DGX Spark vs Radeon 960 XT vs M3 Ultra : One Million AI Tokens Performance Testing

What does it really take to generate 1 million tokens, fast, efficiently, and without breaking the bank? Below, Alex Ziskind breaks down how five distinct computing systems stack up in this high-stakes challenge, revealing surprising trade-offs between speed, energy consumption, and cost. From the lightning-fast DGX Spark to the budget-friendly AMD Radeon 960 XT, each system tells a story of priorities: raw performance, affordability, or sustainability. But the results aren’t always what you’d expect, especially when a system like the Mac Studio M3 Ultra, known for its sleek efficiency, struggles to keep pace in the speed department. If you’ve ever wondered how these machines perform under pressure, this feature offers a detailed look at their strengths and limitations.
By the end of this guide, you’ll discover which system is the fastest, which one saves the most energy, and which might quietly drain your wallet over time. Whether you’re a tech enthusiast, a professional balancing performance with operational costs, or simply curious about the real-world impact of hardware choices, this guide will leave you with plenty to think about. From the record-setting throughput of the DGX Spark to the unexpected inefficiencies of the Beink GTR9, the comparisons here go beyond surface specs to uncover what truly matters in high-performance computing. It’s a fascinating glimpse into the trade-offs that shape the machines powering modern innovation.
Token Generation System Comparison
TL;DR Key Takeaways :
- The DGX Spark is the fastest and most energy-efficient system, ideal for high-throughput tasks, but requires a significant financial investment.
- The AMD Radeon 960 XT offers a budget-friendly option with competitive performance and low energy consumption, suitable for smaller-scale operations.
- The Mac Studio M3 Ultra excels in idle energy efficiency but is slower and less efficient during intensive tasks, making it better for energy-conscious users.
- Software optimization plays a critical role, with tools like VLM excelling in concurrency for Nvidia and AMD hardware, and MLX optimized for Apple Silicon.
- The H200 Cluster delivers unmatched speed for enterprise-level tasks but at a high energy and financial cost, suitable only for users with substantial computational demands.
Test Setup: Hardware and Software Overview
To ensure a comprehensive and fair evaluation, five distinct systems were tested:
- AMD Radeon 960 XT: A budget-friendly GPU designed for moderate workloads.
- DGX Spark: A high-performance system optimized for demanding computational tasks.
- Beink GTR9 (AMD Strix Halo): A compact system aimed at balancing performance and affordability.
- Mac Studio M3 Ultra: Apple’s premium offering, designed for energy efficiency and creative workflows.
- H200 Cluster: A large-scale, high-capacity system built for enterprise-level tasks.
Each system was tasked with generating 1 million tokens using the Quen 3 4B model, a compact 4-billion-parameter model chosen for its compatibility across diverse platforms. To maximize performance, software tools such as Llama CPP, VLM, and MLX were employed, with a focus on concurrency and cross-platform functionality. This combination of hardware and software ensured a robust comparison of speed, energy efficiency, and cost.
Performance Results: Speed Matters
The speed of token generation varied significantly across the systems, reflecting their design priorities and hardware capabilities:
- DGX Spark: The fastest system, completing the task in just 6.7 minutes with a throughput of 2,451 tokens per second. This makes it ideal for high-throughput environments.
- AMD Radeon 960 XT: A strong contender in the budget category, generating tokens at 1,913 tokens per second and completing the task in 8.12 minutes.
- Mac Studio M3 Ultra: While efficient in other areas, it required 26 minutes to generate 1 million tokens, reflecting its focus on energy savings rather than raw speed.
- Beink GTR9: The slowest of the group, taking 34 minutes to complete the task, highlighting its limitations in handling high-performance workloads.
The H200 Cluster, tested separately with a larger 480-billion-parameter model, achieved an impressive 2,609 tokens per second, surpassing the DGX Spark in speed. However, this performance came at a significantly higher energy and financial cost, making it suitable only for users with substantial computational demands.
Million Tokens Speed Test, DGX Spark vs Radeon vs M3 Ultra
Advance your knowledge in AI tokens by reading more of our detailed content.
- How Claude Code AI Handles 1 Million Tokens to Boost Efficiency
- How Claude 3.7 Sonnet Reasoning Improves AI Token Efficiency
- TinyLlama 1.1B powerful small AI model trained on 3 trillion tokens
- Best Local AI Models for the Base Mac Mini M4, Speed & Limits
- Grok 4.2 (Sonoma Sky) AI Model Can Process 2 Million Tokens at
- Google Gemini 3.0 Flash: 32 Reasoning Tokens Explained
- How Ollama Turbo Combines Speed, Privacy and Scalability in AI
- Claude Opus 4.5 Pricing & Performance : Anthropic’s Most Efficient
- How DeepSeek OCR Redefines AI Text Compression & Context
- ChatGPT 5 vs Claude Sonnet: Real-World AI Coding Comparison
Energy Efficiency: Balancing Power and Performance
Energy consumption was a critical factor in evaluating the systems, as it directly impacts both operational costs and environmental considerations:
- DGX Spark: Demonstrated the highest energy efficiency during active processing, producing the most tokens per kilowatt-hour. This makes it a strong choice for users prioritizing both performance and sustainability.
- Mac Studio M3 Ultra: Excelled in idle energy efficiency, consuming minimal power when not actively generating tokens. However, its energy efficiency during intensive tasks was less competitive.
- Beink GTR9: Consumed the most energy overall, reflecting inefficiencies in its design for high-performance tasks. This makes it less suitable for users concerned with energy costs.
For users seeking a balance between energy consumption and performance, the DGX Spark emerged as the most practical option. In contrast, the Beink GTR9’s high energy usage underscores the importance of aligning hardware capabilities with workload requirements.
Cost Analysis: Operational Expenses
Operational costs were calculated based on an energy rate of $0.20 per kilowatt-hour, providing a practical perspective on the financial implications of each system:
- DGX Spark: Despite its high upfront cost, it proved cost-effective for large-scale workloads due to its speed and energy efficiency. This makes it a worthwhile investment for users with demanding computational needs.
- AMD Radeon 960 XT: Offered a more affordable alternative, balancing performance and cost for budget-conscious users. Its lower energy consumption further enhances its appeal for smaller-scale operations.
- Beink GTR9: While inexpensive upfront, its inefficiency led to higher energy costs over time, reducing its overall cost-effectiveness.
The choice between these systems ultimately depends on your workload and budget. For high-throughput tasks, the DGX Spark is an excellent option, while the AMD Radeon 960 XT is better suited for users with limited resources.
Software Optimization: The Role of Tools
The performance of each system was significantly influenced by the software tools used, underscoring the importance of optimization:
- VLM: Excelled in concurrency, particularly on Nvidia and AMD hardware, making it the preferred choice for high-throughput tasks.
- MLX: Optimized for Apple Silicon, it performed exceptionally well on the Mac Studio but lacked the cross-platform compatibility needed for broader applications.
- Llama CPP: A versatile baseline tool, suitable for simpler setups. However, it was less effective for high-concurrency tasks compared to specialized software like VLM.
Selecting the right software is as critical as choosing the hardware itself. Tools like VLM and MLX demonstrate how software optimization can unlock the full potential of a system, enhancing both performance and efficiency.
H200 Cluster: Pushing the Limits
The H200 Cluster was tested with a larger 480-billion-parameter model to evaluate its capabilities under extreme conditions. It achieved a throughput of 2,609 tokens per second, outperforming the DGX Spark in speed. However, its energy consumption and operational costs were substantially higher, making it a viable option only for users with significant computational demands and budgets. This system is best suited for enterprise-level applications where performance outweighs cost considerations.
Key Takeaways: Choosing the Right System
This analysis highlights the diverse capabilities of modern computing systems for token generation tasks. Here are the main insights:
- High-end systems: The DGX Spark delivers exceptional speed and energy efficiency, making it ideal for high-throughput environments, though it requires a significant financial investment.
- Budget-friendly options: The AMD Radeon 960 XT offers competitive performance at a fraction of the cost, making it a practical choice for users with limited resources.
- Energy considerations: The Mac Studio M3 Ultra is highly efficient at idle but less competitive during intensive tasks, while the Beink GTR9 incurs higher energy costs overall.
- Software matters: Tools like VLM and MLX highlight the importance of software optimization in maximizing hardware potential.
Ultimately, the best system for your needs will depend on your specific priorities, whether they are speed, energy efficiency, or cost-effectiveness. By understanding the strengths and limitations of each option, you can make an informed decision that aligns with your computational requirements and budget.
Media Credit: Alex Ziskind
Filed Under: AI, Guides, Hardware
Latest Geeky Gadgets Deals
Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

