A split image showing the Google Cloud logo on one side and the Intel Xeon 6 processor on the other, symbolizing their collaboration.
AI

Google Cloud's Amazing 70% Cost Cut: How C4 VMs Beat GPT OSS Expectations

Google Cloud's C4 VMs, powered by Intel Xeon 6, offer a significant 70% TCO improvement for GPT OSS models, thanks to collaboration with Intel and Hugging Face.

5 min read

Google Cloud's Amazing 70% Cost Cut: How C4 VMs Beat GPT OSS Expectations

The world of Large Language Models (LLMs) is constantly evolving, with performance and cost-efficiency being paramount. Google Cloud, in collaboration with Intel and Hugging Face, has unveiled a significant advancement in this area. By leveraging the latest C4 Virtual Machines (VMs) powered by Intel Xeon 6 processors (Granite Rapids), they've achieved a remarkable 70% improvement in Total Cost of Ownership (TCO) when running OpenAI's GPT OSS models. This breakthrough signifies a major step forward in making large-scale AI deployments more accessible and affordable, potentially reshaping how businesses approach LLM infrastructure. The improved efficiency offers a compelling reason for organizations to re-evaluate their cloud strategies and consider adopting these new technologies to optimize their AI workloads.

What's New

The core of this advancement lies in the combination of Google Cloud's C4 VMs and Intel's Xeon 6 processors, optimized in partnership with Hugging Face. The key improvements include:

  • Enhanced TCO: A 1.7x improvement in TCO compared to the previous generation Google C3 VM instances.
  • Increased Throughput: C4 VMs demonstrate a 1.4x to 1.7x increase in TPOT (tokens per output token) throughput per vCPU per dollar.
  • Lower Hourly Cost: C4 VMs offer a competitive price per hour compared to C3 VMs.
  • Optimized MoE Execution: An expert execution optimization, contributed by Intel and Hugging Face, eliminates redundant computation, directing each expert to run only on the tokens it is routed to.

This translates to faster processing times, reduced operational expenses, and a more sustainable approach to running demanding AI models.

Why It Matters

The implications of this development are far-reaching. For businesses deploying GPT OSS models, the 70% TCO reduction represents a significant cost saving. This allows organizations to scale their AI initiatives without incurring prohibitive expenses. The increased throughput also means faster response times and improved user experiences. Furthermore, the collaboration between Google Cloud, Intel, and Hugging Face highlights the importance of open-source optimization in driving innovation in the AI space. This advancement democratizes access to powerful LLM technology, enabling a wider range of organizations to leverage the benefits of AI.

This also showcases the value of specialized hardware and software co-design. By optimizing the software stack for the underlying hardware, significant performance gains can be achieved. This approach is crucial for unlocking the full potential of next-generation processors and maximizing the efficiency of AI workloads.

Technical Details

The benchmark results showcase the performance improvements achieved by the C4 VMs. Here's a comparison of the hardware used:

| Feature | C3 | C4 | | ---------------------- | ------------------------------------------ | --------------------------------------- | | Processor | 4th Gen Intel Xeon processor (SPR) | Intel Xeon 6 processor (GNR) | | vCPUs | 172 | 144 |

The following table summarizes the key configuration parameters used in the benchmark:

| Parameter | Value | | ---------------- | -------------------------- | | Model | unsloth/gpt-oss-120b-BF16 | | Precision | bfloat16 | | Task | Text generation | | Input Length | 1024 tokens (left-padded) | | Output Length | 1024 tokens | | Batch Sizes | 1, 2, 4, 8, 16, 32, 64 | | Enabled Features | Static KV cache, SDPA attention |

The results showed that the C4 instances consistently outperformed the C3 instances, achieving a 1.4x to 1.7x throughput per vCPU. At a batch size of 64, C4 provided 1.7x the per-vCPU throughput of C3, translating to the aforementioned 70% TCO advantage.

The formula used to calculate the normalized throughput per vCPU is:

normalized_throughput_per_vCPU = (throughput_C4 / vCPUs_C4) / (throughput_C3 / vCPUs_C3)

Final Thoughts

The Google Cloud C4 VMs, powered by Intel Xeon 6 processors and optimized through collaboration with Hugging Face, represent a significant leap forward in the efficiency and cost-effectiveness of running large MoE models like GPT OSS. The observed improvements in throughput, latency, and TCO make this a compelling solution for organizations looking to deploy and scale AI applications. As hardware and software continue to co-evolve, we can expect even greater advancements in the performance and accessibility of AI in the future.

Sources verified via Hugging Face of October 16, 2025.

Google Cloud's Amazing 70% Cost Cut: How C4 VMs Beat GPT OSS Expectations · FineTunedNews