Inside Nvidia’s Formula for Faster, Smarter Data Analytics

In today's data-driven world, the demand for rapid data processing is ever-increasing. Businesses are constantly seeking ways to accelerate their analytics pipelines to gain real-time insights and make informed decisions. GPU-accelerated databases and query engines have emerged as a powerful solution, offering substantial price-performance advantages over traditional CPU-based systems. The parallel processing capabilities of GPUs, with their high memory bandwidth and numerous threads, are particularly well-suited for compute-intensive tasks such as complex joins, aggregations, and string processing. Now, IBM and NVIDIA are joining forces to bring NVIDIA cuDF to the Velox execution engine, unlocking GPU-native query execution for widely used platforms like Presto and Apache Spark. This collaboration promises to revolutionize large-scale data analytics, but only if businesses understand how to leverage this technology effectively.

What's New

The collaboration between IBM and NVIDIA introduces several key advancements:

GPU-Native Query Execution: By integrating NVIDIA cuDF with the Velox execution engine, Presto and Apache Spark can now execute queries directly on GPUs, bypassing the limitations of CPU-based processing.
Velox as an Intermediate Layer: Velox acts as a translator, converting query plans from Presto and Spark into executable GPU pipelines powered by cuDF.
Expanded GPU Operator Support: The cuDF backend for Velox has been enhanced with improved GPU operators for TableScan, HashJoin, HashAggregation, and FilterProject, enabling end-to-end GPU execution in Presto.
UCX-Based Exchange Operator: A new UCX-based Exchange operator facilitates faster data movement between workers, leveraging high-bandwidth NVLink for intra-node connectivity and RoCE or InfiniBand for inter-node connectivity.
Hybrid CPU-GPU Execution in Spark: The integration with Apache Gluten allows for offloading specific query stages to GPUs, optimizing resource utilization in hybrid clusters.

Why It Matters

This collaboration has significant implications for organizations dealing with large-scale data analytics. By leveraging GPU acceleration, businesses can achieve:

Faster Query Processing: GPU-native execution significantly reduces query runtime, enabling real-time insights and faster decision-making.
Improved Price-Performance: GPUs offer superior performance per dollar compared to CPUs for compute-intensive workloads, leading to cost savings.
Enhanced Scalability: The ability to scale up to multi-GPU execution allows organizations to handle increasingly large datasets without compromising performance.
Optimized Resource Utilization: Hybrid CPU-GPU execution in Spark enables efficient utilization of resources in heterogeneous environments.

Technical Details

The performance gains achieved through this collaboration are substantial. For example, in Presto tpch benchmarks at scale factor 1,000, Presto on NVIDIA GH200 Grace Hopper Superchip achieved a runtime of 99.9 seconds for 21 of 22 queries, compared to 1,246 seconds for Presto C++ on AMD 7965WX. Multi-GPU execution on an eight-GPU NVIDIA DGX A100 node delivered over 6x speedup using NVLink in the exchange operator compared to the Presto baseline HTTP exchange. In Apache Spark, offloading the compute-intensive second stage of TPC-DS Query 95 SF100 to GPU resulted in significant performance improvements, even when the first stage was executed on CPU. The table below illustrates the performance comparison:

| Benchmark | Platform | Runtime (seconds) | Queries Completed | Notes | | ---------------------------------------- | ------------------------------------------- | ----------------- | ----------------- | --------------------------------------- | | Presto tpch (SF1000) | Presto C++ on AMD 7965WX | 1,246 | 21/22 | CPU-only | | Presto tpch (SF1000) | Presto on NVIDIA RTX PRO 6000 Blackwell | 133.8 | 21/22 | Single GPU | | Presto tpch (SF1000) | Presto on NVIDIA GH200 Grace Hopper Superchip | 99.9 | 21/22 | Single GPU | | Presto tpch (SF1000) | Presto GPU on NVIDIA GH200 | 148.9 | 22/22 | Single GPU, CUDA managed memory | | Presto tpch (SF1000) | Presto GPU on NVIDIA DGX A100 (8 GPUs) | N/A | 22/22 | Multi-GPU, UCX-based Exchange, NVLink | | Gluten tpcds Query 95 SF100 (Stage 2 GPU) | CPU (8 vCPUs) + NVIDIA T4 GPU (g4dn.2xlarge) | Faster | N/A | Hybrid CPU-GPU, Stage 1 CPU, Stage 2 GPU |

Velox supports several Exchange types for different types of data movements: Partitioned, Merge, and Broadcast. Partitioned Exchange uses a hash function to partition input data. Merge Exchange receives multiple input partitions from other workers and then produces a single, sorted output partition. Broadcast Exchange loads the data in one worker and then copies the data to all other workers.

Final Thoughts

The collaboration between IBM and NVIDIA to bring GPU-native Velox and NVIDIA cuDF to Presto and Apache Spark represents a significant leap forward in large-scale data analytics. By unlocking the power of GPUs, organizations can achieve unprecedented performance gains, enabling faster insights and improved decision-making. As the data landscape continues to evolve, GPU acceleration will become increasingly critical for remaining competitive. We anticipate further advancements in this area, with wider adoption of GPU-powered data analytics across various industries. This is a space to watch closely as it shapes the future of data processing.

Sources verified via NVIDIA of October 6, 2025.