MPSGraph vs MPS Kernels: What to Use for ML Workloads on macOS

If you are running machine learning workloads on a Mac with Apple Silicon, you have probably come across two terms that sound similar but serve different purposes: MPS kernels and MPSGraph. Both are part of Apple’s Metal Performance Shaders framework, both run on the GPU, and both are deeply involved in how frameworks like PyTorch accelerate ML tasks on macOS. But they operate at different levels of abstraction, and understanding how they work together can have a real impact on your model’s performance. This article breaks down what MPS kernels and MPSGraph actually are, how they differ, when to use each, and what this all means for practical ML development on macOS in 2025.

What Are MPS Kernels?

MPS stands for Metal Performance Shaders. It is a framework from Apple that provides a library of highly optimized GPU routines, or kernels, for tasks like image processing, linear algebra, and neural network inference. These kernels are pre-written, low-level GPU programs that have been hand-tuned by Apple engineers to run efficiently on each specific generation of Metal-compatible GPU.

When you perform a convolution, a matrix multiplication, or an activation function on an Apple GPU, MPS kernels provide a ready-made implementation for that exact operation. The kernels are tuned per GPU family, which means the same high-level operation may run a slightly different code path on an M1 versus an M3 chip, with Apple automatically selecting the best path for the hardware present.

What MPS Kernels Cover

MPS kernels include support for convolutional neural network layers through MPSCNNConvolution, matrix operations through MPSMatrixMultiplication, activation functions through MPSCNNNeuron, image filters, and various signal processing operations. They are accessed through the Objective-C and Swift APIs and are best thought of as atomic GPU building blocks. You call one kernel, it performs one specific task, and you get a result back.

This approach is powerful for applications that need a well-defined set of ML operations and want precise control over exactly what runs on the GPU at each step. But it has limits: managing many individual kernel calls introduces overhead, and there is little opportunity for the GPU to optimize across multiple operations at once.

What Is MPSGraph?

MPSGraph is a higher-level framework that sits on top of MPS kernels. Rather than calling individual kernels one at a time, MPSGraph lets you define an entire computation graph of operations, including multi-dimensional tensor math, control flow, and data dependencies, and then execute that graph on the GPU as a whole.

The key benefit is that MPSGraph can analyze the entire graph before running it and apply optimizations across operations. It can fuse multiple operations into a single kernel call, reduce memory transfers between steps, and schedule GPU work more efficiently than calling kernels manually in sequence. At WWDC 2024, Apple highlighted new transformer optimizations in MPSGraph, including a fused representation for scaled dot-product attention (SDPA), which collapses what would normally be several separate matrix operations into a single efficient kernel pass, a major win for transformer-based models.

MPSGraph and Tensor Support

MPSGraph extends Metal’s compute capabilities to multi-dimensional tensors, which are the fundamental data structures in deep learning. It supports tensors of arbitrary rank, handles dynamic shapes for models whose input sizes vary at runtime, and gives developers low-level control over GPU synchronization, memory, and scheduling. On newer macOS releases, notably macOS 15 and later, the MPS stack has improved support for strided and non-contiguous views, reducing how often frameworks need to materialize contiguous copies before passing data to the GPU.

How PyTorch Uses Both MPS Kernels and MPSGraph

When you run a PyTorch model on the mps device, PyTorch does not choose between MPS kernels and MPSGraph in an either-or fashion. It uses both. PyTorch’s MPS backend, which became official with PyTorch 1.12, maps each high-level operation onto whichever lower-level implementation is most appropriate. Simple, well-defined operations may call directly into MPS kernels, while more complex sequences of operations are routed through MPSGraph to take advantage of graph-level optimization.

The result is that when you write standard PyTorch code and move your tensors to the mps device, you are automatically benefiting from Apple’s full Metal Performance Shaders stack without needing to manage the distinction yourself. The unified memory architecture of Apple Silicon adds further benefit: unlike discrete GPUs that require separate CPU and GPU memory pools with expensive data transfers over a PCIe bus, MPS operations on Apple Silicon access the same physical memory directly, reducing latency significantly for ML workloads on macOS.

When MPSGraph Matters More

MPSGraph becomes especially important when you are working with transformer models, large language models, or any architecture with complex multi-step operations. Apple has highlighted stateful KV-cache and SDPA optimizations for transformer workloads in MPSGraph, and has also introduced support for 4-bit integer weight formats to reduce memory footprint. For models that previously required a high-end discrete GPU, MPSGraph-powered inference on higher-end Apple Silicon, such as M3 Max or M3 Ultra-class machines, can be a genuinely viable alternative for local use.

Direct Use of MPSGraph vs Using It Through a Framework

For most ML developers on macOS, the practical question is not whether to call MPSGraph directly in Swift or Objective-C, but whether to use PyTorch’s MPS backend, Apple’s MLX framework, or Core ML as the front-end. Each of these routes operations through MPSGraph and MPS kernels under the hood, but they expose different programming models and suit different use cases.

PyTorch’s MPS backend is the most straightforward choice if you are already working in the PyTorch ecosystem. You change the device string from cpu or cuda to mps, and training and inference run on the GPU. Core ML is better suited for deploying models in production macOS applications, where it can also leverage the Apple Neural Engine for additional acceleration. MLX, Apple’s own open-source framework released in 2023, is designed specifically for Apple Silicon and has shown benchmark advantages over PyTorch’s MPS backend in many scenarios because it was built from scratch with unified memory and lazy graph evaluation in mind.

When to Go Directly to MPSGraph

If you are building a native macOS application that already uses Metal for rendering or other GPU work, calling MPSGraph directly makes sense. Apple’s documentation notes that MPSGraph allows you to sequence ML tasks alongside other GPU operations in the same Metal command queue, and to share low-level Metal buffers between your rendering pipeline and your ML inference pipeline. This tight integration is not possible through a higher-level framework like PyTorch. For game engines, scientific visualization tools, or augmented reality applications that need ML inference fused into the GPU timeline, direct MPSGraph usage is the right approach.

For research and prototyping, however, the overhead of working directly with MPSGraph APIs is rarely worth it. PyTorch’s MPS backend handles the heavy lifting and gives you access to the same underlying MPS kernel optimizations with far less boilerplate.

Performance Considerations and Limitations on macOS

Despite the impressive progress Apple has made with MPS kernels and MPSGraph, there are practical limitations to keep in mind when planning ML workloads on macOS.

Compared to CUDA, compilation and kernel-fusion pathways are still more mature and predictable on NVIDIA GPUs. On MPS, performance often depends more on operator coverage and graph-level optimizations from MPSGraph rather than a compiler-driven fusion pipeline. This is improving with each PyTorch release, but it remains a meaningful gap for training-heavy workloads.

Low-bit quantization also behaves differently on MPS. Apple has introduced support for 4-bit and 8-bit integer weight formats, which reduce memory bandwidth requirements. However, how and when these formats are applied depends on the framework, model format, and operator coverage in use, so throughput gains can vary compared to hardware with native low-precision compute units.

Additionally, PyTorch’s MPS device is currently single-device oriented. Multi-GPU distributed training stacks used on CUDA, such as NCCL, are not available for MPS, which limits its use for large-scale training runs that require data or model parallelism.

What Works Well for ML on macOS

For local prototyping, fine-tuning, and inference, the combination of MPS kernels and MPSGraph running through PyTorch’s MPS backend or MLX delivers genuinely useful performance on Apple Silicon. Higher-end Apple Silicon machines in the Max or Ultra class, with 48 to 128 GB of unified memory, can run large language models locally that simply cannot fit on a consumer NVIDIA GPU without a dedicated server. The memory capacity advantage of Apple Silicon is real, and MPSGraph’s continued improvements to transformer inference make this use case more practical with every macOS update.

MPSGraph vs MPS Kernels: Which One Should You Use?

To put it plainly, MPS kernels are the low-level, hand-optimized GPU routines that execute individual operations like matrix multiplication or convolution. MPSGraph is the higher-level graph execution framework that chains these operations together, applies cross-operation optimizations, and manages memory and scheduling for complex ML models. They are complementary rather than competing choices.

For most ML practitioners on macOS, the distinction is handled automatically by whichever framework you use. PyTorch’s MPS backend, Core ML, and MLX all route work through both layers without requiring you to choose. If you are doing research or prototyping, using PyTorch on the mps device or trying MLX gives you the best combination of familiarity and native performance. If you are building a production macOS application, Core ML with optional MPSGraph integration gives you the best deployment story. If you are building a native Metal application that needs ML acceleration woven into a GPU rendering pipeline, calling MPSGraph directly and sharing Metal buffers gives you control that no higher-level framework can match.

The continued investment Apple is making in both MPS kernels and MPSGraph, from fused attention operators to 4-bit quantization and improved strided tensor support, makes macOS a genuinely competitive platform for local ML workloads. The unified memory advantage of Apple Silicon remains a distinct strength that no other consumer hardware currently matches, and MPSGraph is the engine that makes the most of it.

Also Read: Apple Silicon PyTorch MPS setup and performance guide

Leave a Comment