Apple Silicon PyTorch MPS: Setup and Speed

If you own a Mac with an M-series chip and want to train or run machine learning models locally, the Apple Silicon PyTorch MPS backend is the feature you need to understand. Introduced with PyTorch 1.12, the Metal Performance Shaders (MPS) backend lets your Mac GPU accelerate deep learning workloads directly, without needing a cloud server or an NVIDIA card. This guide walks through how to set it up, which operations are supported, what performance you can realistically expect, and how to handle common problems.

What Is Apple Silicon PyTorch MPS?

MPS stands for Metal Performance Shaders, which is Apple’s GPU programming framework built on top of the Metal API. When PyTorch added the MPS backend, it gave developers a way to run PyTorch operations on the GPU cores inside Apple Silicon chips, covering the M1, M2, M3, and M4 families.

Before the MPS backend, Apple Silicon Macs ran PyTorch workloads entirely on the CPU, leaving the powerful GPU cores idle. The Apple Silicon PyTorch MPS integration changed this by mapping PyTorch’s computational graphs onto Apple’s MPS Graph framework and tuning low-level GPU kernels for Metal’s architecture.

One major architectural advantage Apple Silicon brings to this setup is unified memory. Unlike traditional systems where the CPU and GPU operate on separate memory pools connected by a slow PCIe bus, Apple Silicon chips share a single memory pool across the CPU, GPU, and Neural Engine. This eliminates explicit CPU to GPU memory copies over PCIe and reduces transfer overhead compared to discrete GPU systems. PyTorch still manages tensor placement and synchronization internally, so there is some device synchronization cost, but the absence of physical data movement across the bus is a genuine efficiency gain. Larger models and bigger batch sizes become more practical as a result.

System Requirements and Installation Setup

Before running any code, you need to meet a few baseline conditions. The Apple Silicon PyTorch MPS backend requires macOS 12.3 or later and a Mac with an Apple Silicon chip. You also need a native ARM64 build of Python, not the Rosetta-emulated x86 version. Using an Intel-emulated Python is a common mistake that causes MPS to report itself as unavailable even on capable hardware.

The recommended Python version as of late 2025 is 3.12. Once you have the right Python installed, creating a virtual environment before installing PyTorch is good practice.

To install PyTorch with MPS support using pip, the standard stable release already includes it:

pip install torch torchvision torchaudio

If you want the very latest MPS improvements that have not yet made it into a stable release, the nightly build is an option. On macOS with Apple Silicon, there is no separate MPS wheel as MPS support is included in the standard macOS build:

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

After installation, verify that MPS is working with a short Python script:

import torch

if torch.backends.mps.is_available():
    mps_device = torch.device("mps")
    x = torch.ones(1, device=mps_device)
    print(x)
else:
    print("MPS device not found.")

You can also check torch.backends.mps.is_available() and torch.backends.mps.is_built() separately. If is_built() returns False, the PyTorch installation was compiled without MPS support. If is_available() returns False but is_built() returns True, the issue is your macOS version or Python architecture.

Moving Models and Tensors to MPS

Using the Apple Silicon PyTorch MPS device in your code follows the same pattern as using CUDA on NVIDIA hardware. You define a device string and send your tensors and models to it:

device = "mps" if torch.backends.mps.is_available() else "cpu"

x = torch.rand(1000, 1000, device=device)

model = YourModel()
model.to(device)

Any operation performed on tensors residing on the MPS device will run on the Apple GPU automatically. The workflow is identical to what CUDA users are already familiar with, which makes porting existing PyTorch code to Apple Silicon straightforward.

Supported Operations in the MPS Backend

The Apple Silicon PyTorch MPS backend supports a large and growing set of operations. Standard operations used in neural network training, including matrix multiplication, convolutions, batch normalization, activation functions like ReLU and GELU, pooling layers, and most standard loss functions, work on MPS without modification.

Common architectures including ResNet, EfficientNet, VGG, BERT, DistilBERT, and many others run on MPS. Tasks like image classification, text classification, sequence labeling, and fine-tuning pre-trained Hugging Face models are well supported.

However, the MPS backend still has gaps. Not every PyTorch operation has been implemented in Metal kernels yet. When your code hits an unsupported operation, PyTorch raises an error by default. The simplest fix is to set a fallback environment variable before running your script:

PYTORCH_ENABLE_MPS_FALLBACK=1 python your_script.py

This tells PyTorch to fall back to the CPU for any operation not yet supported on MPS, while still running everything else on the GPU. It is a practical solution for model architectures that mostly work on MPS but hit occasional unsupported ops.

Mixed Precision on MPS

As of PyTorch 2.x, MPS does support float16 tensors and some automatic mixed precision scenarios. However, MPS does not have Tensor Core style acceleration like NVIDIA hardware, and AMP support remains limited with inconsistent performance gains. Mixed precision support on MPS is still evolving and may not provide consistent speedups or numerical stability compared to CUDA AMP. For most workflows, staying with the default float32 on MPS is the safer and more predictable choice.

Distributed Training Limitations

Multi-GPU distributed training is not supported on MPS. The standard distributed backends like NCCL are not compatible with the MPS device, and Apple Silicon Macs expose a single logical GPU. CPU-based distributed training remains possible, but GPU-accelerated multi-device training is not currently available through standard PyTorch distributed utilities on Apple hardware.

Speed Expectations: What You Can Realistically Get

Setting realistic speed expectations for Apple Silicon PyTorch MPS requires understanding both its strengths and its position relative to dedicated NVIDIA GPUs.

Compared to running on the Mac CPU alone, MPS acceleration makes a meaningful difference. For transformer inference and CNN workloads, speedups of several times over CPU are common depending on batch size and model architecture. For training on models like ResNet-50 or VGG-16, the Apple M1 Ultra showed substantial acceleration in benchmarks run with macOS Monterey 12.3 and PyTorch 1.12, with the gains being most pronounced at larger batch sizes.

The MPS backend performs best when GPU utilization is high. On very small models or tiny batch sizes, the overhead of dispatching work to the GPU can reduce or eliminate the benefit over running on the CPU. If you are testing MPS and find it slower than expected, increasing your batch size is often the first adjustment worth making.

Compared to a high-end NVIDIA GPU, Apple Silicon lags on raw throughput for large-scale training. However, Apple Silicon generally delivers strong performance per watt compared to discrete GPU setups, which matters considerably for long training runs on a laptop without constant power access.

For prototyping, fine-tuning, and local inference, the Apple Silicon PyTorch MPS setup is genuinely useful. For training large models from scratch or running production-scale pipelines, dedicated NVIDIA hardware with CUDA still offers more raw compute.

Memory Management Tips for MPS

In terms of capacity, 16 GB of unified memory allows you to load models that would otherwise require a GPU with a similar VRAM size, because the memory is physically shared rather than divided between separate pools. Bandwidth characteristics between unified memory and dedicated VRAM differ, but for capacity-constrained use cases this is a practical advantage.

When working on memory-intensive tasks, you can free the MPS allocator cache between runs:

import gc
import torch

gc.collect()
torch.mps.empty_cache()

This is particularly helpful in Jupyter notebooks or long loops where intermediate tensors accumulate.

Common Issues and How to Fix Them

If MPS reports as unavailable on a Mac that should support it, the most frequent cause is using a Rosetta-emulated Python. Reinstalling Python as a native ARM64 build and reinstalling conda or the virtual environment manager to match resolves this in most cases.

If you encounter errors about unsupported operations, enabling the MPS fallback environment variable is the quickest fix. If performance is lower than expected, checking your batch size is the first step. Small batches often do not give the GPU enough work to offset dispatch overhead, and increasing batch size typically improves utilization and throughput.

Building PyTorch from source with MPS support requires Xcode 13.3.1 or later. For most users, the standard pip installation is sufficient and recommended over a source build.