Triton Autotune Explained with Examples: Choosing Block Sizes for Real Speedups

If you have been writing GPU kernels in Triton, you already know that performance is not just about writing correct code. It is about writing code that runs fast on the actual hardware. One of the most powerful tools Triton gives you for this is its built-in autotuning system. Triton autotune lets you define multiple configurations for your kernel and automatically picks the fastest one at runtime. In this article, we will walk through how Triton autotune works, why block sizes matter so much, and how to use autotuning to get real, measurable speedups in your GPU programs.

What Is Triton Autotune and Why Does It Matter

Triton autotune is a decorator-based system that benchmarks multiple kernel configurations and selects the one with the best performance for a given input size. Instead of guessing the right block size or number of warps for your kernel, you define a list of candidate configurations and let Triton figure out which one works best.

This matters because GPU performance is highly sensitive to how you organize your computation. A block size that works great for a 1024-element vector might perform terribly on a 65536-element one. Hardware differences between an A100 and an RTX 3090 add another layer of complexity. Triton autotune removes the guesswork by testing your configurations automatically.

The autotuning happens once per unique input shape and is cached, so you do not pay the benchmarking cost every time your kernel runs. This makes it practical for production use, not just experimentation.

Understanding Block Sizes in Triton Kernels

Before diving into Triton autotune examples, it helps to understand what block sizes actually are and why they have such a big impact on performance.

In Triton, a block is the fundamental unit of parallel work. When you write a kernel, you define how many elements each program instance processes, and that number is your block size. A block size of 256 means each instance handles 256 elements at once using SIMD-style vectorized instructions.

Why Block Size Affects GPU Speedup

Choosing the wrong block size leads to two common problems. If your block size is too small, you leave parallelism on the table. The GPU has thousands of CUDA cores, and small blocks do not give the hardware enough work to stay busy. Memory latency dominates, and throughput drops.

If your block size is too large, you run into register pressure and shared memory limits. Each streaming multiprocessor has a fixed amount of registers and shared memory. When a single block tries to use too much, fewer blocks can run simultaneously, which hurts occupancy and reduces the GPU speedup you actually observe.

The sweet spot is somewhere in between, and it depends on your kernel’s arithmetic intensity, memory access pattern, and the specific GPU you are targeting. This is exactly why Triton autotune exists.

Common Block Size Values to Try

In practice, block sizes are almost always powers of two. Common choices are 64, 128, 256, 512, and 1024. Triton requires block sizes to be compile-time constants because it uses them to generate optimized PTX code at the LLVM level. This is why you pass them as constexpr parameters using the tl.constexpr type.

How Triton Autotune Works Internally

When you apply the @triton.autotune decorator to a kernel function, you provide a list of triton.Config objects. Each Config specifies a set of meta-parameters like BLOCK_SIZE, num_warps, and num_stages. Triton benchmarks each configuration on the actual hardware with a warm-up run followed by timing runs, then stores the winning configuration in a cache keyed by the input shapes.

Here is a simplified view of what happens under the hood:

Your kernel is compiled once for each configuration in the list.
Each compiled version is run on the GPU with the actual input data.
Execution times are recorded and compared.
The fastest configuration is saved and reused for all future calls with the same input shape.

The key parameter directive tells Triton which input arguments define the shape. When those values change, Triton re-runs the autotuning process for the new shape.

A Practical Triton Autotune Example

Let us look at a concrete Triton autotune example using a simple element-wise vector addition kernel. This is a memory-bound operation, so the block size has a direct impact on how efficiently we use memory bandwidth.

import triton
import triton.language as tl
import torch

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_SIZE': 64}, num_warps=2),
        triton.Config({'BLOCK_SIZE': 128}, num_warps=4),
        triton.Config({'BLOCK_SIZE': 256}, num_warps=4),
        triton.Config({'BLOCK_SIZE': 512}, num_warps=8),
        triton.Config({'BLOCK_SIZE': 1024}, num_warps=8),
    ],
    key=['n_elements'],
)
@triton.jit
def vector_add_kernel(
    x_ptr, y_ptr, output_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

In this Triton autotune example, we define five configurations with different block sizes and warp counts. When this kernel is first called, Triton will benchmark all five and pick the winner. On a modern A100 GPU, BLOCK_SIZE 1024 with 8 warps often wins for large vectors because it maximizes memory throughput. On smaller or older GPUs, 256 or 512 may perform better.

Launching the Kernel

def vector_add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    vector_add_kernel[grid](x, y, output, n_elements)
    return output

Notice how the grid function uses meta to access the chosen BLOCK_SIZE at launch time. This is a common pattern in Triton kernels and it ensures the grid dimension adjusts correctly regardless of which configuration was selected.

Choosing Block Sizes for Matrix Operations

Vector addition is a simple case. For matrix multiplication and other 2D operations, block size selection becomes more complex because you have two dimensions to consider, often referred to as BLOCK_M and BLOCK_N for the output tile, and BLOCK_K for the reduction dimension.

A 2D Triton Autotune Example

Here is how a typical matmul autotune configuration list looks:

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 32}, num_warps=4, num_stages=2),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64, 'BLOCK_K': 32}, num_warps=4, num_stages=4),
        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_warps=8, num_stages=4),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_warps=8, num_stages=4),
        triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_warps=8, num_stages=3),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr,
    BLOCK_K: tl.constexpr,
):
    ...

The num_stages parameter controls software pipelining depth. Higher values allow Triton to prefetch data while computation is happening, which hides memory latency and contributes to GPU speedup. However, more stages consume more shared memory, so there is a tradeoff.

num_warps and Its Role in Getting Real Speedups

Beyond block sizes, num_warps is another critical configuration parameter. A warp is a group of 32 threads that execute together on the GPU. The number of warps per block affects how well the scheduler can hide latency by switching between warps when one is waiting for memory.

For memory-bound kernels like element-wise operations, 4 to 8 warps is usually sufficient. For compute-bound kernels like matrix multiplication, more warps can help by keeping the tensor cores busy. Triton autotune lets you test different warp counts alongside different block sizes in a single sweep, which is far more efficient than tuning them separately by hand.

Reading Autotune Results and Benchmarking

After autotuning runs, you can inspect which configuration was selected by checking the cache. Triton stores autotuning results and you can print them to verify the chosen block size.

To measure the actual GPU speedup, use Triton’s built-in benchmarking utility:

@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=['size'],
        x_vals=[2**i for i in range(12, 28)],
        line_arg='provider',
        line_vals=['triton', 'torch'],
        line_names=['Triton', 'Torch'],
        ylabel='GB/s',
        plot_name='vector-add-performance',
        args={},
    )
)
def benchmark(size, provider):
    x = torch.rand(size, device='cuda', dtype=torch.float32)
    y = torch.rand(size, device='cuda', dtype=torch.float32)
    quantiles = [0.5, 0.2, 0.8]
    if provider == 'triton':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: vector_add(x, y), quantiles=quantiles)
    if provider == 'torch':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: x + y, quantiles=quantiles)
    gbps = lambda ms: 3 * x.numel() * x.element_size() * 1e-9 / (ms * 1e-3)
    return gbps(ms), gbps(max_ms), gbps(min_ms)

benchmark.run(show_plots=True, print_data=True)

This gives you a clear picture of the GPU speedup your Triton kernel achieves compared to a baseline PyTorch operation across different input sizes. For memory-bound kernels, you typically measure throughput in GB/s rather than raw execution time, since the goal is to saturate memory bandwidth.

Common Mistakes When Using Triton Autotune

One mistake developers make is providing too few configurations. If you only test two or three block sizes, you might miss the optimal one. A wider search space takes longer to autotune initially but gives better results. At the same time, including obviously bad configurations like BLOCK_SIZE of 8 or 16 for large inputs wastes time without adding value.

Another common issue is forgetting to set the key parameter correctly. If you forget to include a dimension in the key list, Triton will reuse a cached configuration even when the input shape changes significantly. This can lead to silent performance regressions that are hard to debug.

Finally, developers sometimes confuse autotuning overhead with runtime overhead. The benchmarking cost only happens once per unique shape. After that, the kernel runs at full speed. If your application processes many different shapes, consider warmup strategies or pre-populating the cache.

Real Speedup Numbers You Can Expect

The actual GPU speedup from Triton autotune depends heavily on the kernel type and hardware. For a simple vector addition on an A100, a well-tuned Triton kernel can reach 90 to 95 percent of peak memory bandwidth, matching or even slightly exceeding PyTorch’s native implementation for specific sizes. For matrix multiplication, hand-tuned Triton kernels with good block size selection have been shown to match cuBLAS performance on certain shapes.

For custom kernels like fused attention or specialized reductions that do not have a cuBLAS equivalent, Triton autotune often delivers 2x to 5x speedups compared to naive implementations. The exact numbers depend on your specific operation, data types, and GPU model, but the general principle holds: proper block size selection through Triton autotune consistently leads to meaningful real speedups.

Tillcode