127 lines of catch-up for a problem Apple flagged in 2016
Or: how I spent thirteen months1 almost noticing what a WWDC talk said out loud a decade ago.
Apple’s M-series GPU has no hardware integer divide. None. In 2026. So when you ask AGX to compute a / b, it runs a ~730-cycle software subroutine — a tiny shameful program nestled inside an otherwise competent GPU (or at least mostly competent. It
also lacks double - more on that in a future post), doing long division the way
you learned it in fourth grade. That is unless the divisor is a compile-time constant
to the AGX backend, which is a much narrower category than you’d think.
ggml’s Metal matmul kernels (mul_mm, mul_mv) eat four such divides per thread, by values that are constant for the entire inference session. The divisors come in through a uniform buffer, so the AGX compiler never sees them, so every dispatch pays ~2,800 cycles of udiv it absolutely doesn’t have to. Multiply by every thread of every dispatch of every layer of every token on every Mac running llama.cpp, and you arrive at a number that’s embarrassing to contemplate.
The fix is mechanical: promote the four divisors to MSL
[[function_constant]]s, baked into the pipeline-state object at PSO-compile time. The AGX backend then strength-reduces them to a no-op (ne12=1), a shift (pow2), or a magic-multiply (otherwise).
Where AGX’s divide goes
where d comes from | per-op cost |
|---|---|
Pow2 literal (% 256u) | ~5 cy |
Non-pow2 literal (% 255u) | ~80 cy |
[[function_constant]] | ~80 cy |
| Kernel argument / uniform buffer | ~730 cy |
The slow path exists because at PSO build there’s no value to fold; the buffer’s contents only exist at dispatch time. The fast paths all share one thing: AGX could see the divisor when it built the pipeline. It turns out, Apple warned us about it in a talk a decade ago at WWDC2016 - “So avoid divisional modulus by denominators that are not literal or function constants…that will be very, very slow. Think hundreds of clock seconds”.
When it matters
The win requires three things at once:
- The kernel is ALU-bound, not memory-bound. If the divide hides behind a memory stall, fixing it buys nothing.
- The divisor is constant for the run — tensor dims, GQA group sizes, ring-buffer lengths.
- The divide happens often — once per thread, large grid.
ggml’s matmul kernels happen to hit all three. Most other places you’d grep for don’t — fragment shaders are float-land, audio kernels are memory-bound, Stable Diffusion’s UNet has the wrong arithmetic intensity. So this isn’t some general technique I’ve unlocked; it’s one kernel family where the stars aligned.
The fix
// before — divisors come through args struct
const int i13 = im / args.ne12; // 730 cy
const int i12 = im % args.ne12; // 730 cy
const int i02 = i12 / args.r2; // 730 cy
const int i03 = i13 / args.r3; // 730 cy
// after — same values, promoted to function constants
constant int16_t FC_mul_mv_ne12 [[function_constant(FC_MUL_MV_NE12)]];
const int i13 = im / FC_mul_mv_ne12; // ~5 cy magic-multiply (or shift, or no-op)
ggml-org/llama.cpp#22711. +127 / −88 across 4 files.
Host-side: extract (ne12, ne13, r2, r3) from the op shape, bind via MTLFunctionConstantValues, thread them into the PSO cache key so a Gemma-2 GQA pipeline isn’t reused for a TinyLlama (ne12=1, r2=1, r3=1) op. The interesting code is two lines; the boring code is two hundred.
Results
M4 Pro, llama-bench -p 512 -n 128 -r 20, tg128 tok/s:
| model | quant | baseline | patched | delta |
|---|---|---|---|---|
| TinyLlama 1.1B | Q4_0 | 239.05 | 246.57 | +3.15% |
| Llama 3.2 1B | Q4_0 | 227.58 | 230.96 | +1.49% (noise) |
| Gemma 3 1B (GQA) | Q4_K_M | 164.67 | 173.52 | +5.37% |
| Gemma 2 2B (GQA) | Q4_K_M | 103.41 | 106.66 | +3.14% |
| Mistral 7B (GQA) | Q4_0 | 52.10 | 52.73 | +1.21% |
Perplexity bit-identical (magic-multiply is exact for unsigned division, and since
our values are trivially nonnegative the compiler folds them). The biggest wins are
on GQA models — exactly the ones people actually run - which hit all four divisors meaningfully — r2 and r3 are >1 and dividing is real work.
Global impact, or lying with arithmetic
1M DAU × ~20K tokens/day ÷ ~150 tok/s × ~3% speedup ≈ ~46 person-years per year, if you squint. Add the rest of the ggml ecosystem and call it 50. These numbers are off by up to 10× in either direction.
If you write Metal compute: grep your .metal files for / args. and % args.. For each hit, ask if the divisor is constant for the kernel’s deployment. If yes, and the kernel isn’t memory-bound, there’s a ~9× win sitting on the divide.
If you don’t write Metal but ship software that runs ggml on a Mac: things got 1–5% faster this week.
PR: ggml-org/llama.cpp#22711. Reproducer: github.com/SovereignSoft/agx-idiv-demo.
Footnotes
-
The actual investigation was a day. I noticed the idiv weirdness, forgot about it for a year, and only wandered back into the llama hot loop last week because I was procrastinating on something else. ↩
Comments