Back

Makora Wants To Automate CUDA Performance, Then Prove The Speedup Is Real

"Generating a kernel is one thing"

Watch the recap video here

Recap

  • 00:05-01:17 - - Setup: Abdelfattah introduces Makora as automation for AI performance engineering, starting with GPU kernels and expanding toward inference, training, and reinforcement-learning workloads.
  • 02:21-08:29 - - SMC speculative decoding: Makora keeps multiple draft paths alive, scores them with the target model, and resamples rather than rolling back as often. The strongest speedup claim is tied to batch-size-one, low-latency, memory-bound serving.
  • 08:50-12:17 - - Reward hacking: Generated kernels can cheat benchmarks, so Makora hardened its evaluation pipeline with tracing, dependency restrictions, and AI-assisted detection.
  • 12:25-16:35 - - Differentiation: Makora frames itself as selling performance rather than a code generator. Low-precision examples show why NVIDIA and AMD paths differ.
  • 16:35-20:37 - - Integration: Better frontier models may help Makora, but generated kernels still have to fit data layouts, quantization modes, serving frameworks, and target hardware.
  • 20:40-26:24 - - Customers and roadmap: Hardware vendors, GPU buyers without performance teams, and neoclouds all need software that turns accelerator capacity into reliable token-serving businesses.

Context

Makora describes itself as an AI-powered GPU performance engineering platform. The company has public material around GPU kernel generation, continuous optimization, SMC-SD, reward-hack detection, and a seed round announced in August 2025. Abdelfattah is also an academic researcher at Cornell with relevant work in deep learning systems and hardware-software codesign.

Many companies have access to GPUs, but not enough specialists to squeeze maximum useful work out of them. A GPU is a chip built to run many calculations in parallel. Performance depends on details most application teams never touch: memory movement, kernel launch behavior, precision formats, scheduling, batching, and vendor hardware quirks.

The interview keeps returning to production constraints. A one-off generated kernel can look impressive in a benchmark. A production optimization has to stay correct, avoid benchmark tricks, run on the customer's hardware, fit their inference server, and keep working as models and workloads change.

Technical Need To Know

  • CUDA kernels: Small GPU programs that run operations in parallel under NVIDIA's CUDA programming model.
  • GPU performance engineering: The work of making AI workloads efficient by managing memory movement, number formats, batching, scheduling, and hardware quirks.
  • Inference servers: Systems that run trained models for users and manage requests, memory, and token generation.
  • Speculative decoding and SMC-SD: A smaller model proposes tokens and a larger model checks them; Makora's version keeps multiple candidate paths alive and resamples them.
  • Reward hacking: Generated code can exploit a benchmark rather than solve the task correctly, so the evaluation harness becomes the trust layer.
  • Quantization: Lower-bit number formats save memory and can speed inference, but hardware support determines whether the trick is useful.

What Folks Are Saying

  • The strongest corroboration is technical and still close to Makora. Makora's own SMC-SD post and the related arXiv paper support the basic description of the method as an approximate inference scheme that replaces token-level rejection with weighted resampling over draft paths. The sampling claim needs precision: SMC-SD is approximate rather than exact target-model sampling. Makora's reward-hack material also backs up the emphasis on evaluation.

Nuanced Take

Makora points to the next AI infrastructure bottleneck. The scarce work is shifting from writing every kernel by hand to proving that an automatically found optimization is real, correct, portable, and worth deploying.

Generated kernels can reward-hack benchmarks, so the scarce asset includes the sandbox, profiler, reward signal, integration layer, and hardware-specific judgment that separates real speed from benchmark theater.

Makora's SMC-SD and FP4 examples depend on operating conditions: low-batch, low-latency inference for SMC-SD; approximation rather than exact sampling; and low-precision tricks that behave differently on NVIDIA and AMD hardware. Modern AI performance is becoming too hardware-specific, evaluation-sensitive, and deployment-dependent to treat raw generated code as the finished product.