Senior GPU Optimisation Engineer

Experience: 4.0-10 years Location: BangaloreCTC: 40 –65L

Senior backend Engineer with expertise in GPU optimization and inference performance tuning for AI models.

About this role

Senior GPU Optimisation Engineer

About Smallest.ai

Smallest AI is one of the world's fastest-growing speech AI research companies. We have some of the world's best speech AI models, powering over 30,000 customers across 12 different countries to automate their calling operations. At heart, we are engineers, researchers, and a bunch of nerds who love diving deep into AI.

Role

We're hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level — someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance. You'll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures. Your work directly impacts the latency, throughput, and reliability of smallest's real-time speech models.

What You'll Do

  • Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads
  • Work with TensorRT, ONNX Runtime, and custom runtimes for deployment
  • Port models across GPU chipsets (NVIDIA → AMD / edge GPUs / new compute backends)
  • Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators
  • Tune models to fit GPU memory limits while maintaining quality
  • Perform operator fusion, graph optimization, and kernel-level scheduling improvements
  • Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections
  • Profile models end-to-end to identify GPU bottlenecks — memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints
  • Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware

Requirements

  • Deep understanding of GPU architecture (NVIDIA, AMD)
  • Experience with CUDA, TensorRT, ONNX Runtime
  • Proficiency in kernel-level optimization and profiling
  • Experience with model quantization and inference optimization
  • Strong background in performance-critical systems