What this interview will probe
This role builds and optimizes the systems that serve OpenAI's models in production, working alongside researchers to improve inference performance, throughput, and reliability for models powering ChatGPT and the API. Engineers introduce new techniques for low-latency, high-utilization serving of large transformers across GPU fleets. An interview would probe how inference differs from training (KV caching, batching/continuous batching, quantization), GPU memory and latency tradeoffs, and designing a serving stack that maximizes tokens-per-second under tight tail-latency constraints.
ExoForm is not affiliated with OpenAI. This is an independent practice page.