LLM Inference Infrastructure

This is an opinionated discussion of the basics of LLM inference and the ecosystem of serving runtimes powering inference today, from a systems design perspective. It was originally written internally at IOP Systems, but thanks to some gentle arm-twisting persuasion, it has been posted in case it is useful to anyone. (Thanks for the push, Yao!)

The intended audience is systems and infrastructure developers who want to understand the systems behind LLM inference and serving those requests at scale, but have little to no machine learning knowledge. It focuses largely on the performance aspects of these systems, specifically around techniques to scale serving, reduce latency, or drive better efficiency and hardware utilization.

This is not intended to be comprehensive or complete and favours a broad intuition over mathematical rigour. That said, it is best thought of as a living document and corrections and contributions to improve clarity or scope are welcome here.

Start reading →