LLM Inference
An opinionated and incomplete survey of LLM inference and serving runtimes from a systems and infrastructure lens.
- LLMs and Transformers Introduction, embeddings, transformers and attention mechanisms
- Inference and the KV Cache Inference execution and the KV cache
- Sharding a Model Pipeline, tensor, and expert parallelism
- Batching, Scheduling, and Paging Continuous batching, Orca, and PagedAttention
- I/O-Aware Kernels FlashAttention and FlashInfer
- Speculative Decoding Speculative decoding, EAGLE, and Medusa Trees
- Prefill-Decode Scheduling and Disaggregation Chunk prefill and prefill-decode disaggregation
- KV Cache Management and Offload Prefix caching and KV offload
- Appendix: Overview of Training Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
- Appendix: GPU Hardware Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
- Appendix: Inference Runtimes LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang