Introduction

Mini-SGLang is a lightweight yet high-performance inference framework for Large Language Models (LLMs) derived from the SGLang project. It illustrates the key components of the complex modern LLM serving system with a minimalistic design. This blog series steps through the implementation of the MiniSGL system from a beginner’s perspective.

Scheduler

The core of MiniSGL, it collects requests, schedule the batch, and send the batches to the engine. It supports chunked prefill and overlap scheduling. [[MiniSGL Scheduler]]

KVCache

The KVCache stores and manages the KV cache for the model. [[MiniSGL KVCache]]

Engine

Engine is the execution layer, it receives the request batches from the scheduler, run the forward and sampling, and return the results to the scheduler. [[MiniSGL Engine]]