Introduction
Meridian is an inference-time compute scheduler for reasoning-model serving. It treats think-decode and output-decode as separate scheduling domains with separate SLOs, separate KV eviction priorities, and real-time entropy-driven budget control.
Why this exists
Reasoning models (DeepSeek-R1, Qwen3, Qwen2.5, o3) emit two structurally different token sequences within a single request:
[prompt] → <think> ... N reasoning tokens ... </think> → [output tokens]
These two phases have opposite latency profiles:
| Phase | User-visible latency tolerance | Throughput importance | Correct SLO target |
|---|---|---|---|
| Think-decode | High — user waits regardless | Critical (cost driver) | TPOT-relaxed |
| Output-decode | Zero — streaming experience | Secondary | TTOT-strict |
Standard continuous-batching schedulers (including vLLM's default) process all decode tokens — thinking and output — from the same priority queue with the same TPOT target. Meridian is the scheduling layer that knows the difference.
What Meridian does
- Dual-queue scheduling. Output-phase requests have absolute priority. Think-phase requests fill remaining capacity with a larger effective batch token budget.
- Phase-aware KV block manager. Three-tier eviction:
ThinkComplete<ThinkActive<OutputCritical. Blocks from a completed reasoning chain are demoted the moment</think>is emitted. - Entropy-driven budget forcing. EAT (
arXiv:2509.26522) and RPDI (arXiv:2603.14251) signals inject</think>only when the model itself is signalling convergence or overthinking — not on a static token counter. - Drop-in vLLM plugin. No fork required. Wraps the existing scheduler via the plugin interface; exposes Prometheus + OpenTelemetry telemetry.
- Disagg KV transfer.
offload_block/ingest_blockhooks on the block manager support prefill-decode disaggregation fabrics (NIXL, Mooncake-compatible). Documented in ADR-0006.
Scope and assumptions
- In scope: latency-differentiated scheduling for reasoning models served via vLLM on a single node or a disaggregated prefill-decode topology.
- Out of scope: model training, quantisation, speculative decoding, or serving systems other than vLLM. See Non-goals.
- Assumed: the serving stack emits per-request token IDs to a hook point that Meridian can intercept (the vLLM plugin interface satisfies this).
- Assumed: think/output phase boundaries are detectable from the decoded
token stream via model-specific boundary token IDs (configurable per model
in
meridian.toml).
Threats to validity
- Synthetic benchmark results are directional, not absolute. The synthetic-replay harness simulates latency with a calibrated decoder; it does not model multi-tenant memory pressure or real CUDA kernel variance. Results should be reproduced on target hardware before drawing conclusions.
- Budget forcing can affect accuracy. Injecting
</think>early may shorten correct reasoning chains. Meridian fires forcing only on entropy convergence signals, but the threshold is a tunable heuristic, not a guarantee. Accuracy measurement is the operator's responsibility. - Phase detection depends on token IDs. If a model tokenises
<think>/</think>differently than the configured boundary IDs, Meridian treats the entire request as single-phase. Themodels/*.tomlfiles in the repo carry vetted IDs for supported models.
How to read this book
- Architecture — component map, per-component contracts, failure modes, and observability hooks.
- ADRs — every architectural decision with the rejected alternatives. Read these alongside the code to understand the why.
- Configuration — every
meridian.tomlfield with type, default, valid range, and tuning guidance. - API reference — Rust and Python surfaces with lifecycle and concurrency notes.
- Operations — metrics catalogue, alerting recommendations, troubleshooting runbooks, and benchmark methodology.
- Non-goals — what Meridian explicitly does not do.
- Glossary — definitions for TTFT, TTOT, ITL, EAT, RPDI, KV, disagg, NIXL, and other terms used throughout.