Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Glossary

Terms used throughout the Meridian documentation and codebase.


TTFTTime to First Token. Wall-clock time from the moment a request is submitted to the moment the first token is returned to the client. Dominated by prefill latency for long prompts.

TTOTTime to Output Token. Wall-clock time from the emission of the </think> boundary token to the first user-visible output token. The metric Meridian is specifically designed to protect.

TPOTTime Per Output Token (also ITL: Inter-Token Latency). Time between consecutive output tokens during streaming. The user-perceived "speed" of the stream.

ITLInter-Token Latency. See TPOT.

EATEntropy-Aware Termination. A budget-forcing signal based on the variance of the EMA of per-token Shannon entropy over the reasoning chain. When EAT EMA variance drops below a threshold, the model is inferred to have converged on an answer. Defined in arXiv:2509.26522.

RPDIReasoning Phase Divergence Index. A signal based on the ratio of local transition-token frequency to global transition-token frequency. A high ratio indicates the model is cycling through redundant reasoning steps ("overthinking"). Defined in arXiv:2603.14251.

Entropy (Shannon) — Measure of uncertainty in the next-token distribution. Computed from the logit vector after softmax as -Σ p_i · log(p_i), in nats. High entropy = uncertain prediction; low entropy = confident prediction.

EMAExponential Moving Average. A smoothed average where recent values are weighted more heavily. Controlled by ema_alpha (smaller = longer memory).

KVKey-Value cache. The GPU memory store holding the attention keys and values for each token in active requests. KV memory is the primary capacity constraint in serving systems.

KV block — The unit of KV cache allocation. A fixed-size chunk (default 16 KiB) covering a fixed number of tokens (default 16). Blocks are allocated at the request level and freed on eviction or request completion.

ThinkComplete — Block tier for KV blocks that belonged to a request's reasoning phase after </think> has been emitted. Lowest eviction priority — these blocks are freed first under memory pressure.

ThinkActive — Block tier for KV blocks belonging to a request currently in the reasoning phase.

OutputCritical — Block tier for KV blocks belonging to a request in the output phase. Highest eviction priority — evicting these causes user-visible stream disruption. Any eviction at this tier fires meridian.output_critical_eviction.

Disagg / disaggregated serving — Prefill-decode disaggregation: the model prefill (prompt processing) and decode (token generation) steps are executed on separate hardware. Disagg reduces head-of-line blocking by separating the two workloads, which have very different GPU utilisation profiles.

NIXLNVIDIA Inference eXchange Layer. NVIDIA's reference fabric for transferring KV blocks between prefill and decode nodes in a disaggregated serving topology.

Mooncake — An open-source disaggregated serving framework with a KV-transfer protocol. Meridian's disagg surface is documented as Mooncake-compatible in ADR-0006.

vLLM — An open-source LLM serving framework. Meridian's primary integration target. Meridian wraps vLLM's scheduler without forking the codebase.

DashMap — A concurrent hash map crate used for the PhaseRouter's per-request state. Provides O(1) read and write with sharded locking. See ADR-0003.

EOSEnd of Sequence. The special token that signals request completion. vLLM emits an EOS event that the Meridian plugin uses to trigger request teardown and router state reaping.

Conventional Commits — A commit message standard used throughout this repository. Format: <type>(<scope>): <summary>. See conventionalcommits.org.

DCODeveloper Certificate of Origin. A sign-off mechanism (git commit -s) that certifies the contributor has the right to submit the code under the project's license. Required for all contributions — see CONTRIBUTING.md.

SLSASupply-chain Levels for Software Artifacts. A framework for supply-chain security. Meridian attests Level 2 provenance on every tagged release via slsa-github-generator. See ADR-0007.

SBOMSoftware Bill of Materials. A machine-readable inventory of software components and their licenses. Meridian generates a CycloneDX SBOM for each release, attached as a GitHub release asset.