Introduction
Meridian is an inference-time compute scheduler for reasoning-model serving. It treats think-decode and output-decode as separate scheduling domains with separate SLOs, separate KV eviction priorities, and real-time entropy-driven budget control.
Why this exists
Reasoning models (DeepSeek-R1, Qwen3, Qwen2.5, o3) emit two structurally different token sequences within a single request:
[prompt] → <think> ... N reasoning tokens ... </think> → [output tokens]
These two phases have opposite latency profiles:
| Phase | User-visible latency tolerance | Throughput importance | Correct SLO target |
|---|---|---|---|
| Think-decode | High — user waits regardless | Critical (cost driver) | TPOT-relaxed |
| Output-decode | Zero — streaming experience | Secondary | TTOT-strict |
Standard continuous-batching schedulers (including vLLM's default) process all decode tokens — thinking and output — from the same priority queue with the same TPOT target. Meridian is the scheduling layer that knows the difference.
What Meridian does
- Dual-queue scheduling. Output-phase requests have absolute priority. Think-phase requests fill remaining capacity with a larger effective batch token budget.
- Phase-aware KV block manager. Three-tier eviction:
ThinkComplete<ThinkActive<OutputCritical. Blocks from a completed reasoning chain are demoted the moment</think>is emitted. - Entropy-driven budget forcing. EAT (
arXiv:2509.26522) and RPDI (arXiv:2603.14251) signals inject</think>only when the model itself is signalling convergence or overthinking — not on a static token counter. - Drop-in vLLM plugin. No fork required. Wraps the existing scheduler via the plugin interface; exposes Prometheus + OpenTelemetry telemetry.
- Disagg KV transfer.
offload_block/ingest_blockhooks on the block manager support prefill-decode disaggregation fabrics (NIXL, Mooncake-compatible). Documented in ADR-0006.
Scope and assumptions
- In scope: latency-differentiated scheduling for reasoning models served via vLLM on a single node or a disaggregated prefill-decode topology.
- Out of scope: model training, quantisation, speculative decoding, or serving systems other than vLLM. See Non-goals.
- Assumed: the serving stack emits per-request token IDs to a hook point that Meridian can intercept (the vLLM plugin interface satisfies this).
- Assumed: think/output phase boundaries are detectable from the decoded
token stream via model-specific boundary token IDs (configurable per model
in
meridian.toml).
Threats to validity
- Synthetic benchmark results are directional, not absolute. The synthetic-replay harness simulates latency with a calibrated decoder; it does not model multi-tenant memory pressure or real CUDA kernel variance. Results should be reproduced on target hardware before drawing conclusions.
- Budget forcing can affect accuracy. Injecting
</think>early may shorten correct reasoning chains. Meridian fires forcing only on entropy convergence signals, but the threshold is a tunable heuristic, not a guarantee. Accuracy measurement is the operator's responsibility. - Phase detection depends on token IDs. If a model tokenises
<think>/</think>differently than the configured boundary IDs, Meridian treats the entire request as single-phase. Themodels/*.tomlfiles in the repo carry vetted IDs for supported models.
How to read this book
- Architecture — component map, per-component contracts, failure modes, and observability hooks.
- ADRs — every architectural decision with the rejected alternatives. Read these alongside the code to understand the why.
- Configuration — every
meridian.tomlfield with type, default, valid range, and tuning guidance. - API reference — Rust and Python surfaces with lifecycle and concurrency notes.
- Operations — metrics catalogue, alerting recommendations, troubleshooting runbooks, and benchmark methodology.
- Non-goals — what Meridian explicitly does not do.
- Glossary — definitions for TTFT, TTOT, ITL, EAT, RPDI, KV, disagg, NIXL, and other terms used throughout.
Architecture
Incoming requests
│
▼
┌─────────────────────────────────────────────────────────┐
│ Meridian Daemon │
│ │
│ ┌──────────────┐ ┌────────────────────────────────┐ │
│ │ Prefill │───▶│ Phase Router │ │
│ │ Executor │ │ (token stream state machine) │ │
│ └──────────────┘ └───────────┬─────────────────────┘ │
│ │ │
│ ┌─────────────┴──────────────┐ │
│ │ │ │
│ ┌───────────▼──────────┐ ┌─────────────▼───────┐ │
│ │ Think-Decode │ │ Output-Decode │ │
│ │ Scheduler │ │ Scheduler │ │
│ │ │ │ │ │
│ │ TPOT: relaxed │ │ TTOT: strict SLO │ │
│ │ Batch: 2.5× larger │ │ Batch: standard │ │
│ │ Entropy probe live │ │ Stream priority │ │
│ │ Budget force ready │ │ │ │
│ └──────────┬───────────┘ └────────┬─────────────┘ │
│ │ │ │
│ └──────────┬────────────┘ │
│ │ │
│ ┌──────────────▼─────────────┐ │
│ │ Phase-Aware KV Block Mgr │ │
│ │ │ │
│ │ Tier 0: ThinkComplete │ │
│ │ Tier 1: ThinkActive │ │
│ │ Tier 2: OutputCritical │ │
│ └─────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────┘
│
vLLM worker
(decode kernel, KV store)
Phase state machine
The Phase Router advances each request through this machine on every decoded
token. ForceBudget is emitted as a side effect (the request stays in
ThinkDecode until </think> is observed or injected).
stateDiagram-v2
[*] --> Prefill
Prefill --> ThinkDecode: think_start id / EnterThink
ThinkDecode --> ThinkDecode: token / update EAT + RPDI
ThinkDecode --> ThinkDecode: converged or overthinking / ForceBudget
ThinkDecode --> OutputDecode: think_end id / ExitThink
OutputDecode --> Complete: eos id / Complete
Complete --> [*]
Disaggregated offload sequence
When a fabric is configured, ExitThink triggers a batched offload of the
request's think-complete blocks. Each offloaded block is framed, pushed to the
fabric, and its local slot is reclaimed (see ADR-0006).
sequenceDiagram
participant R as PhaseRouter
participant S as Scheduler / Plugin
participant B as BlockManager
participant F as Fabric (NIXL / Mooncake)
R->>S: ExitThink(req, tokens_used)
S->>B: demote_think_blocks(req)
S->>B: blocks_for_request(req)
B-->>S: [block_ids]
loop batched at offload_threshold_blocks
S->>B: offload_block(id)
B->>F: push(encode(tier, body))
F-->>B: handle
B->>B: free_block_by_id(id)
end
Note over S,F: meridian_disagg_blocks_offloaded_total += n
Components
Phase Router
Inputs: raw token IDs emitted per step, per request ID.
Outputs: PhaseEvent stream (EnterThink, ExitThink, ForceBudget,
BudgetForceReason).
Hot-path constraint: O(1) per token, zero heap allocation in the common
case. Backed by DashMap with sharded locking — see ADR-0003.
Failure mode: if a request is never reaped, its entry leaks in the map.
reap_stale_older_than(Duration) removes entries older than a wall-clock
threshold; the vLLM plugin calls this on every batch step.
Observability: meridian.phase_router.tracked_requests gauge.
Source: crates/meridian-core/src/phase_router.rs.
Dual-Queue Scheduler
Inputs: a pool of pending requests tagged by their current phase.
Outputs: two ordered lists — one output-phase batch (drains first), one
think-phase batch (fills remaining capacity).
Hot-path constraint: a single pass over both queues per schedule_batch
call. No per-token work.
Invariant: output-phase requests are never starved. The think queue only
receives tokens after the output queue is drained or SLO-budget-limited.
Failure mode: if think_batch_multiplier is set too high relative to
GPU capacity, output ITL variance increases. meridian.queue_depth{queue=think}
growing without accompanying budget_force_triggered activity is the signal.
Observability: meridian.schedule_batch.duration_ns, meridian.queue_depth.
See ADR-0001 for the design alternative this rejects.
Source: crates/meridian-core/src/scheduler.rs.
Phase-Aware Block Manager
Inputs: allocate(request_id, tier) and evict_for(required_blocks) calls
from the vLLM KV allocator path.
Outputs: block IDs; eviction decisions ordered by tier.
Invariant: ThinkComplete blocks are always evicted before ThinkActive;
OutputCritical blocks are evicted last and only under sustained pressure.
Failure mode: OutputCritical eviction is a user-visible degradation event
(stream stutter). Every such event increments meridian.output_critical_eviction.
Alert on any increment in a 5-minute window.
Disagg surface: offload_block(block_id) and ingest_block(bytes, tier) are
available when a disagg fabric is configured — see ADR-0006.
Observability: meridian.output_critical_eviction counter.
Source: crates/meridian-core/src/block_manager.rs.
Entropy Probe
Inputs: raw logit vector (fp32, bf16, or fp16) from a completed forward pass.
Outputs: EntropySignal — per-token entropy (nats), EAT value, EAT EMA,
EAT EMA variance, RPDI local/global ratio.
Hot-path constraint: designed to run on a dedicated secondary CUDA stream;
must not stall the generation stream. In Sprint 0 both paths use the NumPy
reference; python/meridian/_backends/cuda.py defines the CUDA backend
interface and delegates to CPU until Sprint 1 wires it to the Rust kernels in
crates/meridian-kernels/.
Invariant: CPU and CUDA backends must agree within atol=1e-5 on the same
logit vector. Enforced by crates/meridian-kernels/tests/kernel_correctness.rs.
Failure mode: if the kernel returns Unavailable, the system falls back to
count-only budget forcing (hard_cap on every termination). This is safe but
loses entropy-driven adaptivity.
Observability: signals surface through meridian.budget_force_reason.
Sources:
crates/meridian-kernels/— CUDA kernels + C FFI.python/meridian/entropy_probe.py— Python facade + EMA state.python/meridian/_backends/— CPU and CUDA backends.
vLLM Plugin
Inputs: vLLM Scheduler instance at attach time; schedule_batch calls
at runtime.
Outputs: reordered batch with output-phase requests drained first; injected
</think> tokens on budget-force events; disagg offload calls on ExitThink.
Constraint: no vLLM fork required. The plugin wraps the existing scheduler
via attribute delegation; unknown attributes fall through to the wrapped
scheduler so vLLM internals work unmodified. MeridianSchedulerPlugin.attach()
is a classmethod that installs the plugin as engine.scheduler[0] — no
separate detach() is provided in v0.1.x.
Failure mode: if the plugin raises during schedule_batch, it re-raises
to the vLLM worker, which surfaces as a serving error for that batch. Errors
in the disagg offload path are caught and logged; they do not block generation.
Observability: all Phase Router and Block Manager metrics
(meridian.block_manager.*, meridian.queue_depth, meridian.schedule_batch.*),
plus meridian_disagg_blocks_offloaded_total and meridian_vocab_fallback_total
emitted by the plugin (Prometheus, and OTLP when [telemetry] is enabled).
Source: python/meridian/vllm_plugin.py.
Non-goals
Meridian has explicit scope boundaries. Documenting what it does not do is as important as documenting what it does — it prevents inflated expectations and makes the design auditable.
Not a serving engine
Meridian is a scheduling layer. It requires vLLM (or a compatible serving backend) underneath it. It does not implement:
- Model loading or weight management.
- Prefill execution (prompt processing).
- Decode kernel scheduling (CUDA streams, tensor parallelism).
- Tokenisation or detokenisation.
- HTTP/gRPC serving interfaces.
Not a throughput optimiser
Meridian optimises latency differentiation — protecting output-phase inter-token latency at the cost of slightly reduced think-phase throughput. It does not optimise:
- Raw tokens/sec/GPU throughput. For throughput, run vLLM's own scheduler.
- Batch filling efficiency in non-reasoning workloads.
- Speculative decoding, continuous batching variants, or chunked prefill.
Not an accuracy guarantee
Budget forcing injects </think> based on entropy signals, not correctness.
This can shorten reasoning chains on hard problems. Meridian does not:
- Measure or guarantee reasoning accuracy.
- Validate that forced-short chains produce correct answers.
- Implement any feedback loop between output quality and forcing thresholds.
Operators running accuracy-sensitive workloads should validate that their
eat_ema_variance_threshold and rpdi_threshold settings do not degrade
task accuracy on representative prompts.
Not a multi-model router
Meridian schedules within a single vllm.AsyncLLMEngine instance. It does not:
- Route requests across multiple models.
- Implement A/B model experiments.
- Manage model replicas or horizontal scaling.
Not a vLLM fork
Meridian is a plugin, not a fork. It does not modify vLLM internals and does not ship a patched vLLM binary. Compatibility with a specific vLLM version is the operator's responsibility. See the Compatibility Matrix.
Not production-certified
v0.1.0 is an early release. It has not been validated under production traffic at scale. Known gaps:
- Real-vLLM end-to-end results exist only for
Qwen/Qwen2.5-0.5Bin the repo's GPU CI workflow. - The disagg fabric integration uses a synthetic in-process mock when
libnixlis not available; real NIXL interop has not been exercised in CI. - No back-pressure mechanism exists if the disagg fabric becomes unavailable.
Compatibility Matrix
Runtime requirements
| Component | Minimum | Tested |
|---|---|---|
| Linux | Ubuntu 22.04 | Ubuntu 24.04 |
| WSL2 | WSL2 on Windows 10/11 | Windows 10 22H2 |
| Rust toolchain | 1.85.0 | 1.85.0 (pinned) |
| Python | 3.11 | 3.11 |
| vLLM | 0.9.0 | 0.21.0 (resolved in uv.lock) |
| NVIDIA driver | 555.x | 555.x |
| CUDA toolkit | 12.6 | 12.6 |
| CUDA Compute Capability | 8.0 (A100) | 8.0+ |
Build requirements
| Tool | Version |
|---|---|
cargo + rustup | Rust 1.85.0 |
maturin | latest (install via pip install maturin) |
uv | 0.4+ |
nvcc | 12.6 (only for --features cuda) |
mdbook | latest (only for docs) |
Model compatibility
Models that have been verified to work with Meridian's phase detection:
| Model family | Boundary detection | Config |
|---|---|---|
| DeepSeek-R1 | Token IDs [128799, 128800] | models/deepseek_r1.toml |
| Qwen3 / Qwen2.5 | Token IDs [151648, 151649] | models/qwen3.toml |
| IBM Granite 3.2 | Prose markers (no distinct token IDs) | models/granite_3_2.toml |
Models that are not verified to work:
- Models with non-standard
<think>tokenisation not listed above — configurethink_start_token_ids/think_end_token_idsmanually and validate with a sample prompt before production use. - Models served through streaming APIs (e.g. Claude via Anthropic API) — Meridian requires direct access to the logit vector, which API-served models do not expose.
Feature flags
| Feature flag | Requires | Status |
|---|---|---|
| (default — no flags) | Linux, Rust | Fully CI-tested |
--features prometheus | prometheus crate | CI-tested |
--features unstable | Rust nightly-gated APIs | CI-tested |
--features nixl | libnixl.so on deploy host | Compiles; integration tested with synthetic mock |
--features cuda | nvcc, CUDA 12.6, GPU at runtime | Build-tested on GPU CI runner |
CI coverage
The GPU jobs are gated to prevent arbitrary code execution on the self-hosted runner from fork PRs. See GPU CI runner setup.
Known incompatibilities
- vLLM below 0.9.0: the dependency constraint is
vllm>=0.9.0. Earlier versions are not supported and will be rejected at install time. - Windows (native): the Rust workspace builds on Windows (tested in development), but the Python extension and benchmarks require Linux or WSL2 for the CUDA and maturin paths.
- macOS: not supported. CUDA is not available on macOS.
Deployment Model
Single-node deployment
The primary and most-tested deployment topology. One AsyncLLMEngine instance,
one Meridian plugin, all on the same GPU node.
┌──────────────────────────────────────────────┐
│ GPU node │
│ │
│ vLLM AsyncLLMEngine │
│ └── MeridianSchedulerPlugin (attached) │
│ ├── PhaseRouter │
│ ├── MeridianScheduler │
│ └── PhaseAwareBlockManager │
└──────────────────────────────────────────────┘
Prerequisites: Linux (or WSL2), NVIDIA driver 555+, CUDA 12.6, vLLM ≥ 0.9.0 (pip install "meridian[vllm]" resolves to 0.21.0 via uv.lock).
Configuration: standard meridian.toml with [disagg] enabled = false.
Disaggregated prefill-decode
Experimental. Requires NIXL-capable infrastructure. The block manager's
offload_block / ingest_block hooks transfer ThinkComplete KV blocks to
a remote decode node after </think> is emitted.
┌─────────────────┐ NIXL fabric ┌─────────────────┐
│ Prefill node │ ──── KV block transfer ────▶ │ Decode node │
│ │ │ │
│ vLLM prefill │ │ vLLM decode │
│ Meridian plug │ │ Meridian plug │
└─────────────────┘ └─────────────────┘
Status: the disagg wire protocol and block manager hooks are implemented
and verified with a synthetic in-process NIXL mock. Real NIXL interop requires
cargo build --features nixl and libnixl.so on the deploy host.
Configuration:
[disagg]
enabled = true
fabric = "nixl"
offload_threshold_blocks = 4
See ADR-0006 for the protocol specification.
Installation
From source (recommended)
git clone https://github.com/angelnicolasc/meridian.git
cd meridian
# Build and install the Rust core + Python bindings.
uv sync --project python
maturin develop --release -m crates/meridian-python/Cargo.toml
# Optional: build with CUDA kernel support.
# Requires nvcc + CUDA 12.6 toolkit.
maturin develop --release \
-m crates/meridian-python/Cargo.toml \
--cargo-extra-args="--features cuda"
Devcontainer
The repo includes a devcontainer configuration with the full toolchain pre-installed:
# Open in VS Code with the Dev Containers extension, or:
./scripts/dev-up.sh
Configuration loading
The plugin looks for meridian.toml in the current working directory, then
~/.config/meridian/meridian.toml. Override with:
from meridian import load_config
from meridian.vllm_plugin import MeridianSchedulerPlugin
cfg = load_config("/path/to/meridian.toml")
plugin = MeridianSchedulerPlugin(scheduler=engine.scheduler, config=cfg)
Known limits
| Dimension | Limit | Notes |
|---|---|---|
| Models per instance | 1 | One engine, one config |
| Concurrent requests | Limited by GPU VRAM / block budget | Set capacity_bytes |
| vLLM version | ≥ 0.9.0 (pinned: 0.21.0 in uv.lock) | Earlier versions not supported |
| Disagg fabric | NIXL (production) or synthetic mock | Real NIXL requires libnixl |
Rust API
The Rust API is generated by rustdoc. Build locally with:
cargo doc --workspace --no-deps --open
Top-level surface
| Item | Kind | Purpose |
|---|---|---|
PhaseRouter | struct | Per-request token-stream state machine |
MeridianScheduler | struct | Dual-queue batch scheduler |
BlockManager | trait | Three-tier KV block manager contract |
PhaseAwareBlockManager | struct | Default BlockManager impl |
types | module | ThinkPhase, PhaseEvent, BlockTier, EntropySignal, BlockLocation |
MeridianConfig | struct | Deserialised TOML config |
Object lifecycle
PhaseRouter
PhaseRouter is Send + Sync. Create once, share across threads via Arc.
Call process_token(req_id, token_id) from any thread; it returns an
Option<PhaseEvent> and is O(1) per call.
Call reap_stale_older_than(duration) periodically to free entries for
completed requests. The vLLM plugin does this on every batch step.
#![allow(unused)] fn main() { use meridian_core::PhaseRouter; use std::sync::Arc; use std::time::Duration; let router = Arc::new(PhaseRouter::new()); // Per-token — called from the decode loop. if let Some(event) = router.process_token(req_id, token_id) { // Handle PhaseEvent::ExitThink, ForceBudget, etc. } // Periodic cleanup — call from the batch step hook. let reaped = router.reap_stale_older_than(Duration::from_secs(60)); }
MeridianScheduler
MeridianScheduler is Send + Sync. The schedule_batch method takes a
shared reference and returns owned Vec<RequestId> for each queue; it does not
hold a lock across the call boundary.
BlockManager trait
The three required methods:
#![allow(unused)] fn main() { fn allocate(&mut self, request_id: RequestId, tier: BlockTier) -> Result<BlockId>; fn evict_for(&mut self, required_blocks: usize) -> Vec<BlockId>; fn block_location(&self, block_id: BlockId) -> BlockLocation; }
Optional disagg methods (offload_block, ingest_block) have default
implementations that return Err(BlockManagerError::FabricNotConfigured).
Override them when wrapping with a NIXL-backed manager.
Error model
All errors implement std::error::Error and are defined in
meridian_core::error::Error. Configuration errors carry the dotted field
path and a human-readable message:
ConfigValidation { field: "entropy.ema_alpha", reason: "must be in (0, 1]" }
Kernel errors are defined in meridian_kernels::KernelError:
Unavailable— built without thecudafeature or runtime missing.Launch(i32)— CUDA returned a non-zero error code.NullPointer(&'static str)— caller passed a null pointer.
Thread safety
| Type | Thread safety |
|---|---|
PhaseRouter | Send + Sync via DashMap interior mutability |
MeridianScheduler | Send + Sync |
PhaseAwareBlockManager | Send; requires &mut self for mutation — wrap in Mutex for shared access |
NixlContext (feature nixl) | Send; not Sync — one context per thread |
Stability
Pre-1.0. Public API breakage is recorded under BREAKING CHANGE: in
CHANGELOG.md.
The C ABI (meridian_entropy_launch, meridian_eat_launch) is treated as
stable from v0.1.0 — forks consuming the FFI directly will not see unexpected
breakage on patch updates.
Python API
Installation
# From source — requires a Linux host with Rust 1.85+ and maturin.
uv sync --project python
maturin develop --release -m crates/meridian-python/Cargo.toml
The package exposes no CUDA dependency at import time. CUDA is lazily loaded
when backend="cuda" is requested on EntropyProbe.
Top-level surface
| Symbol | Kind | Purpose |
|---|---|---|
meridian.EntropyProbe | class | Stateful per-request entropy probe |
meridian.EntropySignal | dataclass | Per-token signal record |
meridian.MeridianConfig | Pydantic model | Runtime configuration |
meridian.load_config(path) | function | Convenience TOML loader |
meridian.vllm_plugin.MeridianSchedulerPlugin | class | vLLM scheduler wrapper |
Object lifecycle
EntropyProbe
One instance per request. Not thread-safe — do not share an instance across concurrent requests. Create, use through the token sequence, then discard.
from meridian import EntropyProbe, load_config
import numpy as np
cfg = load_config("meridian.toml")
probe = EntropyProbe(
think_end_token_ids=cfg.model["qwen3"].think_end_token_ids,
backend="cpu", # "cpu" (NumPy) or "cuda" (CUDA kernel)
ema_alpha=cfg.entropy.ema_alpha,
)
# Per-token call — call once per decoded token.
logits = np.random.randn(151_936).astype(np.float32)
sig = probe.compute(req_id=42, logits=logits)
print(sig.token_entropy, sig.eat, sig.eat_ema_variance)
# Batch path — more efficient for large batch sizes.
batch_logits = np.random.randn(8, 151_936).astype(np.float32)
signals = probe.compute_batch(req_ids=list(range(8)), logits_batch=batch_logits)
MeridianSchedulerPlugin
Wraps an existing vllm.core.scheduler.Scheduler at runtime. Safe to attach
and detach. Holds no GPU resources; all GPU work goes through the underlying
vLLM scheduler.
from vllm import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from meridian import load_config
from meridian.vllm_plugin import MeridianSchedulerPlugin
engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(model="Qwen/Qwen2.5-0.5B"))
cfg = load_config("meridian.toml")
# attach() is a classmethod — constructs the plugin, installs it as
# engine.scheduler[0], and returns the handle for metric access.
plugin = MeridianSchedulerPlugin.attach(engine, cfg, model_key="qwen3")
# ... serve requests ...
# v0.1.x has no detach(); the plugin runs for the engine's lifetime.
Error model
All configuration errors are raised at construction time as ValueError with
a dotted field path. For example:
ValueError: entropy.ema_alpha must be in (0, 1]; got 1.5
Runtime errors from the CUDA kernel surface as meridian.KernelError (a
subclass of RuntimeError). When the kernel returns Unavailable (built
without the cuda feature, or missing runtime library), the probe falls back
to the CPU backend automatically.
Concurrency notes
EntropyProbeinstances are not thread-safe. One instance per request.MeridianSchedulerPluginis designed to be used from vLLM's single async event loop. Do not callschedule_batchconcurrently.- The Rust
PhaseRouterandBlockManagerbindings are thread-safe; they use interior mutability backed byDashMap.
Stability guarantees
Pre-1.0. Signatures may change on minor bumps. Breaking changes are listed
under BREAKING CHANGE: in CHANGELOG.md and announced before merging.
Backends
EntropyProbe accepts backend="cpu" (default, pure NumPy) or backend="cuda".
Both backends implement the same mathematical operations and agree within
atol=1e-5. In Sprint 0 the cuda backend delegates to the CPU implementation;
Sprint 1 will wire it to the Rust CUDA kernels in crates/meridian-kernels/ so
the logit reduction runs on a dedicated secondary CUDA stream.
Configuration
Meridian is configured through a single TOML file consumed by both the Rust
core (meridian-core::config::MeridianConfig) and the Python facade
(meridian.config.MeridianConfig). Both parsers agree on field names; a
round-trip test in crates/meridian-core/tests/config_parse.rs exercises
every field.
The fully-annotated example lives at
meridian.toml.example.
Tune by overlay: keep the example as a reference and write a smaller
meridian.toml containing only fields that differ from their defaults.
Validation
Both parsers reject unknown fields and out-of-range values. Cross-field
violations (e.g. min_think_tokens >= max_think_tokens) are also caught.
Errors carry the dotted field path and a human-readable message.
[scheduler]
Dual-queue scheduling policy. See ADR-0001.
think_tpot_budget_ms
| Property | Value |
|---|---|
| Type | f64 |
| Default | 80.0 |
| Unit | milliseconds per think-phase token |
| Valid range | > 0 |
TPOT budget for think-phase tokens. The user does not see inter-token latency during reasoning, so this can be set much higher than the output budget. Setting it 4× the output budget gives the batcher room to pack a larger effective batch during think. Raise if your GPU is underutilised during think; lower if think-phase requests monopolise capacity at the expense of output.
output_tpot_budget_ms
| Property | Value |
|---|---|
| Type | f64 |
| Default | 20.0 |
| Unit | milliseconds per output-phase token |
| Valid range | > 0 |
TPOT budget for output-phase tokens. This is the user-visible streaming latency floor. 20 ms keeps streams fluid on a 30–50 tok/s display target. Lower values produce tighter streams but reduce think-phase throughput.
think_batch_multiplier
| Property | Value |
|---|---|
| Type | f64 |
| Default | 2.5 |
| Unit | ×output batch token budget |
| Valid range | >= 1.0 |
The think-phase batch can fill this multiple of the output-phase token budget.
2.5× is conservative — empirically stable across H100-class hardware with
MLA-aware allocation. Values above 3.5× risk output ITL variance spikes if
think requests fail to yield promptly. Monitor meridian.queue_depth{queue=think}.
max_think_tokens
| Property | Value |
|---|---|
| Type | u64 |
| Default | 32768 |
| Unit | tokens |
| Valid range | > min_think_tokens |
Hard cap on think tokens per request. Budget forcing fires unconditionally at this limit regardless of entropy signals. 32 768 matches DeepSeek-R1's documented maximum reasoning length and bounds the KV memory a single request can monopolise.
min_think_tokens
| Property | Value |
|---|---|
| Type | u64 |
| Default | 512 |
| Unit | tokens |
| Valid range | < max_think_tokens |
No budget forcing is allowed before this many think tokens. EAT/RPDI signals are noisy below 512 tokens; early forcing can prematurely terminate short-but-correct reasoning chains.
[entropy]
Entropy probe and convergence-detection thresholds.
enabled
| Property | Value |
|---|---|
| Type | bool |
| Default | true |
When false, the entropy probe is disabled and all budget forcing uses
hard_cap only (pure token-count limiting). Useful for A/B comparison
or when the CUDA kernel is not available.
ema_alpha
| Property | Value |
|---|---|
| Type | f64 |
| Default | 0.05 |
| Unit | dimensionless (EMA decay) |
| Valid range | (0.0, 1.0] |
EMA decay applied to all entropy signals. Smaller values give longer memory. α = 0.05 → ~95% mass within the last ~60 samples. Long enough to smooth single-token spikes; short enough to react within a reasoning chain.
rpdi_threshold
| Property | Value |
|---|---|
| Type | f64 |
| Default | 3.0 |
| Unit | ratio (local RPDI / global RPDI) |
| Valid range | > 1.0 |
Overthinking is declared when rpdi_local / rpdi_global > threshold. The
value 3.0 is the empirical threshold from arXiv:2603.14251. Raise to be more
permissive (longer chains); lower to be more aggressive.
eat_ema_variance_threshold
| Property | Value |
|---|---|
| Type | f64 |
| Default | 0.001 |
| Unit | nats² |
| Valid range | > 0.0 |
Convergence is declared when EAT EMA variance drops below this threshold. 0.001 is approximately the noise floor of EAT in steady state. Lower values defer forcing; higher values fire earlier.
transition_entropy_threshold
| Property | Value |
|---|---|
| Type | f64 |
| Default | 2.5 |
| Unit | nats |
| Valid range | > 0.0 |
A token counts as a "transition" for RPDI when its per-token entropy exceeds this value. 2.5 nats ≈ effective branching factor of 12 — a genuine decision point rather than low-entropy continuation.
eat_probe_interval_tokens
| Property | Value |
|---|---|
| Type | u32 |
| Default | 32 |
| Unit | tokens |
| Valid range | >= 1 |
The EAT kernel runs every N think tokens. 1 = every token; higher values
trade signal latency for reduced kernel-launch overhead. 32 is the sweet spot
on H100-class hardware. Halving this on slower GPUs is safe.
[kv_memory]
Phase-aware KV block manager policy.
aggressive_think_eviction
| Property | Value |
|---|---|
| Type | bool |
| Default | false |
When true, ThinkComplete blocks are freed immediately on phase transition.
Leave false until cross-attention back-references are audited for your model;
some reasoning-parser pipelines re-attend over the think segment when generating
output and need those blocks resident.
think_phase_memory_fraction
| Property | Value |
|---|---|
| Type | f64 |
| Default | 0.40 |
| Unit | fraction of total KV budget |
| Valid range | (0.0, 1.0) |
Fraction of total KV budget reserved for think-phase blocks. 0.40 leaves 60%
for output-phase blocks, which accommodates think_batch_multiplier = 2.5
without crowding output. Raise for workloads with very long reasoning chains;
lower for workloads with long output sequences.
block_size_bytes
| Property | Value |
|---|---|
| Type | u64 |
| Default | 16384 (16 KiB) |
| Unit | bytes per KV block |
| Valid range | > 0 |
Must match the actual vLLM block layout for your model. The canonical vLLM layout is 16 KiB for bf16/fp16 KV at 16 tokens per block. MLA-aware models can run smaller blocks.
capacity_bytes
| Property | Value |
|---|---|
| Type | u64 or "auto" |
| Default | "auto" |
| Unit | bytes |
| Valid range | > 0 or "auto" |
Total KV memory budget. "auto" queries the device at startup and uses
85% of torch.cuda.mem_get_info().total. Pin an integer for deterministic
or multi-tenant deployments where you want to reserve GPU memory for
other workloads.
[disagg]
Disaggregated KV transfer. Disabled by default. See ADR-0006.
enabled
| Property | Value |
|---|---|
| Type | bool |
| Default | false |
Master switch. Leave false for single-node deployments.
fabric
| Property | Value |
|---|---|
| Type | "nixl" | "mooncake" | "none" |
| Default | "none" |
Selects the disagg transport. nixl uses the NVIDIA NIXL library (requires
cargo build --features nixl and libnixl on the deploy host). mooncake
uses the Mooncake-compatible protocol adapter. none is only valid when
enabled = false.
offload_threshold_blocks
| Property | Value |
|---|---|
| Type | u32 |
| Default | 4 |
| Unit | KV blocks |
| Valid range | >= 1 |
Minimum ThinkComplete blocks to accumulate before flushing to the fabric.
Larger values amortise transfer overhead; smaller values reduce latency.
[model.<name>]
Per-model token-boundary configuration. One [model.*] table per model
served. The phase router watches for boundary token IDs in the decoded stream.
See models/*.toml for
the vetted token IDs for supported models.
think_start_token_ids
| Property | Value |
|---|---|
| Type | [u32] |
Token IDs that mark the start of a reasoning chain. Model- and tokenizer-specific. If empty, the router never enters think-phase for this model.
think_end_token_ids
| Property | Value |
|---|---|
| Type | [u32] |
Token IDs that mark the end of a reasoning chain. When the decoded stream
contains any of these IDs, the router emits ExitThink.
reasoning_parser
| Property | Value |
|---|---|
| Type | string |
| Values | "deepseek_r1", "qwen3", "granite", "anthropic" |
Selects the reasoning-chain parser. Different models structure their think-output boundary differently; the parser handles model-specific normalization.
supports_think_disable
| Property | Value |
|---|---|
| Type | bool |
Whether the model supports a /no_think directive to suppress the reasoning
phase entirely. When true and the prompt contains the directive, the router
stays in output-phase for the entire request.
Glossary
Terms used throughout the Meridian documentation and codebase.
TTFT — Time to First Token. Wall-clock time from the moment a request is submitted to the moment the first token is returned to the client. Dominated by prefill latency for long prompts.
TTOT — Time to Output Token. Wall-clock time from the emission of the
</think> boundary token to the first user-visible output token. The metric
Meridian is specifically designed to protect.
TPOT — Time Per Output Token (also ITL: Inter-Token Latency). Time between consecutive output tokens during streaming. The user-perceived "speed" of the stream.
ITL — Inter-Token Latency. See TPOT.
EAT — Entropy-Aware Termination. A budget-forcing signal based on the variance of the EMA of per-token Shannon entropy over the reasoning chain. When EAT EMA variance drops below a threshold, the model is inferred to have converged on an answer. Defined in arXiv:2509.26522.
RPDI — Reasoning Phase Divergence Index. A signal based on the ratio of local transition-token frequency to global transition-token frequency. A high ratio indicates the model is cycling through redundant reasoning steps ("overthinking"). Defined in arXiv:2603.14251.
Entropy (Shannon) — Measure of uncertainty in the next-token distribution.
Computed from the logit vector after softmax as -Σ p_i · log(p_i), in nats.
High entropy = uncertain prediction; low entropy = confident prediction.
EMA — Exponential Moving Average. A smoothed average where recent values
are weighted more heavily. Controlled by ema_alpha (smaller = longer memory).
KV — Key-Value cache. The GPU memory store holding the attention keys and values for each token in active requests. KV memory is the primary capacity constraint in serving systems.
KV block — The unit of KV cache allocation. A fixed-size chunk (default 16 KiB) covering a fixed number of tokens (default 16). Blocks are allocated at the request level and freed on eviction or request completion.
ThinkComplete — Block tier for KV blocks that belonged to a request's
reasoning phase after </think> has been emitted. Lowest eviction priority —
these blocks are freed first under memory pressure.
ThinkActive — Block tier for KV blocks belonging to a request currently in the reasoning phase.
OutputCritical — Block tier for KV blocks belonging to a request in the
output phase. Highest eviction priority — evicting these causes user-visible
stream disruption. Any eviction at this tier fires meridian.output_critical_eviction.
Disagg / disaggregated serving — Prefill-decode disaggregation: the model prefill (prompt processing) and decode (token generation) steps are executed on separate hardware. Disagg reduces head-of-line blocking by separating the two workloads, which have very different GPU utilisation profiles.
NIXL — NVIDIA Inference eXchange Layer. NVIDIA's reference fabric for transferring KV blocks between prefill and decode nodes in a disaggregated serving topology.
Mooncake — An open-source disaggregated serving framework with a KV-transfer protocol. Meridian's disagg surface is documented as Mooncake-compatible in ADR-0006.
vLLM — An open-source LLM serving framework. Meridian's primary integration target. Meridian wraps vLLM's scheduler without forking the codebase.
DashMap — A concurrent hash map crate used for the PhaseRouter's
per-request state. Provides O(1) read and write with sharded locking.
See ADR-0003.
EOS — End of Sequence. The special token that signals request completion. vLLM emits an EOS event that the Meridian plugin uses to trigger request teardown and router state reaping.
Conventional Commits — A commit message standard used throughout this
repository. Format: <type>(<scope>): <summary>. See
conventionalcommits.org.
DCO — Developer Certificate of Origin. A sign-off mechanism (git commit -s)
that certifies the contributor has the right to submit the code under the
project's license. Required for all contributions — see CONTRIBUTING.md.
SLSA — Supply-chain Levels for Software Artifacts. A framework for
supply-chain security. Meridian attests Level 2 provenance on every tagged
release via slsa-github-generator. See ADR-0007.
SBOM — Software Bill of Materials. A machine-readable inventory of software components and their licenses. Meridian generates a CycloneDX SBOM for each release, attached as a GitHub release asset.
Architectural Decision Records
Meridian uses Michael Nygard's ADR format to capture the why behind significant choices. ADRs are immutable once "Accepted"; we supersede them rather than edit them.
Lifecycle: Proposed → Accepted → (later) Superseded by ADR-NNNN /
Deprecated.
Index
| ID | Status | Title |
|---|---|---|
| 0001 | Accepted | Dual-queue vs. priority weights |
| 0002 | Accepted | Workspace tri-crate layout |
| 0003 | Accepted | DashMap for per-request state |
| 0004 | Accepted | KV tier promotion policy |
| 0005 | Accepted | Benchmark methodology |
| 0006 | Accepted | Disagg KV transfer protocol |
| 0007 | Accepted | Release and versioning policy |
| 0008 | Accepted | Request preemption policy |
Writing a new ADR
- Copy
template.mdtoNNNN-short-kebab-title.md. - Open as
Proposed; merge asAcceptedafter PR review. - If a later ADR supersedes this one, mark this one
Superseded by ADR-NNNNin a new commit — never edit the body of an accepted ADR.
ADR-0001: Dual-queue scheduling vs. priority weights on a single queue
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
Meridian's central thesis is that think-decode and output-decode are two structurally different workloads inside a single reasoning-model request:
- Output tokens are user-visible streaming. They must hit a tight TTOT (time-to-output-token) target — ~20 ms is the perceptual threshold for fluid streaming at typical display rates.
- Think tokens are user-invisible reasoning. The user is already waiting for the answer; inter-token latency during reasoning is irrelevant. Throughput (tokens/sec/GPU) is what matters here.
Given that, the scheduler needs to give output-phase requests absolute priority while letting think-phase requests fill any remaining batch capacity with a larger effective batch size to maximise GPU utilisation.
There are two plausible structural shapes to implement this:
- Single queue with priority weights. One queue of all "decode-eligible" requests; each request carries a priority numeric. The scheduler picks the highest-weighted requests every iteration and the eviction policy reads block tier from the request's phase.
- Two independent queues, one per phase. An
output_queuedrained first to its budget, then athink_queuedrained to a larger budget capped by remaining KV memory.
Both can produce equivalent dispatch ordering. They diverge in observability, in the failure modes they expose, and in how cleanly they compose with KV tier management.
Decision
Meridian uses two independent queues — output_queue and think_queue —
sharing the same GPU workers, with output drained first every iteration.
The scheduler exposes per-queue depth as a separate metric label, applies per-queue SLO budgets, and the block manager's eviction tiers are indexed on the block's phase membership rather than on the owning request's priority number.
Consequences
Positive
- SLO isolation by construction. TTOT and TPOT live on different queues
and cannot interfere through priority arithmetic. We never need to reason
about whether a priority of
5vs.7is enough to keep an output token from being preempted by a think token — the queues are physically separate. - Reasoning about starvation is local. With priority weights, you have
to argue globally about the joint distribution of priorities under load to
prove that think requests are not starved. With two queues, the worst-case
is "output_queue saturates the budget → think_queue waits its turn." That
is a one-line argument and a single bounded scalar (
think_batch_multiplier × output_budget) to tune. - Block manager tiering is structurally aligned. The eviction policy
iterates
BlockTier::ThinkComplete → ThinkActive → OutputCritical. The scheduler's queues map 1:1 onto two of those tiers (ThinkActive,OutputCritical), and theThinkCompletetier appears precisely when a request transitions queues. The pipeline of state transitions is uniform end-to-end. - Observability is honest.
meridian.queue_depth{queue="output"}and…{queue="think"}are operationally meaningful — they correspond to things an oncall engineer can act on. A singlemeridian.queue_depth_p95_prioritywould obscure the failure mode. - Future disaggregation is cheap. When we add a separate decode pool for think (a natural extension co-located with prefill-decode disagg systems like Mooncake / NIXL), the seam already exists.
Negative / risks
- Two queue data structures instead of one. Marginal memory cost
(
crossbeam::SegQueueis small) and a secondO(log n)insert path. Not material against the per-token compute budget. - Risk that think queue is permanently starved under sustained output
pressure. Mitigation: the scheduler enforces a minimum think-batch
reservation when
output_queue.len() < output_budget. Detection:meridian.queue_depth{queue="think"}rising whilemeridian.budget_force_triggeredstays flat — alert at p95 depth > 4× baseline for 5 minutes. - Edge cases at phase transition. A request that emits
</think>and the next token in the same decode step transitions queues mid-iteration. This is handled byMeridianScheduler::on_phase_eventtaking the request out of the think queue and pushing it into the output queue before the nextschedule_batchcall. Tests for this case live intests/phase_router_state_machine.rs.
Neutral
- The number of tunables stays the same. A single-queue design with
priority weights requires
output_priority,think_priority, and apriority_gap_min; the dual-queue design requiresoutput_tpot_budget_ms,think_tpot_budget_ms, andthink_batch_multiplier. Both surfaces are three scalars.
Alternatives considered
Single queue with continuous priority weights
RequestSlot { priority: f32 }, dispatch is argmax(priority) with
preemption. Output requests carry priority ≈ 10, think requests
priority ≈ 1. Rejected because:
- The dispatch order under heavy load depends on the distribution of
weighted requests, not on a per-class budget — you can no longer write
down a one-line invariant like "output never waits more than
Kms." - The block manager would need to read priorities to decide eviction order, coupling two subsystems that we want orthogonal.
- Operators tuning the system find priority weights opaque — "is
8.5enough?" is not a question with a principled answer.
Single queue with phase-stratified preemption
One queue, but a hard rule that any output-phase request preempts any think-phase one. This is structurally equivalent to two queues but expressed differently. Rejected for code-clarity reasons only: the dual-queue shape makes the invariant ("output drains first") the structure, rather than an invariant we have to police in the dispatcher.
Per-tenant queues with phase tags
Considered for multi-tenant SaaS deployments. Not rejected outright, but deferred — it is an orthogonal axis we can layer on top of the two-queue shape. Captured as a future ADR placeholder.
References
- Playbook §3.3 — Dual-Queue Scheduler.
- vLLM v0.9 scheduler internals:
vllm/core/scheduler.py(single-queue priority-weighted implementation we are improving on). - Mooncake disagg paper — separates prefill from decode; orthogonal axis.
- DUCHESS (arXiv:2509.24957) — intra-request branch orchestration; operates below the queue layer.
ADR-0002: Workspace tri-crate layout (core / kernels / python)
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
Context
Meridian spans three execution domains: a pure-Rust scheduler core, CUDA kernels behind an FFI boundary, and pyo3 bindings that the Python vLLM plugin consumes. The natural layouts are:
- Single crate. All code in one place, gated by
cfg(feature = ...). - Tri-crate Cargo workspace.
meridian-core(Rust only),meridian-kernels(CUDA + FFI),meridian-python(pyo3, built via maturin). All members of one workspace, sharing lockfile and lints. - Polyrepo. Each layer in its own repository, glued by published versions.
Decision
Tri-crate workspace. All three crates live under crates/ in this
repository.
Consequences
Positive
cargo test -p meridian-coreruns on any host — no CUDA, no Python, nonvcc. The CI matrix can validate the core invariants on the cheapest GitHub-hosted runner and only spin up a GPU runner for the CUDA layer.- The
unsafesurface area is visibly contained. Auditors looking atmeridian-coresee#![forbid(unsafe_code)]at the crate root. The unsafe code lives in exactly one crate (meridian-kernels) and at exactly one boundary (the FFI declarations insrc/ffi.rs). - Workspace
[workspace.lints]is applied uniformly across all three crates — one place to change a clippy lint, no drift. - A future fork that wants only the scheduler core (e.g. for a non-CUDA
inference framework) can depend on
meridian-coredirectly without pulling pyo3 or CUDA artifacts.
Negative / risks
- Three
Cargo.tomls to keep in sync. Mitigated byworkspace.dependenciesandworkspace.packageinheritance — versioned dependencies are declared once. - The
meridian-kernelscrate haslinks = "meridian_kernels". Cargo enforces uniqueness so we cannot accidentally link two implementations of the same native library — a small but real guard.
Neutral
- Build artifacts grow by one extra
target/directory per crate. Negligible in practice.
Alternatives considered
Single crate
Rejected because: any Python binding gate requires pyo3 in the dep graph, which pulls a non-trivial transitive closure. Anyone wanting to depend on just the scheduler core would be forced to compile it.
Polyrepo
Rejected because: in a pre-1.0 project where the three layers co-evolve, splitting them into separate repositories introduces version-skew bugs without a corresponding benefit. Once the public API stabilises post-1.0 this can be revisited.
References
ADR-0003: DashMap for per-request phase state
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
The PhaseRouter mutates per-request state (ThinkPhase) on every decoded
token. A single decode worker processes a continuous-batch step that can
touch dozens of requests; in a tensor-parallel deployment, multiple worker
threads may invoke on_token for different requests in the same wall-clock
window. We need:
- O(1) lookup keyed by
req_id. - Concurrent mutation of different keys without serialising the whole map.
- Cheap clone / sharing — the router is shared across the scheduler, the block manager touch path, and the vLLM plugin.
- No GC, no allocations on the hot path.
Candidates evaluated:
parking_lot::RwLock<HashMap<u64, ThinkPhase>>DashMap<u64, ThinkPhase>(shardedHashMapbehind per-shardRwLock)papaya::HashMap<u64, ThinkPhase>(lock-free, June 2025)scc::HashMap<u64, ThinkPhase>(lock-free, sharded)
Decision
DashMap 6.x. It is the boring, well-audited choice that hits the
performance floor we need without introducing a less-battle-tested
dependency on the hot path.
Consequences
Positive
- Sharded locking. Operations on different
req_ids do not contend. - Mature API.
get_mut,entry,removecover every access pattern inPhaseRouter. No need to invent abstractions on top. - No
unsafe. Internally backed byparking_lot; we keep the#![forbid(unsafe_code)]invariant inmeridian-coreintact. - Crates.io top-100. Wide deployment, frequent security audits, stable semver.
Negative / risks
- Sharded lock is not truly lock-free. Under extreme contention (a
single shard absorbing many requests because of hash skew), throughput
degrades. Mitigation: monitor
meridian.phase_router.tracked_requests; if the gauge approachesn_shards × shard_capacity(default ~512 per shard), revisit withpapayaor shard-aware partitioning. In our workload (~1k concurrent requests max per worker) we are far from this ceiling. AHasherby default. Good for our integer keys; not a downside, but worth noting that we benefit from non-cryptographic hashing here.
Neutral
- Memory per entry is slightly higher than a flat
HashMapbecause of the shard metadata. Negligible (~64 bytes overhead total).
Alternatives considered
RwLock<HashMap<...>>
Rejected: every on_token call needs a write lock to bump
tokens_so_far, so the RwLock collapses to a Mutex in practice.
Single-shard serialisation across all requests is unacceptable.
papaya::HashMap
Watched, not adopted. Genuinely lock-free, with better tail latency under
contention than DashMap. Adoption deferred until: (a) it stabilises a
1.0 API, and (b) we can demonstrate a workload where DashMap is the
bottleneck. Tracked as future work in the DEVLOG.
scc::HashMap
Comparable to papaya but more API surface; same deferral rationale.
References
- DashMap source: https://github.com/xacrimon/dashmap.
- The phase_router bench
measures the steady-state cost of
on_tokenunder no contention.
ADR-0004: KV tier promotion is one-way; demotion is the only direction
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
The block manager maintains three tiers (ThinkActive, ThinkComplete,
OutputCritical). Tier transitions are emitted by the scheduler on phase
events. We had to decide whether a block can promote (e.g. a
ThinkComplete block re-attended during output generation gets restored
to ThinkActive or even OutputCritical).
Two camps:
- Bidirectional: any block currently accessed gets promoted to the highest applicable tier.
- One-way demotion: blocks only move down the eviction order
(
OutputCritical → … → freed). Promotion is explicitly disallowed.
Decision
One-way demotion. A block's tier is set at allocation time
(via BlockTier::ThinkActive or BlockTier::OutputCritical) and can only
move toward eviction:
ThinkActive → ThinkCompleteviademote_think_blocks.- Any tier → freed via
evict_fororfree.
There is no API to move a ThinkComplete block back to ThinkActive, or
a ThinkActive block to OutputCritical.
Consequences
Positive
- Reasoning about eviction stability becomes trivial. A block's tier monotonically decreases. Cross-attention back-references over a reasoning span cannot accidentally "rescue" blocks the scheduler has already decided are evictable — operators can reason about KV pressure without tracking promotion races.
- The block manager API is smaller. No
promote()method to test, no invariant to enforce ("you can only promote within the same request"). - Aligns with the playbook intent.
kv_memory.aggressive_think_evictionis a one-way knob: think blocks either survive demotion (default) or are freed immediately (aggressive). Promotion would make this knob semantically incoherent.
Negative / risks
- A reasoning model that re-attends over its own think span during
output generation pays the eviction cost twice. Cross-attention reads
on a
ThinkCompleteblock bring the block into the GPU's L1/L2 cache but do not promote its eviction tier. If that block is then evicted by memory pressure, the next cross-attention read forces a recompute. Mitigation: keepkv_memory.aggressive_think_eviction = falseso blocks stay resident asThinkCompleteuntil pressure actually demands their eviction. - No way to mark "this block is hot, please keep it" beyond keeping it in the lowest tier it was admitted to. The LRU within a tier is the only signal of recency.
Neutral
- The
touch()API exists to update LRU position within a tier, not to promote across tiers. This is explicit in the trait documentation.
Alternatives considered
Bidirectional with promote_block(block_id, tier)
Rejected because:
- Adds a new invariant to police (can a
ThinkCompleteof request A promote toOutputCriticalof request B? Obviously not, but the API must enforce that). - Promotion races with eviction: a block that the eviction iterator has selected as next victim might be promoted mid-eviction. Resolving this needs either a lock around the whole eviction loop (kills throughput) or a generation counter (more complexity).
- The cross-attention rescue use case is real but rare; the simpler
fallback (operator tunes
think_phase_memory_fraction) handles it without architectural complexity.
Implicit promotion on touch()
Rejected: touch() is called on the hot path for every cache hit. Doing
tier promotion there would dramatically slow the common case to optimise
a rare one.
References
- ADR-0001 — dual-queue scheduling sits alongside this tier policy.
- Playbook §3.4 — original three-tier eviction design.
ADR-0005: Benchmark methodology and metric selection
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
Meridian's value proposition is phase-differentiated scheduling. The benchmark harness has to surface that value — measuring overall throughput or aggregate TPOT will not distinguish Meridian from a stock vLLM scheduler running the same workload, even if Meridian is substantially better for the user-visible streaming experience.
We must choose:
- What metrics to report.
- What workload to drive the system with.
- How to run the benchmark without a GPU in CI.
Decision
Metrics
The benchmark reports the following primary metrics (all per-request, aggregated to percentiles in the report):
| Metric | Definition | Why it matters |
|---|---|---|
| TTFT | Time-to-first-token (prefill + first decoded token). | Industry-standard latency floor; baseline parity. |
| TTOT | Time-to-first-OUTPUT-token. Measured from </think> emission to the first user-visible token after it. | The metric stock vLLM does not even track. This is where dual-queue scheduling shows its value: in a baseline, output tokens can be preempted by think tokens, inflating TTOT P95. |
| Output ITL | Inter-token latency during the output phase only. Measured per (token N → token N+1) pair. P50/P95/P99. | Streaming fluidity — the perceptual quality of the user-visible output. |
| Think tokens | Tokens emitted in the think phase per request. Avg + P95. | Cost driver; budget-forcing efficacy is measured against this. |
| Budget force rate | Percentage of reasoning requests where budget force fired. Broken down by reason (converged, overthinking, hard_cap). | Quality signal: if hard_cap dominates we are forcing blindly; if converged dominates the entropy probe is doing its job. |
| OutputCritical eviction events | Count of eviction events that reached the OutputCritical tier during the run. | User-visible degradation event. Any non-zero is alertable. |
The harness explicitly does NOT report aggregate throughput (tokens/sec/GPU). That is what every other benchmark already reports, and it does not distinguish Meridian from the baseline. Operators who only want a throughput number can run vLLM's own benchmark harness.
Workload
Reference workload: synthetic mix of two categories.
- Chat — short prompts, 40–240 output tokens, no think phase. Models the ShareGPT-style background traffic that should never stutter.
- Reasoning — math-style prompts with expected think token counts in
[600, 6000]. Models a MATH-500-equivalent distribution.
Mix ratio is operator-configurable (--reasoning-ratio); default 0.4 is
the realistic balance for a 2026 reasoning-model deployment.
Arrivals are Poisson (exponential inter-arrival) at a configurable rate. This matches how production traffic actually arrives and exercises the dual-queue policy under realistic burst conditions.
Two execution modes
-
synthetic-replay— uses the native Meridian components (PhaseRouter,MeridianScheduler,BlockManager) over a synthetic decoder loop that does not require a GPU or a real vLLM. The phase events, scheduler queue transitions, KV allocations and eviction pressure are all real; only the per-token compute is simulated as a fixed-cost sleep. This mode runs in CI and detects regressions in the scheduler / block manager dynamics. -
real-vllm— drives a realAsyncLLMEnginewith theMeridianSchedulerPluginattached. Requires a CUDA-capable host and a model checkpoint. Runs on the GPU CI job and on demand for release validation.
Both modes emit the same BenchmarkReport JSON+Markdown shape so reports
are directly comparable. CI uploads both as artefacts and the Markdown
form can be posted as a PR comment for visual diff.
Consequences
Positive
- Reproducibility: synthetic-replay is deterministic given
--seedand runs in seconds. Two PRs can be compared apples-to-apples without GPU access. - Honest reporting: metrics call out exactly where Meridian wins (TTOT, output ITL variance) and acknowledge what we don't measure (raw throughput).
- Cross-mode parity: the same
BenchmarkReportschema for both modes means a regression caught in synthetic-replay translates directly to expected behaviour under real-vllm. - Honest about failures: the
output_critical_eviction_eventscounter surfaces user-visible degradation immediately; the budget-force reason breakdown surfaces when the entropy probe is doing real work vs. just hitting the cap.
Negative / risks
- synthetic-replay does not exercise the CUDA kernels. A regression
in the kernels will not be caught by CI; the GPU job's
kernel_correctnesstest is the line of defence there. - synthetic per-token latency is calibrated, not measured. The
default sleeps (6 µs / 18 µs for think / output) are approximations
of bf16-Qwen3-on-H100 wall-clock times. Operators tuning a different
hardware target should override via the
SyntheticDecoderconstructor. - Mix is synthetic. Real ShareGPT / MATH-500 replays are
available via
--workload sharegpt|math500and use the offline HuggingFace dataset loader; they do not require a GPU.
Neutral
- Report artefacts are JSON + Markdown only. Operators who want an HTML dashboard can render the JSON externally.
Alternatives considered
"Just use vLLM's benchmark harness"
Rejected. vLLM's harness reports throughput and TTFT, neither of which shows Meridian's value. We would have to extend it to track TTOT and phase-differentiated latencies — at which point we have already built this harness, but with a tight coupling to vLLM's internal benchmark abstractions.
MLPerf-style reference workloads
Considered, deferred. MLPerf targets raw throughput and aggregate API-level latency, not phase-differentiated metrics. A MLPerf-compatible reporting mode could be added if there is downstream demand, but it does not fit the primary signal Meridian optimises for.
Per-token wall-clock tracing
Considered (would record every decode-step timestamp and reconstruct the queue depths after the fact). Rejected because it generates GiB of trace data per run for marginal additional signal — the aggregated percentiles are enough to identify regressions and the OpenTelemetry spans give the deep-dive when needed.
References
- Playbook §5 — target metrics table.
- ADR-0001 — dual-queue scheduling, the design this benchmark validates.
- ADR-0004 — KV tier policy, the design that
output_critical_eviction_eventsdirectly measures.
ADR-0006: Disaggregated KV transfer protocol
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
The current frontier of LLM serving infrastructure has settled on prefill-decode disaggregation: prompt-processing (compute-bound) and token-decoding (memory-bandwidth-bound) run on separate worker pools and exchange KV blocks across a high-bandwidth fabric. NVIDIA NIXL is the CUDA-blessed reference; Mooncake (Moonshot, ASPLOS '25) is the open-source protocol that the rest of the ecosystem implements.
Meridian's three-tier block manager already produces the exact signal a
disaggregated pool wants — ThinkComplete blocks at ExitThink are
known-cold. A scheduler with phase visibility is the natural producer
of well-batched, well-timed offload events: no other layer in the stack
knows that right now, this request just finished reasoning and won't
re-read its think KV.
The remaining question is how a phase-aware scheduler talks to the fabric. We need a wire format and a set of trigger points that work across NIXL and Mooncake (and any future fabric) without forcing Meridian to depend on a specific runtime.
Constraints:
- The fabric layer cannot live inside
meridian-core— that crate isforbid(unsafe_code)and has no transitive CUDA dependency. It must live inmeridian-kernelsbehind a cargo feature so non-disagg deployments pay nothing. - The wire format must be readable from Python and C++ NIXL agents, not just from Rust. NIXL's reference implementation is C/C++ + Python bindings; Mooncake's reference implementation is C++ + Python.
- A
meridiandeployment that wants disagg but doesn't have a libnixl runtime available must still be exercisable end-to-end — otherwise the integration is impossible to test without specialised hardware.
Decision
Meridian defines a small versioned wire protocol (MRDN v1) for
disaggregated KV transfer, implements it behind the nixl cargo
feature of meridian-kernels, and exposes the trigger points through
the BlockManager trait.
Wire format
+---------------+---------------+---------------+---------------+
| magic (4) | version (4) | body_len (4) | tier (1)+pad(3)|
+---------------+---------------+---------------+---------------+
| checksum (16, Blake3-128) |
+---------------+---------------+---------------+---------------+
| body (body_len bytes — opaque to Meridian; NIXL/Mooncake |
| interpret as raw KV bytes) |
+---------------+---------------+---------------+---------------+
magic = b"MRDN"— fail-fast on misrouted payloads.version = 1— incremented on any breaking framing change. We commit to preserving v1 across all0.xreleases.tier— the producer's tier label (ThinkComplete | ThinkActive | OutputCritical). The consumer may ingest into a different tier; the field exists for telemetry and for fabric-side admission policy.checksum— Blake3 of the body, truncated to 128 bits. Detects bit flips on RDMA and silent corruption inside fabric staging buffers.
The header is exactly 32 bytes. Body is opaque to the protocol — NIXL treats it as a raw KV slab; Mooncake adds its own framing inside.
Trigger points
ExitThink— the producer's natural offload window. The scheduler batches the request'sThinkCompleteblocks and pushes them to the fabric in a single shot, amortised bydisagg.offload_threshold_blocks.OutputCriticalallocation pressure — if the local pool is thrashingOutputCritical(a user-visible degradation event), the scheduler may pull blocks back from the fabric to satisfy allocations. Theingest_blockhook is implemented; the automatic pull policy under allocation pressure is deferred pending measured offload latency data to calibrate the threshold.
Fabric trait
#![allow(unused)] fn main() { pub trait Fabric: Send + Sync + std::fmt::Debug { fn push(&self, payload: Vec<u8>) -> Result<u64>; fn pull(&self, handle: u64) -> Result<Vec<u8>>; fn label(&self) -> &'static str; } }
Shipped implementations:
SyntheticNixlFabric— in-process keyed map, wire-format-identical to a real NIXL agent on the host side. Used for integration tests and for portfolio deployments where libnixl isn't reachable.- Real libnixl FFI — gated on
nvidia-nixl-sysbecoming available on crates.io. The call sites already speak the protocol; switching swaps theFabricimplementation only.
Mooncake compatibility is achieved by writing a MooncakeAdapter: Fabric that re-frames the v1 wire body inside Mooncake's transport.
The header survives unchanged, so an end-to-end conversation between a
Meridian producer and a Mooncake-only consumer is a one-adapter delta.
Consequences
Positive
- A single wire format covers two ecosystems (NIXL + Mooncake) and is
forward-compatible because
versionis on the wire. - Checksum-on-body catches silent corruption — the kind of incident that takes 48 h to diagnose in a heterogeneous fabric.
- The
BlockManagertrait gainsoffload_block/ingest_block/block_locationas non-breaking additions: the default impls returnDisaggUnavailablefor offload/ingest andLocalfor location, so existing implementations keep working. - Portfolio deployments can demonstrate the full disagg path without GPU hardware, via the synthetic fabric.
Negative / risks
- Adding
Vec<u8>allocations on every offload contradicts the zero-allocation discipline of the router hot path. Mitigation: the offload path runs atExitThink, which is at most once per request — well off the per-token critical path. - The 32-byte header is overhead for blocks that may be only a few KiB each. At a typical 16 KiB block this is 0.2% — acceptable.
- The synthetic fabric is not a substitute for real NIXL benchmarks.
Anyone reading a synthetic-fabric A/B chart should know what they're
reading. We label the fabric as
"nixl-synth"in all telemetry to make this unambiguous.
Neutral
- The
versionfield commits us to a backward-compatibility plan once v2 ships. ADR-0007 documents the policy.
Alternatives considered
gRPC point-to-point. Considered briefly. Rejected because every KV transfer would pay HTTP/2 framing overhead on top of the actual payload, and the standard NIXL/Mooncake clients don't speak gRPC. Our producer would be the odd one out.
RDMA-only, no framing. Considered. Rejected because RDMA without framing requires every consumer to know the producer's exact tier and checksum convention out-of-band. The 32-byte header buys us self-describing payloads at a 0.2% overhead — strictly worth it.
Per-block streams over Mooncake without our own header. Considered.
Rejected because we'd lose the tier and version fields, which makes
mixed-version cluster rollouts dangerous: a producer at v1 talking to
a consumer at v0 must fail fast, not silently mis-tier blocks.
References
- NIXL technical brief, NVIDIA Developer Blog, March 2026.
- Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot, ASPLOS '25.
- Blake3 specification, https://github.com/BLAKE3-team/BLAKE3-specs.
- Meridian playbook §6 (Disaggregation outlook).
- ADR-0004 (KV tier promotion policy) — describes why
ThinkCompleteis a natural offload candidate.
ADR-0007: Release and versioning policy
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
Meridian ships three artefacts on three different cadences:
- The
meridian-coreandmeridian-kernelscrates → crates.io. - The
meridianPython package (the vLLM plugin) → PyPI. - The mdBook → GitHub Pages.
A release that bumps any of them must keep the others coherent —
breaking the Python plugin while leaving meridian-core stable would
strand operators on a half-upgraded stack. We need an explicit policy
for what triggers a release, which artefacts move together, and
how breaking changes are signalled — both for the pre-1.0 phase
(today) and for the post-1.0 phase (after a year of production use).
We also need to commit to a provenance and SBOM story so the project is auditable end-to-end. SLSA Level 2 is the bar reasonable consumers expect in 2026.
Decision
SemVer interpretation
- Post-1.0 — strict SemVer. Breaking changes require a major bump; additive changes require a minor bump; bugfixes are patches.
- Pre-1.0 — breaking changes ship as minor bumps. The CHANGELOG
entry for every minor bump must list every breaking change under a
BREAKING CHANGE:footer. Operators reading the CHANGELOG before upgrading get a complete diff with no surprises.
Release cadence
- Minor: every six weeks. The window is fixed; the content is whatever passes CI plus the ADRs accepted in that window.
- Patch: on demand for security fixes and high-severity bugs.
No fixed cadence — patches ship within 48 h of a confirmed report
for
SEV-1(CVSS ≥ 7.0), per the security policy.
Branch policy
mainis always-releasable. Nothing merges that breaks CI.- No release branches.
release-plzdrives the changelog and tag frommaindirectly. - Hotfixes are commits on
mainfollowed by a patch tag. We do not back-port to N-1 minors during pre-1.0 — operators on an old pre-1.0 minor are expected to upgrade forward.
Artefact set per release
Every tagged release ships:
- Compiled Rust static libs and the
meridianPython wheel — built and uploaded as GitHub release assets byrelease.yml. - CycloneDX SBOM for the Rust workspace and the Python wheel, attached to the GitHub release.
- SLSA Level 2 provenance attestation, generated by
slsa-github-generatorand uploaded as a release asset. - A signed Git tag (when GPG is configured) and an annotated tag otherwise.
- mdBook deploy → GitHub Pages (via
docs.ymlon push tomain).
Note on crates.io and PyPI publishing:
release.ymlcarries a manual (workflow_dispatch)publishjob. It defaults to dry-run —cargo publish --dry-runfor the crates plustwine checkon the wheel — so packaging is validated on every invocation without pushing anything. Selectingpublish_mode = liveperforms the real publish, but each live step is token-gated: it runs only whenCARGO_REGISTRY_TOKEN(crates.io) andPYPI_API_TOKEN(PyPI) are present in repo secrets. Until those are configured the job is safe to run and simply validates. This keeps publish off the automatic tag path while the project stabilises pre-1.0.
The three artefacts (crates, wheel, mdBook) move together. A release with a partial set is a CI failure, not a partial release.
Version coupling
The policy is that meridian-core, meridian-kernels, and the meridian
Python package carry the same version string. release-plz enforces this for
the Rust side via Cargo workspace inheritance.
Current state of automation as of v0.1.0: python/pyproject.toml carries
a hardcoded version = "0.1.0" that must be bumped manually in sync with the
Cargo workspace. A build hook to read the workspace version automatically is
planned but not yet wired. Operators upgrading the Rust workspace must also
update python/pyproject.toml until that hook is in place.
Yanking
We will yank a crates.io publish only for a security incident or a correctness regression with no workaround. Style fixes, doc errors and "meh, the version should have been a minor" do not justify yanking — those get a forward fix in the next release.
Consequences
Positive
- Operators reading the CHANGELOG ahead of an upgrade get a complete picture of what's breaking — no surprises during pre-1.0.
- One number versions the entire stack. Cross-artefact compatibility questions ("does the wheel work with crate X.Y.Z?") become trivially answerable.
- SLSA L2 provenance + SBOM at every release make Meridian a credible citizen of supply-chain-conscious deployments.
- A six-week minor cadence is short enough to feel responsive but long enough that operators don't get release fatigue.
Negative / risks
- A monolithic version bump means even a docs-only release moves every artefact. Cargo wastes a publish; PyPI wastes a wheel. The cost is real but small — well below the cost of decoupled versions diverging during an incident.
- The fixed six-week cadence will sometimes ship "nothing material". We accept that; consistency beats hoarding.
Neutral
- The policy is enforced by
release-plzconfig plus the CI release workflow. There is no human gate between a green CI onmainand a tagged release.
Alternatives considered
Independent versions per artefact. Considered for the parity it gives with how Cargo and PyPI usually work in larger projects. Rejected because the cross-artefact compatibility matrix becomes a second README, and we are not large enough yet to justify the overhead.
Release on every merge to main. Considered as a continuous-release model. Rejected because it would publish to crates.io and PyPI dozens of times a week, polluting the index and triggering downstream Dependabot noise for every contributor.
Long-lived release branches per minor. Considered for the back-port story it enables. Rejected pre-1.0 because we explicitly do not support N-1; operators are expected to upgrade forward.
References
- SemVer 2.0.0.
- Keep a Changelog.
- SLSA Specification.
release-plz— drives the changelog and tag.- CycloneDX — SBOM format.
- ADR-0006 (Disagg KV transfer) — wire format
versionfield references this policy. SECURITY.md— disclosure SLAs.
ADR-0008: Request preemption policy
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
Meridian reorders the batch vLLM hands it so output-phase requests are dispatched ahead of think-phase requests (ADR-0001). Reordering is a non-destructive operation: it changes the rank of requests that vLLM has already decided to run this step, but it never removes a request from the running set and never reclaims KV from a request mid-flight.
The next lever — and the one every mature serving scheduler eventually reaches for — is preemption: evicting a request that is already dispatched so its KV blocks can be reused by a higher-priority request. vLLM has its own preemption (recompute / swap), but it is phase-blind: it preempts on global memory pressure without knowing that a think-phase request holding 14 GB of KV is a far better victim than an output-phase request streaming to a waiting user.
A phase-aware preemption policy is therefore a natural Meridian feature. The question this ADR answers is not "how do we build it" in isolation, but "do we build it before 1.0, and if not, what is the design and the risk that justify deferring it". Shipping preemption is the single highest-risk change available to the scheduler: it can deadlock, it can thrash, and it can corrupt a request's KV if the reconstruction path is wrong. A wrong preemption decision is user-visible as a stalled or restarted generation.
Decision
Meridian will not preempt already-dispatched requests before 1.0. The plugin's influence on vLLM remains advisory — reordering and budget forcing only. This ADR records the intended design and its risk analysis so the deferral is a deliberate, documented choice rather than a gap.
Intended design (post-1.0)
When preemption is implemented it will follow this shape:
- Victim selection by phase, then LRU. The victim search walks the
block manager's tier order —
ThinkCompletefirst, thenThinkActive, andOutputCriticalonly under the same severe-pressure warning thatevict_foralready emits. Within a tier, the least-recently-touched request is the victim. This reuses the existing eviction ordering rather than introducing a second policy. - Recompute, not swap, as the default reclamation path. A preempted think-phase request is cheaper to recompute from its prompt than to swap its KV to host and back, because think-phase KV is exactly the data we are willing to discard (ADR-0004). Swap remains available for output-phase victims, which must never lose progress.
- A preemption budget. At most a configurable fraction of the running
set may be preempted per scheduler step (
preempt_max_fraction, default small). This caps thrash: a pressure spike cannot evict the whole batch in one step. - A re-admission guard. A preempted request is parked with a monotonically increasing priority floor so it cannot be preempted again immediately after re-admission. This is the anti-livelock invariant.
- Disagg interaction. When a fabric is configured (ADR-0006), a
ThinkCompletevictim is offloaded rather than discarded, so its KV is recoverable from the fabric instead of recomputed. Preemption and offload share the same victim search.
API shape (sketch, not committed)
The scheduler would gain a single entry point that returns victims for the caller to actuate against vLLM, keeping Meridian advisory rather than reaching into vLLM's running set directly:
fn select_preemption_victims(
&self,
needed_bytes: u64,
running: &[RequestId],
) -> Vec<PreemptionVictim> // { req_id, reason, reclaim: Recompute | Swap | Offload }
The plugin translates each victim into the vLLM preemption call appropriate
for that vLLM version, isolated in the same _extract_* / _reorder shim
layer that already absorbs vLLM API drift.
Consequences
Positive
- The pre-1.0 scheduler stays advisory and therefore safe: the worst case of a wrong Meridian decision is a sub-optimal dispatch order, never a lost or corrupted generation.
- The design is written down, so when preemption lands it starts from a reviewed risk analysis rather than a blank page.
- Victim selection reuses the existing tier ordering and disagg offload path, so the eventual implementation is additive, not a rewrite.
Negative / risks (the reason for deferral)
- Without phase-aware preemption, Meridian leaves throughput on the table under heavy memory pressure: vLLM's phase-blind preemption will sometimes evict an output-phase request when a think-phase victim was available. This is the cost we accept pre-1.0.
- The risks that justify waiting:
- Deadlock: a preempted request needs memory to be re-admitted that only its own preemption could free. Mitigated by the recompute default and the re-admission priority floor — both unproven until implemented.
- Thrash / livelock: oscillating preempt/re-admit under sustained pressure. Mitigated by the per-step preemption budget.
- KV correctness: a swap-and-restore bug silently corrupts a request's context. This needs a dedicated correctness harness on real hardware before it can ship — which is precisely the validation we do not yet have.
Neutral
- This ADR is revisited once Meridian has a real-hardware benchmark baseline. Preemption that cannot be measured against stock vLLM under memory pressure cannot be justified, so the work is gated on that measurement capability existing first.
Alternatives considered
Implement preemption now, behind a default-off flag. Considered, so the code path exists for early adopters. Rejected: a default-off feature with no real-hardware correctness harness is untested code that rots, and the risk analysis above shows the failure modes are the kind that only surface under real load.
Delegate entirely to vLLM's preemption forever. Considered as the permanent answer. Rejected as a permanent policy because phase-blind preemption contradicts Meridian's entire thesis; accepted as the pre-1.0 policy because the safety bar for taking over preemption is high.
References
- ADR-0001 (Dual-queue vs. priority weights) — the advisory-reordering baseline this ADR declines to extend pre-1.0.
- ADR-0004 (KV tier promotion policy) — the tier ordering victim selection reuses.
- ADR-0006 (Disagg KV transfer protocol) — the offload path a
ThinkCompletevictim takes when a fabric is configured.
ADR-NNNN: Short kebab title
- Status: Proposed | Accepted | Superseded by ADR-NNNN | Deprecated
- Date: YYYY-MM-DD
- Authors: name
- Reviewers: name, name
Context
What is the problem we are facing? What constraints apply? What evidence do we have (benchmarks, incident postmortems, prior art)?
Decision
What did we decide? State it as a single declarative sentence at the top, then expand.
Consequences
Positive
- Bullet list of what we gain.
Negative / risks
- Bullet list of what we lose, and how we will detect each risk if it materializes.
Neutral
- Things that change but are neither clearly positive nor negative.
Alternatives considered
For each alternative: one paragraph on what it would look like, and why we rejected it.
References
- Links to research papers, prior ADRs, incident reports, benchmarks.
Metrics
Meridian emits Prometheus metrics and OpenTelemetry traces. All metric names are stable contracts — renames or semantic changes trigger a minor-version bump per ADR-0007.
Metric catalog
meridian.think_tokens_per_request
| Property | Value |
|---|---|
| Type | histogram |
| Unit | tokens |
| Cardinality | 1 series (no labels) |
| Source | PhaseRouter on ExitThink or ForceBudget |
| Why | Tracks the distribution of reasoning-chain lengths. Long tails here indicate the entropy probe is deferring too late; a spike in the P99 bucket means hard-cap forcing is dominating. |
Operator action: if P99 frequently hits max_think_tokens, lower max_think_tokens
or tighten eat_ema_variance_threshold / rpdi_threshold to fire forcing earlier.
meridian.budget_force_triggered
| Property | Value |
|---|---|
| Type | counter |
| Unit | events |
| Cardinality | 1 series |
| Source | PhaseRouter |
| Why | Measures how often the router fires </think> injection. A counter that never moves means the entropy probe is never converging (check eat_ema_variance_threshold). |
meridian.budget_force_reason{reason=...}
| Property | Value |
|---|---|
| Type | counter |
| Unit | events |
| Labels | reason ∈ {converged, overthinking, hard_cap} |
| Cardinality | 3 series |
| Source | PhaseRouter |
| Why | Breaks down why forcing fired. converged and overthinking mean the entropy probe is working. Sustained hard_cap dominance means the probe is failing to detect convergence. |
Operator action: monitor the ratio hard_cap / (converged + overthinking) over
a 1-hour window. A ratio above 0.5 is a signal to investigate probe thresholds or
inspect sample EAT traces.
meridian.output_critical_eviction
| Property | Value |
|---|---|
| Type | counter |
| Unit | events |
| Cardinality | 1 series |
| Source | PhaseAwareBlockManager |
| Why | Every increment is a user-visible degradation event — a KV block backing the live output stream was evicted under memory pressure. Zero is the target. |
Operator action: alert at rate(...) > 0 sustained for 5 minutes. Mitigate
by lowering think_phase_memory_fraction, think_batch_multiplier, or
max_think_tokens.
meridian.phase_router.tracked_requests
| Property | Value |
|---|---|
| Type | gauge |
| Unit | requests |
| Cardinality | 1 series |
| Source | PhaseRouter |
| Why | Requests that complete but are never reaped leak entries. Monotonically growing gauge means reap_stale_older_than is not being called, or the vLLM plugin's post_step is not receiving EOS events. |
Operator action: if the gauge grows without a corresponding growth in active
concurrent requests, check that post_step receives EOS and that the reap
period (60 s default) is shorter than request lifetime.
meridian.schedule_batch.duration_ns
| Property | Value |
|---|---|
| Type | histogram |
| Unit | nanoseconds |
| Cardinality | 1 series |
| Source | MeridianScheduler::schedule_batch |
| Why | Measures scheduling overhead on the hot path. Should be in the microsecond range; millisecond-range values indicate contention inside the scheduler lock. |
Operator action: P99 above 1 ms under steady load is unexpected. File an issue with a CPU profile.
meridian.queue_depth{queue=...}
| Property | Value |
|---|---|
| Type | gauge |
| Unit | requests |
| Labels | queue ∈ {output, think} |
| Cardinality | 2 series |
| Source | MeridianScheduler |
| Why | Monitors queue backlog. Output queue depth growing without draining means the GPU is bottlenecked. Think queue depth growing without budget_force_triggered activity means the entropy probe is not converging and long chains are piling up. |
Operator action: alert when queue_depth{queue="think"} P95 exceeds 4× its
1-hour baseline for 5 consecutive minutes without accompanying forcing activity.
meridian.block_manager.used_bytes
| Property | Value |
|---|---|
| Type | gauge |
| Unit | bytes |
| Cardinality | 1 series |
| Source | PhaseAwareBlockManager |
| Why | Total KV bytes currently allocated across all tiers. Rising towards kv_memory.capacity_bytes predicts incoming eviction pressure. |
Operator action: alert when block_manager.used_bytes / capacity_bytes exceeds
0.90 for 10 minutes — this is the early-warning threshold before
output_critical_eviction events begin.
meridian.block_manager.evictions{tier=...}
| Property | Value |
|---|---|
| Type | counter |
| Unit | events |
| Labels | tier ∈ {think_complete, think_active, output_critical} |
| Cardinality | 3 series |
| Source | PhaseAwareBlockManager |
| Why | Per-tier eviction rate reveals the shape of memory pressure. think_complete evictions are routine; think_active indicates moderate pressure; output_critical is a user-visible degradation event identical to meridian.output_critical_eviction. |
Operator action: alert on any tier=output_critical increment — use this
series or meridian.output_critical_eviction, whichever is easier to route in
your alerting stack.
meridian.scheduler.batch_size{phase=...}
| Property | Value |
|---|---|
| Type | histogram |
| Unit | slots |
| Labels | phase ∈ {output, think} |
| Cardinality | 2 series |
| Source | MeridianScheduler |
| Why | Distribution of actual batch sizes delivered to the vLLM worker per phase. A consistently small output batch under load means output requests are draining faster than think-phase completions replenish the pool. |
Operator action: compare scheduler.batch_size{phase=output} P50 against
queue_depth{queue=output} to verify output requests are being served promptly.
meridian_disagg_blocks_offloaded_total{fabric=...}
| Property | Value |
|---|---|
| Type | counter |
| Unit | blocks |
| Labels | fabric ∈ {nixl, mooncake} |
| Cardinality | 1 series per active fabric |
| Source | MeridianSchedulerPlugin on ExitThink (flushed at offload_threshold_blocks) |
| Why | Tracks disagg throughput. A counter that never moves when disagg is enabled means offload hooks are not firing. |
meridian_vocab_fallback_total
| Property | Value |
|---|---|
| Type | counter |
| Unit | events |
| Cardinality | 1 series |
| Source | MeridianSchedulerPlugin entropy-probe batch path |
| Why | Counts batches where logit rows had heterogeneous vocab sizes and the probe fell back to per-request compute. A rising counter means mixed-model batching is defeating the batched probe; investigate request routing. |
OTLP export
Prometheus is the primary metric surface. When [telemetry] otlp_enabled = true
(requires the otel extra), the plugin additionally exports its counters to an
OTLP/HTTP collector at [telemetry] otlp_endpoint, and the Rust core can wire
its tracing spans to OTLP via the otel crate feature
(meridian_core::telemetry::install). Both are off by default.
Trace spans
Each MeridianScheduler::schedule_batch call opens a meridian.schedule_batch
OpenTelemetry span. PhaseEvents propagate meridian.phase_event{kind=...}
events on the active request's span, allowing per-request phase timelines to be
reconstructed from trace data.
Alerting summary
| Metric | Alert condition | Severity |
|---|---|---|
output_critical_eviction | rate > 0 for 5 min | High — user-visible |
block_manager.evictions{tier=output_critical} | rate > 0 for 1 min | High — user-visible (same event, finer label) |
block_manager.used_bytes | > 90% of capacity for 10 min | Medium — pre-eviction warning |
queue_depth{queue=think} | P95 > 4× baseline for 5 min with no forcing | Medium — starvation risk |
budget_force_reason{reason=hard_cap} | ratio > 0.5 over 1 h | Low — probe investigation |
phase_router.tracked_requests | monotonically growing > 15 min | Low — reap misconfiguration |
Troubleshooting
Each entry follows runbook format: Symptom → Likely cause → How to verify → Immediate mitigation → Longer-term fix.
Output streams stutter under load
Symptom: users report visible gaps in the output token stream; output ITL P99 spikes.
Likely cause: OutputCritical KV blocks are being evicted under memory pressure.
How to verify:
rate(meridian.output_critical_eviction[1m]) > 0
Any non-zero rate confirms the block manager is evicting user-visible KV.
Immediate mitigation: reduce load. Lower the arrival rate or raise kv_memory.capacity_bytes
if headroom exists.
Longer-term fix (pick one or more):
- Lower
kv_memory.think_phase_memory_fraction(e.g. 0.40 → 0.30) to give output more room. - Lower
scheduler.think_batch_multiplierto reduce think-phase KV pressure. - Lower
scheduler.max_think_tokensso individual reasoning chains release blocks sooner.
Budget forcing never fires; chains always hit hard cap
Symptom: meridian.budget_force_reason{reason=hard_cap} increments constantly;
converged and overthinking stay at zero.
Likely cause: the entropy probe is not detecting convergence. Either the probe is disabled, the thresholds are too tight, or the model is being asked to reason on prompts that produce genuinely non-converging chains.
How to verify:
- Check
meridian.toml: confirmentropy.enabled = true. - Capture a sample EAT trace: add
logging.level = "debug"temporarily and inspect EAT EMA values in the logs for a known-good reasoning prompt.
Immediate mitigation: none — hard-cap forcing is safe, just less adaptive.
Longer-term fix:
- If probe is disabled: set
entropy.enabled = true. - If thresholds are too tight: relax
eat_ema_variance_threshold(e.g. 0.001 → 0.005) or lowerrpdi_threshold(e.g. 3.0 → 2.0). - If prompts are genuinely non-converging: lower
max_think_tokensto bound KV cost.
Phase router shows runaway tracked-requests gauge
Symptom: meridian.phase_router.tracked_requests grows monotonically without
a corresponding growth in active requests.
Likely cause: completed requests are not being reaped from the router. Either
post_step is not receiving EOS events, or reap_stale_older_than is not being
called.
How to verify:
- Confirm the vLLM plugin's
post_stepis wired to the engine's step callback. - Check the reap interval in the plugin config — default 60 s. If request lifetime is shorter than 60 s on average, the reaper may be lagging.
Immediate mitigation: no manual reap trigger is exposed in v0.1.x. The plugin
calls router.reap_stale_older_than(60.0) automatically every 64 schedule()
invocations; under active serving load this fires within seconds. To force an
immediate reap at deploy time, temporarily lower _reap_interval_schedules in
vllm_plugin.py to 1.
Longer-term fix: reduce the reap period in vllm_plugin.py or ensure post_step
receives every EOS signal from the vLLM worker.
CUDA kernel returns Unavailable
Sprint 0 note:
python/meridian/_backends/cuda.pycurrently delegates toCpuEntropyBackend;KernelError::Unavailablecannot be triggered through the Python probe in v0.1.x. The scenario below applies once Sprint 1 wires the Python backend to the Rust kernel path incrates/meridian-kernels/.
Symptom: logs show KernelError::Unavailable; entropy probe silently falls back
to CPU; meridian.budget_force_reason{reason=hard_cap} is the only firing signal.
Likely cause: the meridian-kernels crate was built without --features cuda,
or libcudart.so is not on the dynamic linker path at runtime.
How to verify:
# Confirm the Python extension loads and report the current backend behaviour.
python -c "
from meridian._backends.cuda import CudaEntropyBackend
b = CudaEntropyBackend()
print('backend name:', b.name)
# Sprint 0: always prints 'cuda' but delegates to CPU internally.
# Real CUDA dispatch requires a --features cuda build (Sprint 1).
"
# Confirm the Rust kernels extension is importable.
python -c "import meridian._meridian; print('native extension OK')"
Immediate mitigation: the CPU fallback is correct — entropy values are identical.
The only cost is CPU cycles and higher schedule_batch.duration_ns latency.
Longer-term fix:
- Rebuild with
cargo build -p meridian-kernels --features cuda. - Verify
libcudart.sois onLD_LIBRARY_PATHor in/usr/local/cuda/lib64. - Run
maturin develop --release -m crates/meridian-python/Cargo.tomlto regenerate the Python extension.
Disagg offload not firing
Symptom: meridian.disagg.blocks_offloaded counter never increments after
[disagg] enabled = true is set.
Likely cause: the fabric is not reachable, the offload_threshold_blocks has
not been reached, or the NIXL feature was not compiled in.
How to verify:
- Confirm
cargo build -p meridian-kernels --features nixlsucceeds. - Confirm
config.disagg.fabric != "none". - Check whether
ThinkCompleteblock count per step is belowoffload_threshold_blocks.
Immediate mitigation: lower offload_threshold_blocks to 1 temporarily to
force immediate offload on every ExitThink.
Longer-term fix: verify fabric connectivity (NIXL service running, network path
open) and restore offload_threshold_blocks to the production value.
Plugin does not intercept schedule_batch
Symptom: no Meridian metrics appear; vLLM appears to schedule normally without phase-awareness.
Likely cause: MeridianSchedulerPlugin.attach() was never called, or the plugin
was attached to a different scheduler instance than the one the engine uses.
How to verify:
print(type(engine.scheduler[0]))
# Should print: <class 'meridian.vllm_plugin.MeridianSchedulerPlugin'>
# Not: <class 'vllm.core.scheduler.Scheduler'>
Immediate mitigation: call MeridianSchedulerPlugin.attach(engine, cfg, model_key="...")
explicitly after engine construction. attach is a classmethod — it constructs
the plugin and replaces engine.scheduler[0] in one step.
Longer-term fix: verify the plugin initialisation order in the serving entrypoint.
The plugin must be attached after AsyncLLMEngine is fully constructed but before
the first request is submitted.
Benchmarks
The benchmark harness lives at benchmarks/.
The methodology behind metric choice is recorded in
ADR-0005.
Quick start
# CI-friendly: no GPU, no vLLM. Drives native Meridian components over a
# synthetic decoder loop. Finishes in seconds.
uv --project python run python -m benchmarks.meridian_bench synthetic-replay \
--duration-s 30 --arrival-rate 8 --reasoning-ratio 0.4 \
--out-dir bench-out/
# GPU-required: drives a real AsyncLLMEngine via the Meridian plugin.
uv --project python run python -m benchmarks.meridian_bench real-vllm \
--model Qwen/Qwen2.5-0.5B --duration-s 30 --arrival-rate 4 \
--out-dir bench-out/
Both modes produce identically-shaped artefacts in --out-dir:
report.json— full structured report, diffable.report.md— Markdown summary suitable for PR comments.
Metric catalog
| Name | Definition |
|---|---|
| TTFT P50/P95 | Time-to-first-token. Prefill + first decoded token. |
| TTOT P50/P95 | Time from </think> emission to the first user-visible token. |
| Output ITL P50/P95/P99 | Inter-token latency during output phase (streaming jitter). |
| Think tokens avg/P95 | Distribution of reasoning-chain length per request. |
| Output tokens avg | Mean output token count per request. |
| Budget forced % | Percentage of reasoning requests where the router forced </think>. |
| Force reason | Breakdown by converged / overthinking / hard_cap. |
| OutputCritical evictions | KV pressure events that reached the user-visible tier. |
See benchmarks/metrics.py
for the exact serialised shape.
A/B comparison mode
--baseline runs the same workload through one or more baseline schedulers
alongside Meridian and writes ab-report.{json,md} to --out-dir:
stock—StockSchedulerBaseline, a priority-weight single-queue scheduler equivalent to vLLM ≤0.8 (no phase awareness, never forces budget).static-budget—StaticBudgetBaseline, a fixed think-token cap equivalent to vLLM 0.9'sthinking_token_budget(forces</think>on a counter, with no entropy signal). This is the prior art Meridian's EAT/RPDI forcing aims to beat.all— run every baseline; the report gets one value column per run and aΔ% vs <baseline>column per baseline with aWIN/win/FLAT/loss/LOSSflag.
Five-minute A/B (no GPU)
# Stock + static-budget + Meridian over a real prompt-length distribution.
python -m benchmarks.meridian_bench synthetic-replay \
--workload sharegpt --baseline all \
--duration-s 30 --arrival-rate 8 --out-dir bench-out/
# Read the comparison. Meridian should win TTOT P95 vs both baselines.
cat bench-out/ab-report.md
The synthetic-replay mode requires the native extension
(maturin develop -m crates/meridian-python/Cargo.toml); the baseline and
report logic alone are exercised by benchmarks/tests/test_baselines.py
without it.
Dataset loaders
Pass --workload sharegpt or --workload math500 to load real traffic
distributions from HuggingFace. Datasets are downloaded once and cached at
~/.cache/meridian/datasets/. Requires no GPU — the offline replay drives the
synthetic decoder with the real prompt/response length distribution.
Test environment disclosure
When comparing numbers across runs:
| Parameter | Default |
|---|---|
--seed | 42 |
--arrival-rate | 8 req/s |
--duration-s | 30 |
--reasoning-ratio | 0.4 |
Always report --seed and workload flag. Synthetic results are hardware-independent;
real-vLLM results depend on GPU model, driver, and memory state — disclose all three.
How to compare two runs
# Run A (baseline config)
python -m benchmarks.meridian_bench synthetic-replay --seed 42 --out-dir bench-out/a/
# Run B (modified config)
python -m benchmarks.meridian_bench synthetic-replay --seed 42 --out-dir bench-out/b/
# Diff the JSON reports
diff <(jq -S . bench-out/a/report.json) <(jq -S . bench-out/b/report.json)
What this harness is, and what it isn't
- It is a phase-differentiated latency regression suite. It catches changes that move the TTOT or output-ITL distributions, the metrics Meridian was built to improve.
- It is reproducible: the synthetic-replay mode is deterministic
given
--seed. Two PRs can be diffed apples-to-apples. - It is not a raw-throughput benchmark. vLLM's own harness already
reports
tokens/sec/GPUand that metric does not differentiate Meridian from the baseline. Operators who want a throughput number should run vLLM's benchmark. - It is not an accuracy benchmark. Budget forcing can in principle hurt reasoning accuracy on hard problems. Accuracy measurement requires a separate ground-truth evaluation suite; this harness does not provide one.
GPU CI runner setup
The cuda.yml workflow targets a self-hosted runner labelled gpu because
GitHub-hosted runners do not provide CUDA-capable GPUs in the free tier.
This page documents how to provision and secure the runner.
Hardware requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA L4 / A10 | H100 / B200 |
| Driver | 555.x | 555.x or newer |
| CUDA | 12.6 | 12.6 |
| Disk | 80 GiB SSD | 200 GiB NVMe |
| RAM | 32 GiB | 64 GiB |
Required secrets
| Secret | Where set | Purpose |
|---|---|---|
| (none required for GPU runner itself) | — | The runner authenticates via GitHub App token |
RELEASE_PLZ_TOKEN | Repository secrets | Allows release-plz to push tags |
CARGO_REGISTRY_TOKEN | Repository secrets | crates.io publish (future) |
The runner registration token is generated once from the GitHub Actions UI and
is not stored as a persistent secret — it expires after one hour and is only
used during ./config.sh.
Provisioning
# On the Linux host with the GPU
curl -O https://github.com/actions/runner/releases/download/v2.319.0/actions-runner-linux-x64-2.319.0.tar.gz
mkdir actions-runner && cd actions-runner
tar xzf ../actions-runner-linux-x64-2.319.0.tar.gz
./config.sh \
--url https://github.com/angelnicolasc/meridian \
--token <REGISTRATION_TOKEN> \
--labels self-hosted,linux,x64,gpu \
--unattended
sudo ./svc.sh install
sudo ./svc.sh start
Verification
Run ./run.sh once interactively. Then trigger the cuda.yml workflow from
a branch and confirm nvidia-smi prints the expected device in the job logs.
Blast radius and fork safety
Self-hosted runners execute arbitrary code from the workflow YAML. Meridian mitigates this with a hard gate on every GPU job:
if: github.repository_owner == 'angelnicolasc'
PRs from forks never trigger the CUDA workflow. Only pushes and PRs from the
angelnicolasc org are eligible.
Who can trigger: repository owners and collaborators with write access.
How to rotate the runner: generate a new registration token from the
GitHub Actions UI, run ./config.sh --replace, restart the service.
See the GitHub documentation on self-hosted runner security for a full threat model.
Security & Supply Chain
Disclosure policy
Security vulnerabilities should be reported privately. See SECURITY.md for the full disclosure process, severity classification, and response SLAs.
Summary: critical severity (CVSS ≥ 7.0) issues receive a patch within 48 hours of confirmation. Do not open public GitHub issues for unpatched vulnerabilities.
Provenance attestation
Every tagged release (v*) generates a
SLSA Level 2 provenance attestation via the
slsa-github-generator reusable workflow. The attestation covers:
- The
meridian-coreandmeridian-kernelsstatic libraries. - The
meridianPython wheel.
The attestation is uploaded as a GitHub release asset alongside the release artefacts. To verify:
# Install the SLSA verifier
go install github.com/slsa-framework/slsa-verifier/v2/cli/slsa-verifier@latest
# Download the artefact and its provenance from the GitHub release.
# Then verify:
slsa-verifier verify-artifact meridian-*.whl \
--provenance-path meridian.intoto.jsonl \
--source-uri github.com/angelnicolasc/meridian \
--source-tag v0.1.0
What is attested: the build provenance — that the artefact was built from the tagged source in the GitHub Actions environment. SLSA L2 does not attest to the security of the code itself.
Software Bill of Materials (SBOM)
Each release includes a CycloneDX SBOM covering:
- The Rust workspace (all transitive crate dependencies).
- The Python wheel (Python package dependencies from
pyproject.toml).
The SBOM is attached as a .cdx.json asset on the GitHub release. Operators
can use it with vulnerability scanning tools (Grype, Trivy, FOSSA).
Supply-chain controls
| Control | Mechanism |
|---|---|
| Dependency pinning | Cargo.lock and uv.lock committed and verified in CI |
| Dependency auditing | cargo deny check in the supply-chain CI job (licence + advisory check) |
| GitHub Actions pinning | Actions pinned to major version tags in all workflows |
| Self-hosted runner isolation | GPU runner gated to github.repository_owner == 'angelnicolasc' |
| DCO sign-off | All commits require Signed-off-by matching the commit author |
| Release provenance | SLSA L2 via slsa-github-generator |
Dependency policy
New dependencies require:
- A licence compatible with Apache-2.0 (verified by
cargo deny). - No known CVEs at the time of merge (verified by
cargo denyadvisories check). - An entry in the SBOM at the next release.
CI workflow permissions
All CI workflows run with minimal permissions:
| Workflow | Permissions |
|---|---|
ci.yml | contents: read |
release.yml | contents: write, pull-requests: write, id-token: write, attestations: write |
sbom.yml | contents: write |
docs.yml | contents: read, pages: write, id-token: write |
cuda.yml | contents: read |
release.yml permission notes:
id-token: write— required byslsa-github-generatorto mint the OIDC-backed provenance token; scoped to thebuild-artifactsandprovenancejobs.pull-requests: write— required byrelease-plzto open the automated release PR.attestations: write— required byslsa-github-generatorto upload the attestation bundle as a release asset.
Contributor Covenant Code of Conduct
This project adopts the Contributor Covenant, version 2.1 as its code of conduct.
Our Pledge
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
Our Standards
Examples of behavior that contributes to a positive environment include:
- Demonstrating empathy and kindness toward other people.
- Being respectful of differing opinions, viewpoints, and experiences.
- Giving and gracefully accepting constructive feedback.
- Accepting responsibility and apologizing to those affected by our mistakes.
- Focusing on what is best for the overall community.
Unacceptable behavior includes:
- Sexualized language or imagery, and sexual attention or advances of any kind.
- Trolling, insulting or derogatory comments, and personal or political attacks.
- Public or private harassment.
- Publishing others' private information without explicit permission.
- Other conduct which could reasonably be considered inappropriate in a professional setting.
Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported confidentially to the project maintainer at nick.dicerutti@gmail.com. All complaints will be reviewed and investigated promptly and fairly.
For the full text of the Code of Conduct including enforcement guidelines, see https://www.contributor-covenant.org/version/2/1/code_of_conduct/.
Contributing to Meridian
Thank you for considering a contribution. Meridian is an inference-time compute scheduler for reasoning-model serving — correctness, performance and clarity of contracts all matter at the same time. This document describes the bar.
Code of Conduct
Participation is governed by the Contributor Covenant 2.1. Report violations to nick.dicerutti@gmail.com.
Developer Certificate of Origin (DCO)
Meridian uses the DCO instead of a CLA. Every commit must be signed off:
git commit -s -m "feat(core): add RPDI ratio computation"
This appends Signed-off-by: Your Name <you@example.com>, which certifies that
you have the right to submit the change under the project license. PRs without
sign-off will not merge — the DCO check blocks them.
Commit conventions
We use Conventional Commits with the following scopes:
| Scope | Used for |
|---|---|
core | crates/meridian-core/ |
kernels | crates/meridian-kernels/ (CUDA + FFI) |
python | crates/meridian-python/ and python/meridian/ |
ci | .github/workflows/, devcontainer, scripts |
docs | docs/, README, NOTICE |
adr | new or modified ADRs only |
deps | dependency bumps |
bench | benchmarks/ |
Title must be under 72 characters. Breaking changes use feat!: / fix!: and
include a BREAKING CHANGE: footer.
Branch protection
main is protected. PRs require:
- Linear history (rebase, no merge commits).
- CI green:
cargo fmt --check,clippy -D warnings,cargo nextest,ruff,mypy --strict,pytest -m "not gpu",mdbook build. - DCO sign-off on every commit.
- Conventional Commit title.
- At least one approving review.
Local development
./scripts/dev-up.sh # devcontainer + sanity checks
./scripts/ci-local.sh # mirrors CI matrix locally
Pre-commit hooks (rustfmt, clippy, ruff, mypy, commitlint) are configured in
.pre-commit-config.yaml. Install with pre-commit install --install-hooks.
Test strategy
- Pure Rust logic lives in
meridian-coreand must be covered by unit tests plus, where state machines are involved,proptest-based property tests. - CUDA correctness lives in
meridian-kernelsand is verified against the reference CPU implementation in Python. - Anything that crosses the FFI boundary is exercised from Python tests
(
python/tests/). - GPU-dependent tests are marked
@pytest.mark.gpu; CI runs them only on the GPU job.
What lands in main
A change is mergeable when:
- Tests cover the new code path and existing tests still pass.
- Public API additions have rustdoc / docstrings with examples.
- Behavior changes that affect operators (config defaults, metric names, exposed traits) are recorded in an ADR.
CHANGELOG.mdis updated under## [Unreleased](handled automatically byrelease-plzfor routine changes — manually for breaking ones).
Getting help
Open a GitHub Discussion for design questions, an issue for bugs.
Security Policy
Supported versions
Meridian is in pre-1.0 development. Only the latest tagged release receives security fixes. After 1.0, the latest minor of the latest two majors will be supported.
Reporting a vulnerability
Do not file a public issue for security problems. Use GitHub's
Private Vulnerability Reporting
or email nick.dicerutti@gmail.com with the subject line
[meridian-security].
You should expect:
| Stage | Target time |
|---|---|
| Acknowledgement | 48 hours |
| Initial assessment | 5 business days |
| Fix or mitigation | 30 days (critical), 90 days (high), best-effort otherwise |
| Public disclosure | Coordinated, default 90 days after fix is available |
We credit reporters in the release notes unless anonymity is requested.
Scope
In scope for security reporting:
- Memory safety in
meridian-coreandmeridian-kernelsFFI boundary — any reachable UB, out-of-bounds access, double-free, use-after-free. - CUDA kernel safety — buffer overruns, races on shared memory, illegal memory access reachable from sane inputs.
- Deserialization of
meridian.toml, model configs, request payloads — panics on adversarial input, type confusion. - Denial of service — pathological requests that crash the scheduler or exhaust KV memory irrecoverably.
- Supply chain — compromised crate/wheel that ships under the Meridian name.
Out of scope:
- Vulnerabilities in upstream dependencies (file with the upstream project; we will track and bump promptly).
- Misconfigurations of operator-controlled deployments (e.g. exposing the Prometheus endpoint publicly).
- Reasoning quality degradation when budget forcing is misconfigured — this is a correctness concern, not a security one.
Hardening notes
meridian-coredeniesunsafe_op_in_unsafe_fnworkspace-wide.- The CUDA FFI boundary in
meridian-kernelsis the onlyunsafesurface and is reviewed for every change. - Releases are signed via Sigstore cosign; artifact provenance is generated via SLSA Level 2.