Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Meridian is an inference-time compute scheduler for reasoning-model serving. It treats think-decode and output-decode as separate scheduling domains with separate SLOs, separate KV eviction priorities, and real-time entropy-driven budget control.

Why this exists

Reasoning models (DeepSeek-R1, Qwen3, Qwen2.5, o3) emit two structurally different token sequences within a single request:

[prompt] → <think> ... N reasoning tokens ... </think> → [output tokens]

These two phases have opposite latency profiles:

PhaseUser-visible latency toleranceThroughput importanceCorrect SLO target
Think-decodeHigh — user waits regardlessCritical (cost driver)TPOT-relaxed
Output-decodeZero — streaming experienceSecondaryTTOT-strict

Standard continuous-batching schedulers (including vLLM's default) process all decode tokens — thinking and output — from the same priority queue with the same TPOT target. Meridian is the scheduling layer that knows the difference.

What Meridian does

  1. Dual-queue scheduling. Output-phase requests have absolute priority. Think-phase requests fill remaining capacity with a larger effective batch token budget.
  2. Phase-aware KV block manager. Three-tier eviction: ThinkComplete < ThinkActive < OutputCritical. Blocks from a completed reasoning chain are demoted the moment </think> is emitted.
  3. Entropy-driven budget forcing. EAT (arXiv:2509.26522) and RPDI (arXiv:2603.14251) signals inject </think> only when the model itself is signalling convergence or overthinking — not on a static token counter.
  4. Drop-in vLLM plugin. No fork required. Wraps the existing scheduler via the plugin interface; exposes Prometheus + OpenTelemetry telemetry.
  5. Disagg KV transfer. offload_block / ingest_block hooks on the block manager support prefill-decode disaggregation fabrics (NIXL, Mooncake-compatible). Documented in ADR-0006.

Scope and assumptions

  • In scope: latency-differentiated scheduling for reasoning models served via vLLM on a single node or a disaggregated prefill-decode topology.
  • Out of scope: model training, quantisation, speculative decoding, or serving systems other than vLLM. See Non-goals.
  • Assumed: the serving stack emits per-request token IDs to a hook point that Meridian can intercept (the vLLM plugin interface satisfies this).
  • Assumed: think/output phase boundaries are detectable from the decoded token stream via model-specific boundary token IDs (configurable per model in meridian.toml).

Threats to validity

  • Synthetic benchmark results are directional, not absolute. The synthetic-replay harness simulates latency with a calibrated decoder; it does not model multi-tenant memory pressure or real CUDA kernel variance. Results should be reproduced on target hardware before drawing conclusions.
  • Budget forcing can affect accuracy. Injecting </think> early may shorten correct reasoning chains. Meridian fires forcing only on entropy convergence signals, but the threshold is a tunable heuristic, not a guarantee. Accuracy measurement is the operator's responsibility.
  • Phase detection depends on token IDs. If a model tokenises <think> / </think> differently than the configured boundary IDs, Meridian treats the entire request as single-phase. The models/*.toml files in the repo carry vetted IDs for supported models.

How to read this book

  • Architecture — component map, per-component contracts, failure modes, and observability hooks.
  • ADRs — every architectural decision with the rejected alternatives. Read these alongside the code to understand the why.
  • Configuration — every meridian.toml field with type, default, valid range, and tuning guidance.
  • API reference — Rust and Python surfaces with lifecycle and concurrency notes.
  • Operations — metrics catalogue, alerting recommendations, troubleshooting runbooks, and benchmark methodology.
  • Non-goals — what Meridian explicitly does not do.
  • Glossary — definitions for TTFT, TTOT, ITL, EAT, RPDI, KV, disagg, NIXL, and other terms used throughout.

Architecture

Incoming requests
      │
      ▼
┌─────────────────────────────────────────────────────────┐
│                    Meridian Daemon                        │
│                                                           │
│  ┌──────────────┐    ┌────────────────────────────────┐  │
│  │   Prefill    │───▶│        Phase Router             │  │
│  │   Executor   │    │  (token stream state machine)   │  │
│  └──────────────┘    └───────────┬─────────────────────┘  │
│                                  │                         │
│                    ┌─────────────┴──────────────┐          │
│                    │                            │          │
│        ┌───────────▼──────────┐  ┌─────────────▼───────┐  │
│        │   Think-Decode       │  │   Output-Decode      │  │
│        │   Scheduler          │  │   Scheduler          │  │
│        │                      │  │                      │  │
│        │  TPOT: relaxed       │  │  TTOT: strict SLO    │  │
│        │  Batch: 2.5× larger  │  │  Batch: standard     │  │
│        │  Entropy probe live  │  │  Stream priority     │  │
│        │  Budget force ready  │  │                      │  │
│        └──────────┬───────────┘  └────────┬─────────────┘  │
│                   │                       │                 │
│                   └──────────┬────────────┘                 │
│                              │                              │
│               ┌──────────────▼─────────────┐               │
│               │   Phase-Aware KV Block Mgr  │               │
│               │                             │               │
│               │  Tier 0: ThinkComplete      │               │
│               │  Tier 1: ThinkActive        │               │
│               │  Tier 2: OutputCritical     │               │
│               └─────────────────────────────┘               │
└────────────────────────────┬────────────────────────────────┘
                             │
                        vLLM worker
                    (decode kernel, KV store)

Phase state machine

The Phase Router advances each request through this machine on every decoded token. ForceBudget is emitted as a side effect (the request stays in ThinkDecode until </think> is observed or injected).

stateDiagram-v2
    [*] --> Prefill
    Prefill --> ThinkDecode: think_start id  /  EnterThink
    ThinkDecode --> ThinkDecode: token  /  update EAT + RPDI
    ThinkDecode --> ThinkDecode: converged or overthinking  /  ForceBudget
    ThinkDecode --> OutputDecode: think_end id  /  ExitThink
    OutputDecode --> Complete: eos id  /  Complete
    Complete --> [*]

Disaggregated offload sequence

When a fabric is configured, ExitThink triggers a batched offload of the request's think-complete blocks. Each offloaded block is framed, pushed to the fabric, and its local slot is reclaimed (see ADR-0006).

sequenceDiagram
    participant R as PhaseRouter
    participant S as Scheduler / Plugin
    participant B as BlockManager
    participant F as Fabric (NIXL / Mooncake)
    R->>S: ExitThink(req, tokens_used)
    S->>B: demote_think_blocks(req)
    S->>B: blocks_for_request(req)
    B-->>S: [block_ids]
    loop batched at offload_threshold_blocks
        S->>B: offload_block(id)
        B->>F: push(encode(tier, body))
        F-->>B: handle
        B->>B: free_block_by_id(id)
    end
    Note over S,F: meridian_disagg_blocks_offloaded_total += n

Components

Phase Router

Inputs: raw token IDs emitted per step, per request ID.
Outputs: PhaseEvent stream (EnterThink, ExitThink, ForceBudget, BudgetForceReason).
Hot-path constraint: O(1) per token, zero heap allocation in the common case. Backed by DashMap with sharded locking — see ADR-0003.
Failure mode: if a request is never reaped, its entry leaks in the map. reap_stale_older_than(Duration) removes entries older than a wall-clock threshold; the vLLM plugin calls this on every batch step.
Observability: meridian.phase_router.tracked_requests gauge.

Source: crates/meridian-core/src/phase_router.rs.


Dual-Queue Scheduler

Inputs: a pool of pending requests tagged by their current phase.
Outputs: two ordered lists — one output-phase batch (drains first), one think-phase batch (fills remaining capacity).
Hot-path constraint: a single pass over both queues per schedule_batch call. No per-token work.
Invariant: output-phase requests are never starved. The think queue only receives tokens after the output queue is drained or SLO-budget-limited.
Failure mode: if think_batch_multiplier is set too high relative to GPU capacity, output ITL variance increases. meridian.queue_depth{queue=think} growing without accompanying budget_force_triggered activity is the signal.
Observability: meridian.schedule_batch.duration_ns, meridian.queue_depth.

See ADR-0001 for the design alternative this rejects.

Source: crates/meridian-core/src/scheduler.rs.


Phase-Aware Block Manager

Inputs: allocate(request_id, tier) and evict_for(required_blocks) calls from the vLLM KV allocator path.
Outputs: block IDs; eviction decisions ordered by tier.
Invariant: ThinkComplete blocks are always evicted before ThinkActive; OutputCritical blocks are evicted last and only under sustained pressure.
Failure mode: OutputCritical eviction is a user-visible degradation event (stream stutter). Every such event increments meridian.output_critical_eviction. Alert on any increment in a 5-minute window.
Disagg surface: offload_block(block_id) and ingest_block(bytes, tier) are available when a disagg fabric is configured — see ADR-0006.
Observability: meridian.output_critical_eviction counter.

Source: crates/meridian-core/src/block_manager.rs.


Entropy Probe

Inputs: raw logit vector (fp32, bf16, or fp16) from a completed forward pass.
Outputs: EntropySignal — per-token entropy (nats), EAT value, EAT EMA, EAT EMA variance, RPDI local/global ratio.
Hot-path constraint: designed to run on a dedicated secondary CUDA stream; must not stall the generation stream. In Sprint 0 both paths use the NumPy reference; python/meridian/_backends/cuda.py defines the CUDA backend interface and delegates to CPU until Sprint 1 wires it to the Rust kernels in crates/meridian-kernels/.
Invariant: CPU and CUDA backends must agree within atol=1e-5 on the same logit vector. Enforced by crates/meridian-kernels/tests/kernel_correctness.rs.
Failure mode: if the kernel returns Unavailable, the system falls back to count-only budget forcing (hard_cap on every termination). This is safe but loses entropy-driven adaptivity.
Observability: signals surface through meridian.budget_force_reason.

Sources:


vLLM Plugin

Inputs: vLLM Scheduler instance at attach time; schedule_batch calls at runtime.
Outputs: reordered batch with output-phase requests drained first; injected </think> tokens on budget-force events; disagg offload calls on ExitThink.
Constraint: no vLLM fork required. The plugin wraps the existing scheduler via attribute delegation; unknown attributes fall through to the wrapped scheduler so vLLM internals work unmodified. MeridianSchedulerPlugin.attach() is a classmethod that installs the plugin as engine.scheduler[0] — no separate detach() is provided in v0.1.x.
Failure mode: if the plugin raises during schedule_batch, it re-raises to the vLLM worker, which surfaces as a serving error for that batch. Errors in the disagg offload path are caught and logged; they do not block generation.
Observability: all Phase Router and Block Manager metrics (meridian.block_manager.*, meridian.queue_depth, meridian.schedule_batch.*), plus meridian_disagg_blocks_offloaded_total and meridian_vocab_fallback_total emitted by the plugin (Prometheus, and OTLP when [telemetry] is enabled).

Source: python/meridian/vllm_plugin.py.

Non-goals

Meridian has explicit scope boundaries. Documenting what it does not do is as important as documenting what it does — it prevents inflated expectations and makes the design auditable.

Not a serving engine

Meridian is a scheduling layer. It requires vLLM (or a compatible serving backend) underneath it. It does not implement:

  • Model loading or weight management.
  • Prefill execution (prompt processing).
  • Decode kernel scheduling (CUDA streams, tensor parallelism).
  • Tokenisation or detokenisation.
  • HTTP/gRPC serving interfaces.

Not a throughput optimiser

Meridian optimises latency differentiation — protecting output-phase inter-token latency at the cost of slightly reduced think-phase throughput. It does not optimise:

  • Raw tokens/sec/GPU throughput. For throughput, run vLLM's own scheduler.
  • Batch filling efficiency in non-reasoning workloads.
  • Speculative decoding, continuous batching variants, or chunked prefill.

Not an accuracy guarantee

Budget forcing injects </think> based on entropy signals, not correctness. This can shorten reasoning chains on hard problems. Meridian does not:

  • Measure or guarantee reasoning accuracy.
  • Validate that forced-short chains produce correct answers.
  • Implement any feedback loop between output quality and forcing thresholds.

Operators running accuracy-sensitive workloads should validate that their eat_ema_variance_threshold and rpdi_threshold settings do not degrade task accuracy on representative prompts.

Not a multi-model router

Meridian schedules within a single vllm.AsyncLLMEngine instance. It does not:

  • Route requests across multiple models.
  • Implement A/B model experiments.
  • Manage model replicas or horizontal scaling.

Not a vLLM fork

Meridian is a plugin, not a fork. It does not modify vLLM internals and does not ship a patched vLLM binary. Compatibility with a specific vLLM version is the operator's responsibility. See the Compatibility Matrix.

Not production-certified

v0.1.0 is an early release. It has not been validated under production traffic at scale. Known gaps:

  • Real-vLLM end-to-end results exist only for Qwen/Qwen2.5-0.5B in the repo's GPU CI workflow.
  • The disagg fabric integration uses a synthetic in-process mock when libnixl is not available; real NIXL interop has not been exercised in CI.
  • No back-pressure mechanism exists if the disagg fabric becomes unavailable.

Compatibility Matrix

Runtime requirements

ComponentMinimumTested
LinuxUbuntu 22.04Ubuntu 24.04
WSL2WSL2 on Windows 10/11Windows 10 22H2
Rust toolchain1.85.01.85.0 (pinned)
Python3.113.11
vLLM0.9.00.21.0 (resolved in uv.lock)
NVIDIA driver555.x555.x
CUDA toolkit12.612.6
CUDA Compute Capability8.0 (A100)8.0+

Build requirements

ToolVersion
cargo + rustupRust 1.85.0
maturinlatest (install via pip install maturin)
uv0.4+
nvcc12.6 (only for --features cuda)
mdbooklatest (only for docs)

Model compatibility

Models that have been verified to work with Meridian's phase detection:

Model familyBoundary detectionConfig
DeepSeek-R1Token IDs [128799, 128800]models/deepseek_r1.toml
Qwen3 / Qwen2.5Token IDs [151648, 151649]models/qwen3.toml
IBM Granite 3.2Prose markers (no distinct token IDs)models/granite_3_2.toml

Models that are not verified to work:

  • Models with non-standard <think> tokenisation not listed above — configure think_start_token_ids / think_end_token_ids manually and validate with a sample prompt before production use.
  • Models served through streaming APIs (e.g. Claude via Anthropic API) — Meridian requires direct access to the logit vector, which API-served models do not expose.

Feature flags

Feature flagRequiresStatus
(default — no flags)Linux, RustFully CI-tested
--features prometheusprometheus crateCI-tested
--features unstableRust nightly-gated APIsCI-tested
--features nixllibnixl.so on deploy hostCompiles; integration tested with synthetic mock
--features cudanvcc, CUDA 12.6, GPU at runtimeBuild-tested on GPU CI runner

CI coverage

JobPlatformGPUStatus
rust-coreubuntu-24.04NoCI
rust-kernels (stub)ubuntu-24.04NoSame badge
pythonubuntu-24.04NoSame badge
docsubuntu-24.04NoSame badge
supply-chainubuntu-24.04NoSame badge
cuda-buildself-hosted gpu runnerYesRuns only on angelnicolasc org pushes

The GPU jobs are gated to prevent arbitrary code execution on the self-hosted runner from fork PRs. See GPU CI runner setup.

Known incompatibilities

  • vLLM below 0.9.0: the dependency constraint is vllm>=0.9.0. Earlier versions are not supported and will be rejected at install time.
  • Windows (native): the Rust workspace builds on Windows (tested in development), but the Python extension and benchmarks require Linux or WSL2 for the CUDA and maturin paths.
  • macOS: not supported. CUDA is not available on macOS.

Deployment Model

Single-node deployment

The primary and most-tested deployment topology. One AsyncLLMEngine instance, one Meridian plugin, all on the same GPU node.

┌──────────────────────────────────────────────┐
│   GPU node                                    │
│                                               │
│  vLLM AsyncLLMEngine                         │
│   └── MeridianSchedulerPlugin (attached)     │
│        ├── PhaseRouter                       │
│        ├── MeridianScheduler                 │
│        └── PhaseAwareBlockManager            │
└──────────────────────────────────────────────┘

Prerequisites: Linux (or WSL2), NVIDIA driver 555+, CUDA 12.6, vLLM ≥ 0.9.0 (pip install "meridian[vllm]" resolves to 0.21.0 via uv.lock).

Configuration: standard meridian.toml with [disagg] enabled = false.

Disaggregated prefill-decode

Experimental. Requires NIXL-capable infrastructure. The block manager's offload_block / ingest_block hooks transfer ThinkComplete KV blocks to a remote decode node after </think> is emitted.

┌─────────────────┐         NIXL fabric          ┌─────────────────┐
│  Prefill node   │  ──── KV block transfer ────▶ │  Decode node    │
│                 │                               │                 │
│  vLLM prefill   │                               │  vLLM decode    │
│  Meridian plug  │                               │  Meridian plug  │
└─────────────────┘                               └─────────────────┘

Status: the disagg wire protocol and block manager hooks are implemented and verified with a synthetic in-process NIXL mock. Real NIXL interop requires cargo build --features nixl and libnixl.so on the deploy host.

Configuration:

[disagg]
enabled = true
fabric = "nixl"
offload_threshold_blocks = 4

See ADR-0006 for the protocol specification.

Installation

git clone https://github.com/angelnicolasc/meridian.git
cd meridian

# Build and install the Rust core + Python bindings.
uv sync --project python
maturin develop --release -m crates/meridian-python/Cargo.toml

# Optional: build with CUDA kernel support.
# Requires nvcc + CUDA 12.6 toolkit.
maturin develop --release \
  -m crates/meridian-python/Cargo.toml \
  --cargo-extra-args="--features cuda"

Devcontainer

The repo includes a devcontainer configuration with the full toolchain pre-installed:

# Open in VS Code with the Dev Containers extension, or:
./scripts/dev-up.sh

Configuration loading

The plugin looks for meridian.toml in the current working directory, then ~/.config/meridian/meridian.toml. Override with:

from meridian import load_config
from meridian.vllm_plugin import MeridianSchedulerPlugin

cfg = load_config("/path/to/meridian.toml")
plugin = MeridianSchedulerPlugin(scheduler=engine.scheduler, config=cfg)

Known limits

DimensionLimitNotes
Models per instance1One engine, one config
Concurrent requestsLimited by GPU VRAM / block budgetSet capacity_bytes
vLLM version≥ 0.9.0 (pinned: 0.21.0 in uv.lock)Earlier versions not supported
Disagg fabricNIXL (production) or synthetic mockReal NIXL requires libnixl

Rust API

The Rust API is generated by rustdoc. Build locally with:

cargo doc --workspace --no-deps --open

Top-level surface

ItemKindPurpose
PhaseRouterstructPer-request token-stream state machine
MeridianSchedulerstructDual-queue batch scheduler
BlockManagertraitThree-tier KV block manager contract
PhaseAwareBlockManagerstructDefault BlockManager impl
typesmoduleThinkPhase, PhaseEvent, BlockTier, EntropySignal, BlockLocation
MeridianConfigstructDeserialised TOML config

Object lifecycle

PhaseRouter

PhaseRouter is Send + Sync. Create once, share across threads via Arc. Call process_token(req_id, token_id) from any thread; it returns an Option<PhaseEvent> and is O(1) per call.

Call reap_stale_older_than(duration) periodically to free entries for completed requests. The vLLM plugin does this on every batch step.

#![allow(unused)]
fn main() {
use meridian_core::PhaseRouter;
use std::sync::Arc;
use std::time::Duration;

let router = Arc::new(PhaseRouter::new());

// Per-token — called from the decode loop.
if let Some(event) = router.process_token(req_id, token_id) {
    // Handle PhaseEvent::ExitThink, ForceBudget, etc.
}

// Periodic cleanup — call from the batch step hook.
let reaped = router.reap_stale_older_than(Duration::from_secs(60));
}

MeridianScheduler

MeridianScheduler is Send + Sync. The schedule_batch method takes a shared reference and returns owned Vec<RequestId> for each queue; it does not hold a lock across the call boundary.

BlockManager trait

The three required methods:

#![allow(unused)]
fn main() {
fn allocate(&mut self, request_id: RequestId, tier: BlockTier) -> Result<BlockId>;
fn evict_for(&mut self, required_blocks: usize) -> Vec<BlockId>;
fn block_location(&self, block_id: BlockId) -> BlockLocation;
}

Optional disagg methods (offload_block, ingest_block) have default implementations that return Err(BlockManagerError::FabricNotConfigured). Override them when wrapping with a NIXL-backed manager.

Error model

All errors implement std::error::Error and are defined in meridian_core::error::Error. Configuration errors carry the dotted field path and a human-readable message:

ConfigValidation { field: "entropy.ema_alpha", reason: "must be in (0, 1]" }

Kernel errors are defined in meridian_kernels::KernelError:

  • Unavailable — built without the cuda feature or runtime missing.
  • Launch(i32) — CUDA returned a non-zero error code.
  • NullPointer(&'static str) — caller passed a null pointer.

Thread safety

TypeThread safety
PhaseRouterSend + Sync via DashMap interior mutability
MeridianSchedulerSend + Sync
PhaseAwareBlockManagerSend; requires &mut self for mutation — wrap in Mutex for shared access
NixlContext (feature nixl)Send; not Sync — one context per thread

Stability

Pre-1.0. Public API breakage is recorded under BREAKING CHANGE: in CHANGELOG.md. The C ABI (meridian_entropy_launch, meridian_eat_launch) is treated as stable from v0.1.0 — forks consuming the FFI directly will not see unexpected breakage on patch updates.

Python API

Installation

# From source — requires a Linux host with Rust 1.85+ and maturin.
uv sync --project python
maturin develop --release -m crates/meridian-python/Cargo.toml

The package exposes no CUDA dependency at import time. CUDA is lazily loaded when backend="cuda" is requested on EntropyProbe.

Top-level surface

SymbolKindPurpose
meridian.EntropyProbeclassStateful per-request entropy probe
meridian.EntropySignaldataclassPer-token signal record
meridian.MeridianConfigPydantic modelRuntime configuration
meridian.load_config(path)functionConvenience TOML loader
meridian.vllm_plugin.MeridianSchedulerPluginclassvLLM scheduler wrapper

Object lifecycle

EntropyProbe

One instance per request. Not thread-safe — do not share an instance across concurrent requests. Create, use through the token sequence, then discard.

from meridian import EntropyProbe, load_config
import numpy as np

cfg = load_config("meridian.toml")
probe = EntropyProbe(
    think_end_token_ids=cfg.model["qwen3"].think_end_token_ids,
    backend="cpu",          # "cpu" (NumPy) or "cuda" (CUDA kernel)
    ema_alpha=cfg.entropy.ema_alpha,
)

# Per-token call — call once per decoded token.
logits = np.random.randn(151_936).astype(np.float32)
sig = probe.compute(req_id=42, logits=logits)
print(sig.token_entropy, sig.eat, sig.eat_ema_variance)

# Batch path — more efficient for large batch sizes.
batch_logits = np.random.randn(8, 151_936).astype(np.float32)
signals = probe.compute_batch(req_ids=list(range(8)), logits_batch=batch_logits)

MeridianSchedulerPlugin

Wraps an existing vllm.core.scheduler.Scheduler at runtime. Safe to attach and detach. Holds no GPU resources; all GPU work goes through the underlying vLLM scheduler.

from vllm import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from meridian import load_config
from meridian.vllm_plugin import MeridianSchedulerPlugin

engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(model="Qwen/Qwen2.5-0.5B"))
cfg = load_config("meridian.toml")

# attach() is a classmethod — constructs the plugin, installs it as
# engine.scheduler[0], and returns the handle for metric access.
plugin = MeridianSchedulerPlugin.attach(engine, cfg, model_key="qwen3")

# ... serve requests ...

# v0.1.x has no detach(); the plugin runs for the engine's lifetime.

Error model

All configuration errors are raised at construction time as ValueError with a dotted field path. For example:

ValueError: entropy.ema_alpha must be in (0, 1]; got 1.5

Runtime errors from the CUDA kernel surface as meridian.KernelError (a subclass of RuntimeError). When the kernel returns Unavailable (built without the cuda feature, or missing runtime library), the probe falls back to the CPU backend automatically.

Concurrency notes

  • EntropyProbe instances are not thread-safe. One instance per request.
  • MeridianSchedulerPlugin is designed to be used from vLLM's single async event loop. Do not call schedule_batch concurrently.
  • The Rust PhaseRouter and BlockManager bindings are thread-safe; they use interior mutability backed by DashMap.

Stability guarantees

Pre-1.0. Signatures may change on minor bumps. Breaking changes are listed under BREAKING CHANGE: in CHANGELOG.md and announced before merging.

Backends

EntropyProbe accepts backend="cpu" (default, pure NumPy) or backend="cuda". Both backends implement the same mathematical operations and agree within atol=1e-5. In Sprint 0 the cuda backend delegates to the CPU implementation; Sprint 1 will wire it to the Rust CUDA kernels in crates/meridian-kernels/ so the logit reduction runs on a dedicated secondary CUDA stream.

Configuration

Meridian is configured through a single TOML file consumed by both the Rust core (meridian-core::config::MeridianConfig) and the Python facade (meridian.config.MeridianConfig). Both parsers agree on field names; a round-trip test in crates/meridian-core/tests/config_parse.rs exercises every field.

The fully-annotated example lives at meridian.toml.example. Tune by overlay: keep the example as a reference and write a smaller meridian.toml containing only fields that differ from their defaults.

Validation

Both parsers reject unknown fields and out-of-range values. Cross-field violations (e.g. min_think_tokens >= max_think_tokens) are also caught. Errors carry the dotted field path and a human-readable message.


[scheduler]

Dual-queue scheduling policy. See ADR-0001.

think_tpot_budget_ms

PropertyValue
Typef64
Default80.0
Unitmilliseconds per think-phase token
Valid range> 0

TPOT budget for think-phase tokens. The user does not see inter-token latency during reasoning, so this can be set much higher than the output budget. Setting it 4× the output budget gives the batcher room to pack a larger effective batch during think. Raise if your GPU is underutilised during think; lower if think-phase requests monopolise capacity at the expense of output.

output_tpot_budget_ms

PropertyValue
Typef64
Default20.0
Unitmilliseconds per output-phase token
Valid range> 0

TPOT budget for output-phase tokens. This is the user-visible streaming latency floor. 20 ms keeps streams fluid on a 30–50 tok/s display target. Lower values produce tighter streams but reduce think-phase throughput.

think_batch_multiplier

PropertyValue
Typef64
Default2.5
Unit×output batch token budget
Valid range>= 1.0

The think-phase batch can fill this multiple of the output-phase token budget. 2.5× is conservative — empirically stable across H100-class hardware with MLA-aware allocation. Values above 3.5× risk output ITL variance spikes if think requests fail to yield promptly. Monitor meridian.queue_depth{queue=think}.

max_think_tokens

PropertyValue
Typeu64
Default32768
Unittokens
Valid range> min_think_tokens

Hard cap on think tokens per request. Budget forcing fires unconditionally at this limit regardless of entropy signals. 32 768 matches DeepSeek-R1's documented maximum reasoning length and bounds the KV memory a single request can monopolise.

min_think_tokens

PropertyValue
Typeu64
Default512
Unittokens
Valid range< max_think_tokens

No budget forcing is allowed before this many think tokens. EAT/RPDI signals are noisy below 512 tokens; early forcing can prematurely terminate short-but-correct reasoning chains.


[entropy]

Entropy probe and convergence-detection thresholds.

enabled

PropertyValue
Typebool
Defaulttrue

When false, the entropy probe is disabled and all budget forcing uses hard_cap only (pure token-count limiting). Useful for A/B comparison or when the CUDA kernel is not available.

ema_alpha

PropertyValue
Typef64
Default0.05
Unitdimensionless (EMA decay)
Valid range(0.0, 1.0]

EMA decay applied to all entropy signals. Smaller values give longer memory. α = 0.05 → ~95% mass within the last ~60 samples. Long enough to smooth single-token spikes; short enough to react within a reasoning chain.

rpdi_threshold

PropertyValue
Typef64
Default3.0
Unitratio (local RPDI / global RPDI)
Valid range> 1.0

Overthinking is declared when rpdi_local / rpdi_global > threshold. The value 3.0 is the empirical threshold from arXiv:2603.14251. Raise to be more permissive (longer chains); lower to be more aggressive.

eat_ema_variance_threshold

PropertyValue
Typef64
Default0.001
Unitnats²
Valid range> 0.0

Convergence is declared when EAT EMA variance drops below this threshold. 0.001 is approximately the noise floor of EAT in steady state. Lower values defer forcing; higher values fire earlier.

transition_entropy_threshold

PropertyValue
Typef64
Default2.5
Unitnats
Valid range> 0.0

A token counts as a "transition" for RPDI when its per-token entropy exceeds this value. 2.5 nats ≈ effective branching factor of 12 — a genuine decision point rather than low-entropy continuation.

eat_probe_interval_tokens

PropertyValue
Typeu32
Default32
Unittokens
Valid range>= 1

The EAT kernel runs every N think tokens. 1 = every token; higher values trade signal latency for reduced kernel-launch overhead. 32 is the sweet spot on H100-class hardware. Halving this on slower GPUs is safe.


[kv_memory]

Phase-aware KV block manager policy.

aggressive_think_eviction

PropertyValue
Typebool
Defaultfalse

When true, ThinkComplete blocks are freed immediately on phase transition. Leave false until cross-attention back-references are audited for your model; some reasoning-parser pipelines re-attend over the think segment when generating output and need those blocks resident.

think_phase_memory_fraction

PropertyValue
Typef64
Default0.40
Unitfraction of total KV budget
Valid range(0.0, 1.0)

Fraction of total KV budget reserved for think-phase blocks. 0.40 leaves 60% for output-phase blocks, which accommodates think_batch_multiplier = 2.5 without crowding output. Raise for workloads with very long reasoning chains; lower for workloads with long output sequences.

block_size_bytes

PropertyValue
Typeu64
Default16384 (16 KiB)
Unitbytes per KV block
Valid range> 0

Must match the actual vLLM block layout for your model. The canonical vLLM layout is 16 KiB for bf16/fp16 KV at 16 tokens per block. MLA-aware models can run smaller blocks.

capacity_bytes

PropertyValue
Typeu64 or "auto"
Default"auto"
Unitbytes
Valid range> 0 or "auto"

Total KV memory budget. "auto" queries the device at startup and uses 85% of torch.cuda.mem_get_info().total. Pin an integer for deterministic or multi-tenant deployments where you want to reserve GPU memory for other workloads.


[disagg]

Disaggregated KV transfer. Disabled by default. See ADR-0006.

enabled

PropertyValue
Typebool
Defaultfalse

Master switch. Leave false for single-node deployments.

fabric

PropertyValue
Type"nixl" | "mooncake" | "none"
Default"none"

Selects the disagg transport. nixl uses the NVIDIA NIXL library (requires cargo build --features nixl and libnixl on the deploy host). mooncake uses the Mooncake-compatible protocol adapter. none is only valid when enabled = false.

offload_threshold_blocks

PropertyValue
Typeu32
Default4
UnitKV blocks
Valid range>= 1

Minimum ThinkComplete blocks to accumulate before flushing to the fabric. Larger values amortise transfer overhead; smaller values reduce latency.


[model.<name>]

Per-model token-boundary configuration. One [model.*] table per model served. The phase router watches for boundary token IDs in the decoded stream.

See models/*.toml for the vetted token IDs for supported models.

think_start_token_ids

PropertyValue
Type[u32]

Token IDs that mark the start of a reasoning chain. Model- and tokenizer-specific. If empty, the router never enters think-phase for this model.

think_end_token_ids

PropertyValue
Type[u32]

Token IDs that mark the end of a reasoning chain. When the decoded stream contains any of these IDs, the router emits ExitThink.

reasoning_parser

PropertyValue
Typestring
Values"deepseek_r1", "qwen3", "granite", "anthropic"

Selects the reasoning-chain parser. Different models structure their think-output boundary differently; the parser handles model-specific normalization.

supports_think_disable

PropertyValue
Typebool

Whether the model supports a /no_think directive to suppress the reasoning phase entirely. When true and the prompt contains the directive, the router stays in output-phase for the entire request.

Glossary

Terms used throughout the Meridian documentation and codebase.


TTFTTime to First Token. Wall-clock time from the moment a request is submitted to the moment the first token is returned to the client. Dominated by prefill latency for long prompts.

TTOTTime to Output Token. Wall-clock time from the emission of the </think> boundary token to the first user-visible output token. The metric Meridian is specifically designed to protect.

TPOTTime Per Output Token (also ITL: Inter-Token Latency). Time between consecutive output tokens during streaming. The user-perceived "speed" of the stream.

ITLInter-Token Latency. See TPOT.

EATEntropy-Aware Termination. A budget-forcing signal based on the variance of the EMA of per-token Shannon entropy over the reasoning chain. When EAT EMA variance drops below a threshold, the model is inferred to have converged on an answer. Defined in arXiv:2509.26522.

RPDIReasoning Phase Divergence Index. A signal based on the ratio of local transition-token frequency to global transition-token frequency. A high ratio indicates the model is cycling through redundant reasoning steps ("overthinking"). Defined in arXiv:2603.14251.

Entropy (Shannon) — Measure of uncertainty in the next-token distribution. Computed from the logit vector after softmax as -Σ p_i · log(p_i), in nats. High entropy = uncertain prediction; low entropy = confident prediction.

EMAExponential Moving Average. A smoothed average where recent values are weighted more heavily. Controlled by ema_alpha (smaller = longer memory).

KVKey-Value cache. The GPU memory store holding the attention keys and values for each token in active requests. KV memory is the primary capacity constraint in serving systems.

KV block — The unit of KV cache allocation. A fixed-size chunk (default 16 KiB) covering a fixed number of tokens (default 16). Blocks are allocated at the request level and freed on eviction or request completion.

ThinkComplete — Block tier for KV blocks that belonged to a request's reasoning phase after </think> has been emitted. Lowest eviction priority — these blocks are freed first under memory pressure.

ThinkActive — Block tier for KV blocks belonging to a request currently in the reasoning phase.

OutputCritical — Block tier for KV blocks belonging to a request in the output phase. Highest eviction priority — evicting these causes user-visible stream disruption. Any eviction at this tier fires meridian.output_critical_eviction.

Disagg / disaggregated serving — Prefill-decode disaggregation: the model prefill (prompt processing) and decode (token generation) steps are executed on separate hardware. Disagg reduces head-of-line blocking by separating the two workloads, which have very different GPU utilisation profiles.

NIXLNVIDIA Inference eXchange Layer. NVIDIA's reference fabric for transferring KV blocks between prefill and decode nodes in a disaggregated serving topology.

Mooncake — An open-source disaggregated serving framework with a KV-transfer protocol. Meridian's disagg surface is documented as Mooncake-compatible in ADR-0006.

vLLM — An open-source LLM serving framework. Meridian's primary integration target. Meridian wraps vLLM's scheduler without forking the codebase.

DashMap — A concurrent hash map crate used for the PhaseRouter's per-request state. Provides O(1) read and write with sharded locking. See ADR-0003.

EOSEnd of Sequence. The special token that signals request completion. vLLM emits an EOS event that the Meridian plugin uses to trigger request teardown and router state reaping.

Conventional Commits — A commit message standard used throughout this repository. Format: <type>(<scope>): <summary>. See conventionalcommits.org.

DCODeveloper Certificate of Origin. A sign-off mechanism (git commit -s) that certifies the contributor has the right to submit the code under the project's license. Required for all contributions — see CONTRIBUTING.md.

SLSASupply-chain Levels for Software Artifacts. A framework for supply-chain security. Meridian attests Level 2 provenance on every tagged release via slsa-github-generator. See ADR-0007.

SBOMSoftware Bill of Materials. A machine-readable inventory of software components and their licenses. Meridian generates a CycloneDX SBOM for each release, attached as a GitHub release asset.

Architectural Decision Records

Meridian uses Michael Nygard's ADR format to capture the why behind significant choices. ADRs are immutable once "Accepted"; we supersede them rather than edit them.

Lifecycle: ProposedAccepted → (later) Superseded by ADR-NNNN / Deprecated.

Index

Writing a new ADR

  1. Copy template.md to NNNN-short-kebab-title.md.
  2. Open as Proposed; merge as Accepted after PR review.
  3. If a later ADR supersedes this one, mark this one Superseded by ADR-NNNN in a new commit — never edit the body of an accepted ADR.

ADR-0001: Dual-queue scheduling vs. priority weights on a single queue

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc
  • Reviewers: sole-maintainer decision record

Context

Meridian's central thesis is that think-decode and output-decode are two structurally different workloads inside a single reasoning-model request:

  • Output tokens are user-visible streaming. They must hit a tight TTOT (time-to-output-token) target — ~20 ms is the perceptual threshold for fluid streaming at typical display rates.
  • Think tokens are user-invisible reasoning. The user is already waiting for the answer; inter-token latency during reasoning is irrelevant. Throughput (tokens/sec/GPU) is what matters here.

Given that, the scheduler needs to give output-phase requests absolute priority while letting think-phase requests fill any remaining batch capacity with a larger effective batch size to maximise GPU utilisation.

There are two plausible structural shapes to implement this:

  1. Single queue with priority weights. One queue of all "decode-eligible" requests; each request carries a priority numeric. The scheduler picks the highest-weighted requests every iteration and the eviction policy reads block tier from the request's phase.
  2. Two independent queues, one per phase. An output_queue drained first to its budget, then a think_queue drained to a larger budget capped by remaining KV memory.

Both can produce equivalent dispatch ordering. They diverge in observability, in the failure modes they expose, and in how cleanly they compose with KV tier management.

Decision

Meridian uses two independent queues — output_queue and think_queue — sharing the same GPU workers, with output drained first every iteration.

The scheduler exposes per-queue depth as a separate metric label, applies per-queue SLO budgets, and the block manager's eviction tiers are indexed on the block's phase membership rather than on the owning request's priority number.

Consequences

Positive

  • SLO isolation by construction. TTOT and TPOT live on different queues and cannot interfere through priority arithmetic. We never need to reason about whether a priority of 5 vs. 7 is enough to keep an output token from being preempted by a think token — the queues are physically separate.
  • Reasoning about starvation is local. With priority weights, you have to argue globally about the joint distribution of priorities under load to prove that think requests are not starved. With two queues, the worst-case is "output_queue saturates the budget → think_queue waits its turn." That is a one-line argument and a single bounded scalar (think_batch_multiplier × output_budget) to tune.
  • Block manager tiering is structurally aligned. The eviction policy iterates BlockTier::ThinkComplete → ThinkActive → OutputCritical. The scheduler's queues map 1:1 onto two of those tiers (ThinkActive, OutputCritical), and the ThinkComplete tier appears precisely when a request transitions queues. The pipeline of state transitions is uniform end-to-end.
  • Observability is honest. meridian.queue_depth{queue="output"} and …{queue="think"} are operationally meaningful — they correspond to things an oncall engineer can act on. A single meridian.queue_depth_p95_priority would obscure the failure mode.
  • Future disaggregation is cheap. When we add a separate decode pool for think (a natural extension co-located with prefill-decode disagg systems like Mooncake / NIXL), the seam already exists.

Negative / risks

  • Two queue data structures instead of one. Marginal memory cost (crossbeam::SegQueue is small) and a second O(log n) insert path. Not material against the per-token compute budget.
  • Risk that think queue is permanently starved under sustained output pressure. Mitigation: the scheduler enforces a minimum think-batch reservation when output_queue.len() < output_budget. Detection: meridian.queue_depth{queue="think"} rising while meridian.budget_force_triggered stays flat — alert at p95 depth > 4× baseline for 5 minutes.
  • Edge cases at phase transition. A request that emits </think> and the next token in the same decode step transitions queues mid-iteration. This is handled by MeridianScheduler::on_phase_event taking the request out of the think queue and pushing it into the output queue before the next schedule_batch call. Tests for this case live in tests/phase_router_state_machine.rs.

Neutral

  • The number of tunables stays the same. A single-queue design with priority weights requires output_priority, think_priority, and a priority_gap_min; the dual-queue design requires output_tpot_budget_ms, think_tpot_budget_ms, and think_batch_multiplier. Both surfaces are three scalars.

Alternatives considered

Single queue with continuous priority weights

RequestSlot { priority: f32 }, dispatch is argmax(priority) with preemption. Output requests carry priority ≈ 10, think requests priority ≈ 1. Rejected because:

  • The dispatch order under heavy load depends on the distribution of weighted requests, not on a per-class budget — you can no longer write down a one-line invariant like "output never waits more than K ms."
  • The block manager would need to read priorities to decide eviction order, coupling two subsystems that we want orthogonal.
  • Operators tuning the system find priority weights opaque — "is 8.5 enough?" is not a question with a principled answer.

Single queue with phase-stratified preemption

One queue, but a hard rule that any output-phase request preempts any think-phase one. This is structurally equivalent to two queues but expressed differently. Rejected for code-clarity reasons only: the dual-queue shape makes the invariant ("output drains first") the structure, rather than an invariant we have to police in the dispatcher.

Per-tenant queues with phase tags

Considered for multi-tenant SaaS deployments. Not rejected outright, but deferred — it is an orthogonal axis we can layer on top of the two-queue shape. Captured as a future ADR placeholder.

References

  • Playbook §3.3 — Dual-Queue Scheduler.
  • vLLM v0.9 scheduler internals: vllm/core/scheduler.py (single-queue priority-weighted implementation we are improving on).
  • Mooncake disagg paper — separates prefill from decode; orthogonal axis.
  • DUCHESS (arXiv:2509.24957) — intra-request branch orchestration; operates below the queue layer.

ADR-0002: Workspace tri-crate layout (core / kernels / python)

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc

Context

Meridian spans three execution domains: a pure-Rust scheduler core, CUDA kernels behind an FFI boundary, and pyo3 bindings that the Python vLLM plugin consumes. The natural layouts are:

  1. Single crate. All code in one place, gated by cfg(feature = ...).
  2. Tri-crate Cargo workspace. meridian-core (Rust only), meridian-kernels (CUDA + FFI), meridian-python (pyo3, built via maturin). All members of one workspace, sharing lockfile and lints.
  3. Polyrepo. Each layer in its own repository, glued by published versions.

Decision

Tri-crate workspace. All three crates live under crates/ in this repository.

Consequences

Positive

  • cargo test -p meridian-core runs on any host — no CUDA, no Python, no nvcc. The CI matrix can validate the core invariants on the cheapest GitHub-hosted runner and only spin up a GPU runner for the CUDA layer.
  • The unsafe surface area is visibly contained. Auditors looking at meridian-core see #![forbid(unsafe_code)] at the crate root. The unsafe code lives in exactly one crate (meridian-kernels) and at exactly one boundary (the FFI declarations in src/ffi.rs).
  • Workspace [workspace.lints] is applied uniformly across all three crates — one place to change a clippy lint, no drift.
  • A future fork that wants only the scheduler core (e.g. for a non-CUDA inference framework) can depend on meridian-core directly without pulling pyo3 or CUDA artifacts.

Negative / risks

  • Three Cargo.tomls to keep in sync. Mitigated by workspace.dependencies and workspace.package inheritance — versioned dependencies are declared once.
  • The meridian-kernels crate has links = "meridian_kernels". Cargo enforces uniqueness so we cannot accidentally link two implementations of the same native library — a small but real guard.

Neutral

  • Build artifacts grow by one extra target/ directory per crate. Negligible in practice.

Alternatives considered

Single crate

Rejected because: any Python binding gate requires pyo3 in the dep graph, which pulls a non-trivial transitive closure. Anyone wanting to depend on just the scheduler core would be forced to compile it.

Polyrepo

Rejected because: in a pre-1.0 project where the three layers co-evolve, splitting them into separate repositories introduces version-skew bugs without a corresponding benefit. Once the public API stabilises post-1.0 this can be revisited.

References

ADR-0003: DashMap for per-request phase state

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc
  • Reviewers: sole-maintainer decision record

Context

The PhaseRouter mutates per-request state (ThinkPhase) on every decoded token. A single decode worker processes a continuous-batch step that can touch dozens of requests; in a tensor-parallel deployment, multiple worker threads may invoke on_token for different requests in the same wall-clock window. We need:

  • O(1) lookup keyed by req_id.
  • Concurrent mutation of different keys without serialising the whole map.
  • Cheap clone / sharing — the router is shared across the scheduler, the block manager touch path, and the vLLM plugin.
  • No GC, no allocations on the hot path.

Candidates evaluated:

  1. parking_lot::RwLock<HashMap<u64, ThinkPhase>>
  2. DashMap<u64, ThinkPhase> (sharded HashMap behind per-shard RwLock)
  3. papaya::HashMap<u64, ThinkPhase> (lock-free, June 2025)
  4. scc::HashMap<u64, ThinkPhase> (lock-free, sharded)

Decision

DashMap 6.x. It is the boring, well-audited choice that hits the performance floor we need without introducing a less-battle-tested dependency on the hot path.

Consequences

Positive

  • Sharded locking. Operations on different req_ids do not contend.
  • Mature API. get_mut, entry, remove cover every access pattern in PhaseRouter. No need to invent abstractions on top.
  • No unsafe. Internally backed by parking_lot; we keep the #![forbid(unsafe_code)] invariant in meridian-core intact.
  • Crates.io top-100. Wide deployment, frequent security audits, stable semver.

Negative / risks

  • Sharded lock is not truly lock-free. Under extreme contention (a single shard absorbing many requests because of hash skew), throughput degrades. Mitigation: monitor meridian.phase_router.tracked_requests; if the gauge approaches n_shards × shard_capacity (default ~512 per shard), revisit with papaya or shard-aware partitioning. In our workload (~1k concurrent requests max per worker) we are far from this ceiling.
  • AHasher by default. Good for our integer keys; not a downside, but worth noting that we benefit from non-cryptographic hashing here.

Neutral

  • Memory per entry is slightly higher than a flat HashMap because of the shard metadata. Negligible (~64 bytes overhead total).

Alternatives considered

RwLock<HashMap<...>>

Rejected: every on_token call needs a write lock to bump tokens_so_far, so the RwLock collapses to a Mutex in practice. Single-shard serialisation across all requests is unacceptable.

papaya::HashMap

Watched, not adopted. Genuinely lock-free, with better tail latency under contention than DashMap. Adoption deferred until: (a) it stabilises a 1.0 API, and (b) we can demonstrate a workload where DashMap is the bottleneck. Tracked as future work in the DEVLOG.

scc::HashMap

Comparable to papaya but more API surface; same deferral rationale.

References

ADR-0004: KV tier promotion is one-way; demotion is the only direction

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc
  • Reviewers: sole-maintainer decision record

Context

The block manager maintains three tiers (ThinkActive, ThinkComplete, OutputCritical). Tier transitions are emitted by the scheduler on phase events. We had to decide whether a block can promote (e.g. a ThinkComplete block re-attended during output generation gets restored to ThinkActive or even OutputCritical).

Two camps:

  1. Bidirectional: any block currently accessed gets promoted to the highest applicable tier.
  2. One-way demotion: blocks only move down the eviction order (OutputCritical → … → freed). Promotion is explicitly disallowed.

Decision

One-way demotion. A block's tier is set at allocation time (via BlockTier::ThinkActive or BlockTier::OutputCritical) and can only move toward eviction:

  • ThinkActive → ThinkComplete via demote_think_blocks.
  • Any tier → freed via evict_for or free.

There is no API to move a ThinkComplete block back to ThinkActive, or a ThinkActive block to OutputCritical.

Consequences

Positive

  • Reasoning about eviction stability becomes trivial. A block's tier monotonically decreases. Cross-attention back-references over a reasoning span cannot accidentally "rescue" blocks the scheduler has already decided are evictable — operators can reason about KV pressure without tracking promotion races.
  • The block manager API is smaller. No promote() method to test, no invariant to enforce ("you can only promote within the same request").
  • Aligns with the playbook intent. kv_memory.aggressive_think_eviction is a one-way knob: think blocks either survive demotion (default) or are freed immediately (aggressive). Promotion would make this knob semantically incoherent.

Negative / risks

  • A reasoning model that re-attends over its own think span during output generation pays the eviction cost twice. Cross-attention reads on a ThinkComplete block bring the block into the GPU's L1/L2 cache but do not promote its eviction tier. If that block is then evicted by memory pressure, the next cross-attention read forces a recompute. Mitigation: keep kv_memory.aggressive_think_eviction = false so blocks stay resident as ThinkComplete until pressure actually demands their eviction.
  • No way to mark "this block is hot, please keep it" beyond keeping it in the lowest tier it was admitted to. The LRU within a tier is the only signal of recency.

Neutral

  • The touch() API exists to update LRU position within a tier, not to promote across tiers. This is explicit in the trait documentation.

Alternatives considered

Bidirectional with promote_block(block_id, tier)

Rejected because:

  • Adds a new invariant to police (can a ThinkComplete of request A promote to OutputCritical of request B? Obviously not, but the API must enforce that).
  • Promotion races with eviction: a block that the eviction iterator has selected as next victim might be promoted mid-eviction. Resolving this needs either a lock around the whole eviction loop (kills throughput) or a generation counter (more complexity).
  • The cross-attention rescue use case is real but rare; the simpler fallback (operator tunes think_phase_memory_fraction) handles it without architectural complexity.

Implicit promotion on touch()

Rejected: touch() is called on the hot path for every cache hit. Doing tier promotion there would dramatically slow the common case to optimise a rare one.

References

  • ADR-0001 — dual-queue scheduling sits alongside this tier policy.
  • Playbook §3.4 — original three-tier eviction design.

ADR-0005: Benchmark methodology and metric selection

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc
  • Reviewers: sole-maintainer decision record

Context

Meridian's value proposition is phase-differentiated scheduling. The benchmark harness has to surface that value — measuring overall throughput or aggregate TPOT will not distinguish Meridian from a stock vLLM scheduler running the same workload, even if Meridian is substantially better for the user-visible streaming experience.

We must choose:

  1. What metrics to report.
  2. What workload to drive the system with.
  3. How to run the benchmark without a GPU in CI.

Decision

Metrics

The benchmark reports the following primary metrics (all per-request, aggregated to percentiles in the report):

MetricDefinitionWhy it matters
TTFTTime-to-first-token (prefill + first decoded token).Industry-standard latency floor; baseline parity.
TTOTTime-to-first-OUTPUT-token. Measured from </think> emission to the first user-visible token after it.The metric stock vLLM does not even track. This is where dual-queue scheduling shows its value: in a baseline, output tokens can be preempted by think tokens, inflating TTOT P95.
Output ITLInter-token latency during the output phase only. Measured per (token N → token N+1) pair. P50/P95/P99.Streaming fluidity — the perceptual quality of the user-visible output.
Think tokensTokens emitted in the think phase per request. Avg + P95.Cost driver; budget-forcing efficacy is measured against this.
Budget force ratePercentage of reasoning requests where budget force fired. Broken down by reason (converged, overthinking, hard_cap).Quality signal: if hard_cap dominates we are forcing blindly; if converged dominates the entropy probe is doing its job.
OutputCritical eviction eventsCount of eviction events that reached the OutputCritical tier during the run.User-visible degradation event. Any non-zero is alertable.

The harness explicitly does NOT report aggregate throughput (tokens/sec/GPU). That is what every other benchmark already reports, and it does not distinguish Meridian from the baseline. Operators who only want a throughput number can run vLLM's own benchmark harness.

Workload

Reference workload: synthetic mix of two categories.

  • Chat — short prompts, 40–240 output tokens, no think phase. Models the ShareGPT-style background traffic that should never stutter.
  • Reasoning — math-style prompts with expected think token counts in [600, 6000]. Models a MATH-500-equivalent distribution.

Mix ratio is operator-configurable (--reasoning-ratio); default 0.4 is the realistic balance for a 2026 reasoning-model deployment.

Arrivals are Poisson (exponential inter-arrival) at a configurable rate. This matches how production traffic actually arrives and exercises the dual-queue policy under realistic burst conditions.

Two execution modes

  • synthetic-replay — uses the native Meridian components (PhaseRouter, MeridianScheduler, BlockManager) over a synthetic decoder loop that does not require a GPU or a real vLLM. The phase events, scheduler queue transitions, KV allocations and eviction pressure are all real; only the per-token compute is simulated as a fixed-cost sleep. This mode runs in CI and detects regressions in the scheduler / block manager dynamics.

  • real-vllm — drives a real AsyncLLMEngine with the MeridianSchedulerPlugin attached. Requires a CUDA-capable host and a model checkpoint. Runs on the GPU CI job and on demand for release validation.

Both modes emit the same BenchmarkReport JSON+Markdown shape so reports are directly comparable. CI uploads both as artefacts and the Markdown form can be posted as a PR comment for visual diff.

Consequences

Positive

  • Reproducibility: synthetic-replay is deterministic given --seed and runs in seconds. Two PRs can be compared apples-to-apples without GPU access.
  • Honest reporting: metrics call out exactly where Meridian wins (TTOT, output ITL variance) and acknowledge what we don't measure (raw throughput).
  • Cross-mode parity: the same BenchmarkReport schema for both modes means a regression caught in synthetic-replay translates directly to expected behaviour under real-vllm.
  • Honest about failures: the output_critical_eviction_events counter surfaces user-visible degradation immediately; the budget-force reason breakdown surfaces when the entropy probe is doing real work vs. just hitting the cap.

Negative / risks

  • synthetic-replay does not exercise the CUDA kernels. A regression in the kernels will not be caught by CI; the GPU job's kernel_correctness test is the line of defence there.
  • synthetic per-token latency is calibrated, not measured. The default sleeps (6 µs / 18 µs for think / output) are approximations of bf16-Qwen3-on-H100 wall-clock times. Operators tuning a different hardware target should override via the SyntheticDecoder constructor.
  • Mix is synthetic. Real ShareGPT / MATH-500 replays are available via --workload sharegpt|math500 and use the offline HuggingFace dataset loader; they do not require a GPU.

Neutral

  • Report artefacts are JSON + Markdown only. Operators who want an HTML dashboard can render the JSON externally.

Alternatives considered

"Just use vLLM's benchmark harness"

Rejected. vLLM's harness reports throughput and TTFT, neither of which shows Meridian's value. We would have to extend it to track TTOT and phase-differentiated latencies — at which point we have already built this harness, but with a tight coupling to vLLM's internal benchmark abstractions.

MLPerf-style reference workloads

Considered, deferred. MLPerf targets raw throughput and aggregate API-level latency, not phase-differentiated metrics. A MLPerf-compatible reporting mode could be added if there is downstream demand, but it does not fit the primary signal Meridian optimises for.

Per-token wall-clock tracing

Considered (would record every decode-step timestamp and reconstruct the queue depths after the fact). Rejected because it generates GiB of trace data per run for marginal additional signal — the aggregated percentiles are enough to identify regressions and the OpenTelemetry spans give the deep-dive when needed.

References

  • Playbook §5 — target metrics table.
  • ADR-0001 — dual-queue scheduling, the design this benchmark validates.
  • ADR-0004 — KV tier policy, the design that output_critical_eviction_events directly measures.

ADR-0006: Disaggregated KV transfer protocol

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc
  • Reviewers: sole-maintainer decision record

Context

The current frontier of LLM serving infrastructure has settled on prefill-decode disaggregation: prompt-processing (compute-bound) and token-decoding (memory-bandwidth-bound) run on separate worker pools and exchange KV blocks across a high-bandwidth fabric. NVIDIA NIXL is the CUDA-blessed reference; Mooncake (Moonshot, ASPLOS '25) is the open-source protocol that the rest of the ecosystem implements.

Meridian's three-tier block manager already produces the exact signal a disaggregated pool wants — ThinkComplete blocks at ExitThink are known-cold. A scheduler with phase visibility is the natural producer of well-batched, well-timed offload events: no other layer in the stack knows that right now, this request just finished reasoning and won't re-read its think KV.

The remaining question is how a phase-aware scheduler talks to the fabric. We need a wire format and a set of trigger points that work across NIXL and Mooncake (and any future fabric) without forcing Meridian to depend on a specific runtime.

Constraints:

  • The fabric layer cannot live inside meridian-core — that crate is forbid(unsafe_code) and has no transitive CUDA dependency. It must live in meridian-kernels behind a cargo feature so non-disagg deployments pay nothing.
  • The wire format must be readable from Python and C++ NIXL agents, not just from Rust. NIXL's reference implementation is C/C++ + Python bindings; Mooncake's reference implementation is C++ + Python.
  • A meridian deployment that wants disagg but doesn't have a libnixl runtime available must still be exercisable end-to-end — otherwise the integration is impossible to test without specialised hardware.

Decision

Meridian defines a small versioned wire protocol (MRDN v1) for disaggregated KV transfer, implements it behind the nixl cargo feature of meridian-kernels, and exposes the trigger points through the BlockManager trait.

Wire format

+---------------+---------------+---------------+---------------+
| magic (4)     | version (4)   | body_len (4)  | tier (1)+pad(3)|
+---------------+---------------+---------------+---------------+
| checksum (16, Blake3-128)                                     |
+---------------+---------------+---------------+---------------+
| body (body_len bytes — opaque to Meridian; NIXL/Mooncake      |
| interpret as raw KV bytes)                                    |
+---------------+---------------+---------------+---------------+
  • magic = b"MRDN" — fail-fast on misrouted payloads.
  • version = 1 — incremented on any breaking framing change. We commit to preserving v1 across all 0.x releases.
  • tier — the producer's tier label (ThinkComplete | ThinkActive | OutputCritical). The consumer may ingest into a different tier; the field exists for telemetry and for fabric-side admission policy.
  • checksum — Blake3 of the body, truncated to 128 bits. Detects bit flips on RDMA and silent corruption inside fabric staging buffers.

The header is exactly 32 bytes. Body is opaque to the protocol — NIXL treats it as a raw KV slab; Mooncake adds its own framing inside.

Trigger points

  • ExitThink — the producer's natural offload window. The scheduler batches the request's ThinkComplete blocks and pushes them to the fabric in a single shot, amortised by disagg.offload_threshold_blocks.
  • OutputCritical allocation pressure — if the local pool is thrashing OutputCritical (a user-visible degradation event), the scheduler may pull blocks back from the fabric to satisfy allocations. The ingest_block hook is implemented; the automatic pull policy under allocation pressure is deferred pending measured offload latency data to calibrate the threshold.

Fabric trait

#![allow(unused)]
fn main() {
pub trait Fabric: Send + Sync + std::fmt::Debug {
    fn push(&self, payload: Vec<u8>) -> Result<u64>;
    fn pull(&self, handle: u64) -> Result<Vec<u8>>;
    fn label(&self) -> &'static str;
}
}

Shipped implementations:

  • SyntheticNixlFabric — in-process keyed map, wire-format-identical to a real NIXL agent on the host side. Used for integration tests and for portfolio deployments where libnixl isn't reachable.
  • Real libnixl FFI — gated on nvidia-nixl-sys becoming available on crates.io. The call sites already speak the protocol; switching swaps the Fabric implementation only.

Mooncake compatibility is achieved by writing a MooncakeAdapter: Fabric that re-frames the v1 wire body inside Mooncake's transport. The header survives unchanged, so an end-to-end conversation between a Meridian producer and a Mooncake-only consumer is a one-adapter delta.

Consequences

Positive

  • A single wire format covers two ecosystems (NIXL + Mooncake) and is forward-compatible because version is on the wire.
  • Checksum-on-body catches silent corruption — the kind of incident that takes 48 h to diagnose in a heterogeneous fabric.
  • The BlockManager trait gains offload_block / ingest_block / block_location as non-breaking additions: the default impls return DisaggUnavailable for offload/ingest and Local for location, so existing implementations keep working.
  • Portfolio deployments can demonstrate the full disagg path without GPU hardware, via the synthetic fabric.

Negative / risks

  • Adding Vec<u8> allocations on every offload contradicts the zero-allocation discipline of the router hot path. Mitigation: the offload path runs at ExitThink, which is at most once per request — well off the per-token critical path.
  • The 32-byte header is overhead for blocks that may be only a few KiB each. At a typical 16 KiB block this is 0.2% — acceptable.
  • The synthetic fabric is not a substitute for real NIXL benchmarks. Anyone reading a synthetic-fabric A/B chart should know what they're reading. We label the fabric as "nixl-synth" in all telemetry to make this unambiguous.

Neutral

  • The version field commits us to a backward-compatibility plan once v2 ships. ADR-0007 documents the policy.

Alternatives considered

gRPC point-to-point. Considered briefly. Rejected because every KV transfer would pay HTTP/2 framing overhead on top of the actual payload, and the standard NIXL/Mooncake clients don't speak gRPC. Our producer would be the odd one out.

RDMA-only, no framing. Considered. Rejected because RDMA without framing requires every consumer to know the producer's exact tier and checksum convention out-of-band. The 32-byte header buys us self-describing payloads at a 0.2% overhead — strictly worth it.

Per-block streams over Mooncake without our own header. Considered. Rejected because we'd lose the tier and version fields, which makes mixed-version cluster rollouts dangerous: a producer at v1 talking to a consumer at v0 must fail fast, not silently mis-tier blocks.

References

  • NIXL technical brief, NVIDIA Developer Blog, March 2026.
  • Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot, ASPLOS '25.
  • Blake3 specification, https://github.com/BLAKE3-team/BLAKE3-specs.
  • Meridian playbook §6 (Disaggregation outlook).
  • ADR-0004 (KV tier promotion policy) — describes why ThinkComplete is a natural offload candidate.

ADR-0007: Release and versioning policy

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc
  • Reviewers: sole-maintainer decision record

Context

Meridian ships three artefacts on three different cadences:

  • The meridian-core and meridian-kernels crates → crates.io.
  • The meridian Python package (the vLLM plugin) → PyPI.
  • The mdBook → GitHub Pages.

A release that bumps any of them must keep the others coherent — breaking the Python plugin while leaving meridian-core stable would strand operators on a half-upgraded stack. We need an explicit policy for what triggers a release, which artefacts move together, and how breaking changes are signalled — both for the pre-1.0 phase (today) and for the post-1.0 phase (after a year of production use).

We also need to commit to a provenance and SBOM story so the project is auditable end-to-end. SLSA Level 2 is the bar reasonable consumers expect in 2026.

Decision

SemVer interpretation

  • Post-1.0 — strict SemVer. Breaking changes require a major bump; additive changes require a minor bump; bugfixes are patches.
  • Pre-1.0 — breaking changes ship as minor bumps. The CHANGELOG entry for every minor bump must list every breaking change under a BREAKING CHANGE: footer. Operators reading the CHANGELOG before upgrading get a complete diff with no surprises.

Release cadence

  • Minor: every six weeks. The window is fixed; the content is whatever passes CI plus the ADRs accepted in that window.
  • Patch: on demand for security fixes and high-severity bugs. No fixed cadence — patches ship within 48 h of a confirmed report for SEV-1 (CVSS ≥ 7.0), per the security policy.

Branch policy

  • main is always-releasable. Nothing merges that breaks CI.
  • No release branches. release-plz drives the changelog and tag from main directly.
  • Hotfixes are commits on main followed by a patch tag. We do not back-port to N-1 minors during pre-1.0 — operators on an old pre-1.0 minor are expected to upgrade forward.

Artefact set per release

Every tagged release ships:

  • Compiled Rust static libs and the meridian Python wheel — built and uploaded as GitHub release assets by release.yml.
  • CycloneDX SBOM for the Rust workspace and the Python wheel, attached to the GitHub release.
  • SLSA Level 2 provenance attestation, generated by slsa-github-generator and uploaded as a release asset.
  • A signed Git tag (when GPG is configured) and an annotated tag otherwise.
  • mdBook deploy → GitHub Pages (via docs.yml on push to main).

Note on crates.io and PyPI publishing: release.yml carries a manual (workflow_dispatch) publish job. It defaults to dry-runcargo publish --dry-run for the crates plus twine check on the wheel — so packaging is validated on every invocation without pushing anything. Selecting publish_mode = live performs the real publish, but each live step is token-gated: it runs only when CARGO_REGISTRY_TOKEN (crates.io) and PYPI_API_TOKEN (PyPI) are present in repo secrets. Until those are configured the job is safe to run and simply validates. This keeps publish off the automatic tag path while the project stabilises pre-1.0.

The three artefacts (crates, wheel, mdBook) move together. A release with a partial set is a CI failure, not a partial release.

Version coupling

The policy is that meridian-core, meridian-kernels, and the meridian Python package carry the same version string. release-plz enforces this for the Rust side via Cargo workspace inheritance.

Current state of automation as of v0.1.0: python/pyproject.toml carries a hardcoded version = "0.1.0" that must be bumped manually in sync with the Cargo workspace. A build hook to read the workspace version automatically is planned but not yet wired. Operators upgrading the Rust workspace must also update python/pyproject.toml until that hook is in place.

Yanking

We will yank a crates.io publish only for a security incident or a correctness regression with no workaround. Style fixes, doc errors and "meh, the version should have been a minor" do not justify yanking — those get a forward fix in the next release.

Consequences

Positive

  • Operators reading the CHANGELOG ahead of an upgrade get a complete picture of what's breaking — no surprises during pre-1.0.
  • One number versions the entire stack. Cross-artefact compatibility questions ("does the wheel work with crate X.Y.Z?") become trivially answerable.
  • SLSA L2 provenance + SBOM at every release make Meridian a credible citizen of supply-chain-conscious deployments.
  • A six-week minor cadence is short enough to feel responsive but long enough that operators don't get release fatigue.

Negative / risks

  • A monolithic version bump means even a docs-only release moves every artefact. Cargo wastes a publish; PyPI wastes a wheel. The cost is real but small — well below the cost of decoupled versions diverging during an incident.
  • The fixed six-week cadence will sometimes ship "nothing material". We accept that; consistency beats hoarding.

Neutral

  • The policy is enforced by release-plz config plus the CI release workflow. There is no human gate between a green CI on main and a tagged release.

Alternatives considered

Independent versions per artefact. Considered for the parity it gives with how Cargo and PyPI usually work in larger projects. Rejected because the cross-artefact compatibility matrix becomes a second README, and we are not large enough yet to justify the overhead.

Release on every merge to main. Considered as a continuous-release model. Rejected because it would publish to crates.io and PyPI dozens of times a week, polluting the index and triggering downstream Dependabot noise for every contributor.

Long-lived release branches per minor. Considered for the back-port story it enables. Rejected pre-1.0 because we explicitly do not support N-1; operators are expected to upgrade forward.

References

ADR-0008: Request preemption policy

  • Status: Accepted
  • Date: 2026-05-20
  • Authors: angelnicolasc
  • Reviewers: sole-maintainer decision record

Context

Meridian reorders the batch vLLM hands it so output-phase requests are dispatched ahead of think-phase requests (ADR-0001). Reordering is a non-destructive operation: it changes the rank of requests that vLLM has already decided to run this step, but it never removes a request from the running set and never reclaims KV from a request mid-flight.

The next lever — and the one every mature serving scheduler eventually reaches for — is preemption: evicting a request that is already dispatched so its KV blocks can be reused by a higher-priority request. vLLM has its own preemption (recompute / swap), but it is phase-blind: it preempts on global memory pressure without knowing that a think-phase request holding 14 GB of KV is a far better victim than an output-phase request streaming to a waiting user.

A phase-aware preemption policy is therefore a natural Meridian feature. The question this ADR answers is not "how do we build it" in isolation, but "do we build it before 1.0, and if not, what is the design and the risk that justify deferring it". Shipping preemption is the single highest-risk change available to the scheduler: it can deadlock, it can thrash, and it can corrupt a request's KV if the reconstruction path is wrong. A wrong preemption decision is user-visible as a stalled or restarted generation.

Decision

Meridian will not preempt already-dispatched requests before 1.0. The plugin's influence on vLLM remains advisory — reordering and budget forcing only. This ADR records the intended design and its risk analysis so the deferral is a deliberate, documented choice rather than a gap.

Intended design (post-1.0)

When preemption is implemented it will follow this shape:

  • Victim selection by phase, then LRU. The victim search walks the block manager's tier order — ThinkComplete first, then ThinkActive, and OutputCritical only under the same severe-pressure warning that evict_for already emits. Within a tier, the least-recently-touched request is the victim. This reuses the existing eviction ordering rather than introducing a second policy.
  • Recompute, not swap, as the default reclamation path. A preempted think-phase request is cheaper to recompute from its prompt than to swap its KV to host and back, because think-phase KV is exactly the data we are willing to discard (ADR-0004). Swap remains available for output-phase victims, which must never lose progress.
  • A preemption budget. At most a configurable fraction of the running set may be preempted per scheduler step (preempt_max_fraction, default small). This caps thrash: a pressure spike cannot evict the whole batch in one step.
  • A re-admission guard. A preempted request is parked with a monotonically increasing priority floor so it cannot be preempted again immediately after re-admission. This is the anti-livelock invariant.
  • Disagg interaction. When a fabric is configured (ADR-0006), a ThinkComplete victim is offloaded rather than discarded, so its KV is recoverable from the fabric instead of recomputed. Preemption and offload share the same victim search.

API shape (sketch, not committed)

The scheduler would gain a single entry point that returns victims for the caller to actuate against vLLM, keeping Meridian advisory rather than reaching into vLLM's running set directly:

fn select_preemption_victims(
    &self,
    needed_bytes: u64,
    running: &[RequestId],
) -> Vec<PreemptionVictim>   // { req_id, reason, reclaim: Recompute | Swap | Offload }

The plugin translates each victim into the vLLM preemption call appropriate for that vLLM version, isolated in the same _extract_* / _reorder shim layer that already absorbs vLLM API drift.

Consequences

Positive

  • The pre-1.0 scheduler stays advisory and therefore safe: the worst case of a wrong Meridian decision is a sub-optimal dispatch order, never a lost or corrupted generation.
  • The design is written down, so when preemption lands it starts from a reviewed risk analysis rather than a blank page.
  • Victim selection reuses the existing tier ordering and disagg offload path, so the eventual implementation is additive, not a rewrite.

Negative / risks (the reason for deferral)

  • Without phase-aware preemption, Meridian leaves throughput on the table under heavy memory pressure: vLLM's phase-blind preemption will sometimes evict an output-phase request when a think-phase victim was available. This is the cost we accept pre-1.0.
  • The risks that justify waiting:
    • Deadlock: a preempted request needs memory to be re-admitted that only its own preemption could free. Mitigated by the recompute default and the re-admission priority floor — both unproven until implemented.
    • Thrash / livelock: oscillating preempt/re-admit under sustained pressure. Mitigated by the per-step preemption budget.
    • KV correctness: a swap-and-restore bug silently corrupts a request's context. This needs a dedicated correctness harness on real hardware before it can ship — which is precisely the validation we do not yet have.

Neutral

  • This ADR is revisited once Meridian has a real-hardware benchmark baseline. Preemption that cannot be measured against stock vLLM under memory pressure cannot be justified, so the work is gated on that measurement capability existing first.

Alternatives considered

Implement preemption now, behind a default-off flag. Considered, so the code path exists for early adopters. Rejected: a default-off feature with no real-hardware correctness harness is untested code that rots, and the risk analysis above shows the failure modes are the kind that only surface under real load.

Delegate entirely to vLLM's preemption forever. Considered as the permanent answer. Rejected as a permanent policy because phase-blind preemption contradicts Meridian's entire thesis; accepted as the pre-1.0 policy because the safety bar for taking over preemption is high.

References

  • ADR-0001 (Dual-queue vs. priority weights) — the advisory-reordering baseline this ADR declines to extend pre-1.0.
  • ADR-0004 (KV tier promotion policy) — the tier ordering victim selection reuses.
  • ADR-0006 (Disagg KV transfer protocol) — the offload path a ThinkComplete victim takes when a fabric is configured.

ADR-NNNN: Short kebab title

  • Status: Proposed | Accepted | Superseded by ADR-NNNN | Deprecated
  • Date: YYYY-MM-DD
  • Authors: name
  • Reviewers: name, name

Context

What is the problem we are facing? What constraints apply? What evidence do we have (benchmarks, incident postmortems, prior art)?

Decision

What did we decide? State it as a single declarative sentence at the top, then expand.

Consequences

Positive

  • Bullet list of what we gain.

Negative / risks

  • Bullet list of what we lose, and how we will detect each risk if it materializes.

Neutral

  • Things that change but are neither clearly positive nor negative.

Alternatives considered

For each alternative: one paragraph on what it would look like, and why we rejected it.

References

  • Links to research papers, prior ADRs, incident reports, benchmarks.

Metrics

Meridian emits Prometheus metrics and OpenTelemetry traces. All metric names are stable contracts — renames or semantic changes trigger a minor-version bump per ADR-0007.

Metric catalog

meridian.think_tokens_per_request

PropertyValue
Typehistogram
Unittokens
Cardinality1 series (no labels)
SourcePhaseRouter on ExitThink or ForceBudget
WhyTracks the distribution of reasoning-chain lengths. Long tails here indicate the entropy probe is deferring too late; a spike in the P99 bucket means hard-cap forcing is dominating.

Operator action: if P99 frequently hits max_think_tokens, lower max_think_tokens or tighten eat_ema_variance_threshold / rpdi_threshold to fire forcing earlier.


meridian.budget_force_triggered

PropertyValue
Typecounter
Unitevents
Cardinality1 series
SourcePhaseRouter
WhyMeasures how often the router fires </think> injection. A counter that never moves means the entropy probe is never converging (check eat_ema_variance_threshold).

meridian.budget_force_reason{reason=...}

PropertyValue
Typecounter
Unitevents
Labelsreason{converged, overthinking, hard_cap}
Cardinality3 series
SourcePhaseRouter
WhyBreaks down why forcing fired. converged and overthinking mean the entropy probe is working. Sustained hard_cap dominance means the probe is failing to detect convergence.

Operator action: monitor the ratio hard_cap / (converged + overthinking) over a 1-hour window. A ratio above 0.5 is a signal to investigate probe thresholds or inspect sample EAT traces.


meridian.output_critical_eviction

PropertyValue
Typecounter
Unitevents
Cardinality1 series
SourcePhaseAwareBlockManager
WhyEvery increment is a user-visible degradation event — a KV block backing the live output stream was evicted under memory pressure. Zero is the target.

Operator action: alert at rate(...) > 0 sustained for 5 minutes. Mitigate by lowering think_phase_memory_fraction, think_batch_multiplier, or max_think_tokens.


meridian.phase_router.tracked_requests

PropertyValue
Typegauge
Unitrequests
Cardinality1 series
SourcePhaseRouter
WhyRequests that complete but are never reaped leak entries. Monotonically growing gauge means reap_stale_older_than is not being called, or the vLLM plugin's post_step is not receiving EOS events.

Operator action: if the gauge grows without a corresponding growth in active concurrent requests, check that post_step receives EOS and that the reap period (60 s default) is shorter than request lifetime.


meridian.schedule_batch.duration_ns

PropertyValue
Typehistogram
Unitnanoseconds
Cardinality1 series
SourceMeridianScheduler::schedule_batch
WhyMeasures scheduling overhead on the hot path. Should be in the microsecond range; millisecond-range values indicate contention inside the scheduler lock.

Operator action: P99 above 1 ms under steady load is unexpected. File an issue with a CPU profile.


meridian.queue_depth{queue=...}

PropertyValue
Typegauge
Unitrequests
Labelsqueue{output, think}
Cardinality2 series
SourceMeridianScheduler
WhyMonitors queue backlog. Output queue depth growing without draining means the GPU is bottlenecked. Think queue depth growing without budget_force_triggered activity means the entropy probe is not converging and long chains are piling up.

Operator action: alert when queue_depth{queue="think"} P95 exceeds 4× its 1-hour baseline for 5 consecutive minutes without accompanying forcing activity.


meridian.block_manager.used_bytes

PropertyValue
Typegauge
Unitbytes
Cardinality1 series
SourcePhaseAwareBlockManager
WhyTotal KV bytes currently allocated across all tiers. Rising towards kv_memory.capacity_bytes predicts incoming eviction pressure.

Operator action: alert when block_manager.used_bytes / capacity_bytes exceeds 0.90 for 10 minutes — this is the early-warning threshold before output_critical_eviction events begin.


meridian.block_manager.evictions{tier=...}

PropertyValue
Typecounter
Unitevents
Labelstier{think_complete, think_active, output_critical}
Cardinality3 series
SourcePhaseAwareBlockManager
WhyPer-tier eviction rate reveals the shape of memory pressure. think_complete evictions are routine; think_active indicates moderate pressure; output_critical is a user-visible degradation event identical to meridian.output_critical_eviction.

Operator action: alert on any tier=output_critical increment — use this series or meridian.output_critical_eviction, whichever is easier to route in your alerting stack.


meridian.scheduler.batch_size{phase=...}

PropertyValue
Typehistogram
Unitslots
Labelsphase{output, think}
Cardinality2 series
SourceMeridianScheduler
WhyDistribution of actual batch sizes delivered to the vLLM worker per phase. A consistently small output batch under load means output requests are draining faster than think-phase completions replenish the pool.

Operator action: compare scheduler.batch_size{phase=output} P50 against queue_depth{queue=output} to verify output requests are being served promptly.


meridian_disagg_blocks_offloaded_total{fabric=...}

PropertyValue
Typecounter
Unitblocks
Labelsfabric{nixl, mooncake}
Cardinality1 series per active fabric
SourceMeridianSchedulerPlugin on ExitThink (flushed at offload_threshold_blocks)
WhyTracks disagg throughput. A counter that never moves when disagg is enabled means offload hooks are not firing.

meridian_vocab_fallback_total

PropertyValue
Typecounter
Unitevents
Cardinality1 series
SourceMeridianSchedulerPlugin entropy-probe batch path
WhyCounts batches where logit rows had heterogeneous vocab sizes and the probe fell back to per-request compute. A rising counter means mixed-model batching is defeating the batched probe; investigate request routing.

OTLP export

Prometheus is the primary metric surface. When [telemetry] otlp_enabled = true (requires the otel extra), the plugin additionally exports its counters to an OTLP/HTTP collector at [telemetry] otlp_endpoint, and the Rust core can wire its tracing spans to OTLP via the otel crate feature (meridian_core::telemetry::install). Both are off by default.

Trace spans

Each MeridianScheduler::schedule_batch call opens a meridian.schedule_batch OpenTelemetry span. PhaseEvents propagate meridian.phase_event{kind=...} events on the active request's span, allowing per-request phase timelines to be reconstructed from trace data.

Alerting summary

MetricAlert conditionSeverity
output_critical_evictionrate > 0 for 5 minHigh — user-visible
block_manager.evictions{tier=output_critical}rate > 0 for 1 minHigh — user-visible (same event, finer label)
block_manager.used_bytes> 90% of capacity for 10 minMedium — pre-eviction warning
queue_depth{queue=think}P95 > 4× baseline for 5 min with no forcingMedium — starvation risk
budget_force_reason{reason=hard_cap}ratio > 0.5 over 1 hLow — probe investigation
phase_router.tracked_requestsmonotonically growing > 15 minLow — reap misconfiguration

Troubleshooting

Each entry follows runbook format: Symptom → Likely cause → How to verify → Immediate mitigation → Longer-term fix.


Output streams stutter under load

Symptom: users report visible gaps in the output token stream; output ITL P99 spikes.

Likely cause: OutputCritical KV blocks are being evicted under memory pressure.

How to verify:

rate(meridian.output_critical_eviction[1m]) > 0

Any non-zero rate confirms the block manager is evicting user-visible KV.

Immediate mitigation: reduce load. Lower the arrival rate or raise kv_memory.capacity_bytes if headroom exists.

Longer-term fix (pick one or more):

  1. Lower kv_memory.think_phase_memory_fraction (e.g. 0.40 → 0.30) to give output more room.
  2. Lower scheduler.think_batch_multiplier to reduce think-phase KV pressure.
  3. Lower scheduler.max_think_tokens so individual reasoning chains release blocks sooner.

Budget forcing never fires; chains always hit hard cap

Symptom: meridian.budget_force_reason{reason=hard_cap} increments constantly; converged and overthinking stay at zero.

Likely cause: the entropy probe is not detecting convergence. Either the probe is disabled, the thresholds are too tight, or the model is being asked to reason on prompts that produce genuinely non-converging chains.

How to verify:

  1. Check meridian.toml: confirm entropy.enabled = true.
  2. Capture a sample EAT trace: add logging.level = "debug" temporarily and inspect EAT EMA values in the logs for a known-good reasoning prompt.

Immediate mitigation: none — hard-cap forcing is safe, just less adaptive.

Longer-term fix:

  • If probe is disabled: set entropy.enabled = true.
  • If thresholds are too tight: relax eat_ema_variance_threshold (e.g. 0.001 → 0.005) or lower rpdi_threshold (e.g. 3.0 → 2.0).
  • If prompts are genuinely non-converging: lower max_think_tokens to bound KV cost.

Phase router shows runaway tracked-requests gauge

Symptom: meridian.phase_router.tracked_requests grows monotonically without a corresponding growth in active requests.

Likely cause: completed requests are not being reaped from the router. Either post_step is not receiving EOS events, or reap_stale_older_than is not being called.

How to verify:

  1. Confirm the vLLM plugin's post_step is wired to the engine's step callback.
  2. Check the reap interval in the plugin config — default 60 s. If request lifetime is shorter than 60 s on average, the reaper may be lagging.

Immediate mitigation: no manual reap trigger is exposed in v0.1.x. The plugin calls router.reap_stale_older_than(60.0) automatically every 64 schedule() invocations; under active serving load this fires within seconds. To force an immediate reap at deploy time, temporarily lower _reap_interval_schedules in vllm_plugin.py to 1.

Longer-term fix: reduce the reap period in vllm_plugin.py or ensure post_step receives every EOS signal from the vLLM worker.


CUDA kernel returns Unavailable

Sprint 0 note: python/meridian/_backends/cuda.py currently delegates to CpuEntropyBackend; KernelError::Unavailable cannot be triggered through the Python probe in v0.1.x. The scenario below applies once Sprint 1 wires the Python backend to the Rust kernel path in crates/meridian-kernels/.

Symptom: logs show KernelError::Unavailable; entropy probe silently falls back to CPU; meridian.budget_force_reason{reason=hard_cap} is the only firing signal.

Likely cause: the meridian-kernels crate was built without --features cuda, or libcudart.so is not on the dynamic linker path at runtime.

How to verify:

# Confirm the Python extension loads and report the current backend behaviour.
python -c "
from meridian._backends.cuda import CudaEntropyBackend
b = CudaEntropyBackend()
print('backend name:', b.name)
# Sprint 0: always prints 'cuda' but delegates to CPU internally.
# Real CUDA dispatch requires a --features cuda build (Sprint 1).
"

# Confirm the Rust kernels extension is importable.
python -c "import meridian._meridian; print('native extension OK')"

Immediate mitigation: the CPU fallback is correct — entropy values are identical. The only cost is CPU cycles and higher schedule_batch.duration_ns latency.

Longer-term fix:

  1. Rebuild with cargo build -p meridian-kernels --features cuda.
  2. Verify libcudart.so is on LD_LIBRARY_PATH or in /usr/local/cuda/lib64.
  3. Run maturin develop --release -m crates/meridian-python/Cargo.toml to regenerate the Python extension.

Disagg offload not firing

Symptom: meridian.disagg.blocks_offloaded counter never increments after [disagg] enabled = true is set.

Likely cause: the fabric is not reachable, the offload_threshold_blocks has not been reached, or the NIXL feature was not compiled in.

How to verify:

  1. Confirm cargo build -p meridian-kernels --features nixl succeeds.
  2. Confirm config.disagg.fabric != "none".
  3. Check whether ThinkComplete block count per step is below offload_threshold_blocks.

Immediate mitigation: lower offload_threshold_blocks to 1 temporarily to force immediate offload on every ExitThink.

Longer-term fix: verify fabric connectivity (NIXL service running, network path open) and restore offload_threshold_blocks to the production value.


Plugin does not intercept schedule_batch

Symptom: no Meridian metrics appear; vLLM appears to schedule normally without phase-awareness.

Likely cause: MeridianSchedulerPlugin.attach() was never called, or the plugin was attached to a different scheduler instance than the one the engine uses.

How to verify:

print(type(engine.scheduler[0]))
# Should print: <class 'meridian.vllm_plugin.MeridianSchedulerPlugin'>
# Not: <class 'vllm.core.scheduler.Scheduler'>

Immediate mitigation: call MeridianSchedulerPlugin.attach(engine, cfg, model_key="...") explicitly after engine construction. attach is a classmethod — it constructs the plugin and replaces engine.scheduler[0] in one step.

Longer-term fix: verify the plugin initialisation order in the serving entrypoint. The plugin must be attached after AsyncLLMEngine is fully constructed but before the first request is submitted.

Benchmarks

The benchmark harness lives at benchmarks/. The methodology behind metric choice is recorded in ADR-0005.

Quick start

# CI-friendly: no GPU, no vLLM. Drives native Meridian components over a
# synthetic decoder loop. Finishes in seconds.
uv --project python run python -m benchmarks.meridian_bench synthetic-replay \
    --duration-s 30 --arrival-rate 8 --reasoning-ratio 0.4 \
    --out-dir bench-out/

# GPU-required: drives a real AsyncLLMEngine via the Meridian plugin.
uv --project python run python -m benchmarks.meridian_bench real-vllm \
    --model Qwen/Qwen2.5-0.5B --duration-s 30 --arrival-rate 4 \
    --out-dir bench-out/

Both modes produce identically-shaped artefacts in --out-dir:

  • report.json — full structured report, diffable.
  • report.md — Markdown summary suitable for PR comments.

Metric catalog

NameDefinition
TTFT P50/P95Time-to-first-token. Prefill + first decoded token.
TTOT P50/P95Time from </think> emission to the first user-visible token.
Output ITL P50/P95/P99Inter-token latency during output phase (streaming jitter).
Think tokens avg/P95Distribution of reasoning-chain length per request.
Output tokens avgMean output token count per request.
Budget forced %Percentage of reasoning requests where the router forced </think>.
Force reasonBreakdown by converged / overthinking / hard_cap.
OutputCritical evictionsKV pressure events that reached the user-visible tier.

See benchmarks/metrics.py for the exact serialised shape.

A/B comparison mode

--baseline runs the same workload through one or more baseline schedulers alongside Meridian and writes ab-report.{json,md} to --out-dir:

  • stockStockSchedulerBaseline, a priority-weight single-queue scheduler equivalent to vLLM ≤0.8 (no phase awareness, never forces budget).
  • static-budgetStaticBudgetBaseline, a fixed think-token cap equivalent to vLLM 0.9's thinking_token_budget (forces </think> on a counter, with no entropy signal). This is the prior art Meridian's EAT/RPDI forcing aims to beat.
  • all — run every baseline; the report gets one value column per run and a Δ% vs <baseline> column per baseline with a WIN/win/FLAT/loss/LOSS flag.

Five-minute A/B (no GPU)

# Stock + static-budget + Meridian over a real prompt-length distribution.
python -m benchmarks.meridian_bench synthetic-replay \
    --workload sharegpt --baseline all \
    --duration-s 30 --arrival-rate 8 --out-dir bench-out/

# Read the comparison. Meridian should win TTOT P95 vs both baselines.
cat bench-out/ab-report.md

The synthetic-replay mode requires the native extension (maturin develop -m crates/meridian-python/Cargo.toml); the baseline and report logic alone are exercised by benchmarks/tests/test_baselines.py without it.

Dataset loaders

Pass --workload sharegpt or --workload math500 to load real traffic distributions from HuggingFace. Datasets are downloaded once and cached at ~/.cache/meridian/datasets/. Requires no GPU — the offline replay drives the synthetic decoder with the real prompt/response length distribution.

Test environment disclosure

When comparing numbers across runs:

ParameterDefault
--seed42
--arrival-rate8 req/s
--duration-s30
--reasoning-ratio0.4

Always report --seed and workload flag. Synthetic results are hardware-independent; real-vLLM results depend on GPU model, driver, and memory state — disclose all three.

How to compare two runs

# Run A (baseline config)
python -m benchmarks.meridian_bench synthetic-replay --seed 42 --out-dir bench-out/a/

# Run B (modified config)
python -m benchmarks.meridian_bench synthetic-replay --seed 42 --out-dir bench-out/b/

# Diff the JSON reports
diff <(jq -S . bench-out/a/report.json) <(jq -S . bench-out/b/report.json)

What this harness is, and what it isn't

  • It is a phase-differentiated latency regression suite. It catches changes that move the TTOT or output-ITL distributions, the metrics Meridian was built to improve.
  • It is reproducible: the synthetic-replay mode is deterministic given --seed. Two PRs can be diffed apples-to-apples.
  • It is not a raw-throughput benchmark. vLLM's own harness already reports tokens/sec/GPU and that metric does not differentiate Meridian from the baseline. Operators who want a throughput number should run vLLM's benchmark.
  • It is not an accuracy benchmark. Budget forcing can in principle hurt reasoning accuracy on hard problems. Accuracy measurement requires a separate ground-truth evaluation suite; this harness does not provide one.

GPU CI runner setup

The cuda.yml workflow targets a self-hosted runner labelled gpu because GitHub-hosted runners do not provide CUDA-capable GPUs in the free tier. This page documents how to provision and secure the runner.

Hardware requirements

ComponentMinimumRecommended
GPUNVIDIA L4 / A10H100 / B200
Driver555.x555.x or newer
CUDA12.612.6
Disk80 GiB SSD200 GiB NVMe
RAM32 GiB64 GiB

Required secrets

SecretWhere setPurpose
(none required for GPU runner itself)The runner authenticates via GitHub App token
RELEASE_PLZ_TOKENRepository secretsAllows release-plz to push tags
CARGO_REGISTRY_TOKENRepository secretscrates.io publish (future)

The runner registration token is generated once from the GitHub Actions UI and is not stored as a persistent secret — it expires after one hour and is only used during ./config.sh.

Provisioning

# On the Linux host with the GPU
curl -O https://github.com/actions/runner/releases/download/v2.319.0/actions-runner-linux-x64-2.319.0.tar.gz
mkdir actions-runner && cd actions-runner
tar xzf ../actions-runner-linux-x64-2.319.0.tar.gz

./config.sh \
    --url https://github.com/angelnicolasc/meridian \
    --token <REGISTRATION_TOKEN> \
    --labels self-hosted,linux,x64,gpu \
    --unattended

sudo ./svc.sh install
sudo ./svc.sh start

Verification

Run ./run.sh once interactively. Then trigger the cuda.yml workflow from a branch and confirm nvidia-smi prints the expected device in the job logs.

Blast radius and fork safety

Self-hosted runners execute arbitrary code from the workflow YAML. Meridian mitigates this with a hard gate on every GPU job:

if: github.repository_owner == 'angelnicolasc'

PRs from forks never trigger the CUDA workflow. Only pushes and PRs from the angelnicolasc org are eligible.

Who can trigger: repository owners and collaborators with write access.
How to rotate the runner: generate a new registration token from the GitHub Actions UI, run ./config.sh --replace, restart the service.

See the GitHub documentation on self-hosted runner security for a full threat model.

Security & Supply Chain

Disclosure policy

Security vulnerabilities should be reported privately. See SECURITY.md for the full disclosure process, severity classification, and response SLAs.

Summary: critical severity (CVSS ≥ 7.0) issues receive a patch within 48 hours of confirmation. Do not open public GitHub issues for unpatched vulnerabilities.

Provenance attestation

Every tagged release (v*) generates a SLSA Level 2 provenance attestation via the slsa-github-generator reusable workflow. The attestation covers:

  • The meridian-core and meridian-kernels static libraries.
  • The meridian Python wheel.

The attestation is uploaded as a GitHub release asset alongside the release artefacts. To verify:

# Install the SLSA verifier
go install github.com/slsa-framework/slsa-verifier/v2/cli/slsa-verifier@latest

# Download the artefact and its provenance from the GitHub release.
# Then verify:
slsa-verifier verify-artifact meridian-*.whl \
    --provenance-path meridian.intoto.jsonl \
    --source-uri github.com/angelnicolasc/meridian \
    --source-tag v0.1.0

What is attested: the build provenance — that the artefact was built from the tagged source in the GitHub Actions environment. SLSA L2 does not attest to the security of the code itself.

Software Bill of Materials (SBOM)

Each release includes a CycloneDX SBOM covering:

  • The Rust workspace (all transitive crate dependencies).
  • The Python wheel (Python package dependencies from pyproject.toml).

The SBOM is attached as a .cdx.json asset on the GitHub release. Operators can use it with vulnerability scanning tools (Grype, Trivy, FOSSA).

Supply-chain controls

ControlMechanism
Dependency pinningCargo.lock and uv.lock committed and verified in CI
Dependency auditingcargo deny check in the supply-chain CI job (licence + advisory check)
GitHub Actions pinningActions pinned to major version tags in all workflows
Self-hosted runner isolationGPU runner gated to github.repository_owner == 'angelnicolasc'
DCO sign-offAll commits require Signed-off-by matching the commit author
Release provenanceSLSA L2 via slsa-github-generator

Dependency policy

New dependencies require:

  1. A licence compatible with Apache-2.0 (verified by cargo deny).
  2. No known CVEs at the time of merge (verified by cargo deny advisories check).
  3. An entry in the SBOM at the next release.

CI workflow permissions

All CI workflows run with minimal permissions:

WorkflowPermissions
ci.ymlcontents: read
release.ymlcontents: write, pull-requests: write, id-token: write, attestations: write
sbom.ymlcontents: write
docs.ymlcontents: read, pages: write, id-token: write
cuda.ymlcontents: read

release.yml permission notes:

  • id-token: write — required by slsa-github-generator to mint the OIDC-backed provenance token; scoped to the build-artifacts and provenance jobs.
  • pull-requests: write — required by release-plz to open the automated release PR.
  • attestations: write — required by slsa-github-generator to upload the attestation bundle as a release asset.

Contributor Covenant Code of Conduct

This project adopts the Contributor Covenant, version 2.1 as its code of conduct.

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to a positive environment include:

  • Demonstrating empathy and kindness toward other people.
  • Being respectful of differing opinions, viewpoints, and experiences.
  • Giving and gracefully accepting constructive feedback.
  • Accepting responsibility and apologizing to those affected by our mistakes.
  • Focusing on what is best for the overall community.

Unacceptable behavior includes:

  • Sexualized language or imagery, and sexual attention or advances of any kind.
  • Trolling, insulting or derogatory comments, and personal or political attacks.
  • Public or private harassment.
  • Publishing others' private information without explicit permission.
  • Other conduct which could reasonably be considered inappropriate in a professional setting.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported confidentially to the project maintainer at nick.dicerutti@gmail.com. All complaints will be reviewed and investigated promptly and fairly.

For the full text of the Code of Conduct including enforcement guidelines, see https://www.contributor-covenant.org/version/2/1/code_of_conduct/.

Contributing to Meridian

Thank you for considering a contribution. Meridian is an inference-time compute scheduler for reasoning-model serving — correctness, performance and clarity of contracts all matter at the same time. This document describes the bar.

Code of Conduct

Participation is governed by the Contributor Covenant 2.1. Report violations to nick.dicerutti@gmail.com.

Developer Certificate of Origin (DCO)

Meridian uses the DCO instead of a CLA. Every commit must be signed off:

git commit -s -m "feat(core): add RPDI ratio computation"

This appends Signed-off-by: Your Name <you@example.com>, which certifies that you have the right to submit the change under the project license. PRs without sign-off will not merge — the DCO check blocks them.

Commit conventions

We use Conventional Commits with the following scopes:

ScopeUsed for
corecrates/meridian-core/
kernelscrates/meridian-kernels/ (CUDA + FFI)
pythoncrates/meridian-python/ and python/meridian/
ci.github/workflows/, devcontainer, scripts
docsdocs/, README, NOTICE
adrnew or modified ADRs only
depsdependency bumps
benchbenchmarks/

Title must be under 72 characters. Breaking changes use feat!: / fix!: and include a BREAKING CHANGE: footer.

Branch protection

main is protected. PRs require:

  1. Linear history (rebase, no merge commits).
  2. CI green: cargo fmt --check, clippy -D warnings, cargo nextest, ruff, mypy --strict, pytest -m "not gpu", mdbook build.
  3. DCO sign-off on every commit.
  4. Conventional Commit title.
  5. At least one approving review.

Local development

./scripts/dev-up.sh                  # devcontainer + sanity checks
./scripts/ci-local.sh                # mirrors CI matrix locally

Pre-commit hooks (rustfmt, clippy, ruff, mypy, commitlint) are configured in .pre-commit-config.yaml. Install with pre-commit install --install-hooks.

Test strategy

  • Pure Rust logic lives in meridian-core and must be covered by unit tests plus, where state machines are involved, proptest-based property tests.
  • CUDA correctness lives in meridian-kernels and is verified against the reference CPU implementation in Python.
  • Anything that crosses the FFI boundary is exercised from Python tests (python/tests/).
  • GPU-dependent tests are marked @pytest.mark.gpu; CI runs them only on the GPU job.

What lands in main

A change is mergeable when:

  • Tests cover the new code path and existing tests still pass.
  • Public API additions have rustdoc / docstrings with examples.
  • Behavior changes that affect operators (config defaults, metric names, exposed traits) are recorded in an ADR.
  • CHANGELOG.md is updated under ## [Unreleased] (handled automatically by release-plz for routine changes — manually for breaking ones).

Getting help

Open a GitHub Discussion for design questions, an issue for bugs.

Security Policy

Supported versions

Meridian is in pre-1.0 development. Only the latest tagged release receives security fixes. After 1.0, the latest minor of the latest two majors will be supported.

Reporting a vulnerability

Do not file a public issue for security problems. Use GitHub's Private Vulnerability Reporting or email nick.dicerutti@gmail.com with the subject line [meridian-security].

You should expect:

StageTarget time
Acknowledgement48 hours
Initial assessment5 business days
Fix or mitigation30 days (critical), 90 days (high), best-effort otherwise
Public disclosureCoordinated, default 90 days after fix is available

We credit reporters in the release notes unless anonymity is requested.

Scope

In scope for security reporting:

  • Memory safety in meridian-core and meridian-kernels FFI boundary — any reachable UB, out-of-bounds access, double-free, use-after-free.
  • CUDA kernel safety — buffer overruns, races on shared memory, illegal memory access reachable from sane inputs.
  • Deserialization of meridian.toml, model configs, request payloads — panics on adversarial input, type confusion.
  • Denial of service — pathological requests that crash the scheduler or exhaust KV memory irrecoverably.
  • Supply chain — compromised crate/wheel that ships under the Meridian name.

Out of scope:

  • Vulnerabilities in upstream dependencies (file with the upstream project; we will track and bump promptly).
  • Misconfigurations of operator-controlled deployments (e.g. exposing the Prometheus endpoint publicly).
  • Reasoning quality degradation when budget forcing is misconfigured — this is a correctness concern, not a security one.

Hardening notes

  • meridian-core denies unsafe_op_in_unsafe_fn workspace-wide.
  • The CUDA FFI boundary in meridian-kernels is the only unsafe surface and is reviewed for every change.
  • Releases are signed via Sigstore cosign; artifact provenance is generated via SLSA Level 2.