Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration

Meridian is configured through a single TOML file consumed by both the Rust core (meridian-core::config::MeridianConfig) and the Python facade (meridian.config.MeridianConfig). Both parsers agree on field names; a round-trip test in crates/meridian-core/tests/config_parse.rs exercises every field.

The fully-annotated example lives at meridian.toml.example. Tune by overlay: keep the example as a reference and write a smaller meridian.toml containing only fields that differ from their defaults.

Validation

Both parsers reject unknown fields and out-of-range values. Cross-field violations (e.g. min_think_tokens >= max_think_tokens) are also caught. Errors carry the dotted field path and a human-readable message.


[scheduler]

Dual-queue scheduling policy. See ADR-0001.

think_tpot_budget_ms

PropertyValue
Typef64
Default80.0
Unitmilliseconds per think-phase token
Valid range> 0

TPOT budget for think-phase tokens. The user does not see inter-token latency during reasoning, so this can be set much higher than the output budget. Setting it 4× the output budget gives the batcher room to pack a larger effective batch during think. Raise if your GPU is underutilised during think; lower if think-phase requests monopolise capacity at the expense of output.

output_tpot_budget_ms

PropertyValue
Typef64
Default20.0
Unitmilliseconds per output-phase token
Valid range> 0

TPOT budget for output-phase tokens. This is the user-visible streaming latency floor. 20 ms keeps streams fluid on a 30–50 tok/s display target. Lower values produce tighter streams but reduce think-phase throughput.

think_batch_multiplier

PropertyValue
Typef64
Default2.5
Unit×output batch token budget
Valid range>= 1.0

The think-phase batch can fill this multiple of the output-phase token budget. 2.5× is conservative — empirically stable across H100-class hardware with MLA-aware allocation. Values above 3.5× risk output ITL variance spikes if think requests fail to yield promptly. Monitor meridian.queue_depth{queue=think}.

max_think_tokens

PropertyValue
Typeu64
Default32768
Unittokens
Valid range> min_think_tokens

Hard cap on think tokens per request. Budget forcing fires unconditionally at this limit regardless of entropy signals. 32 768 matches DeepSeek-R1's documented maximum reasoning length and bounds the KV memory a single request can monopolise.

min_think_tokens

PropertyValue
Typeu64
Default512
Unittokens
Valid range< max_think_tokens

No budget forcing is allowed before this many think tokens. EAT/RPDI signals are noisy below 512 tokens; early forcing can prematurely terminate short-but-correct reasoning chains.


[entropy]

Entropy probe and convergence-detection thresholds.

enabled

PropertyValue
Typebool
Defaulttrue

When false, the entropy probe is disabled and all budget forcing uses hard_cap only (pure token-count limiting). Useful for A/B comparison or when the CUDA kernel is not available.

ema_alpha

PropertyValue
Typef64
Default0.05
Unitdimensionless (EMA decay)
Valid range(0.0, 1.0]

EMA decay applied to all entropy signals. Smaller values give longer memory. α = 0.05 → ~95% mass within the last ~60 samples. Long enough to smooth single-token spikes; short enough to react within a reasoning chain.

rpdi_threshold

PropertyValue
Typef64
Default3.0
Unitratio (local RPDI / global RPDI)
Valid range> 1.0

Overthinking is declared when rpdi_local / rpdi_global > threshold. The value 3.0 is the empirical threshold from arXiv:2603.14251. Raise to be more permissive (longer chains); lower to be more aggressive.

eat_ema_variance_threshold

PropertyValue
Typef64
Default0.001
Unitnats²
Valid range> 0.0

Convergence is declared when EAT EMA variance drops below this threshold. 0.001 is approximately the noise floor of EAT in steady state. Lower values defer forcing; higher values fire earlier.

transition_entropy_threshold

PropertyValue
Typef64
Default2.5
Unitnats
Valid range> 0.0

A token counts as a "transition" for RPDI when its per-token entropy exceeds this value. 2.5 nats ≈ effective branching factor of 12 — a genuine decision point rather than low-entropy continuation.

eat_probe_interval_tokens

PropertyValue
Typeu32
Default32
Unittokens
Valid range>= 1

The EAT kernel runs every N think tokens. 1 = every token; higher values trade signal latency for reduced kernel-launch overhead. 32 is the sweet spot on H100-class hardware. Halving this on slower GPUs is safe.


[kv_memory]

Phase-aware KV block manager policy.

aggressive_think_eviction

PropertyValue
Typebool
Defaultfalse

When true, ThinkComplete blocks are freed immediately on phase transition. Leave false until cross-attention back-references are audited for your model; some reasoning-parser pipelines re-attend over the think segment when generating output and need those blocks resident.

think_phase_memory_fraction

PropertyValue
Typef64
Default0.40
Unitfraction of total KV budget
Valid range(0.0, 1.0)

Fraction of total KV budget reserved for think-phase blocks. 0.40 leaves 60% for output-phase blocks, which accommodates think_batch_multiplier = 2.5 without crowding output. Raise for workloads with very long reasoning chains; lower for workloads with long output sequences.

block_size_bytes

PropertyValue
Typeu64
Default16384 (16 KiB)
Unitbytes per KV block
Valid range> 0

Must match the actual vLLM block layout for your model. The canonical vLLM layout is 16 KiB for bf16/fp16 KV at 16 tokens per block. MLA-aware models can run smaller blocks.

capacity_bytes

PropertyValue
Typeu64 or "auto"
Default"auto"
Unitbytes
Valid range> 0 or "auto"

Total KV memory budget. "auto" queries the device at startup and uses 85% of torch.cuda.mem_get_info().total. Pin an integer for deterministic or multi-tenant deployments where you want to reserve GPU memory for other workloads.


[disagg]

Disaggregated KV transfer. Disabled by default. See ADR-0006.

enabled

PropertyValue
Typebool
Defaultfalse

Master switch. Leave false for single-node deployments.

fabric

PropertyValue
Type"nixl" | "mooncake" | "none"
Default"none"

Selects the disagg transport. nixl uses the NVIDIA NIXL library (requires cargo build --features nixl and libnixl on the deploy host). mooncake uses the Mooncake-compatible protocol adapter. none is only valid when enabled = false.

offload_threshold_blocks

PropertyValue
Typeu32
Default4
UnitKV blocks
Valid range>= 1

Minimum ThinkComplete blocks to accumulate before flushing to the fabric. Larger values amortise transfer overhead; smaller values reduce latency.


[model.<name>]

Per-model token-boundary configuration. One [model.*] table per model served. The phase router watches for boundary token IDs in the decoded stream.

See models/*.toml for the vetted token IDs for supported models.

think_start_token_ids

PropertyValue
Type[u32]

Token IDs that mark the start of a reasoning chain. Model- and tokenizer-specific. If empty, the router never enters think-phase for this model.

think_end_token_ids

PropertyValue
Type[u32]

Token IDs that mark the end of a reasoning chain. When the decoded stream contains any of these IDs, the router emits ExitThink.

reasoning_parser

PropertyValue
Typestring
Values"deepseek_r1", "qwen3", "granite", "anthropic"

Selects the reasoning-chain parser. Different models structure their think-output boundary differently; the parser handles model-specific normalization.

supports_think_disable

PropertyValue
Typebool

Whether the model supports a /no_think directive to suppress the reasoning phase entirely. When true and the prompt contains the directive, the router stays in output-phase for the entire request.