Configuration
Meridian is configured through a single TOML file consumed by both the Rust
core (meridian-core::config::MeridianConfig) and the Python facade
(meridian.config.MeridianConfig). Both parsers agree on field names; a
round-trip test in crates/meridian-core/tests/config_parse.rs exercises
every field.
The fully-annotated example lives at
meridian.toml.example.
Tune by overlay: keep the example as a reference and write a smaller
meridian.toml containing only fields that differ from their defaults.
Validation
Both parsers reject unknown fields and out-of-range values. Cross-field
violations (e.g. min_think_tokens >= max_think_tokens) are also caught.
Errors carry the dotted field path and a human-readable message.
[scheduler]
Dual-queue scheduling policy. See ADR-0001.
think_tpot_budget_ms
| Property | Value |
|---|---|
| Type | f64 |
| Default | 80.0 |
| Unit | milliseconds per think-phase token |
| Valid range | > 0 |
TPOT budget for think-phase tokens. The user does not see inter-token latency during reasoning, so this can be set much higher than the output budget. Setting it 4× the output budget gives the batcher room to pack a larger effective batch during think. Raise if your GPU is underutilised during think; lower if think-phase requests monopolise capacity at the expense of output.
output_tpot_budget_ms
| Property | Value |
|---|---|
| Type | f64 |
| Default | 20.0 |
| Unit | milliseconds per output-phase token |
| Valid range | > 0 |
TPOT budget for output-phase tokens. This is the user-visible streaming latency floor. 20 ms keeps streams fluid on a 30–50 tok/s display target. Lower values produce tighter streams but reduce think-phase throughput.
think_batch_multiplier
| Property | Value |
|---|---|
| Type | f64 |
| Default | 2.5 |
| Unit | ×output batch token budget |
| Valid range | >= 1.0 |
The think-phase batch can fill this multiple of the output-phase token budget.
2.5× is conservative — empirically stable across H100-class hardware with
MLA-aware allocation. Values above 3.5× risk output ITL variance spikes if
think requests fail to yield promptly. Monitor meridian.queue_depth{queue=think}.
max_think_tokens
| Property | Value |
|---|---|
| Type | u64 |
| Default | 32768 |
| Unit | tokens |
| Valid range | > min_think_tokens |
Hard cap on think tokens per request. Budget forcing fires unconditionally at this limit regardless of entropy signals. 32 768 matches DeepSeek-R1's documented maximum reasoning length and bounds the KV memory a single request can monopolise.
min_think_tokens
| Property | Value |
|---|---|
| Type | u64 |
| Default | 512 |
| Unit | tokens |
| Valid range | < max_think_tokens |
No budget forcing is allowed before this many think tokens. EAT/RPDI signals are noisy below 512 tokens; early forcing can prematurely terminate short-but-correct reasoning chains.
[entropy]
Entropy probe and convergence-detection thresholds.
enabled
| Property | Value |
|---|---|
| Type | bool |
| Default | true |
When false, the entropy probe is disabled and all budget forcing uses
hard_cap only (pure token-count limiting). Useful for A/B comparison
or when the CUDA kernel is not available.
ema_alpha
| Property | Value |
|---|---|
| Type | f64 |
| Default | 0.05 |
| Unit | dimensionless (EMA decay) |
| Valid range | (0.0, 1.0] |
EMA decay applied to all entropy signals. Smaller values give longer memory. α = 0.05 → ~95% mass within the last ~60 samples. Long enough to smooth single-token spikes; short enough to react within a reasoning chain.
rpdi_threshold
| Property | Value |
|---|---|
| Type | f64 |
| Default | 3.0 |
| Unit | ratio (local RPDI / global RPDI) |
| Valid range | > 1.0 |
Overthinking is declared when rpdi_local / rpdi_global > threshold. The
value 3.0 is the empirical threshold from arXiv:2603.14251. Raise to be more
permissive (longer chains); lower to be more aggressive.
eat_ema_variance_threshold
| Property | Value |
|---|---|
| Type | f64 |
| Default | 0.001 |
| Unit | nats² |
| Valid range | > 0.0 |
Convergence is declared when EAT EMA variance drops below this threshold. 0.001 is approximately the noise floor of EAT in steady state. Lower values defer forcing; higher values fire earlier.
transition_entropy_threshold
| Property | Value |
|---|---|
| Type | f64 |
| Default | 2.5 |
| Unit | nats |
| Valid range | > 0.0 |
A token counts as a "transition" for RPDI when its per-token entropy exceeds this value. 2.5 nats ≈ effective branching factor of 12 — a genuine decision point rather than low-entropy continuation.
eat_probe_interval_tokens
| Property | Value |
|---|---|
| Type | u32 |
| Default | 32 |
| Unit | tokens |
| Valid range | >= 1 |
The EAT kernel runs every N think tokens. 1 = every token; higher values
trade signal latency for reduced kernel-launch overhead. 32 is the sweet spot
on H100-class hardware. Halving this on slower GPUs is safe.
[kv_memory]
Phase-aware KV block manager policy.
aggressive_think_eviction
| Property | Value |
|---|---|
| Type | bool |
| Default | false |
When true, ThinkComplete blocks are freed immediately on phase transition.
Leave false until cross-attention back-references are audited for your model;
some reasoning-parser pipelines re-attend over the think segment when generating
output and need those blocks resident.
think_phase_memory_fraction
| Property | Value |
|---|---|
| Type | f64 |
| Default | 0.40 |
| Unit | fraction of total KV budget |
| Valid range | (0.0, 1.0) |
Fraction of total KV budget reserved for think-phase blocks. 0.40 leaves 60%
for output-phase blocks, which accommodates think_batch_multiplier = 2.5
without crowding output. Raise for workloads with very long reasoning chains;
lower for workloads with long output sequences.
block_size_bytes
| Property | Value |
|---|---|
| Type | u64 |
| Default | 16384 (16 KiB) |
| Unit | bytes per KV block |
| Valid range | > 0 |
Must match the actual vLLM block layout for your model. The canonical vLLM layout is 16 KiB for bf16/fp16 KV at 16 tokens per block. MLA-aware models can run smaller blocks.
capacity_bytes
| Property | Value |
|---|---|
| Type | u64 or "auto" |
| Default | "auto" |
| Unit | bytes |
| Valid range | > 0 or "auto" |
Total KV memory budget. "auto" queries the device at startup and uses
85% of torch.cuda.mem_get_info().total. Pin an integer for deterministic
or multi-tenant deployments where you want to reserve GPU memory for
other workloads.
[disagg]
Disaggregated KV transfer. Disabled by default. See ADR-0006.
enabled
| Property | Value |
|---|---|
| Type | bool |
| Default | false |
Master switch. Leave false for single-node deployments.
fabric
| Property | Value |
|---|---|
| Type | "nixl" | "mooncake" | "none" |
| Default | "none" |
Selects the disagg transport. nixl uses the NVIDIA NIXL library (requires
cargo build --features nixl and libnixl on the deploy host). mooncake
uses the Mooncake-compatible protocol adapter. none is only valid when
enabled = false.
offload_threshold_blocks
| Property | Value |
|---|---|
| Type | u32 |
| Default | 4 |
| Unit | KV blocks |
| Valid range | >= 1 |
Minimum ThinkComplete blocks to accumulate before flushing to the fabric.
Larger values amortise transfer overhead; smaller values reduce latency.
[model.<name>]
Per-model token-boundary configuration. One [model.*] table per model
served. The phase router watches for boundary token IDs in the decoded stream.
See models/*.toml for
the vetted token IDs for supported models.
think_start_token_ids
| Property | Value |
|---|---|
| Type | [u32] |
Token IDs that mark the start of a reasoning chain. Model- and tokenizer-specific. If empty, the router never enters think-phase for this model.
think_end_token_ids
| Property | Value |
|---|---|
| Type | [u32] |
Token IDs that mark the end of a reasoning chain. When the decoded stream
contains any of these IDs, the router emits ExitThink.
reasoning_parser
| Property | Value |
|---|---|
| Type | string |
| Values | "deepseek_r1", "qwen3", "granite", "anthropic" |
Selects the reasoning-chain parser. Different models structure their think-output boundary differently; the parser handles model-specific normalization.
supports_think_disable
| Property | Value |
|---|---|
| Type | bool |
Whether the model supports a /no_think directive to suppress the reasoning
phase entirely. When true and the prompt contains the directive, the router
stays in output-phase for the entire request.