Troubleshooting
Each entry follows runbook format: Symptom → Likely cause → How to verify → Immediate mitigation → Longer-term fix.
Output streams stutter under load
Symptom: users report visible gaps in the output token stream; output ITL P99 spikes.
Likely cause: OutputCritical KV blocks are being evicted under memory pressure.
How to verify:
rate(meridian.output_critical_eviction[1m]) > 0
Any non-zero rate confirms the block manager is evicting user-visible KV.
Immediate mitigation: reduce load. Lower the arrival rate or raise kv_memory.capacity_bytes
if headroom exists.
Longer-term fix (pick one or more):
- Lower
kv_memory.think_phase_memory_fraction(e.g. 0.40 → 0.30) to give output more room. - Lower
scheduler.think_batch_multiplierto reduce think-phase KV pressure. - Lower
scheduler.max_think_tokensso individual reasoning chains release blocks sooner.
Budget forcing never fires; chains always hit hard cap
Symptom: meridian.budget_force_reason{reason=hard_cap} increments constantly;
converged and overthinking stay at zero.
Likely cause: the entropy probe is not detecting convergence. Either the probe is disabled, the thresholds are too tight, or the model is being asked to reason on prompts that produce genuinely non-converging chains.
How to verify:
- Check
meridian.toml: confirmentropy.enabled = true. - Capture a sample EAT trace: add
logging.level = "debug"temporarily and inspect EAT EMA values in the logs for a known-good reasoning prompt.
Immediate mitigation: none — hard-cap forcing is safe, just less adaptive.
Longer-term fix:
- If probe is disabled: set
entropy.enabled = true. - If thresholds are too tight: relax
eat_ema_variance_threshold(e.g. 0.001 → 0.005) or lowerrpdi_threshold(e.g. 3.0 → 2.0). - If prompts are genuinely non-converging: lower
max_think_tokensto bound KV cost.
Phase router shows runaway tracked-requests gauge
Symptom: meridian.phase_router.tracked_requests grows monotonically without
a corresponding growth in active requests.
Likely cause: completed requests are not being reaped from the router. Either
post_step is not receiving EOS events, or reap_stale_older_than is not being
called.
How to verify:
- Confirm the vLLM plugin's
post_stepis wired to the engine's step callback. - Check the reap interval in the plugin config — default 60 s. If request lifetime is shorter than 60 s on average, the reaper may be lagging.
Immediate mitigation: no manual reap trigger is exposed in v0.1.x. The plugin
calls router.reap_stale_older_than(60.0) automatically every 64 schedule()
invocations; under active serving load this fires within seconds. To force an
immediate reap at deploy time, temporarily lower _reap_interval_schedules in
vllm_plugin.py to 1.
Longer-term fix: reduce the reap period in vllm_plugin.py or ensure post_step
receives every EOS signal from the vLLM worker.
CUDA kernel returns Unavailable
Sprint 0 note:
python/meridian/_backends/cuda.pycurrently delegates toCpuEntropyBackend;KernelError::Unavailablecannot be triggered through the Python probe in v0.1.x. The scenario below applies once Sprint 1 wires the Python backend to the Rust kernel path incrates/meridian-kernels/.
Symptom: logs show KernelError::Unavailable; entropy probe silently falls back
to CPU; meridian.budget_force_reason{reason=hard_cap} is the only firing signal.
Likely cause: the meridian-kernels crate was built without --features cuda,
or libcudart.so is not on the dynamic linker path at runtime.
How to verify:
# Confirm the Python extension loads and report the current backend behaviour.
python -c "
from meridian._backends.cuda import CudaEntropyBackend
b = CudaEntropyBackend()
print('backend name:', b.name)
# Sprint 0: always prints 'cuda' but delegates to CPU internally.
# Real CUDA dispatch requires a --features cuda build (Sprint 1).
"
# Confirm the Rust kernels extension is importable.
python -c "import meridian._meridian; print('native extension OK')"
Immediate mitigation: the CPU fallback is correct — entropy values are identical.
The only cost is CPU cycles and higher schedule_batch.duration_ns latency.
Longer-term fix:
- Rebuild with
cargo build -p meridian-kernels --features cuda. - Verify
libcudart.sois onLD_LIBRARY_PATHor in/usr/local/cuda/lib64. - Run
maturin develop --release -m crates/meridian-python/Cargo.tomlto regenerate the Python extension.
Disagg offload not firing
Symptom: meridian.disagg.blocks_offloaded counter never increments after
[disagg] enabled = true is set.
Likely cause: the fabric is not reachable, the offload_threshold_blocks has
not been reached, or the NIXL feature was not compiled in.
How to verify:
- Confirm
cargo build -p meridian-kernels --features nixlsucceeds. - Confirm
config.disagg.fabric != "none". - Check whether
ThinkCompleteblock count per step is belowoffload_threshold_blocks.
Immediate mitigation: lower offload_threshold_blocks to 1 temporarily to
force immediate offload on every ExitThink.
Longer-term fix: verify fabric connectivity (NIXL service running, network path
open) and restore offload_threshold_blocks to the production value.
Plugin does not intercept schedule_batch
Symptom: no Meridian metrics appear; vLLM appears to schedule normally without phase-awareness.
Likely cause: MeridianSchedulerPlugin.attach() was never called, or the plugin
was attached to a different scheduler instance than the one the engine uses.
How to verify:
print(type(engine.scheduler[0]))
# Should print: <class 'meridian.vllm_plugin.MeridianSchedulerPlugin'>
# Not: <class 'vllm.core.scheduler.Scheduler'>
Immediate mitigation: call MeridianSchedulerPlugin.attach(engine, cfg, model_key="...")
explicitly after engine construction. attach is a classmethod — it constructs
the plugin and replaces engine.scheduler[0] in one step.
Longer-term fix: verify the plugin initialisation order in the serving entrypoint.
The plugin must be attached after AsyncLLMEngine is fully constructed but before
the first request is submitted.