Meridian emits Prometheus metrics and OpenTelemetry traces. All metric names
are stable contracts — renames or semantic changes trigger a minor-version
bump per ADR-0007.
Tracks the distribution of reasoning-chain lengths. Long tails here indicate the entropy probe is deferring too late; a spike in the P99 bucket means hard-cap forcing is dominating.
Operator action: if P99 frequently hits max_think_tokens, lower max_think_tokens
or tighten eat_ema_variance_threshold / rpdi_threshold to fire forcing earlier.
Measures how often the router fires </think> injection. A counter that never moves means the entropy probe is never converging (check eat_ema_variance_threshold).
Breaks down why forcing fired. converged and overthinking mean the entropy probe is working. Sustained hard_cap dominance means the probe is failing to detect convergence.
Operator action: monitor the ratio hard_cap / (converged + overthinking) over
a 1-hour window. A ratio above 0.5 is a signal to investigate probe thresholds or
inspect sample EAT traces.
Every increment is a user-visible degradation event — a KV block backing the live output stream was evicted under memory pressure. Zero is the target.
Operator action: alert at rate(...) > 0 sustained for 5 minutes. Mitigate
by lowering think_phase_memory_fraction, think_batch_multiplier, or
max_think_tokens.
Requests that complete but are never reaped leak entries. Monotonically growing gauge means reap_stale_older_than is not being called, or the vLLM plugin's post_step is not receiving EOS events.
Operator action: if the gauge grows without a corresponding growth in active
concurrent requests, check that post_step receives EOS and that the reap
period (60 s default) is shorter than request lifetime.
Measures scheduling overhead on the hot path. Should be in the microsecond range; millisecond-range values indicate contention inside the scheduler lock.
Operator action: P99 above 1 ms under steady load is unexpected. File an issue
with a CPU profile.
Monitors queue backlog. Output queue depth growing without draining means the GPU is bottlenecked. Think queue depth growing without budget_force_triggered activity means the entropy probe is not converging and long chains are piling up.
Operator action: alert when queue_depth{queue="think"} P95 exceeds 4× its
1-hour baseline for 5 consecutive minutes without accompanying forcing activity.
Total KV bytes currently allocated across all tiers. Rising towards kv_memory.capacity_bytes predicts incoming eviction pressure.
Operator action: alert when block_manager.used_bytes / capacity_bytes exceeds
0.90 for 10 minutes — this is the early-warning threshold before
output_critical_eviction events begin.
Per-tier eviction rate reveals the shape of memory pressure. think_complete evictions are routine; think_active indicates moderate pressure; output_critical is a user-visible degradation event identical to meridian.output_critical_eviction.
Operator action: alert on any tier=output_critical increment — use this
series or meridian.output_critical_eviction, whichever is easier to route in
your alerting stack.
Distribution of actual batch sizes delivered to the vLLM worker per phase. A consistently small output batch under load means output requests are draining faster than think-phase completions replenish the pool.
Operator action: compare scheduler.batch_size{phase=output} P50 against
queue_depth{queue=output} to verify output requests are being served promptly.
Counts batches where logit rows had heterogeneous vocab sizes and the probe fell back to per-request compute. A rising counter means mixed-model batching is defeating the batched probe; investigate request routing.
Prometheus is the primary metric surface. When [telemetry] otlp_enabled = true
(requires the otel extra), the plugin additionally exports its counters to an
OTLP/HTTP collector at [telemetry] otlp_endpoint, and the Rust core can wire
its tracing spans to OTLP via the otel crate feature
(meridian_core::telemetry::install). Both are off by default.
Each MeridianScheduler::schedule_batch call opens a meridian.schedule_batch
OpenTelemetry span. PhaseEvents propagate meridian.phase_event{kind=...}
events on the active request's span, allowing per-request phase timelines to be
reconstructed from trace data.