ADR-0004: KV tier promotion is one-way; demotion is the only direction
- Status: Accepted
- Date: 2026-05-20
- Authors: angelnicolasc
- Reviewers: sole-maintainer decision record
Context
The block manager maintains three tiers (ThinkActive, ThinkComplete,
OutputCritical). Tier transitions are emitted by the scheduler on phase
events. We had to decide whether a block can promote (e.g. a
ThinkComplete block re-attended during output generation gets restored
to ThinkActive or even OutputCritical).
Two camps:
- Bidirectional: any block currently accessed gets promoted to the highest applicable tier.
- One-way demotion: blocks only move down the eviction order
(
OutputCritical → … → freed). Promotion is explicitly disallowed.
Decision
One-way demotion. A block's tier is set at allocation time
(via BlockTier::ThinkActive or BlockTier::OutputCritical) and can only
move toward eviction:
ThinkActive → ThinkCompleteviademote_think_blocks.- Any tier → freed via
evict_fororfree.
There is no API to move a ThinkComplete block back to ThinkActive, or
a ThinkActive block to OutputCritical.
Consequences
Positive
- Reasoning about eviction stability becomes trivial. A block's tier monotonically decreases. Cross-attention back-references over a reasoning span cannot accidentally "rescue" blocks the scheduler has already decided are evictable — operators can reason about KV pressure without tracking promotion races.
- The block manager API is smaller. No
promote()method to test, no invariant to enforce ("you can only promote within the same request"). - Aligns with the playbook intent.
kv_memory.aggressive_think_evictionis a one-way knob: think blocks either survive demotion (default) or are freed immediately (aggressive). Promotion would make this knob semantically incoherent.
Negative / risks
- A reasoning model that re-attends over its own think span during
output generation pays the eviction cost twice. Cross-attention reads
on a
ThinkCompleteblock bring the block into the GPU's L1/L2 cache but do not promote its eviction tier. If that block is then evicted by memory pressure, the next cross-attention read forces a recompute. Mitigation: keepkv_memory.aggressive_think_eviction = falseso blocks stay resident asThinkCompleteuntil pressure actually demands their eviction. - No way to mark "this block is hot, please keep it" beyond keeping it in the lowest tier it was admitted to. The LRU within a tier is the only signal of recency.
Neutral
- The
touch()API exists to update LRU position within a tier, not to promote across tiers. This is explicit in the trait documentation.
Alternatives considered
Bidirectional with promote_block(block_id, tier)
Rejected because:
- Adds a new invariant to police (can a
ThinkCompleteof request A promote toOutputCriticalof request B? Obviously not, but the API must enforce that). - Promotion races with eviction: a block that the eviction iterator has selected as next victim might be promoted mid-eviction. Resolving this needs either a lock around the whole eviction loop (kills throughput) or a generation counter (more complexity).
- The cross-attention rescue use case is real but rare; the simpler
fallback (operator tunes
think_phase_memory_fraction) handles it without architectural complexity.
Implicit promotion on touch()
Rejected: touch() is called on the hot path for every cache hit. Doing
tier promotion there would dramatically slow the common case to optimise
a rare one.
References
- ADR-0001 — dual-queue scheduling sits alongside this tier policy.
- Playbook §3.4 — original three-tier eviction design.