This page looks best with JavaScript enabled

Paged Attention: Virtual Memory for the GPU

 ·   ·  โ˜• 10 min read · ๐Ÿ‘€... views

the fragmentation problem

the previous post in this series explained why we cache key and value vectors during autoregressive decoding. by the end of that post, the KV cache was saving us enormous amounts of recomputation โ€” but it quietly introduced a new problem: where do you put all that memory?

the naive answer is: allocate a contiguous block of GPU memory for each request, big enough to hold its maximum possible output length. but in practice, you don’t know the output length in advance. so you guess โ€” reserving space for the worst case.

this leads to two kinds of waste:

internal fragmentation. every request gets pre-allocated memory for the maximum output length, but most requests finish much earlier. if you reserve 2048 tokens of KV cache but the request ends at token 300, the remaining 1748 token slots sit empty for the entire duration.

external fragmentation. as requests finish at different times, they release their reserved blocks back to the free pool. soon the free memory is scattered in many disconnected chunks. a new long request needs a large contiguous block, but even if the total free memory is sufficient, no single contiguous region may be large enough to satisfy the allocation.

GPU memory (naive allocation):

[Request A: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘]  โ† 1/3 used, 2/3 wasted
[Request B: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘]  โ† 1/2 used, 1/2 wasted
[Request C: โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘]  โ† early finish, lots of waste
[          free (fragmented)         ]
[Request D: ????  can't fit! total   ]
[           free is enough, but no   ]
[           contiguous block exists  ]
Figure 1: naive contiguous allocation (left) wastes most GPU memory through internal and external fragmentation; paged attention (right) scatters fixed-size blocks through a shared pool, achieving ~96% utilization.

Figure 1: naive contiguous allocation (left) wastes most GPU memory through internal and external fragmentation; paged attention (right) scatters fixed-size blocks through a shared pool, achieving ~96% utilization.

the vLLM paper (Kwon et al., 2023) measured that naive implementations achieve only 20โ€“38% GPU memory utilization. the rest is wasted on fragmentation.

borrowing from the OS

operating systems solved exactly this problem decades ago for process memory. the solution: paging.

instead of giving each process a contiguous chunk of physical memory, the OS divides physical memory into fixed-size frames and gives processes a virtual address space divided into same-size pages. a page table maps each virtual page to its physical frame. frames don’t need to be contiguous โ€” scattered free frames are assembled into a coherent virtual address space through the mapping.

paged attention applies this idea to KV cache management:

OS conceptpaged attention equivalent
physical page frameKV block โ€” a fixed-size contiguous chunk of GPU memory
virtual pagelogical block โ€” the i-th segment of a request’s KV sequence
page tableblock table โ€” per-request mapping from logical to physical blocks
page faultrequesting a new block from the block manager

instead of allocating one large contiguous region per request, we carve the entire available GPU memory into a pool of fixed-size KV blocks (typically holding 16 tokens each). requests are granted blocks one at a time as they need them โ€” and those blocks can come from anywhere in the pool.

data structures

the KV block

each KV block stores the key and value vectors for exactly block_size tokens (typically 16), for every transformer layer:

KV Block #7  (block_size = 16):
  layer  0 โ†’  K[0..15] shape (16, d_head)
              V[0..15] shape (16, d_head)
  layer  1 โ†’  K[0..15], V[0..15]
  ...
  layer 31 โ†’  K[0..15], V[0..15]

the block pool is pre-allocated at startup:

Physical block pool (N blocks total):
[ B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 | ... | B_N ]
  A     A    A    B    free  C    A   free  ...

each block is either free or assigned to exactly one request (or shared, as we’ll see below).

the block table

every sequence maintains a block table: a simple list mapping logical block index to physical block number.

Request A (38 tokens generated, block_size = 16):

  logical block 0  โ†’  physical block  1   (full: 16 tokens)
  logical block 1  โ†’  physical block  5   (full: 16 tokens)
  logical block 2  โ†’  physical block  3   (6 tokens, partially filled)

to find where token \(t\) lives: physical_block = block_table[t // block_size], offset within block = t % block_size. physical blocks need not be contiguous; the block table is what assembles them into a coherent sequence.

Figure 2: logical blocks 0, 1, 2 map via the block table to scattered physical blocks anywhere in the pool. the attention kernel dereferences this indirection at runtime.

Figure 2: logical blocks 0, 1, 2 map via the block table to scattered physical blocks anywhere in the pool. the attention kernel dereferences this indirection at runtime.

the block manager

a centralized block manager tracks which blocks are free, maintains reference counts for shared blocks, and handles allocation and deallocation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class BlockManager:
    def __init__(self, num_gpu_blocks, block_size):
        self.block_size = block_size
        self.free_blocks = list(range(num_gpu_blocks))
        self.ref_count = [0] * num_gpu_blocks

    def allocate(self, seq) -> bool:
        needed = ceil(seq.prompt_len / self.block_size)
        if len(self.free_blocks) < needed:
            return False  # scheduler must preempt another request
        seq.block_table = []
        for _ in range(needed):
            bid = self.free_blocks.pop()
            self.ref_count[bid] = 1
            seq.block_table.append(bid)
        return True

    def append_slot(self, seq):
        last_token = seq.num_tokens - 1
        last_logical = last_token // self.block_size
        if last_logical >= len(seq.block_table):
            # current block is full, need a new one
            bid = self.free_blocks.pop()
            self.ref_count[bid] = 1
            seq.block_table.append(bid)
        else:
            phys = seq.block_table[last_logical]
            if self.ref_count[phys] > 1:
                self._copy_on_write(seq, last_logical)

    def free(self, seq):
        for bid in seq.block_table:
            self.ref_count[bid] -= 1
            if self.ref_count[bid] == 0:
                self.free_blocks.append(bid)

attention over non-contiguous blocks

standard attention assumes \(K\) and \(V\) are contiguous tensors. paged attention modifies this assumption: KV pairs live in scattered blocks, and the attention kernel must gather them.

the computation is equivalent to splitting the key-value sequence into blocks and running online softmax to merge results incrementally:

m = -inf  (running max)
l = 0     (running normalizer)
o = 0     (running output)

for each logical block j:
    K_j, V_j = load_block(block_table[j])       # gather from physical memory
    a_j = (q @ K_j.T) / sqrt(d_k)               # shape: (1, block_size)

    m_j = max(a_j)
    l_j = sum(exp(a_j - m_j))
    o_j = exp(a_j - m_j) @ V_j

    # merge with running statistics (log-sum-exp trick)
    m_new = max(m, m_j)
    l = exp(m - m_new) * l + exp(m_j - m_new) * l_j
    o = exp(m - m_new) * o + exp(m_j - m_new) * o_j
    m = m_new

output = o / l

this online softmax formulation lets each block be fetched independently. there is no need to concatenate KV before running attention โ€” the CUDA kernel loops over blocks (indexed via the block table) and accumulates the softmax-weighted output in registers.

mathematically, this produces exactly the same result as standard attention over a concatenated sequence. the key identity is:

\begin{equation}
\text{softmax}([a_0, a_1, \ldots]) \cdot [V_0; V_1; \ldots] = \text{OnlineSoftmax}({(a_j, V_j)}_j)
\end{equation}

where OnlineSoftmax denotes the incremental merge above.

copy-on-write for parallel sampling

parallel sampling generates multiple independent outputs from the same prompt (e.g., beam search, or sampling with \(n > 1\)). without paged attention, each candidate would need its own full copy of the prompt’s KV cache โ€” wasting memory proportional to the number of candidates.

with block-level sharing, all candidates share the same physical blocks for the prompt, identified via reference counting:

Parallel sampling (3 candidates from the same prompt):

  logical block 0  โ†’  physical block 2  (ref_count = 3, shared)
  logical block 1  โ†’  physical block 6  (ref_count = 3, shared)

  Candidate 1 โ†’  block 9   (ref_count = 1, exclusive)
  Candidate 2 โ†’  block 11  (ref_count = 1, exclusive)
  Candidate 3 โ†’  block 14  (ref_count = 1, exclusive)

when a candidate needs to write to a shared block (because the new token falls in a slot that is still shared), it triggers copy-on-write:

  1. allocate a new free block
  2. gpu_memcpy the shared block’s content into the new block
  3. decrement ref_count on the original block
  4. update the sequence’s block table to point to the new block
  5. write the new token into the new block (now exclusively owned)
Figure 3: three sampling candidates share prompt blocks (ref_count = 3) at zero memory cost. when candidate 2 writes a new token into a shared block, copy-on-write allocates a fresh exclusive block before writing.

Figure 3: three sampling candidates share prompt blocks (ref_count = 3) at zero memory cost. when candidate 2 writes a new token into a shared block, copy-on-write allocates a fresh exclusive block before writing.

this is exactly how Linux handles copy-on-write after fork().

prefix caching falls out naturally

a beautiful property of block-level KV management is that cross-request sharing becomes straightforward. if two requests share the same prefix (e.g., the same system prompt), their blocks for that prefix will contain identical content โ€” so they can literally share the same physical blocks.

the implementation: compute a hash of each block’s token contents. use this hash as a key in a global block cache. on a new request, walk its prompt block-by-block: if a hash hits the cache, reuse the cached physical block (increment ref count); if it misses, compute the KV normally and populate the cache.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def allocate_with_prefix_cache(self, seq):
    for i, token_block in enumerate(seq.prompt_blocks()):
        h = hash_block(token_block)
        if h in self.prefix_cache:
            phys = self.prefix_cache[h]
            self.ref_count[phys] += 1
            seq.block_table.append(phys)
            seq.cached_prefix_len += self.block_size
        else:
            phys = self.free_blocks.pop()
            self.ref_count[phys] = 1
            seq.block_table.append(phys)
            # will be populated during prefill
            self.prefix_cache[h] = phys

cached blocks skip prefill entirely โ€” the attention kernel reads their KV pairs just like any other block, but no forward pass was needed to fill them. for a 4096-token system prompt with block_size = 16, that’s 256 blocks that can be reused across all requests sharing that prompt. a full post on prefix caching covers the eviction policies and radix-tree structure that make this practical at scale.

memory utilization: 30% โ†’ 96%

implementationGPU utilizationreason for waste
naive contiguous allocation20โ€“38%max-length reservation + external fragmentation
paged attention~96%only the last (partially filled) block per request has internal fragmentation

the worst case internal fragmentation with paged attention is \(\text{block_size} / 2\) tokens per request (on average the last block is half full). with block_size = 16, that’s 8 tokens of wasted space โ€” negligible compared to the hundreds or thousands of tokens that naive allocation wastes.

this improvement is not just an efficiency win: it directly translates to higher concurrency. if GPU memory utilization goes from 30% to 96%, roughly 3ร— more requests can be in flight simultaneously on the same hardware.

connection to continuous batching

paged attention doesn’t exist in isolation. it pairs naturally with continuous batching, which schedules new requests at every decoding iteration rather than waiting for an entire batch to finish.

continuous batching creates a dynamic workload: requests are constantly arriving and completing, KV cache blocks are being allocated and freed. the block manager handles this gracefully โ€” freed blocks go back to the pool immediately, and new requests draw from the pool without any compaction or reshuffling. the block table indirection absorbs all the movement.

when the block pool runs out entirely, the scheduler can preempt low-priority requests by either swapping their KV blocks to CPU memory (expensive: ~32 GB/s PCIe bandwidth) or simply discarding them and marking them for re-prefill later (simpler, usually faster for short prompts).

summary

paged attention’s core insight is one sentence: KV cache doesn’t need to be contiguous โ€” it just needs to be addressable.

by borrowing virtual memory from the OS:

  • block table indirection absorbs physical non-contiguity
  • online softmax lets the attention kernel gather blocks at scatter addresses
  • reference counting enables zero-copy sharing among parallel candidates and across requests with common prefixes
  • fine-grained allocation eliminates external fragmentation; only โ‰ค half a block of internal fragmentation remains per request

the payoff: GPU utilization climbs from ~30% to ~96%, enabling ~3ร— more concurrent requests on the same hardware. this is the foundational memory management technique behind vLLM and essentially every modern LLM serving engine.

with KV cache memory under control, the next question is how to keep the GPU busy while requests arrive and finish at different times โ€” that’s the problem continuous batching solves.

Share on