KV-Cache 에 대해 알아보자

KV Cache 계산 공식
- 예시
References

The Attention module concatenates the current key-values with the past key-values stored in the cache. This results in attention weights of shape (new_tokens_length, past_kv_length + new_tokens_length). Essentially, the past and current key-values are combined to compute attention scores, ensuring that the model considers both previous context and new input. The concatenated key-values are used to compute the attention scores resulting in attention weights of shape (new_tokens_length, past_kv_length + new_tokens_length).

Therefore, when iteratively calling forward() instead of the generate() method, it’s crucial to ensure that the attention mask shape matches the combined length of past and current key-values. The attention mask should have the shape (batch_size, past_kv_length + new_tokens_length). This is usually handled internally when you call generate() method. If you want to implement your own generation loop with Cache classes, take this into consideration and prepare the attention mask to hold values to current and past tokens.

KV Cache 계산 공식

모델을 vllm 같은 프레임워크에서 배포하기 위해 메모리가 충분한지 계산을 해야한다.

이때 아래와 같은 공식을 사용할 수 있다.

KV 캐시 메모리 (Bytes) = \
  2 * n_layers * n_heads * d_head * context_length * bytes_per_param

2: Key와 Value를 모두 저장하므로 2배.
n_layers: 트랜스포머 레이어 수
n_heads: 어텐션 헤드 수
d_head: 헤드 당 차원 (예: hidden_size=8192면 d_head = 8192 / n_heads).
context_length: 컨텍스트 길이 (예: 32k=32768).
bytes_per_param: 데이터 타입 (FP16=2 bytes, FP32=4 bytes).

예시

LLaMA-65B 모델을 fp16 모드로 사용하는 경우 아래와 같이 계산할 수 있다.

n_layers: 80
hidden_size: 8192
n_heads: 64
d_head: 8192 / 64 = 128
context_length: 32768
bytes_per_param: 2 (FP16)

KV 캐시 메모리 (Bytes) = \
  2 * 80 * 64 * 128 * 32768 * 2 = \
  85,899,345,920 Bytes = 약 85.9 GB

References

huggingface transformers - kv_cache

KV Cache 계산 공식

예시

References

Enjoy Reading This Article?