Multi-Head Latent Attention

Created in February 06, 2025

2025 · NLP attention · ml-fundamentals

MQA (Multi-Query Attention) 또는 GQA (Grouped Query Attention) 처럼 헤드 수를 줄이는 대신, $W_{KV}$ 행렬을 저차원 행렬 분해(Low-rank decomposition) 방식으로 압축한다.

MLA의 작동 방식은 다음과 같다:

$K$ 와 $V$ 벡터를 압축하여 잠재(Latent) $K$ 와 $V$ 벡터로 변환
이 압축된 정보를 KV 캐시에 저장
필요할 때 이를 다시 전체 크기의 $K$ 와 $V$ 로 복원 (Decompression)

Enjoy Reading This Article?

Here are some more articles you might like to read next:

Deepseek-R1 모델

학습할때 메모리가 터진다고? Cut Your Losses!

GRPO 대신 DAPO: RL 최적화로 LLM 추론 능력 끌어올리기

DeepSeek-V3 기술 요약

python accelerate 라이브러리 함수 조사기