llama 에 대해서 알아보자

LLaMA 1, 2 모델을 뜯어보면서 알게된 내용들을 정리합니다.

We attribute their success, as all else, to divine benevolence.

RMSNorm (Root Mean Square Layer Normalization)

Why need a layer normalization? Internal covariate shift

RMS 는 LayerNorm 의 효과가 mean 보다는 variance 쪽에 기여치가 더 높을 것이라 가정한다. 그래서 RMSNorm 은 평균 계산을 포기하고 variance 만 계산하여 정규화해서 computation-efficiency 에 이득을 취한다.

Rotary positional embedding 은 relative positional embedding 과 비슷하지만, distance 정보를 상수값으로 치환하여 embedding vector 에 적용하면서 계산 효율을 높인다.

그 상수는 complex number 로, Euler’s formula 를 이용해서 attention 값을 계산한다.

Other positional embedding methods

Absolute: Vanilla transformers 에서 적용된 방법으로, Attention 계산 시 이미 고정된 constant position 정보가 적용되어 있는 각 embedding vector 를 계산에 활용한다.
Relative: Attention 계산 시, 각 embedding vector pair 마다 상대적인 distance 정보를 변수로 활용하여 계산한다.

Absolute, relative 와 다르게 rotary positional embedding 은 q, k weight 가 먼저 적용된 이후 에 적용된다는 점이다.

Multi-Query Attention

일반적인 multi-head attention 은 각 head 마다 서로 다른 key, value 를 사용하는데, 이를 하나로 통일하여 모든 query 에 동일한 key, value 를 사용한다.

Why?

Grouped Multi-Query Attention

Grouped Multi-Query 는 일정 개수의 그룹마다 동일한 key, value 를 사용하는 방법이다. 즉, group 사이즈가 1 일 때는 일반적인 multi-head attention 과 동일하다.

SwiGLU는 Swish + GLU, 두개의 Activation Functions를 섞어 만든 함수

Swish Function

$\sigma(x) = x \cdot \sigma(x)$ 로 표현된다.
Original Transformer 에서의 ReLU 와 비슷하지만, 음수쪽에서 0 에 가까워질때 기울기가 0 이 되는 문제(Dying ReLU)를 해결한다.

GLU (Gated Linear Unit)

\[GLU(a, b) = a \otimes \sigma(b)\]

The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.

SwiGLU

GLU 에서 sigmoid 대신 Swish Function 을 사용한다.

\[SwiGLU(a, b) = a \otimes \text{Swish}_\beta(b)\]

구체적으로는 총 3개의 weight matrix 를 사용하여 LLaMA 의 FFN 을 구성한다.

\[\text{FFN}_\text{SwiGLU}(x, W, V, W_2) = (\text{Swish}_1(xW) \otimes xV)W_2\]

vs. ReLU

ReLU 보다는 SwiGLU 가 안정적으로 학습되는 느낌이지만, 그렇다고 엄청 좋은 성능을 보이는 것은 아니다. 실험 결과에서는 ReLU 는 83.80 점이고, SwiGLU 는 84.36 점 정도로, 거의 차이가 없는 느낌.

하지만 전반적인 벤치마크 성능에서 SwiGLU 쪽이 우위인 상황.