Original E5

Introduction

기존의 BERT 와 GPT 같은 모델을 이용해 (추가 학습 없이) text representation 을 추출할 수 있으나, 이러한 모델들은 contrastive learning 으로 학습을 진행하지 않다보니 embedding 품질이 상대적으로 아쉬울 수 있다.

a pre-trained Transformer encoder and average pooling over the output layer to get fixed-size text embeddings

two prefix identifiers “query:” and “passage:” to $q$ and $d$

Datasets: CCPairs

1.3B unfiltered

pair (query, passage) 예시

reddit: (post, comment)
stackoverflow: (question, upvoted answer)
wikipedia: (title, abstract)

A consistency-based data filtering technique

신경망 모델이 clean label 을 먼저 학습하고 그 이후에 noisy label 을 학습하는 경향이 있으므로, 이러한 경향을 활용하여 일관성 기반 데이터셋 필터링 기법을 적용

아래의 방법을 통해 1.3B text pairs 를 270M 개로 줄임

우선 1.3B 데이터셋을 이용해서 모델을 학습
학습된 모델을 이용해 query 와 (정답이 포함된) 1M random passages 에 대한 연관성을 랭킹
Top-k(=2) 개의 passage 안에 실제 정답이 포함되어 있는지 확인
만약 정답이 포함되면 pair 를 keep, 그렇지 않다면 discard

Training

Contrastive loss (for pre-training)

아래와 같은 InfoNCE contrastive loss 를 사용하여 학습

\[\min L_{\mathrm{cont}}=-\frac{1}{n} \sum_i \log \frac{\mathrm{e}^{s_\theta\left(q_i, p_i\right)}}{\mathrm{e}^{s_\theta\left(q_i, p_i\right)}+\sum_j \mathrm{e}^{s_\theta\left(q_i, p_{i j}^{-}\right)}}\]

$s_\theta$: 모델 파라미터 $\theta$ 에 기반한 유사도 계산 함수
$q_i$: query
$p_i$: a positive passage
$p_{i j}^{-}$: negative passages

실제 구현은 sentence-transformers 라이브러리를 이용했는데, 찾아본 결과 MultipleNegativesRankingLoss 를 이용하여 학습한 것으로 보인다.

Knowledge distillation (for finetuning)

knowledge distillation from a cross-encoder (CE) teacher model

KL divergence $D_{\mathrm{KL}}$ for distilling soft labels from the teacher model

\[\min D_{\mathrm{KL}}\left(p_{\mathrm{ce}}, p_{\mathrm{stu}}\right)+\alpha L_{\mathrm{cont}}\]

$p_{\mathrm{ce}}$, $p_{\mathrm{stu}}$: probability from the cross-encoder (CE) teacher and student model respectively
$\alpha$: a hyperparameter to balance the two loss functions
$L_{\mathrm{cont}}$: contrastive loss

여기서는 SimLM 모델을 cross-encoder (teacher model)로 사용하였다고 한다.

Multilingual E5

e5-multilingual 모델은 다음과 같은 두 단계의 학습 전략을 통해 학습되었다.

Training: Two-stage methodology

Weakly-supervised contrastive pre-training on billions of text pairs
Supervised finetuning on small quantity of high-quality labeled data

(1) Weakly-supervised contrastive pre-training

약 1B 정도의 다국어 데이터셋(text pairs)을 이용하여 학습했다.

학습 loss 의 경우, in-batch negatives 가 적용된 InfoNCE contrastive loss 를 사용했다. 다른 하이퍼파라미터는 영어 전용 e5 모델을 학습할때와 동일하게 설정하였다고 한다.

참고로, In-batch negative 는 딥러닝 모델 학습 시 배치(batch) 내의 다른 샘플들을 부정 샘플(negative sample)로 활용하는 기법이다. 해당 방식에 대해 몇가지 의견이 있다면…

이는 별도의 negative sampling 작업없이 현재 배치에 포함된 데이터만으로 모델을 학습시킬 수 있어 효율적이다.
또한, 배치 사이즈가 충분히 크다면 이런 단순한 전략이 다른 방식보다 훨씬 효율적이고 안정적인 학습을 가능하게 한다고 말한다.
물론, batch size 를 줄이고 hard negatives 를 추가하는 것도 나쁘진 않지만, 100M 이상의 큰 데이터셋에서 hard negatives 를 찾아내는건 쉽지 않다 (non-trivial).

(2) Supervised finetuning

이 단계에서는 in-batch negatives 전략 외에도, mined hard negatives 와 교차 인코더 모델(cross-encoder model)로부터의 지식 증류(knowledge distillation)를 전략을 추가로 활용하여 임베딩 품질을 더욱 향상시켰다.

Instruction model 의 경우, Reference [2] 에서 사용했던 GPT-3/4 기반 합성 데이터를 이용하여 embedding model 에 대해 instruction 튜닝을 진행하였다.

Training: hyper-parameters

mE5 (multilingual E5) 모델은 각각 다음과 같은 모델로부터 초기화되었다.

mE5small: multilingual MiniLM
mE5base: xlm-roberta-base
mE5large: xlm-roberta-large

Learning Rate 의 경우 모델 사이즈가 커질수록 더 낮게 설정했다.

Pretraining: small, base, large 각각 {3, 2, 1}×10⁻⁴
Finetuning: small, base, large 각각 {3, 2, 1}×10⁻⁵

Evaluation

MTEB benchmark
MIRACL multilingual retrieval benchmark (MAP, nDCG)
Bitext mining

References

Papers

Code

e5-evaluation: https://github.com/microsoft/unilm/tree/master/e5
sentence-transformers: https://github.com/UKPLab/sentence-transformers

Models

한국어 특화 임베딩 모델: https://github.com/nlpai-lab/KURE

Blog

KURE: 최초의 한국어 특화 임베딩 모델

Text Embedding 모델: E5