Zzong's Notes

❯

machine_learning

❯

mixed precision

mixed precision

2026년 6월 14일1 min read

Mixed Precision

mixed precision 은 모델 학습시 FP16, FP32 부동 소수점 유형을 상황에 따라 유연하게 사용하여 학습을 더 빠르게 실행하고 메모리를 적게 사용하는 방법이다. Forward, Backward Propagation 은 모두 FP16 으로 연산하고, weight 를 업데이트 할 때에는 다시 FP32 로 변환하여 계산한다.

B) 장점

기존 FP32 로 학습한 것 대비, 정확도의 손실이 없거나 오히려 향상됨
메모리가 절반으로 줄어들어, 배치 사이즈를 늘리거나, 더 큰 모델을 학습할 수 있음

C) Related

D) References

함께 보면 좋은 글

gradient checkpoint

Gradient Checkpoint GitHub - cybertronai/gradient-checkpointing: Make huge neural nets fit in memory.

layer normalization

Layer Normalization Batch Normalization 이라고도 불리며, 배치에 있는 각 입력을 평균이 0 이고 분산이 1 을 가지도록 정규화하는 작업을 의미한다.

Eager Learning

Eager Learning 일반적인 학습 과정으로, training 데이터로 일정 기간 학습시킨 후, 학습시킨 모델을 기반으로 테스트 데이터를 적용하는 방법 training 데이터는 학습 시에만 메모리에 보관되며 학습 이후에는 training 데이터를 메모리에서 제거한다.

Lazy learning

Lazy Learning Training data 전체를 메모리상에 보관하면서 test 데이터가 새로 들어왔을 때 바로 학습하는 방법 A.1) 장점 추가적인 학습 시간 없이, 곧바로 학습 결과를 얻을 수 있다.

Fully Sharded Data Parallel

Fully Sharded Data Parallel B) Error Handling UserWarning: FSDP is switching to use NO SHARD instead of ShardingStrategy.FULL SHARD since the world size is 1.

overfitting

Overfitting 이란 모델이 특정 데이터 셋에 과도하게 적합된 것을 의미 B) Overfitting 문제를 완화하기 위한 접근들 더 많은 training data Data Augmentation regularization dropout Early Stopping C) Related D) References.

Trainer(huggingface)

Trainer(huggingface) 언어 모델을 학습하기 위해 사용하는 허깅페이스 클래스. 주로 LLM 을 학습할때 사용한다. 2. Training Arguments steps: 데이터셋에 포함된 배치 개수를 의미.

vanishing gradients

Vanishing Gradients gradient 가 backpropagation 을 통해 layer 들을 지날 때 마다 exponential 하게 감소하는 현상 해결책: Gated Recurrent Unit / Long Short-Term Memory / Truncated BTT exploding...

Data Augmentation

Data Augmentation overfitting 에 대응하기 위해서는 많은 training data 를 이용해서 학습하는 것이 좋다. 특히, 이미지 데이터의 경우, 기존 이미지를 변형해서 새로운 이미지 학습 데이터를 만들 수 있다.

knowledge distillation

Knowledge Distillation 모델 경량화를 위해 쓰이는 방식 2. Related 3. References Hinton et al.,2015 .

Mixed Precision
B) 장점
C) Related
D) References