Zzong's Notes

❯

❯

DeepSpeed-MoE

2026년 6월 14일1 min read

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

DeepSpeed-MoE 란, an end-to-end MoE training and inference solution as part of the DeepSpeed library 이다.

B) Features

Provide novel MoE architecture designs and model compression techniques.
Reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions.

C) Related

D) References

함께 보면 좋은 글

deepspeed

ZeRO 의 역할과 최적화 단계 ZeRO 는 데이터 병렬 (data parallelism) 처리의 계산 및 메모리 자원을 활용하여 모델 학습 시 각 장치 (GPU) 의 메모리와 계산 요구 사항을 줄여줍니다.

MoE

MoE B) 한계점 제한된 범위: 인코더 - 디코더 모델이나 seq2seq task 한정으로 연구가 진행되었고, GPT-3 와 같은 NLG 모델은 덜 연구되었음 메모리 한계: 많은 파라매터 개수가 필요하다.

DistributedDataParallel

DistributedDataParallel a batch is sent to each GPU worker which has its own copy of the model.

LLMOps

LLMOps vs. Reference: 대규모 언어 모델의 핵심 기술 LLMOps를 알아보자! | KT Enterprise LLMOps 와 MLOps 는 머신러닝 모델을 효율적으로 운영하고 관리하기 위해 존재합니다. 목적은 같지만, 명확한 차이점이 있는데요.

Triton

Triton Triton Inference Server 는 딥러닝 모델을 높은 성능으로 서빙을 할 수 있는 오픈소스 추론서버입니다.

supervised fine-tuning

Supervised Fine-tuning Before we start training reward models and tuning our model with RL, it helps if the model is already good in the domain we are interested in.

TrainerCallback

TrainerCallback Huggingface 의 경우 on init end, on train begin, on step end, ..등등으로 callback 의 method 들이 훈련하는 도중에 특정 시점에 호출이 됩니다. 2.

Train Large Model

배경 Qwen2.5 의 72b 급 대용량 모델을 학습하는 방법에 대해 조사해보자.

Batch Decoding

Batch Decoding LLM API 콜 할때 시간/비용이 많이 요구되는 이슈를 해결하기 위해 사용하는 방법 일종의 프롬프트 엔지니어링 처럼 해결하는 것으로 보임 PROMPT = """\ 다양한 작업에 대한 답변을 생성해주세요, 이러한 작업 지침은 ChatGPT 모델에 주어지며,...

FastChat

FastChat LLM 파인 튜닝 용 라이브러리 B) Arguments Description tf32 C) Related D) References.

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
B) Features
C) Related
D) References