Zzong's Notes

RLVR

3건의 항목

2026년 6월 28일
Group Sequence Policy Optimization
2026년 6월 14일
One Token to Fool LLM-as-a-Judge
2026년 6월 14일
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs