Reinforcement Learning

asynchronous dynamic programming

Asynchronous Dynamic Programming 일반적으로 Reinforcement Learning 에서 DP 를 얘기하면, synchronous DP (동기 DP) 를 의미하는 것이다. 동기 DP 는 state set 전체에 대한 sweeps 이 필요하다는 것이 단점이다: ...

bootstrapping

...정 값을 반복적으로 계산할 수 있다. 이러한 과정을 bootstrapping 이라 한다. 다만, 이렇게 수행하는 방식은 states 가 너무 많은 경우 실행이 불가능하다. 많은 Reinforcement Learning 방법은 DP 에서 요구되는 완전하고 정확한 환경 모델 (MDP) 없이도 bootstrapping 을 수행한다. D) Related bagging

expected return

Expected Return 누적 보상을 의미하며, Reinforcement Learning 에서 agent 의 목표는 이 값을 최대화 하는 것을 의미한다. 끝이 있는 학습 (episode 가 존재하는 학습, episodic tasks) 에서는 return Gt 다음과...

Exploration by Random Network Distillation

Exploration by Random Network Distillation RL methods work by maximizing the expected return of a policy. In reality it is often impractical to engineer d...

Multi-Armed Bandit

...dit 은 어떤 슬롯머신이 어떤 수익률을 가지는지 모를 때, 탐색 (Exploration) 과 활용 (Exploitation) 을 적절히 사용하여 최적의 수익을 찾아내고자 하는 Reinforcement Learning 알고리즘을 의미한다. A.1) 수학적 정의

RLHF

SAGE - Steerable Agentic Data Generation for Deep Search with Execution Feedback

Deep Search Reinforcement Learning

Trinity, 여러 LLM을 조율하는 진화된 코디네이터

학습 방법성능특징sep-CMA-ES (Trinity)61.5%—지도학습(SFT)59.2%라벨 생성 비용이 매우 큼무작위 탐색(RS)37.4%수렴이 느림REINFORCE (RL)25.3%그래디언트 노이즈가 극심

Zzong's Notes

탐색기

Reinforcement Learning

Reinforcement Learning

B) For RS

링크된 언급

목차

탐색기

Reinforcement Learning

Reinforcement Learning

B) For RS

C) Related

링크된 언급

함께 보면 좋은 글

목차