Results for "reward hacking"
Reward Hacking
AdvancedMaximizing reward without fulfilling real goal.
Maximizing reward without fulfilling real goal.
Modifying reward to accelerate learning.
Reward only given upon task completion.
A learning paradigm where an agent interacts with an environment and learns to choose actions to maximize cumulative reward.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Expected cumulative reward from a state or state-action pair.
Optimizing policies directly via gradient ascent on expected reward.
Inferring reward function from observed behavior.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.