Reward Hacking in Reinforcement Learning

Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function. With the rise of language models generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a user’s preference, are pretty concerning and are likely one of the major blockers for real-world deployment of more autonomous use cases of AI models. ...

Date: November 28, 2024 | Estimated Reading Time: 37 min | Author: Lilian Weng

The Transformer Family Version 2.0

Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length. Notations Symbol Meaning d The model size / hidden state dimension / positional encoding size. h The number of heads in multi-head attention layer. L The segment length of input sequence. N The total number of attention layers in the model; not considering MoE. XRL×d The input sequence where each element has been mapped into an embedding vector of shape d, same as the model size. WkRd×dk The key weight matrix. WqRd×dk The query weight matrix. WvRd×dv The value weight matrix. Often we have dk=dv=d. Wik,WiqRd×dk/h;WivRd×dv/h The weight matrices per head. WoRdv×d The output weight matrix. Q=XWqRL×dk The query embedding inputs. K=XWkRL×dk The key embedding inputs. V=XWvRL×dv The value embedding inputs. qi,kiRdk,viRdv Row vectors in query, key, value matrices, Q, K and V. Si A collection of key positions for the i-th query qi to attend to. ARL×L The self-attention matrix between a input sequence of lenght L and itself. A=softmax(QK/dk). aijA The scalar attention score between query qi and key kj. PRL×d position encoding matrix, where the i-th row pi is the positional encoding for input xi. Transformer Basics The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT. ...

Date: January 27, 2023 | Estimated Reading Time: 46 min | Author: Lilian Weng

Neural Architecture Search

Although most popular and successful model architectures are designed by human experts, it doesn’t mean we have explored the entire network architecture space and settled down with the best option. We would have a better chance to find the optimal solution if we adopt a systematic and automatic way of learning high-performance model architectures. ...

Date: August 6, 2020 | Estimated Reading Time: 32 min | Author: Lilian Weng

Exploration Strategies in Deep Reinforcement Learning

[Updated on 2020-06-17: Add “exploration via disagreement” in the “Forward Dynamics” section. Exploitation versus exploration is a critical topic in Reinforcement Learning. We’d like the RL agent to find the best solution as fast as possible. However, in the meantime, committing to solutions too quickly without enough exploration sounds pretty bad, as it could lead to local minima or total failure. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while exploration remains more like an open topic. ...

Date: June 7, 2020 | Estimated Reading Time: 36 min | Author: Lilian Weng

The Transformer Family

[Updated on 2023-01-27: After almost three years, I did a big refactoring update of this post to incorporate a bunch of new Transformer models since 2020. The enhanced version of this post is here: The Transformer Family Version 2.0. Please refer to that post on this topic.] ...

Date: April 7, 2020 | Estimated Reading Time: 25 min | Author: Lilian Weng

Curriculum for Reinforcement Learning

[Updated on 2020-02-03: mentioning PCG in the “Task-Specific Curriculum” section. [Updated on 2020-02-04: Add a new “curriculum through distillation” section. ...

Date: January 29, 2020 | Estimated Reading Time: 24 min | Author: Lilian Weng

Self-Supervised Representation Learning

[Updated on 2020-01-09: add a new section on Contrastive Predictive Coding]. [Updated on 2020-04-13: add a “Momentum Contrast” section on MoCo, SimCLR and CURL.] [Updated on 2020-07-08: add a “Bisimulation” section on DeepMDP and DBC.] [Updated on 2020-09-12: add MoCo V2 and BYOL in the “Momentum Contrast” section.] [Updated on 2021-05-31: remove section on “Momentum Contrast” and add a pointer to a full post on “Contrastive Representation Learning”] ...

Date: November 10, 2019 | Estimated Reading Time: 38 min | Author: Lilian Weng

Evolution Strategies

Stochastic gradient descent is a universal choice for optimizing deep learning models. However, it is not the only option. With black-box optimization algorithms, you can evaluate a target function f(x):RnR, even when you don’t know the precise analytic form of f(x) and thus cannot compute gradients or the Hessian matrix. Examples of black-box optimization methods include Simulated Annealing, Hill Climbing and Nelder-Mead method. ...

Date: September 5, 2019 | Estimated Reading Time: 22 min | Author: Lilian Weng

Meta Reinforcement Learning

In my earlier post on meta-learning, the problem is mainly defined in the context of few-shot classification. Here I would like to explore more into cases when we try to “meta-learn” Reinforcement Learning (RL) tasks by developing an agent that can solve unseen tasks fast and efficiently. ...

Date: June 23, 2019 | Estimated Reading Time: 22 min | Author: Lilian Weng

Domain Randomization for Sim2Real Transfer

In Robotics, one of the hardest problems is how to make your model transfer to the real world. Due to the sample inefficiency of deep RL algorithms and the cost of data collection on real robots, we often need to train models in a simulator which theoretically provides an infinite amount of data. However, the reality gap between the simulator and the physical world often leads to failure when working with physical robots. The gap is triggered by an inconsistency between physical parameters (i.e. friction, kp, damping, mass, density) and, more fatally, the incorrect physical modeling (i.e. collision between soft surfaces). ...

Date: May 5, 2019 | Estimated Reading Time: 15 min | Author: Lilian Weng

Implementing Deep Reinforcement Learning Models with Tensorflow + OpenAI Gym

The full implementation is available in lilianweng/deep-reinforcement-learning-gym In the previous two posts, I have introduced the algorithms of many deep reinforcement learning models. Now it is the time to get our hands dirty and practice how to implement the models in the wild. The implementation is gonna be built in Tensorflow and OpenAI gym environment. The full version of the code in this tutorial is available in [lilian/deep-reinforcement-learning-gym]. ...

Date: May 5, 2018 | Estimated Reading Time: 13 min | Author: Lilian Weng

Policy Gradient Algorithms

[Updated on 2018-06-30: add two new policy gradient methods, SAC and D4PG.] [Updated on 2018-09-30: add a new policy gradient method, TD3.] [Updated on 2019-02-09: add SAC with automatically adjusted temperature]. [Updated on 2019-06-26: Thanks to Chanseok, we have a version of this post in Korean]. [Updated on 2019-09-12: add a new policy gradient method SVPG.] [Updated on 2019-12-22: add a new policy gradient method IMPALA.] [Updated on 2020-10-15: add a new policy gradient method PPG & some new discussion in PPO.] [Updated on 2021-09-19: Thanks to Wenhao & 爱吃猫的鱼, we have this post in Chinese1 & Chinese2]. ...

Date: April 8, 2018 | Estimated Reading Time: 52 min | Author: Lilian Weng

A (Long) Peek into Reinforcement Learning

[Updated on 2020-09-03: Updated the algorithm of SARSA and Q-learning so that the difference is more pronounced. [Updated on 2021-09-19: Thanks to 爱吃猫的鱼, we have this post in Chinese]. ...

Date: February 19, 2018 | Estimated Reading Time: 31 min | Author: Lilian Weng

The Multi-Armed Bandit Problem and Its Solutions

The algorithms are implemented for Bernoulli bandit in lilianweng/multi-armed-bandit. Exploitation vs Exploration The exploration vs exploitation dilemma exists in many aspects of our life. Say, your favorite restaurant is right around the corner. If you go there every day, you would be confident of what you will get, but miss the chances of discovering an even better option. If you try new places all the time, very likely you are gonna have to eat unpleasant food from time to time. Similarly, online advisors try to balance between the known most attractive ads and the new ads that might be even more successful. ...

Date: January 23, 2018 | Estimated Reading Time: 10 min | Author: Lilian Weng