Reinforcement. Rewards. Curiosity.

In one of our prior discussions, we examined the initial step involved in transforming a raw large-language model into a dialogue agent, such as ChatGPT. This process begins with emulating the behavior of a reliable expert, like a dialogue passage in a book or high-quality data from a Q&A website. However, the progression to crafting a comprehensive conversational agent like ChatGPT requires several additional stages. Before delving into these, it is essential to discuss some basics and examine some open questions in the realm of artificial intelligence.

Have you engaged in the simple video game Pong? In this game, players possess just two options: moving the paddle up or down. Now, consider the scenario of developing an agent capable of playing Pong at a level equivalent to that of a 14-year-old. This objective can be approached through two distinct methods.

In the first method, an expert’s behavior, say that of a skilled teenager, is recorded while playing the game. For every input frame, represented as an image of the game, an associated action (either up or down) is recorded. Subsequently, a substantial dataset is constructed, comprising of numerous images paired with corresponding labels indicating the action. Learning to play Pong is then facilitated by utilizing this dataset, effectively establishing associations between input frames and actions. Termed supervised learning, this approach heavily relies on the quality and volume of labels provided.

The second approach, known as reinforcement learning, hinges upon some prerequisites. Initially, it assumes the existence of a policy, a function designed to select optimal game actions based on input images. This policy, (initially random) serves as a guide for selecting actions, either moving up or down, in response to input images. Additionally, this approach presumes the game to be episodic, culminating in an endpoint with associated outcomes, such as a final score (e.g., 10–2). In this second approach, the agent is not explicitly instructed on how to maneuver; rather, it explores various actions, learning through trial and error. It will simply know in the end, if it won the game.

The second approach necessitates numerous trial episodes for the agent to learn, and it takes time before meaningful progress is observed. However, over time, this method holds the potential to surpass the performance of the expert (in this case, the skilled teenage player) and achieve exceptional feats unaided by any training data.

Despite its potential, the second approach is not without its challenges beyond the sheer volume of trials required. Consider situations where a high-performing player, after demonstrating impressive gameplay, makes a couple of suboptimal moves leading to defeat and a substantial negative reward. This in turn paints all the good effort in the episode differently!

Let’s understand this deeply. Have you ever been in a situation, where after putting years of blood, sweat and tears into a project, something trivial goes wrong in the end and makes you look bad? Have you also ever been in the opposite situation, where you were praised heavily for something trivial, which literally took you an hour to accomplish? This discrepancy arises due to the intricate nature of assessing effort versus outcomes. In the realm of reinforcement learning, this is called the credit assignment problem, the most important practical problem of this field.

Taking a real world example. In the management circles, you may have heard the adage, “reward the effort not the the outcome” to motivate your employees correctly. That’s because, this is much harder said than done. Rewarding the effort, demands an acute attention to detail and expertise, that often exceeds the discernment possessed by most of us. True for humans and hence true as a manifestation for the algorithms we design. You can find some very funny videos, where robots have learnt how to game the reward system to learn something ridiculous, to be fair, humans do that too.

To tackle the credit assignment problem in the short term, practitioners employ a technique known as reward-shaping. This method entails refining the reward function by leveraging an enhanced understanding of the game dynamics. This refinement incorporates more intricate mathematical equations, enabling a more comprehensive representation of the notion of effort. A tangible illustration of this concept can be observed in our Pong example, where variables like the margin of victory or defeat assume comparable significance to the act of winning itself.

In scenarios like technical interviews, hiring committees often emphasize overlapping feedback while disregarding outlying opinions. Nevertheless, reward-shaping, while pragmatic, remains a convoluted hack. Reward shaping fails to accommodate the fact that rewards’ structure tends to evolve slower than the game’s dynamics involving many skilled players, the surrounding context, or the agents’ capacity to exploit reward loopholes.

Ample scientific literature and years of research offer progressively refined strategies to tackle this challenge. Below, I will highlight the essence behind these approaches.

Personally, I despise self-help books and their inherent emotional manipulation. Yet, one sub-category of those books, stands out for me, the “why books”.

Victor Frankl, drawing from his experiences after surviving a concentration camp during World War II, eloquently captures the essence of resilience and purpose in his seminal work “Man’s Search for Meaning.” Frankl aptly quotes Nietzsche, emphasizing the notion that “He who has a why to live, can bear almost any how.” This concept finds resonance in contemporary literature as well. Works like “Start with Why” and “The Subtle Art of Not Giving a F**k” are modern interpretations of this profound idea.

Our pursuits are often geared toward external rewards, greater recognition, higher grades, increased wealth yet, the failure to achieve these or an unwarranted sense of entitlement are the sources of our suffering. Yet what keeps us going, is intrinsic motivation and sense of personal reward. This includes, the satisfaction of a job well done, the pursuit of knowledge (reason why you are reading this article), incremental advancements in inner tranquility, confronting challenges head-on without yielding, maintaining authenticity during challenging conversations, not compromising your ideals for an easy win and crushing that daily milestone, no matter how small. Likewise, some of the best reinforcement algorithms are designed to model intrinsic motivation, turning failures into experiences, driving the agent simply by nothing more than curiosity to reduce environmental surprises.

Hence, no matter what the society’s reward-meter says, our personal compass can always reward our efforts directed at our personal “why”. That lets you, “be yourself”, every single day.

Now that we understand the basic challenges behind modeling rewards, in the next essay, we will understand how ChatGPT attempts to venture beyond the experts that helped bootstrap it.