No. 3363:

Reinforcement Learning

Audio

Today, reinforcement learning. The University of Houston presents this series about the machines that make our civilization run, and the people whose ingenuity created them.

______________________

How do we learn new skills? We try, fail, and adjust. We remember what works, and we try it again. We avoid repeating what didn’t work. We learn from the consequences of our action. Psychologists call this reinforcement learning. And this idea has been central to building machines that can teach themselves.

The first reinforcement learning algorithms learned slowly from experience. They tried random actions and got rewarded only when they succeeded. But this approach has a fatal flaw: It makes it difficult to determine what action should be given credit for success.

In 1951, Marvin Minsky built SNARC, a maze-solving machine that learned through trial and error. It worked, but barely. The problem was timing. Consider teaching a compute to play chess. If the computer wins after 50 moves, which moves decided the game? Crediting moves close to the game’s end ignores the early moves that may have paved the way to victory. Crediting all moves creates too much noise to learn anything useful.

Richard Sutton in 2017

Enter Richard Sutton, a psychology Ph.D. student who noticed something crucial about animal behavior. Animals don't wait for final outcomes. They get excited when they expect rewards. Pavlov’s dogs salivated when they heard a dinner bell, not just when eating. Sutton asked if machines could learn the same way.

In his 1984 dissertation, Sutton proposed temporal difference learning. The idea was simple: Don't wait for the final outcome, but instead learn from your own changing expectations. At every moment, predict how much future reward you will receive. When something happens that makes that prediction jump, say a chess position goes from "50% chance of winning" to "80% chance of winning", use that jump as the learning signal. Thus, you do not need to wait until the chess match ends to evaluate a move.

But skeptics asked why learning from your own flawed predictions works better than learning from actual outcomes? The answer came in the ‘90s when Gerald Tesauro built TD-Gammon using Sutton's method. This backgammon program learned purely through self-play and reached a level that rivaled the world’s best players.

Marvin Minsky's Stochastic Neural Analog Reinforcement Calculator (SNARC).
Photo from Wikipedia

But the deepest validation came from biology itself. Researchers discovered that cells in monkey brains work as Sutton suggested. These cells signal not when animals receive rewards, but when their prediction of a reward suddenly jumps up. Evolution had solved the credit assignment problem millions of years ago.

Today's self-driving cars, game-playing AIs, and recommendation systems all trace their lineage to Sutton's insight. And that insight was powerful: To learn you do not need to know whether you’ll succeed. You just need to know that you heading in the right direction.

This is Krešo Josić at the University of Houston, where we're interested in the way inventive minds work.

(Theme music)

Here is an accessible overview of reinforcement without much math or jargon. If you want to see a little more math, you can check out this post.

Reinforcement learning has become extremely important in training large language models. In particular, it is used to train them to be polite (among other things) through human feedback. Here humans rate different responses, and the model uses this feedback to learn. You can read more about this type of RL here.

This Episode first aired on March 11, 2026.

‹ Previous Episode

Next Episode ›