Q-Learning and Bellman Equation
Q-learning is a dynamic programming approach for updating our estimation about agent’s future success. In Q-learning, agents experience consists of a sequence of distinct stages or episodes. You can think of $n$ as the number of times that agent has played the game and $t$ as the number of steps that agent has taken in a single game.
At the $t^{th}$ time step of the $n^{th}$ episode, agent:
- Observes its current state $s_t$
- Selects and performs an action $a_t$
- Observes the subsequent state $s_{t+1}$
- Adjusts its $Q_{n-1}$ value using the learning factor $\alpha_{n}$ according to the Bellman equation below:
- $V_{n-1}(s_{t+1})$ is the best possible value that the agent can achieve from state $s_{t+1}$ on ward, in other words expected future reward.
- $\alpha_n$ is the learning factor. It determines the balance between exploration and exploitation. Typically $\alpha_n$ is a function of $n$. During the first few episodes, $\alpha_n$ is close to $1$ and we gradually decrease it to $0$.
- If $\alpha_n = 0$, agent does not learn and exploits prior information.
- If $\alpha_n = 1$, agent learns new things and explores the environment.
- $r_t$: Immediate reward received at time $t$.
- $\gamma$: Discount factor, which weighs the importance of future rewards. The lesser value for $\gamma$ signifies that future rewards are less valuable and agent cares more about the immediate reward.