Q-Learning and Bellman Equation

Q-learning is a dynamic programming approach for updating our estimation about agent’s future success. In Q-learning, agents experience consists of a sequence of distinct stages or episodes. You can think of $n$ as the number of times that agent has played the game and $t$ as the number of steps that agent has taken in a single game.

At the $t^{th}$ time step of the $n^{th}$ episode, agent:

Observes its current state $s_t$
Selects and performs an action $a_t$
Observes the subsequent state $s_{t+1}$
Adjusts its $Q_{n-1}$ value using the learning factor $\alpha_{n}$ according to the Bellman equation below:

\[Q_{n}(s_t,a) = \begin{cases} (1-\alpha_{n})Q_{n-1}(s_t,a) + \alpha_{n}[{r_t} + \gamma V_{n-1}(s_{t+1})], & \text{if not terminal state} \\ Q_{n-1}(s_t,a), & \text{if terminal state} \end{cases}\]

$V_{n-1}(s_{t+1})$ is the best possible value that the agent can achieve from state $s_{t+1}$ on ward, in other words expected future reward.

\[V_{n-1}(s_{t+1}) = \max_{a'} {Q_{n-1}(s_{t+1},{a'})}\]

$\alpha_n$ is the learning factor. It determines the balance between exploration and exploitation. Typically $\alpha_n$ is a function of $n$. During the first few episodes, $\alpha_n$ is close to $1$ and we gradually decrease it to $0$.
- If $\alpha_n = 0$, agent does not learn and exploits prior information.
- If $\alpha_n = 1$, agent learns new things and explores the environment.
$r_t$: Immediate reward received at time $t$.
$\gamma$: Discount factor, which weighs the importance of future rewards. The lesser value for $\gamma$ signifies that future rewards are less valuable and agent cares more about the immediate reward.