Learning how to walk through reinforcement learning

 

Hi again! Sorry I have been swamped with a lot of things lately, and that includes renovating my place, so I had to be in and out of hotels for a month among others.

So enough of that, and let's dive into reinforcement learning! Recently, I have been given the splendid opportunity to take up Udacity's course on Deep Reinforcement Learning (RL). But before getting into the "deep" end of RL, I'd have to understand first what is RL.

"Reinforcement learning ( RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning."
- Wikipedia 

Reinforcement learning is a branch of machine learning where an agent outputs an action, and the environment in return outputs an observation or the state of the system plus a reward. The goal then of the agent is to think of actions that will maximize its reward.


So what exactly are rewards? In the context of an agent (known as the actor in a defined world) learning how to walk, Google Deep explains reward as follows:

Google Deep



So in a nutshell, here are the terms we need to be familiar with when it comes to RL: state, action, reward and policy perfectly explained by Deepak.

"At its core, any reinforcement learning task is defined by three things — states, actions and rewards. States are a representation of the current world or environment of the task. Actions are something an RL agent can do to change these states. And rewards are the utility the agent receives for performing the “right” actions. So the states tell the agent what situation it is in currently, and the rewards signal the states that it should be aspiring towards. The aim, then, is to learn a “policy”, something which tells you which action to take from each state so as to try and maximize reward. This broad definition can be used to fit a number of diverse tasks that we perform every day."

 - Deepak 

State-value and Action-value functions
For each state, the state-value function yields the expected return, if the agent starts in that state, and then follows the policy for all time steps.

For each state and action, the action-value function yields the expected return, if the agent starts in that state, takes the action, and then follows the policy for all future time steps.




Monte Carlo Methods
If a robot were to traverse through a labyrinth as below, agent uses the equal probable random policy to interact with the environment.

Fundamental idea behind monte carlo methods, robot just needs to collect a lot of episodes to get a grasp of what the system is like. 


So in essence, a good way to know what the optimal policy is, is to jot down the combinations of states and action pairs for each episode (getting the average of rewards/returns for some pairs) and choose which pairs give the best returns. Much like the pay-off matrix in game theory. Cool stuff.
In RL, it's apparently called the Q table. 


The concepts I have explained above are from Udacity's Nanodegree on Deep Reinforcement Learning. 







Comments

Popular Posts