Monday, 12 September 2016

Getting AI smarter with Q-learning: a simple first step in Python

Yesterday I found an “old” script I wrote during a morning in the last semester. I remember being a little bored and interested in the concept of Q-learning. That was about the time Alpha-Go had beaten the world champion of Go and by reading here and there I found out that a bit of Q-learning mixed with deep learning might have been involved.

Indeed Q-learning seems an interesting concept, perhaps even more fascinating than traditional supervised machine learning since in this case the machine is basically learning from scratch how to perform a certain task in oder to optimize future rewards.

If you would like to read a, quote, “Painless Q-learning tutorial”, I suggest you to read the following explanation: A Painless Q-learning Tutorial. In this article the concept of Q-learning is explained through a simple example and a clear walk-through. After having read the article I decided to put into code the example shown. The example shows a maze through which the agent should go and find its way up to the goal stage (stage 5). Basically, the idea is to train an algorithm to find the optimal path, given an (often random) initial condition, in order to maximize a certain outcome. In this simple example, as you can see from the picture shown in the article, the possible choices are all known and the outcome of each choice is deterministic. A best path exists and can be found easily regardless of the initial condition. Furthemore, the maze is time invariant. These nice theoretical hypothesis are usually not true when dealing when real world problems and this makes using Q-learning hard in practice, even though the concept behind it is relatively simple.

Before taking a look at the code, I suggest to read the article mentioned above, where you will get familiar with the problem tackled below. I’m not going deep in explaining what is going on since the author of that article has already done a pretty good job doing it and my work is just a (probably horribly inefficient) translation in Python of the algorithm.

Given an initial condition of, say, state 2,  the optimal sequence path is clearly 2 - 3 – 1 - 5. Let’s see if the algorithm finds it!

And sure enough there it is! Bear in mind this is a very simplified problem. At the same time though, keep in mind that Alpha Go is powered in part by a similar concept. I wonder what other application will come from Q-learning.

The concept of Q-learning is more vast than what I showed here, nevertheless I hope my post was an interesting insight.

1. Hey, I am getting error in Q learning formula.The error is " too many indices for array".How to solve that?

1. Hi, there seems to be an error in the indexing process, try to start debugging there.

2. In my example I have 50000 states .... and I got memory error.

1. mmm.. not sure I understand. Sorry.

3. Thanks for the nice python implementation - works great on new graphs. One minor nitpick, your graph image doesn't correspond to your reward matrix, it shouldn't go from 3 to 5 and back, instead 3 only goes to points 4 and 1. 5 loops onto itself for 100 points. Thanks for posting this!

Manuel

1. Hi, thanks for reading! Oh, I didn't notice that! I didn't want to take the picture from the original article so I made one myself in a hurry. Thanks for pointing out the mistake, fixed it! :)

4. Hi Mic, can you tell me why the next action is chosen at random (lines 29-33). I thought in Q learning the next action is chosen based on the highest Q value. Best, Nina

1. Hi Nina, the next action is chosen at random only during the training of the agent in order to explore the environment (and build the Q matrix). In testing (lines 71->89) the next action is chosen according to the highest Q value as you expect.

5. In the update function, why are we selecting the Q max index based on the randomly selected action going into what I thought was the state field of Q? Such that Q(state, action) returns the Q-value for that specific pair, why are we inserting action into state field? (Q[action,])? Shouldn't we want to insert state there to get index from that?

6. It's a fantastic example for designing MDP based algorithms for many randomly generated environments.
May I ask a question? Do you have like this example for multi agents. like team-q (friend-q) learning for cooperative mission

1. Thanks! I'm sorry, as of now I haven't got any examples for multi agents.

2. Bonjour,
comment adapter un algorithme de Q learning aux jeux de shogi ou échecs?

7. Hi ! Great articles combined with the painless Q-learning tutorial ! I'm interested to develop a reinforcement learning algorithm for a flash game, if i'm correct, you don't create any walls or rooms, your agent are just wandering around. So i just need to get data from the flash game and change the python algorithm ?

1. Hi, as long as you can build the R matrix you can adapt this example by simply using your R matrix. However I think it is going to be a bit slow with larger R matrices.

8. Hi,
Thanks for this great article!. Btw how many layers are in the NN you have implemented here? Can we specify the number of hidden layers in the network in reinforcement learning?

9. I like your blog, I read this blog please update more content on python, further check it once at python online training

10. def sample_next_action(available_actions_range):
... next_action = int(np.random.choice(available_act,1))
File "", line 2
next_action = int(np.random.choice(available_act,1))
^
IndentationError: expected an indented block

any idea why?

11. Good Post. I like your blog. Thanks for Sharing!
Machine Learning with Python Training in Gurgaon

12. Thanks a lot for sharing marvellous information on sap course. Thanks for sharing this valuable post with us.
Python Training in Gurgaon

13. Thank you for sharing such great information very useful to us.
Python Training institute in Noida

14. Thankful to you for this amazing information sharing with us. Get website designing and development services by Ogen Infosystem.
Website Designing Company in Delhi

15. Your content is really awesome and understandable, thanks for the efforts in this blog. Visit Mutual Fund Wala for Mutual Fund Schemes.
Mutual Fund Companies

16. Decent, Get Service for Night out page 3 parties and this magnificent service provided by Lifestyle Magazine.
Lifestyle Magazine India

17. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging
Python Training in electronic city

18. Thanks! If I understand correctly, the 4th row in th eR matrix should be [0, -1, -1, 0, -1, 100].

19. Great, I think this is one of the best blog in past some time I have seen. Visit Kalakutir for Fleet Painting, Godown Line Marking Painting and Caution & Indication Signages.
Fleet Painting

20. I want to know more about American eagle credit card login

21. Hiiii....Thanks for sharing Great information....Nice post....Keep move on....