Yesterday I found an “old” script I wrote during a morning in the last semester. I remember being a little bored and interested in the concept of **Q-learning**. That was about the time **Alpha-Go** had beaten the world champion of Go and by reading here and there I found out that a bit of Q-learning mixed with deep learning might have been involved.

Indeed Q-learning seems an interesting concept, perhaps even more fascinating than traditional supervised machine learning since in this case the machine is basically learning from scratch how to perform a certain task in oder to optimize future rewards.

If you would like to read a, quote, “Painless Q-learning tutorial”, I suggest you to read the following explanation: A Painless Q-learning Tutorial. In this article the concept of Q-learning is explained through a simple example and a clear walk-through. After having read the article I decided to put into code the example shown. The example shows a maze through which the agent should go and find its way up to the goal stage (stage 5). Basically, the idea is to train an algorithm to find the optimal path, given an (often random) initial condition, in order to maximize a certain outcome. In this simple example, as you can see from the picture shown in the article, the possible choices are all known and the outcome of each choice is deterministic. A best path exists and can be found easily regardless of the initial condition. Furthemore, the maze is time invariant. These nice theoretical hypothesis are usually not true when dealing when real world problems and this makes using Q-learning hard in practice, even though the concept behind it is relatively simple.

Before taking a look at the code, I suggest to read the article mentioned above, where you will get familiar with the problem tackled below. I’m not going deep in explaining what is going on since the author of that article has already done a pretty good job doing it and my work is just a (probably horribly inefficient) translation in Python of the algorithm.

Given an initial condition of, say, state 2, the optimal sequence path is clearly 2 - 3 – 1 - 5. Let’s see if the algorithm finds it!

And sure enough there it is! Bear in mind this is a very simplified problem. At the same time though, keep in mind that Alpha Go is powered in part by a similar concept. I wonder what other application will come from Q-learning.

The concept of Q-learning is more vast than what I showed here, nevertheless I hope my post was an interesting insight.

Hey, I am getting error in Q learning formula.The error is " too many indices for array".How to solve that?

ReplyDeleteHi, there seems to be an error in the indexing process, try to start debugging there.

DeleteIn my example I have 50000 states .... and I got memory error.

ReplyDeletemmm.. not sure I understand. Sorry.

DeleteThanks for the nice python implementation - works great on new graphs. One minor nitpick, your graph image doesn't correspond to your reward matrix, it shouldn't go from 3 to 5 and back, instead 3 only goes to points 4 and 1. 5 loops onto itself for 100 points. Thanks for posting this!

ReplyDeleteManuel

Hi, thanks for reading! Oh, I didn't notice that! I didn't want to take the picture from the original article so I made one myself in a hurry. Thanks for pointing out the mistake, fixed it! :)

DeleteHi Mic, can you tell me why the next action is chosen at random (lines 29-33). I thought in Q learning the next action is chosen based on the highest Q value. Best, Nina

ReplyDeleteHi Nina, the next action is chosen at random only during the training of the agent in order to explore the environment (and build the Q matrix). In testing (lines 71->89) the next action is chosen according to the highest Q value as you expect.

DeleteIn the update function, why are we selecting the Q max index based on the randomly selected action going into what I thought was the state field of Q? Such that Q(state, action) returns the Q-value for that specific pair, why are we inserting action into state field? (Q[action,])? Shouldn't we want to insert state there to get index from that?

ReplyDeleteIt's a fantastic example for designing MDP based algorithms for many randomly generated environments.

ReplyDeleteMay I ask a question? Do you have like this example for multi agents. like team-q (friend-q) learning for cooperative mission

Thanks! I'm sorry, as of now I haven't got any examples for multi agents.

DeleteBonjour,

Deletecomment adapter un algorithme de Q learning aux jeux de shogi ou échecs?