MLG 029 Reinforcement Learning Intro

Feb 05, 2018

Click to Play Episode

Introduction to reinforcement learning (RL), a system where an agent learns to navigate an environment and achieve defined goals without being given explicit instructions, by using a rewards and punishment mechanism. RL can be model-free, which is reaction-based, or model-based, which incorporates planning. Applications of RL include self-driving cars and video games. Compares RL to supervised learning and its business applications like vision and natural language processing.

Resources

Resources best viewed here

Hands-On Machine Learning with Scikit-Learn and PyTorch

StatQuest - Machine Learning

Reinforcement Learning: An Introduction (2nd Ed.) by Sutton & Barto

UC Berkeley CS285: Deep Reinforcement Learning

Nathan Lambert - RLHF Book

DeepLearning.AI: Reinforcement Learning from Human Feedback

Stanford CS234 Reinforcement Learning

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Reinforcement Learning (RL) is a fundamental component of artificial intelligence, different from purely being AI itself. It is considered a key aspect of AI due to its ability to learn through interactions with the environment using a system of rewards and punishments.

Links:

Concepts and Definitions

Reinforcement Learning (RL):
- RL is a framework where an "agent" learns by interacting with its environment and receiving feedback in the form of rewards or punishments.
- It is part of the broader machine learning category, which includes supervised and unsupervised learning.
- Unlike supervised learning, where a model learns from labeled data, RL focuses on decision-making and goal achievement.

Comparison with Other Learning Types

Supervised Learning:
- Involves a teacher-student paradigm where models are trained on labeled data.
- Common in applications like image recognition and language processing.
Unsupervised Learning:
- Not commonly used in practical applications according to the experience shared in the episode.
Reinforcement Learning vs. Supervised Learning:
- RL allows agents to learn independently through interaction, unlike supervised learning where training occurs with labeled data.

Applications of Reinforcement Learning

Games and Simulations:
- Deep reinforcement learning is used in games like Go (AlphaGo) and video games, where the environment and possible rewards or penalties are predefined.
Robotics and Autonomous Systems:
- Examples include robotics (e.g., Boston Dynamics mules) and autonomous vehicles that learn to navigate and make decisions in real-world environments.
Finance and Trading:
- Utilized for modeling trading strategies that aim to optimize financial returns over time, although breakthrough performance in trading isn’t yet evidenced.

RL Frameworks and Environments

Framework Examples:
- OpenAI Baselines, TensorForce, and Intel's Coach, each with different capabilities and company backing for development.
Environments:
- OpenAI's Gym is a suite of environments used for training RL agents.

Future Aspects and Developments

Model-based vs. Model-free RL:
- Model-based RL involves planning and knowledge of the world dynamics, while model-free is about reaction and immediate responses.
Remaining Challenges:
- Current hurdles in AI include reasoning, knowledge representation, and memory, where efforts are ongoing in institutions like Google DeepMind for further advancement.

Never Run Out of ML ContentGenerate Your Own Episodes

Want to go deeper on a topic this podcast didn't cover? Generate your own episodes - AI agents, transformers, diffusion models, whatever you're curious about. They appear right in your podcast app.Turn any ML topic into a podcast episode in your app.Start Generating →Start Generating →

Transcript

This is episode 29, reinforcement Learning Introduction. Finally, my friends, we are at Reinforcement Learning, the beginning of the end of your artificial intelligence quest, the beginning of the end.

The beginning of the end because reinforcement learning is not artificial intelligence, but as you'll see in a bit, it is such a core component of ai. It's kind of the heart of ai, and when you start diving into rl, you'll really feel like you've made a huge step towards that goal. You'll feel the magic of RL and start to feel like you're making a dent in the grand picture.

Let's define RL first, and then we'll come back to ai. Reinforcement learning, or RL is the third pillar of machine learning. We have unsupervised, supervised, and reinforcement learning, and sometimes semi-supervised, but we don't talk about that in this series. Unsupervised learning, as you'll recall, is a machine learning model, learning without instruction to sort of sift data into piles, find patterns in the data to put the triangles over here and the circles over there.

Uh, it's not super common in my experience in industry or practical application. So most of what we've been dealing with in this podcast series is supervised learning. Supervised learning is a student and a teacher. You're the teacher, and your neural network is the student. You're training your model to recognize patterns with flashcards.

Okay? The flashcards are your data. You hand your neural network a pile of flashcards, and on the front of each flashcard are the features. And on the back of each flashcard is the label or the target that your neural network is trying to learn how to predict. So supervised learning is like neural network school.

You are the trainer and you're training your model to recognize patterns so that when you release it into the wild, it can continue to recognize that pattern on data It's never seen before. Vision, natural language processing, recommender systems. Most of the practical business applications of machine learning fall into this category.

Reinforcement learning is the learning model or the agent. We call it the agent training itself. You don't give it labeled data to train on. Instead, you give it a system whereby it knows whether an action it took is good or bad, and that's it. From there, it learns all on its own. How to navigate the cruel world.

So supervised learning is handing your student a deck of flashcards. Reinforcement learning is handing a kid a sword and a shield and a scoreboard, and sending it out into the world, and your agent will learn all by itself how to swing the sword, how to walk around this map. Which bad guys are too strong for it to fight and eventually how to beat the game.

It's an action-based machine learning system. It's all about taking actions in an environment to achieve an eventual goal. So it's a goals and action-based machine learning system, and the way that it learns what actions to take and how to accomplish its goal is by being rewarded and punished. And that's it.

So classic applications and use cases of reinforcement learning tend to be geared around games, normal human games like chess and go, and I'm sure you've heard about Alpha Go Beating the World Champion Lisa Dole in recent times. That's deep reinforcement learning as well as video games like Atari and Doom.

Anything that can be framed as trying to achieve some eventual goal by taking actions through time and receiving reward or punishment. That's a reinforcement learning scenario, so it's not just video games. We've got self-driving cars whose goal is to get the human from A to B safely where rewards might center around staying in the lines and getting the person there in a timely fashion and punishment might be breaking too fast in these things.

You could apply it to robots walking around in an environment. I'm sure you've seen those mules or dogs by Boston Dynamics that are walking around through the forest, so I'm sure it has some sort of reward and punishment system baked into it for learning how to walk and navigate an environment. I. And of course for our purposes, it can apply to stock trading or day trading.

The eventual goal being maximizing your portfolio value and the rewards and punishment being your gains and losses as you trade. So there's lots of application of reinforcement learning. Now let's get back to ai. In a very early episode of this series, we defined AI as sort of a list of check boxes, like we'll have achieved AI once we've combined a bucket list of features.

We talked about perception like vision and speech. Well, those we've tackled with con Nets and RNs. AI requires learning. Well, this whole thing has been about learning. You probably want your AI to have a body that's robotics. It's not necessary, but it's icing on the cake, and that's out of the jurisdiction of this podcast.

It should have the ability to act, it should have actuators, the ability to open doors or walk around or make decisions. Well, action is what reinforcement learning is all about, so we're gonna be covering that stuff. And it should be able to plan and planning versus reinforcement learning versus action.

We're gonna sift those two bits apart. In this episode. We're gonna talk about the difference between action and planning model free reinforcement learning versus model based reinforcement learning. We'll get to that in a bit, but it looks like all the stuff we've covered so far and the stuff we're gonna be covering now, those pretty much check all of our boxes.

So it looks like we're getting really close to the grand goal of artificial intelligence. Now there's three more check boxes that I commonly see that may not yet be checked. Those are reasoning, knowledge, representation, and memory. Now, in my opinion, a case could be made for both knowledge, representation, and memory being baked into a neural network's neurons.

That the weights of a neuron are, its memory of the pattern. It's trying to predict, or the history of actions that have resulted in high rewards that it's now going to act upon in a reinforcement learning agent. So a case could be made that knowledge, representation, and memory might be considered to be baked into the neural network's.

Neurons already. In other words, that we've already checked those boxes inadvertently. As for reasoning. That's a tougher one. That's one that maybe indeed is an unchecked checkbox, but another case could be made here that planning and reasoning might go hand in hand. That being able to plan your way through an action sequence of an environment is the act of reasoning about your situation.

And since we're going to be covering planning in this sequence of episodes that we're also covering, reasoning may not be a very strong case. So there's potentially some more work that still needs doing around the reasoning, knowledge, representation, and memory aspects of artificial intelligence.

There's also a research project by Google DeepMind called the Differentiable Computer, which purports to solving these exact things, knowledge, representation, memory, and reasoning. So that's something worth looking into, but we'll put those on the back burner for now, and I'll leave it to you listeners to decide whether those aspects of AI are yet to be accomplished.

Other than those bits we're damn close. Damn close to the end goal. And reinforcement learning, especially model-based reinforcement learning, which introduces planning to the mix, takes us one giant step towards that vision. Now let's spend a little bit more time on supervised versus reinforcement learning.

There are cases where supervised learning is an obvious fit. Vision. For example, you're training your connet on a bunch of images that are tagged as cat, dog, or tree. It learns by flipping over the flashcards, slapping its forehead, and fixing its mistakes over time. How to make that distinction on its own.

How to recognize the pattern. Supervised learning clear case. A clear case of reinforcement learning is playing a video game. You have a scoreboard in the top left. That's your reward. You're taking actions in an environment over time, trying to accomplish a goal. Very clear case of reinforcement learning.

What about trading? What about Bitcoin trading, which is our podcast project. That is a case that can go both ways, and that's partially why I chose that project for us is it transitions us from supervised to reinforcement. How could we frame it as a supervised learning scenario? Well, we would feed into our model, whether it's an L-S-T-M-R-N-N, or a connet, a window of time steps.

That's the front of the flashcard. And it's trying to predict the very next time, steps, price, that's the back of the flashcard. So it learns over time how to predict a next price action based on a time window. Now what do you do with that? What do you do now that you know the next price action? Well, if you're an expert trader, you'll program in a bunch of trading rules manually into your program, A bunch of if else statements, basically saying, if the price is going this way and the simple moving average is such and such, and the price 20 steps ago was this, then do that.

That's the supervised learning. Approach to acting on a predicted price action. And for many trading firms, that's exactly what they want. They know how to trade. They're the experts in the subject. All they want is a predicted next price action by their supervised learning model. And they'll take it from here.

Thank you very much. Okay. Another approach would be that the front of your flashcard is the time step window, like before, but the back of the flashcard is whether to buy or sell or hold. So it kind of looks like reinforcement learning in that we're predicting an action. But the agent isn't learning what actions to take on its own.

You are teaching it which actions to take. Given a time, step window, you are manually teaching it. You are the teacher and it is the student, and you are teaching it. If the window looks like this, you buy. If the window looks like that, you sell. And in order to do that, you have to have an expert on hand, able to label very precisely and accurately the best trade signal for a time step window.

That may be very difficult to get accurate, and that may be very time consuming. When we switch to reinforcement learning, our agent learns how to buy and sell through trial and error, through losing money and gaining money towards an ultimate goal of having a very high value portfolio. And you don't have to tell it how to trade.

You don't have to tell it. What's a good trade and what's a bad trade? It learns those all by itself. And the hope of reinforcement learning as has been shown with Atari game playing and AlphaGo versus Lisa Doll, is that this bot can learn to trade with superhuman performance. That it will learn, that it will learn the proper buy, sell signals given a time window much better than a human could trade it on.

Now we haven't seen that to be the case yet in deep reinforcement learning when applied to trading. That's my disclaimer for you. I actually haven't made any money yet with our Bitcoin trading bot and through lots of conversations with people and companies, there's still a lot of work and research in the space that needs to be done before these bots achieve superhuman trading power.

But that's the goal. Now, one thing that a good reinforcement learning agent would need to consider when it's performing its actions is the consequences of its actions. Now for future rewards, this is called the credit assignment problem because you see our RL agent is going to be experiencing delayed rewards.

It might buy some amount now and take a small penalty based on the commissions. But that that purchase will grow in value over time, so its downstream reward is greater than its present penalty. And this is something that's factored into any reinforcement learning agent delayed rewards the credit assignment problem.

This is solved by something called a discount factor. A discount factor, which we'll describe in the technical details of reinforcement learning in the next episode. So that's a lay of the land of reinforcement learning, its definition, and how it compares to supervised learning. Now, let's start cracking open reinforcement learning from a high level perspective.

Let's start looking at its insides. We won't get too technical in this episode. This is an introduction. We'll get a little bit more technical in the next episode. The first high level distinction we make in reinforcement learning is something called model free reinforcement learning agents. Versus model based reinforcement learning agents model free versus model based.

The simpler reinforcement learning agent, the model free RL agent is something I like to consider a reactionary agent. It is a gut reaction agent. An instinct based agent, if you tap its knee with a hammer, its leg comes up. If you put food in front of its mouth, it bites. This is almost like its reptilian brain, and this has a special name.

This, this reactionary component of our agent. It's called a policy. The policy determines what action the agent takes. Given what it is experiencing right now, what it sees, or what it hears, or what it tastes or smells, if it sees food right in front of its face, that goes into a neural network as inputs, inputs to a neural network.

That neural network is called your policy, and out comes an action, which is to bite the food. So a model free RL agent is a gut reaction instinct agent. And a model based RL agent is a much more sophisticated agent, which has planning built into its system. It can look ahead many steps. It can start to think about the problem and weigh the pros and cons of specific actions.

I. So model three is reactionary. Model based is planning based. That's the high level separation of RL agents. In this world of reinforcement learning that we're about to embark on, we won't be getting into the model based agents for quite some time. They're much more sophisticated and complex. So we're gonna start with the reactionary agents, the policy based model, free agents.

You'd be surprised at how far you can get with these model free agents. If you've seen these YouTube videos of an orange ragdoll man in an environment. He's running around and it has that Benny Hill theme song playing, and he's like running really funny and swinging his arm around and trying to jump over gaps and stuff.

This is a physics-based environment called Mu Joko. M-U-J-O-C-O, which stands for multi-joint dynamics with contact. And it's a very complex environment. I mean, this ragdoll guy has to learn how to move all of his limbs with joints throughout the limbs. So he has to learn how to move quite a number of parts on his body and learn how to jump over obstacles and all these things.

You'd think that couldn't be performed by an unsophisticated model, but indeed. Usually that is showcasing a model free reinforcement learning agent, like the proximate policy optimization agent, or the Deep Q network, so you can make surprising progress with a model free agent. And in fact, we're using a model free agent for our Bitcoin trading bot.

Now before I move past model based agents, because again, we're not gonna be really diving into that for quite some time, I want to tell you what these models are. First off the word model. I really dislike this. It took me a long time to understand what was being said here. Model free versus model based.

I don't like the use of the word model because there's models everywhere. Everywhere in our RL agents, whether they're model free or model based. Our model free agent has the reinforcement learning agent at a high level, like the proximate policy optimization agent. That's a model, and it has inside of that a connet for parsing the screenshot of the video game it's playing, or an L-S-T-M-R-N.

If it's parsing some time series like stock data. That's a model. So right from the get go, we have two models, and yet this is a model free agent. It's very confusing when they say model. In reinforcement learning in the context of model free versus model based, they're referring to something very, very specific.

It's the planning component of the RL agent. Namely, it's a system which learns how the world works, how the world around it works, so that it can plan based on that knowledge. This is called transition dynamics, and we'll get to that in the next episode. A very simple reinforcement learning environment, which doesn't need to be learned per se.

You'll see some of the stuff in your chapter, ones of the resources that I'll be recommending. The transition dynamics are baked into the system, and for much more complex environments like Atari games or the MA Joko environment, the transition dynamics need to be learned. So let's have an example. If we're talking about the game of chess when played on a computer, maybe it's the computer versus you.

We could do a number of approaches the computer could have built into its system, all the rules of chess and sort of what are the best actions to take under what circumstances. And this would be like those Windows 95 and Windows 98 chess playing algorithms. This isn't ai, this is an algorithm, but it is planning.

It is a planning based model using either these tree search algorithms, Monte Carlo tree search is a popular one, or any number of other planning based algorithms. This is what planning is. If the chess piece is in this position, then what is. The optimal sequence of steps to take given the opponent's configuration.

And so it goes down this tree simulating how things might play out and it prunes trees that look like there'll be a dead end until eventually it comes to a high scoring possible move to take and then takes that move. That would be the case. When the dynamics of the system are baked into your model, they're programmed into the Windows 95 chess algorithm.

That stuff doesn't fly in the real world. In the, in modern systems and complex video games and robots walking around the world, we couldn't possibly program the physics. Of the universe into our robot. And so instead, you give it the option to learn how the world around it works. It can learn the dynamics of the world.

These are called the transition dynamics. So when we say model free versus model based, the word model refers to a model that is mapping the dynamics of the world, the transition dynamics. And so in a model-based reinforcement learning algorithm, we'll have a, A model for our reinforcement learning agent and for the connet within it and whatever else over here on the left.

And we'll have a model for learning how the world around it works so that it can plan to make wiser decisions over here on the right. And that's what the model is. They say in model based. Now these. Planning or searching algorithms are the stuff of classical ai. If you crack open a textbook, pre 2000 about ai, like one that I will recommend at the end of this show called ai, a Modern Approach, it's these algorithms that those books will teach you.

These planning, searching algorithms for deciding what action to take given a specific configuration. And the newfangled stuff of reinforcement learning takes that planning component, pops it in as a module, and now it can both react. It can bite a food or kick its leg and it can plan. And it can learn to do both of those in a very sophisticated, deep learning framework using comm nets, ltms, or dense layers.

So this is kind of why I think deep model-based reinforcement learning is getting towards the crux of ai. It's 'cause we're combining all the powers of everything we've learned thus. Far all into one robot. So that's the high level breakdown of RL agents that, that we'll be exploring in our education model.

Free versus model based. We'll start with model free and now we will make another division over here on the left in the model free RL agents. We're gonna split that into two new branches. These are the policy gradient agents and the value-based agents. We'll get into the technicals of those in the next episode.

We'll talk high level here. Policy gradient agents are dirt simple. It's the 1 0 1 RL agent you'll learn in any book, and all it is, is deep learning applied to actions. It's just performing an action, assessing the consequence, the reward, considering actions in a time horizon. I. So we can handle delayed rewards and the credit assignment problem, and then from there, taking a gradient step, just traditional machine learning in a direction that optimizes the policy in a direction that helps the agent make better decisions in the future.

So it's really the classical machine learning strategies that you've seen up until this point applied to an actions and rewards framework. It's really vanilla stuff, and you'll see this early on in your education. The policy gradient methods. Then over here we have the value-based methods. And this takes a different approach to the problem setting.

It uses all this bellman stuff that we'll get into later, the bellman optimality equation and value iteration and all these things in order to pose the problem with a different spin. Very, very similar, but a different spin. And what we learn is not the policy directly, but we learn to be able to predict the value of the current state we're in and the value of each possible action we can take.

So it's a slightly more sophisticated spin. It's a little bit more like a one step look ahead than a gut reaction. So a policy gradient, you're training directly the neurons to fire in a certain way. When the network is in a specific state, that's really the hammer on the knee, kicking the leg and food in front of the mouth, taking a bite.

Whereas the value-based methods are able to sort of look down and see the value of the state we're currently standing on and look forward and see the value of each of the options, we can take A, B, or C and reaching out and grabbing one of those options, whichever one has the highest value. It's a little bit more sophisticated, a little bit less reactionary.

And in fact, it turns out it has lower variants as we'll see later than the policy gradient methods. So there's a lot of advantages. It's certainly not a planning agent. Don't get me wrong. When I say it's looking forward, one step at the actions it could take, it's not planning. So it's more like a judgment call than a gut reaction.

Now there are pros and cons to the policy gradient methods versus the value-based methods, so it's not cut and dry. I made it out to sound like the value-based methods are the better approach. That's not true. It's a pros and cons setup. For example, this isn't necessarily the case, but it is very, very often the case.

Policy gradient methods allow you to use continuous actions where value-based methods require you to use discrete actions. Not necessarily the case, but very, very commonly the case, a continuous action, for example, is picking a number between zero and 100. So in our Bitcoin trading bot. We want to be able to sell some arbitrary amount of Bitcoin or dollars, so we want to have the ability to take an action on a continuous scale.

Whereas if we were limited to using a discreet action, we'd have to hard code some predetermined amount that the trading bot could buy or sell, which isn't very slick. On the other hand, the value-based methods have a lot less variance, which is a huge problem in reinforcement learning, which we'll get to later.

So pros and cons of policy, gradient methods versus value-based methods. And per the prior episode on hyper search, what you really want to do is try them all, try both approaches. We'll get into the specific policy gradient, models and value-based models in a future episode. But right here, I want to name drop the most popular from each camp, just so you have an idea.

The reigning champion from the value-based camp is the most popular RL agent in the world, the most spoken of in all the literature and blog posts, and showcased all over the internet. Deep Q Networks, DQN. That is a value-based agent. I'm, I'm quite sure you've heard of dqn. And the current reigning champion of the policy gradient Approaches is the proximate policy optimization model or PPO.

So PPO versus DQN is sort of a showdown. You'll see a lot of. Now just to name drop a handful of other popular RL agents out there. We have the actor agent, the AER agent, DDPG, or Deep Deterministic Policy Gradient and TRPO, or Trusted Region Policy Optimization. And we'll compare a lot of these in the future.

Now, something I've observed, this is my own observation, and I don't know whether it's true, I might get into trouble for saying this, is this Google DeepMind and Open ai. Those are the two biggest research outfits for deep rl. Now, I've observed that Google tends to be a champion of the value learning approaches.

The Q networks where open ai, it seems tends to be a champion of the policy gradient approaches like PPO and TRPO. So that's just something interesting that I've observed that Google seems to champion the Deep Q networks and open AI tends to champion the policy gradient approaches. Whether or not that's true.

This goes to show that there are pros and cons to both camps. There is no clear winner on PG versus value-based. We'll break apart the PG versus value stuff more in the future, but that's just a lay of the land. Now let's talk about technology libraries and frameworks and code that you can use. Now the Hands-on Machine learning book that I've been recommending over and over has a fantastic chapter on deep reinforcement learning, the last chapter of the book, and it guides you on a lay of the land.

Like this episode and the next episode, it walks you through the core concept of reinforcement learning. It has you hand code from scratch, a policy gradient approach, and a Deep Q network, and it also throws a little something called actor critic into the mix, which we'll talk about. So I highly recommend if you haven't already read that chapter in the hands-on ML book.

And with that, you can hand code your own deep reinforcement learning agent. But that'll be as simple as they come. Moving forward, if you want to get, if you want to keep up with the times, follow the latest and greatest improvements and modifications on the reinforcement learning agents by cutting edge research, then you're gonna want to use a framework.

And these frameworks will bake in all the very complex math and theory behind these agents, which can be quite a task. To wrestle with if you want to hand code these things, and there's a handful of popular frameworks out there for reinforcement learning. All of them are built on top of TensorFlow, so you'll get that GPU optimization and all the knowledge that you've gained thus far will come in handy.

The first one I want to mention is open AI baselines. Baselines is a repository of the code that accompanies all of open AI's publications. So each of their papers, like the PPO paper, for example, has accompanying open source code so that you can follow along and also so that you can verify their benchmarks versus.

Versus your own on your computer or try different twists of these agents, et cetera. It's not really a framework. It's more of a dumping ground for code that accompanies their research. So OpenAI baselines is more intended for research, not intended for developers, not intended as a plug and play framework.

And in fact, I tried to use open AI baselines for our trading bot in the early days, and I just couldn't adapt their code to our circumstances. Because their code was too tightly coupled to their environment setup. So baseline's intended for research, not intended for developers. So on the flip side, we have reinforce iOS, tensor Force, tensor Force, which is the framework that we're using for the Bitcoin trading bot.

Tensor Force is a developer framework. It's intended to be plug and play and easy to use. And in my opinion, it is the easiest to use of the frameworks out there. It just has a really slick interface where you can pop in an environment like our Bitcoin trading environment and in 50 lines of code, you can choose which model free reinforcement learning agent you want to use like the PPO agent or the DQN agent.

What is your network architecture, whether it's A CNN or an L-S-T-M-R-N-N, and a handful of hyper parameters and hit go and it will abstract all the really hairy math behind the scenes for you. And a cool part about it is its modular architecture, which allows you to, on a whim, decide to switch from PPO.

To DQN without very much effort at all. It could take maybe 10 minutes and you've switched from your PPO to A DQN, and then that way you can easily benchmark the relative performance of reinforcement agents for your environment. I talked about previously using hyper search to determine which reinforcement learning agent to use, and you'll also want to use hyper search on the hypers that correspond with specific reinforcement learning agents.

So Tensor Force, being a developer framework, makes that process very, very painless. Now, the downside of Tensor Force is that it doesn't really have a big company backing. There's two main developers behind it. They're both from the University of Oxford, so it doesn't have the big name behind it. Like baselines has open ai.

Which could be a problem for some people where they might want to gauge their trust in the future Success of the framework. Another framework that might meet you halfway between those two is Nirvana Systems Coach. Coach is built by Intel. Now that's a big name. As far as a backer's concerned, Intel has a sort of labs department called Nirvana.

Where they are actually developing their own deep learning framework, competitor to tensor flow. And on top of it, they've been building this deep reinforcement learning framework called Coach. Now they're smart. They know that most people out there are using Tensor Flow, not Nirvana. I. And so they've made their framework also compatible with TensorFlow as a first class citizen.

And so coach runs on TensorFlow, and it is intended to be a developer framework, just like tensor force is. It's not like baselines a dumping ground for research code. It's intended to be used by developers in a plug and play fashion, but it is not as successful in my experience at being so plug and play as tensor force.

So pros and cons for all of these frameworks, coach has a bigger backing and thereby prospectively brighter future. But tensor force at present, in my experience, is the most well-oiled machine and lends the best to developers like you and me. And then finally we have RL lab and uh, this is an older framework.

This was more popular before these other contenders came on the scene. I actually cannot speak to this framework at all. I don't have any experience with RL Lab, but I'll just drop it in the show notes there so you can take a look at it and do your own comparison. Now the way these frameworks work is you're gonna specify which reinforcement learning agent you're going to use one from either the policy gradient camp like PPO, or one from the value-based camp like DQN, and you're going to specify your network.

So you'll build some fashion of A CNN or an LSTM with some dense layers in there, and you'll provide it with an environment. An environment is a class that you build. It is a subclass of open AI's Gym package, GYM Gym. Open AI's Gym is a whole suite of environments that you can train your RL agent within things like the Atari video games or moving a mouse through a maze.

Or trying to drive a car up a hill. All these little experimental environments that have prebuilt into them, the actions you can take, the reward system, the transition dynamics, the environment, physics, all these things, and they range from very simple to very complex. Now, Jim does not have mu joco that I described earlier.

That is actually a proprietary environment that you have to. Purchase a license for. So gym is an open source suite of environments that you can download via Pip. Muco is something you'll have to go on a website and purchase a license to use. But these environments range from simple to complex, and the most simple of all, the most commonly used Hello world of reinforcement learning environments is called.

Cart pole. You're balancing a pole on a cart. It's kind of like if you've ever held a broom vertically in the palm of your hand and you're moving your hand around trying to balance the broom so that it doesn't fall down, you're kind of doing this little shuffle dance. Trying to balance the broom, that's what cart pole is.

You can move this cart either left or right, and the goal is to keep the broom balanced in the air. And the reward system is a plus one for every time step that you don't drop the broom. In other words, the longer you keep the broom balanced, the more reward you get. The actions are either left or right.

So it's a single discreet action. So you can use a Deep Q network here. Easy peasy. And then built into the environment is the physics of how this thing works, the angular velocity and the direction of the car and all that stuff so that the pole balancing is following sort of a physical system. Well, if you want to build your own environment as we did.

In our Bitcoin trading bot, we, we wanted a Bitcoin environment with price, action, history, and the physics of what happens when you buy and sell. Well, if you want that, you subclass the environment class from open ai. Open AI's Gym package is sort of like a package of standards that is respected by all the reinforcement learning frameworks.

So a gym environment super class is sort of the specification that baselines. Tensor Force Coach and AL Lab all respect. So as long as you subclass a gym environment, then your environment is bound to work across the different frameworks, which makes evaluating the different frameworks a lot simpler. So my recommendation is try your hand to all these frameworks we're using Tensor force.

I like that one. The best Coach is one that I want to dive into. I haven't had the time to take a look at. Baselines is more for you researchers out there rather than developers. That is reinforcement learning. In a nutshell. It is goal oriented machine learning, taking actions in an environment and being rewarded, and it teaches itself how to act in an environment, which is what differentiates it from supervised learning, where you teach it how to act.

Reinforcement learning teaches itself. We split it into model free versus model based reinforcement learning agents. Model free agents are reactionary agents. You hit its knee with a hammer and it kicks. You put food in front of its mouth and it bites model based RL algorithms. Model the environment around it in a sufficiently sophisticated way that it can use that model to plan actions, not just react, but plan.

And really it's the model based deep RL that's getting us into the very depth of AI proper. We will be discussing model free RL for the next few episodes. We'll break model free RL down into policy gradient, methods like proximate Policy optimization or PPO and value-based methods like Deep Q Networks or D qns.

There are pros and cons to both approaches. And we discussed some technology, namely sub classing and open AI gym environment to create an environment that you can work in or you can use one of their prefabs like cart pole, the hello world of RL environments, and then using those environments within a deep reinforcement learning framework like open AI baselines or reinforce iOS tensor force or nirvana systems Coach.

Now let's talk about the resources. This is gonna be a very heavy resources section, and the reason for this is that the next episodes are gonna be coming out slowly. I'm new to reinforcement learning myself, so I'm going to be releasing episodes as I understand the concepts, so it's gonna take some time.

I. And I want you to be able to get a deep dive head start. I want you to be able to read these resources now so you don't have to wait for the for the next episodes. So I'm going to dump all the deep reinforcement learning resources that we're gonna be covering in this whole sequence of episodes right here and now in the resources section.

The first thing, like I mentioned, is to read the last chapter in the hands-on machine learning book. Which is a chapter on deep RL and it's a real quickie lay of the land, and it has you programming a very simple vanilla policy, gradient method, and deep Q network. Next up, and I want you to consume these resources in sequential order.

The next thing you should read is Reinforcement Learning and Introduction by Sutton and Bartow. This is the single most recommended resource on reinforcement learning out there is the base textbook for your introductory Reinforcement Learning University course. And they just released a free completed second edition draft, PDF, which is a 2018 release, so it's really fresh.

The original first edition was way back in 1998 or something, and this edition is brand spanking new. So you're gonna get a lot of the latest and greatest, and it introduces reinforcement learning primarily. Model free reinforcement learning. When you finish that book, move on to ai, a modern approach.

This is the classic AI introduction textbook. When you embark on your machine learning masters or PhD, probably one of the first classes they have you take. Is gonna be an introduction to artificial intelligence, and they'll assign this textbook AI a modern approach. It introduces all the classical approaches to artificial intelligence, especially in the domain of searching and planning.

It's these searching and planning algorithms that you are going to package up and pop into your model. Based reinforcement learning agents. So a combination of the Sutton and Bartow book and the AI Modern Approach book that'll get you all the knowledge you need to go forward with model-based reinforcement learning.

Next, it's time to move on to deep reinforcement learning and to combine those two together. And the resource here is a Berkeley course, CS 2 94, deep reinforcement learning, and all the videos are available on YouTube. So this is gonna be a very heavy, deep reinforcement learning course that combines model free and model based approaches and describes all the latest and greatest state-of-the-art research in rl, like the PPO algorithm, those three primary resources, Sutton and Barto, AI Modern Approach, and CS 2 94.

There's also a popular video series RL course by David Silver on YouTube that I'll recommend. This one. Me personally, I have converted it to audio and I put it on my iPod. While I'm doing chores or commuting, I found the other three resources to be richer educational material. So I want you to save your vision time for those three resources and use the RL course by David Silver as audio time supplementary material when you're at the gym or cleaning the house.

That's it for the resources. Those resources will take you quite some time to consume maybe the better part of a year. So this will keep you busy for a while and I'm unlikely to be recommending other resources in the coming few episodes, so don't feel overwhelmed. Now, it may take me some time to release the next episode 'cause I'm going to want to really intuitively understand the technical details of this stuff so that I could boil it down for you.

So just a heads up could be a bit of time. But that's it for the introduction to rl and I'll see you next time.

MLG 029 Reinforcement Learning Intro

Resources

Show Notes

@media (min-width:0px){.css-6k8fz8{display:none;}}@media (min-width:1200px){.css-6k8fz8{display:block;}}Learn Faster with a Walking Desk@media (min-width:0px){.css-1rb0nos{display:block;}}@media (min-width:1200px){.css-1rb0nos{display:none;}}Walk While You Learn

Concepts and Definitions

Comparison with Other Learning Types

Applications of Reinforcement Learning

RL Frameworks and Environments

Future Aspects and Developments

Never Run Out of ML ContentGenerate Your Own Episodes

Transcript

Learn Faster with a Walking DeskWalk While You Learn