What is reinforcement learning?

Reinforcement learning is an attempt at solving an extremely general and difficult problem.

Think about how we live our lives. Every minute we are presented with some decision. It could be something minor, such as “Should I scratch that itch?” or something major, such as “Should I move to this new city?”. Our choices determine what happens next. Some outcomes are bad and they teach us not to make similar choices in the future, and some outcomes feel nice and encourage us to repeat our behaviour. Over time, if we’re smart, we learn how to achieve a reasonable level of happiness. This learning process is what we want to understand better.

Reinforcement learning is a way to model this sort of learning process. When we try to say the above with more precision, we get reinforcement learning.

We first define a set A and call it the set of actions. This set contains all possible actions we could take, including “scratch that itch”, “suffer through the itch”, “move to new city”, “stay comfortable in current city”, etc.

Next, we approximate the whole universe with one symbol: S. This denotes the set of all states the universe could possibly be in. This could include things like “raining in Mumbai”, “over the counter drug for cancer invented”, “sun explodes and destroys the entire solar system”, etc.

In the framework of reinforcement learning, we assume that the universe is governed by some process that can be partially influenced by our actions. We can think of this as a game. The universe starts in some state s_0 \in S. We get to observe the state, and are allowed to take an action a_0 \in A. Based on our action, the universe moves to the next state s_1 \in S, which we can observe, and presents us with some reward r_1, which is a real number that depends on s_0, s_1, and a_0. We pick our next action a_1 \in A, and the game continues. The goal of this game is for us to come up with a way of picking actions that maximizes the rewards in the long term. The process of figuring out the optimal way of picking actions in this setting is what we call reinforcement learning.

There are some issues that need to be carefully thought out to make this definition precise:

  1. What exactly do we mean by long-term reward? It sounds like something along the lines of \sum_i r_i. But if the game goes on forever, this sum may have no guarantee to converge and we may end up with strategies that give us an undefined amount of long-term reward. If that doesn’t happen, we always get an infinite amount of total reward and thus finding an optimal strategy gets trivial. There are two popular ways to handle this issue. One option is to assume that the game ends at some point. This is enforced by including a state e in S that denotes the end. Then we need not worry about undefined rewards any more. Another option is impose no such constraint but consider discounted reward instead of just the sum, i.e., we try to maximize \sum_i \gamma^ir_i for some constant \gamma\in (0, 1)
  2. How exactly does the universe decide the next state to be in? One option is to say that there’s a function \tau: (S, A)\to S that the universe uses to determine the next state given current state and action. This, however, makes things deterministic in the sense that for a given state and action, the next state is fully determined. We may want to relax that and add some randomness by defining the function as \tau: (S, A, S)\to [0, 1] where \tau now denotes the probability of going from a state to another when a certain action is taken. This function is usually called the transition function. One might ponder if this relaxation is really needed. If we observe a universe where if we take an action a in state s sometimes it goes to state s' and sometimes to s'' could it be because we haven’t modelled the universe with a rich enough set of states? Perhaps we could capture some extra information in the set s that would be enough to uniquely determine the next state? While this could theoretically be true, as Bell’s experiments show, the universe we live in does not lend itself to determinism by merely adding more information. Moreover, even if we could make the universe deterministic by adding more information, we are often in situations where we don’t have that information and we want to be able to model it anyway. Probabilities could still be useful as a model of uncertainty that comes from lack of information.
  3. Continuing on the previous point, is it fair to constrain the universe to only use the current state, and not the whole sequence of previous states, to decide on the next state? As long as we have no constraints on the definition of S, this constrain doesn’t change anything. We could just redefine S to contain all possible sequences of states.
  4. Turning our focus to the player, how does the player decide what action to pick next? Let’s model this in a similar way. The player uses a deterministic function \alpha: S\to A, that given a state returns an action to take. This function is called the policy. If we fix the set of states, S, set of actions A, and universe’s transition function \tau, it can be shown that there exists a policy \alpha that is the best policy, i.e., it maximizes the long term reward of the game. Even if \tau is random, there is always an optimal policy that is deterministic.
  5. We can then define learning as the task of figuring out the optimal policy. If the player knows \tau, then this is a computational question and one can measure the performance of the learning algorithm using tools from computational complexity. But for most of the problems we want to model the player doesn’t have the luxury of knowing \tau. The player just starts playing the game by taking actions and observing rewards and states, and is asked to figure out the optimal policy or something close to it after several iterations. In this setting a good way to track performance is to measure the number of iterations needed to reach a desired approximation to the optimal policy. 

Let’s step back a bit again so we don’t lose sight of the big picture and see what kinds of problems can be cast as a reinforcement learning problem. We already saw in the beginning that life is a reinforcement learning problem.

A variety of scientific fields can in fact be cast as reinforcement learning. For example, all of physics is a reinforcement learning problem as follows: let S be the set of all possible states of the universe; let A be the same as S, i.e., our action is simply a prediction of what the next state is going to be; let \tau depend only on the current state and not on the action; and finally, let the reward be high if our prediction was close to the true next state and low otherwise. Clearly, the optimal policy is one that makes accurate predictions about the universe’s future, which is exactly what the laws of physics aim to do. So if one claims that reinforcement learning is a solvable problem in general, they are also claiming that the field of physics doesn’t need to exist any more.

One can also very easily cast “making money” as a reinforcement learning problem. Of course, since life itself is reinforcement learning, it’s no surprise that making money is too. But more specifically, one can pick the set of actions to be the set of all investments one can possibly make at any given time. The optimal policy in this case would be the one that maximizes the return on investments. Thus someone claiming to have solved reinforcement learning has also solved all of finance and the hedge fund industry doesn’t need to exist any more.

In fact, the problem of reinforcement learning is so general that it’s a surprise that it’s solvable at all! And yet, some recent research has shown remarkable capabilities at being able to solve it. For example, if you haven’t been living under a rock, you have probably heard of DeepMind’s algorithm that managed to learn all of Go, chess, and shogi without having given any domain knowledge about the individual games. Merely by playing the games over and over again and observing the rewards, the algorithm figured out how to beat the world champion programs in all three games. That’s getting closer and closer to solving all of life!

In this series, we will explore the frontier of this very exciting field. We will delve deep into some of the recent work and try to develop an understanding of what makes reinforcement learning solvable in certain domains and what can be done to make it solvable in others.


The two most important lessons in personal finance

I am gradually learning how to create clickbait titles. I think this one is a step closer.

There’s a lot of advice available on managing your wealth and most of it is rubbish. After thinking over several years about money, I have come to the conclusion that the following two lessons are the most important ones. Ignore everything else and abide by these lessons.

Lesson #1: Focus on earning; not on saving.

The following fact is true: the quality of one’s life is better measured by the total amount of money one spends as opposed to the total amount of money one saves. You enhance your life by spending money on life-enhancing experiences, not by stashing it in a bank account. And yet, 99% of advice on managing your money focuses on how to spend less on your groceries.

Just do the calculations. To make things slightly simpler, say you are 18 years old right now so that lots of career options are available to you. Let’s think about two versions of yourself. The first version puts in serious effort into finding the best career for your skill set and the second one half-asses an average career. What’s the difference between the average amount of money the two versions make per year? For any given skill set, if you start seriously optimizing your career at the age of 18, you can easily rake in upwards of a few hundred thousand dollars per year averaged over your life time. But with a half-assed career you will likely never make more than $100k. So that’s a difference of a few hundred thousand dollars per year. Now, let’s think about how much money you will save by carefully optimizing the grocery store you shop at. Suppose you shop once per week and save $100 every time you shop because your arduous research has found you the cheapest grocery store in town. Since there are about 50 weeks in a year, that’s $5,000 saved in groceries. Let’s be generous and multiply this number by five because you are doing the same level of optimization in your phone plan, your hair cuts, your clothes, and food. That’s still a mere $25,000. Comparing this to the rewards of optimizing your career properly, it’s clear which one is worth your time.

Things might be slightly different if you are older because your career is not very flexible any more. But most people spend way more time trying to save than trying to earn. I think at any age there is some reward to be drawn from shifting your focus from saving to earning.

In fact, for the benefit of the reader, let me make this really clear. I think even if you spend absolutely zero mental energy on saving on groceries, phone, cabs, and food, but are aggressively optimizing your career, you will do orders of magnitude better financially than someone who is aggressively finding the best deals in town but just picks whatever career comes his way.

Lesson #2: Realize that there are many different forms of wealth and most of them are interconvertibile.

Once a friend of mine told me that he only spends money on things that appreciate in value. This was his justification for not buying a car. Of course, if you buy a car and then try to sell it back, you will get less money than its original price. Thus with time, it’s value decreases. But if you buy Apple stocks, it’s very likely that you will be able to sell it for a higher amount in future thus making more money. So isn’t it obvious that you should spend money only on things like Apple stocks and never on things like cars?

The flaw in this argument is that it’s overly focused on one specific kind of wealth, i.e., money. Wealth comes in many forms, including time, convenience, happiness, relationships, pleasure, luxury, status, knowledge, expertise etc. Sure buying a car will reduce your monetary wealth, but that’s only because the monetary wealth is being converted into (a) time, because you will save on your commute, (b) convenience because now you won’t have to carry your groceries in the subway, and may be (c) status in case it’s one of those cars that enhance your status. Why should we look at these other abstract kinds of wealth? Because of the inter-convertibility of one form of wealth into another! Even if you only care about money, it will be easier for you to earn more money in future if you have time, convenience, and status in your hands now.

Many people ignore this when they drive 3 hours to go to a city where the discounts are 5% higher on clothes. Should you stay in your city and pay 5% extra or go to the next city and save it? Stated this way, it seems clear that you should drive to the next city. But the statement above does not represent reality. The real comparison is between 3 hours vs. 5%. Depending on 5% of what, the 3 hours might be more expensive.

A similar reasoning can be applied to going to elite universities. At the face of it, it’s just a sink of money. But why do people still go to university? Because even though you spend money in your tuition fee, you gain wealth in the form of a network of future successful people, and expertise in economically valuable skills. This wealth can later be converted into other forms of wealth, including money!



Why does art exist and what role does it play in the society? Let’s get the basics out of the way first. If you read Robin Hanson’s blog, you know that art’s primary role is signalling, just like any other human activity. But signalling what exactly?

My current theory is that there are two kinds of art based on what it’s used to signal. First, there are the photorealistic painters, the sculptors at Madame Tussauds, guys that can play the piano with their feet, or those that can play 20 notes per second, and people who specialize in writing poems that are also palindromes. These people distinguish themselves by achieving something that others can’t. And it’s in human nature to assign higher status to people who can do things that you can’t. So this kind of art simply serves the purpose of enhancing the artist’s status.

Is it really true, though, that you get higher status by simply being able to do things others can’t, no matter how arbitrary the thing in question is? I think the answer is yes and the most obvious testimony is the existence of sports. The whole institution of sports relies on competing against each other based on completely arbitrary rules, and yet it is able to elicit an enormous amount of passion all over the world. In fact, speaking of sports, I personally prefer to call the kind of art mentioned above as a kind of sport, and not art. But the world considers it art, so it deserves inclusion in this article.

The second kind of art is everything else. From music, to paintings, to poetry, I think all of it is used for a specific kind of signalling, which is, signalling one’s allegiance to a specific community. It’s kind of like a secret handshake, which is why leaving things open to interpretation is so popular in art. If you specify exactly what your art means, it will be very easy to feign interest and thus membership in your group. But if you leave lots of things open to interpretation and someone still “gets it” then there’s a high chance that they have similar thought patterns as the community your art work is designed to test memberships for.

Every community has associated with it a specific genre of art that its members revere. Liking that genre is usually essential to gain membership into that community. This is why one’s taste in art is so influenced by peer pressure. Peer pressure is nothing but a collection of membership tests. One’s willingness to be a member of the group creates the pressure to pass those tests.

A study done a few years ago revealed a correlation between liking classical music and having a high IQ. My theory provides an explanation. People with high IQ’s want to be associated with the community of intellectuals and classical music is popular in that community. So a newly minted intellectual will try to make himself like classical music so that he solidifies his membership in the group.

An argument against universal healthcare

The following argument shows that the most extreme form of universal healthcare, i.e., the one where every individual is given exactly the same kind of healthcare, is impossible, assuming you also want to provide the best healthcare. Universal healthcare can obviously be achieved if you don’t care about its quality. For example, no healthcare at all is horrible, but it’s at least equal for everyone.

So, for contradiction, suppose it’s possible. So there is a hospital that treats its patients on a strictly first-come first-served basis. No matter how much money you have or how important you are, if you come in last, you will be treated last. It is easy to see that this hospital provides horrible healthcare. Why? Imagine that there’s a queue of hundred patients and then a sick doctor comes in. If they treated the doctor first, they would have one extra doctor to now treat the rest of the patients. So by not treating the doctor first, they are providing inferior healthcare to the people already in line.

This argument can be gradually extended from the doctor to, say, a doctor’s secretary. If a doctor’s secretary is sick, and the doctor needs the secretary to take care of some tasks before he can start treating patients, it seems like a good idea to treat the secretary before the other patients. How about the secretary’s secretary? You can imagine where this is going.

So essentially, even if your only aim is to maximize society’s health, some people must get better healthcare than others. Equal healthcare and optimal healthcare cannot exist simultaneously.

A new kind of advertising

Creating a successful business involves three steps:

  1. Create a product that people may want to use.
  2. Find people that may want to use it.
  3. Convince them to start using it.

If you live in a tribe, you pretty much know everyone who lives there. So #2 is not an issue for you. You already know everyone who could possibly want to use your product. And if this number is too low, you wouldn’t build the product to begin with. Thus tribal advertising is all about #3.

But once you have access to a bigger audience, such as the one that the internet provides, #2 starts playing an important role and might even share a huge percentage of #3’s responsibilities. If you have modest goals—for example, if you just want to find 1,000 true fans—it’s probably mainly about doing #2 right. If you have a decent product, what are the chances that out of a population of a billion, not even 1000 want to buy it?

Interestingly, advertising is not just about building businesses. Building a career, making friends, and dating, all depend on how well you advertise yourself, and the process can once again be broken down into three steps:

  1. Work on yourself to make yourself awesome.
  2. Find people who may think you are awesome.
  3. Convince them that you are awesome.

The definition of awesome will depend on the context. Once again, the internet might be responsible for a gradual shift from #3 to #2. If you live in a tribe, then becoming the cool person of the tribe will mostly involve moulding yourself to fit into the social customs dictated by the tribe. But with access to a much bigger audience, it may be possible to become a celebrity by aggressively looking for people who already value the qualities you have.

The Nash equilibrium of online dating

(I got the idea for this post while listening to Tim Ferriss’ podcast with Samy Kamkar.)

Following in the footsteps of Hollywood, let me put together John Nash and dating once again, although in a much less romanticized way.

I used to think that guys are the ones who have a horrible time on online dating websites while girls simply sit back and enjoy the overwhelming attention. But it turns out, I was wrong. Members of both genders are generally utterly disappointed by the outcome of their online dating experiments. So then why isn’t anyone doing something about it? I think it’s because the online dating market has simply settled into a shitty Nash equilibrium. Let me explain. But first, here’s some quick background on Nash equilibrium for you.

Nash equilibrium is an abstract mathematical concept that captures a pattern seen in many social situations. Often people in a society will behave in a way that’s bad for each individual (including themselves), even though, in principle, if they all got together and collectively chose to behave differently, every member of the society would be better off. A modern example is global warming. A simpler but more abstract example is prisoner’s dilemma. In all these cases, a Nash equilibrium is the equilibrium towards which society converges. It is interesting whenever the equilibrium towards which society converges leaves its members much worse off than a state that could have been reached by a collective decision (or an external intervention such as that by a government).

The rule to spot a Nash equilibrium is easy. For each individual in the society under investigation, you need to check the following: assuming every other person’s behaviour remains unchanged, is changing this individual’s behaviour in any way going to make him worse off than what he is now? If the answer is yes for each individual, the present state is a Nash equilibrium.

Next, some background on online dating. In case you don’t know what happens on online dating websites, here’s a quick summary. Guys spam girls with hundreds of average or low quality messages, and girls sift through the deluge of “hey”s and “wanna fuck”s mining for some semblance of attraction. Since this process is frustrating, most messages get ignored. End result: guys frustrated about spamming and not getting replies, and girls frustrated about receiving spam and not having time or patience to reply.

Is this a Nash equilibrium? Consider some guy named Bob trying online dating. Considering that every other guy is sending out hundreds of messages per week, Bob can’t afford to send out the more natural 5-6 per week, unless, of course, he comes up with the magic message that makes every woman swoon. Assuming such messages do not exist, Bob must participate in the spam race to keep a decent standing. Or in other words, if he chooses to deviate from the strategy of spamming a hundred women per week, he will be worse off, implying this is a Nash equilibrium indeed.

Things could be improved if all guys got together and promised each other to only send out 5-6 well thought out messages per week. The girls would be happy to see a neater inbox and would feel motivated to reply to a higher percentage of the messages. This would make both guys and girls happy.

But since such a collective decision is impossible, a better approach is external intervention. If the dating website itself constrained its users to send not more than 10 messages per week, or if it charged money for sending messages, things could improve.

The CEO paradox

This post is half-baked, and is perhaps just non-sense, but hear me out for a bit.

If there is a useful skill that can be defined precisely, then any competitive economy will converge to a point where that skill can be outsourced. This means the highest paid employees will always be the ones with the most vaguely defined skills.

People go on and on about why CEO’s get so much money even though what they do is not even very clear. The ambiguity of their skillset is perhaps the reason why they get paid so much.

Economics 101: Types of assets

(Disclaimer: I know nothing about economics.)

In the last post, I defined assets as anything that can be owned. Interestingly, this concept needs examples to be fully understood. In this post I will provide some examples.

Cars, houses, food, clothes etc. are obvious examples. These are all tangible objects you can hold in your hands. A vacation package to Hawaii is less tangible but no one will have problems accepting it as an asset. Let’s move quickly to the non-obvious.

A loan is an asset. Let me explain why. If you give out a $1,000 loan to your friend Bob at a 10% interest rate, you now own the right to receive $1,000 + interest back from Bob. This right is an asset, and therefore, you can do things with it that you can do with other assets. For example, you can sell it. Whoever buys it will get the right to receive $1,000 + interest from Bob. What value should the asset be priced at? That depends on many things including how reliable Bob is, what interest rate are the banks giving out loans for, and how good of a deal you can negotiate in your trade.

Similarly, many other contracts you sign with people can be considered an asset. A share of a stock is another slightly non-trivial kind of asset. If you own 1% of the shares of a company, you essentially own 1% of the company. Once again, you can exchange this asset with other kinds of assets and the price will depend on many factors including how valuable the company itself is.

The idea of treating contracts as assets has been used extensively in recent years in the form of something called derivatives. Derivatives are assets that are derived from other assets. For example, now that we have seen that a share of a stock is an asset, one can define a new asset called a “call option” which is the right to buy a share at a certain price at a specific future date. A call option is a derivative that’s derived from a stock. Being an asset, this contract can be bought and sold in exchange of other assets and its price is determined by several different market parameters interacting in complex ways. Similarly, for any kind of asset, you can sell the promise to buy or sell the asset at a future date at a specified price and that will be an asset of its own.

Next, we turn our attention to a unique kind of asset: money. Money is simply an asset whose role is to make trades easier to execute. Any object that a large community has collectively agreed to value can be considered money. Imagine you live in a society where money has not been invented yet. If you own lots of horses and you need some rice, in order to exchange horses for rice you will need to find someone who has extra rice and is in need of horses. Or if you are good at sales, you can sell your horses for rice to someone who doesn’t care for horses using the following sales pitch: “Why don’t you take my horses, give me your rice, and then sell these horses to someone else who has the thing you want? And since horses are in demand these days, selling them shouldn’t be a problem.”

This might work if horses were truly in demand (or if you were excellent at sales). This is where money comes in. This is what distinguishes money from other kinds of assets. There is some sort of probabilistic guarantee that money will always be in demand in the sense that anyone you talk to will be willing to accept your money in exchange for something else. How such a guarantee comes into existence is an interesting issue that will require a blogpost of its own. But for now, we will use this as our working definition for money: an asset that a large community has collectively agreed to value.

Finally, another interesting and often ignored asset is time. Everyone owns a certain amount of time and the thing that distinguishes time from other kinds of assets is the property that you cannot choose to not spend it. In fact, it is constantly being spent at a rate of one second per second and you can only choose what you are going to spend it on. Unlike money, for example, you cannot save it in a bank.

Once again, being an asset, it can be exchanged for other assets. The most explicit way of doing that is to take up a contractual job that pays you an hourly wage. But more indirectly, you are always converting time into other kinds of assets. If you spend four years at university getting a degree in computer science that later lands you a $100k job, you have in some way converted those four years into some sort of wealth.

At this high a level of abstraction, things become confusing and it’s not always clear exactly what trade is being executed, but I think there is a certain advantage in thinking of everything as an asset and every activity as a trade. Hopefully this will become clearer in a future post.

Economics 101: A Basic Model

(Disclaimer: I know nothing about Economics.)

From the point of view of economics, the world consists of:

  1. A set of people.
  2. A set of assets.
  3. A set of ownerships between people and assets.

#2 and #3 deserve some explanation. #1 should be clear to you if you are not an alien.

What is an asset? I want to make the definition very general and say that anything that can be owned is an asset. So, a pair of shoes is an asset, an iPhone is an asset, and so is a car. Time is also an asset, although a more subtle one. I will come back to this later and provide some more non-trivial examples of assets.

An ownership is just a pair consisting of one person and one asset. Given a set of ownerships, each person will be a part of several pairs and the set of assets in those pairs will be the assets owned by that person. A society where there is a mechanism to enforce ownerships will be amenable to various laws of economics. What do I mean by enforcing ownership? That the set of ownerships remains unchanged unless two people decide, by mutual consent, to exchange a few assets that they own. An exchange of assets is usually called a trade.

If a person combines a few of the assets he owns to create a new asset, the new asset is automatically considered to be owned by him. So if you have the ingredients for making ketchup and you use your time, which is an asset owned by you, to make ketchup, the ketchup will be considered to be owned by you.

The meaning of consent needs clarification. It’s one of the trickiest things to define and I am not going to claim that I have the most satisfactory definition. Use of physical force is clearly not consent. It’s robbery. But how about using mental tricks? It seems clear that if you drug someone and steal their wallet, they did not consent to give you their wallet. However, what if you use advanced marketing tricks to convince them to empty half their wallet for a weight-reduction armwear that clearly doesn’t work? That sounds like consent.

I am going to completely sidestep the task of defining consent by just stating its desired property. I will say that any trade that ends up increasing the perceived utility of both parties was carried out by mutual consent. That is, both parties involved in the trade should think that the trade improved their lives.  As long as this happens, we will say that the trade happened with mutual consent. Note that this definition fits the examples discussed above. In case of physical force, the party on whom the physical force was applied definitely doesn’t think they are better off now. But in case of using marketing tricks, the person who bought the weight-reduction armwear does believe the purchase to be progress. In fact, that’s the point of marketing: to convince the customers that buying the product will make them richer, happier, sexier, and healthier.

With these definitions in place, if we assume that every individual has sufficient resources (intelligence and information) to figure out what’s good for them in the long term, then we get a simple and elegant model of governance: just make sure that all trades are carried out by mutual consent. The assumption that each individual knows what’s good for them is essentially saying that the perceived utility is always the same as the correct utility, and if that’s true, every trade must improve the correct utilities of both parties involved.


Protein deficiency will kill you; as will protein excess. Same is true of any nutrient. As you increase the intake of a nutrient, its utility first increases, then reaches a plateau, and eventually starts to decrease. Once it’s in the negative, it has potential to kill you.

The ideal amount of protein to consume per day is somewhere between 200 calories to 600 calories (~50 to ~150 gms).

Interestingly, proteins help in both losing weight and gaining muscle mass. Proteins have a satiating effect; so people consuming low amounts of proteins feel more hungry and eat more food, thus consuming more calories. This effect plateaus around 15% protein intake.

Proteins also signal to the body that there’s enough food in it and so it can focus on muscle growth. This is why higher protein intake helps grow muscles. Note, however, that muscle growth is mostly supported by a high calorie intake. As long as you are near the higher end of the 200-600 calorie spectrum, increasing protein intake will not help grow muscles. But increasing calorie intake helps, no matter what the composition of the calories.