Stable systems and managing expectations

In his book Sapiens, Yuval Noah Harari talks about how civilizations in the past fell not because of one error in judgement, but because of multiple such errors. For instance, many empires in ancient China fell because they were not receptive to scientific and military advances from the outside, the bureaucracy was stifled, etc. I have often returned to thinking about this simple observation, and how it relates to much of the world around us.

One way of re-phrasing what Harari was trying to say is that many systems around us are stable. They can absorb small shocks or errors, and still remain relatively unchanged. It takes many errors and misjudgments to completely wreck them. For instance, when we drive on the road, we are generally sensitive to the drivers around us, even if we’re not actively paying attention. If a car suddenly serves into our lane, we will almost involuntary slam our brakes and try to control the outcome. Most times, we will able to do so successfully. However, if we’re intoxicated while behind the wheel, our reaction time will suffer, and we will crash into that car. Hence, just being intoxicated or having a car swerve right into your lane are not by themselves enough to get you into a car accident. Both of these conditions have to come together in order for you to have a decent chance of crashing your car. In other words, when on the road, you’re a stable system. It will take multiple misjudgments on your part or that of others to wreck your vehicle.

Can we also study anger and anxiety from a systems perspective? Imagine that you’re having a bad day at work. Your boss is breathing down your neck for no fault of yours, and the heater has also started malfunctioning, causing you to freeze in your seat for the most of the day. Although this sure will put a damper on your mood, this in itself is generally not enough to make you scream in agony. However, if on your drive back home you get into a fender bender, and on reaching home you realize that there is no electricity and most of the food in the refrigerator has gone bad, you will probably have lost it and will lash out at anyone or anything. It took multiple unfortunate circumstances or “errors” to take you from a sullen face to black rage. If on returning home from work you find your favorite sandwich is waiting for you in the refrigerator, you will soon return to normal. Good food has partially compensated for the bad day at work, and things are alright again.

Hence, although I find myself getting anxious and angry every now at then for seemingly trivial reasons, I have now started thinking that it is a miracle that I don’t get angry or anxious more often. One bad thing is generally not enough to destroy my peace if other things are going well in my life. It is only when bad things line up- car malfunctioning, arguments with friends, problems at home- that I find myself losing my cool. There is probably an evolutionary advantage to not losing your composure over just one thing going wrong. Hence, our moods have evolved to become more stable over eons. It is only when multiple things go wrong that we don’t quite know what to do or who to blame.

Now a small digression: why does it sometimes seem like everything is going wrong for us all at the same time? When we have a bad day at work, we hope that we will at least get good food at home to compensate for it. We hope that we can relax with our partner, tell them about our problems, and that they will soon utter some relaxing words which will make our miseries go away. Hence, our expectations from our life become even more demanding than before. We don’t just want an ordinary day. We want a fantastic day after this in order to forget our troubles. When this does not happen, we get even more dejected, and think that along with a bad boss, we also have an unsympathetic partner and a complete lack of good food in our lives. Hence, one “error” in our life induces us to make narrow, unrealistic expectations in our lives, and when they’re not met, we will that everything is going wrong with us.

So far so good. Our existences are sort-of stable. However, we must ask the following question: how can one become even more stable? Perhaps keep our composure even when very many things go wrong? I struggle with this question because I may have slight anxiety issues. Driving on the road is a struggle because I become nervous when there are cars around me. Even the slightest disturbance when I am working often disturbs my calm. When I’m watching TV, if someone talks to me, I get distracted and irritable. I think all this may be because I only expect to have an easy drive with very few cars around me, complete silence when I work, and no one talking to me when I watch TV. Hence, when these expectations are proved wrong, I get nervous and irritable as I can no longer control my surroundings.

Yesterday, while driving in seemingly erratic conditions, I tried to calm my nerves by trying to expand my range of expectations. I assigned small probabilities to multiple things that could go wrong. Perhaps a car would come and crash against me. Perhaps a car will swerve wildly into my lane. Perhaps I will hit an animal. Of course, thinking about all these possibilities made me mentally prepared to deal with such eventualities. If I do hit an animal, because I had already assigned a non-zero probability to it, I will recover from shock much more quickly, and will probably be able to deal with the situation much better.

This single act of thinking about eventualities calmed my nerves within a few minutes, and I had the rarest of experiences- an enjoyable drive back home, even amongst erratic drivers and the general highway craziness.

Hence, I would like to make the hypothesis that a lot of anxiety and anger may stem from the fact that our range of expectations is often too narrow. Hence, when reality doesn’t meet our very narrow expectations, we lose control of our peace of mind and the situation. If we spend time broadening our range of expectations, we will start expecting in advance more things that happen to us in real life, reducing shock, and hence anxiety in the process.

Perceptual Control Theory

Perceptual Control Theory is a very popular model of how the brain works. Its basic premise is this: our perception of the world is not merely based on what our senses perceive. It is also based on what we “expect to perceive”. For instance, when I am in a car driving down a road, I “expect” to see some trees, perhaps some houses, people walking on the sidewalk, etc. Hence, on staring out of the window of the car, if I continue seeing just these things, the conscious part of my mind will zone out, while my unconscious brain will keep registering these things passing by the window. However, if I suddenly see a Royal Bengal tiger jumping on my car, the unconscious brain will make the conscious brain zap back into attention, and I will reflexively crouch, or perhaps shield my body somehow.

If I were to re-phrase what perceptual control theory says, I would say that it talks about expectation, and how that expectation aligns with the world around us. When I sleep, I expect to be surrounded by relative peace and calm. Hence, on hearing a loud noise, this expectation has been falsified, and I wake up in order to understand what is happening around me. I need to create a new model of the world, in which such disturbances are possible while I sleep. The good thing about these expectations and models of the world is that they can be corrected quickly and decisively. If I don’t expect to see snakes in my house, and I suddenly see a snake one day, my mental model of my house will change to include the possibility of snakes. I’ll now be more careful when I open cupboards and peer down drain pipes, in order to improve my chances of survival.

However, our brain also forms expectations and models of the world that cannot be corrected so quickly and decisively. Hence, we stick to these models for a long period of time, almost always to our detriment. Why am I writing about this? Because I have often made mental models of the world that were incorrect, and proved to have significant impacts on my life in terms of career, social life, etc. In this article, I will talk about one particular example.

I have a mental model of the world which is perhaps more aspirational than realistic. When I pick up a book, or a paper, I expect that if I really focus, I will be able to able to read it within a few hours, and well within a day. This is my expectation. However, reality is different. Time after time, I have noticed that I am not able to read more than a couple of pages of a mathematics paper or textbook in one sitting. It is possible that I am slower than usual, or maybe this is the pace at which everybody reads. However, I refuse to accept that this is fundamentally true for me, because I may have ambitions of blazing through lots of books and papers in no time, and doing good research as a consequence. Hence, one part of my brain has an expectation that aligns with my lofty goals, and there is another part of my brain that knows that this is not possible.

What happens as a result of this incoherence? When my “expectation” and “reality” are different, instead of crouching like I did when I saw the tiger, I decide to not read that book or paper at all. I decide to do something completely different, like reading another book. Note the difference: sense perception forces us to change our model of the world much more quickly and easily. If I expect to not see a tiger on the road, and I see a tiger, my model of the world changes instantaneously to include the possibility of tigers on the road. If I expect to see Trump-Pence banners on the roads, and I see Biden-Harris banners instead, my model of the world quickly changes to accommodate the fact that there may be many Democrats in my neighborhood. However, when I have two models of productivity in my brain, even though one may have resulted from past experience, it is very difficult to abandon the aspirational model for the more realistic model.

In this case, the aspirational and ambitious model in my brain suggests that I should be able to blaze through thick tomes in no time if I really focus. However, the more realistic model, which has been informed by past experiences, suggests that on average, I do not read more than 4 pages of math in a day. At this stage my brain does two things: it says that the former model may still not be wrong, because I didn’t really focus when I read only 4 pages. When I do, I will obviously be able to read more. This is perhaps similar to “True Communism has never really been tried. When the one true Communist establishes a state, humanity will progress like never before.” The other thing that my brain does is that it recognizes that I will probably not be able to truly focus and read large parts of this book. Hence, in order to avoid potentially falsifying my ambitious model, it forces me to read something completely different, through which I can escape judgement.

I have suffered because of this for a long time. I don’t work on homework assignments until the last moment because I think that I will be able to do them the day before class. I don’t work on papers on time because I think that I will be able to do them at the last minute. I don’t read the texts I am supposed to. Because I cannot find a way to falsify the ambitious part of my brain.

Yesterday, I attended a productivity workshop. In that workshop, one interesting perspective that I gained was that I need to manage my expectations. To think “I will be happy if I am able to read 4 pages of this book” as opposed to “I will be happy only when I am able to read this whole book in a short time”. Although it may sound counterintuitive, lowering expectations over the short term may lead to much better results in the long run. Perhaps managing expectations is an extremely important part of my life that I never quite recognized before. If I have a model of the world, I should write down the circumstances in which I will accept it as an accurate model of the world, and also the circumstances in which I will abandon it. In this case, I should write down that if I am not able to read 50 pages of a textbook in a day, I will probably never be able to do it, and hence this expectation is false. I should revise my expectations, until my model of the world aligns with the actual world.

In some ways, this is akin to the scientific method. And it is equally powerful when falsifying theories of black magic, as it is in falsifying delusional theories of the self.

Learning as a process of re-labeling

Disclaimer: This article is highly speculative, and based on my own experiences and a couple of articles I might have come across. I will be happy to remove it when I come across scientific evidence that contradicts it.

One defining feature of smart people is that they learn things fast. You tell them a concept or idea, and they’re able to understand and implement it much faster than the average person. Stupid people take much longer to understand an idea, assuming that they’re ever able to completely understand the idea. As someone who has been stupid for most of his life, who has only recently begun to be slightly “smarter”, I feel that I can shed some light on what might be causing this discrepancy.


On a naive level, it is not difficult to believe that “simple” things are easy to understand, and that understanding “complex” things takes much more time. For example, it might be easy for us to understand that a car can travel faster than a human, but difficult for us to understand what Maxwell’s laws are saying.

What makes things “simple” or “complex”? Although there exist several ways of classifying things as simple or complex, my proposed definitions for both concepts are the following: things that we can refer to from sensory or emotional experience are “simple”, while things that are abstract and cannot be referred to from such experience are “complex”. For instance, an “apple” is a simple concept, because as soon as you say the word, you visualize a red, almost spherical object. You can remember what it smells like, what it tastes like, etc. “Moving fast” is a simple concept. You remember running on the ground, as well as traveling in a car. You implicitly know that cars travel faster than humans, because you have seen your car overtake humans on the sidewalk all the time. Maxwell’s laws, on the other hand, are a “complex” entity. You have no actual sensory or emotional experience that result from Maxwell’s laws. You may remember them as mathematical formulae written in your textbook, or perhaps some diagrams involving charges, magnets, field lines, etc. Hence, Maxwell’s laws will always be a much more complicated concept to internalize than the concept of an “apple”.

What is “simple” for humans may not be “simple” for computers. For instance, you can easily program a computer to solve Physics questions based on Maxwell’s laws. However, it is much more difficult for you to tell a computer what an apple is. What an apple might look like, smell like, or perhaps taste like. Hence, this definition of “simple” only attests to the ease with which a person may learn a concept, and not to any inherent, universal simplicity.

Anecdotal experience

So where am I going with this? We remember and learn concepts better when we can perceive them through our senses or emotions, and re-label those concepts as the corresponding sensory or emotional experience. Let me try and explain this with a couple of anecdotes.

I have tried to learn Algebraic Topology and Algebraic Geometry for a very long time. I own multiple books, all filled with all the correct formulae, and I have pored over them many a time. But after months and years of staring at those formulae, memorizing them and also occasionally repeating them in front of confused strangers, I often had to contend with the fact that I had no idea what the heck they were saying. Despite overwhelming evidence, I refused to accept the obvious inference- that I was stupid.

On the other hand, I came across strangers on the internet who had written lengthy lecture notes on these topics in high school! They had published research papers in prestigious journals, and were glorified everywhere. Sure. I could accept that they were smarter than me. Maybe even much smarter. But that much smarter? That they could learn things in middle school that I could not as a PhD student at a decent grad school? I also recently had the opportunity to talk to a PhD student, who had recently completed his thesis on Algebraic Geometry. He said that it was only after he had defended his thesis that he had begun to understand some basic concepts from his field. He felt that he should re-do all the problems from the very basic textbooks in order to really understand what was happening. I got the same reaction from multiple faculty members at a prestigious Indian research institute. Sure, I could be stupid. But all these very smart people, who had completed their degrees from some of the best institutes in the world, could not be stupid.

Now let us talk about music. I’ve always had a good ear from music. I picked up the guitar in class 7, and within a couple of months of picking it up, I could play along with most Hindi songs that played on the radio. If you played me a chord, I could play it back to you in seconds. I thought that this was how most people learned music, and did not know how rare this was until I went to college. I was one of four people in my batch of 800 who was selected for music club, and was often told that I had the best “ear” for music that they’d seen in years. Why was I good in music, whilst terrible in mathematics? Why wasn’t I uniformly stupid, or uniformly smart?

This was because of the following: when I listened to a chord, I felt an emotional experience. I felt either happy and “straight”, or mysterious and romantic, or about to launch into a speech, or some other complex emotion, and I would know right away that the chords that were being played were C major, A minor of F major respectively.

With Mathematics, I would have no such sensory or emotional feeling. I would see formulae on the paper that I would memorize, understand, and then soon forget. A particular example from Topology comes to mind: I have read and re-read about homology at least 50 times in my life. Perhaps more. I understand the formulae. The definitions. Where they come from. Calculating homology is the mathematical culmination of multiple mathematical concepts that come together beautifully. However, despite verifying and re-verifying this edifice very many times in my life, I had never actually “understood” what is happening. This was until someone told me that homology calculates the number of “holes” in an object. Since re-labeling the abstract concept of homology as the visual picture of the number of holes in an object, life has become much easier for me. Even if I can’t always calculate homology on my first go, I know that I am just calculating the number of holes in something, and then intuition takes over to guide me to the right answer.


I have never been particularly good at mental math. On multiple occasions, I’ve been asked “You study mathematics right? Calculate 34\times 47“. And on many occasions, someone else would have calculated it faster than me. Of course I could rebuff it by saying something like “mathematicians are not calculators. We study ideas”. But I never did. This is because I have tried to be better at mental calculations for a long time, and have always failed. Hence, unless I join an Abacus class or something, I will probably never be much better than the layperson in multiplying two digit numbers in my head.

And then of course come the people with synesthesia. On seeing numbers, some people with synesthesia have reported seeing colors or shapes. If you ask them to multiply 34\times 47 for instance, they would see something like a yellow disc swallowing a giraffe, and then turning into a hippopotamus. And as soon as they see the hippopotamus, they’d know that the answer is 1598. You don’t believe me? See this.

Why this discrepancy? This is because I don’t have a sensory or emotional response when I see a number. I just see it as something written on a piece of paper. However, a person with synesthesia has re-labeled various numbers as emotions. And those emotions help them calculate much faster than the usual multiplication algorithm would.

So how can we use any of this stuff?

Over the past year, since I decided to fire up this blog again, I have tried to learn very many different concepts from different fields. At the beginning, I was understanding things only at a superficial level. At the same time of course, I was also having trouble understanding my own mathematical field, that Penn State might soon proclaim I am an expert at because I’ve done a PhD in it. What a load of drivel.

However, after failing to understand a particular mathematical paper despite reading it multiple times over a month, I started drawing things up on my iPad in multiple colors. In other words, I began relabeling those mathematical concepts as sketches on my iPad. Now on, whenever I would read a concept, my brain would visualize only that sketch that I’d made for it. They didn’t even need to be accurate drawings of the concept. I just needed some visual representation. This was tremendously helpful. I soon began doing the same for concepts that I tried to learn from other papers, and I think that I now understand many concepts much better than before.

Of course, if I am able to relabel those mathematical concepts as some emotions inside my head, I would have an even easier time understanding, recalling and manipulating those concepts. However, until I figure out how to do so, a sensory (visual) representation would suffice.

So if you’re trying to learn something, understand it using reason and logic. But remember it by relabeling it as something that evokes a sensory or emotional response. If you’re anything like me (let’s hope for your sake you’re not), it might greatly boost your ability to learn new things much faster.

Machine learning and life lessons

Anyone who tries to draw “life lessons” from machine learning is someone who understands neither life, nor machine learning. With that in mind, let us get into the life lessons that I draw from machine learning.

Disclaimer: The idea that I am going to expound below is something I first came across in the book Algorithms to Live By. I thought it was a really impressive idea, but didn’t do much about it. Now I came across it again in Neel Nanda’s fantastic video on machine learning. Humans are hardwired to pay attention to an idea that they come across multiple times in unrelated contexts. If the same app is recommended to you by your friends and a random stranger on the internet, it’s probably very good and you should download it. Hence, seeing as I heard about this idea from two people I enjoy reading and learning from, I decided to give it some thought and write about it.

Minimizing regret

In programming as well as in life, we want to minimize our error, or loss function. When a neural network builds a model of the world from given data, it tries to minimize the difference between its predictions and the data. But what do humans want to minimize?

Humans want to minimize regret (also explained in the book Algorithms to Live By).

Regret often comes from choices that led to negative experiences. I read a fantastic interview by the famous programmer (and inventor of Latex) Donald Knuth, who said that in life, instead of trying to maximize our highs or positive experiences, we should focus more on minimizing our lows or negative experiences. What does that mean? Suppose you work in an office in which you win the best employee award every month. Clearly, your career is going well. However, your spouse is on the verge of divorcing you, and all your co-workers hate you. As opposed to this, imagine that you’re an average worker in an office setting whose career is not really that spectacular, but you get along with your spouse and co-workers. In which situation would you be happier and more satisfied? I bet that most of us would choose the second scenario without even thinking. Negative experiences stay with us for longer, and harm us more. Positive experiences provide for self-confidence and happy nostalgia, but are overshadowed by negative experiences on most days. I’ve often thought about this statement by Knuth, and it keeps getting clearer and more relevant with time. Hence, humans will do well to minimize the regret that they might have accumulated from negative experiences.

Although regret often stems from negative experiences, it may also arise from from actions not taken. A common example would be someone who really wanted to become an artist, but was forced by circumstances into a miserable profession (hello, MBAs). They would regret not pursuing their passion for a very long time.

Hence, a happier life is not necessarily one in which we have maximized the number of happy moments, but one in which we have minimized regret.

Gradient descent

So how does one minimize regret? I want to answer this question by discussing how a neural network performs gradient descent.

Imagine that your loss/error function is a graph in two dimensions, and you want to find the value of x (your parameter) for which this function is minimized.

If the graph given above is that of the loss function, it is minimized at the value of x for which the function has its global minima. How does gradient descent work? We choose a starting point on the x-axis, say x_0, and then perform the simple iteration x_{n+1}=x_n-\alpha \nabla f. If you starting point is near the local minima for instance, you will slowly be led towards the local minima.

But wait. You wanted the global minima, and not the local minima. How does one do that? What machine learning algorithms do is that they keep jumping between different starting points x_0, and performing gradient descent with values of \alpha that are large at first, but decrease with time. This approach generally works because real life data that is fed to the neural network is “nice” (probably in the sense that the attractor well for the global minima is larger than the wells for local minima). Hence, after a few jumps, we have a pretty good idea of where the attractor well of the global minima lies. Now we can keep iterating the process of gradient descent until we reach the global minima.

How does this have anything to do with life? The perhaps too obvious, but useful analogy is that we should keep trying new and different things. This is an age-old adage, but there seems to be a mathematical basis for it. Trying new things, like visiting a new part of town, trying meditation, talking to strangers, reading a book in a field you know nothing about, learning a new language, is akin to having a new starting point- x_0. Now you can perform gradient descent. Remove the aspects of that new thing that you don’t like. Minimize regret (I will probably regret focusing on learning French for six hours instead of writing my thesis). You might arrive upon a minima that is better than your previous minima (if not, you can revert to your previous state of life). If you do it enough times, chances are that you will find your global minima, or your best possible life.


This ties in with another concept that was discussed in Algorithms to Live By– annealing. When Intel was trying to design microchip processors that would power the modern computer, finding the right arrangements for each part of the processor was proving to be a mathematically intractable problem. How should these millions of different parts be arranged so that the processor is fast and does not generate too much heat? There were literally millions of parameters, and the brightest minds in the world literally had no idea what to do.

What one physicist suggested was the process of annealing. What is annealing? It is the process through which metals are heated to very high temperatures, and then slowly allowed to cool. This causes metals and alloys to harden. Similarly, the physicist suggested that they randomly arrange all the parts of the processor, and then perform small changes that would make the processor more stable and efficient. Soon, they arrived upon a design that was efficient and successfully powered the modern computer.

How does this apply in one’s life? One possibility is resource allocation. How much time should I devote to working out, as opposed to studying or socializing? We can start at an arbitrary point- say I work out for 10 mins everyday, study for 5 hours and socialize for 2 hours. I can then change the parameters, in the same way that a metal slowly cools down. I should probably work out more, and not spend as much time socializing. Hence, maybe I can work out for 30 mins, study for 4 hours and socialize for 1.5 hours. I can keep tweaking these parameters until I reach an arrangement that is satisfactory for me and makes me happy.

Hence, if you’re looking to live your best life possible, you should start doing things you never thought to do before, and then changing small parameters until you reach a satisfactory situation. Then start all over again. After trying enough things, there’s a good chance that you’ve already figured out what makes you happiest, and minimizes regret. You’ve figured out how to live.

And thus ends my Deepak Chopra-esque spiel.

Capitalism and coordination problems

We all know that the world’s problems can be solved if “all of us can come together and act as one….” you’ve probably dozed off by now. Of course this is true. And of course this never happens. But why not? What is the single most important reason that we cannot get together as a single planet and solve all our problems in an hour?

The reason is simple- working in a group is a complex coordination problem. A large group of people, with no personal ties or friendships, have to come together under a common umbrella, and try to solve a problem together. No individual should shirk their share of the work, and everyone should contribute equally (or at least equitably). As anyone who has worked in a group project can surely testify, this never happens because some members shirk their responsibilities, hoping that others would pick up the slack. The hard working ones work hard for some time, and then realize that they’ve been given an unfair deal. Soon, they stop working as well. Sometimes, they keep working hard, but claim that they deserve “more” than what the slackers are getting- maybe they want more credit, or complete control of the project, etc. Soon, the group disintegrates, and the final outcome is substandard.

Coordination problems are responsible for not enough donations to politics (less than the amount of money that Americans spend on almonds each year, as explained in the linked article), not enough donations to Wikipedia (despite Jimmy’s constant threats and emails), our screwed up education systems, garbage on the roads, etc. Why donations, you may ask? How is that a coordination problem? Let me give a simple example. I want to stop poverty. I really do. When a see a Facebook ad for a project in Africa that is saving children, I feel the pang of conscience, and want to donate an amount. I hear crazy statistics like if enough people in the world donated $20, we would end global hunger in a day. Of course I can donate $20, no problem at all. However, most times I don’t end up donating. Why? Because I know that very few other people will donate. And my $20 will not make a difference at all. That is why most people don’t vote (their puny vote will not make a difference at all). What if I was convinced that most other people in the world will donate if I do? Will I donate then? Of course I will! But for that, a miracle would have to happen. The world really would have to come together to accomplish one purpose. Our greatest coordination problem will have to have been solved. Clearly, this is unlikely to happen in the near future. And because we are not able to get the world together and make everyone donate a small sum, we let hundreds of thousands of innocent children die everyday.

In the book “The Precipice”, the author Toby Ord writes that the reason why the world is headed towards annihilation is that saving the world is a complex coordination problem. In other to stop climate change, reduce pollution, reduce the threat of nuclear winter, etc, all countries have to come together and make sacrifices. However, some countries keep polluting and manufacturing weapons of mass destruction. Why is that? Because the benefits of saving the world will be reaped by all countries, including the errant countries. However, the benefits of misbehaving will only be reaped by the misbehaving countries. If Indian industries keep polluting while the rest of the world reduces its emissions, India will benefit by having a higher manufacturing output than others, and will also benefit due to the cleaner air that has been achieved through the sacrifices of others. Hence, slacking is often a win-win tactic- slackers win short term, because they don’t have to play by the rules, and also hopefully long term, because other more responsible parties will have to pick up the slack and fulfill the objective, providing benefits for all. How does one reduce slacking in a group? How does one solve this omnipresent coordination problem?

Capitalism is the most successful idea in human history. It brought the world world together, and almost singlehandedly improved quality and duration of life for almost everyone around the world. This happened because the whole world did indeed come together and work as one. How did Capitalism solve the coordination problem that killed most other ideas like Communism? It did so by providing an incentive for each party in the world to do their job. If you do your job, you get money and power. If you don’t, you get nothing. Hence, if you’ve slacked off, you’ll be left behind by others who work hard and make money. This is something that you don’t generally get to see in group projects, or in things like Communism.

So how can we solve our great coordination problems? How can we really end poverty and hunger and climate change once and for all? I don’t know. But I think the trick will be to find a way to give an incentive to each individual person in the whole world. Does this incentive have to be money? Should the government be paying people to clean up beaches or repair their overly polluting vehicle? Possibly. But this might not be the whole story. Clearly, many governments will be constrained by limited coffers that they cannot add a further burden to. In its place, maybe they could share success stories of people cleaning up their neighborhoods on social media. Maybe they could offer to name benches on public parks after particularly generous donors to orphanages, etc. Finding the right incentives for a non-homogeneous population is often a difficult talk. But perhaps the main point of writing this essay is that finding incentives is the most important thing that we can do to solve complex coordination problems. And it is only through solving these problems that we can solve major global issues, and perhaps hope to re-create the phenomenal success of Capitalism.

PTSD and Type II thinking

[Note: This article is highly speculative, and I will pull it down without warning if I find evidence to the contrary]

Reading very many of Scott Alexander blogposts, which review basic neuroscience research, has given me the impression that PTSD and depression might be a fundamental neuroscience problem. Let me explain that below.

What does PTSD feel like, exactly? You had a bad experience a gazillion years ago. Maybe you were bullied, or had a bad breakup, etc. You wake up in the morning, and suddenly that is all you can think about. How this bad experience will keep affecting your future, how there is no escape, how you wish you could have behaved differently back then, etc.

This is your brain going in a downward spiral, sucking you into a pit of misery. Well a part of your brain anyway. Reading Alexander’s posts give you the clear impression that your brain is not one cohesive entity. It is a bunch of different shit thrown together. And there is a part of the brain that is responsible for computation and logical inferences from known data. And this part of the brain is not engaged when you go into this downward spiral.

Imagine that you’re falling deeper and deeper into a well. You can stop this fall at any point by pushing your hands and legs against the walls of the well. But you’re just not able to. Mainly because you’ve forgotten about your hands and feet completely.

Similarly, when one falls into this self-denigrating pit of despair, one simply cannot remember to employ the part of the brain that will tell you that you’re overthinking this, and that your past trauma is just not relevant anymore. The world has moved on. How do you deploy this part of your brain?

Write things down. If one were to become slightly technical, there are two types of thinking that the brain employs- Type I and Type II. Type I thinking is the kind of illogical/intuitive/emotional thinking that can lead you into such pits. Type II thinking is the more deliberate, logical form thinking that might save you. When we write things down, or perhaps explain our position to someone else, Type II thinking gets deployed. We may soon realize that the past event just isn’t relevant anymore, or at least not as important as we’re making it out to be.

Another way to think about it is that PTSD is just a failure of computation. Suppose you were bullied as a child. You keep thinking about that, slowly descending further into misery. But what if you deliberately compute the effect of those people in your life? Do these people live near you anymore? No. Do you work with them? No. Have they forgotten about it all? Probably. If you met them again, will they still be horrible to you? Probably not, and definitely not in a few years. When we compute these possibilities, we again deploy Type II thinking, which will lead us out of this spiral.

But why is the brain not always employing Type II thinking anyway? I’m a fairly intelligent person. I should be able to reason myself out of anything. The reason why the brain doesn’t always think in its logical mode is the same reason why the TV doesn’t turn itself on- no one pressed the button. Although the TV is fully capable of showing us our favorite channels, switching it on is still compulsory. Similarly, for the brain to employ Type II thinking, we HAVE to write things down, or perform a computation. This is the only way that the logical part of our brain “switches on”. Without that, the brain will keep chasing us down the same rabbitholes that we’ve been haunted by for years.

IMO 2020, Problem 2

The following is a question from IMO 2020:

The first time I tried to solve the problem, I thought I had a solution, but it turned out to be wrong. I wrongly assumed that a^ab^bc^cd^d would be maximized when a=b=c=d, which is commonly true in Olympiad problems, but that needn’t be the case.

I then looked at solutions available online, and realized that I just needed to show that a^ab^bc^cd^d\leq a^2+b^2+c^2+d^2. After doing that, I homogenized both sides, and tried to prove the statement. I was finally successful. I am recording my solution, as it is slightly different from the ones available online.

Quantum Computing

Today, I will be talking about quantum computing. I will be following Quantum Computing– Lecture Notes by Mark Osdin, who is a professor at the University of Washington, Seattle. These lecture notes are roughly based on the book Quantum Computing and Quantum Information by Michael A. Nielsen and Isaac L. Chuang.

So what exactly is quantum computing? Imagine that you’re trying to find the shortest route from your house to a restaurant. The way that a classical computer would solve this problem is that it would consider all possible routes from your house to the restaurant, and calculate the length of each route individually. It would then compare the lengths of all the routes, and give us the shortest one. A quantum computer, on the other hand, can determine all routes and calculate all route lengths at once.

What does that mean? Suppose there are a million routes, can the quantum computer calculate the lengths of each all at once? What about a trillion? This seems impossible, as it would imply that a quantum computer is infinitely faster than a classical computer. What a quantum computer actually does is that it performs all of these calculations, but then stores the results of these calculations in a superposition. This means that users cannot extract the lengths of all these trillions of routes all at once. Performing a measurement would collapse the wave function generated by a quantum computer into one such route. Hence, we may only be able to know the details of one route on making one measurement.

What use is a quantum computer then? We don’t want to find the length of each route by performing one measurement at a time. We just need the shortest route. If we can find a way to increase the probability of the wave function collapsing into the shortest route, we will have solved our problem. Although this process of increasing the probability of the wave function collapsing into the “right answer” may take some time, it generally takes much less time than a classical computer. Hence, quantum computers have been known to provide exponential speedups over classical computers.

In short, quantum computers are useful because they are fast. Much faster than any classical computer will ever be.

A quantum bit

A classical bit is a “storage space” on a computer, which stores either the number 0 or 1. It cannot store both. A quantum bit or qubit, on the other hand, can store a superposition of both numbers in the form \alpha |0 \rangle + \beta | 1\rangle, where \alpha,\beta\in\Bbb{C} such that \alpha^* \alpha+\beta^* \beta=1. In other words, \alpha,\beta are indicative of the probabilities of the qubit wave function collapsing into the |0 \rangle or |1 \rangle states.

But what if we just want to store the ordinary number 1 in a qubit? We can just find a way to get |\beta|=1 and \alpha=0. Hence, all storage operations that can be performed on a classical computer can also be performed on a quantum computer.

Bloch sphere-I

Each qubit wave function can be represented as a point on the two dimensional sphere S^2.

But how can two complex numbers (\alpha,\beta)\in\Bbb{C}^2 be represented on only a two dimensional manifold? Mathematically speaking, we have the condition that \alpha^*\alpha+\beta^*\beta=1, which gives us the fact that all such (\alpha,\beta) are contained within S^3\subset \Bbb{C}^2. We then mod out S^3/S^1, as the individual phases of \alpha and \beta don’t matter- only the difference of their phases does. This gives us S^2. For the mathematically minded, what I have performed above is a Hopf fibration.

In the sphere above, we have the |0\rangle state as the North Pole and the |1\rangle state as the South Pole. Why is that? This fact doesn’t matter at all, I suppose. I could have chosen any two points as |0\rangle and |1\rangle, and declared all other points to be superpositions of these two states. Choosing the north and south poles for these two points just allows for a fairly intuitive parametrization of any superpositon of those states. The parametrization is (\phi,\psi)\to \cos(\frac{\phi}{2})|0\rangle+\sin(\frac{\phi}{2})e^{i\psi}|1\rangle. Note that this naturally builds up the spinor framework, in which a 4\pi rotation with respect to the angle \phi on the Bloch sphere would correspond to a 2\pi rotation of the superposition of states. Is this just an artificial construction; perhaps an undesired outcome of the parametrization? Perhaps. However, it is still a useful mathematical construction. Each point on the Bloch sphere corresponds to a spinor, and not much is lost if we assume that a 4\pi rotation now corresponds to a full rotation of the quantum state, and not a 2\pi rotation. Moreover, spinors come up naturally in lots of other areas of Physics.

Evolution of quantum systems

The evolution of a quantum system can be thought of as the motion of a dot on the Bloch sphere. It is thought to happen through unitary transformations. What does this mean? Because unitary transformations have eigenvalues of modulus 1, these transformations merely rotate all the eigenstates whose superposition forms the overall wave function, keeping their probabilities the same. Hence, if you take a wave function with a high probability of collapsing into the |0\rangle eigenstate, this probability will remain high as the wave function evolves. Of course the shape of the wave function will change with time.


One may calculate the probability of making a particular measurement by projecting the state vector onto the desired subspace. For example, if have a wavefunction \alpha|0\rangle +\beta |1\rangle, we may calculate the probability of this wavefunction collapsing into the |0\rangle state by projecting it onto the |0\rangle subspace and then calculating the coefficient, which is \alpha. This whole process can be formalized by saying that given a wave function \psi, the probability of making a measurement m is \sqrt{\langle\psi|M_m^* M_m|\psi\rangle}, where M_m is the projection operator projecting the state vector onto the |m\rangle subspace. The state of the system after this measurement is actually observed is \frac{M_m|\psi\rangle}{\sqrt{\langle\psi|M_m^* M_m|\psi\rangle}}.

Multi-qubit systems

Suppose we have two qubits q_1=a|0\rangle +b|1\rangle and q_2=c|0\rangle+d|1\rangle. Can we also form a wavefunction of this system of two qubits? It turns out that we can. We just take the tensor product of the two qubits to get ac|00\rangle +ad|01\rangle +bc|10\rangle +bd|11\rangle. Why the tensor product? The tensor product is a poor man’s multiplication sign when multiplication is not well-defined. Why do we need multiplication at all? Simple probabilitistic arguments would suffice. If the probability of observing the first qubit in state |0\rangle is |a|^2 and the probability of observing the second qubit in state |1\rangle is |d|^2, then the probability of observing the state |01\rangle should be |a|^2|d|^2=|ad|^2, which is exactly what we see in the above system. Hence, this is as good a representation as any to represent the quantum state of the two qubit system.


Before we understand entanglement, we have to grapple with the Hadamard gate. A quantum gate is a linear transformation performed on a state vector. A Hadamard gate, represented by the matrix \frac{1}{\sqrt{2}}\begin{pmatrix} 1&1\\1&-1 \end{pmatrix}, can be thought of as the process of “mixing it up”. When you input the |0\rangle eigenstate into this gate, for instance, you get back \frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle. Hadamard gates are an easy way of creating a superposition of states from a single pure state, where all the eigenfunctions in the superposition have equal amplitudes. It is only by using these superpositions of states that we truly harness the power of quantum computation.

There is another quantum gate (transformation) that we should learn about- the CNot gate. This transformation acts on systems of two qubits, and is represented by

\begin{pmatrix} 1&0&0&0\\0&1&0&0\\0&0&0&1\\0&0&1&0\end{pmatrix}

This is not a unitary transformation. What this means is that the probabilities of the occurence of each eigenstate can now change. The CNot gate is instrumental behind entangling two qubits. What does that mean?

When two wires get entangled, we find it hard to tell them apart. It seems like they have morphed into one, quite unwieldy, entity. Similarly, when two qubits get entangled- it is hard to separate them into distinct qubits. They seem to have morphed into one blob of information. How does the CNot gate morph two distinct qubits into one such blob, though? The CNot gate maps the state vector \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|10\rangle to the vector \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle. This is an entangled state because there do not exist any states |\psi\rangle and |phi\rangle such that |\psi\rangle\otimes |\phi\rangle=\frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle. This is easy to see using the method of undetermined coefficients. Hence, we cannot recover the wavefunctions of the individual qubits anymore.

As the CNot gate is not unitary, the probabilities of various eigenfunctions are often changed. Also note that in the above system represented by the wavefunction \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle, if we observe the first qubit to be in state |0\rangle, we automatically know that the second qubit must also be in state |0\rangle. Hence, knowing the state of one implies knowing the state of the other instantaneously. This further drives home the fact that these two qubits are no longer different entities with possibly different properties.


I learned about teleportation through this amazing video. What exactly is being teleported here? A person? An object? Not quite. Only the quantum wavefunction of a qubit is being teleported- which is still something!

How does all of this happen exactly? Imagine that we have two scientists- Alice and Bob. I have unashamedly borrowed these names from the video. Two qubits are entangled (using the CNot gate), and then both of them are given one qubit each. Now imagine that Alice has another qubit Q, whose state is a|0\rangle +b|1\rangle. She wants to communicate this wave function to Bob. However, she can only communicate classical, and not quantum information. How can she communicate the quantum wave function to Bob? I am going to follow the description in the video, and not that in the text, although the youtube comments suggest that both descriptions are equivalent.

If Alice tensors her entangled qubit with the qubit Q, Bob’s qubit automatically gets tensored with Q as well….well in a sense. He still needs more information to get a complete description of Q, but it’s a start. What this tensoring does to the two entangled qubits is similar to what the Hadamard gate does to the eigenfunction |0\rangle: it mixes things up. In other words, if the original entangled state was \frac{1}{\sqrt{2}}(|00\rangle +|11\rangle), tensoring it with Q creates a superposition of |00\rangle +|11\rangle, |01\rangle +|10\rangle,|10\rangle -|01\rangle and |00\rangle -|11\rangle. This is the set of all possible states that an entangled system of two qubits can exist in, and is called the set of Bell states. Now if Alice makes a Bell state measurement, it collapses the whole quantum system into one Bell state, tensored with one wave function. This happens for both Bob and Alice, although Bob does not know what the collapsed state of the system is. Now if Alice communicates to Bob which Bell state the system has collapsed into, which she can through classical channels, Bob will know how to retrieve the original wave function |\psi\rangle from the wave function he has for Q right now. This retrieval is through multiplication with one of the Pauli matrices, and which Pauli matrix is required for retrieval depends on which Bell state the system has collapsed into. For example, if the Bell state that is measured is |00\rangle-|11\rangle, Bob knows that he needs teh $latex $X$ Pauli matrix to retrieve the original Q wavefunction.

One may ask why Bob cannot perform such calculations himself. Hence, it is implicit in this description that only Alice has the tools to make measurements.

Super dense coding

Super dense coding is teleportation in reverse: it is the process through which two classical bits may be communicated by Alice to Bob, if she can only send across the quantum information of one qubit.

How does this happen? Assume that Alice and Bob share two entangled qubits that are in the Bell state \frac{1}{\sqrt{2}}(|00\rangle +|11\rangle). Now Alice has the following options:

  1. If Alice wants to communicate the classical bits |00\rangle, she can let the entangled state remain unchanged. When Bob receives Alice’s qubit, he will know that the Bell state of the entangled system is still \frac{1}{\sqrt{2}}(|00\rangle +|11\rangle), and will hence infer that Alice just wanted to communicate |00\rangle.
  2. If Alice wants to communicate |01\rangle, she will perform the X rotation on her own entangled qubit, causing the Bell state of the entangled system to change to \frac{1}{\sqrt{2}}(|00\rangle -|11\rangle). This change in the Bell state is not communicated to Bob until he also receives Alice’s entangled qubit. However, when he does, he soon notices the change in Bell state, and infers that Alice wanted to communicate the |01\rangle state.
  3. Similar conclusions can be reached for both the |10\rangle and |11\rangle states, in which Alice can perform the Z and XZ rotations on her entangled qubit respectively.

How does Bob know the Bell state of the entangled system, after he was received Alice’s qubit? Can one just look at both the qubits and know the state? No. Bob passes the entangled qubits through the CNot gate, and then passes Alice’s qubit through the Hadamard gate. After performing these operations, Bob gets the two bit classical state that Alice wanted to communicate. The moral of the story is that a quantum wavefunction cannot really be inferred with just one measurement.

Deutsch-Jozsa algorithm

I will be following this video to explain the Deutsch-Josza algorithm.

What problem does this algorithm solve? Imagine that you have a function f: \{0,1\}^n\to \{0,1\}, and you know that it is either a constant function, or a balanced function, which means it maps half of its domain to 0, and the other half to 1. How do we find out which it is? Using a classical computer, we will need at most 2^{n-1}+1 queries to be absolutely certain of the nature of f. However, a quantum computer can solve this problem with 1 query. How does this happen?

U_f is a quantum gate that is called an oracle function. It is a kind of black box that is quite useful in many applications. The operation that it performs is |x\rangle |y\rangle\to |x\rangle |y + f(x) \mod 2\rangle, where x\in \{0,1\}^n,y\in\{0,1\}. We are performing manipulations on n+1 qubits here. So in what order does all this happen?

We first input a classical state |0\rangle ^{\otimes n} of n bits, and an additional classical state |1\rangle of 1 bit into system. We then perform H^{\otimes n} on the n classical bits to produce a wave function over n qubits, and also perform H on the additional classical bit to produce a wave function on 1 qubit. The U_f gate then performs the operation described above for the wave function determined by tensoring the n qubits with the one additional qubit. Now we again perform the H^{\otimes n} operation on the first n qubits, and get another quantum wave function over n+1 qubits. This final wave function is denoted as |\psi_3\rangle in the diagram below.

If the wave function |\psi_3\rangle comprises of only the |0\rangle^{\otimes n} eigenstate for the first n qubits, f is a constant function. However, if it does not contain the |0\rangle^{\otimes n} eigenstate for the first n qubits at all, f is a balanced function. Fortunately, both of these possibilities can be determined by just one measurement of |\psi_3\rangle.

Bloch sphere

A lot of details about the Bloch sphere have already been mentioned before in this article. Perhaps one important point that we should keep in mind is that every unitary transformation of the wave function of a single qubit can be represented as a rotation of the corresponding point on the Bloch sphere around some axis. Moreover, each such rotation can be written as a composition of rotation around the x-axis, y-axis and z-axis. In other words, U=R_x(\theta_1)R_y(\theta_2)R_z(\theta_3) for some angles \theta_1,\theta_2,\theta_3. In fact, any unitary transformation U can be thought of as a bunch of rotations around just the x and z axes, as any rotation around the y axis can be decomposed into a rotation around the x axis followed by a rotation around the z-axis.

In fact, even stronger claims than that can be made. What if we couldn’t change the angle? What if we had to fix \theta_1,\theta_2? Could we still represent any unitary transformation U to arbitrary precision by such rotations? Yes we can. There do exist two axes a and b, and rotations R_a(\alpha) and R_b(\beta) with \alpha,\beta fixed such that any unitary transformation can be approximated to arbitrary precision by a bunch of these transformations. The only constraint is that \alpha/2\pi and \beta/2\pi have to be irrational.

These rotations are known as universal quantum gates.

Shor’s algorithm

I am going to be following this video to explain what is Shor’s algorithm, and how it works.

Popular science literature has often emphasized the fact that a lot of encryption is based on the simple fact that it is exceedingly difficult and time consuming for classical computers to factor large numbers. Such a computer may take 2000 years of computing time to factor a 100 digit number, and the numbers used in encryption are generally much larger than that.

But what is encryption, and where does factoring large numbers come into all of it? Encryption is the process through which information is converted into a kind of code, and sent to the intended receiver. The hope is that if some malicious party get their hands on this code, they will not be able to crack it in order to obtain that information. But what kind of code is it?

Imagine that Alice sends Bob a code, along with a large number N. Breaking the code to retrieve the information is only possible if Bob knows the factors of the number already, which he does. If a malicious party get their hands on this code and large number, they will not be able to decrypt the code without spending a prohibitively large amount of time and effort to find the factors of this number. Shor’s algorithm, when run on a quantum computer, can factor these numbers with ease. How does it do it?

As mentioned before, the large number we have is N. Now pick a random number g, and check if it has a common factor with N. This can be done very quickly using Euclid’s algorithm. If it does, congrats! We’re done. Two factors of N are N/g. We can use the algorithm below to further factorize these factors if we want.

Chances are that an arbitrarily chosen number g will not have a common factor with N. What should we do now? We should look for a number p>1 such that g^p\equiv 1\mod N. This is because if we can find such a p, then g^{p}-1\equiv 0\mod N, or (g^{\frac{p}{2}}-1)(g^{\frac{p}{2}}+1)\equiv 0\mod N. Hence, two factors of N can be determined by simply using the Euclidean algorithm to find common factors of N and g^{\frac{p}{2}}\pm 1. However, there is a problem.

What if p is odd? Or what if (g^{\frac{p}{2}}-1) or (g^{\frac{p}{2}}+1) has a common factor of N with N? We will have to start over and find another g, and hope that we don’t face this problem again. The good news is that the probability of us finding a “good” random number g for which we face neither of the above problems is \frac{3}{8} for each trial. Hence, we are almost certain to choose a “good” random number g after a sufficient number of trials.

But how does one find such a number p such that g^p\equiv 1\mod N? This is where Shor’s algorithm comes in. It constructs the wave function |1,g^1\mod N\rangle +|2,g^2\mod N\rangle+\dots, suitably normalized of course. Now if we measure only the remainder, the wave function to some remainder, say a. The resultant wave function is |b^x,a\mod N\rangle+|b^{x+p},a\mod N\rangle +|b^{x+2p},a\mod N\rangle+\dots We now need to determine the number p, which is also the period of this wave function.

We do this by determining the quantum fourier transform of this wave function. Naively speaking, a fourier transform gives us all the frequencies of the wave function. The fundamental frequency of this wave function is 1/period. However, a quantum fourier transform gives us all multiples of the fundamental frequency, which can be thought of as resonant frequencies. Hence, the fourier transform of the above wave function will be |1/p\rangle+|2/p\rangle+\dots

Now when we perform a fourier measurement, the above wave function may collapse to any |i/p\rangle, where i is a positive integer. However, with enough measurements (of course preceded by painstakingly recreating the same wavefunction by following the same procedure), we can figure out the fundamental frequency |1/p\rangle, and consequently its reciprocal p. The maximum number of trials required is equal to the number of factors of p.

The discovery of this algorithm was hugely responsible for creating and sustaining interest in quantum computation.

Grover’s algorithm

I will follow this video to explain Grover’s algorithm.

Imagine that we have N switches, some attached correctly and others attached upside down. We have to figure out the correct configuration of the switches which will light the bulb. A classical computer will take 2^N trials to determine the correct configuration. However, a quantum computer using Grover’s algorithm can provide a quadratic speedup. In this case, we may just get away with using 2^{N/2} trials.

How does all of this happen? Imagine a system of N+1 qubits, in which the first N qubits represent all configurations of the N switches, and the last qubit represents the state of the bulb, which is initially assumed to be off. One may imagine that |\frac{1}{2}\rangle corresponds to the bulb being off and \frac-{1}{2}\rangle corresponds to the bulb being on. For the eigenstate corresponding to the correct configuration of the switches, the function f switches the configuration of the bulb to “on”.

Now performing the “Grover iteration” denoted by U_+U_f, the amplitude of the correct configuration is increased. After performing. it enough times, we can be almost certain that the wave function will collapse into the correct configuration when the measurement is made.

What is U_+U_f? Let us try to understand this with an analogy. For a number x, U_f(x)=1-2x, and U_+(x)=2x-1. Hence for a number that is exactly equal to \frac{1}{2}, U_+U_f maps it back exactly to \frac{1}{2}. However, for a number that is equal to -\frac{1}{2}, U_+U_f maps it to 3, (U_+U_f)^2 maps it to -11, etc. It is easy to see that |(U_+U_f)^n(-\frac{1}{2})|\to \infty as n\to\infty. Hence, when we normalize the wave function above, all of the amplitude of the function gets concentrated around the configuration that corresponds to -\frac{1}{2}, which is the correct configuration.

I hope to study quantum error correction soon and blog about that as well. Thanks for reading!

IMO 2019, Problem 1

The International Math Olympiad 2019 had the following question:

Find all functions f:\Bbb{Z}\to \Bbb{Z} such that f(2a)+2f(b)=f(f(a+b)).

The reason that I decided to record this is because I thought I’d made an interesting observation that allowed me to solve the problem in only a couple of steps. However, I later realized that at least one other person has solved the problem the same way.

The right hand side is symmetric in a,b. Clearly, f(f(a+b))=f(f(b+a)). Hence, symmetrizing the left side as well, we get f(2a)+2f(b)=f(2b)+2f(a). This implies that f(2a)-f(2b)=2(f(a)-f(b)). Assuming b=0, we get f(2a)=2f(a)-f(0).

Now use a=x+y and b=0 to show that f(x)-f(0) is linear. This shows us that f(x)=2x-f(0) or f(x)=0 are the only solutions to this question.