Machine learning and life lessons

Anyone who tries to draw “life lessons” from machine learning is someone who understands neither life, nor machine learning. With that in mind, let us get into the life lessons that I draw from machine learning.

Disclaimer: The idea that I am going to expound below is something I first came across in the book Algorithms to Live By. I thought it was a really impressive idea, but didn’t do much about it. Now I came across it again in Neel Nanda’s fantastic video on machine learning. Humans are hardwired to pay attention to an idea that they come across multiple times in unrelated contexts. If the same app is recommended to you by your friends and a random stranger on the internet, it’s probably very good and you should download it. Hence, seeing as I heard about this idea from two people I enjoy reading and learning from, I decided to give it some thought and write about it.

Minimizing regret

In programming as well as in life, we want to minimize our error, or loss function. When a neural network builds a model of the world from given data, it tries to minimize the difference between its predictions and the data. But what do humans want to minimize?

Humans want to minimize regret (also explained in the book Algorithms to Live By).

Regret often comes from choices that led to negative experiences. I read a fantastic interview by the famous programmer (and inventor of Latex) Donald Knuth, who said that in life, instead of trying to maximize our highs or positive experiences, we should focus more on minimizing our lows or negative experiences. What does that mean? Suppose you work in an office in which you win the best employee award every month. Clearly, your career is going well. However, your spouse is on the verge of divorcing you, and all your co-workers hate you. As opposed to this, imagine that you’re an average worker in an office setting whose career is not really that spectacular, but you get along with your spouse and co-workers. In which situation would you be happier and more satisfied? I bet that most of us would choose the second scenario without even thinking. Negative experiences stay with us for longer, and harm us more. Positive experiences provide for self-confidence and happy nostalgia, but are overshadowed by negative experiences on most days. I’ve often thought about this statement by Knuth, and it keeps getting clearer and more relevant with time. Hence, humans will do well to minimize the regret that they might have accumulated from negative experiences.

Although regret often stems from negative experiences, it may also arise from from actions not taken. A common example would be someone who really wanted to become an artist, but was forced by circumstances into a miserable profession (hello, MBAs). They would regret not pursuing their passion for a very long time.

Hence, a happier life is not necessarily one in which we have maximized the number of happy moments, but one in which we have minimized regret.

Gradient descent

So how does one minimize regret? I want to answer this question by discussing how a neural network performs gradient descent.

Imagine that your loss/error function is a graph in two dimensions, and you want to find the value of x (your parameter) for which this function is minimized.

If the graph given above is that of the loss function, it is minimized at the value of x for which the function has its global minima. How does gradient descent work? We choose a starting point on the x-axis, say x_0, and then perform the simple iteration x_{n+1}=x_n-\alpha \nabla f. If you starting point is near the local minima for instance, you will slowly be led towards the local minima.

But wait. You wanted the global minima, and not the local minima. How does one do that? What machine learning algorithms do is that they keep jumping between different starting points x_0, and performing gradient descent with values of \alpha that are large at first, but decrease with time. This approach generally works because real life data that is fed to the neural network is “nice” (probably in the sense that the attractor well for the global minima is larger than the wells for local minima). Hence, after a few jumps, we have a pretty good idea of where the attractor well of the global minima lies. Now we can keep iterating the process of gradient descent until we reach the global minima.

How does this have anything to do with life? The perhaps too obvious, but useful analogy is that we should keep trying new and different things. This is an age-old adage, but there seems to be a mathematical basis for it. Trying new things, like visiting a new part of town, trying meditation, talking to strangers, reading a book in a field you know nothing about, learning a new language, is akin to having a new starting point- x_0. Now you can perform gradient descent. Remove the aspects of that new thing that you don’t like. Minimize regret (I will probably regret focusing on learning French for six hours instead of writing my thesis). You might arrive upon a minima that is better than your previous minima (if not, you can revert to your previous state of life). If you do it enough times, chances are that you will find your global minima, or your best possible life.


This ties in with another concept that was discussed in Algorithms to Live By– annealing. When Intel was trying to design microchip processors that would power the modern computer, finding the right arrangements for each part of the processor was proving to be a mathematically intractable problem. How should these millions of different parts be arranged so that the processor is fast and does not generate too much heat? There were literally millions of parameters, and the brightest minds in the world literally had no idea what to do.

What one physicist suggested was the process of annealing. What is annealing? It is the process through which metals are heated to very high temperatures, and then slowly allowed to cool. This causes metals and alloys to harden. Similarly, the physicist suggested that they randomly arrange all the parts of the processor, and then perform small changes that would make the processor more stable and efficient. Soon, they arrived upon a design that was efficient and successfully powered the modern computer.

How does this apply in one’s life? One possibility is resource allocation. How much time should I devote to working out, as opposed to studying or socializing? We can start at an arbitrary point- say I work out for 10 mins everyday, study for 5 hours and socialize for 2 hours. I can then change the parameters, in the same way that a metal slowly cools down. I should probably work out more, and not spend as much time socializing. Hence, maybe I can work out for 30 mins, study for 4 hours and socialize for 1.5 hours. I can keep tweaking these parameters until I reach an arrangement that is satisfactory for me and makes me happy.

Hence, if you’re looking to live your best life possible, you should start doing things you never thought to do before, and then changing small parameters until you reach a satisfactory situation. Then start all over again. After trying enough things, there’s a good chance that you’ve already figured out what makes you happiest, and minimizes regret. You’ve figured out how to live.

And thus ends my Deepak Chopra-esque spiel.

Capitalism and coordination problems

We all know that the world’s problems can be solved if “all of us can come together and act as one….” you’ve probably dozed off by now. Of course this is true. And of course this never happens. But why not? What is the single most important reason that we cannot get together as a single planet and solve all our problems in an hour?

The reason is simple- working in a group is a complex coordination problem. A large group of people, with no personal ties or friendships, have to come together under a common umbrella, and try to solve a problem together. No individual should shirk their share of the work, and everyone should contribute equally (or at least equitably). As anyone who has worked in a group project can surely testify, this never happens because some members shirk their responsibilities, hoping that others would pick up the slack. The hard working ones work hard for some time, and then realize that they’ve been given an unfair deal. Soon, they stop working as well. Sometimes, they keep working hard, but claim that they deserve “more” than what the slackers are getting- maybe they want more credit, or complete control of the project, etc. Soon, the group disintegrates, and the final outcome is substandard.

Coordination problems are responsible for not enough donations to politics (less than the amount of money that Americans spend on almonds each year, as explained in the linked article), not enough donations to Wikipedia (despite Jimmy’s constant threats and emails), our screwed up education systems, garbage on the roads, etc. Why donations, you may ask? How is that a coordination problem? Let me give a simple example. I want to stop poverty. I really do. When a see a Facebook ad for a project in Africa that is saving children, I feel the pang of conscience, and want to donate an amount. I hear crazy statistics like if enough people in the world donated $20, we would end global hunger in a day. Of course I can donate $20, no problem at all. However, most times I don’t end up donating. Why? Because I know that very few other people will donate. And my $20 will not make a difference at all. That is why most people don’t vote (their puny vote will not make a difference at all). What if I was convinced that most other people in the world will donate if I do? Will I donate then? Of course I will! But for that, a miracle would have to happen. The world really would have to come together to accomplish one purpose. Our greatest coordination problem will have to have been solved. Clearly, this is unlikely to happen in the near future. And because we are not able to get the world together and make everyone donate a small sum, we let hundreds of thousands of innocent children die everyday.

In the book “The Precipice”, the author Toby Ord writes that the reason why the world is headed towards annihilation is that saving the world is a complex coordination problem. In other to stop climate change, reduce pollution, reduce the threat of nuclear winter, etc, all countries have to come together and make sacrifices. However, some countries keep polluting and manufacturing weapons of mass destruction. Why is that? Because the benefits of saving the world will be reaped by all countries, including the errant countries. However, the benefits of misbehaving will only be reaped by the misbehaving countries. If Indian industries keep polluting while the rest of the world reduces its emissions, India will benefit by having a higher manufacturing output than others, and will also benefit due to the cleaner air that has been achieved through the sacrifices of others. Hence, slacking is often a win-win tactic- slackers win short term, because they don’t have to play by the rules, and also hopefully long term, because other more responsible parties will have to pick up the slack and fulfill the objective, providing benefits for all. How does one reduce slacking in a group? How does one solve this omnipresent coordination problem?

Capitalism is the most successful idea in human history. It brought the world world together, and almost singlehandedly improved quality and duration of life for almost everyone around the world. This happened because the whole world did indeed come together and work as one. How did Capitalism solve the coordination problem that killed most other ideas like Communism? It did so by providing an incentive for each party in the world to do their job. If you do your job, you get money and power. If you don’t, you get nothing. Hence, if you’ve slacked off, you’ll be left behind by others who work hard and make money. This is something that you don’t generally get to see in group projects, or in things like Communism.

So how can we solve our great coordination problems? How can we really end poverty and hunger and climate change once and for all? I don’t know. But I think the trick will be to find a way to give an incentive to each individual person in the whole world. Does this incentive have to be money? Should the government be paying people to clean up beaches or repair their overly polluting vehicle? Possibly. But this might not be the whole story. Clearly, many governments will be constrained by limited coffers that they cannot add a further burden to. In its place, maybe they could share success stories of people cleaning up their neighborhoods on social media. Maybe they could offer to name benches on public parks after particularly generous donors to orphanages, etc. Finding the right incentives for a non-homogeneous population is often a difficult talk. But perhaps the main point of writing this essay is that finding incentives is the most important thing that we can do to solve complex coordination problems. And it is only through solving these problems that we can solve major global issues, and perhaps hope to re-create the phenomenal success of Capitalism.

PTSD and Type II thinking

[Note: This article is highly speculative, and I will pull it down without warning if I find evidence to the contrary]

Reading very many of Scott Alexander blogposts, which review basic neuroscience research, has given me the impression that PTSD and depression might be a fundamental neuroscience problem. Let me explain that below.

What does PTSD feel like, exactly? You had a bad experience a gazillion years ago. Maybe you were bullied, or had a bad breakup, etc. You wake up in the morning, and suddenly that is all you can think about. How this bad experience will keep affecting your future, how there is no escape, how you wish you could have behaved differently back then, etc.

This is your brain going in a downward spiral, sucking you into a pit of misery. Well a part of your brain anyway. Reading Alexander’s posts give you the clear impression that your brain is not one cohesive entity. It is a bunch of different shit thrown together. And there is a part of the brain that is responsible for computation and logical inferences from known data. And this part of the brain is not engaged when you go into this downward spiral.

Imagine that you’re falling deeper and deeper into a well. You can stop this fall at any point by pushing your hands and legs against the walls of the well. But you’re just not able to. Mainly because you’ve forgotten about your hands and feet completely.

Similarly, when one falls into this self-denigrating pit of despair, one simply cannot remember to employ the part of the brain that will tell you that you’re overthinking this, and that your past trauma is just not relevant anymore. The world has moved on. How do you deploy this part of your brain?

Write things down. If one were to become slightly technical, there are two types of thinking that the brain employs- Type I and Type II. Type I thinking is the kind of illogical/intuitive/emotional thinking that can lead you into such pits. Type II thinking is the more deliberate, logical form thinking that might save you. When we write things down, or perhaps explain our position to someone else, Type II thinking gets deployed. We may soon realize that the past event just isn’t relevant anymore, or at least not as important as we’re making it out to be.

Another way to think about it is that PTSD is just a failure of computation. Suppose you were bullied as a child. You keep thinking about that, slowly descending further into misery. But what if you deliberately compute the effect of those people in your life? Do these people live near you anymore? No. Do you work with them? No. Have they forgotten about it all? Probably. If you met them again, will they still be horrible to you? Probably not, and definitely not in a few years. When we compute these possibilities, we again deploy Type II thinking, which will lead us out of this spiral.

But why is the brain not always employing Type II thinking anyway? I’m a fairly intelligent person. I should be able to reason myself out of anything. The reason why the brain doesn’t always think in its logical mode is the same reason why the TV doesn’t turn itself on- no one pressed the button. Although the TV is fully capable of showing us our favorite channels, switching it on is still compulsory. Similarly, for the brain to employ Type II thinking, we HAVE to write things down, or perform a computation. This is the only way that the logical part of our brain “switches on”. Without that, the brain will keep chasing us down the same rabbitholes that we’ve been haunted by for years.

IMO 2020, Problem 2

The following is a question from IMO 2020:

The first time I tried to solve the problem, I thought I had a solution, but it turned out to be wrong. I wrongly assumed that a^ab^bc^cd^d would be maximized when a=b=c=d, which is commonly true in Olympiad problems, but that needn’t be the case.

I then looked at solutions available online, and realized that I just needed to show that a^ab^bc^cd^d\leq a^2+b^2+c^2+d^2. After doing that, I homogenized both sides, and tried to prove the statement. I was finally successful. I am recording my solution, as it is slightly different from the ones available online.

Quantum Computing

Today, I will be talking about quantum computing. I will be following Quantum Computing– Lecture Notes by Mark Osdin, who is a professor at the University of Washington, Seattle. These lecture notes are roughly based on the book Quantum Computing and Quantum Information by Michael A. Nielsen and Isaac L. Chuang.

So what exactly is quantum computing? Imagine that you’re trying to find the shortest route from your house to a restaurant. The way that a classical computer would solve this problem is that it would consider all possible routes from your house to the restaurant, and calculate the length of each route individually. It would then compare the lengths of all the routes, and give us the shortest one. A quantum computer, on the other hand, can determine all routes and calculate all route lengths at once.

What does that mean? Suppose there are a million routes, can the quantum computer calculate the lengths of each all at once? What about a trillion? This seems impossible, as it would imply that a quantum computer is infinitely faster than a classical computer. What a quantum computer actually does is that it performs all of these calculations, but then stores the results of these calculations in a superposition. This means that users cannot extract the lengths of all these trillions of routes all at once. Performing a measurement would collapse the wave function generated by a quantum computer into one such route. Hence, we may only be able to know the details of one route on making one measurement.

What use is a quantum computer then? We don’t want to find the length of each route by performing one measurement at a time. We just need the shortest route. If we can find a way to increase the probability of the wave function collapsing into the shortest route, we will have solved our problem. Although this process of increasing the probability of the wave function collapsing into the “right answer” may take some time, it generally takes much less time than a classical computer. Hence, quantum computers have been known to provide exponential speedups over classical computers.

In short, quantum computers are useful because they are fast. Much faster than any classical computer will ever be.

A quantum bit

A classical bit is a “storage space” on a computer, which stores either the number 0 or 1. It cannot store both. A quantum bit or qubit, on the other hand, can store a superposition of both numbers in the form \alpha |0 \rangle + \beta | 1\rangle, where \alpha,\beta\in\Bbb{C} such that \alpha^* \alpha+\beta^* \beta=1. In other words, \alpha,\beta are indicative of the probabilities of the qubit wave function collapsing into the |0 \rangle or |1 \rangle states.

But what if we just want to store the ordinary number 1 in a qubit? We can just find a way to get |\beta|=1 and \alpha=0. Hence, all storage operations that can be performed on a classical computer can also be performed on a quantum computer.

Bloch sphere-I

Each qubit wave function can be represented as a point on the two dimensional sphere S^2.

But how can two complex numbers (\alpha,\beta)\in\Bbb{C}^2 be represented on only a two dimensional manifold? Mathematically speaking, we have the condition that \alpha^*\alpha+\beta^*\beta=1, which gives us the fact that all such (\alpha,\beta) are contained within S^3\subset \Bbb{C}^2. We then mod out S^3/S^1, as the individual phases of \alpha and \beta don’t matter- only the difference of their phases does. This gives us S^2. For the mathematically minded, what I have performed above is a Hopf fibration.

In the sphere above, we have the |0\rangle state as the North Pole and the |1\rangle state as the South Pole. Why is that? This fact doesn’t matter at all, I suppose. I could have chosen any two points as |0\rangle and |1\rangle, and declared all other points to be superpositions of these two states. Choosing the north and south poles for these two points just allows for a fairly intuitive parametrization of any superpositon of those states. The parametrization is (\phi,\psi)\to \cos(\frac{\phi}{2})|0\rangle+\sin(\frac{\phi}{2})e^{i\psi}|1\rangle. Note that this naturally builds up the spinor framework, in which a 4\pi rotation with respect to the angle \phi on the Bloch sphere would correspond to a 2\pi rotation of the superposition of states. Is this just an artificial construction; perhaps an undesired outcome of the parametrization? Perhaps. However, it is still a useful mathematical construction. Each point on the Bloch sphere corresponds to a spinor, and not much is lost if we assume that a 4\pi rotation now corresponds to a full rotation of the quantum state, and not a 2\pi rotation. Moreover, spinors come up naturally in lots of other areas of Physics.

Evolution of quantum systems

The evolution of a quantum system can be thought of as the motion of a dot on the Bloch sphere. It is thought to happen through unitary transformations. What does this mean? Because unitary transformations have eigenvalues of modulus 1, these transformations merely rotate all the eigenstates whose superposition forms the overall wave function, keeping their probabilities the same. Hence, if you take a wave function with a high probability of collapsing into the |0\rangle eigenstate, this probability will remain high as the wave function evolves. Of course the shape of the wave function will change with time.


One may calculate the probability of making a particular measurement by projecting the state vector onto the desired subspace. For example, if have a wavefunction \alpha|0\rangle +\beta |1\rangle, we may calculate the probability of this wavefunction collapsing into the |0\rangle state by projecting it onto the |0\rangle subspace and then calculating the coefficient, which is \alpha. This whole process can be formalized by saying that given a wave function \psi, the probability of making a measurement m is \sqrt{\langle\psi|M_m^* M_m|\psi\rangle}, where M_m is the projection operator projecting the state vector onto the |m\rangle subspace. The state of the system after this measurement is actually observed is \frac{M_m|\psi\rangle}{\sqrt{\langle\psi|M_m^* M_m|\psi\rangle}}.

Multi-qubit systems

Suppose we have two qubits q_1=a|0\rangle +b|1\rangle and q_2=c|0\rangle+d|1\rangle. Can we also form a wavefunction of this system of two qubits? It turns out that we can. We just take the tensor product of the two qubits to get ac|00\rangle +ad|01\rangle +bc|10\rangle +bd|11\rangle. Why the tensor product? The tensor product is a poor man’s multiplication sign when multiplication is not well-defined. Why do we need multiplication at all? Simple probabilitistic arguments would suffice. If the probability of observing the first qubit in state |0\rangle is |a|^2 and the probability of observing the second qubit in state |1\rangle is |d|^2, then the probability of observing the state |01\rangle should be |a|^2|d|^2=|ad|^2, which is exactly what we see in the above system. Hence, this is as good a representation as any to represent the quantum state of the two qubit system.


Before we understand entanglement, we have to grapple with the Hadamard gate. A quantum gate is a linear transformation performed on a state vector. A Hadamard gate, represented by the matrix \frac{1}{\sqrt{2}}\begin{pmatrix} 1&1\\1&-1 \end{pmatrix}, can be thought of as the process of “mixing it up”. When you input the |0\rangle eigenstate into this gate, for instance, you get back \frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle. Hadamard gates are an easy way of creating a superposition of states from a single pure state, where all the eigenfunctions in the superposition have equal amplitudes. It is only by using these superpositions of states that we truly harness the power of quantum computation.

There is another quantum gate (transformation) that we should learn about- the CNot gate. This transformation acts on systems of two qubits, and is represented by

\begin{pmatrix} 1&0&0&0\\0&1&0&0\\0&0&0&1\\0&0&1&0\end{pmatrix}

This is not a unitary transformation. What this means is that the probabilities of the occurence of each eigenstate can now change. The CNot gate is instrumental behind entangling two qubits. What does that mean?

When two wires get entangled, we find it hard to tell them apart. It seems like they have morphed into one, quite unwieldy, entity. Similarly, when two qubits get entangled- it is hard to separate them into distinct qubits. They seem to have morphed into one blob of information. How does the CNot gate morph two distinct qubits into one such blob, though? The CNot gate maps the state vector \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|10\rangle to the vector \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle. This is an entangled state because there do not exist any states |\psi\rangle and |phi\rangle such that |\psi\rangle\otimes |\phi\rangle=\frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle. This is easy to see using the method of undetermined coefficients. Hence, we cannot recover the wavefunctions of the individual qubits anymore.

As the CNot gate is not unitary, the probabilities of various eigenfunctions are often changed. Also note that in the above system represented by the wavefunction \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle, if we observe the first qubit to be in state |0\rangle, we automatically know that the second qubit must also be in state |0\rangle. Hence, knowing the state of one implies knowing the state of the other instantaneously. This further drives home the fact that these two qubits are no longer different entities with possibly different properties.


I learned about teleportation through this amazing video. What exactly is being teleported here? A person? An object? Not quite. Only the quantum wavefunction of a qubit is being teleported- which is still something!

How does all of this happen exactly? Imagine that we have two scientists- Alice and Bob. I have unashamedly borrowed these names from the video. Two qubits are entangled (using the CNot gate), and then both of them are given one qubit each. Now imagine that Alice has another qubit Q, whose state is a|0\rangle +b|1\rangle. She wants to communicate this wave function to Bob. However, she can only communicate classical, and not quantum information. How can she communicate the quantum wave function to Bob? I am going to follow the description in the video, and not that in the text, although the youtube comments suggest that both descriptions are equivalent.

If Alice tensors her entangled qubit with the qubit Q, Bob’s qubit automatically gets tensored with Q as well….well in a sense. He still needs more information to get a complete description of Q, but it’s a start. What this tensoring does to the two entangled qubits is similar to what the Hadamard gate does to the eigenfunction |0\rangle: it mixes things up. In other words, if the original entangled state was \frac{1}{\sqrt{2}}(|00\rangle +|11\rangle), tensoring it with Q creates a superposition of |00\rangle +|11\rangle, |01\rangle +|10\rangle,|10\rangle -|01\rangle and |00\rangle -|11\rangle. This is the set of all possible states that an entangled system of two qubits can exist in, and is called the set of Bell states. Now if Alice makes a Bell state measurement, it collapses the whole quantum system into one Bell state, tensored with one wave function. This happens for both Bob and Alice, although Bob does not know what the collapsed state of the system is. Now if Alice communicates to Bob which Bell state the system has collapsed into, which she can through classical channels, Bob will know how to retrieve the original wave function |\psi\rangle from the wave function he has for Q right now. This retrieval is through multiplication with one of the Pauli matrices, and which Pauli matrix is required for retrieval depends on which Bell state the system has collapsed into. For example, if the Bell state that is measured is |00\rangle-|11\rangle, Bob knows that he needs teh $latex $X$ Pauli matrix to retrieve the original Q wavefunction.

One may ask why Bob cannot perform such calculations himself. Hence, it is implicit in this description that only Alice has the tools to make measurements.

Super dense coding

Super dense coding is teleportation in reverse: it is the process through which two classical bits may be communicated by Alice to Bob, if she can only send across the quantum information of one qubit.

How does this happen? Assume that Alice and Bob share two entangled qubits that are in the Bell state \frac{1}{\sqrt{2}}(|00\rangle +|11\rangle). Now Alice has the following options:

  1. If Alice wants to communicate the classical bits |00\rangle, she can let the entangled state remain unchanged. When Bob receives Alice’s qubit, he will know that the Bell state of the entangled system is still \frac{1}{\sqrt{2}}(|00\rangle +|11\rangle), and will hence infer that Alice just wanted to communicate |00\rangle.
  2. If Alice wants to communicate |01\rangle, she will perform the X rotation on her own entangled qubit, causing the Bell state of the entangled system to change to \frac{1}{\sqrt{2}}(|00\rangle -|11\rangle). This change in the Bell state is not communicated to Bob until he also receives Alice’s entangled qubit. However, when he does, he soon notices the change in Bell state, and infers that Alice wanted to communicate the |01\rangle state.
  3. Similar conclusions can be reached for both the |10\rangle and |11\rangle states, in which Alice can perform the Z and XZ rotations on her entangled qubit respectively.

How does Bob know the Bell state of the entangled system, after he was received Alice’s qubit? Can one just look at both the qubits and know the state? No. Bob passes the entangled qubits through the CNot gate, and then passes Alice’s qubit through the Hadamard gate. After performing these operations, Bob gets the two bit classical state that Alice wanted to communicate. The moral of the story is that a quantum wavefunction cannot really be inferred with just one measurement.

Deutsch-Jozsa algorithm

I will be following this video to explain the Deutsch-Josza algorithm.

What problem does this algorithm solve? Imagine that you have a function f: \{0,1\}^n\to \{0,1\}, and you know that it is either a constant function, or a balanced function, which means it maps half of its domain to 0, and the other half to 1. How do we find out which it is? Using a classical computer, we will need at most 2^{n-1}+1 queries to be absolutely certain of the nature of f. However, a quantum computer can solve this problem with 1 query. How does this happen?

U_f is a quantum gate that is called an oracle function. It is a kind of black box that is quite useful in many applications. The operation that it performs is |x\rangle |y\rangle\to |x\rangle |y + f(x) \mod 2\rangle, where x\in \{0,1\}^n,y\in\{0,1\}. We are performing manipulations on n+1 qubits here. So in what order does all this happen?

We first input a classical state |0\rangle ^{\otimes n} of n bits, and an additional classical state |1\rangle of 1 bit into system. We then perform H^{\otimes n} on the n classical bits to produce a wave function over n qubits, and also perform H on the additional classical bit to produce a wave function on 1 qubit. The U_f gate then performs the operation described above for the wave function determined by tensoring the n qubits with the one additional qubit. Now we again perform the H^{\otimes n} operation on the first n qubits, and get another quantum wave function over n+1 qubits. This final wave function is denoted as |\psi_3\rangle in the diagram below.

If the wave function |\psi_3\rangle comprises of only the |0\rangle^{\otimes n} eigenstate for the first n qubits, f is a constant function. However, if it does not contain the |0\rangle^{\otimes n} eigenstate for the first n qubits at all, f is a balanced function. Fortunately, both of these possibilities can be determined by just one measurement of |\psi_3\rangle.

Bloch sphere

A lot of details about the Bloch sphere have already been mentioned before in this article. Perhaps one important point that we should keep in mind is that every unitary transformation of the wave function of a single qubit can be represented as a rotation of the corresponding point on the Bloch sphere around some axis. Moreover, each such rotation can be written as a composition of rotation around the x-axis, y-axis and z-axis. In other words, U=R_x(\theta_1)R_y(\theta_2)R_z(\theta_3) for some angles \theta_1,\theta_2,\theta_3. In fact, any unitary transformation U can be thought of as a bunch of rotations around just the x and z axes, as any rotation around the y axis can be decomposed into a rotation around the x axis followed by a rotation around the z-axis.

In fact, even stronger claims than that can be made. What if we couldn’t change the angle? What if we had to fix \theta_1,\theta_2? Could we still represent any unitary transformation U to arbitrary precision by such rotations? Yes we can. There do exist two axes a and b, and rotations R_a(\alpha) and R_b(\beta) with \alpha,\beta fixed such that any unitary transformation can be approximated to arbitrary precision by a bunch of these transformations. The only constraint is that \alpha/2\pi and \beta/2\pi have to be irrational.

These rotations are known as universal quantum gates.

Shor’s algorithm

I am going to be following this video to explain what is Shor’s algorithm, and how it works.

Popular science literature has often emphasized the fact that a lot of encryption is based on the simple fact that it is exceedingly difficult and time consuming for classical computers to factor large numbers. Such a computer may take 2000 years of computing time to factor a 100 digit number, and the numbers used in encryption are generally much larger than that.

But what is encryption, and where does factoring large numbers come into all of it? Encryption is the process through which information is converted into a kind of code, and sent to the intended receiver. The hope is that if some malicious party get their hands on this code, they will not be able to crack it in order to obtain that information. But what kind of code is it?

Imagine that Alice sends Bob a code, along with a large number N. Breaking the code to retrieve the information is only possible if Bob knows the factors of the number already, which he does. If a malicious party get their hands on this code and large number, they will not be able to decrypt the code without spending a prohibitively large amount of time and effort to find the factors of this number. Shor’s algorithm, when run on a quantum computer, can factor these numbers with ease. How does it do it?

As mentioned before, the large number we have is N. Now pick a random number g, and check if it has a common factor with N. This can be done very quickly using Euclid’s algorithm. If it does, congrats! We’re done. Two factors of N are N/g. We can use the algorithm below to further factorize these factors if we want.

Chances are that an arbitrarily chosen number g will not have a common factor with N. What should we do now? We should look for a number p>1 such that g^p\equiv 1\mod N. This is because if we can find such a p, then g^{p}-1\equiv 0\mod N, or (g^{\frac{p}{2}}-1)(g^{\frac{p}{2}}+1)\equiv 0\mod N. Hence, two factors of N can be determined by simply using the Euclidean algorithm to find common factors of N and g^{\frac{p}{2}}\pm 1. However, there is a problem.

What if p is odd? Or what if (g^{\frac{p}{2}}-1) or (g^{\frac{p}{2}}+1) has a common factor of N with N? We will have to start over and find another g, and hope that we don’t face this problem again. The good news is that the probability of us finding a “good” random number g for which we face neither of the above problems is \frac{3}{8} for each trial. Hence, we are almost certain to choose a “good” random number g after a sufficient number of trials.

But how does one find such a number p such that g^p\equiv 1\mod N? This is where Shor’s algorithm comes in. It constructs the wave function |1,g^1\mod N\rangle +|2,g^2\mod N\rangle+\dots, suitably normalized of course. Now if we measure only the remainder, the wave function to some remainder, say a. The resultant wave function is |b^x,a\mod N\rangle+|b^{x+p},a\mod N\rangle +|b^{x+2p},a\mod N\rangle+\dots We now need to determine the number p, which is also the period of this wave function.

We do this by determining the quantum fourier transform of this wave function. Naively speaking, a fourier transform gives us all the frequencies of the wave function. The fundamental frequency of this wave function is 1/period. However, a quantum fourier transform gives us all multiples of the fundamental frequency, which can be thought of as resonant frequencies. Hence, the fourier transform of the above wave function will be |1/p\rangle+|2/p\rangle+\dots

Now when we perform a fourier measurement, the above wave function may collapse to any |i/p\rangle, where i is a positive integer. However, with enough measurements (of course preceded by painstakingly recreating the same wavefunction by following the same procedure), we can figure out the fundamental frequency |1/p\rangle, and consequently its reciprocal p. The maximum number of trials required is equal to the number of factors of p.

The discovery of this algorithm was hugely responsible for creating and sustaining interest in quantum computation.

Grover’s algorithm

I will follow this video to explain Grover’s algorithm.

Imagine that we have N switches, some attached correctly and others attached upside down. We have to figure out the correct configuration of the switches which will light the bulb. A classical computer will take 2^N trials to determine the correct configuration. However, a quantum computer using Grover’s algorithm can provide a quadratic speedup. In this case, we may just get away with using 2^{N/2} trials.

How does all of this happen? Imagine a system of N+1 qubits, in which the first N qubits represent all configurations of the N switches, and the last qubit represents the state of the bulb, which is initially assumed to be off. One may imagine that |\frac{1}{2}\rangle corresponds to the bulb being off and \frac-{1}{2}\rangle corresponds to the bulb being on. For the eigenstate corresponding to the correct configuration of the switches, the function f switches the configuration of the bulb to “on”.

Now performing the “Grover iteration” denoted by U_+U_f, the amplitude of the correct configuration is increased. After performing. it enough times, we can be almost certain that the wave function will collapse into the correct configuration when the measurement is made.

What is U_+U_f? Let us try to understand this with an analogy. For a number x, U_f(x)=1-2x, and U_+(x)=2x-1. Hence for a number that is exactly equal to \frac{1}{2}, U_+U_f maps it back exactly to \frac{1}{2}. However, for a number that is equal to -\frac{1}{2}, U_+U_f maps it to 3, (U_+U_f)^2 maps it to -11, etc. It is easy to see that |(U_+U_f)^n(-\frac{1}{2})|\to \infty as n\to\infty. Hence, when we normalize the wave function above, all of the amplitude of the function gets concentrated around the configuration that corresponds to -\frac{1}{2}, which is the correct configuration.

I hope to study quantum error correction soon and blog about that as well. Thanks for reading!

IMO 2019, Problem 1

The International Math Olympiad 2019 had the following question:

Find all functions f:\Bbb{Z}\to \Bbb{Z} such that f(2a)+2f(b)=f(f(a+b)).

The reason that I decided to record this is because I thought I’d made an interesting observation that allowed me to solve the problem in only a couple of steps. However, I later realized that at least one other person has solved the problem the same way.

The right hand side is symmetric in a,b. Clearly, f(f(a+b))=f(f(b+a)). Hence, symmetrizing the left side as well, we get f(2a)+2f(b)=f(2b)+2f(a). This implies that f(2a)-f(2b)=2(f(a)-f(b)). Assuming b=0, we get f(2a)=2f(a)-f(0).

Now use a=x+y and b=0 to show that f(x)-f(0) is linear. This shows us that f(x)=2x-f(0) or f(x)=0 are the only solutions to this question.

Poverty, mental health and rhesus monkeys

Today I came across a very interesting paper titled “Values Encoded in orbitofrontal cortex are causally related to economic choices” by Ballesta, Shi, Conen and Padua-Schioppa. I haven’t read and analyzed the paper fully, partly because of the many statistical tools that I will have to learn to assess it carefully. However, I did manage to read some important bits, and it set me thinking about how it directly applies to so many of us in our daily lives.

In this paper, the researchers claim that our subjective values of things are hard-coded in our orbitofrontal cortices. That is just a fancy way of saying that if we like burgers more than fries, this information is stored in a part of the brain that lies directly above your eyes. Hence, every time you’re offered a choice between burgers and fries, that part of your brain implores you to choose burgers.


The following experiment was done on rhesus monkeys. They were given a choice between 2 drops of grape juice and 6 drops of peppermint tea. Depending upon their surjective preferences (as coded in their orbitofrontal cortices), they would prefer one or the other. For example, let us assume that a monkey named Tim would mostly choose the 2 drops of grape juice over peppermint tea. How exactly is Tim offered these choices?

Tim is first shown a picture of 2 drops of grape juice. Then after a 1 second delay, he is shown a picture of 6 drops of peppermint tea. He is then asked to choose amongst the two images. He consistently chooses the grape juice.

However, suppose a current of 100 \mu A is passed through his orbitofrontal cortex every time he is shown the grape juice image. The passage of this current causes Tim to start choosing peppermint tea slightly more often than before. Note that Tim does not have a clear preference for peppermint tea as such. It is just that his choices become more randomized. Why does this happen? The authors of the paper think that the current interferes with the working of the orbitofrontal cortex, which was initially asking it to consistently choose the grape juice.

AB means that A is being offered first, and B is being offered second. StimON means that current is passed during the first offer.

Depression and poverty

If you’ve been depressed before, you can surely empathize with this. Let us suppose that in a happy and stable state of mind, you’d always prefer the color red over blue. Given the choice between two t-shirts of those colors, you’d consistently choose the red one. However, once that depression hits, you don’t really care anymore. You’d start choosing the blue one more often than before because they’re all the same, and it doesn’t matter. Our brain just refuses to do the computation that leads us to conclude that we prefer red over blue (yes, even choices are a result of mental computation). And the absence of computation leads to random choice.

What about poverty? Imagine that poverty causes a small current to run though your orbitofrontal cortex. This causes your brain’s computational capacities to plummet, leading you to make arbitary choices, or perhaps choices that are dictated by short term thinking (obviously short term thinking requires less computation than long term thinking). Say at the end of a hard day’s labor, you have $100 in your pocket. If you save $50 every day for a year, you will have saved enough money to accumulate interest, perhaps help you tide over bad times. But c’mon. You’re incapable of that computation. Your orbitofrontal cortex is screaming at you to save some money at least. You’ve been through terrible times, and if you’d saved in the past, you’d have been so much better off. You know that you should save money. You’ve definitely been burned enough times to know that. But you cannot hear your brain screaming over the current. You’d rather go to the bar right now and drink it all up. Tomorrow is another day.

What about self-destructive behaviour? What if the lack of will power is basically the lack of computational capabilites? This is perhaps getting into speculative territory. But these are burning questions that can be answered using a similar experimental evidence as described above.

The analysis above is different from the paper in one significant aspect: in the case of depression and poverty, there’s a current that is always running through the orbitofrontal cortex. Hence, our brain, that is now incapable of computing and hence making a good choice, now makes a random choice. However, in the experiment, the current is running through Tim’s brain only when the first choice is being shown. In effect, the current interferes with Tim’s ability to register or analyze the first choice properly. Hence, he starts choosing the second choice, which he can at least perceive properly (even though he may not like the second choice per se). If current through the orbitofrontal cortex does indeed decrease computational capabilities, then this current is causing Tim to not be able to carry out the computation to register the first choice, hence leading it to go for the second choice that it has registered better.

It would be interesting to see an experiment in which Tim has a clear preference for choice A, and a current is running through his brain when both choices are presented. Will this randomize his choices between A and B? That would then provide supporting evidence for my brain current theory of depression and poverty.


  1. Values Encoded in orbitofrontal cortex are causally related to economic choices” by Ballesta, Shi, Conen and Padua-Schioppa

Economics with bad drawings

All things considered, I am not very good at reading.

Let me elaborate on that. Most of my learning, throughout my life, has been based upon reading texts. Recently, while trying to wrap my head around complicated geometric notions written in texts without pictures, I realized that I was pretty horrible at understanding concepts without pictures. Soon, I began making pictures for everything, and it seemed to help in a lot of areas that I thought I could never understand. Including in my own mathematical research.

One field that I could never really understand that well before is Economics. Today, I spent some time reading a glossary of economic concepts and coming up with pictorial representations for all of them:

Evolution, Wars and Game theory

This post is on Evolutionary Game Theory, derived from the book Networks, Crowds and Markets: Reasoning about a highly connected world by David Easley and Jon Kleinberg. This seemingly esoteric field shines a light on an incredible number of things, including evolution, wars, and society itself. Although the subject is inherently mathematical, the mathematics only serves to distract from the ultimate message in a lot of examples. Hence, this article mostly avoids formulae and lengthy calculations. Almost all the examples constructed in this article are my own.


What does Game theory actually mean? Imagine that India and Pakistan are about to go to war. Both sides have declared the other to be responsible for recent provocations, and have justified to their citizens the use of overwhelming force to crush the enemy. Who should attack first? Should they attack at all?

If both of them are able to stop themselves from going to war, both of them will benefit. No soldiers or citizens killed, no blow to the economy, you name it. However, if one of them surprise attacks first, they will catch the other sleeping and quickly be able to gain territory. This will boost the economy, and improve public morale, ensuring that the current government surely wins the next election. Hence, the benefits of attacking first outweigh the benefits of withholding from attack in this situation.

If India and Pakistan are not in communication with each other, they will each expect that the other will want to attack first. Hence, if only to preemptively deny this advantage to the other, they will both attack, leading to entrenched warfare, massive destruction, with very little gain.

If one could quantify the benefits of going to war or abstaining from war in a table, it would look something like this:

India/PakistanNot attackAttack
Not attack0,0-10,10
Benefits and losses have been represented as numbers between -10 and 10

Hence, if India and Pakistan are not in communication with each other, out of sheer insecurity, they will attack each other, leading to a bad result in which both of them suffer. However, if they are in communication with each other, and show cooperation by agreeing not to attack one another, they will both avoid a whole lot of damage and suffering.

This is an example of the classic Prisoner’s Dilemma, in which insecurity amongst people leads to bad conclusions. If I take pains to help out a classmate who will never help me in return, then without communication I will just assume that I’m being a fool, and that I might as well be unhelpful to that person. This inevitably leads to an uncomfortable atmosphere in class for both. However, if both of us have a verbal or non-verbal understanding that we will always help the other out, then both of us can benefit. Life is generally not a zero-sum game, and humans generally do benefit from cooperation, as long as they play by the rules.

Game theory is the study of decision making in such circumstances.

Evolutionary Biology

Evolution, as we all know, is the process through which some genetic traits persist over time, while others are wiped out. If there are two types of owls on an island, identical in every way except for the fact that one type is blind and the other can see, then soon enough the blind owls will be wiped out from the island. This is because it will be in direct competition with the seeing owls for food and other resources. How did one type of owl come to be blind while the other did not? Most genetic differences arise because of random mutation. It is possible that owls were always blind, and due to a random mutation one of them could see. This mutated owl cold hunt better, and soon gave birth to a lot of kids who also had eyes. Gradually, these owls with the power of vision out-competed the “original” blind owls, established their dominance over the island, and with a couple of hundred years those blind owls were nowhere to be found.

Note that evolution is the composite of random mutation and natural selection. It is not the process through which one species naturally dominates another. For instance, when humans first reached Australia more than 40,000 years ago, they hunted many species into extinction. This is not explained through evolution. Evolution only occurs gradually between members of the same species that are distinct only in small ways, their differences caused by random mutation.

Evolutionary Game Theory

How does game theory come into the picture though? Let us suppose that in a large community of say a thousand pigs, ten pigs have three legs instead of four due to random mutation. What will happen? The other pigs will out-compete these mutated pigs for food, shelter and mates. These mutated pigs will probably starve a lot of the time, not have many mates, and hence much fewer children than the rest. The fraction of three-legged pigs will slowly decrease generation upon generation, until it is negligible (or even zero). A genetic trait is called evolutionarily stable if a large enough fraction of the population with that trait can successfully dominate all other mutations, provided that the fraction of such mutants is small enough. Having four legs seems to be an evolutionarily stable trait. Of course we haven’t proved it yet. What if even one pig with 16 legs can out-compete this pig population for food and mates? This is unlikely, as this 16 legged pig will only be made fun of by the others, and will find it hard to attract mates in order to pro-create. Hence, in all likelihood, having 4 legs is evolutionarily stable for this population of pigs.

Let us take another example. Imagine that in a society with a thousand humans, one socially awkward child with a super high IQ is born. If there is only one such child, he/she will struggle to fit in, probably fail to attract mates in competition with others, and hence this mutation will die out. Hence, average IQ is evolutionarily stable, and not a very high one. However, if a sufficiently large number of children are high IQ mutants, they can fend for each other, procreate with each other, and ultimately beat others while in competition for limited resources. Hence, high IQ can spread only if the number of high IQ mutants is large enough to begin with. This would perhaps explain the current state of the world to some extent.

But what happens if the world is mostly comprised of high IQ individuals, and a small number of low IQ mutant is introduced? Intense competition between the high IQ individuals may lead to their mutual destruction. For example, two technologically advanced nations may blow each other up with nuclear weapons, while less advanced nations may survive unscathed. In the aftermath of this destruction, the low IQ mutants may pro-create comfortably, soon populating society with mostly low IQ individuals. Hence, high IQ is not evolutionarily stable.

In the book Sapiens by Yuval Noah Harari, the author argues that Neanderthals were known to actually have a bigger brain than Homo sapiens, and a bigger brain generally translates to a higher IQ. Hence, it was always a mystery how Sapiens managed to exterminate the Neanderthals and spread over the whole world. Perhaps the fact that high IQ is not evolutionarily stable explains this counter-intuitive phenomenon.

Nash equilibrium

What is Nash equilibrium? It is an arrangement between two players in which neither side will gain from changing their strategy. For example, let us suppose that India and Pakistan agree to not go to war with each other. If Pakistan now changes its stance and attacks India, India will retaliate by attacking Pakistan. In this way, things will only become worse if either country stances its stance from peace to war. This shows that both countries being at peace is a Nash equilibrium.

What does Nash equilibrium have to do with evolution? Things that are evolutionarily stable are also Nash equilibria. What this means is the following: suppose we have a population with an evolutionarily stable trait (say, average IQ). If a small fraction of individuals suddenly becomes high IQ, that small fraction will soon be wiped out. Hence, when two average IQ persons are competing for a resource, it will not benefit either of them (over the long run) to become a high IQ mutant. Hence, the only stable arrangement, in which everyone is well-off over the long run, is if everyone has an average IQ.

Mixed strategy

In some situations, we don’t have an evolutionarily stable trait.

Consider the scenario in which most countries of the world are peaceful and minding their own business. Suddenly, the US starts attacking everybody, soon accumulating a load of wealth and territory. Other countries will soon follow suit, and the world will descend into chaos. But in this aftermath of bloody destruction, people will realize that Switzerland has remained unscathed because it refused to descend into this horrible mess. They will realize their folly, and decide to again become peaceful. Soon, most countries in the world are again peaceful. But then they see Russia attacking countries near its borders, slowly gaining control over large parts of the world. To prevent the scaled from being tilted too heavily in Russia’s favor, other countries soon join the fray.

In this way, much of world history has been a story of wars, interspersed with periods of genuine peace. Neither war, nor peace are evolutionarily stable. It took only one belligerent country to make all other countries go to war, and it took only one peaceful, relatively unscathed country to make everyone return to peace. I don’t mean to overplay my hand, but this could be a possible reason why humanity has often oscillated between wars and peace, and neither has stuck.

What should one do then? The author of the article proposes that we should have a mixed strategy in situations without pure evolutionary equilibria. What would that look like in this situation? Suppose all countries mutually decide to initiate an attack with another only one time out of ten (of course all countries can retaliate every time they’re attacked first). If a country suddenly starts attacking everyone, it will find itself always at war, its people and economy destroyed. Hence, it will adapt to attacking less frequently. If a country that is involved in world politics never initiates an attack, it will soon enough find itself attacked first by another country, and will hence suffer considerable damage. Soon, it will start initiating some attacks on its own to never let the past repeat itself. Hence, attacking sometimes, perhaps one time out of ten, is the evolutionarily stable strategy, as eventually everyone will start doing this.


As one may imagine, game theory can probably be applied almost everywhere in human life. When people lack cooperation, insecurity ultimately leads them to destroy their own peace of mind and others’. For example, in a society in which everyone throws garbage on the roads, one person deciding not to do so and taking pains to clean the streets is at a disadvantage. Other people will keep throwing garbage on the streets. Hence, that person will soon decide to stop his/her futile efforts, and litter the streets like other. This shows that the situation in which everyone throws garbage on the streets is evolutionarily stable. However, if people cooperate and mutually decide to never throw garbage on the streets, and also to punish/fine individuals that pollute the streets, our streets will remain clean forever. Hence, clean streets can be evolutionarily stable only when people communicate and cooperate with each other. I stole this example from the book Algorithms to Live By, by Brian Christian and Tom Griffiths. This book also re-kindled my desire to finally understand Game theory, after multiple failed attempts in the past.


  1. Evolutionary Game Theory, by Easley and Kleinberg