Anyone who tries to draw “life lessons” from machine learning is someone who understands neither life, nor machine learning. With that in mind, let us get into the life lessons that I draw from machine learning.
Disclaimer: The idea that I am going to expound below is something I first came across in the book Algorithms to Live By. I thought it was a really impressive idea, but didn’t do much about it. Now I came across it again in Neel Nanda’s fantastic video on machine learning. Humans are hardwired to pay attention to an idea that they come across multiple times in unrelated contexts. If the same app is recommended to you by your friends and a random stranger on the internet, it’s probably very good and you should download it. Hence, seeing as I heard about this idea from two people I enjoy reading and learning from, I decided to give it some thought and write about it.
In programming as well as in life, we want to minimize our error, or loss function. When a neural network builds a model of the world from given data, it tries to minimize the difference between its predictions and the data. But what do humans want to minimize?
Humans want to minimize regret (also explained in the book Algorithms to Live By).
Regret often comes from choices that led to negative experiences. I read a fantastic interview by the famous programmer (and inventor of Latex) Donald Knuth, who said that in life, instead of trying to maximize our highs or positive experiences, we should focus more on minimizing our lows or negative experiences. What does that mean? Suppose you work in an office in which you win the best employee award every month. Clearly, your career is going well. However, your spouse is on the verge of divorcing you, and all your co-workers hate you. As opposed to this, imagine that you’re an average worker in an office setting whose career is not really that spectacular, but you get along with your spouse and co-workers. In which situation would you be happier and more satisfied? I bet that most of us would choose the second scenario without even thinking. Negative experiences stay with us for longer, and harm us more. Positive experiences provide for self-confidence and happy nostalgia, but are overshadowed by negative experiences on most days. I’ve often thought about this statement by Knuth, and it keeps getting clearer and more relevant with time. Hence, humans will do well to minimize the regret that they might have accumulated from negative experiences.
Although regret often stems from negative experiences, it may also arise from from actions not taken. A common example would be someone who really wanted to become an artist, but was forced by circumstances into a miserable profession (hello, MBAs). They would regret not pursuing their passion for a very long time.
Hence, a happier life is not necessarily one in which we have maximized the number of happy moments, but one in which we have minimized regret.
So how does one minimize regret? I want to answer this question by discussing how a neural network performs gradient descent.
Imagine that your loss/error function is a graph in two dimensions, and you want to find the value of (your parameter) for which this function is minimized.
If the graph given above is that of the loss function, it is minimized at the value of for which the function has its global minima. How does gradient descent work? We choose a starting point on the x-axis, say , and then perform the simple iteration . If you starting point is near the local minima for instance, you will slowly be led towards the local minima.
But wait. You wanted the global minima, and not the local minima. How does one do that? What machine learning algorithms do is that they keep jumping between different starting points , and performing gradient descent with values of that are large at first, but decrease with time. This approach generally works because real life data that is fed to the neural network is “nice” (probably in the sense that the attractor well for the global minima is larger than the wells for local minima). Hence, after a few jumps, we have a pretty good idea of where the attractor well of the global minima lies. Now we can keep iterating the process of gradient descent until we reach the global minima.
How does this have anything to do with life? The perhaps too obvious, but useful analogy is that we should keep trying new and different things. This is an age-old adage, but there seems to be a mathematical basis for it. Trying new things, like visiting a new part of town, trying meditation, talking to strangers, reading a book in a field you know nothing about, learning a new language, is akin to having a new starting point- . Now you can perform gradient descent. Remove the aspects of that new thing that you don’t like. Minimize regret (I will probably regret focusing on learning French for six hours instead of writing my thesis). You might arrive upon a minima that is better than your previous minima (if not, you can revert to your previous state of life). If you do it enough times, chances are that you will find your global minima, or your best possible life.
This ties in with another concept that was discussed in Algorithms to Live By– annealing. When Intel was trying to design microchip processors that would power the modern computer, finding the right arrangements for each part of the processor was proving to be a mathematically intractable problem. How should these millions of different parts be arranged so that the processor is fast and does not generate too much heat? There were literally millions of parameters, and the brightest minds in the world literally had no idea what to do.
What one physicist suggested was the process of annealing. What is annealing? It is the process through which metals are heated to very high temperatures, and then slowly allowed to cool. This causes metals and alloys to harden. Similarly, the physicist suggested that they randomly arrange all the parts of the processor, and then perform small changes that would make the processor more stable and efficient. Soon, they arrived upon a design that was efficient and successfully powered the modern computer.
How does this apply in one’s life? One possibility is resource allocation. How much time should I devote to working out, as opposed to studying or socializing? We can start at an arbitrary point- say I work out for 10 mins everyday, study for 5 hours and socialize for 2 hours. I can then change the parameters, in the same way that a metal slowly cools down. I should probably work out more, and not spend as much time socializing. Hence, maybe I can work out for 30 mins, study for 4 hours and socialize for 1.5 hours. I can keep tweaking these parameters until I reach an arrangement that is satisfactory for me and makes me happy.
Hence, if you’re looking to live your best life possible, you should start doing things you never thought to do before, and then changing small parameters until you reach a satisfactory situation. Then start all over again. After trying enough things, there’s a good chance that you’ve already figured out what makes you happiest, and minimizes regret. You’ve figured out how to live.
And thus ends my Deepak Chopra-esque spiel.