Fruits of procrastination

The science of going in circles on doughnuts

The paper that I’m going to be reviewing today is A Topological Look at the Quantum Hall Effect. It explains an amazing coming together of topology and quantum physics, of all things, and provides a review of very important, Nobel prize winning work in Physics.

The Hall Effect

We all learned in high school that if we pass current through a conductor in a magnetic field, and the magnetic field is not parallel to the current, then the conductor experiences a force.

But does the conductor as a whole experience this force, or only the electrons inside it? Maxwell, in his Treatise on Electricity and Magnetism, said that “the mechanical force which urges the conductor…acts not on the electric current, but on the conductor which carries it.” This seems wrong, right? It is the charges that should experience the force, and not non-charged portions of the conductor.

Edwin Hall, a student at Johns Hopkins University, investigated further. He passed a current through a gold leaf in a magnetic field. He could see that the magnetic field was acting on the electrons themselves, as it altered the charge distribution inside the gold leaf, which he could measure with a galvanometer.

As you can see, there is a separation of charges across the breadth of the plate. One may imagine that instead of flowing through the whole plate uniformly, as soon as the current enters the plate, it gets deflected to one side of the plate, although it keeps moving forward. The potential difference created by this separation of charges is known as Hall voltage. Hall conductance is the current already flowing through the plate, divided by the hall voltage. Remember Ohm’s law, which says V=IR? This implies that \frac{1}{R}=\frac{I}{V}. This \frac{1}{R} is the conductance. Of course the difference is that this current is not caused by the Hall voltage. Hence, this formula cannot directly be obtained from Ohm’s law, but let’s shut our eyes to these details.

The direction of this voltage is not fixed, and depends on the charge of the charge carriers in the conductor. Hence, measuring the direction of this voltage is an easy way to determine the nature of charge carriers in a conductor. Some semiconductors owe their efficacy to having positively charged charge carriers, instead of the usual electrons.

The Quantum Hall Effect

In 1980, Klaus von Klitzing was studying the conductance of two dimensional electron gas at very low temperatures. Now remember that resistance=voltage/current (because conductance=current/voltage as discussed previously). As voltage increases, this formula says that resistance should increase. Now as the Hall voltage increases with an increase in magnetic field strength, as the magnetic field strength increases, so should the resistance. But what is the pattern of this increase? It is pretty surprising, surprising enough to get Klitzing a Nobel prize for studying it.

Note: it seems that the resistance is increasing in small steps for lower magnetic field strength, but bigger steps for higher values of the magnetic field. However, the conductance, which is its reciprocal, decreases by the same amount for each step. In other words, conductance is quantized. A quantity if referred to as quantized if can be written as some integer multiple of some fundamental quantity, and increases or decreases by the steps of the same size.

Why do we get this staircase graph, and not, say, a linear graph? If we consider the conductance=1/resistance values in the above graph, we see that all the successive values are integer multiples of a constant we find in nature- e^2/h, irrespective of the geometric properties or imperfections of the material. Here e is the electric charge of an electron, and h is Planck’s constant. Why does this happen?

Robert Laughlin offered an explanation. Consider an electron gas that is cold enough, such that quantum coherence holds. This basically means that there’s not much energy in the system for all the particles to behave independently, and that all particles behave “similarly”. Hence, the behavior of the system can be described easily by a Hamiltonian dependent on a small number of variables (2 in this case). Laughlin imagined the electron gas to be a looped ribbon, with its opposite edges connected to different electron reservoirs.

Now imagine that there’s a current flowing in the ribbon along its surface in a circular motion. The presence of the magnetic field causes a separation of charges. Because the two edges are already connected to electron reservoirs (which can both accept and donate electrons easily), the separation of charges causes electrons to flow from one reservoir to another. If, say, the side connected to A becomes positive due to separation of charges and the side connected to B becomes negative, then electrons from A go into the ribbon, and electrons from the ribbon go into B. One may imagine a current flowing between the electron reservoirs, which is maintained because the magnetic field maintains the Hall voltage between the two ends of the ribbon. However, in order to increase this Hall current, the magnetic flux \Phi will have to be increased (which will increase the Hall voltage).

The Aharonov-Bohm principle says that the Hamiltonian describing the system is gauge invariant under (magnetic) flux changes of \Phi_0=hc/e, which is the elementary quantum of magnetic flux. The Hamiltonian of a system may be thought of as its “energy”. Hence, although the Hall voltage increases every time we increase the magnetic flux by one quantum, the “energy” of the system remains unchanged. This quantized increase in Hall voltage causes a quantized increase in resistance, which can be seen from the step graph above. If the increase in resistance was not quantized, then the graph would have been a smooth curve, and not “stair-like”.

A small note about what it means for the Hamiltonian to be gauge invariant under flux changes of \Phi_0: it is not like the Hamiltonian doesn’t change with time. By the time the flux increase happens, all he eigenstates of the Hamiltonian will have changed. What gauge invariance here means is that if the measured eigenstate of the Hamiltonian at time t_0 is \psi(t_0), and we complete increasing the the magnetic flux by \Phi_0 at time t_1 and also measure the Hamiltonian at that time, then the eigenstate we get is going to be \psi(t_1), and not some other eigenstate \psi^*(t_1). For simplicity, we will just assume that \psi(t_0)=\psi(t_1) in this article, and that the Hamiltonian doesn’t change with time.

Another important note is the following: the charge transfer between the two reservoirs is quantized. In other words, when I increase the magnetic flux by one quantum of flux, the Hall current increases by a certain amount. When I increase the magnetic flux by the same amount again, the same increase in Hall current is seen. There is no good reason for this in quantum mechanics. This is because although the increase of the magnetic flux by \Phi_0 brings back the system to its original Hamiltonian eigenstate, in quantum mechanics there is no guarantee that the same initial conditions will lead to the same outcome. Quantum states randomly collapse into any one of their eigenstates, and don’t have to collapse to the same eigenstate each time. This fact, of the same increase in Hall current, needs some topology to be explained. But before that, we explore some concepts from geometry- namely, curvature.

Adiabatic curvature

“Adiabatic” evolution occurs when the external conditions of the system change gradually, allowing the system to adapt its configuration.

What is curvature? Let us explain this with an example. Take the earth. Starting at the North Pole, build a small road that comes back to the North Pole. By small, we mean that the road should not go all the way around the earth. If you drive your car along this road, the initial orientation of the car will be different from the final orientation of the car (when you arrive). Curvature is the limit of this difference between initial and final orientations of the car, divided by the area enclosed by the road, as the loop tends to a point. Note that this would never happen on flat land which was “not curved”: in other words, the initial and final orientations of the car would be the same.

But how does curvature fit into quantum mechanics? Consider a curved surface parametrized by two angles- \theta and \Phi. Each point of this curved surface corresponds to an eigenstate of the hamiltonian, which can be written as H(\theta\Phi). Essentially, this is a configuration space of the Hamiltonian of the system. Let us now build a quantum (ground) state, say e^{i\alpha}|\psi(\theta,\Phi). If we build a small loop from a point to itself, this ground state, as it is “driven” around the loop, changes its phase from \alpha to something else. The curvature of this configuration space, measured using the limiting process as described above, turns out to be 2 Im \langle \partial_{\theta}\psi|\partial_{\Phi}\psi\rangle.

Why do we care about bringing things back in a loop, as far as Hamiltonians are concerned? This is because at the end of the process of increasing the magnetic flux by one quantum, the initial and final Hamiltonian has to be the “same” in our case. We have to always come back in a loop to our initial Hamiltonian.

Hall Conductance as Curvature

What are \theta and \Phi as described above? \Phi is associated with the Hall voltage, and \theta can be thought of as a function of \Phi. More precisely, theta is defined such that I=c\partial_{\theta}H(\Phi,\theta). The Hamiltonian is periodic with respect to \Phi or Hall voltage. If the period of H with respect to \Phi is P, then the process of increasing the magnetic flux by one quantum can be thought of as changing the value of \Phi to \Phi+P, which leaves the Hamiltonian unchanged. Geometrically, if the configuration space can be thought of as a curved surface, this process can be thought of as moving along a circle and returning to the point we started out from.

For an adiabatic process where \Phi is changed very slowly, Schrodinger’s equation gives us \langle \psi|I|\psi\rangle=\hbar cK\Phi. \langle \psi|I|\psi\rangle can be thought of as the expected value of the current (the weighted average of all possible values of the current), and we know that \Phi is voltage. Hence, K, going by the formula discussed above, is (some constant multiplied with) conductance. In this way, Hall conductance can be interpreted as the curvature of the configuration space of the Hamiltonian.

Chern numbers

What is this “curved surface” that forms the configuration space of the Hamiltonian? As it is periodic in both its parameters \Phi,\theta, we can think of it as a torus. Each point of this configuration space corresponds to an eigenstate of the Hamiltonian.

Let the torus given above be the configuration space of the eigenstates of the Hamiltonian. Consider the integral \int K dA. The difference between this integral evaluated on the orange patch and the external blue patch is an integral multiple of 2\pi. This integer is known as Chern number. You may have seen the famous Gauss-Bonnet Theorem, which states that \frac{1}{2\pi} \int K dA=2(1-g). The right hand side, which is 2(1-g), is the Chern number for in the special case that the loop around the orange area is contractible (can be continuously shrunk to a point).

The Chern number associated with the diagram given above does not change when we deform the orange patch slightly. However, it will change if the boundary of the orange patch changes its topological properties (i.e. if its homotopy group changes). One way it could do that is if the loop circled the “doughnut-hole” of the torus completely.

Let’s bring it all together

Now that we have all that seemingly useless background, let’s bring it all together. Let’s start with the torus, representing all eigenstates of the Hamiltonian. Every time we increase the magnetic flux by one quantum, it’s like we’re going one full circle around the doughnut, and returning to our original Hamiltonian eigenstate. Both the pink and red curves given below are examples of this.

This causes the Chern number to change by an amount, say a. Why does the Chern number change? Because we’re adding the integral \frac{1}{2\pi}\int K dA to the previous integral. When we again increase the magnetic flux by one quantum, we go in another full circle around the torus. This causes the Chern number to change by the same amount a. Both the charge transfer between the conductance are exactly this Chern number in fundamental units (where constants like h and c are taken to be 1). Hence, each time we increase the magnetic flux by one quantum, we see the increase in Hall current, and the same decrease in conductance.

Why does topology have anything to do with quantum mechanics? We might think of this as an analogy that we constructed, that just happened to work absurdly well. We interpreted conductance in terms of curvature. Hence, the change in conductance can be thought of as an integral of curvature on copies of the torus. These integrals satisfy strict mathematical relations on the torus, via the Chern-Gauss-Bonnet Theorem. Hence, we can use this theorem to approximate how much the conductance changes every time we move along a certain loop on the torus (every time we increase the magnetic flux by one magnetic flux quantum).

Thanks for reading!


  1. A Topological Look at the Quantum Hall effect
  2. Adiabatic Theorem

Alcohol Use and Burden: Lancet

Alcohol consumption is a loaded topic. Most people know that it is probably not very good for your health, but how bad can it be? We know that some people die due to alcohol consumption, but that probably happens only to drunkards in old age. Moderation is the key. Moreover, there’s been a notion in popular culture for years, that a drink or two might be actually good for you! Everybody does it. Millions of people cannot be wrong, of course. These claims will be explored in the article below.

The paper that I will be studying today is Alcohol Use and Burden for 195 countries and territories, 1990-2016: a systematic analysis of the Global Burden of Disease Study 2016. This paper, published in the prestigious journal Lancet, is one of the most highly cited scientific papers in recent years, and disproves a lot of myths around alcohol consumption.


Most researchers concur that alcohol is a leading cause for much illness and disease. However, some evidence suggests that a low intake of alcohol has a protective effect on some conditions like ischemic heart disease (the most common form of heart disease) and diabetes. The method of data collection for this “evidence” was self-reporting by consumers of alcohol, which has been found to be unreliable. Some recent studies have instead focused on alcohol sales numbers, prevalence of drinking habits, etc.

This study improves on the prevalent data collection process in the following ways: 694 individual and population-level data sources have been consulted. Alcohol consumption levels have been adjusted to include consumption by tourists (which probably accounts for a family large percentage of alcohol consumption in touristy places). New methods have been devised to better approximate illicit alcohol sales (that are not recorded in stores). 23 health-related outcomes of alcohol consumption, like cirrhosis, tuberculosis, cancer etc have been identified and individually studied. Also, relative risk curves have been used to determine the ideal level of alcohol consumption for maximal health benefits. What is a relative risk curve, you ask? If not drinking alcohol can be thought to have a “risk” factor of 1 for the advent of tuberculosis, then drinking 4 glasses of beer everyday can be thought to have a relative risk factor of 2.3. Hence, drinking 4 glasses of beer everyday might make you 2.3 times more likely to contract tuberculosis.

How do you study the effects of alcohol? By comparing a group of people who drink, with a group of people that does not. This group of people which does not drink, with which other categories of people are compared, can be called the “reference category” in this context. Previous studies didn’t care about the composition of this reference category. Hence, this reference category for them included people who drank heavily in the past (and hence had less than ideal health conditions). On comparing this reference category with the category of drinkers, they didn’t find much difference, and hence concluded that drinking doesn’t change one’s health that much. This study removes this bias by carefully controlling the membership of this reference category.

Also, previous studies just used the assumption that the amount of drinking that minimizes harm is 0%, and then determined population statistics and error margins from this. This should not just be an assumption, but needs to be proven. Moreover, such an assumption reduces the accuracy of the statistical study. This paper does not make such an assumption, and hence better captures the effects of alcohol on health outcomes. Moreover, possibly the confidence intervals of the conclusions are larger.

The data used in this study has been collected from 195 locations across the world from 1990 to 2016, for both sexes, and for 5 year age groups between the ages of 5 and 95 years or older. Moreover, the risk of alcohol consumption for 23 diseases and health disorders has also been studied.


Alcohol consumption is measured in grams of pure ethanol consumed. Hence, a couple of light beers might be equal to a whiskey shot. For estimating consumption of alcohol, 121029 data points from 694 resources were pooled. For estimating disease risk, risk estimates from 592 studies were combined. A total of 28 million individuals were involved in these 592 studies.

The total amount of alcohol in a community is approximated from the total amount of alcohol sold in stores. The relative numbers of drinkers and abstainers are approximated through surveys. Also, the amount of alcohol consumed at an individual level, across age groups, is also approximated through surveys. This data is corrected for alcohol consumed by tourists. Also, in order to get a conservative estimate for the amount of illicit alcohol sold (outside of stores), the authors collate estimates from multiple studies on the sale of illicit alcohol, and construct a uniform probability distribution based on these multiple studies. Hence, each study has an equal probability of being correct. They then construct a new probability distribution, whose range is 0 to the average of the above probability distribution, and then take a sample of 1000 from this new probability distribution, which they then average. This average is obviously lower (probably) than the average of the first distribution. Hence, they get a conservative estimate of the amount of illicit alcohol sold.


In 2016, 32.5% of the world population were drinkers (25% of females and 39% of males). Hence, 2.4 billion people in the world drank alcohol. The mean amount of alcohol consumed was 0.73 standard drinks daily for females and 1.7 standard drinks for males. The actual consumption varied with the Socio-Demographic Index, or SDI, which can be thought of as similar to an index of economic and social prosperity.

On the y-axis of both these graphs, the disability-adjusted life-years, or DALYs have been plotted. These count the number of years spent handicapped or otherwise disabled, per 100,000 years, due to alcohol consumption. Note that these are only global averages. The data varies considerably with changing SDIs. Note that in both instances, alcohol consumption only increases the amount of disability in both genders across 23 diseases and disorders. On the other hand, the prevalence of Ischaemic heart disease in older age groups reduces slightly in both genders. Also, diabetes in females also becomes slightly better with alcohol consumption. Note that these are the only exceptions out of 23 diseases.

In 2016, 2.8 million deaths were attributed to alcohol consumption. This is 6.8% of all deaths amongst males, and 2.2% of all deaths amongst females. Also, alcohol was the cause of 6% of all DALYs amongst males (years spent incapacitated), and 1.6% of all DALYs amongst females. These percentages were even higher for groups of people between 15-49 years. The three leading causes of death that alcohol contributed to, for both females and males in this age group, were tuberculosis (1.4%), road injuries (1.2%) and self-harm (1.1%). For the age groups 50-80, the most prevalent disease in high SDI regions that was contributed to by alcohol consumption was cancer (27.1%). In poorer societies, tuberculosis was the most prevalent such disease. In age groups beyond 80, hemorrhage and hypertensive heart disease became much more prevalent.

If the risk of alcohol consumption with respect to these 23 diseases could be “averaged out” and then plotted, the graph would look something like this:


Past studies had shown that although risks associated with alcohol consumption increased with an increase in alcohol intake, consuming a small amount of alcohol daily was actually good for the body. Subsequent studies, conducted using mendelian analysis and multi-variate meta analysis, had then shown this to be a flawed conclusion. This paper also reaffirms the finding that no amount of alcohol is good for the body. For optimum health, the amount of alcohol we should be drinking daily is zero. However, the graph shown above suggests that drinking a maximum of one drink everyday is not all that bad, although still worse than not drinking at all.

Alcohol is a leading risk factor for disease world wide, and accounts for nearly 10% of deaths worldwide. The government should make the buying of alcohol more expensive, and should restrict its availability in stores. An extreme example of unchecked alcohol consumption is Russia, where in the 1980s, 75% of male deaths were attributed to excessive alcohol consumption. The authors concede that they don’t have data for multiple things, like the fraction of road accidents outside the US, and interpersonal harm caused due to alcohol. Also, they don’t have data on drinking below the age of 15. However, they conclude that having this data could only increase the estimate of health risk associated with alcohol consumption, and not reduce it.


  1. Alcohol Use and Burden for 195 countries and territories, 1990-2016: a systematic analysis of the Global Burden of Disease Study 2016.
  2. Counterfactuals in statistics

So how does Google work, exactly?

How do we decide which movie to watch on a given weekend? We could google “latest movies”, and then take a random pick. However, that throws up both good and bad results. How do we pick a “good movie” from this list? We go to Rottentomatoes, IMDB or the “Reception” section of Wikipedia, and see what critics have to say about those movies. We generally go for the movies for which the critics that we highly value have mostly positive things to say. Google works on the same principle. How does it know which websites are most relevant to your query? It finds the website that most other “credible” websites (containing your query words) cite, or link to.

The paper that I’m blogging about today is The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seeing as it gave rise to Google, perhaps the greatest political, social and economic force of our times, it is one of the most important papers that I’ve had the opportunity to read. And the fact that it is based on one of the most intuitive ideas known to humanity is mind-boggling.


Google did not start the search engine revolution. However, it revolutionized it. But how? Search engines before, like Yahoo and Altavista, were not the kind that thought users already knew what they were looking for, and just needed a big search bar. They thought most users just wanted to “browse”. Hence, if you remember the heavy machinery Yahoo and MSN websites, they would contain separate sections for News, Shopping, Sports, etc. Moreover, the search algorithm relied on keyword matching, that turned up way too many low quality results. If you typed “Sachin”, it would turn up the page that contained the word “Sachin” the most number of times, and not the ESPN stats page for Sachin Tendulkar, which was probably what you were looking for.

Google, on the other hand, was blasphemously bland. It had just one big search bar that you could use only if you already knew what you were looking for. This flew in the face of conventional wisdom. However, they improved their search algorithms and speed of information retrieval so much that they soon became the market leader in this segment.

The authors, Sergey Brin and Lawrence Page, claim in this paper that Google, which is a misspelling of googol (10^{100}), is capable of fast crawling, indexing the massive landscape of the internet effectively, and also dealing with the millions of queries that search engines deal with daily.

Improved Search Quality and PageRank

Search engines by that time were already capable of crawling (downloading) and indexing vast swathes of the internet, and presenting these pages to users when searched for. However, they had not effectively determined a system to rank these pages by relevance. Although millions of results could be returned based on a query, users typically would only check out the first few, and then abandon their search. It was hence extremely important these first few results turned out to be the “best” ones. Google exploited hyper textual information to find an effective algorithm to rank pages based on relevance. What does this mean?

Let me explain this with an analogy. Suppose you need to find out if person A is honest and trustworthy. How do you determine that? You might ask them if they think they’re trustworthy. However, such self introspection is likely to be biased. Similarly, analyzing the contents of a page, like a page containing the word Sachin the most number of times, might not be the best test of its relevance. On the other hand, if you ask some people who you respect and know to be honest and trustworthy about person A, and all of them have a high opinion of this person, then you have a strong reason to trust person A. This is how relationships and ties are formed in the real world anyway, especially in high risk situations like marriage, business deals, etc. Google does pretty much the same thing.

Google ranks pages based on how many other pages cite it, or link to it. The actual contents of that page are not as important. If a search for “Einstein” results 10 million pages, and 8 million of them link to a certain article, that article is likely to be an influential and important article on Albert Einstein. However, there’s further modification that is needed to make the search algorithm even better, which I can again explain with a bad analogy. When you ask for opinions of others on person A, there might be persons who always speak well of others. Despite this charming personal trait, their opinion must be discounted in this particular situation. However, if you know persons who don’t generally throw around compliments willy nilly, and they have overwhelmingly positive things to say about person A, then their opinion should hold much more weight.

Similarly, google gives less importance to pages that contain a lot of hyperlinks to various websites, and more to pages that do not contain a lot of links. If there is a page on Einstein that does not contain a lot of links, and it cites only one page for all its information, that one page is likely to be extremely important. Of course there will be exceptions to this rule. However, statistically, such a search algorithm turns up good results most times.

If a page A has pages T_1,\dots,T_n, and PR(T_i) is the page rank of page T_i while C(T_i) is the number of links going out of T_i, then

PR(A)= (1-d) +d(\frac{PR(T_1)}{C(T_1)}+\dots+\frac{PR(T_n)}{C(T_n)}).

Here d is a number between 0 and 1, and is called the damping factor. This factor models user fatigue: PageRank of a page is supposed to reflect what is the probability that a user reaches that page after clicking on a bunch of hyperlinks in other pages. If d=0.85, for example, then there is an 85% chance that the user will click on the next link, and a 15% chance that the use will quit on that page.

Note that PageRanks are calculated iteratively. Hence, if page A links to page B, then we can calculate the rank of page B only if we already know the rank of page A.

Anchor Text

When you search for “Einstein”, say the most highly ranked page that you get is the Wikipedia entry on him. It says that Einstein was a physicist, where the word “physicist” is a hyperlink to the wikipedia page for physicists. Now when you search for “physicist” on google, you’d want to get the Wikipedia page on “Physicist”, and not “Einstein”. Hence, the mention of the word “Physicist” on Einstein’s page should contribute to the PageRank of of the “Physicist” page, and not the “Einstein” page. If you observe the formula for page rank given above, you’ll see that this is exactly what happens. Hyperlinks always contribute to the page they lead to, and not the page they lead away from. This is how we can search for images and videos on google. There are hyperlinks on some pages with text, which lead to those images.

Google Architecture Overview

This is perhaps the most technical part of the blogpost. Most of Google is implemented in C or C++. What Google does is that it downloads as much of the internet that it can access, stores each webpage, and then determines how many times each word occurs in each webpage. It catalogues all this information, ranks the webpages, and stores them. So when you search for “Einstein”, it finds all webpages with the word “Einstein” in its catalogue, implements the PageRank algorithm to rank the webpages, and gives you the results in order. Now for a more technical description:

The web crawling, or the downloading of web pages, is done by several distributed crawlers. There is a URLServer that sends lists of URLs to these crawlers to download. These webpages are then sent to a storeserver, which stores them in a repository. Each webpage has a docID, which is stored along with the webpage. This docID is useful because URLs can be of arbitrary length. However, docIDs are of a fixed length. Hence, they’re useful for cataloguing.

Now comes the indexer. The indexer takes each webpage, and stores details of the words in that webpage. Say it takes a webpage on “Trump”. Now it counts the number of occurrence of each word in this webpage, which is known as recording hits, and creates a “forward index”, which looks like “docID for Trump: president, orange, twitter rage, yuge,…”. This information is now stored in “barrels”, which are boxes corresponding to a set of words, and store details of all webpages that contain those words. For example, this page on Trump is now stored in the barrel corresponding to the words “Trump, Obama, Lincoln”, etc. It is also stored in the barrel corresponding to the set of words “orange, twitter, yellow, green,…”. What a barrel does is that it creates an “inverted index”, which means that for each word, it has a list of docIDs or webpages containing it. Also, each word is allotted a wordID. An example of an entry in this inverted index would be “wordID for orange: docID for Trump, docID for the ice cream cones,….” etc

The indexer also notes down where in the document each word occurs, font size, capitalization, etc. Also the indexer notes down the URLs on that page- name of URL, and the docID of the page it leads to. This information is stored in an “anchors file”.

Major Data Structures

Almost all of the data is stored in Bigfiles, which span multiple file systems. Also, there are many repositories. The raw HTML repository stores the compressed HTML of each page, with a small header. The document index, on the other hand, keeps information about each document, including current status, various statistics, etc. A hit counts the number of times that a certain word occurs on a webpage. It is stored using hand optimized compact encoding, which uses two bytes for every hit. Hit lists are combined with wordIDs in forward index and docID in the inverted index.

In the inverted index, for each wordID, there is a pointer, which points towards the barrel corresponding to that wordID, the docIDs corresponding to the webpages that contain that word, and the hit lists telling us the details of the occurrence of that word in the webpages (font, number of occurrences, etc). This list is called a doclist. But in what order should these items be stored in a doclist? Should we just rank the docIDs numerically? Or rank them by the number of occurrences of that word? We basically order the docIDs by the number of occurrences, but we introduce a further modification to the barrels: there are now two sets of barrels. One set concerns itself only with title and hyperlinks, and another set deals with the main text. When you search for “Einstein”, the”title and hyperlink” barrels corresponding to “Einstein” are searched first, and the relevant ranked pages yielded. However, when there are not enough results this way, then only are the barrels corresponding to the main text upturned to yield relevant webpages. Now these yielded webpages are ranked using the PageRank algorithm, and are shown on the user’s device.


Google claims to be much better than others at searching for relevant webpages. For instance, it says that most existing search engines cannot even find their own webpages effectively when queried- if you type “Yahoo” into the Yahoo search bar, it might not even bring up its own webpage, but might some other random webpage containing the word Yahoo the most number of times. Google, on the other hand, is much better at this sort of thing. Also, when you type “Bill Clinton”, the most relevant results are presidential articles from

As far as crawling the internet and indexing the huge mass of data is concerned, Google claims that downloading 26 million webpages took them only 9 days, and indexing and storing them just a couple more days. The indexer ran at 54 pages per second, and the whole sorting process took only a day.

Although its processes and algorithms could be optimized, Google was fast and scalable. And it was here to stay.

It indeed was


  1. The Anatomy of a Large-Scale Hypertextual Web Search Engine
  2. Search engine indexing
  3. Inverted index

Gravitational Wave Detection with LIGO

The paper that I’ll be reviewing today is the famous Observation of Gravitational Waves from a Binary Black Hole Merger by B.P. Abott et al. This paper records the detection of gravitational waves for the first time in human history, and was awarded the Nobel Prize in 2017.


What are gravitational waves, and how are they caused? Gravitational waves are caused by the change with time of the quadrupole moment of a system of masses. What does that mean? Let us take an example. Imagine the rotation of the earth. As it is rotating, its mass doesn’t change (first moment), and neither does the relative position of its center of mass (second moment). However, during rotation, the poles get flattened a little bit. When this flattening is happening, the quadrupole moment, or third moment, changes. One way of thinking of it is to say that the “shape” changes. If one was calculating the gravitational potential energy of the earth at a distant point in space, it would change a little because of the flattening of the earth near the poles. It will have reduced. The energy that is lost by the system is radiated out in the form of gravitational waves.

But what do gravitational waves do? Simplistically streaking, they stretch and compress space itself. One may imagine the universe to be a rubber sheet. Then gravitational waves elongate and compress this mat with time. This is exactly the property of waves that the LIGO exploited to detect gravitational waves. A more mathematical description of gravitational waves can be found here. In this mathematical explanation, they “simplify” Einstein’s Field Equations to make them linear, and then use the fact that however we deform the universe, although metrics and other observables might change, the physical laws will always remain the same (the justification for gauge transformations). With different kinds of such deformations, we get different solutions for how the “shape and size” (metric) of the universe changes with time. All these are valid descriptions of the universe. One such description is that gravitational waves are emitted after certain phenomena. Although these are solutions only for “simple” and “approximate” Einstein’s Field Equations, it has been rigorously proven that gravitational waves indeed are solutions of Einstein’s Field Equations in the complicated, non-linear case.

An important caveat is that this is not the first time that gravitational waves have been observed. The detection of the binary pulsar system PSR B1913+16 proved that gravitational waves do indeed exist. However, the presence of gravitational waves was inferred from other observations, and not directly detected. This paper presents proof for the first actual detection of gravitational waves, and also the first observation of the merger of two black holes in a binary black hole system.


The merger of two black holes so far away in space and time was observed on September 14, 2015. The initial detection was completed by low latency searches, which means that some preliminary gravitational wave analysis was done quickly (3 minutes) so that scientists could quickly figure out the approximate location of the merger of the black holes, and turn their telescopes to the portion of the sky where this signal was coming from. If astronomical observations co-incided with this wave detection, then they would have proof that this wave was not a technical glitch, but the result of an astronomical event.

Two observatories observed this event: one in Hanford, WA and another in Livingston, LA. The gravitational waves reached Hanford 10 ms before Livingston, and this discrepancy in time of observation helped scientists locate the binary black hole system in the sky. The observed signals had a signal-to-noise ratio of 24.

Some features of the observed signal were these: over 0.2 seconds, both the amplitude and the frequency of the wave increased in 8 cycles from 35 to 150 Hz, where the amplitude reached its maximum. Let us focus on the maxima: where the frequency is 150 Hz. Two bodies orbiting their common center of mass would have to be orbiting 75 times every second to produce this signal. After noting the frequency and the rate of change of frequency, the scientists concluded that two compact masses of a certain total mass had to be orbiting each other at a distance of approximately 350 km for these gravitational waves to have been produced. They concluded that these two point masses could not be neutron stars (too light) or a neutron star black hole binary (would have coalesced at a lower orbital frequency). Hence, these had to be two black holes.


The detection of gravitational waves happened at two observatories. This was so that random noise from the environment could be eliminated (most sources of noise would be local, and hence would be detected by one observatory but not the other), and the location of the astronomical event in the sky could be determined (mainly by the time difference of observation by the two observatories). The instrument that was used was a laser interferometer, and its working principle is explained beautifully here. Simplistically speaking, and interferometer has two arms that are perpendicular to each other, and there are light beams inside both. They are arranged in such a way that these light beams interfere destructively at the detector. However, gravitational waves stretch one arm, and compress the other one (unless the arms are at equal angles to the direction of propagation of the wave, which is extremely unlikely for a randomly generated gravitational wave in the sky). This causes the light to not interfere destructively, and we can hence detect gravitational waves by detecting visible interference patterns of laser beams. The instrument uses a resonant optical cavity to increase the intensity of the laser- we end up getting a 100kW laser light from a 700W light. Hence, even a small change in the interference pattern of light can be detected easily.

The mirrors used in the experiment are also mounted on quadruple-pendulum systems, so that these mirrors remain almost stationary in the event of earthquakes or other sources of external noise. To reduce Rayleigh scattering of photons, ultrahigh vacuum is created, and the pressure inside the arms is reduced to 10^{-6} Pa. Other instruments like seismometers, etc were also mounted near both observatories so that if localized causes did create a disturbance in the instrument, they would be detected by those instruments also, and hence such detections could be discarded.

Methods of searching

The blackhole merger event, codenamed GW150914, was verified by two different techniques. Because these two techniques use completely independent methods of analysis, and the fact that both report the same event boosts the plausibility of this being an actual gravitational wave and not another random disturbance. All events that are observed are assigned a detection-statistic value by both methods, which measures their likelihood of being blackhole merger events. GW150914 was given very high values by both methods. But how can we calibrate the instrument to ensure that we give high detection-statistic value only to actual black hole merger events? We measure the rate at which the instrument gives the same or higher detection-statistic values to events that are not black hole mergers. If this happens very rarely, say once in 5,000 years, then we can be reasonably sure that the event that has been given a high detection-statistic value is an actual black hole merger, and not a freakish once-in-5000-years-event.

The two methods of analysis used are:

  • Generic Transient Search– In this technique, there are no pre-constructed gravitational waves with which incoming signals should be compared. After observing a signal lying within a certain frequency and time range, we construct a gravitational wave form having those properties. Then this wave form and the incoming signal is compared. If the comparison is favorable, we can reasonably infer that the incoming signal is a gravitational wave. The detection-statistic value given to GW150914 was 20. Only once in 22,500 years will this method give such a high value to an event that is not the merger of two black holes producing gravitational waves.
  • Binary Coalescence Search– We have approximately 225,000 pre-constructed gravitational wave forms. The incoming signal is then compared with these wave forms, and the better the match, the higher the detection statistic value. The detection statistic value that this search gave to GW150914 was 23.6, and this method would erroneously give such a high value only once in 203,000 years.


Some properties of the black hole merger event could be calculated from the observations:

No deviation from General Relativity was found with regard to mass and spin of black holes, and also the coefficients in the mathematical expression for the phase of the gravitational waves (expressed as a power series in f^{1/3}). Moreover, better lower bounds for the Compton wavelength of a graviton were obtained.

The authors conclude this paper with the assertion that there are probably a lot of black hole binary systems out there that are merging much more often than we can observe, the predicting or analyzing the interference patterns of the gravitational waves created by such events could help us decode the history of the universe.


  1. Observation of Gravitational Waves from a Binary Black Hole Merger
  2. Binary Black Hole
  3. Interferometers

Effective Altruism- May and June

I’m attaching screenshots below for May and June. I donated to soup kitchen relied in West Bengal, cyclone relief efforts in West Bengal, and the State College fund for Covid relief.

My relief efforts have been a little off the cuff, and not studied. Hence, this has been more like virtue signaling than actual labored virtue. I really do hope to rectify this in the future.

The Bitcoin Revolution

The paper that I’ll be discussing today is perhaps the most famous that has come out in recent years- Bitcoin: A Peer-to-Peer Electronic Cash System, by Satoshi Nakamoto. It was a surprisingly short paper packed with the most amazing ideas for creating an alternate banking system. This paper has gone a long way in upending the finance industry.


We all deal with digital money, for example when we make a payment through a credit card. There are some features of such transactions that are relevant to this discussion:

  1. We depend on trustworthy financial institutions to facilitate this transaction.
  2. Double spending, which is spending the same money multiple times, cannot always be prevented. Someone may buy a product using a credit card, use it, and then claim that his card was fraudulently used. In many such cases, because of lack of evidence, this money is refunded to the complainant, although he is clearly defrauding the system. Credit card companies consider this part and parcel of being in the business. Such transactions are called reversible transactions.
  3. These costs are added up for credit card companies, which then increase the price of transaction. Credit card companies approximate the amount of money they spend each year in hiring employees, advertising, and also in returning money to fraudulent customers. These costs are ultimately transferred to buyers and sellers.

In order to remove the need for third parties to moderate transactions, Nakamoto says that what is needed is an electronic payment system based on cryptographic proof. When you agree to pay me for a service that I’ve rendered to you, the proof of this exchange of money will be recorded in the public domain. Hence if I claim that you have not paid me, you can show me, and everyone else the proof of payment.


How does it all work though? Everyone has a public key, and a private key. A key here stands for an ID number, like your Driver’s License number. When I make a payment to you, I verify this transaction with my public key, and sign the transaction with my private key. The money is then transferred to you, and the electronic details of your transaction, which contain details of both our public keys and also the time and amount, is hidden through hashing. What hashing does is that it takes all the details of a transaction, and converts it into an alphanumeric code like 000a234bfu344. This code can never be reserved back into the details of the payment. This is to ensure that both the payer and payee retain their privacy. The only thing that can be verified by everyone else is that the payment did indeed take place.

Bitcoin Network

This system of payments works by employing a set of CPUs/nodes to process and store the information of transactions. But what purpose does it serve? Let us demonstrate that with an example:

If you have a physical $10 note, you can clearly only spend it once. However, with digital money being a set of codes instead of a physical entity, people may try to use the $10 they have in digital money multiple times. Can they get away with it? If I try to spend my $10 digital money multiple times, the details of all these multiple transactions go to all nodes that are connected to the Bitcoin network. They only accept the first such transaction, and all other transactions are considered void. Now you can’t spend that $10 anymore, and the only person who can now spend that money is the person you paid.

Timestamp server

Details of each transaction are converted into an alphanumeric code, or a hash. Many such hashes form a block. When a block has been created by the network, a stamp of time is put on it, and every time the server sees that block in the “blockchain” it puts a timestamp on it. So for instance, if a block is created at 12 noon. Then it will have a timestamp of 12 noon. If the timestamp server processes the “blockchain” (literally chain of blocks) again at 12:05 PM, that particular block will have two time stamps- 12 noon and 12:05 PM, and so on. The nodes keep making chains will all the blocks that they receive, and keep exchanging each others blockchains to ensure that they haven’t missed out on any details or transactions.


The recording of transactions, the hashing, the verification, all of this is done by CPUs connected to the Bitcoin network. But why should I connect my CPU to this network? This is what happens: every new transaction is broadcast to all nodes (or CPUs) in the network. Each node collects many such transactions into a block. Each node now tries to find a “proof-of-work” for this block, which is basically finding an appropriate hash code for the block with an appropriate number of 0‘s in front. The first node that finds such a code broadcasts it to all other nodes, who verify that that is indeed an appropriate code for that block. Now they add that block to the blockchain, and continue processing other transactions and blocks. The finding of this appropriate hash code is probabilistic, and each node is equally likely to find it, assuming equal computational power.

But why should I connect my CPU to the network for this? Because whatever I spend in the form of processing time and electricity costs, I get back in the form of Bitcoin. This Bitcoin essentially is a form of money that I can now spend online.

Information Loss

What about information loss? It is entirely possible that some node does not receive details of a certain transaction. However, we can be almost certain that at least one node will. And eventually, when the first node receives the block containing details of the missing transaction, it will rectify its mistake.

Now here’s an important caveat: as we’ve seen before, all these blocks are arranged in a chain. However, when these blocks are broadcast by one node (which could successfully come up with a good hash for the block) to all the other nodes, some nodes may have missed out on this. Let us suppose node A missed out on a block, but it successfully receives the next block. It creates a chain that is one block short. However, when it receives the block chains that other nodes have created, it discards the previous chain, and chooses the longer chain. This process of choosing the longer chain removes the errors that creep in when certain nodes do not receive all the blocks. This works on the assumption that at least some nodes receive all the information of all blocks in a small interval of time. This is likely, as there are thousands of nodes in the network.

Another caveat is that nodes only accept information of transactions that are valid. As a node receives all the information of all transactions, if I try to spend my digital $latex 10 twice, the nodes (which successfully receive this information) will not process my second transaction.

Reclaiming disk space

All the CPUs store all the information of each transaction on the network. That sure sounds like a lot of space. In order to clear space, blocks only store information pertaining to the fact that the transaction happened, and erase the details of old transactions. If a transaction can be thought of as a tree with roots (proof of transaction) and branches (details of transaction), the branches of the trees are cut off so that only the roots remain. Nakamoto calculates that a CPU would need only 4.2 MB of storage space in a year to store all information of all transactions.

Payment Verification

How do I verify that a certain transaction took place at 12 noon on Saturday? I ask the nodes for the longest blockchain (which is the correct one with no missing information), access the relevant block by using the right block header, and then verify that some payment indeed took place at that time. I cannot find out the parties involved, or the amount. However, the fact that this payment took place is public information.

Beating the network

What does beating the network mean here? I can either spend my $10 multiples times and force the network to accept it, or I can go into history, and erase the record of me paying someone money so that I can keep it for myself. Because no other codes will accept multiple transactions, the first option is infeasible. Hence, the only way I can do that is go back into history, erase the record of my payment, and then build blocks faster than other nodes so that I can make a longer chain than them.

Suppose there are 10,000 CPUs/nodes in the network, and I have managed to capture 100 of them. Suppose that the transaction that I want to erase was recorded z blocks ago. I need to go back z blocks, which means I need to erase z-1 blocks, erase my transaction from the zth block, and now hash blocks faster than the “honest” nodes so as to form a longer chain. Remember that only the longest blockchain will be thought to be the correct one, and will be accepted by all parties involved. This act of erasing blocks may stop all exchange of blocks between the “honest” nodes and the “evil” nodes under my control. Hence, the only way to legalize my fraud is to quickly build a longer blockchain, so that everyone will have to accept that my version of events is the correct one.

Coming up with the right hash or code for a block is probabilistic. Hence, the honest nodes, which are much greater in number, have a much higher chance of forming the longer blockchain. The author calculates that if the number of “honest” nodes h is greater than the number of “evil” nodes e, the probability of the evil nodes ever making a longer chain than the honest nodes is (\frac{e}{h})^z. Hence, if the number of honest nodes is much greater than the number of evil nodes, which is likely to be the case in the real world, then it is extremely unlikely that an evil person will be able to overpower the system.

Another incentive for people not to overpower the system is that if they spend the CPU power that they’re using for evil in hashing transactions and blocks instead, the monetary value of the bitcoin that they’ll earn will be far greater than any monetary reward that they might earn by trying to break the system.

Thanks for reading!


  1. Bitcoin: A Peer-to-Peer Electronic Cash System
  2. Hash

Google’s Quantum Supremacy

The paper that I’m going to talk about today is the famous Quantum Supremacy Using a Programmable Conducting Processor. This was published in Nature in 2019, and demonstrated Quantum supremacy for the first time in human history. It is one of the best-written papers I’ve ever read, and is more helpful for understanding the definitions and concepts involved than even expansive Wikipedia entries. Its findings have been disputed by companies like IBM, and mathematicians like Gil Kalai.

But first, a primer in quantum measurement. What exactly is it? If you’ll allow me a bad analogy, quantum measurement is like picking out an M&M from a bag of M&Ms. You never know what color you’re going to pick out. The bag may contain either an equal number of M&Ms of all colors, or it may contain a lot of blue ones, and very few yellow ones (the different physical states, of which the quantum state is a superposition, might have different probabilities, or said with a top hat, the different eigenstates might have different corresponding eigenvalues). So although it is more likely that you’ll pick out a blue one, there is still a real possibility that you might pick out a yellow one. Hence, it doesn’t make sense to ask what is the color of M&Ms in a bag. It is the (weighted) superposition of all the colors present in the bag.

Quantum Supremacy

So what exactly is quantum supremacy? Let me offer another terrible analogy: there are some humans that run slow, and some, like Usain Bolt, that run very fast. Although Usain Bolt holds the record for being the fastest human on Earth, we expect that in the future, there will be even faster humans who will break even those records. However, regardless of how fast humans may run in the future, they will never be able to outrun jet planes. Similarly, quantum supremacy is achieved when we can prove that a quantum processor can perform a certain task much faster than any classical processor, regardless of how brilliant the creator of that processor is, and how amazing the algorithm that it uses is.

But how do we find such a task, where we can be sure that brilliance will not help in solving a problem faster than a quantum processor? An example would be suppose a person quick with mental calculations is given a string of randomly generated numbers, and asked to square them. If the numbers were not randomly generated, say they were all of the form 10^n, this would be easier for a person who recognizes that the square of 10^n is 10^{2n}. However, because they are randomly generated, there is no pattern to the numbers. Hence, regardless of the mental faculties of that person, taking the squares of a million numbers, say \{247,3691,47385,\dots\} will still take him days (if not months and years). However, plugging all these numbers into a supercomputer will give you the answer in seconds. Hence, as no amount of brilliance will help that man calculate the squares of a million randomly generated integers in seconds, we should find a task in which the data is randomly generated, so that there are no patterns in the data, and hence no amount of algorithmic genius should enable a classical processor to perform it as fast as a quantum processor.

The task that the researchers here have chosen is the following, which I’ll again explain with the M&M analogy: your job is to make a bag with 53 (as 53 qubits are used in the experiment) different colored M&Ms. There may be one green M&M, 5 red M&Ms, etc. The numbers don’t have to be the same. However, there is one caveat: the number of blue and yellow M&Ms is much higher than the number of other M&Ms. Hence, if you were to pick an M&M, the probability of picking any color is well defined, and you’re much more likely to pick a blue or yellow M&M than other other M&M.

Now imagine that your friend doesn’t know how many M&Ms of each color are there in the bag. He is asked to pick a certain number of M&Ms from the bag, until he can write down a probability distribution (calculate the probability of picking each of the 53 colors from the bag). Also, the probability distribution should be “close” to the actual probability distribution. Suppose that we calculate that if the probability distribution that your friend writes down is “sufficiently close” to the actual probability distribution, he needs to pick out at least a 1000 M&Ms. Because he doesn’t have any prior information of the contents of the bag, he cannot use any clever techniques to calculate this probability distribution. He has to just keep picking out M&Ms until he has a good idea of the contents of the bag. How much time does it take him to do this? Your friend may take, say, 10 minutes. But your quantum friend can do it in 1 millisecond.

The task given to the quantum processor is this: a quantum state, which in this case is a superposition of 2^{53} discrete states that look like |0011001\dots\rangle, is measured. Much like an M&M that you picked up from the bag cannot have more than one color, this measured quantum state projects to one of the 2^{53} states that it was a superposition of. Now the quantum state that we initially constructed has certain probability distribution, which tells us the probability that it will project to a given state. For instance, the probability that this quantum state will project to |011100\dots0\rangle might be 0.5. We want to measure this quantum state a few million, so that we can approximate the probability distribution of the quantum state to within a given degree of accuracy.

Building a processor

How do you produce randomly generated data though? The authors create random quantum circuits with one and two qubit-gates. Qubit gates, much like logic gates, take in a quantum state, tweak it, and spit out another quantum state. They can be thought of as reversible functions that map quantum states to quantum states.

What a quantum circuit does is that it takes in a quantum state (a superposition of discrete states), and gives us another quantum state. In this case for instance the quantum circuit takes in a quantum state (which is a linear superposition of the 2^{53} discrete states), and modifies the amplitudes (for instance, it might make the amplitude of |000\dots 0\rangle equal to 0.999, and make the amplitude of |111\dots 1\rangle equal to 0.001. Hence, after passing through this quantum circuit, the quantum state that was input is much more likely to be observed as |000\dots 0\rangle rather than |111\dots 1\rangle. Also, what it does is that it highly entangles certain qubits with the help of two-bit gates, which basically implies that certain quantum states will be much more likely to appear than others. An example is that if the first two qubits are entangled in a certain way, then we can be sure that |01\dots\rangle and |10\dots\rangle can be observed, but |00\dots\rangle can never be observed. In fact, the authors emphasize that the probability distribution is exponential: some discrete states are exponentially more likely to be observed than others.

But what is a random quantum circuit? It is a circuit in which the one-qubit gates are randomly selected. In essence, we want to make sure that we are randomly changing the data, via randomly manufactured circuits. Because the generation of data is random, a classical processor can’t use clever algorithms to calculate the sample distribution. It has to do what everyone else is doing, take a sample, count the number of each state, and then write down a probability distribution. Now it is all down to speed, and not cleverness.

What the word fidelity means is how close the probability distribution that the processor writes, is to the actual probability distribution. If the fidelity is 0.000001, then the probability distribution is way off, and if it is 0.99, then it is very close to the actual probability distribution.

Recall that we have to first build a quantum state, and then put it inside the quantum circuit, which we then randomly changed after a few million observations. How do we build this quantum state? We first build a superconducting circuit, which implies that current and voltage follow quantum rules. Each of the 53 qubits is encoded with the lowest two eigenstates of the resonant circuit, which is just a fancy way of saying that each qubit is a superposition of two states, and projects to one state when a resonant frequency is passed through the qubit. But what about qubit gates, the parts of the quantum circuit that modify the amplitudes of the eigenstates? We also “benchmark” them, and ensure that they do what is expected of them (within an acceptable range of error).

After quantum states have been generated and quantum circuits have been built, we want to take a sample, and see if we can write down a probability distribution that has a fidelity of at least 0.002 as compared to the actual probability distribution. 0.002 seems low, but taking enough samples is that hard. The authors calculate that for 53 qubits, and a quantum circuit comprising of 1,113 single qubit-gates and 430 two-qubit gates, a fidelity of 0.002 can be achieved by taking a few million measurements. The authors contend that taking these many measurements will take a classical processor 10,000 years, while it takes their quantum processor 200 seconds.

Cross-checking and benchmarking

But how do we know what is the actual probability distribution? If a classical processor takes 10,000 years to calculate an approximate probability distribution, how does it calculate the actual probability distribution in finite time? This is done by taking the quantum circuit constructed above, and simplifying it a little bit such that the probability distribution doesn’t vary much, but calculating the probability distribution becomes much simpler. But wait. We thought the quantum circuits were randomly generated. If the classical processor can do this simplification, why does it not do this, instead of taking 10,000 years for sampling, and then calculating an approximate probability distribution?

This is because the arrangement of qubit-gates is not known initially. It is only after all the sampling has been done that the positions of qubit gates are observed, and then simplified so as to be able to calculate the “actual” probability distribution.

Hence, we have a task at hand that takes a classical processor 10,000 years to accomplish, but takes the quantum processor Sycamore 200 seconds to accomplish. More impressively, the actual sampling part only takes 30 seconds.

Applications and Future Directions

The authors contend that some immediate applications of this quantum processor are in materials science, machine learning, and optimization. They expect that quantum computational power, relative to classical computational power, will grow at double exponential speed: adding qubits will not make quantum processors that much slower, but will make their classical simulation exponentially slower. Moreover, Moore’s law will hold for components of quantum circuits. Hence, quantum computing capabilities can be expected to double every few years. However, quantum error correction has to be vastly improved, and many more qubits need to be added, before we can perform quantum algorithms like Shor’s algorithm, which can be famously used to crack any encryption.


  1. Quantum Supremacy Using a Programmable Superconducting Processor,
  2. Quantum Circuit

Origins of Astrology in India

The paper that I will be discussing today is On the Philosophical and Cosmological Foundations of Indian Astrology, by Audrius Beinorius. I’ve always been fascinated with Astrology, and am glad that I got the time and opportunity to study it. This article is not paywalled.

The Bible says that God created man in his image. You might not be surprised to learn that this is not a belief that is original to the Bible, but has been influencing the development of humanity for thousands of years. This belief was called sādarśya in Ancient India, although perhaps a human-like God could be replaced by the heavens in the Indian context. I will be discussing this, and more below.


The study of Astrology is called Jyotiśāstra, or the study of lights, which I think is an amazing name for the study of anything. However, was it really mentioned in the ancient Hindu texts, or was it a later invention that was falsely attributed to ancient Hindu texts (like Vedic Maths)? Jyotiśāstra formed an integral part of the Rig Veda, Yajurveda and Atharvaveda. Jyotisa Vedanga was known as one of the six auxiliary sciences (āngas) of the Vedas. Hence, astrology was one of the most integral parts of science and society in Ancient India.

Although attributed to the Indian sage Lagadha around 400 BC, the Jyotisa Vedanga are claimed to have been revealed by Brahma (prathama minu) himself to eighteen mythological sages. According to the Mātsyapurana, the authority of a Rishi or a seer is based solely on his understanding of celestial movements, and that some of them become stars themselves when they die, according to their Karma. The main purpose that the Jyotisa Vedanga served was to inform people about the correct time of sacrifice, and not necessarily to predict the future.

The Vedas arose for the purpose of use in sacrifices; sacrifices are enjoined according to the order of times; therefore, he who knows Jyotiṣa knows sacrifices. Just as a tuft of hair stands on the head of peacocks or a jewel in the heads of cobras, so astro- nomical calculations (gaṇita) stand at the head of all the sciences that are spoken of as vedāṅga

Ṛgveda Jyotiṣa, 35 Yajurveda Jyotiṣa

According to the sage Panini, the science of the movements of heavenly bodies formed “the eye of the Vedas”. Hence, Astrology and Astronomy formed perhaps the most important sciences according to ancient Indian texts.

Space and Time

The ancients asked the same questions that high powered Physicists are still asking: what is time, and how is it created? The Bhagvata Purana, one of the holiest Indian texts that is read on a regular basis in many Indian households, says that time is the same as God, and is the primary catalyst that causes the cosmos to unfold “with its infinite potential for permutation”. In other words, time causes stars and planets to move in their eternal orbits.

On the other hand, ancient Indian astronomers thought that it was the motion of planets that “created” time, and not the other way round.

Further- more, the knowers of time regard time as the motion of the sun, planets, and constellations, distinguished by their different revolutions.


The time that the motion of planets created was continuous and regular, and humans artificially divided time into seconds, hours and days. Moreover, time was not just an instrument of measurement in Ancient India. It possessed a certain moral characteristic. Your karma from your previous incarnations required you to perform your dharma in your current lifetime in order to atone for your sins. But how do you know what was the right time and opportunity to perform your dharma? By studying the movements of planets. If you understood Astrology well enough, the positions of planets in certain constellations would tell you what you should do and when. The eternal motions of planets not only symbolize eternal order or sanatana dharma, but also individual order or svadharma, so that humans would know what to do in order to maintain the eternal harmony of the universe.

In the image of the heavens

But what about people who are not astrologers, and cannot study the heavens? Are they not affected by planetary motion? They indeed are, because a human was thought to be the embodiment of the whole cosmos.

A belief that permeated the whole of the ancient world (and Europe until the 16th century) was that things that resembled each other in shape, were of the “same kind”. Hence, when Sphujidhvaja asserted in 270 AD that “mutual interactions (yoga) of the planets as they pass through the various signs of the Zodiac are said to be formed in the likeness of shapes”, it was concluded that objects that resembled those same shapes would now be affected by the motion of the planets. Convenient.

The early Taittīriya brāhmaṇa (commentary on the Vedas in the 6th century BC) already had suggested: “The constellations are images of the world”

Because the world and constellations are imaged of each other, whatever happened in the world would also happen in the heavens (and vice-vera). However, the world was too complicated to understand, with lots of humans swarming around killing, looting and the like. Hence, people could look up to the heavens to understand what was happening, or going to happen in the world. As a lot of constellations resemble human shapes (the Hunter, for instance), the stars clearly mirror human activities. But astrologers took things one step further:

The Bṛhatsaṃhitā (14.1-5) then speaks of the nakṣatra-puruṣa (“man of the constella- tions”), whose body is formed from the twenty-seven lunar mansions. The Bṛhajjātaka of Varāhamihira (1.4) describes the kālapuruṣa (“man of time”), whose body is composed of the twelve zodiacal signs, beginning with Aries and ending with Pisces, his head and feet, respectively.

So it was not just that stars could tell man’s future. Man was a microcosm of the whole universe, with all the constellations contained inside him. Hence, when planets passed through certain constellations, one may imagine miniature planets passing through the same constellation inside the human body, mirroring the motion of the planets in the skies. This affected the good or evil inside of him, and also his fate. This can be taken as an example of taking an analogy, and pushing it to its logical extreme. It wasn’t that we could just understand our fate by looking at the skies, because we were the created in the their image. We could understand our fates by looking at the skies because we contain the skies within us.

While researching for this piece, I read some parts of another paper by David Pingree, which stated that Indian Astronomers originally stated that there were 28 constellations. However, under Babylonian influence around 500 BC, it corrected that to 12. Along parallel lines, Indian astrology originally concerned itself only with knowing the right time for animal (and other) sacrifices. It was only later, with the advent of the horoscopic kind of astronomy with its 12 constellations, that Indian astrology concerned itself with predicting the future. This probably gave a lot of clout to seers and sages in the courts of kings.


Both the author of the paper and I had similar feelings while reading up on this topic: although it’s easy to dismiss Astrology as pseudo-science, it forms a window with which to view the ancient world. In fact, things like divination are almost axiomatic. If you observe A, then you conclude B. If I see that mercury is in Aries and Jupiter is in Taurus, I have to conclude that something good is going to happen. The only fallacy perhaps is to conclude that the motions of planets have anything to do with human fate. However, we can hardly fault the ancients for a bit of imagination, seeing as the motion of some heavenly bodies like the Sun and the moon did actively determine their lives and fates.


  1. On the Philosophical and Cosmological Foundations of Indian Astrology, by Audrius Beinorius
  2. Astronomy and Astrology in India and Iran, by David Pingree

Deep Learning through bad analogies

The paper that I will be reviewing today is Deep Learning by YeCun, Bengio and Hilton. It was published in Nature in 2015, and is one of the most highly cited papers in the field with 16,750 citations. The authors of the paper went on to win the Turing Award in 2018.

I did not know deep learning and machine learning before reading this article. Hence, a lot of my experience of reading the paper involved reading a paragraph, not knowing what the heck was going on, and then watching youtube videos to break down what was happening. I will be linking those Youtube videos as I go along. Needless to say, I am amazed at the quality of the content related to this field that is available online.

What does deep learning do? Deep learning helps us classify data. Looking at a picture of a dog, it should be able to tell you that it is a dog, and not a pangolin (which might be useful information in these times). But how does it do that? It does this through a neural network, which is essentially a series of layers of “calculators”. Suppose we take a number which has 4 digits. This can obviously be encoded in a 4-tuple. Now we want to know if that number is divisible by 37 or not (let us imagine a world in which we care whether numbers are divisible by 37). What the layers of “calculators” or neural units, as they’re generally called, do is that they take in that 4-tuple, and spit out another 4 tuple. At the end, we will again have a 4 tuple which will look nothing like the original tuple. However, we might have a test saying that if the first component of this final tuple is >0.9, then this number has at least a 95% chance of being divisible by 37. This is only a simplistic example (which might have been better performed on an actual calculator), but what neural networks do is that they give us the probability that a certain input lies inside a certain class of objects: like the probability of the 4 digit number that was input to lie inside the class of numbers that are divisible by 37.

Supervised Learning and Back Propagation

How do we train a neural network to give us the right outputs? The same way you train a kid to behave around guests. When they scream at guests and break the china, you send them outside and ask them to re-evaluate their behavior. You keep doing that until they start behaving themselves around guests.

A neural network has been displayed below:

Each edge corresponds to certain weights, or punishment techniques that will help kids learn how to behave around strangers. You may for instance try a variety of punishments: offer them ice cream, ask them to go to their room, threaten to take away their toys, or tell them that they will only be allowed to eat in yellow bowls instead of white from now on. After trying combinations of all these punishment techniques, you find out the following: you should take away ice cream instead of offering more (negative gradient for this weight), children’s behavior improves if you ask them togo to their room, their behavior improves much faster if you threaten to take away their toys, and that they really don’t care if they eat in white bowls or yellow bowls. From this analysis, you conclude that you will be best served if you do a combination of threatening to take away their toys, and taking away their ice cream.

Similarly, neural networks might take in the picture of a dog, and tell you that it’s a cat. You’ll then have to adjust the weights associated with the edges such that this error is corrected. Because millions of inputs are used to train neural networks, some inputs might ask us to increase certain weights, whilst others might require us to decrease them. Hence, we take an average desired change in weights over all inputs, and then change weights accordingly so as to decrease the error “on average”.

What is back propagation? It is a fancy way of saying that we correct the values of the weights backwards. Refer to the image of the neural network given above. First we change the weights of the edges connecting the units in Hidden Units H2 to the output layer in the diagram given above, so as to minimize error. Then, using those new values, we change the weights of the edges connecting Hidden Units H1 to Hidden Units H2, in order to minimize the error further, and so on. An amazing video that explains this much better than I ever could is this.

Unsupervised Training

But what if you don’t have the time to train your child? Would it still be likely that the child behaves well in front of unknown guests. A kid will not necessarily know good from bad. However, a sufficiently observant child might know that screaming and breaking china evoke certain kinds of reactions from parents and guests, and that laughing and playing with guests evoke other kinds of reactions. This is similar to the clustering algorithm used in unsupervised training. If you feed new raw data without labels into an unsupervised neural network, it may not know the *names* of the classes that that data lies in, but it can plot that data, and determine that the data can be subdivided into certain classes. Look at the illustration below for an example of clustering:

Another feature of unsupervised learning is that instead of classification, it can encode images, and then decode images. What this does is that it removes noise, and preserves only the important details of the input. An illustration is given below:

Much like an unsupervised child, and unsupervised neural network can do a lot of things without being trained by its parents. All the above illustrations have been taken from this amazing video.

Convolutional Neural Networks

Suppose that in a more fun parallel universe, you’re asked to locate a 1 ft tall Mickey Mouse statue in a dimly lit room. How would you do that? One way is the following: you look around the room to see if you can spot objects that are approximately 1 ft tall. Objects that are too big or too small are automatically eliminated. Then amongst the more or less 1 ft tall objects, you should look for objects that that have two round ears sticking out. You must have eliminated everything else apart from, possibly, Minnie Mouse. Now you can perhaps concentrate harder on facial features to conclusively locate the Mickey Mouse statue.

Similarly, convolutional networks are neural networks that try to detect features of an input image step by step, in order to determine what it is an image of. In the first hidden unit layer, the convolutional layer might detect where in the image one may find circles. This information is now input into the second hidden layer, which might detect whiskers, or spots, or perhaps stripes. At the end of this process, perhaps after being processed through thousands of layers of hidden units, the neural network might be able to tell us what the image is of.

The detection of edges etc happens by breaking the input image into pixels, and constructing a very large matrix containing information of those pixels. Say this pixel matrix contains a 1 whenever the color black is present in a pixel, and 0 whenever white is present. Gray would perhaps correspond to 0.5. Now construct a much smaller matrix, say a matrix of the form \begin{pmatrix}1&1\\0&0 \end{pmatrix}.

This matrix corresponds to two (short) rows of pixels, the top row containing the color black, and the bottom row white. Now we take this small matrix, and multiply it with all the 2\times 2 sub-matrices of the bigger pixel matrix. This process will help us detect which parts of the image look like one row of black on top of one row of white, which helps us find white edges on a black background. All this has been amazingly explained in this video.

Natural Language Processing

When we type “neural” into the Google Search Bar, how does Google know that we mean “neural networks” (*meta*). This is because the neural network responsible for making these predictions has been trained on billions of pages of text, which help it “understand” which words make meaningful phrases. I will mostly be referring to this video on natural language processing.

Let us suppose that each word in the English language has been assigned a 50-tuple, in which each entry of the tuple has some number. For instance, it is possible that the word “neural” corresponds to (1,1,1,0,0,\dots,0). Now suppose we decide to train the neural network to understand English phrases that are three words long. We feed billions of pages of text into it, and train it to give us a positive score whenever the phrase is meaningful, like “neural networks rock”, and give a negative score whenever the phrase is not meaningful, like “Amit Shah rocks”.

But how does a neural network mathematically process a phrase? It takes the three 50-tuples corresponding to each word, glues them together, and now makes a 150-tuple. Assuming that the weights of all the edges in this neural network remain constant, we’re going to change the entries in the 50-tuple of each word such that when the neural network performs mathematical manipulations on the 150-tuples, we get a positive answer for meaningful phrases, and a negative answer for nonsense phrases. So after all this training is done, the words “Amit” might correspond to (1,0,\dots,0), “Shah” might correspond to (0,1,\dots,0), and “rocks” might correspond to (-1,-1,-1,\dots,-1). Hence, when the 150-tuple (1,0,\dots,0,1,\dots,0,-1,-1,\dots,-1) is plugged into the neural network, we get a negative answer.

But then how does word prediction work? When we type “Amit Shah” into the google search bar, the neural network tries to fit all words into the phrase that give the three word phrase a positive score. Then it calculates the probability of each of those words completing that phrase, probably depending on the frequency of use in the training data, and offers the words with highest probabilities as predictions just below the search bar.

Recurrent Neural Networks

Let us now talk about speech detection. Suppose you say the phrase “Black Lives Matter” (in order to train your neural network to do right by society). Each word that you say has to be separately interpreted by the neural network. This is a task that a recurrent neural network can perform: it takes as input each word that you say, and assigns probabilities to the various options corresponding to the input, stores it, and then takes the next input.

For instance, when you say “Black”, the neural network assigns a 99% probability that you said black, and a 1% probability that you said blot, based on the clarity of your speech. This gets stored (in Long Short Term Memory (LSTM) units). Now let us assume that when you say “Lives”, your speech is not clear. Hence, the neural network assigns a 40% probability to that word being Lives, and a 60% probability to that word being Pies. Now when you say “Matter”, your speech is much clearer, and the neural network is 99% sure that you said “Matter”. Now it tries to process the two phrases “Black Lives Matter” and “Black Pies Matter”. The neural network will give a negative score for the latter phrase as it has never seen it before, or perhaps has been negatively trained on it, and a positive score for the former. Hence, the neural network ultimately detects “Black Lives Matter”. We hope that it says an emphatic “Yes” in response.

Future directions

The authors opine that much of the future of deep learning lies in further developing unsupervised learning, and basing classification algorithms on human vision, which focuses only a small part of an object, and blurs everything else. The authors conclude that combining representation learning with complex reasoning is what will be most useful in the long run.

Thanks for reading!



Of mountain passes and migration in the olden days

The paper that I read today is Why Mountain Passes are Higher in the Tropics by Daniel H. Janzen. I found this paper by browsing the paper “100 articles every ecologist should read”, which was published in Nature in 2017.

What were the obstructions to the migration of species of animals in the ancient world, and what made species in temperate regions more suited to migration than those in the tropical regions?

One might imagine that tall mountains would be an obstruction to migration. High mountains would prevent people living in the valley from travelling through the impossible mountain passes to other parts of the world. However, they were less of an obstruction to species’ migration in temperate regions of the world, as compared to tropical regions. Why was that?

One of the factors that could affect migration is this: if the climate in the valley and in the high mountain passes were comparable, then species, that were already thriving in the valley, could easily travel through the mountain passes and spread. Maybe the weather conditions of mountain passes and valleys were much more similar in temperate regions as compared to the tropics? That can explain the fact that species migrated much more in temperate regions than in tropical regions? The authors, after a statistical study, conclude the following:

The climatic differences between the mountain passes and the valley in tropical and temperate regions were comparable. Hence, this wouldn’t be a deciding factor in migration.

What was the deciding factor in migration, however, was that temperatures varied a lot more annually in the temperate regions than in the tropics. Anyone living on the East Coast can attest to that. Hence, by natural selection, the species in temperate regions were a lot more resistant to wildly varying temperatures than their tropical counterparts, who were used to similar temperatures throughout the year.

These toughened and highly resistant species would then cross mountain passes, and spread out over a much larger area than species in the tropics.

This also explains another interesting observation: there is a much larger variety of species found in the tropics than in temperate regions. This is because of the following reason: when a few unusually resistant individuals of a tropical species were able to cross mountain passes and travel to other regions, they would not be followed by their less resistant counterparts. Hence, they could go to new regions and populate them with their evolutionarily advantaged genes, without any mixing and interference from their unfit counterparts still stuck in their old habitat.

Hence, mountain passes are not always higher in the tropics. The ability to cross these mountain passes were found to be lacking, however.

<Insert moral about climbing mountains or something similar>