How, on Earth, we took a picture of a black hole

The paper that I will be writing about today is Computational Imaging for VLBI Reconstruction by Katie Bouman, et al. The image processing algorithm explained in this paper was the reason for the first photo that humanity ever took of a black hole.


At the time that this paper was written, there was an international collaboration underway that wanted to capture the image of a black hole’s Event Horizon for the first time in history. But the angular resolution needed was an order of magnitude lower than what was available with current technology. What does this mean?

We know that there are billions of stars in the sky. However, stars that are very far away don’t appear as distinct stars to us. Instead, they appear to be broiling in a hazy cosmic soup. This is because the naked eye is not very good at distinguishing between relatively close stars that are very far away.

When light rays from far away stars enter the tiny apertures in our eyes, they undergo diffraction. For close enough stars, these diffraction patterns merge into one amorphous blob. Hence, instead of distinct stars, we just see a cosmic haze.

The Rayleigh formula for angular resolution is \theta=1.22 \frac{\lambda}{D}, where \theta is angular separation, \lambda is the wavelength of the incoming signal and D is the diameter of the aperture (of the human eye or a telescope). Given that the observed wavelength and the diameter of the aperture are fixed, if two stars are separated by an angle of at least \theta, they can be seen as distinct entities. If they have an angular separation that is smaller than that, they will be seen as one hazy blob.

How does all this relate to taking the photo of the event horizon of a black hole? Imagine that the event horizon subtends for an angle \alpha. If this \alpha>\theta, then we will be able to take an image of a black hole that has at least two distinct points. Hence, the event horizon will appear to be an elongated object instead of just a dot against the night sky. This is what we mean by taking photo. When we talk about taking a photo of Jupiter, we refer to a photo containing details of its stripes, the great red spot, etc, and not just a pale dot in the night sky. We hope to accomplish a similar feat with black holes (this hope was obviously fulfilled in 2019).

Now as black holes are very far away, the angle that they subtend might be very small. Hence, keeping \lambda constant, we will have to have a a huge aperture diameter. For example, it is predicted that the black hole at the center of the Milky Way has an angular diameter (relative to viewers on Earth) of 2.5\times 10^{-10} radians. For \lambda=1.3 mm, we will have to build a telescope of diameter 13,000 km to resolve it. This of course is not realistic. To overcome this constraint, the authors plan on using multiple telescopes spread over the world, and then use their combined data to emulate the data collected by one giant telescope. Using multiple telescopes in this manner is known is Very Long Baseline Interferometry (VLBI). Hence, VLBI can solve our problems of angular resolution. But what about the post-VLBI stuff? What about the creation of the actual image? There exist multiple algorithms to produce images from given data, with varied degrees of clarity.

The authors content that the Continuous High-Resolution Image Reconstruction using Path Priors, or CHIRP algorithm, can reconstruct better images than other existing algorithms from the same data.

A Brief Introduction to VLBI

Before we talk about the CHIRP algorithm, we need to talk about the VLBI process so that we understand the kind of data that we’re collecting. As explained before, in order to be able to get a better angular resolution, multiple telescopes are needed. Such an array of telescopes is known as an interferometer. This proves problematic though. Two telescopes located at different places will receive different signals: signals that are different in terms of phase, etc. How can be combine all of these observations to produce one coherent signal then?

Here’s how we do it: fix a wavelength \lambda amongst all the incoming wavelengths. This wavelength is composed of very many spatial frequencies. Spatial frequency can be thought of as the number of light and dark lines in unit distance, when a light beam is diffracted by a particular apparatus. Look at the diagram below, where SF stands for spatial frequency

One may imagine that the incoming signal is the sum of waves of very many different spatial frequencies. We calculate to what extent the two telescopes under consideration receive waves with the same spatial frequency. Another way to look at it is we measure to what extent the two telescopes receive the “same” wave (with the same frequency, etc). This is calculated by the correlation function, which is defined as

\Gamma_{12}(u,v,0)=\int\int I(l,m)e^{-2\pi i(ul+vm)}dldm

This formula comes from the Van Cittert-Zernike Theorem, and the Wikipedia article on it is extremely well-written. Here the telescopes are numbered 1 and 2, and their correlation function (also known as visibility) is denoted as \Gamma_{1,2}. Just some notation- l,m denotes the angular position of the celestial object, and (u,v) denotes the position vector of one telescope with respect to another. The basic gist of the Van Cittert-Zernike theorem is this: if you consider any two stars (or even different points on the same star), they emit electromagnetic waves with different properties like spatial frequency, etc. Hence, the waves emitted by them are not “coherent” at any two points. An analogy is that if you drop two pebbles into water, at points close to the pebbles, different points on the water oscillate differently. Some show oscillate wildly, whilst others don’t oscillate at all (destructive interference). However, points that are a little further away from the two pebbles oscillate similarly. They show the same behavior. Similarly, points that are far away from these non-coherent stars will receive waves of similar properties. The extent of similarity, which is definitely much greater than 0, is given by the formula above.

Now note that if we have N telescopes, we can form {N\choose 2} pairs of telescopes. Hence, we have N\choose 2 visibility numbers. These numbers change due to relative atmospheric conditions, as the atmosphere introduces a delay in the phase of the incoming waves, and it introduces a different delay for each telescope as atmospheric conditions are different at different points on the planet. Hence Gamma_{i,j} becomes \Gamma_{i,j}e^{i(\theta_i-\theta_j)}. However, if we have a set of three telescopes $i,j,k$, then the product \Gamma_{i,j}\Gamma_{j,k}\Gamma_{k,i} remains unchanged, even after the phase delay introduced by the atmosphere. We refer to this triple product as the bispectrum (I will refer to the plural as bispectra). This is a useful invariant to have, as we don’t have to wonder about atmospheric delays anymore.

This introduces a number of constraints into the system, and hence reduces other independent additional constraints that we impose on the system. Why? Because we have N\choose 2 independent variables of the form \Gamma_{i,j}, and these independent variables can vary arbitrarily (because of the delay introduced by the atmosphere, for instance). However, we now have N\choose 3 constraints. If we can only have fixed number of independent constraints on the system so that it still remains well-defined, then we have to reduce the number of other additional constraints that we can impose on the system.

Let’s back up. It seems that we have a system of equations in the variables \Gamma_{i,j}, and we want to find these variables. But why do we want to find these variables at all? How will these help us reconstruct the image of a celestial object? It seems that with only this information, the authors have an algorithm to construct an image of the source. This is incredible, and I will try to provide more details below. Anyway, from this point we will assume that we have determine all the visibility (\Gamma_{i,j}) and bispectra (\Gamma_{i,j}\Gamma_{j,k}\Gamma_{k,i} values).

A discussion of some algorithms

What algorithm should we now use to construct an image from the given data? Some known algorithms are given below:

CLEAN– This algorithm assumes that the image is made up of discrete bright points. It finds the brightest point on the image, and then “deconvolves” around that location. Deconvolution can be thought of as getting a sharper image from hazy data. After a few iterations, the brightness of the original source is reduced. Because this algorithm focuses on the brightest points, it often produces bad images of extended objects with no particular brightest points.

OPTICAL INTERFEROMETRY– We will talk about two algorithms here- the BiSpectrum Maximum Entropy Method (BSMEM), and SQUEEZE. Both are capable of using bispectrum data to construct images. BSMEM uses a Bayesian approach to image reconstruction, which may be thought of as determining what is the most probably “actual” image of the source, given the collected data. SQUEEZE on the other hand moves a set of point sources around in the Field of View, and then takes an “average” of the images obtained.

The authors propose their own algorithm CHIRP, which produces superior images than the algorithms mentioned above, with minimal human intervention.

CHIRP Method

Let us think back to the formula \Gamma_{12}(u,v,0)=\int\int I(l,m)e^{-2\pi i(ul+vm)}dldm

It is clearly impossible to calculate I_{\lambda}(l,m) for different values of l,m, as the source is far away, and the received signal is hazy and impossible to resolve. However, the visibility and bispectra values have been determined above. Using these values, we determine the value of $lambda I_{\lambda}(l,m)$ using the inverse Fourier transform (with an intermediate step in which we discretize I_{\lambda}(l,m). More details to be given shortly.

The authors parametrize a continuous image using a discrete set of parameters. Having a continuous image reduces modeling errors during optimization. Now remember that we have the Fourier transforms of the original signal I_{\lambda}(l.m) for each pair of telescopes (the visibility values). Because the signal is the weighted sum of these Fourier transforms, we can represent the original image as the sum of these continuous, scaled and shifted pulses (although the weights of these pulses, and hence the original image, are unknown at this stage). For a scene defined in the angular range l\in [-\frac{F_l}{2},\frac{F_l}{2}] and m\in [-\frac{F_m}{2},\frac{F_m}{2}], we parametrize the space into N_l\times N_m scaled pulse functions, h(l,m), centered at the geometric center of each cube of the N_l\times N_m grid. The numbers N_l,N_m are for us to decide, depending on how much accuracy is desired. We use known triangular pulse functions h(l,m) here, and hence their Fourier transforms are known. The weights that we assign to each pulse function can be written down as x[i,j] or x.

What we’ve essentially done is that we’ve located the (angular) coordinates across which the celestial body extends, zoomed in on those coordinates, drawing up a grid across this space, and then drawn pulse functions inside each cube of this grid. Why have we done this though? We want to convert the difficult calculation of integration into the relatively simpler calculation of a linear matrix operation. In other words, we can calculate I_{\lambda}(l,m) from the values of h(l,m) using just simple linear algebra. However, we need x in order to do that, and we calculate the value of x below.

We need to construct an image with the values of the bispectra that we have. We use a maximum a posteriori (MAP) estimate. What this means is the following: consider the vector x as representing the weights of h(l,m), as described above. If the observed observed bispectra (used instead of visibility because these values don’t change with atmospheric conditions) is denoted as y, then we need to minimize the energy f_r(x|y)=-D(y|x)-EPLL_r(x). Clearly y here is fixed, and we vary the values of $x$ to find the optimal one.

The issue at hand is slightly more subtle than I let on. D(y|x) and EPLL_r can only be calculated if we already know what \hat{I}(x) is. However, at this stage, we clearly don’t. Hence, what is probably happening is an iteration: we start with an image that we obtain after patching together information from the patch prior (hence the value of x is known). We use this value of x to determine D(y|x) and EPLL_r(x). Using these, we now calculate a better value for $x$ by minimizing the function f_r(x|y)=-D(y|x)-EPLL_r(x), and so on. After having found an optimal value for x, we can use it to reconstruct the image. In practice, we don’t minimize this function directly, but use a slightly different algorithm that gives us the same result. This algorithm is explained below:

We use the “Half Quadratic Splitting” method. Start with a blurry image, with a known value of x. Use this value, and the value of a varying parameter \beta, to calculate the most likely patches \{z^n\}, which may be thought of as the most likely sub-images in the grid that can be overlapped and put together to form the complete image. This information is now used to calculate a better value of x. This value can now again be used to calculate the most likely patches \{z^n\}, with a different value of $\beta$. The values of \beta for which we run this iteration are 1,4,8,16,32,64,128,256,512. After this process is completed, we will have gained a “good” value of x, using which we can construct I_{\lambda}(l,m).


So does the CHIRP algorithm perform better than SQUEEZE, BSMEMS or other algorithms? Let us look at the pictures below.

Note that the data, based on which these images were created, are a result of simulations, and not actual observations. The authors made their very extensive and useful dataset available here.

The CHIRP algorithm was also found to be more robust to random (Gaussian) noise.

What effect did patch priors have? Patch priors allow for the algorithm to train itself on what the image “probably” looks like, given the input data y. Patch priors helped the algorithm to piece together a much better picture of celestial bodies.

The authors contend that CHIRP is a better image processing algorithm than other state of the art algorithms. And they did prove it with a magnificent picture of the event horizon of a black hole 4 years later, for all of eternity to stare at with moist eyes.


  1. Computational Imaging for VLBI Image Reconstruction
  2. Van Cittert-Zernike Theorem

Why is chlorophyll green?

The paper that I want to discuss today is Quieting a Noisy Antenna Reproduces Photosynthetic Light Harvesting Spectra, by Arp, Aji, Gabor et al. I first read about this in a Quanta article. The scientific question that was answered in this paper was amazing- why are plants green?

So, why indeed are plants green? One might say it is because of chlorophyll, which is green in color. But why is chlorophyll green? Of course, one answer to this question is “because it is green”. But a more satisfying answer is that chlorophyll being green ensures that plants receive a steady, not-wildly-fluctuating supply of energy.


A plant, or at least the light harvesting parts of a plant, may be thought of as an antenna. The extremal points of a plant absorb energy from sunlight, and transfer this energy via various pathways to the parts where this energy is needed in order to make food and sustain other life-giving activities. Sunlight contains a spectrum of wavelengths, and plants probably want to absorb all wavelengths in order to maximize their intake. However, absorbing the green frequency would lead to a lot of variance in the amount of energy absorbed. Hence, to reduce this variance, plants just reflect this green part of the solar light, and absorb the red and blue parts.

And that is why plants appear green.

Antenna Network and Noise

Ensuring that the energy input into a network or grid is equal to the energy output is a fundamental requirement of networks. If excess energy is absorbed, it may destroy the system, and if not enough energy is absorbed, an underpowered system will soon shut down. However, the environment of a plant can vary rapidly with time. The sun can be covered by clouds, the plants above a light absorbing leaf may sway with the wind and hence block access to sunlight at intervals, etc. How can a plant ensure that it receives a steady supply of energy? Clearly, much like we need a constant amount of food everyday, a plant’s energy output to its food making parts needs to be constant in order to survive.

If the energy absorbed by a plant at a fixed point in time can be plotted on a graph, with different probabilities given to different amounts of energy absorbed, the greater the variance in this graph, the more the variance in energy absorbed by a plant. This variance, which is called noise, should be reduced. Reducing noise is going to be our main motive as we do the analysis below. Methods of reducing noise like adaptive noise filtering require external intervention, and hence are not available to plants.

One node network

Imagine a network with a single input node A, that absorbs light at wavelength \lambda_A with (maximum) power P_A. Note that the average power absorbed does not have to equal P_A due to changing external conditions like swaying plants blocking sunlight, etc. Hence, the average power absorbed by a plant is p_A P_A, where p_A<1, and can be thought of as the probability of the plant absorbing P_A.

Let the energy output be \Omega. If P_A=\Omega, then the average energy input, which is p_A P_A<P_A, would always be less than output. Hence, in this model of only one input node A, P_A>\Omega so that the average energy input is equal to output. In other words, p_A P_A=\Omega.

Let us now calculate the variance of the energy received. If p_A is the probability that the plant is able to receive P_A energy, and (1-p_A) is the probability of the plant receiving 0 energy, then the variance is \sigma^2=p_A(P_A-\Omega)^2+(1-p_A)(\Omega)^2. This can be simplified as


We should look for ways to reduce this variance. We can do this by having two nodes instead of one.

Two node network

Let us now have a network with two input nodes, and see if we can reduce variance. Let the input nodes A and B absorb light at frequencies \lambda_A,\lambda_B. Let the power absorbed be P_A,P_B with probabilities p_A,p_B. We will assume that P_A<\Omega<P_B. Also, we want p_A P_A+p_BP_B=\Omega, as the average power input should be equal to the output. One constraint of the system is that the plant shouldn’t absorb P_A+P_B power, because P_A+P_B>>\Omega. Hence the possibilities of power absorption are that the plant absorbs P_A power, P_B power, or 0 power. The variance of the model is now \sigma^2=p_A(P_A-\Omega)^2+p_B(P_B-\Omega)^2+(1-P_A-P_B)(\Omega)^2. This can be simplified as


Clearly, this is smaller than the variance we get from the network with just one node. The good news is that plants also have two input nodes- chlorophyll a and chlorophyll b. The presence of two input nodes for a given wavelength probably served as an evolutionary advantage in order to minimize noise in energy absorption.


We want plants to absorb energy at a steady rate, to ensure that energy input=energy output. We want to maximize P_A-P_B, where P_B<\Omega<P_A, so that the noise, or variance in energy absorption, is minimized. Our constraint is that p_A P_A + p_B P_B=\Omega. And we want to do this for all the wavelenghts of light that we can.

Now the maximum power available P_A depends upon the sunlight available, and is given below as the black curve in the graph. Ignore the two peaks for now.

Hence, we can ideally select two nodes each for the blue, green and red regions of the wavelength spectrum, and absorb energy from each of them. In order to reduce noise however, we need to maiximize P_A-P_B. This can be done if we place two nodes each in each of the three regions, and let the two nodes have very similar wavelengths, but different P_i‘s. This can be done easily where the slope of the irradiance is high. We can see in the graph above that the slope of the irradiance graph is high in the blue and red regions. However, in the green region, the slope is close to 0. Hence, if we place two nodes there with similar wavelengths, P_A-P_B will be almost $0$, and hence there will be a lot of noise in the energy input.

This is the reason why plants have two nodes each only in the red and blue regions of the light spectrum, and not the green region. The green light is reflected, and this is why plants are green.

Purple bacteria and green sulphur bacteria can be modeled using the same constraint of reducing noise in energy absorption. Hence, the scientific model developed by the authors is robust, and can explain the color of much of the flora and fauna found on the planet.


  1. Quieting a Noisy Antenna Reproduces Photosynthetic Light Harvesting Spectra

Jagdish Chandra Bose and Plant Neurobiology

The paper that I’ll be discussing today is Jagdish Chandra Bose and Plant Neurobiology by Prakash Narain Tandon, a celebrated Indian scientist.

When I was a school student in India, I often came across JC Bose’s claims of plants being sentient beings, having nervous systems, etc. However, these things were never part of the official curriculum (i.e. we never had to learn these things for tests). Bose’s statements in this matter have always been considered to be something of a not-completely-scientific, “the whole universe is connected by prāna”-type of claim by the larger scientific community. This paper assets that despite initial rejection, most of Bose’s claims have been proven to be correct by modern science in recent times.


By the time Bose retired from Presidency College, he was a world renowned physicist who was known to have studied radio waves even before Marconi (although the primacy debate is a complex one, there is evidence to suggest that there were scientists in Europe who had studied radio waves even before Bose). After retiring, Bose started working at Bose Institute (which he founded), and guided by his “Unity of Life” philosophy, started studying the effect of radio waves on inorganic matter. Finding their response to be “similar to animal muscle”, he now started studying plant physiology (the nervous system of plants). He would expose plants to various stimuli, and record their response through the use of ingenious instruments that he himself designed. His conclusion was that the nervous impulses of plants were similar to those of animals.

Action Potentials

Bose studied both large plant parts and individual plant cells. He would connect microelectrodes to these cells, and record their response to stimuli. He concluded that plants contain receptors of stimuli, nerve cells that code these stimuli electrically and propagate these messages, and also motor organs that purportedly helped in carrying out a response to the stimuli. In this, he concluded that plants and animals have similar nervous systems. Bose said that the nervous system in plants was responsible for things like photosynthesis, ascent of sap, response to light, etc.

Bose said that the action potential of plant neurons follows the unipolarity of animal neurons. But what is Action Potential? This is an amazing video explanation of what it is. Action Potential is the electric potential via which neurons transmit messages. In resting state, the electric potential difference between the inside and the outside of neurons is -70 mV (the inside is negatively charged). When neurotransmitters activate the neuron (because a message is to be passed), this negative potential difference is destroyed by a stream of positive sodium ions that comes into the neuron from the outside. This causes lots of changes to the neurons, including inducing it to release neurotransmitters to activate the next neuron in line. The electric potential difference becomes positive, and then becomes negative again because the neuron loses a lot of potassium ions to the outside. The sodium-potassium pump on the cell membrane expends energy to exchange sodium and potassium ions to ensure that the neuron returns to its previous state before it was excited. Thus, the neuron enters the resting state again. This is the chemical mechanism by which a neuron conducts a message.

Where can one find the “nerves” of plants? Bose localized the nervous tissue in the phloem, which conducted both efferent and afferent nervous impulses. He also measured the speed of the nervous impulse, which he found to be 400 mm/sec. Although Burdon-Sanderson and Darwin had previously reported on nerve impulses in insectivorous plants, Bose’s studies over the next three decades were far wider and deeper. Although ignored after the 1930s, his studies have been found to be correct by modern experiments. The author claims that Baluska et al have not only confirmed Bose’s major findings, but have also advanced these further utilizing molecular biology, genomics, etc. Baluska seems to have published this paper in a journal that he himself is the editor of. Hence, these claims perhaps need to be investigated further.

Electrical Studies

Along with Action Potentials (APs) (common to plants and animals), Slow Wave Potentials (SWPs) or Variation Potentials (VPs) (found only in plants) are also used by plants to transmit nerve impulses. These SWPs do not propagate electrically, but by hydraulic pressure exerted by tissues found in the xylem. Some plants like Dionaea flytraps were found to possess unidirectional APs similar to those found in cardiac myocytes (cadiac muscle cells). This prompted Bose to poetically state that plants possess hearts that beat as long as they live.

Molecular Studies

At the molecular level, plants possess voltage gate channels (membranous proteins that are activated by change in electric potential and allow the exchange of ions), a vesicular trafficking apparatus (for the transport of proteins and other molecules within the cell plasma) etc, all of which are also found in animal cells. Trewavas also observed that water soluble Ca^{+2} ions were responsible for intra-cell communication, and also inducing changes in plants as a response to environmental conditions. We now know that there exist many such water-soluble (these are called cystolic) messengers in plants, as they do in animals.

Plant Roots

Darwin had pointed out that the tip of the radicle (found in the roots) is endowed with sensitivity, and also directs the movements of adjoining parts. In this, it is like the brain.

Bose elaborated on this by saying that the radicle is stimulated by friction and the chemical constitution of the surround soil. The cells undergo contraction at appropriate times, causing their liquid contents to go up. This causes the ascent of sap. Baluska et all carried these claims even further, and stated that within the root apex of the maize plant, there is a “command centre” which facilitates the long distance travel of nervous impulses, and instructs the plant to move towards positive stimuli (and away from negative stimuli). Tandon rejects the notion that such a command centre is anywhere near as complex as an animal brain.

Synapses- Neurotransmitters

Bose found the nerve cells in plants to be elongated tubes, and the dividing membrane between them to be the synapse (the gap between animal neurons where messages are transmitted between neurons). This claim has been substantiated by Barlow, who said that plant synapses share many characteristics with animal synapses. Plants also use many of the same neurotransmitters as animals like acetylcholine, glutamate and \gamma-aminobutyric acid.

Plant Memory, Learning and Intelligence

Bose claimed that plants are intelligent, have memory, and are capable of learning. Tandon makes the claim that Trewavas describes a large number of protein kinases in plant neural pathways, and hence finds their nervous system to be similar to that of animals. On skimming Trevawas’ paper however, I mostly found it to say that although there do exist protein kinases in plants, the neural systems found in plants differ from that of animals in important ways.

In another paper, Trewavas claims that one important difference between plant and animal nervous systems is the timescale of response- plants respond much more slowly to external stimuli. Hence, we need time scale photography to properly study the plant neural response. Also, if intelligence can be thought of as “a capacity for problem solving”, then plants show signs of intelligence as they change their architecture, physiology and phenotype in order to compete for resources, forage for food, and protect themselves against harsh elements of the environment.

Barlow substantiates these arguments, claiming that plants rapidly convert external stimuli to electrochemical signals, which cause them to change their physiology. He also claims that plants do have memory, as their decision making would involve recollection of previously stored memories. Barlow also says that roots experience four stimuli at once (touch, gravity, humidity and light), and have to decide how to obtain the optical mix of all. Hence, plants do possess the decision making aspect of intelligence.

With regard to plant intelligence, Baluska et al make the following claim:

‘Recent advances in chemical ecology reveal the astonishing communicative complexity of higher plants as exemplified by the battery of volatile substances which they produce and sense in order to share with other organisms information about their physiological state”

Gruntman and Novoplansky from Israel also claim that B. Dactyltoides are able to differentiate between themselves and other plants, and if a plant has multiple roots, each set of roots identities the other as belonging to the same plant (and hence these roots don’t compete with each other for resources). But how do plants recognize themselves and others? The authors claim that this is from the internal oscillations of hormones like auxin and cytokines. The frequency of this oscillation is unique to each plant, and can be measured externally by roots.


Bose claimed that

“these trees have a life like ours……they eat and grow…….face poverty, sorrows and sufferings. This poverty may……induce them to steal and rob…….they help each other, develop friendships, sacrifice their lives for their children”

The author finds that this sentiment is not yet fully supported by the scientific data collected by Bose. However, these claims may be further ratified when more experiments are done in this realm.

Bose single-handedly created the field of Plant Neurobiology. Although the establishment of this field has its opponents, even the most vocal of these opponents cannot find fault with any of Bose’s scientific claims. The author hopes that plant and animal neuroscientists communicate better with each other in the future, and find the time and resources to study this field more. Hopefully, such studies will ratify even more of Bose’s revolutionary ideas and claims.


  1. Jagdish Chandra Bose and Plant Neurobiology
  2. How Plants Learn

The Imitation Game

The paper that I’m reviewing today is Computing Machinery and Intelligence by Alan Turing. It is one of the most important papers in the history of computation, and has hence shaped the world as we know it today. It could perhaps be called more an essay than a paper; this claim will hopefully become apparent below. I spent 4th of July reading it, and it was a pretty fun way to spend my day.

The Imitation Game

“Can machines think?” We soon realize that we first need to define the terms “machine” and “think”. Let us set the scene: there are three people- A (man), B (woman) and C (interrogator). None of them can see each other or speak, and they can only communicate through typewritten messages. The interrogator (C) has the job of determining the genders of A and B (he doesn’t know that A is a man and B a woman). C can ask them any number of questions. A’s job is to mislead C into thinking that he is a woman. B’s job is to convince C that she is the woman, and that A is a man. Both A and B can lie in their answers to C.

Clearly, lying and misleading are traits of human beings, and require thought. Suppose we replace A (man) by a machine. Will the machine be as convincing a liar as A? Will it be able to dodge C’s questions as skillfully as A? If the probability of C calling out the machine’s lies is less than or equal to the probability of him/her calling out A’s lies, then Turing says that that implies that the machine can “think“. Testing whether someone or something can “think” would involve having them accomplish a task that would require considerable thought and planning. Lying and misleading clearly fall into that category. Without going into the metaphysics of what exactly “thought” is and why inanimate objects can’t think because they don’t own big squishy brains like us, I think this is a good definition of what it means for a machine to think.

What is a “machine” though? In order to do away with objections like “humans are also machines” etc, Turing says all machine that are considered in this paper are “digital computers”. This may sound like a very restrictive definition. However, he says that digital computers, given enough memory, can imitate all other machines (and are hence universal machines). This claim will be partly proved below. Hence, if we wish to prove that that there exists some machine capable of a certain task, it is equivalent to proving that there exists a digital computer capable of that task.

One of the more fascinating parts of this section is where Turing describes how the storage of a computer would work, and how a loop program works. He then gives an example of a loop in human life, which is what a loop program is meant to imitate:

To take a domestic analogy. Suppose Mother wants Tommy to call at the cobbler’s every morning on his way to school to see if her shoes are done, and she can ask him afresh every morning. Alternatively, she can stick up a notice once and for all in the hall which he will see when he leaves for school and which tells him to call for the shoes, and also to destroy the notice when he comes back if he has the shoes with him.

Although fascinating, the details of such functions are quite well known, and hence I won’t elaborate more on them here.

Universal machines

A “machine” is a pretty broad term in general. It has certain states, and rules for how the machine will behave when it is in that state. For instance, the universe may be thought of as a machine. Its initial state was that it was very very hot at first. God’s Rule Book said that in such a state, the universe had to expand and cool down, and that’s exactly what it did. The universe could be described as a continuous state machine- if its initial state was, say, even one millionth of a degree cooler, it would have evolved very differently. Hence, we can say that the universe is a machine is that very sensitive to initial conditions, and hence is a continuous state machine.

The opposite of a continuous state machine is a discrete state machine. This is a machine that is not very sensitive to initial conditions. An example would be your stereo system. Imagine that you have a stereo system with one of those old fashioned volume knobs that you turn to increase or decrease the volume. It won’t make that much of a difference to your audio experience if you turn the knob slightly more than you intended to. 70 dB will sound almost the same as 71 dB. Hence, what a lot of companies do is that they break up the volume level on your stereo into discrete states. Because 70dB is also the same as 71dB, they can both be clubbed into the volume level “15”. When you turn the volume knob, and see the volume bar in the visual display go up by 1, you’ve moved up from one state to the next. In this way, your stereo system is clearly a discrete state system.

Digital computers are discrete state machines. Discrete state machines are stable to small changes in initial conditions, and hence it becomes easy to predict their behavior in the future, based on knowing their initial state.

Turing then makes a fantastic prediction in this paper:

..I believe that at the end of this century…one will be able to speak of machines thinking without expecting to be contradicted.

Turing claims that this is only a conjecture, but that making conjectures is good for science. This is one conjecture that is perhaps still hotly debated across the world, and “verify you are a human” tests on many websites regularly play The Imitation Game with us.

Can machines have our immortal soul?

Turing anticipates many objections to his claim that machines can “think” if they can mislead other humans. And he deals with these anticipated objections one by one.

The Theological Objection might be that God provides immortal souls capable of thinking only to humans, and not animals or machines. Turing contends that “we should not irreverently usurp his power of creating souls”, and that He might indeed have provided all objects, animate and inanimate, of souls capable of thought. Turing also clarifies that religious arguments don’t impress him anyway. The Head in the Sand Objection contends that contemplating machines having the capacity for thought is too scary, and hence we should assume that they will never gain this facility. Intelligent people, basing their superiority on their capacity for thought, often make this argument. Turing thinks that this argument is too weak to even argue against, and that such “intelligent people” just need consolation.

The Mathematical Argument against machines being able to imitate humans well is that every machine is based on a certain logical system. Gödel proved that to every such logical system, you can ask a Yes/No question that it can only answer incorrectly, or never be able to answer. As humans (namely C, in this context) don’t have such a handicap, we can easily find out whether A is a machine or not. However, knowing what this question might be is going to be near impossible, as we wouldn’t know what type of a machine A might be (even if we are fairly sure it is a machine). Hence, we might now know what question to ask for which the machine will falter. Moreover, it is also entirely possible that humans would also give the incorrect answer to this question (as it is likely to be a tough logical question). In addition to this, what if this one machine is actually a collection of digital machines, with different such “trap” questions for each? It is going to be tough finding a question for which all such machines will fail. Therefore, this mathematical objection, although valid, is not something we need to worry about too much.

The Argument from Consciousness says that machines cannot think unless they can write a sonnet, feel pleasure at their success, feel warm because of flattery, etc. Turing contends that there is no way of knowing whether machines can feel any of this unless we can become the machine itself, which is impossible. One way to ascertain the emotional content of a machine is to conduct a viva voce, in which answering the questions posed would require some amount of emotional processing. The example that Turing provides, which he expects a human and a machine to be able to conduct, is given below:

Lady Lovelace, who described in details The Analytical Machine that Charles Babbage designed, claimed that machines cannot do anything original, and can only do what they’re programmed to do. Turing contends that although the kinds of machines that Lady Lovelace could see perhaps led her to that conclusion, she was incorrect, and that even The Analytical Machine could be suitably programmed such that it could do “original” things. One variation of her statement could be that “machines cannot do anything new”. However, even the “new” things that humans do is inspired, at the very least, from their own experiences. Hence, an even better variant would be “machines cannot surprise us”. This is also incorrect, as humans often make calculations that are slipshod, that lead them to certain conclusions. When they ask machines for answers, they’re often surprised with the (correct) answer that the machines provide. An analogy would be when we incorrectly calculate that preparing for a certain exam would take us one all nighter, and are surprised by how bad that plan was. We did not correctly calculate our productivity over the course of one night.

The Argument from Continuity says that the nervous system is a continuous state system, and a digital computer is a discrete state system. Hence, a digital computer can never successfully imitate a human being. Turing counters this by saying that a digital computer is capable of imitating continuous state systems. Take a differential analyzer for instance, which is a continuous state system. If asked for the value of \pi the differential analyzer, might give any value between 3.12 and 3.16, based on its current state. This is a feature of continuous state systems- even a small deviation from the “ideal” state brings about a noticeable change in the output. This could be imitated by a digital computer by having it output 3.12,3.13,3.14,3.15 with probabilities of 0.05, 0.19, 0.55, 0.20, 0.01.

The Argument from Informality of Behavior says that machines follow a rule book, which tells them how to behave under certain circumstances. Moreover, the behavior of a machine can be completely studied in a reasonable amount of time, such that we’ll be able to predict the behavior of a machine perfectly in any given situation. Humans are not predictable in such a fashion. Hence, humans and machines are different, and can be easily distinguished. Turing argues against this by saying that it is possible that such a “rule book” for humans may also exist, and the unpredictability of humans is just a result of the fact that we haven’t found all the rules in the book yet. Moreover, he says he has written a program on a relatively simple Manchester computer, which when supplied with one 16 digit number returns another such number within 2 seconds. He claims that humans will not be able to predict what number this program returns even if they get a thousand years to study the machine.

One of the more interesting sections of the paper is where Turing says that if player B is telepathic, then this game would break down, as she (player B is a woman) would easily be able to pass tests like ‘C asks “What number am I thinking?”‘. Clearly, a machine would be unable to think of this number with any degree of certainty. While Turing contends that Telepathy is indeed real, he overcomes this problem by suggesting that all the participants sit in “telepathy proof” rooms.

Learning Machines

Turing says that in some sense, a machine can be made to become like a “super-critical” brain (something which is capable of devising plans of action after being given an idea), and that in another sense a brain is also a machine. However, these ideas are likely to be contested. The only proof of whether a learning machine can exist can be given only when one such machine is constructed, which for all purposes lies far ahead in the future.

But what are the constraints when one tries to construct such a machine? Turing says that a human brain uses about 10^7 binary digits to process ideas and “think”. These binary digits can be thought of as analogues to neurons, which are 1 when they fire, and 0 when they don’t. A particular combination of neuron firing leads to resultant thoughts and actions. Turing thinks that the storage space needed for containing these many binary digits can be easily added to a computer. Hence, there is no physical constraint on constructing this device. The only question is, how do we program a computer to behave like a human being?

One could theoretically observe human beings closely, see how they behave under all possible circumstances, and then program a computer to behave in exactly that way. However, such an endeavor is likely to fail. One can instead program a computer to behave like a human child, and then make it experience the same things that a human child experiences (education, interacting with others, etc). Because a child is mostly a blank slate, and forms its picture of the world and behavioral paradigms based on its experiences, a computer may learn how to be a human adult (and hence imitate a human adult) in exactly the same way. Turing concedes that constructing a “child machine” is a difficult task, and that some trial and error is required.

A child learns through a “reward and punishment” system. What rewards and punishments can a teacher possibly give to a machine? A machine has to be programmed to respond to rewards and punishments much like a child does. It should repeat behaviors for which it is praised by the teacher, and not repeat behaviors for which it is scolded or punished. Also, it can be fed with programs that ask it to do exactly as the teacher says. Although inferring how the machine should behave based on fuzzy input from the external world might lead to mistakes, Turing contends that this is not any more likely than “falling over unfenced cliffs”. Moreover, suitable imperatives can be fed into the machine to further reduce such errors.

A critical feature of “learning machines” is that they change their rules of behavior with time. Suppose they have been programmed to go to school every day at 8 am. One day, the teacher announces that school will being at 9 am the next day. Learning machines are able to change the 8 am rule to 9 am. Adapting rules of behavior based on external output is also a feature of human beings.

How does a machine choose its mode of behavior though? Suppose the parents say “I want you to be on your best behavior in front of the guests”. What does that mean? There are a lot of near solutions to this problem- the machine could sit in a corner silently throughout the party, it could perform dangerous tricks to enthrall and entertain the guests, etc. How does it know which of these solutions is best? Turing suggests that we assign the machine a random variable. Let us suppose that all the “solutions” mentioned above are assigned numbers between 0 and 100. The machine could pick any number randomly, try out the solution corresponding to that number, evaluate the efficacy of that solution, and then pick another number. After a certain number of trials (maybe 15), it could pick the solution that is the most effective. Why not pick all numbers between 1 and 100? Because that would take too much time and computation.

By the time the machine is an adult, it will be difficult to accurately predict how the machine will behave in a certain situation, because we cannot know all the experiences that the machine has been through, and how its behavioral patterns have evolved. Hence, if we program something into it, it might behave completely unexpectedly- it might “surprise us”. Also, because its learned behavior and traits are unlikely to be perfect, it is expected to behave less than optimally in many situations, and hence mimic “human fallibility”. For all purposes, our machine will have become a human.

What is incredible to me is that this is exactly how neural networks behave. It is insane that Turing could envision such a futuristic technology more than half a century ago. Thanks for reading!


  1. Computing Machinery and Intelligence

The science of going in circles on doughnuts

The paper that I’m going to be reviewing today is A Topological Look at the Quantum Hall Effect. It explains an amazing coming together of topology and quantum physics, of all things, and provides a review of very important, Nobel prize winning work in Physics.

The Hall Effect

We all learned in high school that if we pass current through a conductor in a magnetic field, and the magnetic field is not parallel to the current, then the conductor experiences a force.

But does the conductor as a whole experience this force, or only the electrons inside it? Maxwell, in his Treatise on Electricity and Magnetism, said that “the mechanical force which urges the conductor…acts not on the electric current, but on the conductor which carries it.” This seems wrong, right? It is the charges that should experience the force, and not non-charged portions of the conductor.

Edwin Hall, a student at Johns Hopkins University, investigated further. He passed a current through a gold leaf in a magnetic field. He could see that the magnetic field was acting on the electrons themselves, as it altered the charge distribution inside the gold leaf, which he could measure with a galvanometer.

As you can see, there is a separation of charges across the breadth of the plate. One may imagine that instead of flowing through the whole plate uniformly, as soon as the current enters the plate, it gets deflected to one side of the plate, although it keeps moving forward. The potential difference created by this separation of charges is known as Hall voltage. Hall conductance is the current already flowing through the plate, divided by the hall voltage. Remember Ohm’s law, which says V=IR? This implies that \frac{1}{R}=\frac{I}{V}. This \frac{1}{R} is the conductance. Of course the difference is that this current is not caused by the Hall voltage. Hence, this formula cannot directly be obtained from Ohm’s law, but let’s shut our eyes to these details.

The direction of this voltage is not fixed, and depends on the charge of the charge carriers in the conductor. Hence, measuring the direction of this voltage is an easy way to determine the nature of charge carriers in a conductor. Some semiconductors owe their efficacy to having positively charged charge carriers, instead of the usual electrons.

The Quantum Hall Effect

In 1980, Klaus von Klitzing was studying the conductance of two dimensional electron gas at very low temperatures. Now remember that resistance=voltage/current (because conductance=current/voltage as discussed previously). As voltage increases, this formula says that resistance should increase. Now as the Hall voltage increases with an increase in magnetic field strength, as the magnetic field strength increases, so should the resistance. But what is the pattern of this increase? It is pretty surprising, surprising enough to get Klitzing a Nobel prize for studying it.

Note: it seems that the resistance is increasing in small steps for lower magnetic field strength, but bigger steps for higher values of the magnetic field. However, the conductance, which is its reciprocal, decreases by the same amount for each step. In other words, conductance is quantized. A quantity if referred to as quantized if can be written as some integer multiple of some fundamental quantity, and increases or decreases by the steps of the same size.

Why do we get this staircase graph, and not, say, a linear graph? If we consider the conductance=1/resistance values in the above graph, we see that all the successive values are integer multiples of a constant we find in nature- e^2/h, irrespective of the geometric properties or imperfections of the material. Here e is the electric charge of an electron, and h is Planck’s constant. Why does this happen?

Robert Laughlin offered an explanation. Consider an electron gas that is cold enough, such that quantum coherence holds. This basically means that there’s not much energy in the system for all the particles to behave independently, and that all particles behave “similarly”. Hence, the behavior of the system can be described easily by a Hamiltonian dependent on a small number of variables (2 in this case). Laughlin imagined the electron gas to be a looped ribbon, with its opposite edges connected to different electron reservoirs.

Now imagine that there’s a current flowing in the ribbon along its surface in a circular motion. The presence of the magnetic field causes a separation of charges. Because the two edges are already connected to electron reservoirs (which can both accept and donate electrons easily), the separation of charges causes electrons to flow from one reservoir to another. If, say, the side connected to A becomes positive due to separation of charges and the side connected to B becomes negative, then electrons from A go into the ribbon, and electrons from the ribbon go into B. One may imagine a current flowing between the electron reservoirs, which is maintained because the magnetic field maintains the Hall voltage between the two ends of the ribbon. However, in order to increase this Hall current, the magnetic flux \Phi will have to be increased (which will increase the Hall voltage).

The Aharonov-Bohm principle says that the Hamiltonian describing the system is gauge invariant under (magnetic) flux changes of \Phi_0=hc/e, which is the elementary quantum of magnetic flux. The Hamiltonian of a system may be thought of as its “energy”. Hence, although the Hall voltage increases every time we increase the magnetic flux by one quantum, the “energy” of the system remains unchanged. This quantized increase in Hall voltage causes a quantized increase in resistance, which can be seen from the step graph above. If the increase in resistance was not quantized, then the graph would have been a smooth curve, and not “stair-like”.

A small note about what it means for the Hamiltonian to be gauge invariant under flux changes of \Phi_0: it is not like the Hamiltonian doesn’t change with time. By the time the flux increase happens, all he eigenstates of the Hamiltonian will have changed. What gauge invariance here means is that if the measured eigenstate of the Hamiltonian at time t_0 is \psi(t_0), and we complete increasing the the magnetic flux by \Phi_0 at time t_1 and also measure the Hamiltonian at that time, then the eigenstate we get is going to be \psi(t_1), and not some other eigenstate \psi^*(t_1). For simplicity, we will just assume that \psi(t_0)=\psi(t_1) in this article, and that the Hamiltonian doesn’t change with time.

Another important note is the following: the charge transfer between the two reservoirs is quantized. In other words, when I increase the magnetic flux by one quantum of flux, the Hall current increases by a certain amount. When I increase the magnetic flux by the same amount again, the same increase in Hall current is seen. There is no good reason for this in quantum mechanics. This is because although the increase of the magnetic flux by \Phi_0 brings back the system to its original Hamiltonian eigenstate, in quantum mechanics there is no guarantee that the same initial conditions will lead to the same outcome. Quantum states randomly collapse into any one of their eigenstates, and don’t have to collapse to the same eigenstate each time. This fact, of the same increase in Hall current, needs some topology to be explained. But before that, we explore some concepts from geometry- namely, curvature.

Adiabatic curvature

“Adiabatic” evolution occurs when the external conditions of the system change gradually, allowing the system to adapt its configuration.

What is curvature? Let us explain this with an example. Take the earth. Starting at the North Pole, build a small road that comes back to the North Pole. By small, we mean that the road should not go all the way around the earth. If you drive your car along this road, the initial orientation of the car will be different from the final orientation of the car (when you arrive). Curvature is the limit of this difference between initial and final orientations of the car, divided by the area enclosed by the road, as the loop tends to a point. Note that this would never happen on flat land which was “not curved”: in other words, the initial and final orientations of the car would be the same.

But how does curvature fit into quantum mechanics? Consider a curved surface parametrized by two angles- \theta and \Phi. Each point of this curved surface corresponds to an eigenstate of the hamiltonian, which can be written as H(\theta\Phi). Essentially, this is a configuration space of the Hamiltonian of the system. Let us now build a quantum (ground) state, say e^{i\alpha}|\psi(\theta,\Phi). If we build a small loop from a point to itself, this ground state, as it is “driven” around the loop, changes its phase from \alpha to something else. The curvature of this configuration space, measured using the limiting process as described above, turns out to be 2 Im \langle \partial_{\theta}\psi|\partial_{\Phi}\psi\rangle.

Why do we care about bringing things back in a loop, as far as Hamiltonians are concerned? This is because at the end of the process of increasing the magnetic flux by one quantum, the initial and final Hamiltonian has to be the “same” in our case. We have to always come back in a loop to our initial Hamiltonian.

Hall Conductance as Curvature

What are \theta and \Phi as described above? \Phi is associated with the Hall voltage, and \theta can be thought of as a function of \Phi. More precisely, theta is defined such that I=c\partial_{\theta}H(\Phi,\theta). The Hamiltonian is periodic with respect to \Phi or Hall voltage. If the period of H with respect to \Phi is P, then the process of increasing the magnetic flux by one quantum can be thought of as changing the value of \Phi to \Phi+P, which leaves the Hamiltonian unchanged. Geometrically, if the configuration space can be thought of as a curved surface, this process can be thought of as moving along a circle and returning to the point we started out from.

For an adiabatic process where \Phi is changed very slowly, Schrodinger’s equation gives us \langle \psi|I|\psi\rangle=\hbar cK\Phi. \langle \psi|I|\psi\rangle can be thought of as the expected value of the current (the weighted average of all possible values of the current), and we know that \Phi is voltage. Hence, K, going by the formula discussed above, is (some constant multiplied with) conductance. In this way, Hall conductance can be interpreted as the curvature of the configuration space of the Hamiltonian.

Chern numbers

What is this “curved surface” that forms the configuration space of the Hamiltonian? As it is periodic in both its parameters \Phi,\theta, we can think of it as a torus. Each point of this configuration space corresponds to an eigenstate of the Hamiltonian.

Let the torus given above be the configuration space of the eigenstates of the Hamiltonian. Consider the integral \int K dA. The difference between this integral evaluated on the orange patch and the external blue patch is an integral multiple of 2\pi. This integer is known as Chern number. You may have seen the famous Gauss-Bonnet Theorem, which states that \frac{1}{2\pi} \int K dA=2(1-g). The right hand side, which is 2(1-g), is the Chern number for in the special case that the loop around the orange area is contractible (can be continuously shrunk to a point).

The Chern number associated with the diagram given above does not change when we deform the orange patch slightly. However, it will change if the boundary of the orange patch changes its topological properties (i.e. if its homotopy group changes). One way it could do that is if the loop circled the “doughnut-hole” of the torus completely.

Let’s bring it all together

Now that we have all that seemingly useless background, let’s bring it all together. Let’s start with the torus, representing all eigenstates of the Hamiltonian. Every time we increase the magnetic flux by one quantum, it’s like we’re going one full circle around the doughnut, and returning to our original Hamiltonian eigenstate. Both the pink and red curves given below are examples of this.

This causes the Chern number to change by an amount, say a. Why does the Chern number change? Because we’re adding the integral \frac{1}{2\pi}\int K dA to the previous integral. When we again increase the magnetic flux by one quantum, we go in another full circle around the torus. This causes the Chern number to change by the same amount a. Both the charge transfer between the conductance are exactly this Chern number in fundamental units (where constants like h and c are taken to be 1). Hence, each time we increase the magnetic flux by one quantum, we see the increase in Hall current, and the same decrease in conductance.

Why does topology have anything to do with quantum mechanics? We might think of this as an analogy that we constructed, that just happened to work absurdly well. We interpreted conductance in terms of curvature. Hence, the change in conductance can be thought of as an integral of curvature on copies of the torus. These integrals satisfy strict mathematical relations on the torus, via the Chern-Gauss-Bonnet Theorem. Hence, we can use this theorem to approximate how much the conductance changes every time we move along a certain loop on the torus (every time we increase the magnetic flux by one magnetic flux quantum).

Thanks for reading!


  1. A Topological Look at the Quantum Hall effect
  2. Adiabatic Theorem

Alcohol Use and Burden: Lancet

Alcohol consumption is a loaded topic. Most people know that it is probably not very good for your health, but how bad can it be? We know that some people die due to alcohol consumption, but that probably happens only to drunkards in old age. Moderation is the key. Moreover, there’s been a notion in popular culture for years, that a drink or two might be actually good for you! Everybody does it. Millions of people cannot be wrong, of course. These claims will be explored in the article below.

The paper that I will be studying today is Alcohol Use and Burden for 195 countries and territories, 1990-2016: a systematic analysis of the Global Burden of Disease Study 2016. This paper, published in the prestigious journal Lancet, is one of the most highly cited scientific papers in recent years, and disproves a lot of myths around alcohol consumption.


Most researchers concur that alcohol is a leading cause for much illness and disease. However, some evidence suggests that a low intake of alcohol has a protective effect on some conditions like ischemic heart disease (the most common form of heart disease) and diabetes. The method of data collection for this “evidence” was self-reporting by consumers of alcohol, which has been found to be unreliable. Some recent studies have instead focused on alcohol sales numbers, prevalence of drinking habits, etc.

This study improves on the prevalent data collection process in the following ways: 694 individual and population-level data sources have been consulted. Alcohol consumption levels have been adjusted to include consumption by tourists (which probably accounts for a family large percentage of alcohol consumption in touristy places). New methods have been devised to better approximate illicit alcohol sales (that are not recorded in stores). 23 health-related outcomes of alcohol consumption, like cirrhosis, tuberculosis, cancer etc have been identified and individually studied. Also, relative risk curves have been used to determine the ideal level of alcohol consumption for maximal health benefits. What is a relative risk curve, you ask? If not drinking alcohol can be thought to have a “risk” factor of 1 for the advent of tuberculosis, then drinking 4 glasses of beer everyday can be thought to have a relative risk factor of 2.3. Hence, drinking 4 glasses of beer everyday might make you 2.3 times more likely to contract tuberculosis.

How do you study the effects of alcohol? By comparing a group of people who drink, with a group of people that does not. This group of people which does not drink, with which other categories of people are compared, can be called the “reference category” in this context. Previous studies didn’t care about the composition of this reference category. Hence, this reference category for them included people who drank heavily in the past (and hence had less than ideal health conditions). On comparing this reference category with the category of drinkers, they didn’t find much difference, and hence concluded that drinking doesn’t change one’s health that much. This study removes this bias by carefully controlling the membership of this reference category.

Also, previous studies just used the assumption that the amount of drinking that minimizes harm is 0%, and then determined population statistics and error margins from this. This should not just be an assumption, but needs to be proven. Moreover, such an assumption reduces the accuracy of the statistical study. This paper does not make such an assumption, and hence better captures the effects of alcohol on health outcomes. Moreover, possibly the confidence intervals of the conclusions are larger.

The data used in this study has been collected from 195 locations across the world from 1990 to 2016, for both sexes, and for 5 year age groups between the ages of 5 and 95 years or older. Moreover, the risk of alcohol consumption for 23 diseases and health disorders has also been studied.


Alcohol consumption is measured in grams of pure ethanol consumed. Hence, a couple of light beers might be equal to a whiskey shot. For estimating consumption of alcohol, 121029 data points from 694 resources were pooled. For estimating disease risk, risk estimates from 592 studies were combined. A total of 28 million individuals were involved in these 592 studies.

The total amount of alcohol in a community is approximated from the total amount of alcohol sold in stores. The relative numbers of drinkers and abstainers are approximated through surveys. Also, the amount of alcohol consumed at an individual level, across age groups, is also approximated through surveys. This data is corrected for alcohol consumed by tourists. Also, in order to get a conservative estimate for the amount of illicit alcohol sold (outside of stores), the authors collate estimates from multiple studies on the sale of illicit alcohol, and construct a uniform probability distribution based on these multiple studies. Hence, each study has an equal probability of being correct. They then construct a new probability distribution, whose range is 0 to the average of the above probability distribution, and then take a sample of 1000 from this new probability distribution, which they then average. This average is obviously lower (probably) than the average of the first distribution. Hence, they get a conservative estimate of the amount of illicit alcohol sold.


In 2016, 32.5% of the world population were drinkers (25% of females and 39% of males). Hence, 2.4 billion people in the world drank alcohol. The mean amount of alcohol consumed was 0.73 standard drinks daily for females and 1.7 standard drinks for males. The actual consumption varied with the Socio-Demographic Index, or SDI, which can be thought of as similar to an index of economic and social prosperity.

On the y-axis of both these graphs, the disability-adjusted life-years, or DALYs have been plotted. These count the number of years spent handicapped or otherwise disabled, per 100,000 years, due to alcohol consumption. Note that these are only global averages. The data varies considerably with changing SDIs. Note that in both instances, alcohol consumption only increases the amount of disability in both genders across 23 diseases and disorders. On the other hand, the prevalence of Ischaemic heart disease in older age groups reduces slightly in both genders. Also, diabetes in females also becomes slightly better with alcohol consumption. Note that these are the only exceptions out of 23 diseases.

In 2016, 2.8 million deaths were attributed to alcohol consumption. This is 6.8% of all deaths amongst males, and 2.2% of all deaths amongst females. Also, alcohol was the cause of 6% of all DALYs amongst males (years spent incapacitated), and 1.6% of all DALYs amongst females. These percentages were even higher for groups of people between 15-49 years. The three leading causes of death that alcohol contributed to, for both females and males in this age group, were tuberculosis (1.4%), road injuries (1.2%) and self-harm (1.1%). For the age groups 50-80, the most prevalent disease in high SDI regions that was contributed to by alcohol consumption was cancer (27.1%). In poorer societies, tuberculosis was the most prevalent such disease. In age groups beyond 80, hemorrhage and hypertensive heart disease became much more prevalent.

If the risk of alcohol consumption with respect to these 23 diseases could be “averaged out” and then plotted, the graph would look something like this:


Past studies had shown that although risks associated with alcohol consumption increased with an increase in alcohol intake, consuming a small amount of alcohol daily was actually good for the body. Subsequent studies, conducted using mendelian analysis and multi-variate meta analysis, had then shown this to be a flawed conclusion. This paper also reaffirms the finding that no amount of alcohol is good for the body. For optimum health, the amount of alcohol we should be drinking daily is zero. However, the graph shown above suggests that drinking a maximum of one drink everyday is not all that bad, although still worse than not drinking at all.

Alcohol is a leading risk factor for disease world wide, and accounts for nearly 10% of deaths worldwide. The government should make the buying of alcohol more expensive, and should restrict its availability in stores. An extreme example of unchecked alcohol consumption is Russia, where in the 1980s, 75% of male deaths were attributed to excessive alcohol consumption. The authors concede that they don’t have data for multiple things, like the fraction of road accidents outside the US, and interpersonal harm caused due to alcohol. Also, they don’t have data on drinking below the age of 15. However, they conclude that having this data could only increase the estimate of health risk associated with alcohol consumption, and not reduce it.


  1. Alcohol Use and Burden for 195 countries and territories, 1990-2016: a systematic analysis of the Global Burden of Disease Study 2016.
  2. Counterfactuals in statistics

So how does Google work, exactly?

How do we decide which movie to watch on a given weekend? We could google “latest movies”, and then take a random pick. However, that throws up both good and bad results. How do we pick a “good movie” from this list? We go to Rottentomatoes, IMDB or the “Reception” section of Wikipedia, and see what critics have to say about those movies. We generally go for the movies for which the critics that we highly value have mostly positive things to say. Google works on the same principle. How does it know which websites are most relevant to your query? It finds the website that most other “credible” websites (containing your query words) cite, or link to.

The paper that I’m blogging about today is The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seeing as it gave rise to Google, perhaps the greatest political, social and economic force of our times, it is one of the most important papers that I’ve had the opportunity to read. And the fact that it is based on one of the most intuitive ideas known to humanity is mind-boggling.


Google did not start the search engine revolution. However, it revolutionized it. But how? Search engines before, like Yahoo and Altavista, were not the kind that thought users already knew what they were looking for, and just needed a big search bar. They thought most users just wanted to “browse”. Hence, if you remember the heavy machinery Yahoo and MSN websites, they would contain separate sections for News, Shopping, Sports, etc. Moreover, the search algorithm relied on keyword matching, that turned up way too many low quality results. If you typed “Sachin”, it would turn up the page that contained the word “Sachin” the most number of times, and not the ESPN stats page for Sachin Tendulkar, which was probably what you were looking for.

Google, on the other hand, was blasphemously bland. It had just one big search bar that you could use only if you already knew what you were looking for. This flew in the face of conventional wisdom. However, they improved their search algorithms and speed of information retrieval so much that they soon became the market leader in this segment.

The authors, Sergey Brin and Lawrence Page, claim in this paper that Google, which is a misspelling of googol (10^{100}), is capable of fast crawling, indexing the massive landscape of the internet effectively, and also dealing with the millions of queries that search engines deal with daily.

Improved Search Quality and PageRank

Search engines by that time were already capable of crawling (downloading) and indexing vast swathes of the internet, and presenting these pages to users when searched for. However, they had not effectively determined a system to rank these pages by relevance. Although millions of results could be returned based on a query, users typically would only check out the first few, and then abandon their search. It was hence extremely important these first few results turned out to be the “best” ones. Google exploited hyper textual information to find an effective algorithm to rank pages based on relevance. What does this mean?

Let me explain this with an analogy. Suppose you need to find out if person A is honest and trustworthy. How do you determine that? You might ask them if they think they’re trustworthy. However, such self introspection is likely to be biased. Similarly, analyzing the contents of a page, like a page containing the word Sachin the most number of times, might not be the best test of its relevance. On the other hand, if you ask some people who you respect and know to be honest and trustworthy about person A, and all of them have a high opinion of this person, then you have a strong reason to trust person A. This is how relationships and ties are formed in the real world anyway, especially in high risk situations like marriage, business deals, etc. Google does pretty much the same thing.

Google ranks pages based on how many other pages cite it, or link to it. The actual contents of that page are not as important. If a search for “Einstein” results 10 million pages, and 8 million of them link to a certain article, that article is likely to be an influential and important article on Albert Einstein. However, there’s further modification that is needed to make the search algorithm even better, which I can again explain with a bad analogy. When you ask for opinions of others on person A, there might be persons who always speak well of others. Despite this charming personal trait, their opinion must be discounted in this particular situation. However, if you know persons who don’t generally throw around compliments willy nilly, and they have overwhelmingly positive things to say about person A, then their opinion should hold much more weight.

Similarly, google gives less importance to pages that contain a lot of hyperlinks to various websites, and more to pages that do not contain a lot of links. If there is a page on Einstein that does not contain a lot of links, and it cites only one page for all its information, that one page is likely to be extremely important. Of course there will be exceptions to this rule. However, statistically, such a search algorithm turns up good results most times.

If a page A has pages T_1,\dots,T_n, and PR(T_i) is the page rank of page T_i while C(T_i) is the number of links going out of T_i, then

PR(A)= (1-d) +d(\frac{PR(T_1)}{C(T_1)}+\dots+\frac{PR(T_n)}{C(T_n)}).

Here d is a number between 0 and 1, and is called the damping factor. This factor models user fatigue: PageRank of a page is supposed to reflect what is the probability that a user reaches that page after clicking on a bunch of hyperlinks in other pages. If d=0.85, for example, then there is an 85% chance that the user will click on the next link, and a 15% chance that the use will quit on that page.

Note that PageRanks are calculated iteratively. Hence, if page A links to page B, then we can calculate the rank of page B only if we already know the rank of page A.

Anchor Text

When you search for “Einstein”, say the most highly ranked page that you get is the Wikipedia entry on him. It says that Einstein was a physicist, where the word “physicist” is a hyperlink to the wikipedia page for physicists. Now when you search for “physicist” on google, you’d want to get the Wikipedia page on “Physicist”, and not “Einstein”. Hence, the mention of the word “Physicist” on Einstein’s page should contribute to the PageRank of of the “Physicist” page, and not the “Einstein” page. If you observe the formula for page rank given above, you’ll see that this is exactly what happens. Hyperlinks always contribute to the page they lead to, and not the page they lead away from. This is how we can search for images and videos on google. There are hyperlinks on some pages with text, which lead to those images.

Google Architecture Overview

This is perhaps the most technical part of the blogpost. Most of Google is implemented in C or C++. What Google does is that it downloads as much of the internet that it can access, stores each webpage, and then determines how many times each word occurs in each webpage. It catalogues all this information, ranks the webpages, and stores them. So when you search for “Einstein”, it finds all webpages with the word “Einstein” in its catalogue, implements the PageRank algorithm to rank the webpages, and gives you the results in order. Now for a more technical description:

The web crawling, or the downloading of web pages, is done by several distributed crawlers. There is a URLServer that sends lists of URLs to these crawlers to download. These webpages are then sent to a storeserver, which stores them in a repository. Each webpage has a docID, which is stored along with the webpage. This docID is useful because URLs can be of arbitrary length. However, docIDs are of a fixed length. Hence, they’re useful for cataloguing.

Now comes the indexer. The indexer takes each webpage, and stores details of the words in that webpage. Say it takes a webpage on “Trump”. Now it counts the number of occurrence of each word in this webpage, which is known as recording hits, and creates a “forward index”, which looks like “docID for Trump: president, orange, twitter rage, yuge,…”. This information is now stored in “barrels”, which are boxes corresponding to a set of words, and store details of all webpages that contain those words. For example, this page on Trump is now stored in the barrel corresponding to the words “Trump, Obama, Lincoln”, etc. It is also stored in the barrel corresponding to the set of words “orange, twitter, yellow, green,…”. What a barrel does is that it creates an “inverted index”, which means that for each word, it has a list of docIDs or webpages containing it. Also, each word is allotted a wordID. An example of an entry in this inverted index would be “wordID for orange: docID for Trump, docID for the ice cream cones,….” etc

The indexer also notes down where in the document each word occurs, font size, capitalization, etc. Also the indexer notes down the URLs on that page- name of URL, and the docID of the page it leads to. This information is stored in an “anchors file”.

Major Data Structures

Almost all of the data is stored in Bigfiles, which span multiple file systems. Also, there are many repositories. The raw HTML repository stores the compressed HTML of each page, with a small header. The document index, on the other hand, keeps information about each document, including current status, various statistics, etc. A hit counts the number of times that a certain word occurs on a webpage. It is stored using hand optimized compact encoding, which uses two bytes for every hit. Hit lists are combined with wordIDs in forward index and docID in the inverted index.

In the inverted index, for each wordID, there is a pointer, which points towards the barrel corresponding to that wordID, the docIDs corresponding to the webpages that contain that word, and the hit lists telling us the details of the occurrence of that word in the webpages (font, number of occurrences, etc). This list is called a doclist. But in what order should these items be stored in a doclist? Should we just rank the docIDs numerically? Or rank them by the number of occurrences of that word? We basically order the docIDs by the number of occurrences, but we introduce a further modification to the barrels: there are now two sets of barrels. One set concerns itself only with title and hyperlinks, and another set deals with the main text. When you search for “Einstein”, the”title and hyperlink” barrels corresponding to “Einstein” are searched first, and the relevant ranked pages yielded. However, when there are not enough results this way, then only are the barrels corresponding to the main text upturned to yield relevant webpages. Now these yielded webpages are ranked using the PageRank algorithm, and are shown on the user’s device.


Google claims to be much better than others at searching for relevant webpages. For instance, it says that most existing search engines cannot even find their own webpages effectively when queried- if you type “Yahoo” into the Yahoo search bar, it might not even bring up its own webpage, but might some other random webpage containing the word Yahoo the most number of times. Google, on the other hand, is much better at this sort of thing. Also, when you type “Bill Clinton”, the most relevant results are presidential articles from

As far as crawling the internet and indexing the huge mass of data is concerned, Google claims that downloading 26 million webpages took them only 9 days, and indexing and storing them just a couple more days. The indexer ran at 54 pages per second, and the whole sorting process took only a day.

Although its processes and algorithms could be optimized, Google was fast and scalable. And it was here to stay.

It indeed was


  1. The Anatomy of a Large-Scale Hypertextual Web Search Engine
  2. Search engine indexing
  3. Inverted index

Gravitational Wave Detection with LIGO

The paper that I’ll be reviewing today is the famous Observation of Gravitational Waves from a Binary Black Hole Merger by B.P. Abott et al. This paper records the detection of gravitational waves for the first time in human history, and was awarded the Nobel Prize in 2017.


What are gravitational waves, and how are they caused? Gravitational waves are caused by the change with time of the quadrupole moment of a system of masses. What does that mean? Let us take an example. Imagine the rotation of the earth. As it is rotating, its mass doesn’t change (first moment), and neither does the relative position of its center of mass (second moment). However, during rotation, the poles get flattened a little bit. When this flattening is happening, the quadrupole moment, or third moment, changes. One way of thinking of it is to say that the “shape” changes. If one was calculating the gravitational potential energy of the earth at a distant point in space, it would change a little because of the flattening of the earth near the poles. It will have reduced. The energy that is lost by the system is radiated out in the form of gravitational waves.

But what do gravitational waves do? Simplistically streaking, they stretch and compress space itself. One may imagine the universe to be a rubber sheet. Then gravitational waves elongate and compress this mat with time. This is exactly the property of waves that the LIGO exploited to detect gravitational waves. A more mathematical description of gravitational waves can be found here. In this mathematical explanation, they “simplify” Einstein’s Field Equations to make them linear, and then use the fact that however we deform the universe, although metrics and other observables might change, the physical laws will always remain the same (the justification for gauge transformations). With different kinds of such deformations, we get different solutions for how the “shape and size” (metric) of the universe changes with time. All these are valid descriptions of the universe. One such description is that gravitational waves are emitted after certain phenomena. Although these are solutions only for “simple” and “approximate” Einstein’s Field Equations, it has been rigorously proven that gravitational waves indeed are solutions of Einstein’s Field Equations in the complicated, non-linear case.

An important caveat is that this is not the first time that gravitational waves have been observed. The detection of the binary pulsar system PSR B1913+16 proved that gravitational waves do indeed exist. However, the presence of gravitational waves was inferred from other observations, and not directly detected. This paper presents proof for the first actual detection of gravitational waves, and also the first observation of the merger of two black holes in a binary black hole system.


The merger of two black holes so far away in space and time was observed on September 14, 2015. The initial detection was completed by low latency searches, which means that some preliminary gravitational wave analysis was done quickly (3 minutes) so that scientists could quickly figure out the approximate location of the merger of the black holes, and turn their telescopes to the portion of the sky where this signal was coming from. If astronomical observations co-incided with this wave detection, then they would have proof that this wave was not a technical glitch, but the result of an astronomical event.

Two observatories observed this event: one in Hanford, WA and another in Livingston, LA. The gravitational waves reached Hanford 10 ms before Livingston, and this discrepancy in time of observation helped scientists locate the binary black hole system in the sky. The observed signals had a signal-to-noise ratio of 24.

Some features of the observed signal were these: over 0.2 seconds, both the amplitude and the frequency of the wave increased in 8 cycles from 35 to 150 Hz, where the amplitude reached its maximum. Let us focus on the maxima: where the frequency is 150 Hz. Two bodies orbiting their common center of mass would have to be orbiting 75 times every second to produce this signal. After noting the frequency and the rate of change of frequency, the scientists concluded that two compact masses of a certain total mass had to be orbiting each other at a distance of approximately 350 km for these gravitational waves to have been produced. They concluded that these two point masses could not be neutron stars (too light) or a neutron star black hole binary (would have coalesced at a lower orbital frequency). Hence, these had to be two black holes.


The detection of gravitational waves happened at two observatories. This was so that random noise from the environment could be eliminated (most sources of noise would be local, and hence would be detected by one observatory but not the other), and the location of the astronomical event in the sky could be determined (mainly by the time difference of observation by the two observatories). The instrument that was used was a laser interferometer, and its working principle is explained beautifully here. Simplistically speaking, and interferometer has two arms that are perpendicular to each other, and there are light beams inside both. They are arranged in such a way that these light beams interfere destructively at the detector. However, gravitational waves stretch one arm, and compress the other one (unless the arms are at equal angles to the direction of propagation of the wave, which is extremely unlikely for a randomly generated gravitational wave in the sky). This causes the light to not interfere destructively, and we can hence detect gravitational waves by detecting visible interference patterns of laser beams. The instrument uses a resonant optical cavity to increase the intensity of the laser- we end up getting a 100kW laser light from a 700W light. Hence, even a small change in the interference pattern of light can be detected easily.

The mirrors used in the experiment are also mounted on quadruple-pendulum systems, so that these mirrors remain almost stationary in the event of earthquakes or other sources of external noise. To reduce Rayleigh scattering of photons, ultrahigh vacuum is created, and the pressure inside the arms is reduced to 10^{-6} Pa. Other instruments like seismometers, etc were also mounted near both observatories so that if localized causes did create a disturbance in the instrument, they would be detected by those instruments also, and hence such detections could be discarded.

Methods of searching

The blackhole merger event, codenamed GW150914, was verified by two different techniques. Because these two techniques use completely independent methods of analysis, and the fact that both report the same event boosts the plausibility of this being an actual gravitational wave and not another random disturbance. All events that are observed are assigned a detection-statistic value by both methods, which measures their likelihood of being blackhole merger events. GW150914 was given very high values by both methods. But how can we calibrate the instrument to ensure that we give high detection-statistic value only to actual black hole merger events? We measure the rate at which the instrument gives the same or higher detection-statistic values to events that are not black hole mergers. If this happens very rarely, say once in 5,000 years, then we can be reasonably sure that the event that has been given a high detection-statistic value is an actual black hole merger, and not a freakish once-in-5000-years-event.

The two methods of analysis used are:

  • Generic Transient Search– In this technique, there are no pre-constructed gravitational waves with which incoming signals should be compared. After observing a signal lying within a certain frequency and time range, we construct a gravitational wave form having those properties. Then this wave form and the incoming signal is compared. If the comparison is favorable, we can reasonably infer that the incoming signal is a gravitational wave. The detection-statistic value given to GW150914 was 20. Only once in 22,500 years will this method give such a high value to an event that is not the merger of two black holes producing gravitational waves.
  • Binary Coalescence Search– We have approximately 225,000 pre-constructed gravitational wave forms. The incoming signal is then compared with these wave forms, and the better the match, the higher the detection statistic value. The detection statistic value that this search gave to GW150914 was 23.6, and this method would erroneously give such a high value only once in 203,000 years.


Some properties of the black hole merger event could be calculated from the observations:

No deviation from General Relativity was found with regard to mass and spin of black holes, and also the coefficients in the mathematical expression for the phase of the gravitational waves (expressed as a power series in f^{1/3}). Moreover, better lower bounds for the Compton wavelength of a graviton were obtained.

The authors conclude this paper with the assertion that there are probably a lot of black hole binary systems out there that are merging much more often than we can observe, the predicting or analyzing the interference patterns of the gravitational waves created by such events could help us decode the history of the universe.


  1. Observation of Gravitational Waves from a Binary Black Hole Merger
  2. Binary Black Hole
  3. Interferometers

Effective Altruism- May and June

I’m attaching screenshots below for May and June. I donated to soup kitchen relied in West Bengal, cyclone relief efforts in West Bengal, and the State College fund for Covid relief.

My relief efforts have been a little off the cuff, and not studied. Hence, this has been more like virtue signaling than actual labored virtue. I really do hope to rectify this in the future.

The Bitcoin Revolution

The paper that I’ll be discussing today is perhaps the most famous that has come out in recent years- Bitcoin: A Peer-to-Peer Electronic Cash System, by Satoshi Nakamoto. It was a surprisingly short paper packed with the most amazing ideas for creating an alternate banking system. This paper has gone a long way in upending the finance industry.


We all deal with digital money, for example when we make a payment through a credit card. There are some features of such transactions that are relevant to this discussion:

  1. We depend on trustworthy financial institutions to facilitate this transaction.
  2. Double spending, which is spending the same money multiple times, cannot always be prevented. Someone may buy a product using a credit card, use it, and then claim that his card was fraudulently used. In many such cases, because of lack of evidence, this money is refunded to the complainant, although he is clearly defrauding the system. Credit card companies consider this part and parcel of being in the business. Such transactions are called reversible transactions.
  3. These costs are added up for credit card companies, which then increase the price of transaction. Credit card companies approximate the amount of money they spend each year in hiring employees, advertising, and also in returning money to fraudulent customers. These costs are ultimately transferred to buyers and sellers.

In order to remove the need for third parties to moderate transactions, Nakamoto says that what is needed is an electronic payment system based on cryptographic proof. When you agree to pay me for a service that I’ve rendered to you, the proof of this exchange of money will be recorded in the public domain. Hence if I claim that you have not paid me, you can show me, and everyone else the proof of payment.


How does it all work though? Everyone has a public key, and a private key. A key here stands for an ID number, like your Driver’s License number. When I make a payment to you, I verify this transaction with my public key, and sign the transaction with my private key. The money is then transferred to you, and the electronic details of your transaction, which contain details of both our public keys and also the time and amount, is hidden through hashing. What hashing does is that it takes all the details of a transaction, and converts it into an alphanumeric code like 000a234bfu344. This code can never be reserved back into the details of the payment. This is to ensure that both the payer and payee retain their privacy. The only thing that can be verified by everyone else is that the payment did indeed take place.

Bitcoin Network

This system of payments works by employing a set of CPUs/nodes to process and store the information of transactions. But what purpose does it serve? Let us demonstrate that with an example:

If you have a physical $10 note, you can clearly only spend it once. However, with digital money being a set of codes instead of a physical entity, people may try to use the $10 they have in digital money multiple times. Can they get away with it? If I try to spend my $10 digital money multiple times, the details of all these multiple transactions go to all nodes that are connected to the Bitcoin network. They only accept the first such transaction, and all other transactions are considered void. Now you can’t spend that $10 anymore, and the only person who can now spend that money is the person you paid.

Timestamp server

Details of each transaction are converted into an alphanumeric code, or a hash. Many such hashes form a block. When a block has been created by the network, a stamp of time is put on it, and every time the server sees that block in the “blockchain” it puts a timestamp on it. So for instance, if a block is created at 12 noon. Then it will have a timestamp of 12 noon. If the timestamp server processes the “blockchain” (literally chain of blocks) again at 12:05 PM, that particular block will have two time stamps- 12 noon and 12:05 PM, and so on. The nodes keep making chains will all the blocks that they receive, and keep exchanging each others blockchains to ensure that they haven’t missed out on any details or transactions.


The recording of transactions, the hashing, the verification, all of this is done by CPUs connected to the Bitcoin network. But why should I connect my CPU to this network? This is what happens: every new transaction is broadcast to all nodes (or CPUs) in the network. Each node collects many such transactions into a block. Each node now tries to find a “proof-of-work” for this block, which is basically finding an appropriate hash code for the block with an appropriate number of 0‘s in front. The first node that finds such a code broadcasts it to all other nodes, who verify that that is indeed an appropriate code for that block. Now they add that block to the blockchain, and continue processing other transactions and blocks. The finding of this appropriate hash code is probabilistic, and each node is equally likely to find it, assuming equal computational power.

But why should I connect my CPU to the network for this? Because whatever I spend in the form of processing time and electricity costs, I get back in the form of Bitcoin. This Bitcoin essentially is a form of money that I can now spend online.

Information Loss

What about information loss? It is entirely possible that some node does not receive details of a certain transaction. However, we can be almost certain that at least one node will. And eventually, when the first node receives the block containing details of the missing transaction, it will rectify its mistake.

Now here’s an important caveat: as we’ve seen before, all these blocks are arranged in a chain. However, when these blocks are broadcast by one node (which could successfully come up with a good hash for the block) to all the other nodes, some nodes may have missed out on this. Let us suppose node A missed out on a block, but it successfully receives the next block. It creates a chain that is one block short. However, when it receives the block chains that other nodes have created, it discards the previous chain, and chooses the longer chain. This process of choosing the longer chain removes the errors that creep in when certain nodes do not receive all the blocks. This works on the assumption that at least some nodes receive all the information of all blocks in a small interval of time. This is likely, as there are thousands of nodes in the network.

Another caveat is that nodes only accept information of transactions that are valid. As a node receives all the information of all transactions, if I try to spend my digital $latex 10 twice, the nodes (which successfully receive this information) will not process my second transaction.

Reclaiming disk space

All the CPUs store all the information of each transaction on the network. That sure sounds like a lot of space. In order to clear space, blocks only store information pertaining to the fact that the transaction happened, and erase the details of old transactions. If a transaction can be thought of as a tree with roots (proof of transaction) and branches (details of transaction), the branches of the trees are cut off so that only the roots remain. Nakamoto calculates that a CPU would need only 4.2 MB of storage space in a year to store all information of all transactions.

Payment Verification

How do I verify that a certain transaction took place at 12 noon on Saturday? I ask the nodes for the longest blockchain (which is the correct one with no missing information), access the relevant block by using the right block header, and then verify that some payment indeed took place at that time. I cannot find out the parties involved, or the amount. However, the fact that this payment took place is public information.

Beating the network

What does beating the network mean here? I can either spend my $10 multiples times and force the network to accept it, or I can go into history, and erase the record of me paying someone money so that I can keep it for myself. Because no other codes will accept multiple transactions, the first option is infeasible. Hence, the only way I can do that is go back into history, erase the record of my payment, and then build blocks faster than other nodes so that I can make a longer chain than them.

Suppose there are 10,000 CPUs/nodes in the network, and I have managed to capture 100 of them. Suppose that the transaction that I want to erase was recorded z blocks ago. I need to go back z blocks, which means I need to erase z-1 blocks, erase my transaction from the zth block, and now hash blocks faster than the “honest” nodes so as to form a longer chain. Remember that only the longest blockchain will be thought to be the correct one, and will be accepted by all parties involved. This act of erasing blocks may stop all exchange of blocks between the “honest” nodes and the “evil” nodes under my control. Hence, the only way to legalize my fraud is to quickly build a longer blockchain, so that everyone will have to accept that my version of events is the correct one.

Coming up with the right hash or code for a block is probabilistic. Hence, the honest nodes, which are much greater in number, have a much higher chance of forming the longer blockchain. The author calculates that if the number of “honest” nodes h is greater than the number of “evil” nodes e, the probability of the evil nodes ever making a longer chain than the honest nodes is (\frac{e}{h})^z. Hence, if the number of honest nodes is much greater than the number of evil nodes, which is likely to be the case in the real world, then it is extremely unlikely that an evil person will be able to overpower the system.

Another incentive for people not to overpower the system is that if they spend the CPU power that they’re using for evil in hashing transactions and blocks instead, the monetary value of the bitcoin that they’ll earn will be far greater than any monetary reward that they might earn by trying to break the system.

Thanks for reading!


  1. Bitcoin: A Peer-to-Peer Electronic Cash System
  2. Hash