Evolution, Wars and Game theory

This post is on Evolutionary Game Theory, derived from the book Networks, Crowds and Markets: Reasoning about a highly connected world by David Easley and Jon Kleinberg. This seemingly esoteric field shines a light on an incredible number of things, including evolution, wars, and society itself. Although the subject is inherently mathematical, the mathematics only serves to distract from the ultimate message in a lot of examples. Hence, this article mostly avoids formulae and lengthy calculations. Almost all the examples constructed in this article are my own.


What does Game theory actually mean? Imagine that India and Pakistan are about to go to war. Both sides have declared the other to be responsible for recent provocations, and have justified to their citizens the use of overwhelming force to crush the enemy. Who should attack first? Should they attack at all?

If both of them are able to stop themselves from going to war, both of them will benefit. No soldiers or citizens killed, no blow to the economy, you name it. However, if one of them surprise attacks first, they will catch the other sleeping and quickly be able to gain territory. This will boost the economy, and improve public morale, ensuring that the current government surely wins the next election. Hence, the benefits of attacking first outweigh the benefits of withholding from attack in this situation.

If India and Pakistan are not in communication with each other, they will each expect that the other will want to attack first. Hence, if only to preemptively deny this advantage to the other, they will both attack, leading to entrenched warfare, massive destruction, with very little gain.

If one could quantify the benefits of going to war or abstaining from war in a table, it would look something like this:

India/PakistanNot attackAttack
Not attack0,0-10,10
Benefits and losses have been represented as numbers between -10 and 10

Hence, if India and Pakistan are not in communication with each other, out of sheer insecurity, they will attack each other, leading to a bad result in which both of them suffer. However, if they are in communication with each other, and show cooperation by agreeing not to attack one another, they will both avoid a whole lot of damage and suffering.

This is an example of the classic Prisoner’s Dilemma, in which insecurity amongst people leads to bad conclusions. If I take pains to help out a classmate who will never help me in return, then without communication I will just assume that I’m being a fool, and that I might as well be unhelpful to that person. This inevitably leads to an uncomfortable atmosphere in class for both. However, if both of us have a verbal or non-verbal understanding that we will always help the other out, then both of us can benefit. Life is generally not a zero-sum game, and humans generally do benefit from cooperation, as long as they play by the rules.

Game theory is the study of decision making in such circumstances.

Evolutionary Biology

Evolution, as we all know, is the process through which some genetic traits persist over time, while others are wiped out. If there are two types of owls on an island, identical in every way except for the fact that one type is blind and the other can see, then soon enough the blind owls will be wiped out from the island. This is because it will be in direct competition with the seeing owls for food and other resources. How did one type of owl come to be blind while the other did not? Most genetic differences arise because of random mutation. It is possible that owls were always blind, and due to a random mutation one of them could see. This mutated owl cold hunt better, and soon gave birth to a lot of kids who also had eyes. Gradually, these owls with the power of vision out-competed the “original” blind owls, established their dominance over the island, and with a couple of hundred years those blind owls were nowhere to be found.

Note that evolution is the composite of random mutation and natural selection. It is not the process through which one species naturally dominates another. For instance, when humans first reached Australia more than 40,000 years ago, they hunted many species into extinction. This is not explained through evolution. Evolution only occurs gradually between members of the same species that are distinct only in small ways, their differences caused by random mutation.

Evolutionary Game Theory

How does game theory come into the picture though? Let us suppose that in a large community of say a thousand pigs, ten pigs have three legs instead of four due to random mutation. What will happen? The other pigs will out-compete these mutated pigs for food, shelter and mates. These mutated pigs will probably starve a lot of the time, not have many mates, and hence much fewer children than the rest. The fraction of three-legged pigs will slowly decrease generation upon generation, until it is negligible (or even zero). A genetic trait is called evolutionarily stable if a large enough fraction of the population with that trait can successfully dominate all other mutations, provided that the fraction of such mutants is small enough. Having four legs seems to be an evolutionarily stable trait. Of course we haven’t proved it yet. What if even one pig with 16 legs can out-compete this pig population for food and mates? This is unlikely, as this 16 legged pig will only be made fun of by the others, and will find it hard to attract mates in order to pro-create. Hence, in all likelihood, having 4 legs is evolutionarily stable for this population of pigs.

Let us take another example. Imagine that in a society with a thousand humans, one socially awkward child with a super high IQ is born. If there is only one such child, he/she will struggle to fit in, probably fail to attract mates in competition with others, and hence this mutation will die out. Hence, average IQ is evolutionarily stable, and not a very high one. However, if a sufficiently large number of children are high IQ mutants, they can fend for each other, procreate with each other, and ultimately beat others while in competition for limited resources. Hence, high IQ can spread only if the number of high IQ mutants is large enough to begin with. This would perhaps explain the current state of the world to some extent.

But what happens if the world is mostly comprised of high IQ individuals, and a small number of low IQ mutant is introduced? Intense competition between the high IQ individuals may lead to their mutual destruction. For example, two technologically advanced nations may blow each other up with nuclear weapons, while less advanced nations may survive unscathed. In the aftermath of this destruction, the low IQ mutants may pro-create comfortably, soon populating society with mostly low IQ individuals. Hence, high IQ is not evolutionarily stable.

In the book Sapiens by Yuval Noah Harari, the author argues that Neanderthals were known to actually have a bigger brain than Homo sapiens, and a bigger brain generally translates to a higher IQ. Hence, it was always a mystery how Sapiens managed to exterminate the Neanderthals and spread over the whole world. Perhaps the fact that high IQ is not evolutionarily stable explains this counter-intuitive phenomenon.

Nash equilibrium

What is Nash equilibrium? It is an arrangement between two players in which neither side will gain from changing their strategy. For example, let us suppose that India and Pakistan agree to not go to war with each other. If Pakistan now changes its stance and attacks India, India will retaliate by attacking Pakistan. In this way, things will only become worse if either country stances its stance from peace to war. This shows that both countries being at peace is a Nash equilibrium.

What does Nash equilibrium have to do with evolution? Things that are evolutionarily stable are also Nash equilibria. What this means is the following: suppose we have a population with an evolutionarily stable trait (say, average IQ). If a small fraction of individuals suddenly becomes high IQ, that small fraction will soon be wiped out. Hence, when two average IQ persons are competing for a resource, it will not benefit either of them (over the long run) to become a high IQ mutant. Hence, the only stable arrangement, in which everyone is well-off over the long run, is if everyone has an average IQ.

Mixed strategy

In some situations, we don’t have an evolutionarily stable trait.

Consider the scenario in which most countries of the world are peaceful and minding their own business. Suddenly, the US starts attacking everybody, soon accumulating a load of wealth and territory. Other countries will soon follow suit, and the world will descend into chaos. But in this aftermath of bloody destruction, people will realize that Switzerland has remained unscathed because it refused to descend into this horrible mess. They will realize their folly, and decide to again become peaceful. Soon, most countries in the world are again peaceful. But then they see Russia attacking countries near its borders, slowly gaining control over large parts of the world. To prevent the scaled from being tilted too heavily in Russia’s favor, other countries soon join the fray.

In this way, much of world history has been a story of wars, interspersed with periods of genuine peace. Neither war, nor peace are evolutionarily stable. It took only one belligerent country to make all other countries go to war, and it took only one peaceful, relatively unscathed country to make everyone return to peace. I don’t mean to overplay my hand, but this could be a possible reason why humanity has often oscillated between wars and peace, and neither has stuck.

What should one do then? The author of the article proposes that we should have a mixed strategy in situations without pure evolutionary equilibria. What would that look like in this situation? Suppose all countries mutually decide to initiate an attack with another only one time out of ten (of course all countries can retaliate every time they’re attacked first). If a country suddenly starts attacking everyone, it will find itself always at war, its people and economy destroyed. Hence, it will adapt to attacking less frequently. If a country that is involved in world politics never initiates an attack, it will soon enough find itself attacked first by another country, and will hence suffer considerable damage. Soon, it will start initiating some attacks on its own to never let the past repeat itself. Hence, attacking sometimes, perhaps one time out of ten, is the evolutionarily stable strategy, as eventually everyone will start doing this.


As one may imagine, game theory can probably be applied almost everywhere in human life. When people lack cooperation, insecurity ultimately leads them to destroy their own peace of mind and others’. For example, in a society in which everyone throws garbage on the roads, one person deciding not to do so and taking pains to clean the streets is at a disadvantage. Other people will keep throwing garbage on the streets. Hence, that person will soon decide to stop his/her futile efforts, and litter the streets like other. This shows that the situation in which everyone throws garbage on the streets is evolutionarily stable. However, if people cooperate and mutually decide to never throw garbage on the streets, and also to punish/fine individuals that pollute the streets, our streets will remain clean forever. Hence, clean streets can be evolutionarily stable only when people communicate and cooperate with each other. I stole this example from the book Algorithms to Live By, by Brian Christian and Tom Griffiths. This book also re-kindled my desire to finally understand Game theory, after multiple failed attempts in the past.


  1. Evolutionary Game Theory, by Easley and Kleinberg

Effective Altruism- October

I am attaching my EA donation slip for October below. I took the Effective Altruism pledge last year, in which I pledged to donate 10% of my lifetime earnings to the organization.

Today is a special day for many Indians, including myself, as today is Gandhi Jayanti, Mahatma Gandhi’s birthday. Reading his autobiography was one of the most influential decisions of my life, and a lot of my life’s decisions were based on trying (and mostly failing) to follow in his footsteps- like going vegan (very few people know that Gandhi was vegan *before it was cool*), trying to donate a part of my salary, trying to be helpful to everyone, etc. Although I’ve failed in a lot of the above, I’ve also had some modest successes.

Gandhi’s influence in my life has changed over the past few years. I’ve helped myself to the odd dairy dessert way too many times. I’ve also been treated extremely badly by people who I’ve tried to be nice to. But most importantly, I’ve slowly realized that Gandhi’s ideas were mostly bad in the long run for society. It tries to negate basic human tendencies like selfishness by trying to establish primarily agricultural and socialistic communities, sexuality by trying to promote abstinence, etc. He may have had some success with these, but I’ve mostly failed.

After doing some random unfocused reading, I have realized that before having grand visions for society, we should have some evidence on whether our vision will work or not. I’ve found Abhijit Banerjee’s writings on experimentation in economics to be quite relevant in these matters. Despite Gandhi’s lack of insight into behavioral economics, he was an amazing man, and I am humbled by the opportunity to be able to donate some money towards social upliftment on his birthday.

Forecasting the American presidential elections

The article that I will be talking about today is How the Economist presidential forecast works by G. Elliott Morris. I have always wanted to know more about how American Presidential forecasts work, and how reliable they are. This is my attempt to try and understand it. Note that the authors have developed a code for their statistical algorithm, that they have posted here.

Poll position

How does one predict who will win the Presidential election? Simple. Randomly select a group of people from amongst the population, and note their voting preferences. If this selection process is unbiased and selects a large enough group of people, you should have a good indicator of who will win. Right? Unfortunately, this is not the full picture.

The time at which this poll is conducted is of great importance. At the beginning of the election year, a lot of people are undecided. “I don’t want to choose between two bad options, I’d rather go for a third party”. However, as the elections loom closer, people often return to their inherent preferences for either the Democrats or the Republicans. Hence, polls conducted at the beginning of the year are much less accurate than polls conducted right before the elections. For example, even as late as June 1988, George HW Bush was trailing by 12 percentage points in polling averages to his contender. He went on to win by 8 points just five months later. Of course, national polls taken close to the election can also be unreliable. Hillary Clinton was leading by 8 percentage points over Donald Trump as late is October 2016. She won the popular vote by only 2 points (and of course lost the election). For a fascination explanation of the electoral college, and how a candidate can lose the election despite winning the popular vote, watch this.

So if national polls are not completely reliable (at least the ones conducted in the early stages), how can one predict the election? A lot of things like market stability, global business, and even the stability of foreign governments rides on being able to predict the American Presidential election successfully. Hence, political theorists have put a lot of thought into it. It tuns out that there are some “fundamentals” that predict the election outcome better than polls. The “fundamentals” that we are concerned with here are the state of the economy, the state of the country, etc. One such model that uses “fundamentals” is “Time for Change”, developed by the political scientist Alan Abramowitz. It predicts the election outcome by using the GDP growth, net approval rating, and whether the incumbent is running for re-election. The error margins for this model have historically been comparable to those of polls taken late in the election season, and in 1992 it did a better job of predicting the election than national polls.

Something simple, please

To develop a prediction model using “fundamentals”, we have to choose the factors that are important in determining the election outcome. In selecting these factors using the given data, we might select factors that “seem” important, given the limited data, but do not really matter in predicting elections. This fallacy is known as overfitting, and can introduce substantial error into our predictions. To mitigate this problem, we borrow two techniques from machine learning- “elastic-net regularization” and “leave-one-out cross-validation”. It is heartening to see that although statistics heralded the machine learning revolution, new insights into how machines learn have also started changing the world of statistics.

Elastic-net regularization is the process of “shrinking” the impact of factors we consider in our model. The mantra that one must follow is that simpler equations do a better job of predicting the future than more convoluted ones. Hence, we may reduce the weights of the various factors we are considering, or remove the weak ones entirely. But how does one know by how much we should reduce the weights of these factors, or which factors to completely remove? For this, we use leave-one-out cross-validation. We will leave out one part of our data set, and train the model on the remainder of the data set, using a pre-determined “shrinkage” algorithm for reducing the weights of certain factors. We may also completely remove certain factors. We then test whether our model is able to predict the correct conclusion based on that left out data set. For instance, if we training an election prediction model based on data from 1952 to 2016, we leave out the data from 1952, and train out model on all the other election years to identify relevant factors and prescribe weights to them. Then we feed the data for 1952 into the model and see if it is able to predict the election result correctly. In the next iteration, we leave out 1956, and run the same algorithm. After we have this algorithm for all election years, we change the “shrinkage” algorithm and run the whole process all over again. A total of 100 times. We select the shrinkage algorithm that is the most successful on average.

The “shrinkage” algorithm that the authors found after running this algorithm was pretty close to Alan Abramowitz’s model. Some small differences were that the authors prescribed a penalty to parties that had been in power for two previous terms, and used a cocktail of economic indicators like real disposable income, non-farm payrolls, stock market, etc rather than just second-quarter GDP growth. They interestingly found that these economic factors have become less important in predicting elections, as the voter base gets more polarized. Hence, ideology has slowly come to trump economic concerns, which is a worrying indicator of major ideological upheaval in the coming years.

Of course, economic indicators are important in the 2020 elections. The pandemic has caused major economic depression, which is likely to reverse quickly once the health scare is mitigated. The authors see it fit to assign a weight to economic factors that is 40% higher than that assigned to economic factors during the 2008-2009 Great Recession.

The authors find that their “fundamentals” model does exceedingly well in back-testing, and better in fact that both early polls and late polls.

When the authors try to include polls in the model to possibly make it even more accurate, the machine learning algorithms they use decline to use early polls, and only incorporate polls conducted very close to the actual election.

There’s no margin like an error margin

Suppose the model predicts that Biden will win 51.252% of the vote. The actual election results being exactly this is effectively zero. The most important information that a model produces is the uncertainty estimate around that prediction. If the model predicts that Biden will get between 50% and 53% of the vote with 95% certainty, we can be pretty sure that Biden will win the popular vote.

To calculate these ranges of outcomes, we use a beta distribution, which is essentially like the normal distribution, but for values between 0 and 1. Also, the width of the beta distribution can vary as compared to the normal distribution, increasing or decreasing the uncertainty of a model’s prediction. If the beta distribution is wide, the margin of error is large. If the margin of error (95% confidence interval) is, say 10%, then a candidate predicted to win 52% of the vote has a 2.5% chance of getting less than 42% of the vote, and a 2.5% chance of getting more than 62%. Hence, in closely contested elections, beta distributions with large uncertainty can be quite unreliable.

Modeling uncertainty

How does one model uncertainty though, now that we’ve calculated the correct amount of “shrinkage”? We again use elastic-net regularization and leave-one-out cross-validation. Uncertainty also depends on certain parameters, and these parameters can be determined by these two algorithms. Uncertainties, in the authors’ model, are smaller closer to the election, in polarized elections, when there’s an incumbent running for re-election, and when economic conditions are similar to the long-term average. For instance, 11 months before the election in 1960, when the economy was unusually buoyant and the incumbent set to retire, the 95% confidence interval of the Republican vote share was quite large: from 42.7% to 62.4%. However, in 2004, with George W Bush seeking re-election, when the economy was in a stable state and the electorate was polarized, the 95% confidence level of Bush’s vote-share was from 49.6% to 52.6%. He ended up getting 51.1%, which was almost identical to the authors’ prediction.

Moral victories are for losers

Winning the popular vote does not guarantee winning the election. Hillary Clinton famously won the popular vote by 2%, but still lost the election to Donald Trump. The election outcome depends upon the “electoral college“, through which states, rather than people, do the voting. The authors, in trying to predict national election outcomes, choose to forecast a state’s “partisan lean” rather than the actual state election outcome. “Partisan lean” can be defined as how much a state favors Democrats or Republicans as compared to the whole nation, and hence how it would be expected to vote in the event of a tie.

Why would we forecast partisan lean instead of actual outcome though? This is because partisan lean is a more stable predictor of the actual voting outcome in the state. For instance, in the early stages, our model might predict that Minnesota is likely to vote Democrat by 52%. However, as the election approaches, events might cause the voting pattern across the whole nation to shift to the Republican side, including in Minnesota. Hence, although our model would predict that Minnesota would vote Democrat, events might transpire such that Minnesota eventually votes Republican. However, if one forecasts partisan bias, this bias towards a particular party will remain unchanged even if national events cause voters to swing, as long as this swing is spread uniformly across all states. Hence, partisan bias is a better predictor of eventual election outcome.

To produce central estimates of partisan lean, the authors use a variety of parameters like the state’s partisan lean during the previous two elections, the home states of the presidential candidates, the actual national popular vote, etc. But how do we use this analysis for 2020? The actual national popular vote has not even been conducted yet. In this case, we can use the various possible outcomes, calculate the partisan bias based on these numbers, and then attach a weight to them based on the probability of that outcome. For instance, if there’s a 10% chance of Trump getting 52% of the vote and Biden getting 45% of the vote, we plug those numbers into the algorithm to calculate each state’s partisan bias, and then attach a weight of 0.10 to it.

Bayesian analysis

The principle of Bayesian statistics is pretty powerful. First assume that a certain hypothesis or “prior” is true. Now study the actual real world data, and calculate the probability of that data being the outcome, assuming that your prior was true. If the probability if low, discard your prior and choose one for which the real world data is likely.

How does Bayesian analysis fit in here though? Don’t we already have a model? Why don’t we just plug in all the data and get a prediction already? This is because poll numbers are unreliable, and often have a bias. We can remove this bias by using Bayesian analysis.

What are some sources of errors while taking polls? One source of error is sampling error, in which the sample chosen is not representative of the whole population. For instance, in a population where half the people vote Democrat and the other half vote Republican, choosing a group of people who are mostly Republican will give us a biased election forecast.

However, there are other, non-sampling errors too. Even if the sample we select is representative of the population, not all of the people who are polled are eligible to vote, or will actually go out and vote even if they are eligible. Polls that try to correct for these and other non-sampling errors also don’t do a very good job of it, and inevitably introduce a bias. The model developed by the authors corrects for these biases by comparing the predictions of polls that have a bias towards the Democrats, and others that are biased towards the Republicans. Simplistically speaking, both of these kinds of biases cancel out.

There is another source of error that is more subtle: the partisan non-response. Let me illustrate that with an example. Given the generally negative media coverage for Donald Trump amongst multiple news outlets, many Republican voters will not agree to be surveyed at all. They might be scared of social ridicule should they voice their support for Trump, and probably don’t want to lie that they support Biden. Hence, any sample that polling organizations can construct will have a pro-Biden bias. However, this might change if the overall news coverage of Trump becomes more favorable. This introduces a lot of uncertainty into any forecasting model. The authors correct for partisan non-response by separating all polls into two groups- those that correct for partisan non-response, and those that don’t. Then they observe how the predictions given by these polls change every day. The difference in prediction between the two types of polls can be attributed to partisan non-response, and the authors can then incorporate this difference into their model.

However, what about the far flung states that are not polled as often as other, always in-the-news states? How do we reliably predict how these states will vote? The authors conclude that neighboring states, with similar geographies and demographies, vote similarly. Hence, if we have a forecast for Michigan but not for Wisconsin, we can reliably assume that Wisconsin is likely to have a similar polling result as Michigan. Given below is the correlation matrix for various states.

Bayes-in the model

Let us now put all the pieces together. The authors use an extension of the Markov Chain Monte Carlo Method, first expounded by Drew Linzer. What does it do? It performs a random walk.

Let me illustrate this with an example. Let us choose the prior that polls are biased towards the Democrats by about 5%. Also, we know the partisan bias for Michigan in the month of June. In the coming days until the election, Michigan can, swing Republican, Democrat, or stay the same. Because of our prior, however, we have to assign different probabilities to Michigan swinging Republican or Democrat (or staying the same). We perform this random walk every day for Michigan until the election, to get a prediction for how Michigan will vote, assuming our prior is true.

Now we may assume a different prior: that polls over-estimate Republican chances by 2%. We again perform a random walk for each state, including Michigan. The authors take 20,000 such priors, and perform random walks for various states. They now calculate each candidate’s chances of winning as the total number of priors which led to them winning, divided by the total number of priors.

Using this model, the authors predict a comfortable win for Biden.


  1. How the Economist presidential forecast works, by G. Elliott Morris.

IMO 1988/Problem 6

I spent some time thinking about the infamous IMO 1988/Problem 6 today:

For positive integers a,b, if \frac{a^2+b^2}{ab+1} is an integer, prove that it is a square number.

After some initial false starts, I came up with this:

a^2+b^2=(ab+1)(\frac{a}{b}-\frac{1}{b^2})+(b^2+\frac{1}{b^2}). We need to eliminate the \frac{1}{b^2}, in order to have some hope of a quotient that is an integer. One way to do that is to have (ab+1)\frac{1}{b^2}=(b^2+\frac{1}{b^2}). This implies that \frac{a}{b}=b^2. We have thus proven that the quotient, which is \frac{a}{b}, is a square number.

Of course, this is not the complete solution. We might have (ab+1)(\frac{a}{b^2}+l)=(b^2+\frac{1}{b^2}), in which case the quotient could also be an integer. However, it turns out that the solution that I have written above is the only possible solution. Now we only need to justify that this is the only solution.

Social Capital: Networks and Connections

The paper that I want to review today is “Social Capital: Its Origins and Applications in Modern Sociology” by Alejandro Portes. I found this paper in the list of the most highly cited Sociology papers of the last century. This is the first paper that I’ve ever read in Sociology, and I found it to be very engaging. This has definitely made me want to explore this field more.


The concept of “social capital” has become one of the rare terms to transcend the rarefied world of sociology into everyday language. The media makes it sound like the solution to most of the world’s problems, both between individuals and countries. Due to the diverse applications to which this term has been put, there is no one definition of the term anymore.

Although this term has its historical roots in the writings of Marx and Durkheim, its modern presentation leaves much to be desired. Sociologists often only present the positive aspects of it whilst leaving aside the negative. Also, “social capital” is often interpreted as similar to monetary capital in its capacity to provide an individual with power, status or opportunities. Some authors have also gone on to the extent of saying that cities and countries too can possess social capital, as opposed to just individuals, and the presence of this ill-defined “social capital” is retrospectively held responsible for certain cities being more prosperous and stable.

Clearly, the modern presentation of social capital can benefit from a more balanced view. The author intends to do just that in this article.


Pierre Bourdieu, the first person to systematically analyze the concept, defined social capital as “the aggregate of the actual or potential resources which are linked to possession of a durable network of more or less institutionalized relationships of mutual acquaintance or recognition”. For historical reasons, this analysis did not become well-known amongst researchers, what with the original paper being in French. Bourdieu makes the point that it is the benefits that members accrue, from being part of a social network, that gives rise to the strength and stability of such networks. Social networks do not just come into place on their own. People have to invest time and effort into building social ties. However, once this network is in place, the members of this network can appeal to the institutionalized norms of group relations to gain social capital. In some sense, this is like a monetary investment that can pay dividends later.

Because spending this social capital may lead to the acquisition of economic capital in the form of loans, investment tips, etc, Bourdieu thinks of social capital as completely interchangeable with economic capital. However, the acquisition of social capital is much less transparent, and much more uncertain than the process of acquisition of economic capital. It requires the investment of both economic and cultural resources, and involves uncertain time horizons, unspecified obligations, and the possible violation of reciprocity. If you help your friend today, it is not completely certain that you will ever need their help in the future, and that they will help you if you do.

Another contemporary researcher who has probed this realm is Glen Loury (1977), who stated that economists studying racial inequality and welfare focused too much on human capital (which might perhaps be interpreted as individual education or ability), and the creation of a level playing field such that only the most skilled persons succeed. They want to create a level playing field by making employers’ racial preferences illegal. However, this cannot succeed because the acquisition of human capital by some communities is stunted due to a lack of economic resources, along with the absence of strong social networks.

The merit notion that, in a free society, each individual will rise to the level justified by his or her competence conflicts with the observation that no one travels that road entirely alone. The social context within which individual maturation occurs strongly conditions what otherwise equally competent in- dividuals can achieve. This implies that absolute equality of opportunity…is an ideal that cannot be achieved. (Loury 1977)

Although Loury’s analysis of social capital and social networks stopped here, this led Coleman to delve into the issue in more detail, who described how social capital leads to the acquisition of human capital. Coleman defined social capital as “a variety of entities with two things in common: They all consist of some aspect of social structures, and they facilitate certain action of actors- whether persons or corporate actors- within the structure”. Coleman also described some things that lead to the generation of social capital, like reciprocity expectations and group enforced norms, along with the consequences of having social capital, like privileged access to information. Resources obtained through social capital are often looked upon as “gifts”, and hence one must distinguish between the possession of social capital, and the ability to acquire these gifts through it. Not everyone who possesses social capital, by virtue of being a member of a social group, can necessarily acquire these gifts without some requisite social savvy.

Another distinction that should be made is between the motivations of recipients and donors. Although the motivations of recipients are fairly clear, donors can either be motivated by reciprocity from the individual that they’re helping, or greater status in the community; or perhaps both.

Coleman also talks about the concept of “closure” in communities, which is the presence of sufficient ties in a community that guarantee the observance of norms. For instance, the possibility of malfeasance in the tightly knit community of Jewish diamond traders in New York is pretty low because of “closure”. This leads to easy transactions between members without going into much legalese.

Another interesting perspective on social capital is offered by Burt, who says that it is the relative absence of ties, called “structural holes”, that facilitates individual mobility. This is because in dense networks, after a certain amount of time, new information is scarce, and only redundant information gets transmitted. It is the weak ties in a sparse network that can suddenly become active, and transmit useful and new information, that can lead to new contacts, jobs, etc. Hence, it is the weaker networks that mostly lead to advancement as opposed to the stronger or denser ones. This is in stark contrast to the stance taken by others like Coleman, who emphasize that the benefits that can be accrued through a social network is directly dependent on how dense that network is.

Sources of social capital

What motivates a donor to help out a person asking for help in a social network? This motivation can be consummatory or instrumental. A consummatory motivation is one that stems from a sense of duty or responsibility. For instance, the economically well off members of a tightly knit community might feel an obligation to help out those who are less privileged. An instrumental motivation is one that stems from expectation of reciprocity. Donors help others only to accumulate obligations, and expect to be repaid on full at some time in the future. This is different from an economic exchange however, because the method of repayment can be different from the original method of payment, and also because the time frame of repayment is more uncertain.

There is another set of examples that explains this dichotomy of motivations. Bounded solidarity refers to the mechanism through which an underprivileged or sidelined community develops a sense of solidarity, and all members feel a duty to help each other out. This is an example of consummatory motivation. On the other hand, sometimes donors help out others only to raise their status in society. Hence, although there is an expectation of reciprocity, it is from the whole community and not from an individual. This again is an example of instrumental motivation. Of course the two motivations can be mixed: a donor may extend a loan to another member of a community to both gain status, and also expect individual reciprocity in that they may expect that the money be returned in time. The strong ties in the social network would ensure that both happen.

Effects of social capital

The three basic functions of social capital are

  1. As a source of social control- Parents, the police, etc use their social capital to influence the behavior of others in the community. For instance, parents may expect their children to behave well by using their social capital, which they possess by virtue of being guardians of their children. This social capital, if one may imagine it to be some sort of money, is never really spent and exhausted. Parents will always have an infinite amount of social capital to control the behvaiour of their children. The same goes for policemen, etc.
  2. As a source of family support- Children may use their social capital, which they possess by virtue of being dependent on their parents, to expect help from their parents in all spheres of life. This form of social capital is also inexhaustible. It has been noted that children that are brought up in a stable household with two parents often experience better success in their education and careers. On the other hand, children brought up in one-parents households face a harder time dealing with their education and career. This is mainly because children in one-parent households have less social capital, in that they have one less parent to ask for help from.
  3. As a source of benefits through extrafamilial networks- This one is slightly more intuitive. Connections made outside one’s family can have a huge impact on individual mobility and careers. For instance, Jewish immigrants in New York at the turn of the 20th century often received help from other immigrants in the form of small loans of employment in companies. They had social capital just by virtue of belonging to the same community. Other examples of this phenomenon are New York’s Chinatown, Miami’s Little Havana, etc.

    On the flip side, a lack of connections can spell doom for certain communities. For example, impoverished communities rarely have connections with the better parts of town which might provide them with employment or relief. For instance, inner-city impoverished black communities often lack connections with potential sources of employment, and remain mired in poverty. This problem is further exacerbated by the dense social network existing between members of this community, which leads them to influence each other to pursue crime and drug abuse.

    Stanton-Salazar and Dornbrush have found a positive correlation between the existence of such extrafamilial social networks and academic achievement amongst Hispanic students in San Francisco. On a side note, they found an even higher correlation between bilingualism and academic achievement, highlighting the importance of being able to communicate with a wider community.

    This form of social capital is exhaustible, in that you cannot keep asking the wider community for help and not expect people’s patience to run out.

Negative social capital

It is important to identify the negative with the positive. Recent studies have noted four negative consequences of the existence of social networks:

  1. Strong social ties within a community can bar access to others. For instance, business owners from some ethnic communities often employ only members of the same community. Control of the produce business on the East Coast by the Korean community, control of the diamond business in New York by the Jewish community, etc are examples.
  2. Successful members of a community are often assaulted by job-seeking kinsmen. The strong social ties often force these otherwise successful professionals/businessmen to help out or hire their kinsmen, affecting the overall quality and performance of their organizations.
  3. Social control can lead to demands of excessive conformity. In tightly knit traditional societies today, divorces are still looked down upon for instance, and errant members are ostracized. Privacy and individualism are reduced in this way.
  4. Marginalized communities often develop a strong sense of solidarity, and are apprehensive about mixing with the rest of the population and go up the social and career ladders. Consider the following quote from a Puerto Rican laborer, for instance

“When you see someone go downtown and get a good job, if they be Puerto Rican, you see them fix up their hair and put some contact lenses in their eyes. Then they fit in and they do it! I have seen it! … .Look at all the people in that building, they all “turn-overs.” They people who want to be white. Man, if you call them in Spanish it wind up a problem. I mean like take the name Pedro-I’mjust telling you this as an example-Pedro be saying (imitating a whitened accent) “My name is Peter.” Where do you get Peter from Pedro?”

(Bourgois 1991, p. 32

Decades and centuries of discrimination or persecution often lead to certain communities becoming closed to the outside world, which removes them from the larger social network that could perhaps have helped them succeed. This self-imposed exclusion makes their situation even worse than before. These are known as downward leveling norms. Moreover, members that step outside of these communities are often ostracized, which leads to low overall member mobility.

Social capital as a feature of communities and nations

Some political scientists have also extended the notion of social capital to cities and communities, renaming it as “civicness”. This “civicness” or social capital of a community encompasses “the features of a social organization, like networks, norms and trust, that facilitate action and cooperation for mutual benefit”. There is no information on the number of people involved in this social network, the density of the social network, etc.

Robert Putnam, a prominent advocate of the community view of social capital, said that the decline in the nature of cities and the community in general is a result of the loss of social capital through the falling membership of organizations like PTA, the Elks Club, the League of Women Voters, etc. Critics have called this view elitist for stating that social capital can only be regained through membership in these high society organizations. Moreover, they have also admonished Putnam’s opinion that the responsibility for increasing this social capital lies in the hands of the masses by joining these organizations, and not in the hands of the government or corporate leaders.

Dr Portes notes that Putnam’s argument is also circular. The social capital of a community cannot be measured directly. It can only be inferred from a community’s success. If a community is successful, one may infer that there probably is a strong sense of cohesion between the members. Anything that cannot be measured directly, and can only be inferred, cannot scientifically considered to be a cause. For instance, “emotional balance” cannot scientifically be considered a cause of a person’s success. There may be lots of causes of their success, like hard work of networking. Emotional balance can only be inferred: because this person is successful, they probably do have emotional balance. In this way, identifying one single cause for the success of a person or community, especially if that cause can only be inferred and not measured directly, is a dangerous game. The only way that Putnam could have proven his thesis is by taking two communities that are similar in all regards except that one has more social capital than the other (we are assuming that this social capital can be directly measured), and showing that the one with more social capital is more successful. This is obviously a difficult experiment to conduct in real life.


“Social capital” is essentially a mix of old ideas in a new garb. Moreover, it is unlikely that just increasing social capital will lead to a solution of community-wide problems. As has been explained above, social capital is also responsible for holding back certain communities from development. Hence, appreciating both the positives and negatives of social capital is important for having a balanced and realistic view of the concept.


  1. Social Capital: Its Origins and Applications in Modern Sociology, by Alejandro Portes.

EA- September

Attaching my Effective Altruism receipt for the month. I’ve been neglecting this for a couple of months, in part because of recent high expenses (car, etc). This is me trying to jump back on the bandwagon.

I’ve often thought about whether “blacking” out the amount donated would be a more altruistic thing to do. However, this is not a one-off donation. This is supposed to be 10% of your income, and the amount donated should reflect that. I might still black things out in the future to make these posts less awkward.

How have I been able to afford it? I don’t drink, or eat out that much anymore. My other expenses have generally only reduced over time. Hence, I am perhaps just substituting one form of expenditure with another. I have tried donating to friends’ charities and things like that before. However, on average, I’m happiest donating to just EA and Arxiv.

The topology of data

The paper that I wish to discuss today is An Introduction to Topological Data Analysis: fundamental and practical aspects for data scientists, by Frederic Chazal and Bertrand Michel. Topological data analysis is an exciting new field, and this paper can be understood by people from a wide range of backgrounds.

Notation: For this paper, \Bbb{X} denotes a topological space, and \Bbb{X}_n=\{x_1,\dots,x_n\}\subset \Bbb{X} a sample of points from it.


Topological Data Analysis works on the assumption that the topology of data is important. But why? Let us take an example from physics. Suppose we want to study the energy emitted by atoms after absorbing sunlight. We observe that energy emitted by atoms forms a discrete set, and not a continuous one. We take a large number of readings, and observe the very same phenomenon. Hence, we can conclude with a high degree of probability that the topology of the energy states of atoms is discrete. This is one of the most fundamental discoveries of modern Science, and heralded the Quantum revolution.

It becomes clear from the above example that understanding the topology of data can lead us to understand the universe around us. However, we have to “guess” this topology of the “population” from a much smaller “sample”. Topological Data Analysis can be thought of as the study of making “good” guesses in this realm.

Metric Spaces, Covers, and Simplicial Complexes

Let us take a metric space (X,d). The Hausdorff distance between two compact sets A,B\subset X is defined as \inf\limits_{a\in A}\sup\limits_{b\in B}d(a,b). It is essentially a measure of how “similar” two compact sets look. We need compactness, or rather boundedness, because we need the \inf, \sup limits to be well-defined. However, what if A,B are not necessarily subsets of the same space? Gromov generalized the above definition in the following way: The Gromov-Hausdorff distance between two compact sets A,B is the infimum of all positive real numbers r\geq 0 such that d(\phi_A(A),\phi_B(B))\leq r, where \phi is the isometric embedding of A,B in some manifold M. Essentially, we want to see how “close” the two sets can get across all possible isometric embeddings in all possible manifolds. As one can imagine, calculating it is a seemingly impossible task in most situations, and it is primarily useful when an upper bound to it implies other useful mathematical facts.

Simplicial complexes

Now given a set of points \{x_0,\dots,x_k\}\subset \Bbb{R}^d of k+1 affinely independent points, we can make a k-simplex in \Bbb{R}^d called the convex hull of the points. A simplicial complex K is a collection of simplices such that

  1. Any face of a simplex is also a simplex. We need this condition because we want to define a boundary map from the simplicial complex to itself, which will allow us to calculate topological invariants like the homology of the complex.
  2. The intersection of two simplices is either empty, or a common face of both. This condition ensures that only topological “holes” are detected in homology groups, and not other topological features.

Note that K can be thought of as both a topological space and a combinatorial entity. The topological perspective is useful when we’re trying to break up a topological space into simplices in order to calculate homology, and the combinatorial perspective is useful when we use simplices to represent mathematical entities that are not originally topological spaces. For instance, polynomial rings can also be studied using simplicial complexes, where each vertex corresponds to a homogeneous polynomial. Of course, each combinatorial simplicial complex can be given a topological description. Also, note that the combinatorial description is more suitable than the topological one in the realm of algorithms and machine learning.

Vietoris-Rips Complex: Given a set of points \{x_0,\dots,x_k\}\subset \Bbb{R}^d with pre-determined distances, form all simplices of the form [x_{i_1},\dots,x_{i_l}] whenever the distance between any pair of points in \{x_{i_1},\dots,x_{i_l}\} is at most \alpha. If k>d, we might even have simplices that cannot be fit into \Bbb{R}^d. The set of all such simplices forms a simplicial complex in some \Bbb{R}^k, where k might be bigger than d. As we increase the value of \alpha>0, this simplicial complex, and consequently its homology groups, will change.

Čech Complex: Given a set of points \{x_0,\dots,x_k\}\subset \Bbb{R}^d with pre-determined distances, we form l-simplices when the \alpha-balls of l points have a non-empty intersection. Hence, the maximum distance between two points now has to be 2\alpha, and not \alpha. A Čech complex is denoted as {Cech}_{\alpha}(\Bbb{X}). How is this different from Rips_{2\alpha}(\Bbb{X}) though? As shown in the diagram above, three pairwise intersecting discs give us a 2-simplex in Rips_{2\alpha}(\Bbb{X}), but just a 1-cycle in the corresponding Čech complex.

The Nerve Theorem: Given a cover \{U_i\} of a manifold, we can form a Čech complex from it (note that the open sets here are U_i, and not \alpha-balls as stated before). The Nerve Theorem says that if the intersection of any sub collection of \{U_i\} is either empty or contractible, then the Čech complex is homotopy equivalent to the manifold.

The contractibility of intersections ensures that we do not “cover up” any holes in the manifold using our discs. But why the Čech complex, and not the Rips complex? This is because a hole would be covered up by three pair-wise intersecting discs in a Rips complex, but not in the Čech complex. Hence, the Čech complex is a useful way of preserving the homology of the underlying manifold.

Why is this theorem important? This is because it takes a continuous property of a topological space, and converts it into a combinatorial entity. For instance, if I only needed to know the homotopy class of a manifold for my problem, studying a simplicial complex on the manifold with the same homotopy type is much easier for me, as I can now feed the combinatorial data of the simplicial complex into well-known algorithms to make deductions about the space.

Mapper Algorithm

For a space X, let f:X\to \Bbb{R}^d be a continuous map. For a cover \{U_i\} of \Bbb{R}^d, the sets \{f^{-1}(U_i)\} form a cover of X. Now consider the set of all connected components of \{f^{-1}(U_i)\}. If the function f and the open cover \{U_i\} are well chosen, the nerve of this “refined” cover gives us useful information about the underlying space X. Note that the \{f^{-1}(U_i)\} don’t have to be contractible anymore. An example is given below:

The function here is the height function (look out for the slightly camouflaged yellow parts). This nerve of the two-holed torus does a decent job of representing its topology, although we fail to detect the small hole at the bottom because the cover chosen of \Bbb{R} is not fine enough.

In practice, however, we don’t map continuous manifolds, but just data points. A suitable example is given below:

From the nerve drawn on the right, one may conclude that the topology of the underlying population, from which the data has been extracted, is a circle.

Some popular choices of f are the centrality function f(x)=\sum\limits_{y\in \Bbb{X}} d(x,y), or the eccentricity function f(x)=\max\limits_{y\in \Bbb{X}} d(x,y). These functions do not require any special knowledge of the data.

As one may imagine, the choice of the cover \{U_i\} determines the nerve that we get from the Mapper algorithm. Generally, one may choose regularly spaced rectangles U_i which overlap with each other. If f:X\to \Bbb{R}, then the length of the intervals U_i is known as the resolution, and the fraction of overlap is denoted as g. One must explore various resolutions and values of g in order to find the “best” nerve of X.

Now remember that the connected components of the f^{-1}(U_i)`s form just the vertex set of the simplicial complex we are building. Although we could build a Čech complex from these pre-images, we may also cluster the vertices corresponding to the f^{-1}(U_i)‘s in other ways. For instance, we may build a k-NN graph with the points in X, associate points to each f^{-1}(U_i), cluster them appropriately, and then only select the connected components of the subgraph whose vertices correspond to the f^{-1}(U_i)‘s. This is a different algorithm because we don’t care if the f^{-1}(U_i)`s intersect anymore.

Geometric Reconstruction and Homology Inference

Let us now bring probability into the mix. Suppose we have a space X\subset \Bbb{R}^d with a probability measure \mu. If we take a sample of points \{x_0,\dots,x_k\}\subset X, we want the topology of Cech_{\alpha}(\{x_0,\dots,x_k\}) to somehow approximate the topology of X.

Now, we introduce some mathematical definitions. For a compact space K\subset \Bbb{R}^d, the r-offset K^r is defined as the set of all points x such that d(x,K)\leq r. Why do we care about r-offsets? Because they are much better at capturing the topology of the space around them. For instance, the homology of a loop is clearly different from the Euclidean space it is contained in. However, for appropriate values of r, the r-offset of the loop becomes contractible, and hence has the same homology as the surrounding space.

A function \phi:\Bbb{R}^d\to \Bbb{R}_+ is distance-like if it is proper, and x\to \|x\|^2-\phi^2(x) is convex. We need the proper condition because we don’t want unbounded sets to only be at finite distance from a point. Hence, properness ensures that only bounded sets are at a bounded distance from a point (with respect to which a distance function is defined). The second condition is that of semi-concavity, and I would like to re-write it in its more natural form: \phi^2(x)-\|x\|^2 is concave. The reason that it is called “semi”-concave is that it is concave only when a very concave function is added to it. \phi(x) can actually be a convex function itself. Why do we want this condition here? This is because we want to generate a continuous flow using \phi, and functions that “rise too fast” may have discontinuities when we generate this flow.

A point x\in \Bbb{R}^d is said to be \alpha-critical if \|\nabla_x \phi\|\leq \alpha. It is a generalization of the notion of a critical point (where \alpha=0). We want to find the smallest r such that \phi^{-1}[0,r) does not have any \alpha-critical values. This is known as the \alpha-reach of \phi. What this means is that the level sets \phi^{-1}(r') for r'<r will “flow” along the gradient of \phi at a speed that is faster than \alpha, at least until they reach the level set \phi^{-1}(r). Why do we want level sets flowing into each other at all? This is because of their relation to Čech complexes. Consider the level sets in \phi^{-1}((a,b)) and \phi^{-1}((c,d)). If the level sets of the first can “flow” into the level sets of the “second”, then there is no hole between them, and the two vertices corresponding to these inverse open sets can be joined by a line regardless of how these vertices have been clustered.

Isotopy Lemma– Let \phi be a distance-like function such that \phi^{-1}([r_1,r_2]) has no critical points. Then the level sets of \phi^{-1}([0,r]) are isotopic for r\in [r_1,r_2]. Two topological sets are isotopic when they are homeomorphic, and one can continuously be deformed into the other. Essentially, the level sets in \phi^{-1}([0,r_1]) can essentially stay where they are, and the level sets inside \phi^{-1}([r_1,r_2]) can move because there are no critical points inside. Whether there are critical points in \phi^{-1}([0,r_1]) is irrelevant.

Reconstruction Theorem– This theorem essentially states that when two distance-like functions \phi and \psi are “close enough”, suitable sub level sets of both are homotopically equivalent. Of course it is unclear from the statement how these level sets “flow” into each other, and which function’s gradient field is chosen for this. Why is this theorem important?

Let \phi=d_M and \psi=d_{\Bbb{X}_n}, which are the distance functions with respect to the support of \mu on M and \Bbb{X}_n respectively. The Reconstruction theorem can tell us that for appropriate values of r and \eta, the r-offset of M is homotopically equivalent to the union of the \eta-offsets of the points in \Bbb{X}_n, which in turn is homotopically equivalent to the Čech complex formed by these offsets. Essentially, the Reconstruction Theorem provides the basis for studying the topology of \Bbb{X} using the nerve of \Bbb{X}_n.

An important result of Chazal and Oudot in this direction is the following: For M\subset \Bbb{R}^d a compact set, let the \alpha-reach of d_M be R>0. Also, let X be a set of points such that d(M,X)<\epsilon<\frac{1}{9}(\alpha-reach) of d_M. Then for \eta\in (0,R), \beta_k(X^{\eta})=rk(H_k(Rips_r(X)\to H_k(Rips_{4r}(X))). Here a suitable r has been chosen. Why is this theorem important? Because it allows us to calculate the Betti numbers of M using just information gleaned from the Rips complexes of X.

Distance to measure

Note that in all of the theorems discussed above, d_M and d_{X_n} have been “pretty close” as metrics. This forms the basis of all our topological deductions. What if they’re not? What if we have outlier points in X_n? Note that it is not in general possible to select a point from outside of the support of the probability measure \mu on M, as the probability of selecting a point outside of it is by definition 0. Hence, the existence of such a point is purely due to noise, which brings in a probability distribution that is independent of $mu$.

To deal with noise of this sort, we have the notion of “distance to a measure”. Given a probability distribution P and a given parameter 0<u\leq 1, \delta_{P,u}: x\to \inf\{t>0: P(B(x,t))\geq u\}.

Note that this map can be discontinuous if the support of P is badly behaved. Hence, to further regularize this distance, we define d_{P,m,r}=\big(\frac{1}{m}\int_0^m{\delta^r_{P,m}(x) du}\big)^{1/r}

A nice property of d_{P,m,r} is that it is stable with respect to the Wasserstein metric. In other words if P,P' are two probability measures, then \|d_{P,m,r}-d_{P',m,r}\|_{\infty}\leq m^{-1/r}W_r(P,P'). Hence, d_{P,m,r} is a good distance-like function to consider to analyze the topology of M, or at least the support of \mu.

In practice, P is not known. We can only hope to approximate it from X_n. Consider the probability measure P_n on \Bbb{X}_n. Although the exact construction of P_n is not mentioned (it might just be a discrete measure), the following formula has been mentioned: for

m=\frac{k}{n}, \delta^r_{P_n,k/n,r}(x)=\frac{1}{k}\sum\limits_{j=1}^k \|x-X_n\|^r_{j}

Here \|x-X_n\|^r_{j} denotes the distance between x and its jth neighbor in \Bbb{X}_n. If the Wasserstein distance between P and P_n is small, which is what we can hope if we take a large enough sample from the population, then \delta^r_{P_n,k/n,r} is pretty close to \delta^r_{P,k/n,r} in the L^{\infty} measure.

Persistent Homology

Persistent homology is used to study how the Čech complexes and Betti numbers change with a parameter (generally the radius of the overlapping balls). Why is it important? We can never really be sure if the homology of the Čech complex that we have is the same as the homology of the underlying space. Hence, why not study all possible homologies obtained from an infinite family of Čech complexes? Two spaces with “similar” such families of homologies are likely to be the “same” in some well defined topological sense. Hence, this is another attempt at determining the topology of the underlying space.

A filtration of a set X is a family of subsets \{X_r\} such that r<r'\implies X_r\subset X_{r'}. Some useful filtrations are the family of simplicial complexes \{Cech_{r}(X_n)\} or \{f^{-1}([0,r])\}.

Given a filtration F_r, the homology of F_r changes as r increases. Consider the persistence diagram below:

I will briefly try to explain what is happening here. f is the height function defined on this graph, and F_r=f^{-1}([0,r]) for a_1<r<a_2 looks like an interval that is expanding in size. When r=a_2,,a_3, etc, new intervals are created, which die when they join with some other interval that was created before them. For example, the interval created when r=a_3 joins the interval created at r=a_1 when we reach the point r=a_4. (value of r at the birth of an interval, value of r at the death of that interval) can be graphed as a point in \Bbb{R}^2, as shown by the red dots in the diagram on the right. We also add al diagonal points with infinite multiplicity. One way to interpret that is for all values of r, an infinite number of intervals come alive and die. The reason why we add this seemingly arbitrary line is that when we try to sample a population of data points, we might receive some noise. Hence, we can create a small neighborhood of the diagonal, and only interpret the points that lie outside of the diagonal as genuine topological features of the underlying manifold. The points inside the neighborhood denote topological features that are born and die “too soon”, and hence might just be noise. More will be said about this later.

Given below is the corresponding diagram for Čech complexes. The blue dots correspond to the birth and death times of “topological holes”.

Note that in the diagram above, all the balls grow at the same time. Hence, we don’t have a clear way of choosing which red interval should “die” and which one should survive. We arbitrarily decide that if two balls that start at the same time join, the red line below remains, and the one above ends. The persistence diagram is is given on the bottom right.

Persistence modules and persistence diagrams

The inclusion diagram \dots\to F_r\to F_{r'}\to \dots, where r<r', induces the inclusion diagram of vector spaces \dots\to H_k(F_r)\to H_k(F_{r'})\to \dots. The latter inclusion diagram is called a persistence module. Why is this important? Because it shows, in real time, the evolution of the kth Betti number of the diagram.

In general, a persistence module \Bbb{V} over a subset of the real numbers T is an indexed family of vector spaces \{V_r| r\in T\} such that \psi^r_s:V_r\to V_s when r\leq s, along with the property that \psi^t_r\circ \psi^r_s=\psi^t_s. In many cases, such diagrams can expressed as a direct sum of “inclusions” modules \Bbb{I}_{(b,d)} of the form

\dots\to 0 \dots\to 0 \to \Bbb{Z}_2\to \Bbb{Z}_2\to \dots\to \Bbb{Z}_2\to 0 \to\dots

where the \Bbb{Z}_2\to \Bbb{Z}_2 maps are identity maps, and the rest are 0 maps. In some sense, when the vector spaces in the persistence module are the H_k groups, we are breaking up the \dots\to H_k(F_r)\to H_k(F_{r'})\to \dots for each k-dimensional hole, and tracking when each hole appears and disappears. Chazal et al proved that if the map \psi^r_s:V_r\to V_s in a persistence module has finite rank for each r,s\in T, then it can be decomposed as a direct sum of “inclusions” modules in a well-defined way. One way to think of this is the following: a generic “element” of a persistence module is the set births and deaths of all topological k-holes that survive at least one iteration (or increment in the value of r), between r=b and r=d. If all but a finite number of holes die only within one iteration, then each element of the persistence module can be thought of as a finite sum of “inclusions modules”.

Persistence landscapes

The persistence landscape was introduced by Bubenik, who stated that the “dots” in a persistence diagram should be replaced by “functions”. But why? Because there’s not much algebra that we can do with dots. We can’t really add or multiply them in canonical ways. However, functions lend themselves to such operations easily. Hence, the inferences that we make about such functions may help us make inferences about the underlying topological space.

How do we construct such a function? Take a point (a,b). We just construct a function \Lambda_{p}(t) for a point p that looks like a “tent”, by joining the points (a,0) and (\frac{a+b}{2},\frac{b-a}{2}) by a straight line, and then joining (\frac{b+a}{2},\frac{b-a}{2}) and (b,0) by another straight line. Three such “tents” for three different points are given below. They’re drawn in red.

The persistence landscape for this diagram is defined as \lambda_{dgm}(k,t)=kmax\Lambda_p (t), where kmax is the kth highest value in the set. For instance, the function in blue drawn above is \lambda_{dgm}(1,t).

A short note about the axes in the two diagrams above: the b and d on the left diagram correspond to time of birth and death respectively. For the diagram on the right, the axes denote the coordinates of the black “dots” on the functions. The “tent”-ed functions themselves may be thought of as a progression from left to right, in which a topological feature is birthed and then dies.

One of the advantages of persistence landscapes is that they share the same stability properties as persistence diagrams. Hence, if two probability measures are “close enough”, then their persistence landscapes will also be “quite close”.

Metrics on the space of persistence diagrams

We know that then two probability distributions are “close enough”, then the distance functions to those probability distributions are also “pretty close”. However, what about the persistence diagrams of those probability distributions? Does the persistence diagrams of two probability distributions being “close” imply that the probability distributions are also close? Before we can answer this question, we must find a good metric to calculate the distance between two persistence diagrams.

One such metric is the bottleneck metric, which is defined as

d_b(dgm_1,dgm_2)=\inf_{m}\max_{(p,q)\in m}\|p-q\|_{\infty}

Here m is a “matching”, which is the arbitrary bipartite pairing of points in dgm_1 with those in dgm_2. Because the L^{\infty} norm is too sensitive to “outliers”, a more robust metric is

W_p(dgm_1,dgm_2)^p=\inf_{m}\sum_{(p,q)\in m}\|p-q\|^p_{\infty}

Stability properties of persistence diagrams

But if two persistence diagrams are “close”, are the underlying probability distributions also “close”? We don’t know. But the converse is true.

Let \Bbb{X} and \Bbb{Y} be two compact metric spaces and let Filt(\Bbb{X}) and Filt(\Bbb{Y}) be the Vietoris Rips of Čech filtrations built on top of \Bbb{X} and \Bbb{Y}. Then

d_b(dgm(Filt(\Bbb{X})),dgm(Filt(\Bbb{Y})))\leq 2d_{GH}(\Bbb{X},\Bbb{Y})

We can also conclude that if two persistence diagrams are close, then their persistence landscapes are also close: Let \Bbb{X} and \Bbb{\tilde{X}} be two compact sets. Then for any t, we have

|\lambda_{\Bbb{X}}(k,t)-\lambda_{\tilde{X}}(k,t)|\leq d_b(dgm(Filt(\Bbb{X})),dgm(Filt(\Bbb{Y})))

Whether the closeness of persistence diagrams denotes the closeness of the underlying topological spaces remains woefully unanswered.

Statistical aspects of persistent homology

While talking about persistence homology, we have only talked about topological spaces, and not necessarily about probability distributions. We do so here.

Let \Bbb{X} be an underlying space with probability measure \mu, and let \Bbb{X}_{\mu} be the compact support of this measure. If we take n independent readings from this set- say \Bbb{X}_n=\{X_1,\dots,X_n\}, then we can estimate the space \Bbb{X}_{\mu} by \Bbb{\hat{X}}. The probability measure on \Bbb{X}_n has support \{X_1,\dots,X_n\}.

For some (a,b), let \mu satisfy the condition \mu(B(x,r))\geq \min(ar^b,1). Then

\Bbb{P}\big(d_b(dgm(Filt(\Bbb{X}_{\mu})),dgm(Flt(\Bbb{X}_n)))>\epsilon \big)\leq\min\big(\frac{2^b}{a\epsilon^b}exp^{-na\epsilon^b},1\big)

Because we do not exactly know \mu and hence the persistence diagram of \Bbb{X}_{\mu}, we can only calculate the probability of the persistence diagram of \Bbb{X}_n being close to that of \Bbb{X}_{\mu}. Clearly, as n grows big, this probability becomes smaller. This can also be ascertained from the following formula:

P\big(d_b(dgm(Filt(\Bbb{X}_{\mu})),diag(Filt(\Bbb{X}_n)))\leq C\big(\frac{\log n}{n}\big)^{1/b}\big)

approaches 1 as n\to\infty. Here, C is some constant.

Estimation of the persistent homology of functions

If two functions f,g on a manifold are “close”, then the persistence diagrams induced by them are also close. More precisely,

d_b(dgm_k(f),dgm_k(g))\leq \sup\limits_{x\in M}|f(x)-g(x)|

This opens up a vista of opportunities, in that we can now study density estimators, regression functions, etc. But how? Suppose we do not know how to calculate the persistence homology of a complicated function. We take a more regular function that is “close” to it in the L^{\infty} norm, calculate its persistence homology, and then be assured that the persistence diagram of the complicated function looks almost like the persistence diagram of the current, “better” function.

Confidence regions for persistent homology

When estimating a persistence diagram dgm with an estimator diagram \hat{dgm}, we look for a value \eta_{\alpha} such that P(d_b(\hat{dgm},dhm)\geq n_{\alpha})\leq \alpha, for some \alpha\in (0,1). The n_{\alpha} gives us an upper bound on how “far” the two diagrams can be.

In some sense, if we were in the space of persistence diagrams (each point in this space is a persistence diagram), B(\hat{dgm},\eta_{\alpha}) is the \alpha-confidence interval of dgm. How does this translate to confidence intervals of the actual points in \hat{dgm}? One way to do this is to center a box of side length 2\alpha at each of these points. Another way is to create an \eta_{\alpha} neighborhood of the diagonal line in the persistence diagram. The points outside of it are significant topological features of the sample, and are definitely preserved in dgm. This is perhaps an important reason why we include the diagonal in persistence diagrams- points on the diagonal are unimportant topological features that might just be noise. Hence, we can infer the persistence diagram of the underlying space from that of the sample, as long as the points that we get are “far enough” from the diagonal.

Of course, all of this depends on whether we can successfully approximate the value of \eta_{\alpha} from the sample. Methods like the Subsampling Approach and the Bottleneck Bootstrap are important in this context.

Central tendency for persistent homology

Persistence diagrams are just a bunch of dots and a diagonal line. As they’re not elements of a Hilbert space, we cannot determine an “expected” or “mean” persistence diagram. Hence, we need to move to persistence landscapes, which are elements of a Hilbert space, and consequently lend themselves to such an analysis.

For any positive integer m, let \{x_1\dots,x_m\}\subset \Bbb{X}_{\mu} be a sample of m points from \mu. For a fixed k, let the corresponding persistence landscape be \lambda_{X}. Now consider the space of all persistence landscapes (corresponding to different \Bbb{X}_n‘s drawn from \Bbb{X}), and let \mu^{\otimes m} be the induced measure on \Bbb{X}^m, which then induces the measure \Psi^m_{\mu} on the space of persistence landscapes. It is now possible to calculate the expected persistence landscape, which is \Bbb{E}_{\Psi^m_{\mu}}[\lambda_X(t)]. This quantity is quite stable. In fact, the following is true:

Let X\sim \mu^{\otimes m} and Y\sim \nu^{\otimes m}, where \mu and \nu are two probability measures on M. For any p\geq 1, we have

\|\Bbb{E}_{\Psi^m_{\mu}}[\lambda_X]-\Bbb{E}_{\Psi^m_{\mu}}[\lambda_Y]\|_{\infty}\leq 2m^{\frac{1}{p}}W_p(\mu,\nu)

Persistence homology and machine learning

Due to the highly non linear nature of persistence diagrams, persistence diagrams first need to be converted into persistence landscapes to be useful in machine learning. Such persistence landscapes have been useful in protein binding and object recognition.

The construction of kernels for persistence diagrams has also drawn some interest. Can we directly compare two persistence diagrams by taking their “dot product” in some sense? Convolving a symmetrized version of a persistence diagram (with respect to the diagonal) with the two dimensional Gaussian measure gives us exactly such a kernel. Such a kernel can be used for texture recognition, etc.

Sometimes, identifying topological features from a persistence diagram becomes a difficult task. Hence, the choice of a kernel becomes an important factor. Also, deep learning can also be used to identity the relevant topological features in a given situation.


  1. An Introduction to Topological Data Analysis: fundamental and practical aspects for data scientists, by Chazal and Michel

CRISPR- A review

The paper that I want to talk about today is The CRISPR Toolkit for Genome Editing and Beyond by Mazhar Adli, published in Nature Communications in 2018. My rather elementary knowledge of Biology did not make this easy, and it was fun to watch countless youtube videos to try and get to grips with this amazing technology.

Genome-editing before CRISPR

Spiderman developed his awesome superhuman skills because a radioactive spider bite caused a mutation in his DNA. Genome-editing technologies have chased similar superhuman dreams for a long time now. What if we could edit our DNA itself to give us amazing capabilities, or remove those parts of the DNA that are responsible for our deformations? The first glimmer of hope in this direction was the discovery of restriction enzymes in bacteria, that protected them from invading agents called phages. Restriction enzymes scan DNA molecules, and if they see an “enemy” pattern that they’ve been trained to recognize, they cut the DNA molecules at an appropriate site, effectively rendering the “enemy” gene useless.

With this discovery, scientists were able to manipulate the DNA of cells in test tubes, rendering similar cuts to the enemy. However, manipulating the DNA of living cells that were part of a larger living organism (in vivo) remained an elusive dream. This was finally realized by the work of Capechhi and Smithies, who found that mammalian cells could incorporate a foreign copy of DNA into their own genome. This happens through a process called homologous recombination, and is explained here.

So could we just keep introducing desired DNA copies into mammalian cells, and hope that these get incorporated? No. This is because only 1 in 10^7 cells allowed the foreign DNA to combine with the existing DNA in the cell. Secondly, this foreign material could be incorporated in other parts of the DNA instead of the desired foci. Hence, we needed to get better control over the process.

Cut ’em up

Researchers soon realized that if there was a break in both strands of the DNA at the desired site, called a double-strand break or DSB, the frequency of the foreign material getting attached there would increase by orders of magnitude. This led to a lot of research into large “cutting” molecules, or meganucleases. These meganucleases could recognize strands that were 14-40 base pairs long, and then cut the genes at the desired site. This was problematic however, because scientists couldn’t find meganucleases for all the sites of interest to them. New meganucleases were not easy to engineer. Moreover, the meganuclease-induced DSBs are mostly repaired by non-homologous end joining (NHEJ), which may be thought of as a rough and slipshod method of joining ends of the DNA. Hence, this method would not be suitable for introducing the desired foreign DNA in the correct manner into the genes.

This problem was partially solved when zinc finger proteins (ZNPs) were discovered. Instead of 14-40 base pairs long sites, these proteins could recognize sites that were only 3 base pairs long. Hence, given the (possibly long) base pair configuration of a desired site, we could attach multiple such zinc finger proteins to match the sequence at the desired site. In this manner, scientists could manipulate many more sites on the genome than before. Note that the zinc finger proteins would not be performing the actual cleavage: this would be performed by an endonuclease called Fok I, to which the zinc finger proteins would be bound. ZNFs can be thought of as the search party, and Fok I held the actual knife for cleaving.

The situation was further improved when scientists discovered TALE proteins, which could now recognize just 1 base pair instead of 3 bp long sites. However, even with this discovery, a lot of difficult engineering and re-engineering of proteins was required to target all possible sites of interest. The CRISPR gene editing technology turned out to be just as robust as these technologies, if not more, and also much easier to use!


CRISPR stands for clustered regularly interspaced short palindromic repeat DNA sequences. These highly repeating DNA sequences, interspersed with non-repeating spacer genomic sequences, were first observed in Escherichia coli, although they were later observed to be present in more than 40% of all bacteria and 90% of archaea. These CRISPR sequences form the backbone of the bacterial immune response to invasion by bacteriophages. A horrifying video of such a bacteriophage invasion is present here. In response to this invasion, the bacteria would store a part of the foreign DNA of the invaders in the form of spacers. Hence, the CRISPR may be thought of as a library in which you keep records of all the invaders that have wronged you. In the future, if the same phages attacked the bacteria again, their DNA strands would get recognized and destroyed.

How does all this happen though? After detecting and storing the DNA sequence of the invading genome, the CRISPR system makes copies of this sequence and stores them in two short RNAs- the mature crRNA and the trans-activating crRNA. Both of these RNAs activate the Cas9 enzyme, which goes in search of this particular DNA sequence. When it does detect the sequence, it cleaves the genome, rendering it useless. However, the CRISPR system itself also contains a copy of this sequence! How does the Cas9 protein know that it should not cleave the CRISPR DNA sequence? This is because of propospacer-adjacent motifs or PAMs. PAMs are base pair sequences that are present in the invading genome but not in the CRISPR sequence. Hence, before cleaving, the Cas9 enzyme checks if the genome contains the relevant PAM or not, and cleaves the DNA sequence only if such a PAM exists.

Scientists soon realized that they don’t need to go through the whole shindig of first letting a foreign genome attack a cell, and only then getting the required genome sequence in order to look for DNA sites to cleave. They could just directly engineer the two crRNAs containing information about the DNA site which they wished to have cleaved, and the Cas9 enzyme would do the rest. Better still, instead of two, they could just manufacture one RNA- the guide RNA or sgRNA! This idea caught on pretty quickly, and since 2012, when the field was created, there have been over 10,000 articles written on this topic to date.

Just to be clear, CRISPR sequences, Cas9 enzymes, etc are not naturally found in human cells. They would have to be extracted from bacteria and other prokaryotes, and then put inside eukaryotic cells like those of humans. Moreover, the cleavage of DNA sites by Cas9 enzymes is only half the story. If scientists wish to add sequences to the genome, they would have to ensure that these sequences have already been accepted into the cell. The cleavage just speeds up the process of modifying the genome by adding these sequences.

Different CRISPR systems

There are two CRISPR classes- Class I, which contains types I and III of CRISPR, and Class II, which contains types II and IV. The most commonly used type is the type II, which is found in Streptococcus pyogenes (spCas9). However, researchers have also identified 10 other different types of CRISPR proteins. A table of some of them is given below:

As one can see, each protein recognizes a different PAM sequence in the genome before cleaving, and hence is suitable for attacking different types of invading genomes.

Because Cas9 or other cleaving proteins are not naturally found in human cells, they have to be packaged and delivered through Lenti or Adeno Associated Viruses (AAVs). This can be a problem if the proteins are big. For instance, the spCas9 protein is 1366 aa. Although some smaller cleaving proteins have been discovered, they have the disadvantage of having really complex PAM requirements. For instance, although the SaCas9 is only 1053 aa, it requires a PAM sequence of 5′-NNGRRT-3′. Here, 5′ and 3′ denote the ends of a DNA sequence. Because very few (invading or non-invading) genomes contain this particular sequence, SaCas9 can target very few types of invaders.

Re-engineering CRISPR-Cas9 tools

Scientists are curious about whether they can re-engineer the naturally found Cas proteins to change their sizes, PAM requirements, etc. They also want to improve the target specificity of these proteins, so that they don’t go cleave the wrong DNA sites. Unfortunately, Cas9 proteins have a natural propensity to not be too site-specific, as they were mainly used in bacteria to attack constantly mutating bacteriophages. In order to study the specificity of Cas9 proteins, scientists tried to map the DNA binding sites of catalytically inactive SpCas9. They saw that the protein was more likely to bind with open chromatin regions. Also, the cleavage rates at sites of incorrect binding were quite low. This was good news, as even though these proteins would bind with undesired sites, they wouldn’t do as much harm there.

Scientists have spent a lot of time thinking of ways to reduce off-site binding and improve target specificity. One method that is useful is changing the delivery method of the Cas9-sgRNA complex, from plasmid-based to delivery as a ribonucleotide protein (RNP) complex. This complex makes the Cas9 protein relatively inactive, and hence less likely to bind to the wrong site in a flurry of activity. Another method is to have two separate sgRNAs direct a nickase Cas9 (nCas9), attached to a Fok 1 enzyme, to cleave a certain site of the genome. A nickase Cas9 protein or nCas9 cleaves only one strand of the DNA helix, and not both. Hence, for such a complex to cleave the wrong site of the genome, both the nCas9 proteins have to make a mistake, which has a smaller probability than just one of them making a mistake. Obviously the two nCas9 proteins are slightly separated, and contain different sequences. Other ways of affecting specificity of the cleaving proteins are increasing or reducing the length of the sgRNAs, attaching self-cleaving catalytic RNAs to the sgRNAs to regulate Cas9 action, using optical light to regular Cas9 approaches, etc.

CRISPR beyond genome-editing

What if we just want to identify relevant genome sites, and not cleave them? For this purpose, we can use catalytically inactive dead Cas9 proteins, or dCas9. How are dCas9 proteins formed? A regular CRISPR-Cas9 protein has two catalytic domains- HNH and RuvC, which cleave one DNA strand each. Point mutations in either of them render them ineffective. Hence, a point mutation in only one of them gives rise to a nickase Cas9, and point mutations in both gives rise to a dCas9.


In this section, we will primarily talk about the nCas9. The nickase cas9, or nCas9, is quite useful for converting one base into another, without cleaving both strands of the DNA and hence possibly introducing harmful indels (indels or insertions/deletions are arbitrary insertions or deletions of base pairs in the DNA strand). Komor et al discovered that nCas9, fused to an APOBEC1 deaminase enzyme and a UGI protein, can change C to T without cleaving both strands of the DNA helix. Similarly, another nCas9 complex is now able to change A to G. Scientists can now subsequently introduce STOP codons in genes. A STOP codon is a trinucleotide (can be thought of as a sequence of three bases) present in the RNA, that halts the production of proteins when instructions are bring read from the mRNA. Hence, the distance between the START and STOP codons determines the number of amino acids in a protein molecule. Scientists realized that by changing C to T, scientists could change the trinucleotides CGA, CAG and CAA to TGA, TAG and TAA, which are the three STOP codons. Hence, scientists could effectively manipulate the production of proteins in the ribosomes. Another route that scientists have gone down is forming an nCas9-AID complex, where AID stands for activation-induced adenosine deaminase enzyme. In the absence of UGI, this complex supports local mutations, and hence is a powerful gain-of-function screening tool. Gain-of-function screening tools are those that identify which genes are most suitable for mutation in order for the organism to develop a desired phenotype. Hence, the nCas9-AID complex can introduce mutations at multiple genes, and then select the most suitable.

Gene expression regulation

In this section, we primarily deal with dCas9 or catalytically dead Cas9, because we don’t want to cleave any DNA sites.

Gene expression is the process by which a gene is converted into a final product, which may be a protein, non-coding RNA, etc. Hence, regulating gene expression is an important goal for researchers: essentially, we wish to induce “beneficial” genes to express at a higher rate, and the “bad” genes to not express at all. dCas9 was found to tightly bind to DNA sites, and prevent other proteins such as RNA Polymerase II to bind there and start transcription. This phenomenon was exploited to form the CRISPR interference approach or CRISPRi. Notably, attaching a Kruppel-associated Box or KRAB complex to dCas9 results in an even stronger gene repressor. It has been shown that KRAB-mediated gene repression is associated with deacetylation and methylation of histone proteins. Wait, what are those?

Histones are proteins around which the DNA double helix wraps itself, both at the actual targeted gene, and also at the promoter and enhancer sites of the gene. When acetyl groups are attached to histone molecules, the helix unwinds, and becomes ready for transcription. When these acetyl groups are removed (temporary change), or replaced by methyl groups (permanent change), the DNA helix wraps itself even more tightly to the histone proteins, and hence is not expressed. For the H3 histone protein, it has been noticed that the repression activities of the KRAB-dCas9 complex occurs through H3 deacetylation and increased H3 methylation, especially in the H3 proteins present in the promoter and distal (far away) enhancer regions of the targeted gene. This picture is quite complicated, however, and is explained in more detail later.

In contrast, dCas9 can also promote gene transcription (and hence expression) through fusion with VP64, which is composed of four identical repeating units of VP16, a 16-amino acid chain found in the Herpes simplex virus. Other dCas9 complexes that promote gene expression are SunTag, VPR and SAM. SunTag has a dCas9 fused protein scaffold that contains a repeating peptide array, that is used to recruit multiple copies of an antibody fused effector protein. These effector proteins bind with the histone modules and regulate gene expression. SAM is just a complex of gene expression-promoting proteins comprising of dCas9-VP64 and MCP-fused P65-HSF1. The latter is carried to the target site in an engineered sgRNA scaffold. VPR is a complex of VP64, P65 and Rta proteins, all of which also enhance gene expression. CRISPR regulates gene expression, but the actual expression of the gene happens through the regular mechanism of the cell itself, as opposed to other approaches in which gene expression may be facilitated by foreign elements. Hence, this process is more robust and less prone to errors.

Epigenome editing

Epigenetics refers to the mechanism of differential gene expression, even though the genome might be the same. Hence, two identical twins with the same genome are different in many ways because of differences in gene expression. The “epigenome”, on the other hand, refers to the set of molecules that attach to the genome in order to regulate gene expression. All of this is explained beautifully in this video. Epigenome may also influence post-translational modifications of features. Despite recent epigenomic mapping efforts like the Encyclopedia of DNA elements (ENCODE), the functioning of even basic epigenomic features like histone modifications and DNA methylation remain poorly understood. Scientists now hope to use dCas9 complexes to add or remove epigenetic markers at various locations on the genome, in order to study their impact on gene expression. We have already seen how dCas9 induced DNA methylation at the promoter or enhancer sites leads to gene suppression. It is known that many disorders including some types of cancer are caused by aberrant methylation (too much or too little). Although some drugs exist to counter this, they act on the whole genome globally, and hence may affect undesired sites. Some dCas9 complexes like DNMT3A can rectify this by promoting methylation only at the targeted sites. Note that even Cas9 proteins are not known for being very target specific, and are often found bound to undesired sites. However, gene expression does not change at these undesired sites. This makes DNMT3A a useful complex to promote methylation.

On the other hand, if we want to suppress excessive methylation, TET proteins are pretty useful. Researchers formed dCas9-TET1 complexes to promote demethylation at desired sites. The outcome was found to be robust, as there was a 90% reduction in methylation at CpG dinucleotides. The impact at off-target sites was yet to be studied.

Although methylation is seen as a way of suppressing gene expression, it can also promote gene expression in some cases. This phenomenon is beautifully explained in this video. Histone proteins contain 4 types of residues- H2A, H2B, H3 and H4. The H3 residue contains both the H3K4 and the H3K27 sites. Both the acetylation and methylation of H3K4 promote gene expression, while the trimethylation of H3K27 only suppresses gene expression (acetylation of H3K27 promotes gene expression though). This duo can act as a bivalent regulator of gene expression, in which one part promotes and the other represses gene expression. Researchers are curious about controlling the methylation and acetylation of histone residues via dCas9 complexes.

In order to control the methylation and acetylation of H3 resides, researchers used a dCas9 complex to recruit LSD1 at the desired sites to reduce the number of enhancers H3K4me2 and H3K27ac (remember that the acetylation of H3K27 has made it an enhancer). Hence, this complex serves to repress gene expression. On the other hand, the dCas-P300 complex results in a significant increase in the number of H3K27ac, which promotes gene expression. Other dCas9 complexes have also been used to increase H3K4me3, which promotes gene expression, or reduce H3K27ac, which represses it. The global footprint (impact on the genome globally) of such dCas9 complexes is still unknown.

CRISPR-mediated live cell chromatin imaging

Although technology to image specific parts of the genome has existed for some time, it was mainly done in vitro (in a test tube) through Fluorescent In-Situ Hybridization (FISH) methods, and not in vivo (in a live organism). The development of CRISPR has revolutionized live cell chromatin imaging.

But “how bright does the bulb have to be”? If we imagine the dCas9 complexes as bulbs that attach themselves to desired genomic loci, these are likely to be too small and faint to register on our machines. Hence ideally, such complexes should target repeating genomic sequences that are close together, so that multiple such bulbs can go and attach themselves to each of these repeating sequences, giving out a brighter light that can be registered. For a non-repeating sequence, 26-36 sgRNAs need to attach themselves to one single sequence in order to produce a clear enough signal. So many sgRNAs attaching themselves to a single site is statistically quite unlikely. To overcome this problem, researchers came up with an sgRNA scaffold containing 16 MS2 binding molecules. All of these molecules travel together, and hence attach themselves to the binding site when the sgRNA reaches the desired loci. Put together, these now generate a strong enough signal for imaging. Using these scaffolds, repeated genomic sequences can now be imaged with just 4 sgRNAs, and non-repeating sequences can now be imaged with just 1 sgRNA, as explained above.

Manipulation of chromatin topology

Chromatins are strands of DNA that are arranged linearly. What if we could bring the promoter and enhancer for a gene closer together, or push them further apart? Would this affect gene expression? Yes! That is why researchers are interested in forming chromatin loops, or change the topology of chromatin strands in other ways. Morgan et al took two dimerizable proteins (proteins that had a tendency to attract each other and form a bond), and attached them to two different dCas9 complexes. These complexes now attached themselves to the promoter and enhancer regions separately, and then the dimer bond formation between the two proteins brought the promoter and enhancer closer together. This did result in an increase in gene expression.

Large-scale screenings

How do we find out which gene affects a particular phenotype, say cell proliferation? Checking each gene out of the millions available is surely a daunting task. What if we could check thousands of genes at once? This task is accomplished using hundreds of thousands of sgRNAs in a large population of cells. The way that this works is this: we ensure that each cell receives one or less sgRNA, and each gene is targeted by 6-10 different sgRNAs. This means that at at least 6-10 cells are used to study the impact of one gene on the desired phenotype, which in this case is cell proliferation. The sgRNAs which hit the correct gene will cause their cells to proliferate fast, and the other cells will die out eventually. This helps us zero in on the gene which causes cell proliferation. Of course we will have to keep track of which sgRNA goes to which cell, which will allow us to make the right deductions.

Future directions

One major aim for future researchers should be to reduce the size of the existing Cas proteins, so that they may be easily transported using virus vectors. Another aim should be the careful design of CRISPR procedures, so that “gene drives” that potentially impact entire populations do not cause harm in the long run.

An important obstacle to overcome is the fact that more than half of all humans experience an immune response to the introduction of Cas9 proteins in cells (this is called immunogenicity). One possible solution to this problem is the development of Cas9 proteins to which humans have not been exposed before, so that we don’t have an immune response against them.

CRISPR has great potential to benefit society and eradicate formidable diseases. I am excited to see what comes next.


  1. The CRISPR toolkit for genome editing and beyond
  2. https://mbio.asm.org/content/5/4/e01730-14
  3. https://www.youtube.com/watch?v=vP23dkY0mPo
  4. https://www.youtube.com/watch?v=lLvdxtPaYGM
  5. https://en.wikipedia.org/wiki/Bivalent_chromatin
  6. https://en.wikipedia.org/wiki/Epigenetics
  7. https://www.youtube.com/watch?v=iSEEw4Vs_B4

Notation in Riemannian Geometry

I have always found the notation in Riemannian Geometry to be very confusing. How and why are we doing the things that we’re doing? What does all this abstruse notation mean? This is my attempt to write a helpful guide for anyone starting out in this field.

  1. Why the affine connection? Why this notion of derivative in particular?

A common sentiment that goes around in mathematical circles is that we need a coordinate invariant notion of a derivative. When we say \frac{\partial f}{\partial x}, we are specifying a Euclidean coordinate chart, using which we are differentiating the function f. But Euclidean charts are not always the most convenient setting for calculations- sometimes we need polar coordinates, for instance. Hence, if we could represent equations in a way that does not assume a coordinate chart, it will make life much simpler for us. There would be no complicated Euclidean-to-polar coordinate conversion operations, for example.

Let us now dig slightly deeper into what a coordinate invariant mathematical expression actually means. Suppose we have a physical law saying that a quantity f exists such that \frac{\partial f}{\partial x}=1. Now if we have a transformation x\to x' such that x'=\frac{1}{2}x, then we know that this law cannot hold true anymore. This is because if \frac{\partial f}{\partial x}=1, then \frac{\partial f}{\partial x'}=\frac{1}{2}. Hence, when we state this physical law, we also have to specify the coordinate system that we must choose.

Much importance, at least in Physics, is given to the fact that there is no preferred coordinate system. All inertial systems have the same Newton’s laws. In Special Relativity, we find quantities that are Lorentz invariant. Why can we not just specify the coordinate system each time we mention a law? This is because things can get unmanageable and cumbersome if we propose a different law for each moving reference frame. Moreover, these “laws” might also change when we change units of space and time: in fact a choice of such units is also a choice of coordinate systems. Therefore, physical laws should be such that regardless of whether we choose metres of feet, and regardless of whether we choose Euclidean coordinates or polar coordinates, they remain invariant. I can now choose my preferred coordinate system which simplifies calculations the most, and then arrive upon the answer.

Now that we’ve established that we need a coordinate invariant notion of a derivative, why the affine connection in particular? Mainly because the properties \nabla g=0 and \nabla_X Y-\nabla_Y X=[X,Y] simplify a lot of calculations. These are just constraints that we put on the definition, which give us a unique connection. We could also have put other constraints, and perhaps gotten a different unique connection.

How do we know connections are coordinate invariant? Because connections, by definition, have the property that \nabla_X=X^i\nabla_i. Hence, the coordinate invariance property follows from the definition itself. When we don’t specify a specific coordinate system, and claim that a certain mathematical expression holds in general, we have written down a coordinate invariant expression. This is exactly what we do here.

Another important point to note is that because connections follow the product rule of differentiation, the difference of two connections is always a tensor. Hence, if \nabla is the affine connection and \nabla' is any other connection, we can just define the tensor \nabla-\nabla', and we’re done. How do we define the affine connection in an intuitive way then? We seem to have a lot of choice, as we can choose any other already defined connection \nabla', and then write \nabla=\nabla'+ some tensor. Here, we choose \nabla' to be Euclidean differentiation. This allows us to interpret the affine connection as a “correction” to regular differentiation.

2. Why do we deal with abstract notation at all? Why do we have something like g^{IJ}\nabla_I\nabla_KT_{JL}? The indices show what kinds of mathematical objects we are dealing with. The \nabla_I, for instance, tells us that it takes in a vector. g^{IJ}\nabla_I\nabla_KT_{JL}, when it accepts 2 vectors, will become a function.

Let us now consider the tensor g^{IJ}\nabla_I\nabla_KT_{JL}(X,Y). How do we know where X and Y go? Does X go the the \nabla_I or the T? We solve this conundrum by the following rule: X goes to the left-most place it can go, and Y goes to the left-most place after that. Another, perhaps clearer way of saying this is that we contract X with the I index and Y with the L index. We use this notation because of the tensorial nature of this mathematical object- when X goes to the \nabla_I, we get X^I\nabla_I.

So is that it? Is this expression equal to g^{IJ}\nabla_I\nabla_KT_{JL}X^I Y^L? Yep. It’s as simple as that. But this doesn’t “mean” anything. Let me try and elaborate on this statement. This is just abstract notation. Fluff. Refined nonsense. We know which vector goes where. We have some information about the mathematical object we are dealing with. However, performing actual calculations is a completely different beast.

How do the calculations go, though? We first select a coordinate system and vector space basis elements. We then perform the tensorial differentiation via the connection \nabla. It is only then that we plug in the vectors into the right places. What does the g^{IJ} do? There are again two levels of understanding- one level is just manipulating this expression abstractly, and another is actually choosing a coordinate system and calculating the final expression. For the abstract level, we can just write this as \nabla^J\nabla_KT_{JL}. Now the actual calculation: suppose we choose an orthonormal basis. Then we can write g^{IJ}\nabla_I\nabla_KT_{JL}(X,Y) as \sum\limits_{i}\nabla_I\nabla_KT_{JL}(e_i,X,e_i,Y), and then simplify. Let us simplify this particular expression. This becomes \sum\limits_{i}\nabla_{e_i}(\nabla_KT_{JL})(X,e_i,Y), which simplifies to


Each of these terms can also be simplified using the same rules of tensorial differentiation. Hence, the actual calculation is a long iterative process. When we deal with these expressions abstractly, however, manipulations are generally substantially shorter.

3. Whenever we perform calculations at a point, they become substantially shorter and easier. Why? And what does performing a calculation at a point even mean? When we select a tensorial operation and vector or co-vector fields to operate on, we are selecting global entities. All of the mathematical objects defined above are defined over the whole space or manifold. However,if X|p=\tilde{X}|_p, then T(X)|_p=T(\tilde{X})|_p. where T is a tensor. The utility of this fact is that given a complicated vector field X, we can choose a really simple \tilde{X} with special properties that will make life easy. For instance, when dealing with tensors, we can always choose \tilde{X}_i such that they \nabla{e_i} X_j=0. This substantially simplifies calculations. However, the most common pre-requisite for such drastic simplifications is that we are dealing with tensors, which is something one should always check.

4. Raising and lowering indices- We know that a lowered index means that the tensor accepts vectors, and a raised index means that it accepts co-vectors. However, why do we raise a lowered index, and vice-versa? We will first talk about lowering an index. Consider a vector V^I. We can lower that index via g_{IJ}V^I=V_J. In common parlance, we say that V_J is now a co-vector. But how did we magically get a co-vector by just multiplying with g_{IJ}? Let us see what happens when we contract a vector W^J with V_J. We get V_JW^J=g_{IJ}V^IW^J, which is the inner product of the two vectors! Hence, g_{IJ} converted a vector into a co-vector because it transformed V^I\to \langle V_I,- \rangle.
The same can be said about the raising of indices for co-vectors. Because inner products of all kinds of tensors are defined only using the metric, g_{IJ} or g^{IJ} are involved in raising or lowering indices for all tensors.

5. What does \delta^I_J do exactly?
Essentially \delta(e_i,\omega^j)=1 if and only if i=j. This implies that \delta^I_JV^J=V^I because, by definition, (\delta^I_JV^J)(\omega_I)=1, and V^I is the unique vector with this property.

6. Who came first, the dual or the metric? If we think of a function f as a co-vector, then we know that its dual \nabla f can only be defined in terms of its metric. Hence, we can conclude that duals are not defined independently. The dual of a vector V^I is a covector V_J such that given a vector W, V_JW^J=\langle V^I, W^J\rangle. In fact, the dual V_J doesn’t have to be such that V_JV^J=1. The value of V_JV^J=g_{IJ}V^IV^J completely depends on the metric.

7. What does it mean to raise the index of \nabla? In other words, what does \nabla^J=g^{IJ}\nabla_I mean? When we contract a co-vector W with \nabla^J in the form \nabla^JW_J, then what we are really doing is determining the inner product g^{IJ}\nabla_I W_J. However, \nabla is an operator that acts on other tensors. Hence, the actual calculation is a completely different story than this abstract nonsense. For co-vector W, consider \nabla^I(W)=g^{IJ}\nabla_J(W)=g^{IJ}\nabla_J W_I=\sum\limits_{i,j}(\nabla W)(e_j,e_i). Like before, the actual calculation requires substantial simplification before we can just bring in the vectors and sum over everything.

Neuromorphic Computing

I’ve searched far and wide for an easy to understand review paper on Neuromorphic computing, and finally got lucky with Neuromorphic Computing: From Materials to Systems Architecture. This is a report from a roundtable conference organized in 2015 to discuss the future of Neuromorphic computing. The organizing committee of the conference comprised of Evan Schuller from UC San Diego and Rick Stevens from the University of Chicago and Argonne National Laboratory.

Why Neuromorphic Computing

Currently, 5-15% of the world’s energy goes into powering computing devices. Surely we’d love to have devices that consume much less power. We are, in fact, acquainted with a computing device that consumes much less power- the human brain. Can we make computing devices that mimic the human brain in terms of power consumption, given that we are ready to give up some of the accuracy that modern computational devices possess? Turns out that we can…if we can construct working neuromorphic devices.

Other capabilities that a brain-like device might possess are speech and image recognition. Clearly, human brains are capable of recognizing people and understanding what they are saying. Despite impressive capabilities in other areas, conventional devices are still thought to be pretty bad at this, despite consuming much more power. Hence, it might do us good to manufacture brain-like devices.

But wait. Devices have been created which are “pretty” good at speech and image recognition. Systems have now been created with 5 billion transistors on a single die, and feature sizes approaching 10 nm! Moreover, parallel computing helps us put multiple such devices to work at the same time. Surely things are looking good in terms of capabilities (if we can forget that such devices consume millions of times more power). However, this dream cannot last forever. There will come a time when we cannot pack any more transistors on a single die. Because of energy dissipation, we will not be able to pack transistors more densely (otherwise, because of the large amount of energy dissipated in too small an area, the whole die may catch on fire or be severely damaged). In short, we will have to create a fundamentally different kind of technology, that does not just involve packing more and more transistors on a chip. Neuromorphic devices promise to be this different kind of technology.

It is important to note that neuromorphic computing may not be as accurate as conventional computing devices. Hence, they need only be implemented when lower power “functional” approximations are required, as opposed to high power precise values. For instance, if you need to find out whether Jupiter is closer to us or the Sun, we need only know the approximate distances in millions of miles, and not up to inches. Neuromorphic computing may be useful to us in these situations.

Another feature that is important to us is that neuromorhic devices should be able to learn from past experience, much like humans do. If you take two people and put them in the same situation, say lock them up in a room, they may react differently. One person may panic, possibly because of an earlier clasutrophobic experience, while the other person may stay relatively calm, knowing that someone will most likely rescue them, based on a similar past experience. We want neuromorphic devices to also have different reactions to the same situation. This means that we want neuromorphic devices to have self-learning capabilities when exposed to external stimuli, and we want this self-learning to change them and their reactions to future stimuli. This is not what happens in conventional devices, as all devices of the same kind react in the very same way to external stimuli.

Von Neumann vs Neuromorphic Architecture

The von Neumann architecture, named after the famed polymath John von Neumann, has the processor and memory units separated. Whenever a computation needs to be performed, the processor calls for the required information from the memory unit, performs the computation, and then stores the answer in the memory unit. Although processing power and storage capacity have both increased over the years, the rate of transfer of information between memory and the processor has stayed relatively stagnant. Hence, this is a bottleneck that is preventing faster computations.

Neuromorphic devices on the other hand would have their memory and processing units located right next to each other. In fact, in the human brain, there is not such a clear demarcation of what the storage unit is and the processing unit is, and different parts take on either or both roles based on learning and adapting. Hence, the bottleneck present in von Neumann architectures can be avoided.

Also, at the device level, the von Neumann architecture is made up of resistors, capacitors, etc. These are elements that perform clearly differentiated functions, and are not fault tolerant. If the resistors in the device stop working, the whole device stops working. The human brain, and hence the neuromorphic architecture, consists of just one element- the neuron. The neuron consists of dendrites, the soma, the axon and synapses. While these parts do perform some specific functions, most of them multi-task, and can learn to perform the tasks of another neuron or another part of the same neuron. Hence, neuromorphic architecture is fault resistant by design, which means that if some part of it stops working, the device can still adapt and keep going.

Comparing the two architectures

Let us now directly compare the two architectures:

In the advantages column, black font shows an advantage for the von Neumann architecture, and red font shows an advantage for the neuromorphic architecture. It becomes clear that although neuromorphic devices are much less reliable than conventional devices, they consume a million times less power, and neurons can also be much more densely packed on a chip than transistors because of the smaller amount of energy dissipated. Note that the above data for neuromorphic systems was not gathered from actual neuromorphic devices (good scalable devices are yet to be built), but from simulations of such devices on conventional transistor chips.

Existing Neuromorphic Systems

Many existing devices can be modified to approximate neuromorphic devices. On one hand, we have artificial neural network architectures whose processing algorithms depend on matrix operators (which are much faster than other algorithms). These also consume much less power than conventional CPU devices, although they are less accurate. On the other side of the spectrum we have the actual digital or analog implementation of neurons- dendrites, some, axon and synapses. Axons are implemented as conventional wires, and synapses can be constructed to show learning with time (this is called spike dependent plastic synapses or STDP). The analog implementation is much more power efficient than the digital implementation, and also requires less transistors (yes, both still require transistors). However, the brain is still four or five orders of magnitude more power efficient.

Some devices, like IBM’s TrueNorth chip and FACETS at the university of Heidelberg, contain millions of neurons and billions of synapses. Hence, they may be coming close to approximating the complexity and power of the human brain. Why are these more synapses then neurons? A synapse may be thought of as a connection between one neuron and another. Each neuron needs to be connected to very many other neurons to be useful. Hence, the number of synapses needs to be orders of magnitude bigger than the number of neurons. Existing neurotrophic chips restrict the number synapses per neuron to 256. We might need many more synapses per neuron in order to become “more like the brain”.

Also, in current neuromorphic devices, synapses are arranged in a dense crossbar configuration.

Synapses being in this configuration implies that multiple neurons can be connected to each other at the same time, through direct contact. However, the crossbar cannot have an arbitrary fan-in fan-out ratio. Hence, the number of neurons that can be connected to each other through this crossbar configuration is clearly controlled.

Implementation Needs

These properties of the brain need to be replicated in our ideal neuromorphic devices of the future:

  1. Spiking- Neurons work only when they “spike”. The intricacies of how they work are explained in this beautiful video. Essentially, neurons lie at rest until their their voltage is disturbed to above a threshold value, after which they perform whatever function they’re expected to perform, and then go back to rest. It’s like a bored parent sitting in front of the TV, who doesn’t really care if the kids are dirtying the house or fighting with each other. They only get to action if they hear screaming or see blood, discipline the children, and then again go back to resting and watching TV. That sure sounds much more relaxing than a hyperactive parent who is constantly running around their children, screaming at them and micro-managing all their activities. Conventional devices are like those hyperactive parents, always working, and hence constantly drawing copious amounts of power. Neuromorphic devices would be like the bored parent, work only when they’re asked to work, and be at rest the rest of the time.
  2. Plasticity- Neuromorphic devices, we hope, will be self-learning. For this, they need to have a little plasticity- be able to change their properties and parameters as they learn.
  3. Fan in/fan out- Conventional devices have a much lower number of connections between various units than that required in neuromorphic devices. We still have to figure out whether this part is essential, or we can do with a smaller number of connections.
  4. Hebbian learning- A much used phrase in neuroscience is “neurons that fire together, wire together”. What does this mean? Imagine that performing a particular task involves neuron A sending a message to neuron B. At first, neuron B might not be as receptive to neuron A’s message, as the synapse between them might be weak or relatively ineffective. However, the more this synapse is activated, the stronger it becomes. After a few times, it becomes super easy for neuron A to transfer a message to neuron B, and hence it becomes much easier, faster, and almost second nature for a person to perform that task. This reaffirms conventional wisdom that practicing a skill will make you better at it. We need neuromorphic devices to have the same property- the more times one neuron activates another, the stronger their connection should become. In contrast, this does not happen at all in conventional computing, and your computer doesn’t run Fortnite faster if you’ve played it every day for the last 5 years.
  5. Criticality- The brain is thought to be in a critical state. By this, we mean that the brain, although not chaotic and hence relatively stable, is capable of changing itself as it is exposed to the varied experiences that life throws at us. We want neuromorphic devices to be the same way.

Building Blocks

The four building blocks of a neuromorphic device would be:

  1. Synapse/memristor- The synapse is perhaps the most important of the building blocks. It transmits messages between neurons, and strengthens or weakens with time. The device that can perhaps mimic it most convincingly is the memristor. I learned about this device a couple of weeks back, and it is by far my favorite piece of technology. This video does a fantastic job of explaining what a memristor is and what it does. It essentially acts like a transistor that is capable of handling very high voltages. The capability of a memristor that we are concerned with here is that it should be able to strengthen with time. Let us look at the diagrams below:

A regular transistor, as shown in the graph on the left, shows the same increase in current as we apply a voltage to it. Each time we apply this voltage, the current goes up by the same amount. A memristor on the other hand, as shown in the graph on the right, shows an increase in current every time we apply the same voltage. In other words, as opposed to a regular transistor, the transfer of messages by the memristor gets easier every time.

2. Soma/Neuristor- The soma is the body of the neuron where the threshold voltage is reached, which causes the neuron to spike and send a message to other neurons. One possible implementation is a capacitor coupled with a memristor. The capacitor, when it reaches the threshold frequency, would activate the memristor, which in turn would send the message.

3. Axon/Long wire- The axon was previously thought to be responsible only for signal conduction between the some and the next neuron. However, it is now known to also be responsible for signal conduction. It can be implemented using a long wire.

4. Dendrite/Short wire- Dendrites are the input wires that bring messages from outside to the soma. They have also been shown to possess pattern recognition capabilities. They can be implemented using short wires.

Issues with Implementation

There are multiple issues to be resolved, if we implement neuromorphic devices are suggested above.

For one, memristors often show chaotic behavior. They are designed to show huge increases in drain current when a small voltage is applied to them. Hence, when this voltage difference exists by accident in the environment, memristors can allow huge currents to pass when not required to do so. They can also inject previously stored energy into the device, which leads to undesired voltage amplification.

The way that a synapse implementation can work is the following: synapses are required to show stronger connections the more they’re activated. Hence, if two electrodes are separated by an insulator to not allow large current to pass, as those electodes are activated more and more, we need metal impurities to slowly form between the two electrodes in order to allow currents and hence signals to flow. However, the rate of formation and positions of these impurities is still random, and hence a lot more study is required so that we may be able to control this “strengthening” of the synapse.

Spin torque switching happens when a relatively large current of polarized electrons causes the magnetic field in an electromagnet to switch. This article is a beautiful explanation of how this happens. Such devices are used in Magnetic RAMs and other storage devices. However, the stability of such a device, along with the effect of impurities on it at the nanoscale, still needs to be studied.

Nanoelectronics, on the whole, is a complicated endeavor. Effects of impurities, Oersted fields, etc, which are not markable at larger scales, become all important at the nano scale. If we are to squeeze millions of neurons onto a chip, we have to study nanoelectronics. Hence, the amount of progress we can make in this field directly controls the amount of progress we can make in building powerful neuromorphic devices.


This paper, as noted before, was written at a conference which hoped to lay down a roadmap that computer scientists and material scientists could follow to make scalable neuromorphic devices. Its recommendations, for such a dream to reach fruition, are the following:

Computer scientists and material scientists should work together . Computer scientists can worry about optimizing algorithms and putting everything together, while material scientists can worry about the best possible implementation of the building blocks. The successful construction of such a device will follow only from singling out the best possible materials for the various parts, which can come only from a deep grasp of the quantum properties of materials along with a good understanding of nanotechnology.

Moreover, the device should exhibit three dimensional reconstruction, external and internal healing, distributed power delivery, fault tolerance, etc. It is important to understand that the construction of such a device will not happen overnight, and hence scientists should recognize and implement useful intermediate steps, that will ultimately result in a neuromorphic revolution.


  1. Neuromorphic Computing: From Materials to Systems Architecture
  2. Memristors