The paper that I wish to discuss today is An Introduction to Topological Data Analysis: fundamental and practical aspects for data scientists, by Frederic Chazal and Bertrand Michel. Topological data analysis is an exciting new field, and this paper can be understood by people from a wide range of backgrounds.
Notation: For this paper, denotes a topological space, and a sample of points from it.
Topological Data Analysis works on the assumption that the topology of data is important. But why? Let us take an example from physics. Suppose we want to study the energy emitted by atoms after absorbing sunlight. We observe that energy emitted by atoms forms a discrete set, and not a continuous one. We take a large number of readings, and observe the very same phenomenon. Hence, we can conclude with a high degree of probability that the topology of the energy states of atoms is discrete. This is one of the most fundamental discoveries of modern Science, and heralded the Quantum revolution.
It becomes clear from the above example that understanding the topology of data can lead us to understand the universe around us. However, we have to “guess” this topology of the “population” from a much smaller “sample”. Topological Data Analysis can be thought of as the study of making “good” guesses in this realm.
Metric Spaces, Covers, and Simplicial Complexes
Let us take a metric space . The Hausdorff distance between two compact sets is defined as . It is essentially a measure of how “similar” two compact sets look. We need compactness, or rather boundedness, because we need the limits to be well-defined. However, what if are not necessarily subsets of the same space? Gromov generalized the above definition in the following way: The Gromov-Hausdorff distance between two compact sets is the infimum of all positive real numbers such that , where is the isometric embedding of in some manifold . Essentially, we want to see how “close” the two sets can get across all possible isometric embeddings in all possible manifolds. As one can imagine, calculating it is a seemingly impossible task in most situations, and it is primarily useful when an upper bound to it implies other useful mathematical facts.
Now given a set of points of affinely independent points, we can make a -simplex in called the convex hull of the points. A simplicial complex is a collection of simplices such that
- Any face of a simplex is also a simplex. We need this condition because we want to define a boundary map from the simplicial complex to itself, which will allow us to calculate topological invariants like the homology of the complex.
- The intersection of two simplices is either empty, or a common face of both. This condition ensures that only topological “holes” are detected in homology groups, and not other topological features.
Note that can be thought of as both a topological space and a combinatorial entity. The topological perspective is useful when we’re trying to break up a topological space into simplices in order to calculate homology, and the combinatorial perspective is useful when we use simplices to represent mathematical entities that are not originally topological spaces. For instance, polynomial rings can also be studied using simplicial complexes, where each vertex corresponds to a homogeneous polynomial. Of course, each combinatorial simplicial complex can be given a topological description. Also, note that the combinatorial description is more suitable than the topological one in the realm of algorithms and machine learning.
Vietoris-Rips Complex: Given a set of points with pre-determined distances, form all simplices of the form whenever the distance between any pair of points in is at most . If , we might even have simplices that cannot be fit into . The set of all such simplices forms a simplicial complex in some , where might be bigger than . As we increase the value of , this simplicial complex, and consequently its homology groups, will change.
Čech Complex: Given a set of points with pre-determined distances, we form -simplices when the -balls of points have a non-empty intersection. Hence, the maximum distance between two points now has to be , and not . A Čech complex is denoted as . How is this different from though? As shown in the diagram above, three pairwise intersecting discs give us a -simplex in , but just a -cycle in the corresponding Čech complex.
The Nerve Theorem: Given a cover of a manifold, we can form a Čech complex from it (note that the open sets here are , and not -balls as stated before). The Nerve Theorem says that if the intersection of any sub collection of is either empty or contractible, then the Čech complex is homotopy equivalent to the manifold.
The contractibility of intersections ensures that we do not “cover up” any holes in the manifold using our discs. But why the Čech complex, and not the Rips complex? This is because a hole would be covered up by three pair-wise intersecting discs in a Rips complex, but not in the Čech complex. Hence, the Čech complex is a useful way of preserving the homology of the underlying manifold.
Why is this theorem important? This is because it takes a continuous property of a topological space, and converts it into a combinatorial entity. For instance, if I only needed to know the homotopy class of a manifold for my problem, studying a simplicial complex on the manifold with the same homotopy type is much easier for me, as I can now feed the combinatorial data of the simplicial complex into well-known algorithms to make deductions about the space.
For a space , let be a continuous map. For a cover of , the sets form a cover of . Now consider the set of all connected components of . If the function and the open cover are well chosen, the nerve of this “refined” cover gives us useful information about the underlying space . Note that the don’t have to be contractible anymore. An example is given below:
The function here is the height function (look out for the slightly camouflaged yellow parts). This nerve of the two-holed torus does a decent job of representing its topology, although we fail to detect the small hole at the bottom because the cover chosen of is not fine enough.
In practice, however, we don’t map continuous manifolds, but just data points. A suitable example is given below:
From the nerve drawn on the right, one may conclude that the topology of the underlying population, from which the data has been extracted, is a circle.
Some popular choices of are the centrality function , or the eccentricity function . These functions do not require any special knowledge of the data.
As one may imagine, the choice of the cover determines the nerve that we get from the Mapper algorithm. Generally, one may choose regularly spaced rectangles which overlap with each other. If , then the length of the intervals is known as the resolution, and the fraction of overlap is denoted as . One must explore various resolutions and values of in order to find the “best” nerve of .
Now remember that the connected components of the `s form just the vertex set of the simplicial complex we are building. Although we could build a Čech complex from these pre-images, we may also cluster the vertices corresponding to the ‘s in other ways. For instance, we may build a k-NN graph with the points in , associate points to each , cluster them appropriately, and then only select the connected components of the subgraph whose vertices correspond to the ‘s. This is a different algorithm because we don’t care if the `s intersect anymore.
Geometric Reconstruction and Homology Inference
Let us now bring probability into the mix. Suppose we have a space with a probability measure . If we take a sample of points , we want the topology of to somehow approximate the topology of .
Now, we introduce some mathematical definitions. For a compact space , the -offset is defined as the set of all points such that . Why do we care about -offsets? Because they are much better at capturing the topology of the space around them. For instance, the homology of a loop is clearly different from the Euclidean space it is contained in. However, for appropriate values of , the -offset of the loop becomes contractible, and hence has the same homology as the surrounding space.
A function is distance-like if it is proper, and is convex. We need the proper condition because we don’t want unbounded sets to only be at finite distance from a point. Hence, properness ensures that only bounded sets are at a bounded distance from a point (with respect to which a distance function is defined). The second condition is that of semi-concavity, and I would like to re-write it in its more natural form: is concave. The reason that it is called “semi”-concave is that it is concave only when a very concave function is added to it. can actually be a convex function itself. Why do we want this condition here? This is because we want to generate a continuous flow using , and functions that “rise too fast” may have discontinuities when we generate this flow.
A point is said to be -critical if . It is a generalization of the notion of a critical point (where ). We want to find the smallest such that does not have any -critical values. This is known as the -reach of . What this means is that the level sets for will “flow” along the gradient of at a speed that is faster than , at least until they reach the level set . Why do we want level sets flowing into each other at all? This is because of their relation to Čech complexes. Consider the level sets in and . If the level sets of the first can “flow” into the level sets of the “second”, then there is no hole between them, and the two vertices corresponding to these inverse open sets can be joined by a line regardless of how these vertices have been clustered.
Isotopy Lemma– Let be a distance-like function such that has no critical points. Then the level sets of are isotopic for . Two topological sets are isotopic when they are homeomorphic, and one can continuously be deformed into the other. Essentially, the level sets in can essentially stay where they are, and the level sets inside can move because there are no critical points inside. Whether there are critical points in is irrelevant.
Reconstruction Theorem– This theorem essentially states that when two distance-like functions and are “close enough”, suitable sub level sets of both are homotopically equivalent. Of course it is unclear from the statement how these level sets “flow” into each other, and which function’s gradient field is chosen for this. Why is this theorem important?
Let and , which are the distance functions with respect to the support of on and respectively. The Reconstruction theorem can tell us that for appropriate values of and , the -offset of is homotopically equivalent to the union of the -offsets of the points in , which in turn is homotopically equivalent to the Čech complex formed by these offsets. Essentially, the Reconstruction Theorem provides the basis for studying the topology of using the nerve of .
An important result of Chazal and Oudot in this direction is the following: For a compact set, let the -reach of be . Also, let be a set of points such that -reach) of . Then for , . Here a suitable has been chosen. Why is this theorem important? Because it allows us to calculate the Betti numbers of using just information gleaned from the Rips complexes of .
Distance to measure
Note that in all of the theorems discussed above, and have been “pretty close” as metrics. This forms the basis of all our topological deductions. What if they’re not? What if we have outlier points in ? Note that it is not in general possible to select a point from outside of the support of the probability measure on , as the probability of selecting a point outside of it is by definition . Hence, the existence of such a point is purely due to noise, which brings in a probability distribution that is independent of $mu$.
To deal with noise of this sort, we have the notion of “distance to a measure”. Given a probability distribution and a given parameter , .
Note that this map can be discontinuous if the support of is badly behaved. Hence, to further regularize this distance, we define
A nice property of is that it is stable with respect to the Wasserstein metric. In other words if are two probability measures, then . Hence, is a good distance-like function to consider to analyze the topology of , or at least the support of .
In practice, is not known. We can only hope to approximate it from . Consider the probability measure on . Although the exact construction of is not mentioned (it might just be a discrete measure), the following formula has been mentioned: for
Here denotes the distance between and its th neighbor in . If the Wasserstein distance between and is small, which is what we can hope if we take a large enough sample from the population, then is pretty close to in the measure.
Persistent homology is used to study how the Čech complexes and Betti numbers change with a parameter (generally the radius of the overlapping balls). Why is it important? We can never really be sure if the homology of the Čech complex that we have is the same as the homology of the underlying space. Hence, why not study all possible homologies obtained from an infinite family of Čech complexes? Two spaces with “similar” such families of homologies are likely to be the “same” in some well defined topological sense. Hence, this is another attempt at determining the topology of the underlying space.
A filtration of a set is a family of subsets such that . Some useful filtrations are the family of simplicial complexes or .
Given a filtration , the homology of changes as increases. Consider the persistence diagram below:
I will briefly try to explain what is happening here. is the height function defined on this graph, and for looks like an interval that is expanding in size. When etc, new intervals are created, which die when they join with some other interval that was created before them. For example, the interval created when joins the interval created at when we reach the point . (value of at the birth of an interval, value of at the death of that interval) can be graphed as a point in , as shown by the red dots in the diagram on the right. We also add al diagonal points with infinite multiplicity. One way to interpret that is for all values of , an infinite number of intervals come alive and die. The reason why we add this seemingly arbitrary line is that when we try to sample a population of data points, we might receive some noise. Hence, we can create a small neighborhood of the diagonal, and only interpret the points that lie outside of the diagonal as genuine topological features of the underlying manifold. The points inside the neighborhood denote topological features that are born and die “too soon”, and hence might just be noise. More will be said about this later.
Given below is the corresponding diagram for Čech complexes. The blue dots correspond to the birth and death times of “topological holes”.
Note that in the diagram above, all the balls grow at the same time. Hence, we don’t have a clear way of choosing which red interval should “die” and which one should survive. We arbitrarily decide that if two balls that start at the same time join, the red line below remains, and the one above ends. The persistence diagram is is given on the bottom right.
Persistence modules and persistence diagrams
The inclusion diagram , where , induces the inclusion diagram of vector spaces . The latter inclusion diagram is called a persistence module. Why is this important? Because it shows, in real time, the evolution of the th Betti number of the diagram.
In general, a persistence module over a subset of the real numbers is an indexed family of vector spaces such that when , along with the property that . In many cases, such diagrams can expressed as a direct sum of “inclusions” modules of the form
where the maps are identity maps, and the rest are maps. In some sense, when the vector spaces in the persistence module are the groups, we are breaking up the for each -dimensional hole, and tracking when each hole appears and disappears. Chazal et al proved that if the map in a persistence module has finite rank for each , then it can be decomposed as a direct sum of “inclusions” modules in a well-defined way. One way to think of this is the following: a generic “element” of a persistence module is the set births and deaths of all topological -holes that survive at least one iteration (or increment in the value of ), between and . If all but a finite number of holes die only within one iteration, then each element of the persistence module can be thought of as a finite sum of “inclusions modules”.
The persistence landscape was introduced by Bubenik, who stated that the “dots” in a persistence diagram should be replaced by “functions”. But why? Because there’s not much algebra that we can do with dots. We can’t really add or multiply them in canonical ways. However, functions lend themselves to such operations easily. Hence, the inferences that we make about such functions may help us make inferences about the underlying topological space.
How do we construct such a function? Take a point . We just construct a function for a point that looks like a “tent”, by joining the points and by a straight line, and then joining and by another straight line. Three such “tents” for three different points are given below. They’re drawn in red.
The persistence landscape for this diagram is defined as , where is the th highest value in the set. For instance, the function in blue drawn above is .
A short note about the axes in the two diagrams above: the and on the left diagram correspond to time of birth and death respectively. For the diagram on the right, the axes denote the coordinates of the black “dots” on the functions. The “tent”-ed functions themselves may be thought of as a progression from left to right, in which a topological feature is birthed and then dies.
One of the advantages of persistence landscapes is that they share the same stability properties as persistence diagrams. Hence, if two probability measures are “close enough”, then their persistence landscapes will also be “quite close”.
Metrics on the space of persistence diagrams
We know that then two probability distributions are “close enough”, then the distance functions to those probability distributions are also “pretty close”. However, what about the persistence diagrams of those probability distributions? Does the persistence diagrams of two probability distributions being “close” imply that the probability distributions are also close? Before we can answer this question, we must find a good metric to calculate the distance between two persistence diagrams.
One such metric is the bottleneck metric, which is defined as
Here is a “matching”, which is the arbitrary bipartite pairing of points in with those in . Because the norm is too sensitive to “outliers”, a more robust metric is
Stability properties of persistence diagrams
But if two persistence diagrams are “close”, are the underlying probability distributions also “close”? We don’t know. But the converse is true.
Let and be two compact metric spaces and let and be the Vietoris Rips of Čech filtrations built on top of and . Then
We can also conclude that if two persistence diagrams are close, then their persistence landscapes are also close: Let and be two compact sets. Then for any , we have
Whether the closeness of persistence diagrams denotes the closeness of the underlying topological spaces remains woefully unanswered.
Statistical aspects of persistent homology
While talking about persistence homology, we have only talked about topological spaces, and not necessarily about probability distributions. We do so here.
Let be an underlying space with probability measure , and let be the compact support of this measure. If we take independent readings from this set- say , then we can estimate the space by . The probability measure on has support .
For some , let satisfy the condition . Then
Because we do not exactly know and hence the persistence diagram of , we can only calculate the probability of the persistence diagram of being close to that of . Clearly, as grows big, this probability becomes smaller. This can also be ascertained from the following formula:
approaches as . Here, is some constant.
Estimation of the persistent homology of functions
If two functions on a manifold are “close”, then the persistence diagrams induced by them are also close. More precisely,
This opens up a vista of opportunities, in that we can now study density estimators, regression functions, etc. But how? Suppose we do not know how to calculate the persistence homology of a complicated function. We take a more regular function that is “close” to it in the norm, calculate its persistence homology, and then be assured that the persistence diagram of the complicated function looks almost like the persistence diagram of the current, “better” function.
Confidence regions for persistent homology
When estimating a persistence diagram with an estimator diagram , we look for a value such that , for some . The gives us an upper bound on how “far” the two diagrams can be.
In some sense, if we were in the space of persistence diagrams (each point in this space is a persistence diagram), is the -confidence interval of . How does this translate to confidence intervals of the actual points in ? One way to do this is to center a box of side length at each of these points. Another way is to create an neighborhood of the diagonal line in the persistence diagram. The points outside of it are significant topological features of the sample, and are definitely preserved in . This is perhaps an important reason why we include the diagonal in persistence diagrams- points on the diagonal are unimportant topological features that might just be noise. Hence, we can infer the persistence diagram of the underlying space from that of the sample, as long as the points that we get are “far enough” from the diagonal.
Of course, all of this depends on whether we can successfully approximate the value of from the sample. Methods like the Subsampling Approach and the Bottleneck Bootstrap are important in this context.
Central tendency for persistent homology
Persistence diagrams are just a bunch of dots and a diagonal line. As they’re not elements of a Hilbert space, we cannot determine an “expected” or “mean” persistence diagram. Hence, we need to move to persistence landscapes, which are elements of a Hilbert space, and consequently lend themselves to such an analysis.
For any positive integer , let be a sample of points from . For a fixed , let the corresponding persistence landscape be . Now consider the space of all persistence landscapes (corresponding to different ‘s drawn from ), and let be the induced measure on , which then induces the measure on the space of persistence landscapes. It is now possible to calculate the expected persistence landscape, which is . This quantity is quite stable. In fact, the following is true:
Let and , where and are two probability measures on . For any , we have
Persistence homology and machine learning
Due to the highly non linear nature of persistence diagrams, persistence diagrams first need to be converted into persistence landscapes to be useful in machine learning. Such persistence landscapes have been useful in protein binding and object recognition.
The construction of kernels for persistence diagrams has also drawn some interest. Can we directly compare two persistence diagrams by taking their “dot product” in some sense? Convolving a symmetrized version of a persistence diagram (with respect to the diagonal) with the two dimensional Gaussian measure gives us exactly such a kernel. Such a kernel can be used for texture recognition, etc.
Sometimes, identifying topological features from a persistence diagram becomes a difficult task. Hence, the choice of a kernel becomes an important factor. Also, deep learning can also be used to identity the relevant topological features in a given situation.