# How, on Earth, we took a picture of a black hole

The paper that I will be writing about today is Computational Imaging for VLBI Reconstruction by Katie Bouman, et al. The image processing algorithm explained in this paper was the reason for the first photo that humanity ever took of a black hole.

## Introduction

At the time that this paper was written, there was an international collaboration underway that wanted to capture the image of a black hole’s Event Horizon for the first time in history. But the angular resolution needed was an order of magnitude lower than what was available with current technology. What does this mean?

We know that there are billions of stars in the sky. However, stars that are very far away don’t appear as distinct stars to us. Instead, they appear to be broiling in a hazy cosmic soup. This is because the naked eye is not very good at distinguishing between relatively close stars that are very far away.

When light rays from far away stars enter the tiny apertures in our eyes, they undergo diffraction. For close enough stars, these diffraction patterns merge into one amorphous blob. Hence, instead of distinct stars, we just see a cosmic haze.

The Rayleigh formula for angular resolution is $\theta=1.22 \frac{\lambda}{D}$, where $\theta$ is angular separation, $\lambda$ is the wavelength of the incoming signal and $D$ is the diameter of the aperture (of the human eye or a telescope). Given that the observed wavelength and the diameter of the aperture are fixed, if two stars are separated by an angle of at least $\theta$, they can be seen as distinct entities. If they have an angular separation that is smaller than that, they will be seen as one hazy blob.

How does all this relate to taking the photo of the event horizon of a black hole? Imagine that the event horizon subtends for an angle $\alpha$. If this $\alpha>\theta$, then we will be able to take an image of a black hole that has at least two distinct points. Hence, the event horizon will appear to be an elongated object instead of just a dot against the night sky. This is what we mean by taking photo. When we talk about taking a photo of Jupiter, we refer to a photo containing details of its stripes, the great red spot, etc, and not just a pale dot in the night sky. We hope to accomplish a similar feat with black holes (this hope was obviously fulfilled in 2019).

Now as black holes are very far away, the angle that they subtend might be very small. Hence, keeping $\lambda$ constant, we will have to have a a huge aperture diameter. For example, it is predicted that the black hole at the center of the Milky Way has an angular diameter (relative to viewers on Earth) of $2.5\times 10^{-10}$ radians. For $\lambda=1.3 mm$, we will have to build a telescope of diameter 13,000 km to resolve it. This of course is not realistic. To overcome this constraint, the authors plan on using multiple telescopes spread over the world, and then use their combined data to emulate the data collected by one giant telescope. Using multiple telescopes in this manner is known is Very Long Baseline Interferometry (VLBI). Hence, VLBI can solve our problems of angular resolution. But what about the post-VLBI stuff? What about the creation of the actual image? There exist multiple algorithms to produce images from given data, with varied degrees of clarity.

The authors content that the Continuous High-Resolution Image Reconstruction using Path Priors, or CHIRP algorithm, can reconstruct better images than other existing algorithms from the same data.

## A Brief Introduction to VLBI

Before we talk about the CHIRP algorithm, we need to talk about the VLBI process so that we understand the kind of data that we’re collecting. As explained before, in order to be able to get a better angular resolution, multiple telescopes are needed. Such an array of telescopes is known as an interferometer. This proves problematic though. Two telescopes located at different places will receive different signals: signals that are different in terms of phase, etc. How can be combine all of these observations to produce one coherent signal then?

Here’s how we do it: fix a wavelength $\lambda$ amongst all the incoming wavelengths. This wavelength is composed of very many spatial frequencies. Spatial frequency can be thought of as the number of light and dark lines in unit distance, when a light beam is diffracted by a particular apparatus. Look at the diagram below, where SF stands for spatial frequency

One may imagine that the incoming signal is the sum of waves of very many different spatial frequencies. We calculate to what extent the two telescopes under consideration receive waves with the same spatial frequency. Another way to look at it is we measure to what extent the two telescopes receive the “same” wave (with the same frequency, etc). This is calculated by the correlation function, which is defined as $\Gamma_{12}(u,v,0)=\int\int I(l,m)e^{-2\pi i(ul+vm)}dldm$

This formula comes from the Van Cittert-Zernike Theorem, and the Wikipedia article on it is extremely well-written. Here the telescopes are numbered $1$ and $2$, and their correlation function (also known as visibility) is denoted as $\Gamma_{1,2}$. Just some notation- $l,m$ denotes the angular position of the celestial object, and $(u,v)$ denotes the position vector of one telescope with respect to another. The basic gist of the Van Cittert-Zernike theorem is this: if you consider any two stars (or even different points on the same star), they emit electromagnetic waves with different properties like spatial frequency, etc. Hence, the waves emitted by them are not “coherent” at any two points. An analogy is that if you drop two pebbles into water, at points close to the pebbles, different points on the water oscillate differently. Some show oscillate wildly, whilst others don’t oscillate at all (destructive interference). However, points that are a little further away from the two pebbles oscillate similarly. They show the same behavior. Similarly, points that are far away from these non-coherent stars will receive waves of similar properties. The extent of similarity, which is definitely much greater than $0$, is given by the formula above.

Now note that if we have $N$ telescopes, we can form ${N\choose 2}$ pairs of telescopes. Hence, we have $N\choose 2$ visibility numbers. These numbers change due to relative atmospheric conditions, as the atmosphere introduces a delay in the phase of the incoming waves, and it introduces a different delay for each telescope as atmospheric conditions are different at different points on the planet. Hence $Gamma_{i,j}$ becomes $\Gamma_{i,j}e^{i(\theta_i-\theta_j)}$. However, if we have a set of three telescopes $i,j,k$, then the product $\Gamma_{i,j}\Gamma_{j,k}\Gamma_{k,i}$ remains unchanged, even after the phase delay introduced by the atmosphere. We refer to this triple product as the bispectrum (I will refer to the plural as bispectra). This is a useful invariant to have, as we don’t have to wonder about atmospheric delays anymore.

This introduces a number of constraints into the system, and hence reduces other independent additional constraints that we impose on the system. Why? Because we have $N\choose 2$ independent variables of the form $\Gamma_{i,j}$, and these independent variables can vary arbitrarily (because of the delay introduced by the atmosphere, for instance). However, we now have $N\choose 3$ constraints. If we can only have fixed number of independent constraints on the system so that it still remains well-defined, then we have to reduce the number of other additional constraints that we can impose on the system.

Let’s back up. It seems that we have a system of equations in the variables $\Gamma_{i,j}$, and we want to find these variables. But why do we want to find these variables at all? How will these help us reconstruct the image of a celestial object? It seems that with only this information, the authors have an algorithm to construct an image of the source. This is incredible, and I will try to provide more details below. Anyway, from this point we will assume that we have determine all the visibility ( $\Gamma_{i,j}$) and bispectra ( $\Gamma_{i,j}\Gamma_{j,k}\Gamma_{k,i}$ values).

## A discussion of some algorithms

What algorithm should we now use to construct an image from the given data? Some known algorithms are given below:

CLEAN– This algorithm assumes that the image is made up of discrete bright points. It finds the brightest point on the image, and then “deconvolves” around that location. Deconvolution can be thought of as getting a sharper image from hazy data. After a few iterations, the brightness of the original source is reduced. Because this algorithm focuses on the brightest points, it often produces bad images of extended objects with no particular brightest points.

OPTICAL INTERFEROMETRY– We will talk about two algorithms here- the BiSpectrum Maximum Entropy Method (BSMEM), and SQUEEZE. Both are capable of using bispectrum data to construct images. BSMEM uses a Bayesian approach to image reconstruction, which may be thought of as determining what is the most probably “actual” image of the source, given the collected data. SQUEEZE on the other hand moves a set of point sources around in the Field of View, and then takes an “average” of the images obtained.

The authors propose their own algorithm CHIRP, which produces superior images than the algorithms mentioned above, with minimal human intervention.

## CHIRP Method

Let us think back to the formula $\Gamma_{12}(u,v,0)=\int\int I(l,m)e^{-2\pi i(ul+vm)}dldm$

It is clearly impossible to calculate $I_{\lambda}(l,m)$ for different values of $l,m$, as the source is far away, and the received signal is hazy and impossible to resolve. However, the visibility and bispectra values have been determined above. Using these values, we determine the value of $lambda I_{\lambda}(l,m)$ using the inverse Fourier transform (with an intermediate step in which we discretize $I_{\lambda}(l,m)$. More details to be given shortly.

The authors parametrize a continuous image using a discrete set of parameters. Having a continuous image reduces modeling errors during optimization. Now remember that we have the Fourier transforms of the original signal $I_{\lambda}(l.m)$ for each pair of telescopes (the visibility values). Because the signal is the weighted sum of these Fourier transforms, we can represent the original image as the sum of these continuous, scaled and shifted pulses (although the weights of these pulses, and hence the original image, are unknown at this stage). For a scene defined in the angular range $l\in [-\frac{F_l}{2},\frac{F_l}{2}]$ and $m\in [-\frac{F_m}{2},\frac{F_m}{2}]$, we parametrize the space into $N_l\times N_m$ scaled pulse functions, $h(l,m)$, centered at the geometric center of each cube of the $N_l\times N_m$ grid. The numbers $N_l,N_m$ are for us to decide, depending on how much accuracy is desired. We use known triangular pulse functions $h(l,m)$ here, and hence their Fourier transforms are known. The weights that we assign to each pulse function can be written down as $x[i,j]$ or $x$.

What we’ve essentially done is that we’ve located the (angular) coordinates across which the celestial body extends, zoomed in on those coordinates, drawing up a grid across this space, and then drawn pulse functions inside each cube of this grid. Why have we done this though? We want to convert the difficult calculation of integration into the relatively simpler calculation of a linear matrix operation. In other words, we can calculate $I_{\lambda}(l,m)$ from the values of $h(l,m)$ using just simple linear algebra. However, we need $x$ in order to do that, and we calculate the value of $x$ below.

We need to construct an image with the values of the bispectra that we have. We use a maximum a posteriori (MAP) estimate. What this means is the following: consider the vector $x$ as representing the weights of $h(l,m)$, as described above. If the observed observed bispectra (used instead of visibility because these values don’t change with atmospheric conditions) is denoted as $y$, then we need to minimize the energy $f_r(x|y)=-D(y|x)-EPLL_r(x)$. Clearly $y$ here is fixed, and we vary the values of $x$ to find the optimal one.

The issue at hand is slightly more subtle than I let on. $D(y|x)$ and $EPLL_r$ can only be calculated if we already know what $\hat{I}(x)$ is. However, at this stage, we clearly don’t. Hence, what is probably happening is an iteration: we start with an image that we obtain after patching together information from the patch prior (hence the value of $x$ is known). We use this value of $x$ to determine $D(y|x)$ and $EPLL_r(x)$. Using these, we now calculate a better value for $x$ by minimizing the function $f_r(x|y)=-D(y|x)-EPLL_r(x)$, and so on. After having found an optimal value for $x$, we can use it to reconstruct the image. In practice, we don’t minimize this function directly, but use a slightly different algorithm that gives us the same result. This algorithm is explained below:

We use the “Half Quadratic Splitting” method. Start with a blurry image, with a known value of $x$. Use this value, and the value of a varying parameter $\beta$, to calculate the most likely patches $\{z^n\}$, which may be thought of as the most likely sub-images in the grid that can be overlapped and put together to form the complete image. This information is now used to calculate a better value of $x$. This value can now again be used to calculate the most likely patches $\{z^n\}$, with a different value of $\beta$. The values of $\beta$ for which we run this iteration are $1,4,8,16,32,64,128,256,512$. After this process is completed, we will have gained a “good” value of $x$, using which we can construct $I_{\lambda}(l,m)$.

## Conclusion

So does the CHIRP algorithm perform better than SQUEEZE, BSMEMS or other algorithms? Let us look at the pictures below.

Note that the data, based on which these images were created, are a result of simulations, and not actual observations. The authors made their very extensive and useful dataset available here.

The CHIRP algorithm was also found to be more robust to random (Gaussian) noise.

What effect did patch priors have? Patch priors allow for the algorithm to train itself on what the image “probably” looks like, given the input data $y$. Patch priors helped the algorithm to piece together a much better picture of celestial bodies.

The authors contend that CHIRP is a better image processing algorithm than other state of the art algorithms. And they did prove it with a magnificent picture of the event horizon of a black hole 4 years later, for all of eternity to stare at with moist eyes.