How do we decide which movie to watch on a given weekend? We could google “latest movies”, and then take a random pick. However, that throws up both good and bad results. How do we pick a “good movie” from this list? We go to Rottentomatoes, IMDB or the “Reception” section of Wikipedia, and see what critics have to say about those movies. We generally go for the movies for which the critics that we highly value have mostly positive things to say. Google works on the same principle. How does it know which websites are most relevant to your query? It finds the website that most other “credible” websites (containing your query words) cite, or link to.
The paper that I’m blogging about today is The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seeing as it gave rise to Google, perhaps the greatest political, social and economic force of our times, it is one of the most important papers that I’ve had the opportunity to read. And the fact that it is based on one of the most intuitive ideas known to humanity is mind-boggling.
Google did not start the search engine revolution. However, it revolutionized it. But how? Search engines before, like Yahoo and Altavista, were not the kind that thought users already knew what they were looking for, and just needed a big search bar. They thought most users just wanted to “browse”. Hence, if you remember the heavy machinery Yahoo and MSN websites, they would contain separate sections for News, Shopping, Sports, etc. Moreover, the search algorithm relied on keyword matching, that turned up way too many low quality results. If you typed “Sachin”, it would turn up the page that contained the word “Sachin” the most number of times, and not the ESPN stats page for Sachin Tendulkar, which was probably what you were looking for.
Google, on the other hand, was blasphemously bland. It had just one big search bar that you could use only if you already knew what you were looking for. This flew in the face of conventional wisdom. However, they improved their search algorithms and speed of information retrieval so much that they soon became the market leader in this segment.
The authors, Sergey Brin and Lawrence Page, claim in this paper that Google, which is a misspelling of googol (), is capable of fast crawling, indexing the massive landscape of the internet effectively, and also dealing with the millions of queries that search engines deal with daily.
Improved Search Quality and PageRank
Search engines by that time were already capable of crawling (downloading) and indexing vast swathes of the internet, and presenting these pages to users when searched for. However, they had not effectively determined a system to rank these pages by relevance. Although millions of results could be returned based on a query, users typically would only check out the first few, and then abandon their search. It was hence extremely important these first few results turned out to be the “best” ones. Google exploited hyper textual information to find an effective algorithm to rank pages based on relevance. What does this mean?
Let me explain this with an analogy. Suppose you need to find out if person A is honest and trustworthy. How do you determine that? You might ask them if they think they’re trustworthy. However, such self introspection is likely to be biased. Similarly, analyzing the contents of a page, like a page containing the word Sachin the most number of times, might not be the best test of its relevance. On the other hand, if you ask some people who you respect and know to be honest and trustworthy about person A, and all of them have a high opinion of this person, then you have a strong reason to trust person A. This is how relationships and ties are formed in the real world anyway, especially in high risk situations like marriage, business deals, etc. Google does pretty much the same thing.
Google ranks pages based on how many other pages cite it, or link to it. The actual contents of that page are not as important. If a search for “Einstein” results 10 million pages, and 8 million of them link to a certain article, that article is likely to be an influential and important article on Albert Einstein. However, there’s further modification that is needed to make the search algorithm even better, which I can again explain with a bad analogy. When you ask for opinions of others on person A, there might be persons who always speak well of others. Despite this charming personal trait, their opinion must be discounted in this particular situation. However, if you know persons who don’t generally throw around compliments willy nilly, and they have overwhelmingly positive things to say about person A, then their opinion should hold much more weight.
Similarly, google gives less importance to pages that contain a lot of hyperlinks to various websites, and more to pages that do not contain a lot of links. If there is a page on Einstein that does not contain a lot of links, and it cites only one page for all its information, that one page is likely to be extremely important. Of course there will be exceptions to this rule. However, statistically, such a search algorithm turns up good results most times.
If a page A has pages , and is the page rank of page while is the number of links going out of , then
Here is a number between and , and is called the damping factor. This factor models user fatigue: PageRank of a page is supposed to reflect what is the probability that a user reaches that page after clicking on a bunch of hyperlinks in other pages. If , for example, then there is an 85% chance that the user will click on the next link, and a 15% chance that the use will quit on that page.
Note that PageRanks are calculated iteratively. Hence, if page A links to page B, then we can calculate the rank of page B only if we already know the rank of page A.
When you search for “Einstein”, say the most highly ranked page that you get is the Wikipedia entry on him. It says that Einstein was a physicist, where the word “physicist” is a hyperlink to the wikipedia page for physicists. Now when you search for “physicist” on google, you’d want to get the Wikipedia page on “Physicist”, and not “Einstein”. Hence, the mention of the word “Physicist” on Einstein’s page should contribute to the PageRank of of the “Physicist” page, and not the “Einstein” page. If you observe the formula for page rank given above, you’ll see that this is exactly what happens. Hyperlinks always contribute to the page they lead to, and not the page they lead away from. This is how we can search for images and videos on google. There are hyperlinks on some pages with text, which lead to those images.
Google Architecture Overview
This is perhaps the most technical part of the blogpost. Most of Google is implemented in C or C++. What Google does is that it downloads as much of the internet that it can access, stores each webpage, and then determines how many times each word occurs in each webpage. It catalogues all this information, ranks the webpages, and stores them. So when you search for “Einstein”, it finds all webpages with the word “Einstein” in its catalogue, implements the PageRank algorithm to rank the webpages, and gives you the results in order. Now for a more technical description:
The web crawling, or the downloading of web pages, is done by several distributed crawlers. There is a URLServer that sends lists of URLs to these crawlers to download. These webpages are then sent to a storeserver, which stores them in a repository. Each webpage has a docID, which is stored along with the webpage. This docID is useful because URLs can be of arbitrary length. However, docIDs are of a fixed length. Hence, they’re useful for cataloguing.
Now comes the indexer. The indexer takes each webpage, and stores details of the words in that webpage. Say it takes a webpage on “Trump”. Now it counts the number of occurrence of each word in this webpage, which is known as recording hits, and creates a “forward index”, which looks like “docID for Trump: president, orange, twitter rage, yuge,…”. This information is now stored in “barrels”, which are boxes corresponding to a set of words, and store details of all webpages that contain those words. For example, this page on Trump is now stored in the barrel corresponding to the words “Trump, Obama, Lincoln”, etc. It is also stored in the barrel corresponding to the set of words “orange, twitter, yellow, green,…”. What a barrel does is that it creates an “inverted index”, which means that for each word, it has a list of docIDs or webpages containing it. Also, each word is allotted a wordID. An example of an entry in this inverted index would be “wordID for orange: docID for Trump, docID for the ice cream cones,….” etc
The indexer also notes down where in the document each word occurs, font size, capitalization, etc. Also the indexer notes down the URLs on that page- name of URL, and the docID of the page it leads to. This information is stored in an “anchors file”.
Major Data Structures
Almost all of the data is stored in Bigfiles, which span multiple file systems. Also, there are many repositories. The raw HTML repository stores the compressed HTML of each page, with a small header. The document index, on the other hand, keeps information about each document, including current status, various statistics, etc. A hit counts the number of times that a certain word occurs on a webpage. It is stored using hand optimized compact encoding, which uses two bytes for every hit. Hit lists are combined with wordIDs in forward index and docID in the inverted index.
In the inverted index, for each wordID, there is a pointer, which points towards the barrel corresponding to that wordID, the docIDs corresponding to the webpages that contain that word, and the hit lists telling us the details of the occurrence of that word in the webpages (font, number of occurrences, etc). This list is called a doclist. But in what order should these items be stored in a doclist? Should we just rank the docIDs numerically? Or rank them by the number of occurrences of that word? We basically order the docIDs by the number of occurrences, but we introduce a further modification to the barrels: there are now two sets of barrels. One set concerns itself only with title and hyperlinks, and another set deals with the main text. When you search for “Einstein”, the”title and hyperlink” barrels corresponding to “Einstein” are searched first, and the relevant ranked pages yielded. However, when there are not enough results this way, then only are the barrels corresponding to the main text upturned to yield relevant webpages. Now these yielded webpages are ranked using the PageRank algorithm, and are shown on the user’s device.
Google claims to be much better than others at searching for relevant webpages. For instance, it says that most existing search engines cannot even find their own webpages effectively when queried- if you type “Yahoo” into the Yahoo search bar, it might not even bring up its own webpage, but might some other random webpage containing the word Yahoo the most number of times. Google, on the other hand, is much better at this sort of thing. Also, when you type “Bill Clinton”, the most relevant results are presidential articles from whitehouse.gov.
As far as crawling the internet and indexing the huge mass of data is concerned, Google claims that downloading 26 million webpages took them only 9 days, and indexing and storing them just a couple more days. The indexer ran at 54 pages per second, and the whole sorting process took only a day.
Although its processes and algorithms could be optimized, Google was fast and scalable. And it was here to stay.
It indeed was