by Krešo Josić
Today, let's talk about how Google ranks your search results. The University of Houston Mathematics Department presents this program about the machines that make our civilization run, and the people whose ingenuity created them.
Vannevar Bush was the most effective and influential science advisor of the last century. In 1945 he presciently warned of getting lost in the torrent of information that humanity is generating. He also pointed to a solution: a memory storage and retrieval system he named the memex. Vannevar Bush's dream is embodied in today's web — and, fortunately, search engines were developed right along with it! Starting with the now-forgotten Archie in 1990 to Google and Yahoo today, search engines allow us to make effective use of the mountain of information that is the web.
What makes modern search engines so useful? How do they lead us to what we wish to find? They start by sending out spiders — programs that methodically crawl the web and store the information they find. Search engines work on these huge collections of words that are listed along with the webpages on which they were found. But how do unthinking machine know what is important to us humans?
Search engines integrate the clues about our preferences that we left throughout the web. We write about things we like, and link to pages we consider important. Search engines use massive collections of such hints to lead us to the information we need.
PageRank, the algorithm used by Google is one of the most successful of this type. Here is what it does: The importance of a webpage can be judged by the number of other webpages that link to it. Suppose that you receive 1 popularity point for each website that links to yours. What if I have 20 of my buddies link to my webpage, and you have the Houston Chronicle and the KUHF websites link to yours? Your website is a lot more visible. And the fact that you were chosen by extremely popular websites, means that your site is more popular than mine.
Different websites that link to yours should provide you with a different amount of popularity points. The number of points needs to depend on the popularity of the linking websites. Incoming links from popular sites should translate into more popularity points for you. But, it looks like we have just moved the problem one step back — how do we judge the popularity of sites that link to yours? We should award the points according to the websites that link to them. Now we need the popularity of websites twice removed from yours. This approach quickly gets us going in circles.
So what is the way out? It is a branch of mathematics called linear algebra. PageRank works with a large spreadsheet. On the rows and columns of this spreadsheet lie all the pages of the web. When the webpage on a row links to a webpage in a column, the spreadsheet is filled at the intersection of that row and that column. Computations with this spreadsheet, or matrix, are a billion dollar business.
However, there is something even more circular going on here: We use the result of our web searches to find pages we want to reference and link to. In turn search engines will revise their popularity rankings according to the links and entries we thus create. As our preferences shape the search results, so do the search results shape our preferences. Thus the evolution of the web is in part the result of a dialogue between man and machine.
I'm Krešo Josić, at the University of Houston, where we're interested in the way inventive minds work.
For a nice history of search engines you can look at http://www.searchenginehistory.com/.
Here is a nice article on how Google collects and ranks results https://www.google.com/librariancenter/articles/0512_01.html. Googles' ranking system PageRank, is nicely spoofed on another Google page https://www.google.com/technology/pigeonrank.html.
You can read more about Google bombing here https://en.wikipedia.org/wiki/Google_bomb. The idea was to create a huge number of links to a page, and thus cause it to get ranked highly.
Vannevar Bush's article is available on the Atlantic Monthly website at http://www.theatlantic.com/doc/194507/bush.
A more technical description of the Google algorithm can be found in K. Bryan and T. Leise. The $25,000,000,000 egienvector: The linear algebra behind Google. SIAM Review (2006), vol. 48 (3), p. 569.