Network Effects Explained: PageRank and Preferential Attachment
In the United States where I’m based it is now the beginning of the school year and that means that it’s also the season for University Rankings. Sometime ago, I was contacted by one of the myriad of university ranking “services” and asked me to fill out a survey about university reputations. As one of my jobs in college was cold-calling people’s landlines for one of the larger (Portuguese) polling companies, I was curious about their questions and decided to participate. My curiosity was richly rewarded.
Their survey asked a number of questions, but I did eventually manage to finish it. I’ve forgotten the details of most questions, but I do remember a few. It started off by asking questions about my academic background, which area I worked in, which universities I had been affiliated with, etc. Perhaps the most interesting kind of question went something along the lines of: “in your opinion, in field X, what are the Top 3 (?) universities?” I naturally excluded any that I had been personally affiliated with, of course ;o)
I’m sure there are many other factors that contribute to the final ranking but for the sake of simplicity let’s focus just on this aspect of their algorithm. This apparently innocuous, and someone obvious, question gives us an interesting view on how their ranking works. Presumably, universities that get named to by more people are taken to be more important than others that get named less often, thus giving a “good” view of the relative importance of each one.
On the other hand, their respondents are not chosen at random. I’m not privy to their methodology but if they want to focus on academics they probably run a search on some academic search engine (Web of Science, Google Scholar, Microsoft Academic, etc…) to find researchers working in each area. As each academic comes from at least one university, we effectively have the individuals from one university naming the most influential universities based on the work they see originating from individuals in other universities. In a rough approximation, this can be seen as Universities ranking each others reputations.
If you have a Computer Science background, you might recognize that a similar mechanism is at play in the World Wide Web. When you create a new webpage or blog post you will likely create links to other pages or posts and intuitively, you expect that the best pages will quickly acquire the largest number of incoming links. Indeed, this simple fact lies at the heart of the algorithm devised by Larry Page and Sergey Brin (coincidentally named “PageRank”) and that laid the original foundation for the Google search engine.
PageRank relies on the uniquely democratic nature of the Web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at considerably more than the sheer volume of votes, or links a page receives; for example, it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important.” Using these and other factors, Google provides its views on pages’ relative importance. This idea is extremely simple. The importance of each page is the average of the importance of every page that points to it. There is however, one obvious problem…
if the importance of everybody depends on everybody else how can you calculate anything?
The answer lies in the concept of iteration… you assign each node the same “importance”, say, m, and make every node play a simple game: Look at your outgoing links and evenly send each page you point to an equal share of your current importance. This similar in idea to the questionnaire I mentioned above, except that here we are not limited to just the 3 “most important” connections. After one step of this game each node has a new value, mi, of importance that is the sum of the importance “points” it received from all the pages that link to it. Under some simple assumptions, it is simple to show that by playing this game repeatedly we will eventually reach a situation where the values of mi assigned to each node stops changing.
The more mathematically inclined among you will recognize this as a Markov Chain that can be mathematically represented as an eigenvalue problem:
where P is the PageRank vector (the vector of “importances”) and G is known as the “Google Matrix”, a column stochastic matrix that describes the structure of the links. If you want to understand the mathematical details behind PageRank, there is no better reference than “The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google” by Kurt Bryan and Tanya Leise. One particularly intuitive way of looking at this process is to imagine yourself being placed on a node and having to walk randomly in the graph by following edges from node to node. The PageRank vector is then simply the amount of time you spend in each node if you walk randomly in the graph for a looooooooooooooong time.
You can generate the Google Matrix from the adjacency matrix in a straightforward manner:
Since we are interested in the eigenvector corresponding to the largest eigenvalue (with value one) we can use the Power Method to calculate the eigenvector in an efficient way, by repeated multiplication by a normalized vector:
If we now plot the elements of the PageRank vector as a function of the number of connections each node has, its degree, we notice that there is a strong, although not perfect, correlation between the two:
The more incoming connections a node has the higher its PageRank. And of course results obtained on a much larger scale have been published in the academic literature:
If we think a little about the mechanism we described above we quickly find the explanation for this correlation. The more incoming connections a node has, the more influence “points” it receives with the fluctuations being likely accounted for by the local network structure, the details of who connects to whom.
There is however one other, more subtle, process at work. When Google uses the value of PageRank to sort the results it displays as the result of our queries it implicitly influences future results: If, say, I’m researching sources to link from a blog post I’m writing, I will likely link to the relevant result that I find first. In this way, result ranking in Google (or any other search engine) becomes a self reinforcing prophecy: the more people link to my page, the higher its PageRank, and the higher its PageRank the more links it attracts as a result.
This process is perhaps one of the most interdisciplinary models in existence, having been introduced, independently, by the British Statistician Udny Yule in 1925, the American Sociologist Herbert Simon in 1955, the British Scientometrician Derek Price in 1976 and the Hungarian Physicists Albert-Lázló Barabási and Réka Albert in 1999. It is commonly referred to by names as diverse the “Yule Process”, “Mathew effect”, “the rich get richer”, “Preferential Attachment”, the “Barabási-Albert” model, among many others. The reason behind this interest spanning almost 100 years is the fact that this simple idea results in a “scale-free” or “power-law” distribution of popularity or number of connections.
In the case of networks, this idea is implemented by adding a new node and edge at each time step. The newly created node connects at random to one of the previously existing nodes proportionally to their current degree. A simple way of doing this is by keeping a linear list where position of the form 2i have value i (the label of the node added at time i) and positions of the form 2i+1 have the value of the node that node i is connected to.
In other words, a network of three nodes in the shape of a triangle would be represented by the list: [0, 1, 1, 2, 2, 0]. It’s easy to see that each node label appears twice and has degree two. At the next time step, node three would connect at random by uniformly choosing a position in this list and connecting to the node listed in that position. Or, in code form:
and the resulting degree distribution, the probability, P[k], of a node chosen uniformly at random having degree k is clearly broadtail with an exponent of approximately 3.
This means that nodes of degree 100 are approximately 1,000,000 times less common than nodes of degree 1. A consequence of this form of the degree distribution is the existence of “hubs”, or superstars, nodes that account for a disproportionately large number of connections. For example, in a network of 1,000,000 nodes, the Top 1% of nodes by degree account for roughly 21% of all connections and the Top 10 nodes have over 1.5% of the total. This kind of disparity can be seen in many other cases, such as the accumulation of wealth.
As a counter example, let’s consider the simple case of purely random attachment. Now at each time step we simply choose two nodes at random and connect them:
In the end we end up with the same number of edges and number of nodes but they are now distributed in a very different way. For example, the degree distributions is now:
with every node having degree less than 15 and where the Top 1% has only ~3% of the connections. A much more equitable situation. This model is know as the Erdős-Rényi model after its introduction by two Hungarian mathematicians, Paul Erdős and Alfréd Rényi in 1958. Unfortunately, the real world is much more prone to the less equitable Preferential Attachment scenario.
P. W. Anderson, the 1977 Nobel prize in Physics famously said in a 1972 issue of Science that “More Is Different”. By this, he meant that when you have more than just a few items, be it grains of sand, atoms, people or even random numbers, strange things start to happen. Larger sets of “objects” are able to demonstrate properties and behaviors that you wouldn’t be able to predict by analyzing them in isolation and at each step of the hierarchy you need new ideas and, in some ways, new science to properly understand the system. In my view, this has never been truer than in the case of Complex Networks. As we saw above, not only the number of connections but the way they are distributed can have a huge impact on the way the system functions and what the final outcomes are.
Of course, the importance of a node is not given just by the number of connections it has, but also by its position on the network, but that’s a story for another day.
As usual, you can find the code for the snippets and figures above on my GitHub: