Demystifying the PageRank and HITS Algorithms...Or just enough linear algebra to master network-based ranking methods
There is a ton of information on the internet about how search engines work and the webpage ranking algorithms that run 'under the hood'. The hyperlink, network-based algorithms known as PageRank and HITS are the two most prominent examples of such methods. However, much of the information on these methods is either misleading or confusing, and really require only a little bit of mathematics to understand well. Unfortunately, most explanations on websites are couched in language of linear algebra and "eigenvalues", which often obscures the relatively simply ideas underlying them. Here is an attempt to make these algorithms more accessible by boiling down the math to the essential details.
Let G be a directed graph and let A(G)=A be the adjacency matrix where A[i,j] = 1 if (i,j) is an arc, is 0 otherwise. The ith row of A is the (characteristic) vector representing all the out-neighbors of the ith node, and the jth column of A is the vector representing the in-neighbors of the jth node. The transpose matrix At has the roles reversed, where the ith row of At is the vector representing all the in-neighbors of the ith node, and the jth column of At is the vector representing the out-neighbors of the jth node.
Matrix multiplication can give simple algebraic representations of graph structures. For example, the square of the adjacency matrix A2 has as its [i,j]th entry the number of 2-hop paths from ith node to jth node. And the product AtA has as its [i,j]th entry the number of nodes that are common in-neighbors to both the ith and jth node. Note how AtA is a symmetric matrix, likewise AAt is also symmetric.
Stochastic Matrices
Let us place a probability distribution on the out-neighborhood of each node – that is we give a positive weight (between 0 and 1) to each arc, with the property that the sum of weights out of each node is exactly 1. We call such a matrix a stochastic matrix.
Figure 1. Stochastic matrix obtained from a digraph where a uniform distribution is used at every node.
Markov Chains and PageRank
Given a stochastic matrix P we consider each entry as a transition probability of a random walk, or equivalently, the transition probability matrix of the Markov chain. A stationary distribution of a Markov chain is a probability distribution d on the nodes, so that Ptd =d, i.e., d is a fixed point under the action of the transition matrix. For such matrices we can apply the fundamental theorem of Markov chains, which says that, under certain conditions, such stationary distributions always exist and are unique.
Fundamental Theorem of Markov Chains: If a Markov chain P is irreducible and aperiodic, then it has a unique stationary distribution. Moreover, (Pt)n x_0 converges to the stationary distribution d as n goes to infinity for any initial distribution x_0.
This unique stationary distribution is associated with a ranking of the nodes of a web-graph, called the PageRank vector (normalized such that the entries sum to 1). However, for the above theorem to apply we must make sure that our web-graph satisfies the hypothesis. To wit, a Markov chain P is irreducible if the graph corresponding to P is strongly connected, i.e., there is a path between any pair of nodes. A Markov chain P is aperiodic if for all x, y we have gcd{t : Pt(x, y) > 0} = 1, which simply means the set of random walks from nodes x to y are not caught in a cycle. So we see that the Fundamental Theorem of Markov Chains can apply to most graphs we encounter; however to apply it to general web-graphs we must make sure the graph associated with the Markov chain is strongly-connected. PageRank achieves this by letting a small amount of probability be used to jump to any randomly chosen node. This will modify the graph/Markov chain so that it does indeed satisfy the theorem above.
The proof of the fundamental theorem above relies on a result called Perron-Frobenius Theorem which essentially states that such Markov chains have a left eigenvector of P with eigenvalue 1, and we will consider eigenvectors and eigenvalues later in this note.
An Algorithm for PageRank
The Fundamental Theorem of Markov Chains also gives us a simple algorithm for computing the PageRank (the unique stationary distribution) for graphs to which it applies:
We can simply power up or iteratively update a distribution vector, starting from an arbitrary initial state (usually taken as the uniform distribution). Each iteration of this algorithm takes time O(|V| + |E|), and will often converge very quickly.
Determinants and Eigenvalues of Matrices
A determinant is a function that associates a scalar value to a square matrix. The value of the determinant can tell you if the matrix has an inverse or not. A matrix with a nonzero determinant is called regular or invertible matrix (and, we can calculate its inverse matrix). If the determinant is zero the matrix is called a singular matrix and it does not have an inverse.
Formally, a determinant is computed by a sum of products over all permutations of numbers 1 to n, and therefore has n! terms in sum. Determinants over 2 by 2 matrices, that is where n = 2, have only sum/difference of two terms (a11 * a22 – a12 * a21) and are simple to calculate by hand. For larger n>2, the number of terms start to get out of hand.
All regular matrices can be transformed into singular matrices by subtracting common value from the diagonal entries of the matrix. The problem of transforming a regular matrix into a singular matrix is referred to as the eigenvalue problem. The Eigenvalue Problem is determined by subtracting c*I from A and setting the determinant to 0 and solving for c. Given a solution c to the eigenvalue problem we can find a vector (a class of vectors) X such that AX=cX. Any such X is called the associated eigenvector for eigenvalue c.
The Power Method for Principal Eigenvector
One of the simplest methods for finding the largest magnitude eigenvalue and an associated eigenvector is called the Power Method. Given a matrix A and an initial vector b0 that is a linear combination (that is, a weighted sum) of eigenvectors, we can see that by repeatedly multiplying each bi by A, we are reinforcing (the direction) of the eigenvector with the largest magnitude eigenvalue (relative to the direction of the other eigenvectors in the sum). Hence, the power method is applicable for the following technical conditions:
1) A has an eigenvalue that is strictly greater in magnitude than its other eigenvalues.
2) The starting vector b0 has a nonzero component in the direction of an eigenvector associated with the dominant eigenvalue, i.e., b0 is not orthogonal to the eigenvector.
The power method algorithm starts with a vector b0, and is described by the following iteration:
for k = 0,1,2,3,...
bk+1 = Abk / || Abk||
The HITS Algorithm for Ranking Webpages
The HITS algorithm is a conceptually simply link based ranking algorithm based on the power method, which posits two parameters for each webpage: an authority value a_i and a hub value h_i. Authority values are simply the sum of the in-neighbor hub values, and similarly the hub values are the sum of the out-neighbor authority values. So hubs and authority values are mutually dependent. In matrix form we have the following two matrix-vector equations:
a= Ath and h = Aa
Plugging back in for h and a leads to the following equations yielding an pair of eigenvector problems:
a= AtAa and h = A Ath
a= AtAa and h = AAth
Looking at these equations we can see that we seek eigenvectors associated with eigenvalue 1 for each of the pair of matrices AtA and AAt. We cannot in general assume that such eigenvalue/eigenvector pairs exists, but we can find other eigenvectors associated with the largest eigenvalue using the normalized Power Method described above, so long as technical conditions are satisfied. To wit, as noted both AtA and AAt are positive, symmetric matrices. Such matrices have the property that all the eigenvalues are positive reals, and the associated eigenvectors form a basis, that is every vector is a linear combination of eigenvectors. Hence, we can be assured that the HITS algorithm will converge to the (normalized) sum of eigenvectors associated with the largest eigenvalue (for which the starting vectors a_0 and h_0 have non-zero component). We are in good shape if these vectors are real-valued, so that we can interpret the components as ranking values. The theory says that as long as the matrices in question are "irreducible" then we are in business.
0 comments:
Post a Comment