Rank Centrality: Ranking from Pair-wise Comparisons Sahand Negahban Statistics Department, Yale University, 24 Hillhouse Ave, New Haven, CT 06510 , sahand.negahban@yale.edu Sewoong Oh Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign, 104 S. Mathews Ave., Urbana, IL 61801, swoh@illinois.edu Devavrat Shah* Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Massachusetts Ave., Cambridge, MA 02139, devavrat@mit.edu The question of aggregating pair-wise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining a ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest for understanding the intensity of the preferences. In this paper, we propose Rank Centrality, an iterative rank aggregation algorithm† for discovering scores for objects (or items) from pair-wise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with an edge present between a pair of objects if they are compared; the score, which we call Rank Centrality, of an object turns out to be its stationary probability under this random walk. To study the efficacy of the algorithm, we consider the popular Bradley-Terry-Luce (BTL) model (equiv- alent to the Multinomial Logit (MNL) for pair-wise comparisons) in which each object has an associated score which determines the probabilistic outcomes of pair-wise comparisons between objects. In terms of the pair-wise marginal probabilities, which is the main subject of this paper, the MNL model and the BTL model are identical. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. In particular, the number of samples required to learn the score well with high probability depends on the structure of the comparison graph. When the Laplacian of the comparison graph has a strictly positive spectral gap, e.g. each item is compared to a subset of randomly chosen items, this leads to dependence on the number of samples that is nearly order-optimal. Experimental evaluations on synthetic datasets generated according to the BTL model show that our algorithm performs as well as the Maximum Likelihood estimator for that model and outperforms other popular ranking algorithms. Key words : Rank Aggregation, Rank Centrality, Markov Chain, Random Walk History : This paper was first submitted on December 1st, 2013. 1 ar X iv :1 20 9. 16 88 v4 [ cs .L G] 1 2 N ov 20 15 Author: Rank Centrality: Ranking from Pair-wise Comparisons 2 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 1. Introduction Rank aggregation is an important task in a wide range of learning and social contexts arising in recommendation systems, information retrieval, and sports and competitions. Given n items, we wish to infer relevancy scores or an ordering on the items based on partial orderings provided through many (possibly contradictory) samples. Frequently, the available data that is presented to us is in the form of a comparison: player A defeats player B; book A is purchased when books A and B are displayed (a bigger collection of books implies multiple pair-wise comparisons); movie A is liked more compared to movie B. From such partial preferences in the form of comparisons, we frequently wish to deduce not only the order of the underlying objects, but also the scores associated with the objects themselves so as to deduce the intensity of the resulting preference order. For example, the Microsoft TrueSkill engine assigns scores to online gamers based on the out- comes of (pair-wise) games between players. Indeed, it assumes that each player has inherent “skill” and the outcomes of the games are used to learn these skill parameters which in turn lead to scores associated with each player. In most such settings, similar model-based approaches are employed. In this paper, we have set out with the following goal: develop an algorithm for the above stated problem which (a) is computationally simple, (b) works with available (comparison) data only, and (c) when data is generated as per a reasonable model, then the algorithm should do as well as the best model aware algorithm. The main result of this paper is an affirmative answer to these questions. Related work. Most rating based systems rely on users to provide explicit numeric scores for their interests. While these assumptions have led to a flurry of theoretical research for item rec- ommendations based on matrix completion (cf. Cande`s and Recht (2009), Keshavan et al. (2010), Negahban and Wainwright (2012)), arguably numeric scores provided by individual users are gen- erally inconsistent. Furthermore, in a number of learning contexts as illustrated above, explicit scores are not available. These observations have led to the need to develop methods that can aggregate such forms of ordering information into relevance ratings. In general, however, designing consistent aggregation methods can be challenging due in part to possible contradictions between individual preferences. For example, if we consider items A, B, and C, one user might prefer A to B, while another prefers B to C, and a third user prefers C to A. Such problems have been well studied starting with ∗ This work was supported in parts by MURI W911NF-11-1-0036 and NSF CMMI-1462158. † Similar algorithms, based on the comparison data matrix have been proposed in the literature. As discussed in detail in Section 3.3, they are all different from Rank Centrality. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 3 (and potentially even before) Condorcet (1785). In the celebrated work by Arrow (1963), existence of a rank aggregation algorithm with reasonable sets of properties (or axioms) was shown to be impossible. In this paper, we are interested in a more restrictive setting: we have outcomes of pair-wise comparisons between pairs of items, rather than a complete ordering as considered in (Arrow 1963). Based on those pair-wise comparisons, we want to obtain a ranking of items along with a score for each item indicating the intensity of the preference. One reasonable way to think about our setting is to imagine that there is a distribution over orderings or rankings or permutations of items (also known as the discrete choice model in the literature on Social Choice) and every time a pair of items is compared, the outcome is generated as per this underlying distribution. Examples of popular distributions over permutations include the Plackett-Luce model (Luce 1959, Plackett 1975) and the Mallows model (Mallows 1957). With this, our question becomes even harder than the setting considered by Arrow (1963) as, in that work, effectively the entire distribution over permutations was already known! Indeed, such hurdles have not stopped the scientific community as well as practical designers from designing such systems. Chess rating systems and the more recent MSR TrueSkill Ranking system are prime examples. Our work falls precisely into this realm: design algorithms that work well in practice, makes sense in general, and perhaps more importantly, have attractive theoretical properties under common comparative judgment models. An important and landmark model in this class is called the Plackett-Luce model, which is also known as the Multinomial Logit (MNL) model (cf. McFadden (1973)) in the operations research and social science literature. A special case of the Plackett-Luce model applied to pair-wise comparisons is known as the Bradley-Terry-Luce (BTL) model (Bradley and Terry 1955, Luce 1959). It has been the backbone of many practical system designs including pricing in the airline industry, e.g. see Talluri and VanRyzin (2005). Adler et al. (1994) used such models to design adaptive algorithms that select the winner from small number of rounds. Interestingly enough, the (near- )optimal performance of their adaptive algorithm for winner selection is matched by our non- adaptive algorithm for assigning scores to obtain global rankings of all players. We propose a new rank aggregation algorithm, which we call Rank Centrality, that builds on a long line of research in using eigenvectors of certain matrices to find global rankings of items, which dates back to Seeley (1949). This line of research is referred to as spectral ranking and for an extensive survey we refer to Vigna (2009). Given pair-wise comparisons of items from a single individual on all possible choices of pairs, Wei (1952) introduced a ranking algorithm based on the leading eigenvector of the matrix representing the comparisons outcome. A slight generalization accounting for data from multiple decision makers was proposed by Kendall (1955). Keener (1993), Author: Rank Centrality: Ranking from Pair-wise Comparisons 4 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) and more recent work by Dwork et al. (2001a), proposed several variations of spectral algorithms for ranking from pair-wise comparisons. We propose Rank Centrality for ranking from pair-wise comparisons by using the leading eigenvector of a particular matrix formed by constructing a Markov chain corresponding to a random walk on a graph. Although it appears to be similar to the existing spectral ranking approaches, the precise form of the algorithm proposed is distinct and this precise form does matter: the empirical results using synthetic data presented in Section 3.3 make this clear. In summary, building on the classical field of spectral ranking, we propose a novel spectral ranking algorithm and provide a firm theoretical grounding by showing that it is a provably near-optimal estimator for a popular discrete choice model, i.e. the BTL model formally defined in Section 2.1. Numerous spectral ranking algorithms have been proposed in the past, one of the most popular example being PageRank (Brin and Page 1998). However, almost invariably, the question of when one should choose to use a particular spectral ranking algorithm is left open. One notable exception is the work of Altman and Tennenholtz (2005), which provides a set of axioms satisfied by PageRank algorithm and prove that PageRank is the only rank aggregation algorithm that satisfies those particular axioms. Hence, it provides a guideline for deciding when PageRank should be used, i.e. in applications where the specific set of axioms make sense. In a similar spirit, Rank Centrality is a spectral ranking algorithm with a theoretical justification suggesting that it should be used in applications where the BTL or MNL model makes sense (in the remainder of this manuscript, we shall use BTL model as representative for BTL and MNL model). There has been significant work on rankings from pair-wise comparison in the last several years. A popular model is a distribution over permutations known as the Mallows model, which assigns probability to observed rankings according to the Kendall-τ distance to a true ranking. Since the maximum likelihood estimation is provably difficult, Dwork et al. (2001b) studied this prob- lem (also known as the Kemeny optimization) when full rankings are observed and provided a 2-approximation algorithm. This was later improved by Ailon et al. (2008) and also generalized to partial rankings (Ailon 2010). Recently, Lu and Boutilier (2011) proposed an expectation- maximization approach with novel sampling schemes to learn the Mallows model from pair-wise comparisons. These distance-based approaches aim to provide good approximation algorithms for the provably difficult problem of minimizing the Kendall-τ distance and some variations of it (e.g. Farnoud et al. (2012)). Learning to rank from pair-wise comparisons has also been studied in applications where one might observe more than just the ordinal outcome of pair-wise comparisons. Additional data on cardinal preferences such as the margin of victory (the difference between the winning team’s score and the losing team’s score) in a football match has led to score-based methods for ranking, where Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 5 the goal is to find scores for each team such that the difference of the scores is consistent with the observed margins of victory (Hochbaum 2006, Gleich and Lim 2011, Jiang et al. 2011). More recently, Volkovs and Zemel (2012) proposed a unified model that generalizes both the BTL model and the cardinal preferences. These approaches add to the traditional approaches based on some notion of distance, such as the Kendall-τ distance, and probabilistic models, such as the BTL model. Another probabilistic model directly parameterizes the distribution of pair-wise comparisons for all the pairs and asks the question of whether existing pair-wise ranking algorithms are consistent or not (Duchi et al. 2010, Rajkumar and Agarwal 2014). It is shown that many existing algorithms do not meet the proposed ‘consistency’ criteria and new regret/optimization based algorithms are presented. The algorithm proposed by Ammar and Shah (2011) can be viewed as natural adaption of Borda count based on pair-wise comparison data. They establish it to be equivalent to Borda count based on entire distribution when perfect pair-wise marginals are available, i.e. large sample limit. In Braverman and Mossel (2008), the authors present an algorithm that produces an ordering based on O(n logn) pair-wise comparisons on adaptively selected pairs. They assume that there is an underlying true ranking and one observes noisy comparison results. Each time a pair is queried, we are given the true ordering of the pair with probability 1/2 + γ for some γ > 0 which does not depend on the items being compared. Our contributions. In this paper, we introduce Rank Centrality, an iterative algorithm that takes the noisy comparison answers between a subset of all possible pairs of items as input and produces scores for each item as the output. The proposed algorithm has a nice intuitive explanation. Con- sider a graph with nodes/vertices corresponding to the items of interest (e.g. players). Construct a random walk on this graph where at each time, the random walk is likely to go from vertex i to vertex j if items i and j were ever compared; and if so, the likelihood of going from i to j depends on how often i lost to j. That is, the random walk is more likely to move to a neighbor who has more “wins”. How frequently this walk visits a particular node in the long run, or equivalently the stationary distribution, is the score of the corresponding item. Thus, effectively this algorithm captures preference of the given item versus all of the others, not just immediate neighbors: the global effect induced by transitivity of comparisons is captured through the stationary distribution. Such an interpretation of the stationary distribution of a Markov chain or a random walk has been an effective measure of relative importance of a node in wide class of graph problems, popularly known as the Network Centrality cf. (Newman 2010). Notable examples of such network centralities include the random surfer model on the web graph for the version of the PageRank (Brin and Page Author: Rank Centrality: Ranking from Pair-wise Comparisons 6 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 1998) which computes the relative importance of a web page, a model of a random crawler in a peer-to-peer file-sharing network to assign trust value to each peer in EigenTrust (Kamvar et al. 2003) and a random walk interpretation of Rumor Centrality that assigns likelihood to each node for being source of information (or rumor) spread in a network graph based on the foot-print of infection under the Susceptible-Infected model Shah and Zaman (2011, 2015). The computation of the stationary distribution of the Markov chain boils down to ‘power itera- tion’ using transition matrix lending to a nice iterative algorithm. To establish rigorous properties of the algorithm, we analyze its performance under the BTL model described in Section 2.1. Formally, we establish the following result: given n items, when comparisons between randomly chosen ω(n logn) pairs of items are produced as per an (unknown) underlying BTL model, Rank Centrality learns the true score up to an arbitrary accuracy with high probability as n→∞. It should be noted that Ω(n logn) is a necessary number of (random) comparisons for any algorithm to even produce a consistent ranking with high probability since with fewer edges (comparisons) the resulting random graph will be disconnected with positive probability. In that sense, Rank Centrality is nearly order-optimal. In general, the comparisons may not be available between randomly chosen pairs. Let G= ([n],E) denote the graph of comparisons between these n objects with an edge (i, j)∈E if and only if objects i and j are compared. In this setting, we establish that with O(ξ−2 npoly(logn)) comparisons, Rank Centrality learns the true score of the underlying BTL model up to an arbitrarily small error with high probability. Here, ξ is the spectral gap for the Laplacian of G and this is how the graph structure of comparisons plays a role. Indeed, as a special case when comparisons are chosen at random, the induced graph is Erdo¨s-Re´nyi for which ξ is strictly positive, independent of n, with high probability, leading to the (order) optimal performance of the algorithm as stated earlier. To understand the performance of Rank Centrality compared to the other options, we perform an experimental study. It shows that the performance of Rank Centrality is identical to the ML estimation of the BTL model. Furthermore, it outperforms other popular choices. In summary, Rank Centrality (a) is computationally simple, (b) always produces a solution using available data, and (c) has near optimal performance with respect to a reasonable generative model. Some remarks about our analytic technique. Our analysis boils down to studying the induced stationary distribution of the random walk or Markov chain corresponding to the algorithm. Like most such scenarios, the only hope to obtain meaningful results for such ‘random noisy’ Markov chain is to relate it to stationary distribution of a known Markov chain. Through recent concen- tration of measure results for random matrices and comparison technique using Dirichlet forms for characterizing the spectrum of reversible/self-adjoint operators, along with the known expansion property of the random graph, we obtain the eventual result. Indeed, it is the consequence of such Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 7 existing powerful results that lead to near-optimal analytic results for random comparison model and characterization of the algorithm’s performance for general setting. As an important comparison, we provide analysis of sample complexity required by the maximum likelihood estimator (MLE) using the state-of-art analytic techniques, cf. Negahban and Wainwright (2012). Subsequent to our work, Hajek et al. (2014) extended our analysis of MLE and established that MLE also achieves near-optimal performance guarantees (up to a logarithmic factor) as well. Our numerical experiments suggests something even stronger, the resulting error is effectively identical for both MLE and Rank Centrality. Organization. The remainder of the paper is organized as follows. In Section 2, we describe the model, problem statement and the rank Centrality algorithm. Section 3 describes the main results – the key theoretical properties of rank Centrality as well as it’s empirical performance in the context of two real datasets from NASCAR and One Day International (ODI) cricket. We provide comparison of the Rank Centrality with the maximum likelihood estimator using the existing analytic techniques in the same section. We derive the Cramer-Rao lower bound on the square error for estimating parameters by any algorithm - across range of parameters, the performance of Rank Centrality and MLE matches the lower bound implied by Cramer-Rao bound as explained in Section 3 as well. Finally, Section 4 details proofs of all results. We discuss and conclude in Section 5. Notation. In the remainder of this paper, we use C, C ′, etc. to denote absolute constants, and their value might change from line to line. We use AT to denote the transpose of a matrix. The Euclidean norm of a vector is denoted by ‖x‖=√∑i x2i , and the operator norm of a linear operator is denoted by ‖A‖2 = maxx xTAx/xTx. When we say with high probability, we mean that the probability of a sequence of events {An}∞n=1 goes to one as n grows: limn→∞ P(An) = 1. Also define [n] = {1,2, . . . , n} to be the set of all integers from 1 to n. 2. Model, Problem Statement and Algorithm 2.1. Model In this section, we discuss a model of comparisons between various items. This model will be used to analyze the Rank Centrality algorithm. Bradley-Terry-Luce model for comparative judgment. When comparing pairs of items from n items of interest, represented as [n] = {1, . . . , n}, the Bradley-Terry-Luce model assumes that there is a weight or score wi ∈R+ ≡ {x∈R : x> 0} associated with each item i∈ [n]. The outcome of a comparison for pair of items i and j is determined only by the corresponding weights wi and Author: Rank Centrality: Ranking from Pair-wise Comparisons 8 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) wj. Let Y l ij denote the outcome of the l-th comparison of the pair i and j, such that Y l ij = 1 if j is preferred over i and 0 otherwise. Then, according to the BTL model, Y lij = { 1 with probability wj wi+wj , 0 otherwise . Furthermore, conditioned on the score vector w = (w1, . . . ,wn) T , it is assumed that the random variables Y lij’s are independent of one another for all i, j, and l. Since the BTL model is invariant under the scaling of the scores, an n-dimensional representation of the scores is not unique. Indeed, under the BTL model, a score vector w ∈Rn+ is the equivalence class [w] = {w′ ∈Rn+|w′ = aw, for some a> 0}. The outcome of a comparison only depends on the equivalence class of the score vector. To get a unique representation, we represent each equivalence class by its projection onto the standard orthogonal simplex such that ∑ iwi = 1. This representation naturally defines a distance between two equivalent classes as the Euclidean distance between two projections: d(w,w′) ≡ ∥∥∥ 1〈w,1〉w− 1〈w′,1〉w′∥∥∥ . Our main result provides an upper bound on the (normalized) distance between the estimated score vector and the true underlying score vector. Bradley-Terry-Luce is equal to pair-wise marginals of Multinomial Logit (MNL)/Plackett-Luce. We take a brief detour to remind the reader that the BTL model is identical to the MNL model in the sense that the pair-wise distributions between objects induced under BTL are identical to that under MNL. Consider an equivalent way to describe an MNL model. Each object i has an associated score wi > 0. A random ordering over all n objects is drawn as follows: iteratively fill the ordered positions 1, . . . , n by choosing object i(k) for position k, amongst the remaining objects (not chosen in the first 1, . . . , k− 1 positions) with probability proportional to it’s weight wi(k). It can be easily verified that in the random ordering of n objects generated as per this process, i is ranked higher than j with probability wi/(wi +wj). Sampling model. We also assume that we perform a fixed k number of comparisons for all pairs i and j that are considered (e.g. a best of k series). This assumption is mainly to simplify notations, and the analysis as well as the algorithm easily generalizes to the case when we might have a different number of comparisons for different pairs. Given observations of pair-wise comparisons among n items according to this sampling model, we define a comparisons graph G= ([n],E,A) as a graph of n items where two items are connected if we have comparisons data on that pair and A denotes the weights on each of the edges in E. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 9 2.2. Rank Centrality In our setting, we will assume that aij represents the fraction of times object j has been preferred to object i, for example the fraction of times chess player j has defeated player i. Given the notation above, we have that aij = (1/k) ∑k l=1 Y l ij. Consider a random walk on a weighted directed graph G= ([n],E,A), where a pair (i, j)∈E if and only if the pair has been compared. The weight edges are defined based on the outcome of the comparisons: Aij = aij/(aij +aji) and Aji = aji/(aij +aji) (note that aij + aji = 1 in our setting). We let Aij = 0 if the pair has not been compared. Note that by the Strong Law of Large Numbers, as the number k→∞ the quantity Aij converges to wj/(wi +wj) almost surely. A random walk can be represented by a time-independent transition matrix P , where Pij = P(Xt+1 = j|Xt = i). By definition, the entries of a transition matrix are non-negative and satisfy∑ j Pij = 1. One way to define a valid transition matrix of a random walk on G is to scale all the edge weights by 1/dmax, where we define dmax as the maximum out-degree of a node. This rescaling ensures that each row-sum is at most one. Finally, to ensure that each row-sum is exactly one, we add a self-loop to each node. Concretely, Pij = { 1 dmax Aij if i 6= j , 1− 1 dmax ∑ k 6=iAik if i= j . (1) The choice to construct our random walk as above is not arbitrary. In an ideal setting with infinite samples (k→∞) per comparison the transition matrix P would define a reversible Markov chain under the BTL model. Recall that a Markov chain is reversible if it satisfies the detailed balance equation: there exists v ∈ Rn+ such that viPij = vjPji for all i, j; and in that case, pi ∈ Rn+ defined as pii = vi/( ∑ j vj) is its unique stationary distribution. In the ideal setting (say k→∞), we will have Pij = P˜ij ≡ (1/dmax)wj/(wi +wj). That is, the random walk will move from state i to state j with probability proportional to the chance that item j is preferred to item i. In such a setting, it is clear that v =w satisfies the reversibility conditions. Therefore, under these ideal conditions it immediately follows that the vector w/ ∑ iwi acts as a valid stationary distribution for the Markov chain defined by P˜ , the ideal matrix. Hence, as long as the graph G is connected and at least one node has a self loop then we are guaranteed that our graph has a unique stationary distribution proportional to w. If the Markov chain is reversible then we may apply the spectral analysis of self-adjoint operators, which is crucial in the analysis of the behavior of the method. In our setting, the matrix P is a noisy version (due to finite sample error) of the ideal matrix P˜ discussed above. Therefore, it naturally suggests the following algorithm as a surrogate. We estimate the probability distribution obtained by applying matrix P repeated starting from any Author: Rank Centrality: Ranking from Pair-wise Comparisons 10 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) initial condition. Precisely, let pt(i) = P(Xt = i) denote the distribution of the random walk at time t with p0 = (p0(i))∈Rn+ be an arbitrary starting distribution on [n]. Then, pTt+1 = p T t P . (2) When the transition matrix has a unique left largest eigenvector, then starting from any initial distribution p0, the limiting distribution pi is unique. This stationary distribution pi is the top left eigenvector of P , which makes computing pi a simple eigenvector computation. Formally, we state the algorithm, which assigns numerical scores to each node, which we shall call Rank Centrality: Rank Centrality Input: G= ([n],E,A) Output: rank {pi(i)}i∈[n] 1: Compute the transition matrix P according to (1); 2: Compute the stationary distribution pi (as the limit of (2)). The stationary distribution of the random walk is a fixed point of the following equation: pi(i) = ∑ j pi(j) Aji∑ `Ai` . This suggests an alternative intuitive justification: an object receives a high rank if it has been preferred to other high ranking objects or if it has been preferred to many objects. One key question remains: does P have a well defined unique stationary distribution? Since the Markov chain has a finite state space, there is always a stationary distribution or solution of the above stated fixed-point equations. However, it may not be unique if the Markov chain P is not irreducible. The irreducibility follows easily when the graph is connected and for all edges (i, j)∈E, aij > 0, aji > 0. Interestingly enough, we show that the iterative algorithm produces a meaningful solution with near optimal sample complexity as stated in Theorem 2 when the pairs of objects that are compared are chosen at random. 3. Main Results The main result of this paper derives sufficient conditions under which the proposed iterative algorithm finds a solution that is close to the true solution (under the BTL model) for general model with arbitrary connected comparison graph G. This result is stated as Theorem 1 below. In words, the result implies that to learn the true score correctly as per our algorithm, it is sufficient to have number of comparisons scaling as O(ξ−2 npoly(logn)) where ξ is the spectral gap of the Laplacian of the graph G. This result explicitly identifies the role played by the graph structure in the ability of the algorithm to learn the true scores. In the special case, when the pairs of objects to be compared are chosen at random, that is the induced G is an Erdo¨s-Re´nyi random graph, the spectral gap ξ can be lower-bounded by a Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 11 constant with high probability and hence the resulting number of comparisons required scales as O(npoly(logn)). This is effectively the optimal sample complexity. The bounds are presented as the rescaled Euclidean norm between our estimate pi and the underlying stationary distribution of P˜ . This error metric provides us with a means to quantify the relative certainty in guessing if one item is preferred over another. After presenting our main theoretical result, we describe illustrative simulation results. We also present application of the algorithm in the context of two real data-sets: results of NASCAR race for ranking drivers, and results of One Day International (ODI) Cricket for ranking teams. We shall discuss relation between Rank Centrality, the maximum likelihood estimator and the information theoretic lower bound to conclude that both MLE and Rank Centrality are near-optimal when the pairs are chosen according to the Erdo¨s-Renyi random graph. 3.1. Rank Centrality: Error bound for general graphs Recall that in the general setting, each pair of objects or items are chosen for comparisons as per the comparisons graph G([n],E). For each such pair, we have k comparisons available. The result below characterizes the performance of Rank Centrality for such a general setting. Before we state the result, we present a few necessary notations. Let di denote the degree of node i in G; let the max-degree be denoted by dmax ≡ maxi di and min-degree be denoted by dmin ≡mini di; let κ≡ dmax/dmin. The random walk normalized Laplacian matrix of the graph G is defined as L=D−1B where D is the diagonal matrix with Dii = di and B is the adjacency matrix with Bij =Bji = 1 if (i, j) ∈ E and 0 otherwise. This normalized Laplacian, defined thus, can be thought of as a transition matrix of a reversible random walk on graph G: from each node i, jump to one of its neighbors j with equal probability. Given this, it is well known that the random walk normalized Laplacian of the graph has real eigenvalues denoted as −1 ≤ λn(L) ≤ . . . ≤ λ1(L) = 1. (3) We shall denote the spectral gap of the Laplacian as ξ ≡ 1−λmax(L) , where λmax(L) ≡ max{λ2(L),−λn(L)} . (4) There is one-to-one correspondence between the eigenvalues of the random walk normalized Lapla- cian L and the standard (symmetric) normalized Laplacian I−D−1/2BD−1/2. Now we state the result establishing the performance of Rank Centrality. Author: Rank Centrality: Ranking from Pair-wise Comparisons 12 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) Theorem 1. Given n objects and a connected comparison graph G= ([n],E), let each pair (i, j)∈ E be compared for k times with outcomes produced as per a BTL model with parameters w1, . . . ,wn. Then, for some positive constant C ≥ 8 and when k ≥ 4C2(b5κ2/dmaxξ2) logn, the following bound on the normalized error holds with probability at least 1− 4n−C/8:∥∥pi− p˜i∥∥ ‖p˜i‖ ≤ Cb5/2κ ξ √ logn k dmax , where p˜i(i) =wi/ ∑ `w`, b≡maxi,j wi/wj, and κ≡ dmax/dmin. 3.2. Rank Centrality: Error bound for random graphs Now we consider the special case when the comparison graph G is an Erdo¨s-Re´nyi random graph with pair (i, j) being compared with probability d/n. When d is poly-logarithmic in n, we pro- vide a strong performance guarantee. Specifically, the result stated below suggests that with O(npoly(logn)) comparisons, Rank Centrality manages to learn the true scores with high proba- bility. Theorem 2. Given n objects, let the comparison graph G= ([n],E) be generated by selecting each pair (i, j) to be in E with probability d/n independently of everything else. Each such chosen pair of objects is compared k times with the outcomes of comparisons produced as per a BTL model with parameters w1, . . . ,wn. Then, if d≥ 10C2 logn and k d≥ 128C2b5 logn, the following bound on the error rate holds with probability at least 1− 10n−C/8:∥∥pi− p˜i∥∥ ‖p˜i‖ ≤ 8Cb 5/2 √ logn k d , where p˜i(i) =wi/ ∑ `w` and b≡maxi,j wi/wj. Remarks. Some remarks are in order. First, Theorem 2 immediately implies that as long as kd grows super-linear in logn, then the error goes to 0. Furthermore, in the context that the number of items n goes to ∞ as long as we choose d= Ω(logn) and kd= ω(logn), the relative error goes to 0 as n→∞ with high probability. That is, with ω(n logn) total samples, the relative error goes to 0 with high probability. It is well-known that for Erdo¨s-Renyi graphs, the induced graph G is connected with high probability only when d= Ω(logn), i.e. when total number of pairs sampled scales as Ω(n logn). Thus, Rank Centrality is nearly order-optimal in this setting. Second, the b parameter should be treated as constant. It is the dynamic range in which we are trying to resolve the uncertainty between scores. We are considering a regime that there exists some uncertainty in the samples. Otherwise, if the weight of a single item where an order n greater than the weights of other items, then it would effectively be preferred with certainty. Hence, we would remove it from the items under consideration. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 13 Third, for a general graph, Theorem 1 implies that by choice of kdmax = O(κ 2ξ−2 logn), Rank Centrality learns a score vector close to the true scores with high probability. That is, effectively the Rank Centrality algorithm requires O(nκ2ξ−2poly(logn)) comparisons to learn scores well. Ignoring κ, the graph structure plays a role through ξ−2, the squared inverse of the spectral gap of Laplacian of G, in dictating the performance of Rank Centrality. A reversible natural random walk on G, whose transition matrix is the Laplacian, has its mixing time scaling as ξ−2 (precisely, relaxation time). In that sense, the mixing time of natural random walk on G ends up playing an important role in the ability of Rank Centrality to learn the true scores. Hence, if one has the option to choose which pairs to compare, our analysis in Theorem 1 suggests that one should choose pairs such that the resulting graph has large spectral gap. Spectral gap of the comparisons graph also plays an important role in Osting et al. (2013), where the goal is to choose pairs to compare under a different model where cardinal preferences (as opposed to ordinal preferences) are observed. Finally, if we wish to obtain a relative accuracy of  with probability at least 1 − δ for a fixed number of items n, then our results also show that we require k d ≥ 512 b5/2 max(log2(10/δ)/ logn, logn). 3.3. Experimental Results Under the BTL model, define an error metric of an estimated ordering σ as the weighted sum of pairs (i, j) whose ordering is incorrect: Dw(σ) = { 1 2n‖w‖2 ∑ i 0 )}1/2 , where I(·) is an indicator function. This is a more natural error metric compared to the Kemeny distance, which is an unweighted version of the above sum, since Dw(·) is less sensitive to errors between pairs with similar weights. Further, assuming without loss of generality that w is normal- ized such that ∑ iwi = 1, the next lemma connects the error in Dw(·) to the bound provided in Theorem 2. Hence, the same upper bound holds for Dw error. A proof of this lemma is provided in the Appendix. Lemma 1. Let σ be an ordering of n items induced by a scoring pi. Then, Dw(σ) ≤ ‖w−pi‖‖w‖ . Synthetic data. To begin with, we generate data synthetically as per a BTL model for a specific choices of scores. For a given n and b, the scores are chosen such that the ratio between two consecutive scores are fixed to be b1/n, i.e. w1 = b (1−n)/2n, w2 = b(3−n)/2n, w3 = b(5−n)/2n etc. A Author: Rank Centrality: Ranking from Pair-wise Comparisons 14 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) representative result is depicted in Figure. 1: for fixed n = 400 and a fixed b = 10, it shows how the error scales when varying two key parameters – varying the number of comparisons per pair with fixed d= 10 logn (on left), and varying the sampling probability with fixed k= 32 (on right). This figure compares performance of Rank Centrality with variety of other algorithms. Next, we provide a brief description of various algorithms that we shall compare with. Regularized Rank Centrality. When there are items that have been compared only a few times, the scores to those items might be sensitive to the randomness in the outcome of the comparisons, or even worse the resulting comparisons graph might not be connected. To make the random walk irreducible and get a ranking that is more robust against comparisons noise in those edges with only a few comparisons, one can add regularization to Rank Centrality. A reasonable way to add regularization is to consider the transition probability Pij as the prediction of the event that j beats i, given data (aij, aji). The Rank Centrality, in non regularized setting, uses the Haldane prior of Beta(0,0), which gives Pij ∝ aij/(aij + aji). To add regularization, one can use different priors, for example Beta(ε, ε), which gives Pij = 1 dmax aij + ε aij + aji + 2ε . (5) When the prior is unknown, a reasonable choice in practice is ε= 1. Maximum Likelihood Estimator (MLE). The ML estimator directly maximizes the likelihood assuming the BTL model (L. R. Ford 1957). If we reparameterize the problem so that θi = log(wi) then we obtain our estimates θ̂ by solving the convex program θ̂ ∈ arg min θ ∑ (i,j)∈E k∑ l=1 log(1 + exp(θj − θi))−Y lij(θj − θi), (6) which is pair-wise logistic regression. The MLE is known to be consistent (L. R. Ford 1957). The finite sample analysis of MLE is provided in Section 3.5. For comparison with Regularized Rank Centrality, we provide regularized MLE or regularized Logistic Regression: arg min θ ∑ (i,j)∈E ∑ l { log(1 + exp(θj − θi))−Y lij(θj − θi) } + 1 2 λ‖θ‖2 (7) Borda Count. The (generalized) Borda Count method, analyzed recently by Ammar and Shah (2011), scores an item by counting the number of wins divided by the total number of comparisons: s(i) = # of times item i has won # of times item i has been compared . Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 15 This can be thought of as an extension of the standard Borda Count for aggregating full rankings (de Borda 1781), which is widely used in psychology (David 1963, Kendall and Smith 1940, Mosteller 1951). If we break the full rankings into pair-wise comparisons and apply the pair-wise version of the Borda Count from (Ammar and Shah 2011), then it produces the same ranking as the standard Borda Count applied to the original full rankings. This is different from how HodgeRank from Jiang et al. (2011) generalizes Borda count, which does not normalize the scores by the number of comparisons. Spectral Ranking Algorithms. Rank Centrality can be classified as part of the spectral ranking algorithms, which assign scores to the items according to the leading eigenvector of a matrix that represents the data. Different choices of the matrix based on data can lead to different algorithms. Few prominent examples are Ratio matrix in (Saaty 2003) and those in Dwork et al. (2001a). In Ratio matrix algorithm, a matrix M ∈Rn×n with Mij = aij/aji is constructed (and Mii = 1), and the scores for the times are assigned as per the top eigenvector of this ratio matrix. Dwork et al. (2001a) introduced four spectral ranking algorithms called MC1, MC2, MC3 and MC4. They are all based on a random walk very similar (but distinct) to that of Rank Centrality. These algorithms use the stationary distributions of the following Markov chains respectively, translated to account for the pair-wise comparisons data: P (MC1) ij = 1/|{` : ai` > 0}|, P (MC2)ij = aij/ ∑ ` 6=i ai`, P (MC3) ij = { aij/deg(i) if i 6= j . 1−∑` 6=i ai`/deg(i) if i= j . , P (MC4)ij =  1/n if aij ≥ aji ,0 if aij 12n logn observations of the form (i, j, y) where i and j are drawn uniformly at random from [n] and y is Bernoulli with parameter exp(θ∗i − θ∗j )/(1 + exp(θ∗i − θ∗j )). Then, we have with probability at least 1− 2/n ‖θ̂− θ∗‖ ≤ 6(1 + b) 2 b √ n2 logn m . With the assumption that ‖θ∗‖∞ ≤ b˜, we have ‖θ∗‖ ≤ b˜ √ n. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 21 3.6. Crame´r-Rao lower bound The Fisher information matrix (FIM) encodes the amount of information that the observed mea- surements carry about the parameter of interest. The Crame´r-Rao bounds we derive in this section provides a lower bound on the expected squared Euclidean norm E[‖p˜i − pi‖2] of any unbiased estimator and is directly related to the (inverse of) Fisher information matrix. Denote the log-likelihood function as `(p˜i|a) = ∑ (i,j)∈E log f(aij, aji|p˜i) , where f(aij, aji|p˜i) = ( p˜ij p˜ii + p˜ij )kijaij( p˜ii p˜ii + p˜ij )kijaji , and kij is the number of times the pair (i, j) was compared. The Fisher information matrix with the BTL weights p˜i is defined as F (p˜i)∈Rn×n with F (p˜i)ij = Ea [ − ∂ 2`(p˜i|a) ∂p˜ii∂p˜ij ] =  ∑ i′∈∂i kii′ (p˜ii+p˜ii′ )2 p˜ii′ p˜ii if i= j , − kij (p˜ii+p˜ij) 2 if (i, j)∈E , 0 otherwise . This follows from the fact that ∂`(p˜i|a) ∂p˜ii = ∑ i′∈∂i −kii′(aii′ + ai′i) p˜ii + p˜ii′ + kii′ai′i p˜ii , and ∂2`(p˜i|a) ∂p˜ii∂p˜ij =  ∑ i′∈∂i kii′ ( 1 (p˜ii+p˜ii′ )2 − ai′i (p˜ii) 2 ) if i= j , kij (p˜ii+p˜ij) 2 if (i, j)∈E , 0 otherwise . Let pi denote our estimate of the weights. Applying the Crame´r-Rao bound (Rao 1945), we get the following lower bound for all unbiased estimators pi: E[‖pi− p˜i‖2] ≥ Trace(F (p˜i)−1) This bound depends on p˜i and the graph structure. Although a closed form expression is difficult to get and Rank Centrality as well as the ML estimate is biased, we compare our numerical experiments with a numerically computed Crame´r-Rao bound on the same graph and the same weights p˜i. 3.6.1. Numerical comparisons In Figure 3, the average normalized root mean squared error (RMSE) is shown as a function of various model parameters. We fixed the control parameters as k= 32, n= 400, d= 60 and b= 10 with pairs assigned according to Erdo¨s-Renyi graph G(n,d/n). Author: Rank Centrality: Ranking from Pair-wise Comparisons 22 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 0.01 0.1 1 1 10 100 Rank Centrality ML estimate Cramer-Rao bound RMSE k 0.01 0.1 1 0.01 0.1 1 Rank Centrality ML estimate Cramer-Rao bound d/n 0.01 0.1 1 10 100 1000 10000 100000 Rank Centrality ML estimate Cramer-Rao bound b Figure 3 Comparisons of Rank Centrality, the ML estimator, and the Crame´r-Rao bound. All three lines are almost indistinguishable for all ranges of model parameters. Each point in the figure is averaged over 20 random instances S. Let p˜i(i) be the resulting estimate at i-th experiment, then RMSE = 1 |S| ∑ i∈S ‖pi(i)− p˜i‖ ‖p˜i‖ (10) For all ranges of model parameters k, d, and b, RMSE achieved using Rank Centrality is almost indistinguishable from that of the ML estimate and also the Crame´r-Rao bound (CRB). CRB provides a lower bound on the expected mean squared error for unbiased estimators. Although we are plotting average root mean squared error, as opposed to average mean squared error, we do not expect any estimator to achieve RMSE better than the CRB as long as there is a concentration. The ML estimator in (7) with λ= 0 finds an estimate pi= eθˆ that maximizes the log-likelihood, and in general ML estimate does not coincide with the minimum mean squared error estimator. From the figure we see that it intact achieves the minimum mean squared error and matches the CRB. What is perhaps surprising is that for all the parameters that we experimented with, the RMSE achieved by Rank Centrality is almost indistinguishable with that of ML estimate and the CRB. Thus, coupled with the minimax lower-bounds, one cannot do better than Rank Centrality under the BTL model. 3.7. Discussion of Results In this section we review the results that we have established above. In Theorem 1 we establish upper bounds on the error when samples are drawn from an arbitrary graph and when each edge is compared k times. This bound depends on the spectral gap of the underlying graph, which shows that graphs with a larger spectral gap achieve smaller estimation error. For the case of Erdo¨s-Renyi graphs, Theorem 2 provides an upper bound on the error achieved by Rank Centrality. In Theorem 3 we prove that the bound is near-optimal, up to logarithmic factors, in an information theoretic Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 23 sense. That is, no method, regardless of computational power can achieve better performance on the same statistical model. For a tighter analysis of the optimality of Rank Centrality, we provide numerical experiments under the BTL model and compare it to the Cramer Rao lower-bound established in Section 3.6. Comparisons with the Cramer-Rao bound in Figure 3 suggests that the error achieved by Rank Centrality is indistinguishable from the fundamental Cramer-Rao lower bound, and hence exactly optimal for a certain class of estimators. For completeness, we further provide an analysis of the error achieved by the MLE in Theorem 4. Building upon our analysis, Hajek et al. (2014) shows that MLE is near order-optimal, just like Rank Centrality. Finally, we compare the computational cost of Rank Centrality versus the MLE. While it is dif- ficult to make an exact, theoretical, comparison, we nevertheless compare their computational cost by means of popular implementations on a common computation platform. For Rank Centrality, the implementation is based on using eigs function MATLAB. For MLE, the implementation is based on the basic first-order method. In a collection of experiments (with varying problem param- eters), Rank Centrality converges an order of magnitude faster than the MLE. It should be noted that the first-order method has tunable step-size and our implementation did not attempt to opti- mize this selection when varying problem parameters. Finally, MLE can be viewed as a standard logistic regression. Therefore, the lm function of R-package can be used to solve for MLE. Again, in the same computation environment, the resulting MLE is order of magnitude slower compared to the MATLAB implementation of Rank Centrality, but faster than the first-order method. 4. Proofs We may now present proofs of Theorems 1 and 2. We first present a proof of convergence for general graphs in Theorem 1. This result follows from Lemma 2 that we state below, which shows that our algorithm enjoys convergence properties that result in useful upper bounds. The lemma is made general and uses standard techniques of spectral theory. The main difficulty arises in establishing that the Markov chain P satisfies certain properties that we will discuss subsequently. Given the proof for the general graph, Theorem 2 follows by showing that in the case of Erdo¨s-Renyi graphs, certain spectral properties are satisfied with high probability. The next set of proofs involve the information-theoretic lower bound stated in Theorem 3 and the proof of Theorem 4 establishing the finite sample error analysis of MLE. Author: Rank Centrality: Ranking from Pair-wise Comparisons 24 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 4.1. Proof of Theorem 1: General graph In this section, we characterize the error rate achieved by our ranking algorithm. Given the random Markov chain P , where the randomness comes from the outcome of the comparisons, we will show that it does not deviate too much from its expectation P˜ , where we recall that P˜ is defined as P˜ij = { 1 dmax wj wi+wj if i 6= j , 1− 1 dmax ∑ ` 6=i w` wi+w` if i= j for all (i, j)∈E and P˜ij = 0 otherwise. Recall from the discussion following equation (1) that the transition matrix P used in our ranking algorithm has been carefully chosen such that the corresponding expected transition matrix P˜ has two important properties. First, the stationary distribution of P˜ , which we denote with p˜i is proportional to the weight vectors w. Furthermore, when the graph is connected and has self loops (which at least one exists), this Markov chain is irreducible and aperiodic so that the stationary distribution is unique. The next important property of P˜ is that it is reversible–p˜i(i)P˜ij = p˜i(j)P˜ji. This observation implies that the operator P˜ is symmetric in an appropriately defined inner product space. The symmetry of the operator P˜ will be crucial in applying ideas from spectral analysis to prove our main results. Let ∆ denote the fluctuation of the transition matrix around its mean, such that ∆≡ P − P˜ . The following lemma bounds the deviation of the Markov chain after t steps in terms of two important quantities: the spectral radius of the fluctuation ‖∆‖2 and the spectral gap 1−λmax(P˜ ), where λmax(P˜ ) ≡ max{λ2(P˜ ),−λn(P˜ )} . Since λ(P˜ )’s are sorted, λmax(P˜ ) is the second largest eigenvalue in absolute value. Lemma 2. For any Markov chain P = P˜ + ∆ with a reversible Markov chain P˜ , let pt be the distribution of the Markov chain P when started with initial distribution p0. Then,∥∥pt− p˜i∥∥ ‖p˜i‖ ≤ ρ t ‖p0− p˜i‖ ‖p˜i‖ √ p˜imax p˜imin + 1 1− ρ‖∆‖2 √ p˜imax p˜imin . (11) where p˜i is the stationary distribution of P˜ , p˜imin = mini p˜i(i), p˜imax = maxi p˜i(i), and ρ= λmax(P˜ ) + ‖∆‖2 √ p˜imax/p˜imin. The above result provides a general mechanism for establishing error bounds between an estimated stationary distribution pi and the desired stationary distribution p˜i. It is worth noting that the result only requires control on the quantities ‖∆‖2 and 1− ρ. We may now state two technical lemmas that provide control on the quantities ‖∆‖2 and 1− ρ, respectively. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 25 Lemma 3. For some constant C ≥ 8, the error matrix ∆ = P − P˜ satisfies ‖∆‖2 ≤ C √ logn k dmax with probability at least 1− 4n−C/8. The next lemma provides our desired bound on 1− ρ. Lemma 4. If ‖∆‖2 ≤C √ logn/(kdmax) and k≥ 4C2b5dmax logn(1/dminξ)2, then 1− ρ ≥ ξdmin b2dmax . Proof of Theorem 1. With the above stated Lemmas, we shall proceed with the proof of Theorem 1. When there is a positive spectral gap such that ρ < 1, the first term in (11) vanishes as t grows. The rest of the first term is bounded and independent of t. Formally, we have p˜imax/p˜imin ≤ b , ‖p˜i‖ ≥ 1/ √ n , and ‖p0− p˜i‖ ≤ 2 , by the assumption that maxi,j wi/wj ≤ b and the fact that p˜i(i) = wi/( ∑ j wj). Hence, the error between the distribution at the tth iteration pt and the true stationary distribution p˜i is dominated by the second term in equation (11). Substituting the bounds in Lemma 3 and Lemma 4, the dominant second term in equation (11) is bounded by lim t→∞ ∥∥pt− p˜i∥∥ ‖p˜i‖ ≤ C b5/2 ξdmin √ dmax logn k with probability at least 1 − 4n−C/8. In fact, we only need t = Ω(logn + log b + log(dmax logn/(d 2 minkξ 2))) to ensure that the above bound holds up to a constant factor. This finishes the proof of Theorem 1. Notice that in order for this result to hold, we need k ≥ 4C2b5dmax logn(1/dminξ) 2 for Lemma 4. 4.1.1. Proof of Lemma 2. Due to the reversibility of P˜ , we can view it as a self-adjoint operator on an appropriately defined inner product space. This observation allows us to apply the well-understood spectral analysis of self-adjoint operators. To that end, define an inner product space L2(p˜i) as a space of n-dimensional vectors, Rn, endowed with 〈a, b〉p˜i = n∑ i=1 aip˜iibi . Similarly, we define ‖a‖p˜i = √〈a,a〉 p˜i as the 2-norm in L2(p˜i). An operator (matrix) A is self-adjoint with respect to L2(pi) if 〈u,Av〉p˜i = 〈Au,v〉p˜i for all u, v ∈ Rn. For a self-adjoint operator A in Author: Rank Centrality: Ranking from Pair-wise Comparisons 26 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) L2(p˜i), we define ‖A‖p˜i,2 = maxa ‖Aa‖p˜i/‖a‖p˜i as the operator norm. These norms are related to the corresponding norms in the Euclidean space through the following inequalities. √ p˜imin ‖a‖ ≤ ‖a‖p˜i ≤ √ p˜imax ‖a‖ , (12)√ p˜imin p˜imax ‖A‖2 ≤ ‖A‖p˜i,2 ≤ √ p˜imax p˜imin ‖A‖2 . (13) It is easy to check that, a reversible Markov chain P˜ is self-adjoint in L2(p˜i) due to the detailed- balanced condition, where p˜i is the unique stationary distribution of P˜ . Consider symmetrized version of P˜ , defined as S = Π˜1/2P˜ Π˜−1/2, where Π˜ is a diagonal matrix with Π˜ii = p˜i(i). Again, reversibility of P˜ makes S symmetric. It can be verified that P˜ and S have the same set of eigenvalues. By Perron-Frobenius theorem, the eigenvalues are in [−1,1] with largest being equal to 1. Let they be denoted as 1 = λ1 ≥ λ2 ≥ . . .≥ λn ≥−1, and let λmax = max{|λn|, λ2}. Let ui be the left eigenvector of S corresponding to λi for 1≤ i≤ n. Then the ith left eigenvector of P˜ is given by vi = Π˜ 1/2ui. Since the first left eigenvector of P˜ is the stationary distribution, i.e. v1 = p˜i, we have that u1(i) = p˜i(i) 1/2 or Π˜−1/2u1 = 1. Finally, define rank-1 projection of S as S1 = λ1u1u T 1 = u1u T 1 and let P˜1 = Π˜ −1/2S1Π˜1/2. Our interest is in Markov chain P = P˜ + ∆ and iterates obtained from it pTt = p T t−1P . Then, pTt − p˜iT = (pt−1− p˜i)T (P˜ + ∆) + p˜iT∆ . (14) Using the fact that (p` − p˜i)T Π˜−1/2u1 = (p` − p˜i)T1 = 0 for any probability distribution p`, we get (p`− p˜i)T P˜1 = (p`− p˜i)T Π˜−1/2u1λ1uT1 Π˜1/2 = 0. Then, from (14) we get pTt − p˜iT = (pt−1− p˜i)T (P˜ − P˜1 + ∆) + p˜iT∆ . By definition of P˜1, it follows that ‖P˜ − P˜1‖p˜i,2 = ‖S−S1‖2 = λmax. Let ρ= λmax + ‖∆‖p˜i,2, then ‖pt− p˜i‖p˜i ≤ ‖pt−1− p˜i‖p˜i(‖P˜ − P˜1‖p˜i,2 + ‖∆‖p˜i,2) + ‖p˜iT∆‖p˜i ≤ ρt‖p0− p˜i‖p˜i + t−1∑ `=0 ρt−1−`‖p˜iT∆‖p˜i . Dividing each side by ‖p˜i‖ and applying the bounds in (12) and (13), we get ‖pt− p˜i‖ ‖p˜i‖ ≤ ρ t √ p˜imax p˜imin ‖p0− p˜i‖ ‖p˜i‖ + t−1∑ `=0 ρt−1−` √ p˜imax p˜imin ‖p˜iT∆‖ ‖p˜i‖ . This finishes the proof of the desired claim. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 27 4.1.2. Proof of Lemma 3. Our interest is in bounding ‖∆‖2. Now ∆ = P − P˜ so that for 1≤ i, j ≤ n, ∆ij = 1 kdmax Cij, (15) where Cij is distributed as per B(k, pij)−kpij if (i, j)∈E and Cij = 0 otherwise. Here B(k, pij) is a Binomial random variable with parameter k and pij ≡ wjwi+wj . It should be noted that Cij +Cji = 0 and Cij are independent across all the pairs with i < j. For 1≤ i≤ n ∆ii = Pii− P˜ii = ( 1− ∑ j 6=i Pij )− (1−∑ j 6=i P˜ij ) = ∑ j 6=i P˜ij −Pij =− ∑ j 6=i ∆ij. (16) Given the above dependence between diagonal and off-diagonal entries, we shall bound ‖∆‖2 as follows: let D be the diagonal matrix with Dii = ∆ii for 1≤ i≤ n and ∆¯ = ∆−D. Then, ‖∆‖2 = ‖D+ ∆¯‖2 ≤ ‖D‖2 + ‖∆¯‖2. (17) We shall establish the bound of O (√ logn kdmax ) for both ‖D‖2 and ‖∆¯‖2 to establish the Lemma 3. Bounding ‖D‖2. Since D is a diagonal matrix, ‖D‖2 = maxi |Dii| = maxi |∆ii|. For a given fixed i, as per (15)-(16), kdmax∆ii can be expressed as summation of at most kdmax independent, zero- mean random variables taking values in the range of at most 1. Therefore, by an application of Azuma-Hoeffding’s inequality, it follows that P ( kdmax|∆ii|> t )≤ 2exp (− t2 2kdmax ) . (18) By selection of t = C √ kdmax logn for appropriately large constant, it follows from above display that P ( ‖D‖2 ≥C √ logn kdmax ) ≤ n∑ i=1 P ( |∆ii|>C √ logn kdmax ) (19) ≤ 2n−C2/2+1 (20) Bounding ‖∆¯‖2 when dmax ≤ logn. Towards this goal, we shall make use of the following standard inequality: for any square matrix M , ‖M‖2 ≤ √ ‖M‖1‖M‖∞, (21) where ‖M‖1 = maxi ∑ j |Mij| and ‖M‖∞ = ‖MT‖1. In words, ‖M‖22 is bounded above by product of the maximal row-sum and column-sum of absolute values of M . Since ∆ij and ∆ji are identically distributed and entries along each row (and hence each column) are independent, it is sufficient Author: Rank Centrality: Ranking from Pair-wise Comparisons 28 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) to obtain a high probability bound (≥ 1− 1/poly(n)) for maximal row-sum of absolute values of ∆¯; exactly the same bound will apply for column-sum using; and using union bound the desired result will follow. To that end, consider the sum of the absolute values of the ith row-sum of ∆¯ and for simplicity let us denote it by Ri. Then, Ri = 1 kdmax ∑ j 6=i |Cij|, (22) where recall that Cij =Xij−kpij with Xij an independent Binomial random variable with param- eters k, pij. Therefore, for any s > 0, P ( Ri > s ) = P (∑ j∈∂i |Cij|>kdmaxs ) ≤ ∑ j∈∂i ∑ ξj∈{−1,+1} P (∑ j ξjCi,j >kdmaxs ) by the union bound ≤ ∑ j∈∂i ∑ ξj∈{−1,+1} exp (−2k2d2maxs2 dik ) where the last inequality follows from Hoeffding’s bound and the fact that Xij = ∑k j=1(yij − pij) where yij are Bernoulli random variables with mean pij. Now, the number of terms in the sum is 2di , the summand is constant, and di ≤ dmax. Thus, the last inequality is upper-bounded by∑ j∈∂i ∑ ξj∈{−1,+1} exp (−2k2d2maxs2 dik2 ) ≤ exp (−2kdmaxs2 + di ln 2) By an application of the union bound P (‖∆¯‖2 ≥ s)≤ 2nP (Ri ≥ s) ≤ 2n exp (−2kdmaxs2 + dmax ln 2) . Now, if we set s= C 2 √ logn+dmax ln 2 kdmax we have that P ( ‖∆¯‖2 ≥C/2 √ logn+ dmax ln 2 kdmax ) ≤ 2n−(C2/2−1) Finally, using the assumption that dmax ≤ logn yields ‖∆¯‖2 ≤C √ logn kdmax with probability at least 1− 2n−C2/2+1. Bounding ‖∆¯‖2 when dmax ≥ logn. Towards this goal, we shall make use of the recent results on the concentration of the sum of independent random matrices. For completeness, we recall the following result (Tropp 2011). Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 29 Lemma 5 (Theorem 6.2 (Tropp 2011)). Consider a finite sequence {Z˜ij}ij(A˜ij)2‖2. Then, for all t≥ 0, P (∥∥∥∑ i (1/2)d )≤ 2e−d/16. Hence, for d≥C ′ logn, equation (30) is true with probability at least 1− 2n−C′/16. Finally, we finish the proof with a result on the lower bound of the spectral gap ξ = 1 − λmax(D −1B). Lemma 7. Consider a random graph G drawn from the Erd¨os-Renyi distribution G(n,d/n). Then if d≥ 10C2 logn, we have ξ ≥ 1/2 with probability at least 1−n−Cn/(n−d)/8 The proof of this result can be found in Appendix B. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 33 4.3. Proof of Theorem 3: Information-theoretic lower bound In this section, we prove Theorem 3 using an information-theoretic method that allows us to reduce the stochastic inference problem into a multi-way hypothesis testing problem. This estimation problem can be reduced to the following hypothesis testing problem. Consider a set {p˜i(1), . . . , p˜i(M(δ))} of M(δ) vectors on the standard orthogonal simplex which are separated by δ, such that ‖p˜i(`1)− p˜i(`2)‖ ≥ δ for all `1 6= `2. To simplify the notations, we are going to use M as a shorthand for M(δ). Suppose we choose an index L ∈ {1, . . . ,M} uniformly at random. Then, we are given noisy outcomes of pair-wise comparisons with w= p˜i(L) from the BTL model. We use X to denote this set of observations. Let pi be the estimation produced by an algorithm using the noisy observations. Given this, the best estimation of the “index” is Lˆ, where Lˆ= arg min`∈[M ] ‖pi− p˜i(`)‖. By construction of our packing set, when we make a mistake in the hypothesis testing, our estimate is at least δ/2 away from the true weight p˜i(L). Precisely, Lˆ 6=L implies that ‖pi− p˜i(L)‖ ≥ δ/2. Then, E [‖pi− p˜i(L)‖ ] ≥ δ 2 P ( Lˆ 6=L) ≥ δ 2 { 1− I(Lˆ;L) + log 2 logM } , (32) where I(·; ·) denotes the mutual information between two random variables and the second inequal- ity follows from Fano’s inequality. These random vectors form a Markov chain L— p˜i(L) —X—pi— Lˆ , where X—Y —Z indicates that X and Z are conditionally independent given Y . Let PL,X(`, x) denote the joint probability function, and PX|L(x|`), PL(`) and PX(x) denote the conditional and marginal probability functions. Then, by data processing inequality for a Markov chain, we get I(L; Lˆ) ≤ I(L;X) = EL,X [ log ( PL,X(L,X) PL(L)PX(X) )] = 1 M ∑ `∈[M ] EX [ log (PX|L(X|`) PX(X) )] = 1 M ∑ `∈[M ] EX [ log ( PX|L(X|`)∑ `2∈[M ] PX|L(X|`2)P(`2) )] ≤ 1 M ∑ `∈[M ] ∑ `2∈[M ] P(`2)EX [ log ( PX|L(X|`) PX|L(X|`2) )] = 1 M 2 ∑ `1,`2 DKL ( PX|L(X|`1) ∥∥∥PX|L(X|`2)) , (33) where DKL(·‖·) is the Kullback-Leibler (KL) divergence and the inequality follows from the con- cavity of logarithm and Jensen’s inequality. Author: Rank Centrality: Ranking from Pair-wise Comparisons 34 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) The KL divergence between the observations coming from two different BTL models depend on how we sample the comparisons. We are sampling each pair of items for comparison with probability d/n, and we are comparing each of these sampled pairs k times. Let Xij denote the outcome of k comparisons for a sampled pair of items (i, j). To simplify notations, we drop the subscript X|L whenever it is clear from the context. Then, DKL ( P(X|`1) ∥∥P(X|`2) ) = d n ∑ 1≤i αδ√ n ) ≤ 2e−n/2 . By union bound, this holds uniformly for all ` with probability at least 1−2e−63n/128. In particular, this implies that 1− 2αδ√n n ≤ p˜i(`)i ≤ 1 + 2αδ √ n n , (35) for all i∈ [n] and `∈ [M ]. Next, we use standard concentration results to bound the distance between two vectors: ∥∥p˜i(`1)− p˜i(`2)∥∥2 = ∥∥Y (`1)−Y (`2)∥∥2−n(Y¯ (`1)− Y¯ (`2))2 Applying Hoeffding’s inequality for the first term, we get P (|∑i(Y (`1)i − Y (`2)i )2 − (2/3)α2δ2| ≥ (1/2)α2δ2 ) ≤ 2e−n/32. Similarly for the second term, we can show that P(|∑i(Y (`1)i − Y (`2)i )| ≥ (1/4)αδ √ n )≤ 2e−n/32. Substituting these bounds, we get 1 10 α2δ2 ≤ ‖p˜i(`1)− p˜i(`2)‖2 ≤ 13 10 α2δ2 , (36) Author: Rank Centrality: Ranking from Pair-wise Comparisons 36 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) with probability at least 1− 4e−n/32. Applying union bound over (M 2 )≤ en/64 pairs of vectors, we get that the lower and upper bound holds for all pairs `1 6= `2 with probability at least 1−4e−n/64. The probability that both conditions (35) and (36) are satisfied is at least 1−4e−n/64−2e−63n/128. For n≥ 90, the probability of success is strictly positive. Hence, we know that there exists at least one set of vectors that satisfy the conditions. Setting α = √ 10, we have constructed a set that satisfy all the conditions. 4.4. Proof of Theorem 4: Finite sample analysis of MLE The proof of this theorem will follow in two parts. First we will show that if the gradient of the loss ∇Lm evaluated at θ∗ is small, then the error between θ∗ and θ̂ is also small. To that end we begin with a simple inequality: Lm(θ̂)≤Lm(θ∗). Let ∆ = θ̂− θ∗. We can add and subtract 〈∇Lm(θ∗),∆〉 from the above equation to obtain Lm(θ∗+ ∆)−Lm(θ∗)−〈∇Lm(θ∗),∆〉 ≤ 〈∇Lm(θ∗),∆〉. Now assume ‖∇Lm(θ∗)‖2 ≤ c. By the Cauchy-Schwartz inequality we have that Lm(θ∗+ ∆)−Lm(θ∗)−〈∇Lm(θ∗),∆〉 ≤ c‖∆‖2. Therefore, we if we prove that Lm(θ∗+ ∆)−Lm(θ∗)−〈∇Lm(θ∗),∆〉 ≥ µ 2 ‖∆‖22, (37) then we immediately have that ‖∆‖2 ≤ 2c/µ. We now proceed to establish the above inequality. 4.4.1. Proof of Equation 37 By Taylor’s theorem and the definition of Lm from equation 9 for some v ∈ [0,1] we have Lm(θ∗+ ∆)−Lm(θ∗)−〈∇Lm(θ∗),∆〉= 1 2m m∑ l=1 exp(〈θ∗, xl〉+ v〈θ∗, xl〉) (1 + exp(〈θ∗, xl〉+ v〈θ∗, xl〉))2 (〈∆, xl〉) 2. Now, by assumption ∑ i θ ∗ i = ∑ i θ̂i = 0; and θ ∗ max− θ∗min and θ̂max− θ̂min ≤ log(b) so that |〈θ∗, xl〉+ v〈θ∗, xl〉| ≤ log(b). Therefore, Lm(θ∗+ ∆)−Lm(θ∗)−〈∇Lm(θ∗),∆〉 ≥ 1 2m m∑ l=1 b (1 + b)2 (〈∆, xl〉)2. Thus, what remains is to establish a lower-bound on 1 m m∑ l=1 (〈∆, xl〉)2. We appeal to the following lemma for the lower-bound. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 37 Lemma 9. Given m> 12n logn i.i.d. samples yl, xl we have that 1 m m∑ l=1 (〈∆, xl〉)2 ≥ 1 3n ‖∆‖22 with probability at least 1− 1/n. Finally, we present the following lemma that establishes an upper-bound on ‖∇Lm(θ∗)‖2. Lemma 10. Given m observations (vl, xl) we have that ‖∇Lm(θ∗)‖2 ≤ 2 √ logn m with probability at least 1− 1/n. Therefore, putting everything together we have that ‖∆‖2 ≤ 6(1 + b)2/b √ n2 logn m , which establishes the desired result. 4.4.2. Proof of Lemma 9 To prove this lemma we note that 1 m m∑ l=1 (〈∆, xl〉)2 = 1 m m∑ l=1 ∆Txlx T l ∆. Thus, it is sufficient to prove a lower-bound on λmin( 1 m ∑m l=1 xlx T l ). In order to do so we may again appeal to recent results on random matrix theory Tropp (2011). Lemma 11 (Theorem 1.4 (Tropp 2011)). Consider a finite sequence {Xk} of independent, random, self-adjoint matrices with dimensions d. Assume that each random matrix satisfies EXk = 0 and λmax(Xk)≤R almost surely. Then, for all t≥ 0, P { λmax (∑ k Xk ) ≥ t } ≤ d · exp ( −t2/2 σ2 +Rt/3 ) where σ2 := ‖ ∑ k E(X2k)‖, (38) and ‖X‖ for a matrix X represents the operator norm of X or its larges singular value. In order to apply the above lemma we let Xl = xlx T l −2/n(I−11T/n). Therefore, the Xl are zero- mean, i.i.d., and symmetric. Furthermore, ‖Xl‖ ≤ 2 and EX2l = 4/n(I−11T/n)−4/n2(I−11T/n). Therefore, applying the above lemma to both Xl and −Xl yields the inequality P { ‖ ∑ l Xl/m‖ ≥ t } ≤ 2n exp ( −t2/2 4 nm + 2t/(3m) ) . Thus, with probability at least 1− 1/n, ‖ 1 m ∑ l Xl‖ ≤max(4 √ 2 logn nm ,8/3 logn m ). Author: Rank Centrality: Ranking from Pair-wise Comparisons 38 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) Hence, as long as 12n logn 12n logn the above inequality can be lower bounded by 1 3n ‖∆‖22, establishing the desired result. 4.4.3. Proof of Lemma 10 To establish this result we will proceed by showing each individ- ual element of ∇Lm is upper bounded by 2 √ logn/(nm) with high probability. Recall that ∇Lm = 1 m m∑ l=1 xl(E[Xl|xl]−Xl). Consequently, focusing on a single component ∇Lmk we have that ∇Lmk = 1 m m∑ l=1 (xl)k(E[Xl|xl]−Xl). Thus, the kth component of ∇Lm is the average over m independent mean zero random variables that are upper-bounded by 1 and that each have variance upper-bounded by 1/n. Therefore, an application of Bernstein’s inequality yields P(|∇Lmk| ≥ t)≤ 2exp ( −t2 2 nm + 2t 3m ) . Therefore, P(‖∇Lm‖∞ ≥ t)≤ nP(|∇Lmk| ≥ t) ≤ 2n exp ( −t2 2 nm + 2t 3m ) . Using arguments similar to those to establish the results in Section 4.4.1 we have that with prob- ability at least 1− 2/n ‖∇Lm‖∞ ≤ 2 √ logn nm , as desired. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 39 5. Discussion The main contribution of this paper is the design and analysis of Rank Centrality: an iterative algorithm for rank aggregation using pair-wise comparisons. We established the efficacy of the algorithm by analyzing its performance when data is generated as per the popular Bradley-Terry- Luce (BTL) or Multinomial Logit (MNL) model. We have obtained an analytic bound on the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. As shown, these lead to near-optimal dependence on the number of samples required to learn the scores well by our algorithm under random selection of pairs for comparison. More generally, the comparison graph structure plays a crucial role in the performance of the algorithm. For a tighter analysis of the optimality of Rank Centrality, we provide numerical experiments under the BTL model and compare it to the Cramer Rao lower-bound. Comparisons with the Cramer-Rao bound in Figure 3 suggests that the error achieved by Rank Centrality is indistinguish- able from the fundamental Cramer-Rao lower bound, and thus suggesting it’s stronger optimality properties compared to what we can establish. For completeness, we further provided an analysis of the error achieved by the MLE. Build- ing upon our analysis, Hajek et al. (2014) shows that MLE is near order-optimal, just like Rank Centrality. It is worth noting, however, that empirically the computational cost of Rank Central- ity seems much better than that of finding the MLE. Author: Rank Centrality: Ranking from Pair-wise Comparisons 40 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) Appendix. Proof of Lemma 1 A. Proof of Lemma 1 Without loss of generality, let us consider two items i and j such that wi >wj. When we estimate a higher score for item j then we make a mistake in the ranking of these two items. When this happens, such that pij − pii > 0, it naturally follows that wi −wj ≤ wi −wj + pij − pii ≤ |wi − pii|+ |pij − wj|. For a general pair i and j, we have (wi − wj)(σi − σj) > 0 implies that |wi − wj| ≤ |wi−pii|+ |wj−pij|. Substituting this into the definition of the weighted distance Dw(·), and using the fact that (a+ b)2 ≤ 2a2 + 2b2, we get Dw(σ) = { 1 2n‖w‖2 ∑ i 0 )}1/2 ≤ { 1 n‖w‖2 ∑ i 0, exp(θ|x|)≤ exp(θx) + exp(−θx). From this, it follows that E[exp(θ|Cij|)]≤E[exp(θCij)] +E[exp(−θCij)]. (40) Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 41 Now for any θ ∈ R, using the fact that Xij is Binomial distribution and 1 + x ≤ exp(x) for any x∈R, we have E[exp(θCij)] = exp(−θkpij) ( 1 + pij(exp(θ)− 1) )k ≤ exp(−θkpij) exp ( kpij(exp(θ)− 1) ) . (41) Using second-order Taylor’s expansion, for any θ ∈ [− ln 4/3, ln 4/3], we obtain that | exp(θ)− 1− θ| ≤ 2 3 θ2. (42) Using above display in (41), we can obtain the claimed result. B. Proof of Lemma 7 Since we are interested in the eigenvalues of L=D−1B, we define a more tractable matrix with the same set of eigenvalues: L˜=D−1/2BD−1/2. Because L˜ is a symmetric matrix, the eigenvalues are the same as the singular values up to a sign. Let σ1(L˜)≥ σ2(L˜)≥ . . . denote the ordered singular values of L˜. Note that the matrix D−1/2BD−1/2 has largest singular value equal to 1. Therefore, σ2(L˜) ≤ ‖D−1/2BD−1/2− 11T/n‖2 because the vector 1/ √ n has unit norm. Decomposing the above we have that ‖D−1/2BD−1/2− 11T/n‖2 ≤ ‖B/d− 11T/n‖2 + ‖B/d−D−1/2BD−1/2‖2 We now appeal to the following lemma: Lemma 13. If the matrix B ∈ Rn×n is the adjacency matrix of a random Graph drawn from the Erdo¨s-Renyi ensemble G(n,d/n) with d≥C logn and D is the corresponding diagonal matrix whose entry dii is equal to the degree of node i, then we have that ‖B/d− 11T/n‖2 ≤C √ logn d and ‖B/d−D−1/2BD−1/2‖2 ≤C √ logn d with probability at least 1− 2n−Cn/(n−d)/8. At this point, applying the above bound yields the result. It remains to prove the above bound. Author: Rank Centrality: Ranking from Pair-wise Comparisons 42 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) B.1. Proof of Lemma 13 We prove the result in two parts. We first focus on establishing that ‖B/d− 11T/n‖2 ≤C √ logn d with probability at least 1−n−Cn/(n−d)8. To prove this result, we appeal to the following Lemma 14 (Theorem 1.4 (Tropp 2011)). Consider a finite sequence {Xk} of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies EXk = 0 and λmax(Xk)≤R almost surely. Then, for all t≥ 0, P ( λmax (∑ k Xk ) ≥ t ) ≤ d · exp ( −t2/2 σ2 +Rt/3 ) where σ2 = ‖ ∑ k EX2k‖2. In our setting we are interested in the random matrix B where we can write B as B− 11Td/n= ∑ i>j (Aij − d/n)(eieTj + ejeTi ) + ∑ i (Aii− d/n)eieTi where Aij is a Bernoulli random variable with parameter d/n. Therefore, in applying the above Lemma we have that R= 1 almost surely and σ2 = d(1−d/n). Setting t=C√d logn we have that ‖B/d− 11T/n‖2 ≤C √ logn d with probability at least 1−n−Cn/(n−d)/8. Next we show that ‖B/d−D−1/2BD−1/2‖2 ≤C √ logn d with the same probability as above. To prove this result we will let E =D1/2−d1/2I and first note that ‖B/d−D−1/2BD−1/2‖2 ≤ 1 d · dmin ‖D 1/2BD1/2− dB‖2 because ‖D1/2‖2 = 1dmin . Some simple calculations show that ‖D1/2BD1/2− dB‖2 ≤ ‖B‖2 · [‖E‖22 + 2d1/2‖E‖2] by above we know that ‖B‖2 ≤ 2d with high probability. Therefore, ‖B/d−D−1/2BD−1/2‖2 ≤ 2 dmin [‖E‖22 + 2d1/2‖E‖2] An application of Bernstein’s inequality shows that with probability at least 1− 2n−Cn/(n−d)/8 we have ‖E‖2 ≤ 10C √ logn. Finally, using the fact that with high probability dmin ≥ 12d ‖B/d−D−1/2BD−1/2‖2 ≤ 12C √ logn d with probability at least 1− 2n−Cn/(n−d)/8. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 43 References Adler, M., P. Gemmell, M. Harchol-Balter, R. M. Karp, C. Kenyon. 1994. Selection in the presence of noise: the design of playoff systems. Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms. SODA ’94, Society for Industrial and Applied Mathematics, 564–572. Ailon, N. 2010. Aggregation of partial rankings, p-ratings and top-m lists. Algorithmica 57(2) 284–300. Ailon, N., M. Charikar, A. Newman. 2008. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM) 55(5) 23. Altman, A., M. Tennenholtz. 2005. Ranking systems: the pagerank axioms. Proceedings of the 6th ACM conference on Electronic commerce. ACM, 1–8. Ammar, A., D. Shah. 2011. Ranking: Compare, don’t score. Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on. 776–783. Arrow, K. J. 1963. Social Choice and Individual Values. Yale University Press. Boyd, S., A. Ghosh, B. Prabhakar, D. Shah. 2005. Mixing times for random walks on geometric random graphs. SIAM ANALCO . Bradley, R. A., M. E. Terry. 1955. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4) 324–345. Braverman, M., E. Mossel. 2008. Noisy sorting without resampling. Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms. SODA ’08, Society for Industrial and Applied Mathe- matics, 268–276. Brin, S., L. Page. 1998. The anatomy of a large-scale hypertextual web search engine. Seventh International World-Wide Web Conference (WWW 1998). Cande`s, E. J., B. Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computa- tional Mathematics 9(6) 717–772. Condorcet, M. 1785. Essai sur l’application de l’analyse a` la probabilite´ des de´cisions rendues a` la pluralite´ des voix . l’Imprimerie Royale. David, H. A. 1963. The method of paired comparisons, vol. 12. DTIC Document. de Borda, J. C. 1781. Me´moire sur les e´lections au scrutin . Diaconis, P., L. Saloff-Coste. 1993. Comparison theorems for reversible markov chains. The Annals of Applied Probability 3(3) 696–730. Duchi, J. C., L. Mackey, M. I. Jordan. 2010. On the consistency of ranking algorithms. Proceedings of the ICML Conference. Haifa, Israel. Dwork, C., R. Kumar, M. Naor, D. Sivakumar. 2001a. Rank aggregation methods for the web. Proceedings of the Tenth International World Wide Web Conference, 2001 . Author: Rank Centrality: Ranking from Pair-wise Comparisons 44 Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) Dwork, Cynthia, Ravi Kumar, Moni Naor, Dandapani Sivakumar. 2001b. Rank aggregation methods for the web. Proceedings of the 10th international conference on World Wide Web. ACM, 613–622. Farnoud, F., B. Touri, O. Milenkovic. 2012. Novel distance measures for vote aggregation. arXiv preprint arXiv:1203.6371 . Gleich, D. F., L. Lim. 2011. Rank aggregation via nuclear norm minimization. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 60–68. Guiver, J., E. Snelson. 2009. Bayesian inference for plackett-luce ranking models. Proceedings of the 26th Annual International Conference on Machine Learning . ACM, 377–384. Hajek, Bruce, Sewoong Oh, Jiaming Xu. 2014. Minimax-optimal inference from partial rankings. Advances in neural information processing systems (NIPS). Hochbaum, D. S. 2006. Ranking sports teams and the inverse equal paths problem. Internet and Network Economics. Springer, 307–318. Horn, R. A., C. R. Johnson. 1985. Matrix Analysis. Cambridge University Press. Hunter, David R. 2004. Mm algorithms for generalized bradley-terry models. Annals of Statistics 384–406. Jiang, X., L. Lim, Y. Yao, Y. Ye. 2011. Statistical ranking and combinatorial hodge theory. Mathematical Programming 127(1) 203–244. Kamvar, S. D., M. T. Schlosser, H. Garcia-Molina. 2003. The eigentrust algorithm for reputation management in p2p networks. Proceedings of the 12th international conference on World Wide Web. WWW ’03, ACM, New York, NY, USA, 640–651. Keener, J. P. 1993. The perron-frobenius theorem and the ranking of football teams. SIAM review 35(1) 80–93. Kendall, M. G. 1955. Further contributions to the theory of paired comparisons. Biometrics 11(1) 43–62. Kendall, M. G., B. B. Smith. 1940. On the method of paired comparisons. Biometrika 324–345. Keshavan, R. H., A. Montanari, S. Oh. 2010. Matrix completion from noisy entries. Journal of Machine Learning Research 11 2057–2078. L. R. Ford, Jr. 1957. Solution of a ranking problem from binary comparisons. The American Mathematical Monthly 64(8) 28–33. Lu, T., C. Boutilier. 2011. Learning mallows models with pairwise preferences. Proceedings of the 28th International Conference on Machine Learning (ICML-11). 145–152. Luce, D. R. 1959. Individual Choice Behavior . Wiley, New York. Mallows, C. L. 1957. Non-null ranking models. i. Biometrika 114–130. McFadden, D. 1973. Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics 105–142. Author: Rank Centrality: Ranking from Pair-wise Comparisons Article submitted to Operations Research; manuscript no. (Please, provide the mansucript number!) 45 Mosteller, F. 1951. Remarks on the method of paired comparisons: I. the least squares solution assuming equal standard deviations and equal correlations. Psychometrika 16(1) 3–9. Negahban, S., M. J. Wainwright. 2012. Restricted strong convexity and (weighted) matrix completion: Optimal bounds with noise. Journal of Machine Learning Research 1665–1697. Newman, M. E. J. 2010. Networks: An Introduction. Oxford University Press. Osting, B., C. Brune, S. Osher. 2013. Enhanced statistical rankings via targeted data collection. Proceedings of the 30th International Conference on Machine Learning . 489–497. Plackett, R. L. 1975. The analysis of permutations. Applied Statistics 193–202. Rajkumar, A., S. Agarwal. 2014. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. Proceedings of The 31st International Conference on Machine Learning . 118–126. Rao, C. R. 1945. Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society 37(3) 81–91. Saaty, T. L. 2003. Decision-making with the ahp: Why is the principal eigenvector necessary. European Journal of Operational Research 145 pp. 85–91. Salganik, M. J., K. E.C. Levy. 2012. Wiki surveys: Open and quantifiable social data collection. Tech. Rep. arXiv:1202.0500. Seeley, J. R. 1949. The net of reciprocal influence. Canadian Journal of Psychology 3(4) 234–240. Shah, D., T. Zaman. 2011. Rumors in a network: who?s the culprit? IEEE Transactions on Information Theory 57(8) 5163–5181. Shah, D., T. Zaman. 2015. Finding rumor sources on random trees. Operations Research . Talluri, K. T., G. VanRyzin. 2005. The Theory and Practice of Revenue Management . springer. Tropp, J. 2011. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics . Vigna, S. 2009. Spectral ranking. arXiv preprint arXiv:0912.0238 . Volkovs, M. N., R. S. Zemel. 2012. A flexible generative model for preference aggregation. Proceedings of the 21st international conference on World Wide Web. ACM, 479–488. Wei, T. H. 1952. The algebraic foundations of ranking theory. Ph.D. thesis, University of Cambridge.