Joint Learning and Control in Stochastic Queueing Networks with Unknown Utilities

We study the optimal control problem in stochastic queueing networks with a set of job dispatchers connected to a set of parallel servers with queues. Jobs arrive at the dispatchers and get routed to the servers following some routing policy. The arrival processes of jobs and the service processes of servers are stochastic with unknown arrival rates and service rates. Upon the completion of each job from dispatcher un at server sm, a random utility whose mean is unknown is obtained. We seek to design a control policy that makes routing decisions at the dispatchers and scheduling decisions at the servers to maximize the total utility obtained by the end of a finite time horizon T. The performance of policies is measured by regret, which is defined as the difference in total expected utility with respect to the optimal dynamic policy that has access to arrival rates, service rates and underlying utilities. We first show that the expected utility of the optimal dynamic policy is upper bounded by T times the solution to a static linear program, where the optimization variables correspond to rates of jobs from dispatchers to servers and the feasibility region is parameterized by arrival rates and service rates. We next propose a policy for the optimal control problem that is an integration of a learning algorithm and a control policy. The learning algorithm seeks to learn the optimal extreme point solution to the static linear program based on the information available in the optimal control problem. The control policy, a mixture of priority-based and Joint-the-Shortest-Queue routing at the dispatchers and priority-based scheduling at the servers, makes decisions based on the graphical structure induced by the extreme point solutions provided by the learning algorithm. We prove that our policy achieves logarithmic regret whereas application of existing techniques to the optimal control problem would lead to Ω(√T)-regret. The theoretical analysis is further complemented with simulations to evaluate the empirical performance of our policy.


INTRODUCTION
Consider a bipartite queueing network with a set of job dispatchers connected to a set of servers with queues. Jobs arriving at the dispatchers get routed to the queues for service, while a certain utility is obtained for each job completed. Such bipartite queueing networks have been widely adopted to model networked systems such as inter-connected switches [1], cloud platforms [2] and server farms [3,4]. We study the optimal control problem in such bipartite queueing networks.   Fig. 2. A simple example with one dispatcher and two servers. In this example, the static linear program is to maximize 11 11 + 12 12 subject to 11 achieves logarithmic regret. What further elevates the challenge is that in the optimal control problem, as the parameters are unknown and we can only observe their stochastic realizations, it is impossible to obtain the exact optimal solution to the static linear program, e.g., no learning algorithm can obtain the exact values of * 11 , * 12 in a finite time horizon [17]. The solutions computed by the learning algorithm are inherently approximate, which precludes the routing policy from relying on the values of the solutions as the approximation error will accumulate to Ω( √ )-regret (See Section 4). Instead, the routing policy has to rely on some structure of the solutions that is robust to the approximation error.
Our main results are as follows. First, we show that each extreme point solution to the static linear program induces a spanning forest of the bipartite network. We thus propose a control policy that consists of a mixture of priority-based routing and Join-the-Shortest-Queue routing at the dispatchers and priority-based scheduling at the servers, which relies on the spanning forests of the solutions as input. Such spanning forest structure is robust against the error in the values of the solutions. Second, we propose a learning algorithm that can learn the optimal solution to the linear program and handle the unknown feasibility region in the optimal control problem. Finally, we integrate the learning algorithm and the control policy into a joint learning and control policy. We show that our policy achieves logarithmic regret for the optimal control problem, which is superior to application of existing techniques that will lead to Ω( √ )-regret. We also complement our theoretical analysis with empirical evaluation. Our results highlight the importance of co-design for the learning and the networking control aspects of the optimal control problem.
We note that the optimal control problem can be subsumed into the general framework of network utility maximization (with unknown utility functions) [8,23,24] or reinforcement learning [12][13][14]. However, the methods therein can also only achieve Ω( √ )-regret since they cannot leverage the structure of the optimal control problem. We refer the reader to Section 4 for further discussion.
The rest of the paper is organized as follows: we formally present the model and formulation of the optimal control problem in Section 2. In Section 3, we introduce several key preliminary results for the optimal control problem. In Section 4, we will give an overview of our main results and provide discussion on related results in the literature. We propose our control policy in Section 5, and the learning algorithm and the joint learning and control policy in Section 6. In Section 7, we conduct empirical evaluation on our policy. Finally, we conclude the paper in Section 8.

MODEL AND PROBLEM FORMULATION
Consider a bipartite network G(U, S) that operates in discrete time with job dispatchers { 1 , . . . , } and parallel servers { 1 , . . . , } (see Figure 1). The job dispatchers are controlled by a single decision maker. Dispatcher is connected to a set of servers S . Server is connected to a set of job dispatchers U . We will also sometimes use to represent a generic dispatcher and to represent a generic server. At each time slot , ( ) unit-size jobs arrive at dispatcher . The arrivals ( )'s are independent random variables with unknown means (arrival rates) E[ ( )] = . Each dispatcher sends its incoming jobs to servers to which it is connected. Each server has a buffer that stores incoming jobs. For server , its offered service at time is denoted by ( ), with ( )'s being independent random variables with unknown mean (service rate) E[ ( )] = . The offered service ( ) is equal to the number of jobs that can finish executing at time . The arrival rates { } and the service rates { } will be referred to as the network statistics. After a job from dispatcher finishes execution at server , we obtain and observe a random utilityˆ( ) with an unknown mean E[ˆ( )] = , where is the underlying utility associated with server for dispatcher . Eachˆ( ) is a 1-sub-Gaussian random variable 1 and ( )'s for different jobs are independent. Note that the underlying utilities 's are unknown, and that we can only observeˆ( )'s. We assume that the realized arrivals and offered service rates, i.e., , where [·] + = max{·, 0}. Our goal is to design a control policy that makes routing decisions at the dispatchers(i.e. sending each incoming job to a server) and scheduling decisions at the servers (i.e. deciding which jobs to serve) such that the expected utility obtained by the end of the time horizon is maximized. To make the problem concrete, we first define the expected utility of a generic policy . Let be the total number of jobs from dispatcher completed at server by the end of the time horizon. Note that is a random variable. As the noise associated with the utility of each job is independent, the expected total utility obtained under is ( ) = =1 =1

E[
]. Also, note that only the jobs that are completed by contribute to the total utility while the jobs that are left in the queue at the end of the time horizon do not count towards the total utility. Let Π be the set of all policies, including the ones that have knowledge of the underlying utilities { }, the network statistics, and the realizations of arrivals and services over the whole time horizon. The optimal dynamic policy * is the best policy in Π, i.e., * = arg max ′ ∈Π ( ′ ). Note that the optimal dynamic policy is typically inadmissible in the optimal control problem as it can make decisions based on information that is not available in the problem setting. We define the regret of a policy as ( ) = ( * ) − ( ), i.e., the gap between the expected utility of and the optimal dynamic policy. In this paper, we pursue admissible policies that make decisions only based on observable information without prior knowledge of the network statistics or the underlying utilities with low regret. We will refer to the problem as the Optimal Control Problem.

PRELIMINARIES
In this section, we introduce several key preliminary results. We start by presenting a static linear program as fluid version of the optimal control problem. Based on the linear program, we first give an upper bound on the expected utility of the optimal dynamic policy, and then set up the instance-dependent conditions on the network statistics that will be assumed throughout the paper. Finally, we establish several structural properties of the extreme points of the linear program.

The Static Linear Program
Consider the following linear program P. : The linear program P can be interpreted as a fluid version of the optimal control problem, where represents the rate of jobs from dispatcher completed at server . Let (P) be the optimal value of P. Using the same argument as in [23], we have the following proposition. Similar results that upper-bound the value of a stochastic dynamic program with its fluid counterpart have appeared in other works on online decision problems, e.g. [25,26]. Proposition 1. The expected utility of the optimal dynamic policy is upper bounded by · (P), i.e., ( * ) ≤ · (P).
From Proposition 1, we see that if our policy can obtain an expected utility close to · (P), then it will achieve low regret. We can thus approach the optimal control problem through learning the solution to P while making routing decisions based on the learned solution.

Structure of the Extreme Points
We now briefly recall some standard definitions from the linear programming literature [29] that will be useful in the proceeding discussion. The definitions are most conveniently stated for a linear program in standard form: min T s.t.
= , ≥ 0, with ∈ R × , ≤ , the feasibility region D = { | = , ≥ 0} is a polyhedron. Let * be an optimal solution and E be the set of all extreme points of the polyhedron D, which is equivalently the set of all basic feasible solutions of the linear program. Without loss of generality, assuming has linearly-independent rows, each basic solution is characterized by a basis = [ (1) , . . . A basis is Δ-infeasible if max :( −1 ) <0 ( −1 ) ≤ −Δ, i.e., the maximum negative basis variable is smaller than -Δ.
Each basic feasible solution is characterized by a feasible basis, i.e., If the feasibility region D is bounded, then there exists an optimal solution * that is a basic feasible solution or equivalently, an extreme point (i.e., * ∈ E). We will write * as the basis associated with * . Following the standard definition in the linear programming literature [29], we say an extreme point with basis is non-degenerate if > 0 component-wise.
Back to the static linear program P of the optimal control problem, we can write P in standard form as the following.
Here, the matrix and vector are formed by the constraints ∈S = and : ∈S + = and the optimization vector is formed by { }, { }. Compared to the original form of P, in the standard form we introduce slack variables { } to replace inequality constraints with equality constraints. The slack variables essentially capture the difference between the service rates and the total rates of incoming jobs. Let = { , } be an extreme point of P in standard form. Under the extreme point, we define to be an idle server if = 0 for all ∈ U and to be a slack server if > 0. For a dispatcher and server that are connected, we say that the link between and is essential, if > 0. In Proposition 2, we show that each extreme point induces a spanning forest of G. Similar results and extreme point structure have also been applied in heavy traffic analysis of stochastic queueing networks [30,31]. The proof relies on standard techniques for analyzing linear programs related to network flow. We give the proof in Appendix A for completeness. Proposition 2. Under any extreme point { , }, the subgraph induced by essential links and idle servers form a spanning forest of the bipartite network G. Each tree in the spanning forest contains at most one slack server. Furthermore, if the extreme point is non-degenerate, then each tree contains exactly one slack server.
To give an example of Proposition 2, consider the network shown in Figure 3(a) with the essential links of an extreme point marked in red. The values of the { } variables are labeled besides the links. In this example, the extreme point is non-degenerate and the only slack server is 4 (with 4 = 2). The spanning forest induced by the extreme point has two trees: one is a trivial tree only consisting of the idle server 2 , and the other is shown in Figure 3(b).

Conditions on Network Statistics
We define the following two conditions.
• Condition 1: • Condition 2: the optimal extreme point is non-degenerate, and every basis is either Δ 2 -feasible or Δ 2 -infeasible with Δ 2 > 0 being a constant independent of . Using the terminology of the online learning literature [11], Δ 1 and Δ 2 can be viewed as the "instance-dependent" parameters, where Δ 1 denotes the gap between the optimal and the second best extreme point, and Δ 2 represents the minimum absolute value of the non-zero basis variables. Note that under Condition 2, as the optimal extreme point is feasible and non-degenerate, all of its basis variables are non-zero, and larger than or equal to Δ 2 . Using the structure of extreme points defined in Proposition 2, condition 2 means that for each extreme point (basic feasible solution), the rates of jobs on essential links are at least Δ 2 and the one slack server has at least Δ 2 of extra capacity ( ≥ Δ 2 ). For each infeasible basic solution, the rates of jobs on essential links are at least Δ 2 and the servers with arrival greater than service have ≤ −Δ 2 (which is what makes the basic solution infeasible).
For the rest of the paper, we will assume that conditions 1 and 2 hold, and focus on analyzing instance-dependent regret (with Δ 1 , Δ 2 being the instance-dependent parameters), where the instance of the optimal control problem is fixed and we study the scaling of the regret with respect to the time horizon . Note that existing lower bounds on stochastic linear optimization [11,17] establishes that only in the instance-dependent case can one hope to achieve logarithmic regret, otherwise, it is impossible to have a policy with regret better than Ω( √ ). The standard form of P was introduced to facilitate the definitions related to the extreme points. For consistency of notations, in what follows we will focus on the original form of P (instead of the standard form) and use vector or { } to represent a generic feasible solution to P.

OVERVIEW AND DISCUSSION OF RESULTS
In this section, we give an overview of our main results and discuss their relation to previous works in the literature.

Overview of Results
Proposition 1 shows that solving the optimal control problem boils down to learning the solution to P in the context of stochastic queueing networks and making routing and scheduling decisions based on the learned solution. Therefore, we can break down the optimal control problem into two logical components: the learning algorithm and the control policy. The learning algorithm (approximately) solves P using the feedback available in the optimal control problem. The control policy makes routing and scheduling decisions based on the solution provided by the learning algorithm. Note that the two logical components are anything but disjoint. They must be integrated as a joint policy as the learning algorithm updates its solution based on the utility observations, the dynamics of which is determined by the control policy, while the control policy relies on the solution fed by the learning algorithm.
For the control policy, we propose one that relies on the extreme point of P as input. It relies on the graphical structure, or more specifically, the spanning forest induced by the extreme point rather than the value of the extreme point. This makes it robust to the errors in the solution provided by the learning algorithm. We will show that, given the optimal extreme point, our control policy achieves logarithmic regret. The control policy is a mixture of threshold-based join-the-shortest-queue routing and priority-based routing at the dispatchers, and priority-based scheduling at the servers, with the priority defined by the structure of the forest. More details will be given in Section 5.
The learning algorithm and the joint policy that integrates the learning algorithm and the routing policy will be presented in Section 6. We adapt the algorithm for stochastic linear optimization with bandit feedback proposed in [17] to learn the solution to P. The algorithm from [17] cannot be directly applied here due to the unknown feasibility region of P (which is parameterized by the unknown network statistics) and the feedback delay. More details will be given regarding these two challenges in Section 6. We will adapt the algorithm from [17] and combine it with our control policy to form a joint learning and control policy and show that it achieves logarithmic regret for the optimal control problem.

Discussion
In this section, we review related results in the literature and discuss how they fall short of achieving logarithmic regret for the optimal control problem, which will also highlight the novelty of our results on learning in stochastic queueing networks.

Online Learning and Online Decision
Making. As we have mentioned, there has been extensive effort in the field of online learning/online decision making dedicated to optimization problems with unknown utility functions with feedback given through (zero-order) oracle on function values [9,17,[25][26][27]. Due to the unknown constraints and feedback delay, those algorithms cannot be directly applied to learn the optimal extreme point of P. More importantly, as those works study a pure optimization problem, they are not concerned with the networking aspect of the optimal control problem, i.e., how to translate the learned solution to an effective control policy and how the control policy affects the learning process. Note that many works in the online learning literature consider adversarial objective functions that are time varying [32], which is more general than the fixed utilities we consider in the optimal control problem. However, that generality does not help as their settings still do not involve stochastic queueing dynamics.

Network Utility Maximization.
The optimal control problem can be considered as a problem of network utility maximization with (unknown) linear utility function. Although results for general network utility maximization problems [8,19,20,23,24,28] can be used to derive viable policies for the optimal control problem, we will justify in the following that those policies will only achieve Ω( √ )-regret, which is strictly worse than the logarithmic regret achieved by our policy. Consider the simple network with one dispatcher and two servers in Figure 2. Since 11 = 6 > 12 = 5, the optimal solution is * 11 = 5, * 12 = 2, and the network statistics satisfy conditions 1 and 2 with Δ 1 = Δ 2 = 3. We first consider a simplistic case where the optimal solution is given and we only need a good control policy to achieve low regret. A simple idea is to use a static randomized routing policy parameterized by the optimal solution combined with an arbitrary scheduling policy. For example, a valid static policy based on { * } is one that at each time sends the incoming jobs to 1 with probability 5 7 and to 2 with probability 2 7 while the servers serve the jobs in an arbitrary order. This policy seems natural, but it has a utility gap of Ω( √ ), which is defined as the difference between the expected utility achieved by the policy and times the value of the solution * , and this will lead to Ω( √ )-regret. The reason is that the queue of 1 is critically loaded, which will result in E[ 1 ( )] = Ω( √ ) and cause a Ω( √ )-loss of utility 2 . This is by no means specific to the static randomized policy considered in the example, as we will formally show in Appendix B that any static policy that makes routing decisions independently over time has a regret of Ω( √ ), which is inferior to the logarithmic regret that our policy achieves. As existing works on network utility maximization use variants of Max-Weight policies which are derived from minimizing certain quadratic Lyapunov function that seek to converge to the optimal static policy, they also cannot achieve a regret better than Ω( √ ) [19]. It follows that the network control component alone is already non-trivial. Furthermore, when we bring the problem of learning the optimal solution back into the picture, we see that to achieve logarithmic regret, the control policy cannot rely on the values of the solution. Since both the objective function and the feasibility region of the static linear program P are unknown, existing lower bounds [11,17] establish that it is impossible to obtain the solution to P within an error smaller than Θ( 1 √ ) after time slots. For example, in the aforementioned case ( Figure 2), no learning algorithm can obtain the exact values of * 11 = 5, * 12 = 2 but at best * 11 ≃ 5 ± Θ( 1 √ ), * 12 ≃ 2 ± Θ( 1 √ ). Therefore, relying on the values of the solution will inevitably lead to Ω( √ )-regret. To achieve logarithmic regret, we have to rely on some structure of the solutions that is robust against the error.

Reinforcement Learning.
The optimal control problem can be formulated as a Markov decision process with unknown parameters. Therefore, reinforcement learning techniques can also be applied.
However, none of the existing results on reinforcement learning can be shown to achieve a regret better than ( √ ) [12][13][14][15][16]. Due to the generality of the reinforcement learning framework, results therein cannot exploit the structure of the optimal control problem.

THE DUAL-LEVEL JSQ-POLICY
In this section, we introduce the control policy for the optimal control problem, which we call the Dual-Level JSQ-policy. At the dispatchers, it makes routing decisions based on the priority levels defined on the forest induced by the extreme point in a Join-the-Shortest-Queue fashion with queue length threshold . At the servers, it makes priority-based scheduling decisions also based on the priority levels defined on the forest. In what follows, we first introduce the priority levels of extreme points, and then present the details and performance analysis of the Dual-Level JSQ-policy.

Priority Levels Induced by Extreme Point
Given an extreme point of the linear program P, we associate a priority level with each server and dispatcher as follows. For a tree in the spanning forest induced by the extreme point, if there is a slack server in the tree, then we designate the slack server as the root, otherwise, we designate an arbitrary server in the tree as the root. Based on the designated root, we essentially give each tree an orientation. For each node in the tree, we define its priority level as its distance to the root in the tree. For example, the root server has priority level 0, and the job dispatchers that are immediately connected to the root server in the tree have priority level 1. From these definition, we have the following observations: • For a dispatcher of level ℎ, it is connected to exactly one server of level ℎ − 1. If the dispatcher is not a leaf node, it is also connected to at least one server of level ℎ + 1. We will refer to the level ℎ − 1 server as the secondary server of the dispatcher, and the level ℎ + 1 server(s) as the primary server(s) of the dispatcher. • For a server of level ℎ ≠ 0, it is connected to exactly one dispatcher of level ℎ − 1. If the server is not a leaf node, it is also connected to at least one dispatcher of level ℎ + 1. We will refer to the level ℎ − 1 dispatcher as the secondary dispatcher of the server, and the level ℎ + 1 dispatcher(s) as the primary dispatcher(s) of the server. • If a server is a primary server of a dispatcher , then the dispatcher is the (only) secondary dispatcher of the server . Similarly, if is the secondary server of a dispatcher , then is a primary dispatcher of .

The Control Policy
We now present the details of the dual-level JSQ-policy, which we will often abbreviate as the JSQ-policy. The JSQ-policy is parameterized by a threshold parameter whose value will be set later. For a given extreme point , the JSQ-policy is structured based on the spanning forest induced by .
Scheduling: The queue of each server is partitioned into two virtual queues ℎ and . Under an extreme point, the virtual queue ℎ is the high-priority queue that holds the jobs from the primary dispatchers of the server while the virtual queue is the low-priority queue that holds the jobs from the secondary dispatcher of the server. As an example, the virtual queueing architecture corresponding to the extreme point in the example of Figure 3(a) is shown in Figure  3(b). For each server, the scheduling policy is to gives priority service to the jobs in its high-priority queue and only serves the jobs in its low-priority queue if its high-priority queue is empty. Routing: For each dispatcher that is connected to its primary servers 1 , . . . , and secondary server 0 , it first checks if any of its primary servers have low-priority queue with length no larger than the threshold . If so, then it sends the jobs to the primary server with the smallest low-priority queue length. Otherwise, it then checks if the high-priority queue of its secondary server is no larger than . If so, it sends the incoming jobs to its secondary server. If all the low-priority queues of 1 , . . . , and the priority queue of 0 are greater than , then the dispatcher discards the incoming jobs 3 . The pseudo-code of the JSQ-policy is shown in Algorithm 1.
To give a concrete example, consider the extreme point in Figure 3(b). On the server side, the priority queue of server 1 receives jobs from dispatchers 1 and 2 . The low-priority queue of 1 receives jobs from 3 . The high-priority queue of 3 receives jobs from 4 and the low-priority queue of 3 receives jobs from 3 . The high-priority queue of 4 receives jobs from 3 and 5 . On the dispatcher side, both 1 and 2 only sends jobs to 1 when ℎ 1 ≤ , and discard the incoming jobs otherwise. Dispatcher 4 sends jobs to 3 when ℎ 3 ≤ and discard the incoming jobs otherwise. Dispatcher 3 sends jobs to the shorter of 1 , 3 when at least one of them is no larger than , otherwise 3 sends jobs to 4 when ℎ 4 ≤ . When 1 , 3 , ℎ 4 are all greater than , 3 discards the incoming jobs. Dispatcher 5 sends jobs to 4 when ℎ 4 ≤ , and discards the incoming jobs otherwise.

Analysis
We first show the claim that given the optimal extreme point, the JSQ-policy achieves logarithmic regret. The claim will follow from Theorem 1, which establishes a more general statement: given any non-degenerate extreme point , the difference between the total utility achieved by the JSQpolicy given as input and · ( ) is in (log ), where ( ) = , is the value of with respect to the objective function of P. Combining Proposition 1 and Condition 2, we have that the optimal extreme point is non-degenerate. Hence, we have that Theorem 1 implies the claim that the JSQ-policy achieves logarithmic regret if given the optimal extreme point. Theorem 1. For any non-degenerate extreme point , the total expected utility achieved by the JSQpolicy based on the spanning forest induced by with = (log ) is at least · ( ) − (log ).

11:
( ) := 0 for all other servers . 12: Due to the space limitations, we give the overall structure and the intuition of the proof here. The proof details are deferred to Appendix C.
The proof of Theorem 1 consists of proving two claims for each node in the spanning forest. Under a policy , we define the random variable˜( ) as the number of jobs from dispatcher completed at server at time . In the proof, we will omit the superscript as it will always refer to the JSQ-policy. Consider a non-degenerate extreme point . The extreme point induces a spanning forest of G. We consider an arbitrary tree in the spanning forest with + 1 priority levels 0, . . . , . We will establish two claims for each dispatcher and two claims or each server. More specifically, define ℎ = 1 [4( + ) ] −ℎ for ℎ = 1, . . . , and 1 = 4( + ) For each dispatcher at level − ℎ with primary servers 1 , . . . , and secondary server 0 , we establish the following two claims: Claim (1.1): Starting from = 1 ℎ ln , the probability that there exists a primary server of with queue length (high-priority queue plus low-priority queue) smaller than The expected total completed service from at each of the primary server is close to · , i.e., for each = 1, . . . , , . For each server (of priority level − ℎ) with primary dispatchers 1 , . . . , and secondary dispatcher 0 , we establish the following two claims Claim (2.1): Starting from = 1 ℎ ln , the probability that the high-priority queue of grows over ℎ is small, i.e., Claim (2.2): The expected total completed service at from each of its primary dispatcher is close to · , i.e., for each = 1, . . . , , . Note that the constant ℎ decreases as the level of the node increases (going from the root to leaf nodes). Claim (1.1) shows that for each server that is not the root (slack) server, its queue length rarely goes below (1 − ℎ ) after (log ) time slots, which implies that it is almost never idle. This also implies that non-slack servers are fully utilized. Claim (2.1) shows that for each server, its high-priority queue is rarely greater than ℎ with ℎ < 1. This implies that dispatchers almost never drop the incoming jobs since the high-priority queue of their secondary server is almost always smaller than . Claims (1.1) and (2.1) are intermediate steps that are instrumental in proving Claims (1.2) and (2.2). After having proved Claim (1.2) for each dispatcher in the tree and Claim (2.2) for each server in the tree, it will follow that the difference between the total expected utility obtained under the policy and · ( ) is (log ), which will imply Theorem 1.
The proof of the claims with respect to each node in the tree is done via an induction framework. Structure of the Induction: For each tree in the spanning forest, the base step of the induction deals with the nodes at level − 1 (parents of the leaf nodes). The base step starts from nodes at level − 1 instead of nodes at level because nodes at level (the leaf nodes) do not have any children and thus their corresponding claims trivially hold. Depending on whether the nodes are servers or dispatchers, the base step can be divided into two cases. In the first case, the nodes at level − 1 are dispatchers (and the leaf nodes are servers) and we need to prove Claim (1.1) and Claim (1.2) for each dispatcher at level − 1. In the second case, the nodes at level − 1 are servers (and the leaf nodes are dispatchers) and we need to prove Claim (2.1) and Claim (2.2) for each server at level − 1. Proceeding from the base step, the induction step works by proving for each node in the tree Claims (1.1), (1.2) if the node is a dispatcher, and Claims (2.1), (2.2) if the node is a server, assuming the induction hypothesis that Claims (1.1), (1.2) hold for all the dispatchers, and Claims (2.1), (2.2) hold for all the servers in the subtree rooted at the node. When completing the induction, we will have proved the corresponding claims of each node in the forest.
Intuition Behind the Claims: Now we give the intuition behind why the claims hold. The details of establishing the claims are deferred to Appendix C. We first give the main intuition behind Claims (1.1) and (1.2). Consider a dispatcher whose primary servers 1 , . . . , are the leaf nodes of the tree and secondary server is 0 . For Claim (1.2), note that in this case = . Therefore, the upper bound of =1 E[˜( )] is straightforward. While for the lower bound, since each primary server only receives jobs from , we need to show that the cumulative idleness in each of the servers 1 , . . . , is in (log ), which will essentially follow from Claim (1.1). For Claim (1.1), note that if we consider servers 1 , . . . , as a set, as long as one of the servers have queue length smaller than , the incoming jobs from (which is of rate ) are sent to the set while the total service rate of the set is =1 . By the constraint satisfied by the extreme point, − =1 = 0 > 0. Therefore, the total queue lengths of the set tend to have positive drift when the queues are not too large, from which Claim (1.1) can be derived. When proving Claims (1.1) and (1.2) for a dispatcher higher up in the tree (i.e. the induction steps), the key ideas are the same but the drift arguments are more challenging to construct since the queueing dynamics will be influenced by the servers and dispatchers of higher priorities (that are descendents of the dispatcher in the tree).
For the main intuition behind Claims (2.1) and (2.2), consider a server whose primary dispatchers 1 , . . . , are the leaf nodes and secondary dispatcher is 0 . In this case = . Hence for Claim (2.2) the upper bound on =1 E[˜( )] is straightforward. For the lower bound, since each dispatcher 1 , . . . , only sends jobs to , we need to show that the queue length of and the total number of jobs discarded are in (log ). The queue length of is in (log ) by design as = (log ). The total number of jobs discarded is the same order as the total probability of ℎ being greater than over the whole time horizon which will essentially follow from Claim (2.1). For Claim (2.1), note that the total arrival rate to ℎ is at most =1 , while the offered service rate to ℎ (as it receives priority service) is as long as ℎ > 0. As by the constraint satisfied by the extreme point, − =1 = 0 > 0, ℎ is a queue with negative drift, from which Claim (2.1) can be derived. Again, when proving Claims (2.1) and (2.2) for a server higher up in the tree (i.e. the induction steps), we use the same ideas but need to be more careful when constructing the drift arguments.

LEARNING ALGORITHM AND THE JOINT POLICY
In this section, we will first present our learning algorithm, which is an adaptation of the Confidence-Ball algorithm proposed in [17]. Next, we integrate the learning algorithm with the JSQ-policy to form a joint policy and prove that it achieves logarithmic regret for the optimal control problem.
6.1 Learning Algorithm: The Confidence-Ball Algorithm 6.1.1 Stochastic Linear Optimization with Bandit Feedback. We start by briefly reviewing the result of [17]. The work [17] studied the problem of stochastic linear optimization with bandit feedback. Consider the linear optimization problem max ∈ · , where ∈ R is the feasibility region that is known in advance but is unknown. At every time , we choose a decision vector ∈ and receives an observation = · + , where is a zero-mean noise with bounded variance. Let the optimal solution be * := arg max ∈ · . The goal is to design an algorithm that outputs a sequence of decisions 1 , . . . , such that the regret = =1 ( · * − · ) is low. The algorithm proposed by [17], the Confidence-Ball algorithm, achieves logarithmic regret for stochastic linear optimization with bandit feedback. Before reviewing the details of the algorithm, the following definitions are needed. For a vector ∈ R and a positive definite matrix ∈ R × , we denote || || 1, := || 1/2 || 1 = =1 | 1/2 | as the 1-norm based on . The details of the algorithm is shown in Algorithm 2. The algorithm essentially works through estimating the vector from the observations using a linear regression-like procedure. Note that if we are given ( 1 , 1 ), . . . , ( , ), the problem of estimating resembles the linear regression problem, where the estimate is given bŷ . The Confidence-Ball algorithm essentially uses the same procedure, where the matrix keeps track of =1 ′ but initialized with the Bary-centric spanner to make sure that is invertible for = 1, . . . , . Instead of using the point estimateˆ, the algorithm uses the best in an ellipsoid (confidence-ball) aroundˆ(Line 4) to solve for (Line 5). Note that as pointed out in [17], when is a polyhedron, every can be solved to be an extreme point of .

6:
Receives unbiased observation of the objective function := · + . 7: 6.1.2 Challenges of Applying the Confidence-Ball Algorithm. Two main challenges prevent us from directly applying the Confidence-Ball Algorithm to find the optimal extreme point of the static linear program P in the optimal control problem. The first one is the unknown feasibility region.
Recall that the feasibility region of P is written as = {{ } | ∈S = , : ∈S ≤ , ≥ 0.}. The set is unknown apriori since it is parameterized by unknown network statistics (arrival rates and service rates).
The second challenge arises from the delay in obtaining unbiased estimate of the objective function. The Confidence-Ball algorithm requires an unbiased estimate of the objective function · for a decision vector . Such unbiased estimate is not directly available in the optimal control problem, but can be synthesized from utility observations of completed jobs. Consider a decision vector = { } for P. If for each , with > 0, we have a utility observationˆfrom a job from completed at (which from now on will be referred to as a ( , )-job), then , : >0ˆ· is an unbiased estimate of the objective function of P at { }. However, such synthesized estimates are not immediately available since we only observe utilities upon the completion of the jobs which experience queueing delay.
Therefore, the Confidence-Ball algorithm cannot be directly applied in the optimal control problem. In what follows, we will propose an adapted version of the Confidence-Ball algorithm that address the aforementioned two challenges and integrate the algorithm with the JSQ-policy.

The Confidence-Ball JSQ-Policy
In this section, we propose the an adapted version of the Confidence-Ball algorithm and integrate it with the Dual-Level JSQ-policy to form the Confidence-Ball JSQ-policy, which will be shown to achieve logarithmic regret for the optimal control problem. The Adapted Confidence-Ball algorithm is the same as Algorithm 2 but with Line 5 replaced with minimization overˆ, i.e., = arg max ∈ˆm ax ∈ 1 ( ′ · ), and Line 6 replaced with synthesizing the unbiased estimate using corresponding utility observations as introduced in Section 6.1.2.
We define * as the optimal extreme point ofˆ, i.e., * := arg max ∈ˆ· . Using the results of [17], we have the following proposition regarding the performance guarantee of the Adapted Confidence-Ball algorithm. Proposition 3. With probability at least 1 − 1 , during iterations of the Adapted Confidence-Ball algorithm, the number of in {1, . . . , } such that ≠ * is in (log 3 ).

Integration of Adapted
Confidence-Ball and Dual-Level JSQ-. Now we are ready to introduce the Confidence-Ball JSQ-(CB-JSQ-) policy, which the joint policy that integrates the Adapted Confidence-Ball algorithm and the Dual-Level JSQ-policy for the optimal control problem.
The basic idea of the CB-JSQ-policy is to make routing decisions using the JSQ-policy based on the extreme-point solution provided by the Adapted Confidence-Ball algorithm. Due to the aforementioned feedback delay, the Adapted Confidence-Ball algorithm cannot update the solution every time slot. Instead, we employ an episodic version of the Adapted Confidence-Ball algorithm where the solution is updated once every episode (consisting of multiple time slots). The routing decisions are made using JSQ-based on the same solution during each episode. The episode length will be set long enough to ensure that the Adapted Confidence-Ball algorithm can obtain the utility observations necessary to synthesis unbiased estimates of the objective function.
More specifically, the CB-JSQ-policy embeds the JSQ-policy in an episodic version of the Adapted Confidence-Ball algorithm. The episode length is set to = log 2 log log . We index the episodes by = 1, . . . and let be the first time slot of episode . The policy maintains the matrix , vector estimateˆof , and ellipsoid aroundˆfor every episode . At the beginning of each episode, it solves for an extreme point through optimization over andˆ, whereî s the estimation of feasibility region (Definition 1) at the beginning of episode . Let T be the spanning forest induced by . If T ≠ T −1 , the policy discards all the jobs in the queues. 4 Then, it makes routing and scheduling decisions based on the JSQ-policy using T , while collecting utility observations. At the end of the episode, the policy updatesˆbased on the observations of arrivals and offered services, and updates ,ˆif we have obtained at least one utility observation of a := arg max ∈ˆm ax ∈ 1 ( T ).

6:
T := the spanning forest induced by . Proof Sketch: we provide the main idea of the proof here and defer the details to Appendix D. The first step of the proof is to show that we will be able to obtain all the necessary utility observations for each episode (with high probability) after ≥ log 3 . This holds since after ≥ log 2 ,ˆ,ŵ ill be sufficiently close to , so that the estimated feasibility regionˆis sufficiently close to the true feasibility region . It follows that the forest induced by any extreme point ofˆwill be feasible with respect to . Therefore, we can construct similar drift argument as the proof of Theorem 1 to show that within each episode , at least one ( , )-job is completed for every > 0 with high probability, which implies that all the necessary utility observations are obtained. The second step is to show that after ≥ log 2 , the optimal extreme point ofˆand the optimal extreme point of induces the same spanning forest. This again follows from thatˆ,ˆbeing sufficiently close to , . Combining this with Proposition 3, it implies that there are at most (log 3 ) episodes where the spanning forest used by the JSQ-policy is sub-optimal. Based on the previous two steps, we proceed to analyze the regret of the CB-JSQ-policy. We divide the time horizon into periods, where each period is formed by consecutive episodes with the same forest. The regret of the CB-JSQ-policy is the sum of the regret over each period. We will call a period/episode correct if the spanning forest of the period/episode coincides with the optimal, and a period/episode incorrect otherwise. Since there are at most (log 3 ) episodes where T is not equal to the optimal forest, there are at most (log 3 ) periods. The total length of incorrect period is upper bounded by the total number of incorrect episodes times the episode length, which is (log 5 log log ) time slots. Therefore, the regret incurred in incorrect period is in (log 5 log log ). Whenever the policy switches between periods, it discards all the jobs in the queue, which will in total incur (log 4 )-regret as the total queue length is in (log ). Finally, by Theorem 1, as the optimal extreme point is non-degenerate, the regret incurred in each correct period is (log ). Therefore, in summary, the regret of the CB-JSQ-policy is in (log 5 log log ).
Discussion: The main reason that the Confidence-Ball JSQ-policy can achieve (poly)-logarithmic regret instead of ( √ ) regret is that the Confidence-Ball JSQ-policy tries to learn the optimal spanning forest, i.e., the structure of the optimal solution to the linear program P instead of the value of the optimal solution. Since the total number of extreme points is finite, when the assumptions are satisfied, there is enough "separation" between different extreme points while such separation does not exist for the value of the solution as it lies in a continuous set. This is the main reason why learning the optimal structure is more robust than learning the optimal value.

SIMULATIONS
In this section, we evaluate the empirical performance of our policies via simulations on the network shown in Figure 3(a). The arrival rates and service rates are shown in the figure. The underlying utilities are chosen such that the extreme point shown in Figure 3(b) is the optimal extreme point. 5 For each dispatcher , ( ) is chosen as a uniform integer between − 2 and + 2 and for each server , ( ) is chosen as a uniform integer between − 2 and + 2. We first study the growth of regret with the time horizon. We vary the time horizon in {10000, 20000, . . . , 200000} and compare the performance of three policies: • Static: the static optimal control policy based on the optimal solution to P.
• JSQK: the Dual-Level JSQ-policy with = 20 log on the optimal extreme point.
The regret of each policy is approximated by the difference between · (P) and the total utility obtained over the time horizon. We plot the regret and the total queue lengths at the end of the time horizon in Figure 4. The results are the average over 10 runs.
From Figure 4(a), we can see that JSQK and CB-JSQK have significantly lower regrets than the static optimal policy, which validates our theoretical analysis, as the former two achieve logarithmic regret while the latter has a regret of Ω( √ ). The regret of CB-JSQK is higher than JSQK since it needs to learn the optimal extreme point while JSQK is fed with the optimal extreme point. Similar phenomenon can be observed for queue lengths in Figure 4(b). However, the queue length under CB-JSQK is slightly lower than JSQK, which can be attributed to that the CB-JSQK policy clears the queues when the current extreme point changes between episodes.
Next, we study the sensitivity of the performance of JSQK and CB-JSQK to the parameter . We fix the time horizon to be = 100000 and vary in {20, 40, . . . , 200}. We plot the regret and the total queue lengths at the end of the time horizon in Figure 5. From Figure 5, we can see that for both JSQK and CB-JSQK, the total queue lengths increase with , which is not surprising given the role of as the queue length threshold in the policies. Furthermore, the regrets of both policy are not sensitive to as long as its value is in the range of 80 to 200 (for a time horizon of 100000).

CONCLUSION
In this paper, we studied the optimal control problem in stochastic bipartite queueing networks, where we developed an admissible policy with low regret compared to the optimal dynamic policy. It is a first class of problems that focus on the challenges of combining learning and network control, where the learning aspect and the network control aspect are not separate and must be co-designed. We first showed that the expected utility of the optimal dynamic policy is upper bounded by times the solution to a static linear program, where the optimization variables correspond to rates of jobs from dispatchers to servers and the feasibility region is parameterized by arrival rates and service rates. We next proposed the CB-JSQ-policy for the optimal control problem that is an integration of an adapted version of the Confidence-Ball algorithm (learning algorithm) and the Dual-Level JSQ-policy (control policy). The Dual-Level JSQ-policy relies on the spanning forest structure induced by the extreme points of the static linear program while the Confidence-Ball algorithm seeks to learn the optimal extreme point. We proved that the CB-JSQ-policy achieves logarithmic regret, which is superior to techniques in previous works that could only achieve Ω( √ )-regret. There are several future directions. First, it would be interesting to consider utility functions that depend on the waiting time of the jobs instead of only the dispatcher and the server. The second direction involves the lower bound of the optimal routing problem Since the optimal control problem can be considered as a generalization of the multi-armed bandits problem, following from the lower bound of multi-armed bandits [11], a regret lower bound of Ω(log ) also holds for the optimal routing problem. An important open problem is thus, whether stronger lower bounds (e.g. poly-logarithmic to ) hold for the optimal routing problem.

A PROOF OF PROPOSITION 2
Proof. The proof consists of three steps. We first show that the set of essential links forms a forest in the bipartite network, and each node in G is either connected to an essential link, or is an idle server, which implies that the essential links and idle servers form a spanning forest. Second, we prove that each tree in the spanning forest contains at most one slack server. Finally, we argue that for a non-degenerate extreme point, each tree in the forest contains exactly one slack server. The proof relies on the connection between the optimization problem P and the network flows in an extended networkG of G, which is defined as follows. An example of the extended network is shown in Figure 6. Consider the flow polytope formed by the set of˜-˜flows F of value on˜. Standard results in network flow [29] show that the flow polytope is equivalent to the feasibility region of P, with the equivalence manifested by the flow value between ( , ) in F corresponding to the value of variable in P. It follows that the extreme points of the flow polytope are equivalent to the extreme points of P. For a˜-˜flow inG, we say a link is unsaturated if the flow value of the link is smaller than its capacity. The following lemma characterizes the structural property of extreme points in the flow polytope. It is also a standard result in the network flow literature and can be found in [29]. Based on Definition 2 and Lemma 1, we are ready to carry out the three steps in proving the proposition. For the first step, note that as the capacities of the links between U and S are infinity, those links are unsaturated under any flow. It follows that every essential link (in G) corresponds to an unsaturated link with positive flow in the extended networkG. If the essential links do not from a forest (i.e., it contains a cycle), then there will be a cycle of unsaturated links with positive flows in the extended network, which contradicts the condition that { , } is an extreme point by Lemma 1. Furthermore, for each dispatcher , as ∈S = > 0, there must exist an ∈ S with > 0. Thus, each dispatcher is connected to at least one essential link. For each server , note that if is not connected to any essential link, then we have = > 0 and is an idle server. Therefore, the essential links and idle servers form a spanning forest of G, with each tree in the forest is either a tree that contains essential links, servers and dispatchers, or a trivial tree that only contains an idle server. For the second step, consider an arbitrary tree in the forest induced by the extreme point. Suppose for the sake of contradiction, there exist two slack servers , ′ in the tree. Then, the links ( ,˜) and ( ′,˜) are unsaturated with positive flows. Furthermore, since and ′ are in the same tree, there exists a path of essential links connecting and ′ . Therefore, in the extended network˜, the path together with links ( ,˜) and ( ′,˜) form a cycle of unsaturated links, which contradicts Lemma 1. Hence, there is at most one slack server in each tree.
Finally, if the extreme point is non-degenerate, then all variables in the basis must be strictly positive. Note that there are + constraints in P (not including the non-negativity constraints). Hence, there are + variables in the basis. Suppose that under the extreme point, there are trees in the spanning forest. Then, as the spanning forest has + nodes, it must have + − edges, which correspond to + − essential links. Since each essential link corresponds a basis variable, and as the link is non-degenerate, links with zero flow are not in the basis, it follows that there are variables > 0 in the basis that correspond to slack servers. Since in the second step, we have shown that each tree contains at most one slack server, we thus have each tree contains exactly one slack server when the extreme point is non-degenerate. □

B REGRET LOWER BOUND OF STATIC POLICIES
In this section, we establish a lower bound on the regret of static policies. We formally define static policies as ones under which the numbers of jobs sent from each dispatcher to server , i.e., (1), . . . , ( ) at time = 0, . . . , are independent random variables with the same mean. Note that we only require independence of decisions corresponding to each dispatcher-server pair across time, but do not ask for independence of ( )'s across different dispatcher-server pairs for the same . 6 Also, the requirement is with respect to the routing policy at the dispatchers while the servers can employ an arbitrary scheduling policy. The lower bound is summarized as follow.
Proposition 4. There exist instances of the optimal control problem in which any static policy has Ω( √ )-regret.
Proof. We consider an instance with one dispatcher ( = 1) and two servers ( = 2). Server 1 has an integer service rate 1 with its offered service 1 ( ) being an integer chosen from { 1 −1, 1 +1} 6 In fact, they are often not independent as they have to satisfy uniformly at random. Server 2 has an integer service rate 2 with its offered service 2 ( ) being an integer chosen from { 2 − 1, 2 + 1} uniformly at random. The underlying utilities 11 = 12 + 1. The arrival process is deterministic, with 1 ( ) = for each . Note that we explicitly specify the distributions of the arrival and service processes only for concreteness. It should be clear from the proof that the result holds for a wide range of instances, not restricted to = 2 or the distributions assumed here.

C PROOF OF THEOREM 1
Proof. In this section, we give the details of the proof of Theorem 1. Recall that for each dispatcher (of priority level − ℎ) with primary servers 1 , . . . , and secondary server 0 , we will establish the following two claims: Claim ( .2) of all the dispatchers and servers will conclude the proof of the theorem. Base Step: The base step starts from the nodes of priority level − 1, since the nodes of priority level are leaf nodes (with no children), of which the claims are trivial. We begin by the following fact, due to the arrivals and offered services being upper bounded by .
Claim (1.2): For Claim (1.2), recall that ( , ) is the offered service of server at time on the sample path . Note that by construction, = . Hence, we first have for each = 1, . . . , , Next, as˜( ) = ( ) on sample paths where there is no idleness in the queue of server , we have that where the last inequality holds because of Claim (1.1) we just proved. This concludes the proof for the first case of the base step.

Base
Step -Case 2: We next consider the case where a node of priority level − 1 is a server and prove the claims (2.1) and (2.2) with respect to the nodes. Note that here, the server's primary dispatchers 1 , . . . , are only connected to as their secondary server in the tree. Claim (2.1): In this case, the high-priority queue length ℎ ( ) it self can be used as the potential function for our analysis. For simplicity of notation, we will write 2 ( ) := ℎ ( ). When 2 ( ) ≥ , there will be no idleness for the high-priority queue at server at time . Therefore, where the last inequality is due to the non-degeneracy of the extreme point. Also, | 2 ( +1)− 2 ( )| ≤ with probability 1. Therefore, let 2 ( ) = 2 ( + 1) − 2 ( ), We have for constant = 1 4 , It follows similarly as in (10) and (11) that Thus, we have Iterating over inequality (25) and noting that 2 (0) = 0, we obtain It follows that, Thus, we have =1 P[ ℎ ( ) > 1 ] = (1), which implies the claim. Claim (2.2): By definition of our JSQ-policy, = . Hence, we first have for each = 1, . . . , , Next, note that the incoming jobs from dispatcher are either discarded, still in the queue, or completed by , and job discarding can only happen when ℎ ( ) > . It follows that, where the last inequality holds because of Claim (2.1) we just proved and that ℎ ( ) no larger than + almost surely. The concludes the proof for the second case of the base step. Induction Step: We now proceed to the induction step of the proof. Suppose that Claims (1.1), (1.2), (2.1), (2.2) hold for nodes (dispatchers, servers) of priority levels , − 1, . . . , − ℎ + 1. We consider a node of priority level − ℎ. Induction Step -Case 1: Consider a dispatcher node of priority level − ℎ. It is connected to its primary servers 1 , . . . , of priority level − ℎ + 1, and its secondary server 0 of priority level − ℎ − 1.
To establish a lower bound on =1 E[˜( )], we observe that as · =1 P{ ℎ ( ) + ( ) ≤ } upper-bounds the total amount of wasted service at server due to idleness. By the induction hypothesis for dispatchers in U , we have for each ′ ∈ U , E[ =1˜′ ( )] ≤ ′ + (log ). Combining this with the Claim (1.1) for dispatcher that we just proved, we have from which Claim (1.2) follows. Induction Step -Case 2. Consider a server node of priority level − ℎ. It is connected to its primary dispatchers 1 , . . . , of priority level − ℎ + 1, and its secondary dispatcher 0 of priority level − ℎ − 1.
Claim (2.1): We consider the sub-tree rooted at . Denote the set of dispatchers in the sub-tree as U and the set of servers in the sub-tree (including ) as S . Let M be the set of server nodes of priority level − ℎ + in the sub-tree. Let J := {2, 4, . . . , ℎ − 1 or ℎ} be the values of corresponding to the server nodes in the sub-tree that are not . Let ℎ = ∈ J |M |, i.e., the total number of servers (excluding ) in the sub-tree. We consider the potential function 2 ( ) := ∈ J ′ ∈M [ ℎ ′ ( ) + ′ ( )] + ℎ ( ). Again, similar as in the induction step of Claim (1.1), we focus on the sample paths where the high probability events of the inductive hypothesis hold, which have a probability mass of 1 − (1/ ). For ≥ 0 = − 1(ℎ − 1) ln , according to the induction hypothesis, we have the following observations: • Using Claim (1.1) of the induction hypothesis, for each ∈ J, ′ ∈ M , ℎ ′ ( ) Let 2 = ∈ J |M |(1 − ) + ℎ . To establish the claim, it suffices to bound the probability of 2 ( ) ≥ 2 . • Using Claim (2.1) of the induction hypothesis, and that ′ ( ) ≤ + for all ′ , we have that for ′ ∈ M , ℎ ′ ( ) + ′ ( ) ≤ (1 + ) + .
Observe that It follows from (68) that P[ 2 ( ) ≥ 2 ] ≤ 2 2 for ≥ 1 ℎ ln . Thus, we have Claim (2.2): For each = 1, . . . , , we first note that the number of jobs from completed at is upper bounded by the total number of jobs sent to from , which is further upper bounded by the difference between the total incoming jobs from and the number of jobs completed by the primary servers of . More formally, let M be the set of primary servers of , we have Thus, invoking the induction hypothesis on Claim (1.2) of , we obtain that On the other hand, the total number of jobs completed at from is lower bounded by the total number of incoming jobs from minus the total number of jobs completed at the primary servers of , the number of unfinished jobs from (in the queues), and the total number of jobs discarded from due to all of its primary servers and secondary server ( ) having queue length larger than . It follows that, where the last inequality follows from Claim (2.1) of that we just proved and the induction hypothesis of Claim (1.2) on . □ Remark: from the proof of Theorem 1 we see that to achieve a logarithmic utility gap, it suffices to set as 8 · [4( + )] log , where is the maximum height of the trees in the forest induced by the extreme point. Therefore, the value of need not depend on Δ 2 (although the upper bound on the utility gap is proportional to 1 Δ 2 ). Moreover, as we did not tighten the bound in terms of constant factors the constant factor in the value of could actually be set as a much smaller value.

D PROOF OF THEOREM 2
Proof. Throughout the proof, we will refer to the events that happen with probability at least 1 − ( 1 ) as events "with high probability". Similarly as in Theorem 1, the threshold is chosen to be 8(4( + )) log , where upper-bounds the maximum height of the tree in the forest induced by any extreme point (e.g. = + ). Throughout the proof, we will make arguments about the extreme points of P, which are most conveniently stated with respect to the standard form of P. Recall that the standard form of P has a feasibility region of the form = { | = }, where = { , } and represents the constraint matrix and represents the constraint vector (formed by the arrival rates and service rates). We use ⊆ to denote a generic basis ( columns of ). We will use or (when we are only focusing on the { } components) to denote a generic extreme point.
The key of the proof is to show the claim that with high probability, we will be able obtain at least one utility observation of ( , )-job for each > 0. This, combined with Corollary 3 and Theorem 1 will lead to the proof of the theorem. We formally state the claim in the Lemma 2.
Lemma 2. With probability 1 − ( 1 ), for each episode ≥ 2, at least one ( , )-job is completed during episode for every > 0. Proof of Lemma 2: First, we note that after ≥ 2, ≥ log 2 . It follows from Azuma-Hoeffding inequality that with high probability, |ˆ− | = Let be the basis corresponding to . We will begin by showing that is a feasible basis with respect to the true feasibility region of P. Indeed, let be the constraint vector ofˆbased on the empirical means of arrivals and services. Since is feasible with respect toˆ, we have is not feasible with respect to , then by condition 2, there exists a component such that ( −1 ) ≤ −Δ 2 , which leads to a contradiction. Therefore, is a feasible basis with respect to . It follows that all the drift inequalities (under the arrival rates { } and service rates { }) in the proof of Theorem 1 holds for the JSQ-policy based on the forest T except possibly for the root servers (since the extreme point can be degenerate). Thus, after (log ) time slots from the beginning of the episode, with high probability, the servers (except for the root servers) are never idle, and the dispatchers never discard incoming jobs (except possibly for the dispatchers directly connected to a root server). We will again focus on this set of sample paths, which as have been shown will not affect the regret analysis. Now, we will show the claim that with high probability, at least one ( , )-job is completed for every > 0 during episode . First, we consider the case where is a primary dispatcher of , i.e., the jobs from receives priority service at in its (virtual) priority queue ℎ . Note that as ℎ ( ) ≤ + = (log ), the queueing delay experienced by any ( , )-job is in (log ) with high probability. Furthermore, we show that within (log ) time slots from , dispatcher will send at least one job to ℎ . This, combined with the previous argument will prove the claim. Indeed, let 1 , . . . , be the primary servers of . By the construction of JQS-, dispatcher will send incoming jobs to when > for all = 1, . . . , . Let M be the set of servers in the sub-tree of . From previous discussion and using condition 2, we have that the function ′ ∈M ℎ ′ ( ) + ′ ( ) has positive drift of at least Δ 2 if there exists = 1, . . . , with ( ) ≤ . Note that this still holds even when is a root server. Hence, following a similar drift analysis in the proof of Theorem 1, after at most (log ) time slots, all queues for = 1, . . . , will be greater than and the will send the incoming jobs to .
Second, we consider the case where is the secondary dispatcher of , i.e., the jobs from receives service at in its (virtual) low-priority queue when ℎ is empty. We will show that with high probability, ℎ will be empty (and jobs in will be served) every (log ) time slots. Indeed, consider the sub-tree at and let M be the set of servers in the sub-tree (not including ). From previous discussion, we have that the function ( ) + ′ ∈M ℎ ′ ( ) + ′ ( ) has a negative drift of at least Δ 2 when ( ) is not idle. Hence, following a similar drift analysis in the proof of Theorem 1, after at most (log ) time slots, ℎ will be idle and will receive service. Since is bounded by + = (log ), it will follow that the queueing delay experienced by any job in is in (log 2 ). As is a primary server of , there will be at least one ( , )-job in within (log ) slots from . Combining the above arguments, we prove that there is at least one ( , )-job completed in the episode of length = log 2 log log with high probability. □ From Lemma 2, we see that we can focus on the set of sample paths where the necessary utility observations are obtained for every episode ≥ 2. Let * be the optimal extreme point (with respect to the true utility vector ) ofˆ. The analysis in [17] directly implies the following extension of Corollary 3 that with probability at least 1 − 1 , ≠ * for (log 3 ) episodes. We will proceed to show in Lemma 3 that for ≥ 2, * induces the same forest as the optimal extreme point * of . Lemma 3. For ≥ 2, the optimal extreme point ofˆinduces the same forest as the optimal extreme point of .
Proof of Lemma 3: Let * be the basis of the optimal extreme point * of . Since from the previous discussion, for ≥ 2, || − || = ( 1 log ) with high probability, we have that * is also a feasible basis forˆ. Suppose for the sake of contradiction, the optimal extreme point of has a different basis . From the proof of Lemma 3, we see that is also a feasible extreme point of . Let˜:= −1 be the extreme point of with respect to , and * := ( * ) −1 be the extreme point ofˆcorresponding to the basis * . We have that || * − * || = ( 1 log ) and || −˜|| = ( 1 log ). It follows that | · * − ·˜| = ( 1 log ), which contradicts condition 1. Thus, we conclude the proof of the lemma.
From Lemma 3, we conclude that with high probability, there are at most (log 3 ) episodes where T is not equal to the forest induced by the optimal extreme point * . We divide the time horizon into periods, where each period is formed by consecutive episodes with the same forest. The regret of the CB-JSQ-policy is the sum of the regret over each period. We will call a period/episode correct if the spanning forest of the period/episode coincides with the optimal, and a period/episode incorrect otherwise. Since there are at most (log 3 ) episodes where T is not equal to the optimal forest, there are at most (log 3 ) periods. The total length of incorrect period is upper bounded by the total number of incorrect episodes times the episode length, which is (log 5 log log ) time slots. Therefore, the regret incurred in incorrect period is in (log 5 log log ). Whenever the policy switches between periods, it discards all the jobs in the queue, which will in total incur (log 4 )-regret as the total queue length is in (log ). Finally, by Theorem 1, as the optimal extreme point is non-degenerate, the regret incurred in each correct period is (log ). Therefore, in summary, the regret of the CB-JSQ-policy is in (log 5 log log ).