Vector quantile regression and optimal transport, from theory to numerics

In this paper, we first revisit the Koenker and Bassett variational approach to (univariate) quantile regression, emphasizing its link with latent factor representations and correlation maximization problems. We then review the multivariate extension due to Carlier et al. (2016, 2017) which relates vector quantile regression to an optimal transport problem with mean independence constraints. We introduce an entropic regularization of this problem, implement a gradient descent numerical method and illustrate its feasibility on univariate and bivariate examples.


Introduction
Quantile regression, introduced by Koenker and Bassett (1978), has become a very popular tool for analyzing the response of the whole distribution of a dependent variable to a set of predictors.It is a far-reaching generalization of the median regression, allowing for a predition of any quantile of the distribution.We briefly recall classical quantile regression.
For t ∈ [0, 1], it is well-known that the t-quantile of ε = Y − q t (x) given X = x minimizes the loss function or equivalently E [ε + + (t − 1) ε|X].As a result, if q t (x) is specified under the parametric form q t (x) = β t x + α t , it is natural to estimate α t and β t by minimizing the loss While the previous optimization problem estimates α t and β t for pointwise values of t, if one would like to estimate the whole curve t → (α t , β t ), one simply should construct the loss function by integrating the previous loss functions over t ∈ [0, 1], and thus the curve t → (α t , β t ) minimizes min As it is known since the original work by Koenker and Bassett, this problem has an (infinite-dimensional) linear programming formulation.Defining P t = Y − β t X − α t + as the positive deviations of Y with respect to their predicted quantiles β t X + α t , we have hence we arrive at the primal formulation max If V t and (α t , β t ) are solutions to the above primal and dual programs, complementary slackness yields 1 {Y >β t X+αt} ≤ V t ≤ 1 {Y ≥β t X+αt} , hence if (X, Y ) has a continuous distribution, then for any (α, β), P Y − β X − α = 0 = 0, and therefore one has almost surely Koenker and Ng (2005) impose a monotonicity constraint of the estimated quantile curves.
Indeed, if β t x + α t is the t-quantile of the conditional distribution of Y given X = x, the curve t → β t x + α t should be nondecreasing.Hence, these authors impose a natural constraint on the dual, that is β t X + α t ≥ β t X + α t for t ≥ t , and they incorporate this 2 It may seem awkward to start with the "dual" formulation before giving out the "primal" one, and the "primal" being the dual to the "dual," this choice of labeling is pretty arbitrary.However, our choice is motivated by consistency with optimal transport theory, introduced below.
Therefore, in that case, V t should be nonincreasing in t, which allows us to impose a monotonicity constraint on the primal variable V t instead of a monotonicity constraint on the dual variables β t and α t .This is precisely the problem we look at.Consider Let us now take a look at a sample version of this problem.Here, we observe as sample (X i , Y i ) for i ∈ {1, ..., N }.We shall discretize the probability space [0, 1] into T points, The sample analog of (1.3) is max Denoting t the T × 1 row matrix with entries t τ , and D a T × T matrix with ones on the main diagonal, and −1 on the diagonal just below the main diagonal, and 0 elsewhere, the Setting π = D V /N , and Note that this is a direct extension of the Monge-Kantorovich problem of optimal transport -in fact it boils down to it when the last constraint is absent.This should not be surprising, given the connection between optimal transport, as recalled below.In the present paper, we introduce the Regularized Vector Quantile Regression (RVQR) problem, which consists of adding an entropic regularization term in the expression (1.4), which yields, for a given data distribution ν, Due to smoothness and regularity, the regularized problem (1.5) enjoys computational and analytical properties that are missing from the original problem (1.4).In particular, the dual to (1.4) is a smooth, unconstrained problem that can be solved by computational methods.While here, unlike in the context of stadard optimal transport, the Kullback-Leibler divergence projection onto the mean-independence constraint is not in closed form, we can use Nesterov's gradient descent acceleration, which gives optimal convergence rates for first-order methods.
The present paper in part provides a survey of previous results, and in part conveys new results.In the vein of the previous papers on the topic (2016, 2017), this paper seeks to apply the optimal transport toolbox to quantile regression.In contrast with these papers, a particular focus in the present paper is to propose a regularized version of the problem as well as new computational methods.The two main new contributions of the paper are (1) a connection with shape-constrained classical regression (section 4), and (2) the introduction of the regularized vector quantile regression problem (RVQR) along with a duality theorem for that problem (section 6).
The paper is organized as follows.Section 2 will offer reminders on the notion of quantile; section 3 will review the previous results of Carlier et al. (2016Carlier et al. ( , 2017) ) on the "specified" case; section 4 offers a new result (theorem 4.1) on the comparison with the shape-constrained classical quantile regression; and section 5 will review results on the multivariate case.
Section 6 introduces RVQR and introduces results relevant for that problem, in particular a duality result in that case (theorem 6.1).

Several characterizations of quantiles
Throughout the paper, (Ω, F, P) will be some fixed nonatomic space 3 probability.Given a random vector Z with values in R k defined on this space we will denote by L (Z) the law of Z, given a probability measure θ on R k , we shall often write Z ∼ θ to express that L (Z) = θ.Independence of two random variables Z 1 and Z 2 will be denoted as Z 1 ⊥ ⊥ Z 2 .
3 One way to define the nonatomicity of (Ω, F, P) is by the existence of a uniformly distributed random variable on this space, this somehow ensures that the space is rich enough so that there exists random variables with prescribed law.If, on the contrary, the space is finite for instance only finitely supported probability measures can be realized as the law of such random variables.

2.1.
Quantiles.Let Y be some univariate random variable defined on (Ω, F, P).Denoting by F Y the distribution function of Y : is the generalized inverse of F Y given by the formula: > t} for all t ∈ (0, 1). (2.1) Let us now recall two well-known facts about quantiles:  Ryff (1970) and the extension to the multivariate case (by optimal transport) is due to Brenier (1991).
We therefore see that there are basically two different approaches to study or estimate quantiles: • the local or "t by t" approach which consists, for a fixed probability level t, in using directly formula (2.1) or the minimization problem (2.2) (or some approximation of it), this can be done very efficiently in practice but has the disadvantage of forgetting the fundamental global property of the quantile function: it should be monotone in t, • the global approach (or polar factorization approach), where quantiles of Y are defined as all nondecreasing functions Q for which one can write Y = Q(U ) with U uniformly distributed.In this approach, one rather tries to recover directly the whole monotone function Q (or the uniform variable U that is maximally correlated to Y ).Therefore this is a global approach for which one should rather use the optimal transport problem (2.3).

Conditional quantiles.
Let us assume now that, in addition to the random variable Y , we are also given a random vector X ∈ R N which we may think of as being a list of explanatory variables for Y .We are primarily interested in the dependence between Y and X and in particular the conditional quantiles of Y given X = x.Let us denote by ν the joint law of (X, Y ) by ν the law of X, and by ν(.|x) the conditional law of Y given X = x: which in particular yields dν(x, y) = dν(y|x)dm(x).
We then denote by F (x, y) = F Y |X=x (y) the conditional cdf: and Q(x, t) the conditional quantile For the sake of simplicity, we shall assume that for m = L (X)-almost every x ∈ R N (m-a.e. x for short), one has t → Q(x, t) is continuous and increasing (2.5) so that for m-a.e.x, F (x, Q(x, t)) = t for every t ∈ (0, 1) and Q(x, F (x, y)) = y for every y in the support of ν(.|x).
Let us now define the random variable then by construction: We deduce that U is uniformly distributed and independent from X (since its conditional cdf does not depend on x).Moreover since (2.5) that one has the representation in which U can naturally be interpreted as a latent factor.
This easy remark leads to a conditional polar factorization of Y through the pointwise relation Y = Q(X, U ) with Q(X, .)nondecreasing and U ∼ µ, U ⊥ ⊥ X.We would like to emphasize now that there is a variational principle behind this conditional decomposition.
Let us indeed consider the variant of the optimal transport problem (2.3) where one further requires U to be independent from the vector of regressors X: Proof.Let V be admissible for (2.7).Let us define for x ∈ spt(m) and t ∈ [0, 1], We first claim that ϕ(X, U ) is integrable, indeed we obviously have where we have used in the second line the fact that the image of 3. Specified and quasi-specified quantile regression 3.1.Specified quantile regression.Since the seminal work of Koenker and Bassett (1978), it has been widely accepted that a convenient way to estimate conditional quantiles is to stipulate an affine form with respect to x for the conditional quantile.Since a quantile function should be monotone in its second argument, this leads to the following definition and for m-a.e.x and every t ∈ 2) hold, quantile regression is specified with regression coefficients (α, β).

Specification of quantile regression can be characterized by the validity of an affine in X
representation of Y with a latent factor: Proposition 3.2.Let (α, β) be continuous and satisfy (3.1).Quantile regression is specified with regression coefficients (α, β) if and only if there exists U such that Proof.The fact that specification of quantile regression implies decomposition (3.3) has already been explained in paragraph 2.2.Let us assume (3.3), and compute 3.2.Quasi-specified quantile regression.Let us now assume that both X and Y are integrable and normalize, without loss of generality, X in such a way that Koenker and Bassett showed that, for a fixed probability level t, the regression coefficients (α, β) can be estimated by quantile regression i.e. the minimization problem inf where the penalty ρ t is given by ρ t (z) := tz − + (1 − t)z + with z − and z + denoting the negative and positive parts of z.For further use, note that (3.6) can be conveniently be rewritten as inf As noticed by Koenker and Bassett, this convex program admits as dual formulation An optimal (α, β) for (3.7) and an optimal V t in (3.8) are related by the complementary slackness condition: Note that α appears naturally as a Lagrange multiplier associated to the constraint E(V t ) = (1 − t) and β as a Lagrange multiplier associated to E(V t X) = 0.
To avoid mixing i.e. the possibility that V t takes values in (0, 1), it will be convenient to assume that ν = L (X, Y ) gives zero mass to nonvertical hyperplanes i.e.
We shall also consider a nondegeneracy condition on the (centered) random vector X which says that its law is not supported by any hyperplane5 : Thanks to (3.10), we may simply write and thus the constraints which simply are the first-order conditions for (3.7).
Any pair (α, β) which solves6 the optimality conditions (3.13) for the Koenker and Bassett approach will be denoted and the variable V t solving (3.8) given by (3.12) will similarly be denoted is given by (3.14)) then: Moreover U QR solves the correlation maximization problem with a mean-independence constraint: Proof.Obviously As already observed X for δ > 0, letting δ → 0 + and using the continuity of (α QR , β QR ) we get Y ≥ α QR (U QR ) + β QR (U QR ) X.The converse inequality is obtained similarly by remarking that U QR < t implies that Y ≤ α QR (t) + β QR (t) X.
Let us now prove that U QR solves (3.16).Take V uniformly distributed, such that X is mean-independent from V and set V t := 1 {V >t} , we then have • both U and U uniformly distributed, • X is mean-independent from U and U : E(X|U ) = E(X|U ) = 0, • α, β, α, β are continuous on [0, 1], • (α, β) and (α, β) satisfy the monotonicity condition (3.1), then Let us also define for (x, y) in R N +1 : thanks to the monotonicity condition (3.1), the maximization program above is strictly concave in t for every y and m-a.e.x.We then remark that Y = α(U ) + β(U ) X = ϕ (U ) + b (U ) X exactly is the first-order condition for the above maximization problem when (x, y) = (X, Y ).In other words, we have with an equality for (x, y, t) = (X, Y, U ) i.e.
Using the fact that L (U ) = L (U ) and the fact that mean-independence gives but reversing the role of U and U , we also have so that, thanks to inequality (3.17 taking the conditional expectation with respect to U on both sides we then obtain α = α and thus β(U ) X = β(U ) X almost surely.We then compute To sum up, we have shown that quasi-specification is equivalent to the validity of the factor linear model: for (α, β) continuous and satisfying the monotonicity condition (3.1) and U , uniformly distributed and such that X is mean-independent from U .This has to be compared with the decomposition of paragraph 2.2 where U is required to be independent from X but the dependence of Y with respect to U , given X, is given by a nondecreasing function of U which is not necessarily affine in X.

Quantile regression without specification
Now we wish to address quantile regression in the case where neither specification nor quasi-specification can be taken for granted.In such a general situation, keeping in mind the remarks from the previous paragraphs, we can think of two natural approaches.
The first one consists in studying directly the correlation maximization with a meanindependence constraint (3.16).The second one consists in getting back to the Koenker and Bassett t by t problem (3.8) but adding as an additional global consistency constraint that V t should be nonincreasing (which we abbreviate as V t ↓) with respect to t: Our aim is to compare these two approaches (and in particular to show that the maximization problems (3.16) and (4.1) have the same value) as well as their dual formulations.
Proof.Let U be admissible for (3.16) and define V t := 1 {U >t} then U = 1 0 V t dt and obviously (V t ) t is admissible for (4.5), we thus have sup(3.16)≤ sup(4.5).Take now (V t ) t admissible for (4.5) and let V := 1 0 V t dt, we then have surely so that E(X1 {V >t} ) = 0 which implies that X is mean-independent from V and thus (3.16).We conclude that sup(3.16)= sup(4.5). 8With a little abuse of notations when a reference number (A) refers to a maximization (minimization) problem, we will simply write sup(A) (inf(A)) to the denote the value of this optimization problem.

Let us now define
Let (V t ) t be admissible for (4.1) and set It follows from Lemma 4.3 below that, for q ∈ L 1 (0, 1) defining Q(t) := t 0 q(s)ds, one has So setting ϕ(t) := t 0 α(s)ds, b(t) := t 0 β(s)ds and remarking that integrating by parts immediately gives but we know from (4.4) that inf(4.3)= sup(3.16)which ends the proof.
In the previous proof, we have used the elementary result (proven in the appendix) Lemma 4.3.Let q ∈ L 1 (0, 1) and define Q(t) := t 0 q(s)ds for every t ∈ [0, 1], one has

Vector quantiles, vector quantile regression and optimal transport
We now consider the case where Y is a random vector with values in R d with d ≥ 2.
The notion of quantile does not have an obvious generalization in the multivariate setting however, the various correlation maximization problems we have encountered in the previous sections still make sense (provided Y is integrable say) in dimension d and are related to optimal transport theory.The aim of this section is to briefly summarize the optimal transport approach to quantile regression introduced in Carlier et al. ( 2016) and further analyzed in their follow-up 2017 paper.5.1.Brenier's map as a vector quantile.From now on we fix as a reference measure the uniform measure on the unit cube [0, 1] d i.e.
Given Y , an integrable R d -valued random variable on (Ω, F, P), a remarkable theorem due to Brenier (1991) and extended by McCann (1995) implies that there exists a unique U ∼ µ d and a unique (up to the addition of a constant) convex function defined on [0, 1] d such that The map ∇ϕ is called the Brenier's map between µ d and L (Y ).
The convex function ϕ is not necessarily differentiable but being convex it is differentiable at Lebesgue-a.e. point of [0, 1] d so that ∇ϕ(U ) is well defined almost surely, it is worth at this point recalling that the Legendre transform of ϕ is the convex function: which is often refered to in convex analysis as the Fenchel reciprocity formula 9 .Note then that (5.2) implies that If both ϕ and ϕ * are differentiable, their subgradients reduce to the singleton formed by their gradient and the Fenchel recirprocity formula simply gives ∇ϕ −1 = ∇ϕ * .Recalling the subgradient of the convex function ϕ is monotone in the sense that whenever y 1 ∈ ∂ϕ(u 1 ) and y 2 ∈ ∂ϕ(u 2 ) one has we see that gradients of convex functions are a genelarization to the multivariate case of monotone univariate maps.It is therefore natural in view of (5.2) to define the vector quantile of Y as: Definition 5.1.The vector quantile of Y is the Brenier's map between µ d and L (Y ).Now, it is worth noting that the Brenier's map (and the uniformly distributed random vector U in (5.2)) are not abstract objects, they have a variational characterization related to optimal transport10 .Consider indeed and its dual inf then U in (5.2) is the unique solution of (5.4) and any solution (f, g) of the dual (5.5) satisfies ∇f = ∇ϕ µ d -a.e..

Conditional vector quantiles.
Assume now as in paragraph 2.2 that we are also given a random vector X ∈ R N .As in (2.4), we denote by ν the law of (X, Y ), by m the law of X and by ν(.|x) the conditional law of Y given X = x (the only difference with (2.4) is that Y is R d -valued).Conditional vector quantile are then defined as Definition 5.2.For m = L (X)-a.e.x ∈ R N , the vector conditional quantile of Y given We denote this well defined map as ∇ϕ x where ϕ x is a convex function on [0, 1] d .
If both ϕ x and its Legendre transform are differentiable 11 , one can define the random vector: which is equivalent to One can check exactly as in the proof of Proposition 2.1 for the univariate case that if Y is integrable then This affine form moreover implies that not only U maximizes the correlation with Y among uniformly distributed random vectors independent from X but in the larger class of uniformly distributed random vectors for which 12 This is the reason why the study of (5.9) 11 A deep regularity theory initated by Caffarelli (1992) in the 1990's gives conditions on ν(.|x) such that this is in fact the case that the optimal transport map is smooth and/or invertible, we refer the interested reader to the textbook of Figalli (2017) for a detailed and recent account of this regularity theory.
is the main tool in the approach of Carlier et al. (2016Carlier et al. ( , 2017) ) to vector quantile regression.
Let us now briefly summarize the main findings in these two papers.First observe that (5.9) can be recast as a linear program by setting π := L (U, X, Y ) and observing that U solves (5.9) if and only if π solves max where MI(ν, µ) is the set of probability measures which satisfy the linear constraints: • the first marginal of π is µ d , i.e., for every ϕ ∈ C([0, 1] d , R): • the second marginal of π is ν, i.e., for every • the conditional expectation of x given u is 0, i.e., for every b ∈ C([0, 1] d , R N ): 6. Discretization, regularization, numerical minimization 6.1.Discrete optimal transport with a mean independence constraint.We now turn to a discrete setting for implementation purposes, and consider data (X j , Y j ) j=1..J distributed according to the empirical measure ν = J j=1 ν j δ (x j ,y j ) , and a [0, 1] d -uniform sample (U i ) i=1,...,I with empirical measure µ = I i=1 µ i δ u i .In this setting, the vector quantile regression primal (5.10) writes max subject to marginal constraints ∀j, i π ij = ν j and ∀i, j π ij = µ i and the mean-independence constraint between X and U : ∀i, j x j π ij = 0. Its dual formulation (5.11) reads inf 6.2.The Regularized Vector Quantile Regression (RVQR) problem.Using the optimality condition φ i = max j u i y j − b i x j − ψ j , we obtain the unconstrained formulation inf Replacing the maximum with its smoothed version 13 , given a small regularization parameter ε, yields the smooth convex minimization problem (see Cuturi and Peyr Ã© (2016) for more details in connection with entropic regularization of optimal transport), which we call the Regularized Vector Quantile Regression (RVQR) problem inf We then have the following duality result 14 : Theorem 6.1.The RVQR problem max has dual (6.1), or equivalently Note that the objective J in (6.1) remains invariant under the two transformations 13 Recall that the softmax with regularization parameter ε > 0 of (α1, . . ., αJ ) is given by Softmaxε(α1, . . .αJ ) := ε log( J j=1 e α j ε ). 14Which can be proved either by using the Fenchel-Rockafellar duality theorem or by hand.Indeed, in the primal, there are only finitely many linear constraints and nonnegativity constraints are not binding because of the entropy.The existence of Lagrange multipliers for the equality constraints is then straightforward.
• ψ ← ψ + λ where λ ∈ R is a constant.These two invariances enable us to fix the value of b 1 = 0 and (for instance) to chose λ in such a way that i,j exp Remark.This formulation is eligible for stochastic optimization techniques when the number of (X, Y ) observations is very large.where To solve (6.1) numerically, we therefore can use a gradient descent mehod.An efficient way to do it is to use Nesterov accelerated gradient algorithm see Nesterov (1983) and Beck and Teboulle (2009).Note that if ψ, b solves (6.1), the fact that the partial derivatives in (6.2)-( 6.3) vanish imply that the coupling satisfies the constraint of fixed marginals and mean-independence of the primal problem.
Since the index j corresponds to observations it is convenient to introduce for every x ∈ 15 it is even strictly convex once we have chosen normalizations which take into account the two invariances of J explained above.

Results
Quantiles computation.The discrete probability π ε is an approximation (because of the regularization ε) of L (U, X, Y ) where U solves (5.9).The corresponding approximate In the above discrete setting, this yields .
Remark.To estimate the conditional distribution of Y given U = u and X = x, we can use kernel methods.In the experiments, we compute approximate quantiles as means on neighborhoods of X values to make up for the lack of replicates.This amounts to considering where B η (x) is a Euclidean ball of radius η centered on x.
Empirical illustrations.We demonstrate the use of this approach on a series of health related experiments.We use the "ANSUR II" dataset (Anthropometric Survey of US Army Personnel), which can be found online 16 .This dataset is one of the most comprehensive publicly available data sets on body size and shape, containing 93 measurements for over 4,082 male adult US military personnel.It allows us to easily build multivariate dependent variables.
One-dimensional VQR.We start by one-dimensional dependent variables (d = 1), namely Weight (Y 1 ) and Thigh circumference (Y 2 ), explained by X =(1, Height), to allow for comparison with classical quantile regression of Koenker and Bassett (1978).Figure 1 displays results of our method compared to the classical approach, for different height quantiles (10%, 30%, 60%, 90%).Figure 1 is computed with a "soft" potential φ while table 1 depicts the difference with its "hard" counterpart (see the beginning of section 6.2).
Figure 2 and Table 2 detail the impact of regularization strength on these quantiles.Chosen grid size is n = 10 (per axis) and regularization strength ε = 0.1.

.
Vector quantile regression.When one assumes that the convex function ϕ x is affine with respect to the explanatory variables x (specification):ϕ x (u) = ϕ(u) + b(u) xwith ϕ : [0, 1] d → R and b : [0, 1] d → R N smooth, the conditional quantile is itself affine and the relation (5.6) takes the form

[0, 1 ]
d ×R N ×R d b(u) xdπ(u, x, y) = 0.The dual of the linear program (5.9) then reads inf (ϕ,ψ,b) [0,1] d ϕdµ d + R N ×R d ψ(x, y)dν(x, y) (5.11) subject to the pointwise constraint ϕ(u) + b(u) x + ψ(x, y) ≥ u y given b and ϕ the lowest ψ fitting this constraint being the (convex in y) function ψ(x, y) := sup u∈[0,1] d {u y − ϕ(u) − b(u) x}.The existence of a solution (ψ, ϕ, b) to (5.11) is established in Carlier et al. (2016) (under some assumptions on ν) and optimality for U in (5.9) is characterized by the pointwise complementary slackness condition ϕ(U ) + b(U ) X + ψ(X, Y ) = U Y almost surely.If ϕ and b were smooth we could deduce from the latter that Y = ∇ϕ(U ) + Db(U ) U = ∇ϕ X (U ), for ϕ x (u) := ϕ(u) + b(u) x which is exactly (5.8).So specification of vector quantile regression is essentially the same as assuming this smoothness and the convexity of u → ϕ x (u) := ϕ(u) + b(u) x.In general, these properties cannot be taken for granted and what can be deduced from complementary slackness is given by the weaker relations ϕ X (U ) = ϕ * * X (U ), Y ∈ ∂ϕ * * X (U ) almost surely, were ϕ * * x is the convex envelope of ϕ x (i.e. the largest convex function below ϕ x ), we refer the reader to Carlier et al. (2017) for details.

6 . 3 .
techniques are not needed to compute b since the number of U samples (i.e. the size of b) is set by the user.Gradient descent.As already noted the objective J in (6.1) is convex 15 and smooth.Its gradient has the explicit form ∂J ∂ψ j := ν j −

Figure 3 .Figure 5 .
Figure 3.Comparison of computational times between the unregularized case (using Gurobi's barrier logging) and the regularized case, for a varying number of predictors in 2D.In the latter, this time represents the time to reach an error of 10 −5 in • 2 between two iterates of the transport plan for ε = 0.1.Chosen grid size is n = 10 (per axis).
Stochastic optimization w.r.t.ψ can be performed using the stochastic averaged gradient algorithm, see Genevay et al. (2016), considering the