Gaussian linear density estimation in high dimension

Gaussian linear density estimation in high dimension

In this complement, we determine the minimax excess risk for predictive (conditional) density estimation with respect to the linear Gaussian model in the well-specified case, which was referred to in Section 7.4.1 of Chapter 7 as well as Section 1.4.5 of the introduction. Specifically, the setting is that of conditional density estimation, see Section 1.4 as well as Chapter 7. Here, the space of covariates is X = Rd , the response lies in Y = R. The considered (conditional) model is the Gaussian linear model, given by the conditional densities of the form F = fβ(·|x) := N (hβ, xi, σ2 ) : β ∈ Rd , (8.1) where we set the base measure on Rd to be µ(dy) = (2π) −d/2dy and identify densities with respect to µ with the corresponding densities. Here, σ 2 is fixed, and without loss of generality we assume that σ 2 = 1. Finally, we consider in this section the well-specified case, where the true conditional distribution of Y given X belongs to the class F. The results here (and their proof) are similar in spirit to those of Chapter 6 on regression with square loss. Setting. We assume that (X1, Y1), . . . ,(Xn, Yn) are i.i.d. samples from a distribution P, such that the conditional distribution of Y given X belongs to the class F, i.e. such that Y = hβ ∗ , Xi + ε where ε|X ∼ N (0, 1). Hence, the corresponding set of distributions P of (X, Y ) is characterized by the distribution PX of covariates X, and is denoted P := PGauss(PX, 1) (with the notation of Chapter 6). Recall from Sections 1.1.1 and 1.4 of the introduction that the risk of a conditional density g is R(g) := E[`(g,(X, Y ))] = E[− log g(Y |X)] , where ` denotes the logarithmic loss. Also, the minimax excess risk is by definition E ∗ n (PX) := inf gbn sup P ∈P E[E(gbn)] = inf gbn sup P ∈P n E[R(gbn)] − inf β∈F R(fβ) o , (8.2) where gbn spans all estimators of Y given X. In what follows, we assume that E[kXk 2 ] < +∞ and that the covariance matrix Σ = E[XX>] is invertible. Main result. Theorem 8.1 below provides the minimax risk, as a function of the distribution PX of covariates. Theorem 8.1. If the distribution PX is degenerate (in the sense of Definition 6.1, Chapter 6) or if n < d, then the minimax risk (8.2) is infinite. If PX is non-degenerate and n > d, then the minimax excess risk (8.2) in the well-specified case is given by 1 2 E h log

A Marchenko-Pastur lower bound on Stieltjes transforms of ESDs of covariance matrices

In this section, we let X be a random vector in Rd , with unit covariance: E[XX>] = Id. Given n i.i.d. variables X1, . . . , Xn distributed as X, define the sample covariance matrix as Σbn := 1 n Xn i=1 XiX> i . (8.10) Σbn is a symmetric, positive semi-definite d × d matrix. Let λ1(Σbn) > . . . > λd(Σbn) denote the (ordered) eigenvalues of Σbn, and denote λbj,n = λj (Σbn) for 1 6 j 6 d. The empirical spectral distribution (ESD) of Σbn is by definition the distribution µbn = (1/d) Pd j=1 δ λbj,n , with cumulative distribution function Fbn(x) = 1 d X d j=1 1(λbj,n 6 x) for x ∈ R. The celebrated Marchenko-Pastur theorem (Marchenko and Pastur, 1967) states that, if X ∼ N (0, Id), as d, n → ∞ while d/n → γ ∈ (0, 1), the ESD µbn converges almost surely in distribution to the Marcenko-Pastur distribution µ MP γ , with density x 7−→ p (bγ − x)(x − aγ) 2πγx · 1(aγ 6 x 6 bγ) with respect to the Lebesgue measure, where aγ = (1 − √γ) 2 and bγ = (1 + √γ) 2 . This behavior has a form of universality, in the sense that it remains true whenever the coordinates of X are independent, centered and with unit variance (Wachter, 1978; Yin, 1986). On the other hand, the independence assumption that underlies this “universal” behavior is quite strong, especially in high dimension where it implies a very specific “incoherent” geometry for the Xi ’s (including near-constant norm and pairwise orthogonality, see Section 1.3.2). In this section, we show a form of extremality of the Marchenko-Pastur distribution among ESDs of empirical covariance matrices of general (unit covariance) random vectors in Rd . Define the Stieltjes transform Sµ : R∗ + → R of a probability distribution µ supported on R+ by Sµ(λ) := Z R (x + λ) −1µ(dx). The Stieltjes transform (extended to λ ∈ C \ R−) plays an important role in the spectral analysis of random matrices, and in particular in the proof of the Marchenko-Pastur law (Bai and Silverstein, 2010). Also, define the expected ESD µ¯n = E[µbn] (such that µ¯n(A) = (1/d) Pd j=1 P(λbj,n ∈ A) for every measurable subset A of R) and its cumulative distribution function F¯ n(x) := E[Fbn(x)] = (1/d) Pd j=1 P(λbj,n 6 x). Our main result is the following: Theorem 8.2 (Marchenko-Pastur lower bound). Let X be a random vector in Rd such that E[XX>] = Id. Then, the expected Stieltjes transform of the ESD µbn is lower bounded in terms of that of the Marchenko-Pastur distribution µ MP γ 0 with γ 0 = d/(n + 1). Specifically, for every 306 CHAPTER 8. COMPLEMENTS λ > 0, denoting λ 0 = [n/(n + 1)]λ, Sµ¯n (λ) = 1 d E Tr (Σbn + λId) −1 > n n + 1 −(1 − γ 0 + λ 0 ) + p (1 − γ 0 + λ0) 2 + 4γ 0λ0 2λ0γ 0 = n n + 1 SµMP γ0 (λ 0 ). (8.11) In particular, if n, d → ∞ with d/n → γ ∈ (0, 1), lim infn→∞ infPX Sµ¯n (λ) > SµMP γ (λ) for every λ > 0. Theorem 8.2 states that the Marchenko-Pastur law, which is a limiting distribution of ESDs of vectors with independent coordinates, also provides a non-asymptotic lower bound (in terms of associated Stieltjes transforms) for ESDs of general random vectors in Rd . Before giving the proof of Theorem 8.2 (which is elementary and relies on a combination of the Sherman-Morrison formula with a fixed-point argument), let us indicate some consequences for least-squares regression and Gaussian linear density estimation. Let us fix a distribution PX of covariates X such that Σ := E[XX>] is invertible. For σ 2 > 0, consider the statistical model P = PGauss(PX, σ2 ) = {P(X,Y ) : Y |X ∼ N (hβ ∗ , Xi, σ2 ), β∗ ∈ Rd}. For λ > 0, define the prior distribution Πλ = N (0, σ2/(λn)Σ−1 ) on β ∗ . Πλ has constant density on the sets {β ∗ ∈ Rd : kβ ∗kΣ = t} of constant signal strength kβ ∗kΣ = E[hβ ∗ , Xi 2 ] 1/2 . Let us also define the signal-to-noise ratio (SNR) η 2 = η 2 (λ) := Eβ∗∼Πλ [kβ ∗k 2 Σ ]/σ2 = d/(λn). Corollary 8.2 (Lower bound on Bayes risk in regression in terms of SNR). Let λ > 0, and η := η(λ) be the corresponding SNR. Then for every distribution PX such that E[XX>] = Σ, the Bayes optimal risk Bd,n(PX, η, σ2 ) under prior Πλ for prediction under square loss `(β,(x, y)) = (y − hβ, xi) 2 is lower bounded as Bd,n(PX, η, σ2 ) > σ 2 · −

Télécharger le document complet