An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

Abstract. We introduce a procedure for predictive conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for supervised statistical learning. On standard examples, this bound scales as d/n with d the model dimension and n the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal log n factors, addressing an open problem from Grünwald and Kotłowski (2011) for the considered models, and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a nonasymptotic excess risk of O((d + B2R2 )/n), where R bounds the norm of features and B that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than min(BR/√ n, deBR/n) in general (Hazan et al., 2014). This provides a computationally more efficient alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly answering a question by Foster et al. (2018).

Introduction

Consider the standard problem of density estimation: given an i.i.d. sample Z1, . . . , Zn from an unknown distribution P on some measurable space Z, the goal is to produce a good approximation Pbn of P. One way to measure the quality of an estimate Pbn is through its predictive risk: given a base measure µ on Z, the risk of a density g on Z with respect to µ is given by R(g) = E[`(g, Z)] , where `(g, z) = − log g(z) (7.1) for z ∈ Z and where Z is a random variable with distribution P. Letting G denote the set of all probability densities on Z with respect to µ, the loss function ` : G × Z → R defined by (7.1), called logarithmic (or negative log-likelihood, entropy or logistic) loss, measures the error of the density g ∈ G (which can be interpreted as a probabilistic prediction of the outcome) given outcome z ∈ Z. This loss function is standard in the information theory literature, due to its link with coding (Cover and Thomas, 2006). The risk of a density g can be interpreted in relation to the joint probability assigned by g to a large i.i.d. test sample Z 0 1 , . . . , Z0 m from P: by the law of large numbers, as m tends to infinity, almost surely Ym j=1 g(Z 0 j ) = exp  − Xm j=1 `(g, Z0 j )  = exp  − m[R(g) + o(1)] . In addition, assume that P of Z has a density p ∈ G; we then have, for every g ∈ G, R(g) − R(p) = E h log p(Z) g(Z) i = Z Z log p g  p dµ = KL(p · µ, g · µ) > 0 , where KL(P, Q) := R Z log  estimators that make the best use of available data. A major limitation of this approach, however, is that these results rely on the unrealistic assumption that the true distribution belongs to the selected model. Such an assumption is generally unlikely to hold, since the model usually involves a simplified representation of the phenomenon under study: it comes from a choice of the statistician, who has no control over the true underlying mechanism. A more realistic situation occurs when the underlying model captures some aspects of the true distribution, such as its most salient properties, but not all of them. In other words, the statistical model provides some non-trivial approximation of the true distribution, and is thus “wrong but useful”. In such a case, a meaningful objective is to approximate the true distribution (namely, to predict its realizations) almost as well as the best distribution in the model. This task can naturally be cast in the framework of Statistical Learning Theory (Vapnik, 1998), where one constrains the comparison class F while making few modeling assumptions about the true distribution. Given a class F of densities, the performance of an estimator gbn is evaluated in terms of its excess risk with respect to the class F, namely E(gbn) := R(gbn) − inf f∈F R(f). We say that the estimator gbn is proper (or a plug-in estimator) when it takes value inside the class F, otherwise gbn will be referred to as an improper procedure. Below, we discuss two established approaches to this problem.

Our contributions

Let us now summarize our main contributions. Note that, while the previous discussion dealt with density estimation, most of this work in fact deals with conditional density estimation, where one seeks to estimate the conditional distribution of a response Y to an input variable X, under logarithmic loss `(f,(X, Y )) = − log f(Y |X) (see Section 7.2.2). SMP: a general procedure for conditional density estimation. In the present work, we introduce a general procedure for predictive density estimation under entropy risk. This estimator, which we call Sample Minmax Predictor (SMP), is obtained by minimizing a new general excess risk bound for supervised statistical learning (Theorem 7.1), and in particular conditional density estimation (Theorem 7.2). In short, SMP is the solution of some minmax problem obtained by considering virtual samples. SMP satisfies an excess risk bound valid under model misspecification, and unlike previous approaches does not rely on a reduction to the sequential problem, thereby improving rates for parametric classes from O(d log n/n) to O(d/n) for our considered models, addressing an open problem from Grünwald and Kotłowski (2011) in this case

SMP for the Gaussian linear model

We apply SMP to the Gaussian linear model F = {fθ(·|x) = N (hθ, xi, σ2 ) : θ ∈ Rd} for some σ 2 > 0, a classical conditional model for a scalar response y ∈ R to covariates x ∈ Rd . SMP then smoothes predictions in terms of leverage scores, and for every distribution of covariates, its expected excess risk in the general misspecified case is at most twice the minimax excess risk in the well-specified case, but without any condition on the approximation error of the linear model or noise variance (Theorem 7.4). This yields an excess risk bound of d/n+O((d/n) 2 ) over the class F under some regularity assumptions on covariates (Corollary 7.1); such a guarantee cannot be obtained for a within-model estimator, or through a regret minimization approach. We also consider a Ridge-regularized variant of SMP, and study its performance on balls of the form FB = {fθ : kθk 6 B} for B > 0. For covariates X bounded by R > 0, we establish two guarantees: a “finite-dimensional” bound of O(d log(BR/√ d)/n) (Proposition 7.3), removing an extra log n term from results of Kakade and Ng (2005) in the sequential case, and a dimension-free “nonparametric” bound (Theorem 7.5), where explicit dependence on d is replaced by a dependence on the covariance structure of covariates, matching well-specified minimax rates over such balls in infinite dimension (Caponnetto and De Vito, 2007).

SMP for logistic regression

We then turn to logistic regression, arguably the most standard model for a binary response y ∈ {−1, 1} to covariates x ∈ Rd , given by F = {fθ(1|x) = σ(hθ, xi) : θ ∈ Rd}, where σ(u) = e u/(1+e u ). In this case, SMP admits a simple form, and its prediction can be computed by solving two logistic regressions. Assuming that kXk 6 R, we show that a Ridge-penalized variant of SMP achieves excess risk O((d+B2R2 )/n) with respect to the ball FB = {fθ : kθk 6 B} for all B > 0 (Corollary 7.2), together with dimension-free bounds (Theorem 7.6). In contrast, results of Hazan et al. (2014) show that no within-model estimator can achieve better rate than min(BR/√ n, deBR/n) without further assumptions. Compared to approaches obtaining fast rates through Bayesian mixtures (Kakade and Ng, 2005; Foster et al., 2018), computation of SMP replaces posterior sampling by optimization. SMP thus provides a natural non-Bayesian approach to uncertainty quantification and calibration of probabilistic estimates, relying on virtual samples

Cours gratuitTélécharger le document complet

Télécharger aussi :

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *