Mondrian Random forests: theory and methodology

Mondrian Random forests: theory and methodology

Introduction

Introduced by Breiman (2001a), Random Forests (RF) are state-of-the-art classification and regression algorithms that proceed by averaging the forecasts of a number of randomized decision trees grown in parallel. Many extensions of RF have been proposed to tackle quantile estimation problems (Meinshausen, 2006), survival analysis (Ishwaran et al., 2008) and ranking (Clémençon et al., 2013); improvements of original RF are provided in literature, to cite but a few, better sampling strategies (Geurts et al., 2006), new splitting methods (Menze et al., 2011) or Bayesian alternatives (Chipman et al., 2010). Despite their widespread use and remarkable success in practical applications, the theoretical properties of such algorithms are still not fully understood (for an overview of theoretical results on RF, see Biau and Scornet, 2016). As a result of the complexity of the procedure, which combines sampling steps and feature selection, Breiman’s original algorithm has proved difficult to analyze. A recent line of research (Scornet et al., 2015; Wager and Walther, 2015; Mentch and Hooker, 2016; Cui et al., 2017; Wager and Athey, 2018; Athey et al., 2019) has sought to obtain some theoretical guarantees for RF variants that closely resembled the algorithm used in practice. It should be noted, however, that most of these theoretical guarantees only offer limited information on the quantitative behavior of the algorithm (guidance for parameter tuning is scarce) or come at the price of conjectures on the true behavior of the RF algorithm itself, being thus still far from explaining the excellent empirical performance of it. In order to achieve a better understanding of the random forest algorithm, another line of research focuses on modified and stylized versions of RF. Among these methods, Purely Random Forests (PRF) (Breiman, 2000; Biau et al., 2008; Biau, 2012; Genuer, 2012; Arlot and Genuer, 2014; Klusowski, 2018) grow the individual trees independently of the sample, and are thus particularly amenable to theoretical analysis. The consistency of such algorithms (as well as other idealized RF procedures) was first obtained by Biau et al. (2008), as a byproduct of the consistency of individual tree estimates. These results aim at quantifying the performance guarantees by analyzing the bias/variance of simplified versions of RF, such as PRF models (Genuer, 2012; Arlot and Genuer, 2014). In particular, Genuer (2012) shows that some PRF variant achieves the minimax rate for the estimation of a Lipschitz regression function in dimension one. The bias-variance analysis is extended by Arlot and Genuer (2014), showing that PRF can also achieve minimax rates for C 2 regression functions in dimension one. These results are much more precise than mere consistency, and offer insights on the proper tuning of the procedure. Quite surprisingly, these optimal rates are only obtained in the one-dimensional case (where decision trees reduce to histograms). In the multi-dimensional setting, where trees exhibit an intricate recursive structure, only suboptimal rates are derived. As shown by lower bounds from Klusowski (2018), this is not merely a limitation from the analysis: centered forests, a standard variant of PRF, exhibit suboptimal rates under nonparametric assumptions. From a more practical perspective, an important limitation of the most commonly used RF algorithms, such as Breiman’s Random Forests (Breiman, 2001a) and the Extra-Trees algorithm (Geurts et al., 2006), is that they are typically trained in a batch manner, where the whole dataset, available at once, is required to build the trees. In order to allow their use in situations where large amounts of data have to be analyzed in a streaming fashion, several online variants of decision trees and RF algorithms have been proposed (Domingos 100 CHAPTER 2. MINIMAX RATES FOR MONDRIAN TREES AND FORESTS and Hulten, 2000; Saffari et al., 2009; Taddy et al., 2011; Denil et al., 2013, 2014). Of particular interest in this article is the Mondrian Forest (MF) algorithm, an efficient and accurate online random forest classifier introduced by Lakshminarayanan et al. (2014), see also Lakshminarayanan et al. (2016). This algorithm is based on the Mondrian process (Roy and Teh, 2009; Roy, 2011; Orbanz and Roy, 2015), a natural probability distribution on the set of recursive partitions of the unit cube [0, 1]d . An appealing property of Mondrian processes is that they can be updated in an online fashion. In Lakshminarayanan et al. (2014), the use of the conditional Mondrian process enables the authors to design an online algorithm which matches its batch counterpart: training the algorithm one data point at a time leads to the same randomized estimator as training the algorithm on the whole dataset at once. The algorithm proposed in Lakshminarayanan et al. (2014) depends on a lifetime parameter λ > 0 that guides the complexity of the trees by stopping their building process. However, a theoretical analysis of MF is lacking, in particular, the tuning of λ is unclear from a theoretical perspective. In this chapter, we show that, aside from their appealing computational properties, Mondrian Forests are amenable to a precise theoretical analysis. We study MF in a batch setting and provide theoretical guidance on the tuning of λ. Based on a detailed analysis of Mondrian partitions, we prove consistency and convergence rates for MF in arbitrary dimension, that turn out to be minimax optimal on the set of s-Hölder function with s ∈ (0, 2], assuming that λ and the number of trees in the forest (for s ∈ (1, 2]) are properly tuned. Furthermore, we construct a procedure that adapts to the unknown smoothness s ∈ (0, 2] by combining Mondrian Forests with a standard model aggregation algorithm. To the best of our knowledge, such results have only been proved for very specific purely random forests, where the covariate space is of dimension one (Arlot and Genuer, 2014). Our analysis also sheds light on the benefits of Mondrian Forests compared to single Mondrian Trees: the bias reduction of Mondrian Forests allow them to be minimax for s ∈ (1, 2], while a single tree fails to be minimax in this case. Agenda. This chapter is organized as follows. In Section 2.2, we describe the considered setting and set the notations for trees and forests. Section 2.3 defines the Mondrian process introduced by Roy and Teh (2009) and describes the MF algorithm. Section 2.4 provides new sharp properties for Mondrian partitions: cells distribution in Proposition 2.1 and a control of the cells diameter in Corollary 2.1, while the expected number of cells is provided in Proposition 2.2. Building on these properties, we provide, in Section 2.5, statistical guarantees for MF: Theorem 2.1 proves consistency, while Theorems 2.2 and 2.3 provide minimax rates for s ∈ (0, 1] and s ∈ (1, 2] respectively. Finally, Proposition 2.4 proves that a combination of MF with a model aggregation algorithm adapts to the unknown smoothness s ∈ (0, 2].

The Mondrian Forest algorithm

Given a rectangular box C = Qd j=1[aj , bj ] ⊆ Rd , we denote |C| := Pd j=1(bj − aj ) its linear dimension. The Mondrian process MP(C) is a distribution on (infinite) tree partitions of C introduced by Roy and Teh (2009), see also Roy (2011) for a rigorous construction. Mondrian partitions are built by iteratively splitting cells at some random time, which depends on the linear dimension of the cell; the splitting probability on each side is proportional to the side length of the cell, and the position is drawn uniformly. The Mondrian process distribution MP(λ, C) is a distribution on tree partitions of C, resulting from the pruning of partitions drawn from MP(C). The pruning is done by removing all splits occurring after time λ > 0. In this perspective, λ is called the lifetime parameter and controls the complexity of the partition: large values of λ corresponds to deep trees (complex partitions). Sampling from the distribution MP(λ, C) can be done efficiently by applying the recursive procedure SampleMondrian(C, τ = 0, λ) described in Algorithm 1. Figure 2.1 below shows a particular instance of Mondrian partition on a square box, with lifetime parameter λ = 3.4. In what follows, Exp(λ) stands for the exponential distribution with intensity λ > 0.

Local and global properties of the Mondrian process

In this Section, we show that the properties of the Mondrian process enable us to compute explicitly some local and global quantities related to the structure of Mondrian partitions. To do so, we will need the following two facts, exposed by Roy and Teh (2009). Fact 2.1 (Dimension 1). For d = 1, the splits from a Mondrian process Πλ ∼ MP(λ, [0, 1]) form a subset of [0, 1], which is distributed as a Poisson point process of intensity λdx. Fact 2.2 (Restriction). Let Πλ ∼ MP(λ, [0, 1]d ) be a Mondrian partition, and C = Qd j=1[aj , bj ] ⊂ [0, 1]d be a box. Consider the restriction Πλ|C of Πλ on C, i.e. the partition on C induced by the partition Πλ of [0, 1]d . Then Πλ|C ∼ MP(λ, C). Fact 2.1 deals with the one-dimensional case by making explicit the distribution of splits for Mondrian process, which follows a Poisson point process. The restriction property stated in Fact 2.2 is fundamental, and enables one to precisely characterize the behavior of the Mondrian partitions. Given any point x ∈ [0, 1]d , Proposition 2.1 below is a sharp result giving the exact distribution of the cell Cλ(x) containing x from the Mondrian partition. Such a characterization is typically unavailable for other randomized trees partitions involving a complex recursive structure.