Breiman's "Two Cultures'' Revisited and Reconciled

In a landmark paper published in 2001, Leo Breiman described the tense standoff between two cultures of data modeling: parametric statistical and algorithmic machine learning. The cultural division between these two statistical learning frameworks has been growing at a steady pace in recent years. What is the way forward? It has become blatantly obvious that this widening gap between"the two cultures"cannot be averted unless we find a way to blend them into a coherent whole. This article presents a solution by establishing a link between the two cultures. Through examples, we describe the challenges and potential gains of this new integrated statistical thinking.


A Clash of Two Cultures
Some time around the year 2000 a split opened up in the world of statistics. (Efron, 2020) By the dawn of the 21st century, the statistics community was clearly divided into two distinct camps-namely, the (parametric) statistics and the (algorithmic) machine learning. Leo Breiman (2001) presents a vivid description of this cultural polarization in his influential essay on "Statistical Modeling: The Two Cultures." But what has triggered this split?
MLąSTAT: Breiman (2001) argued that algorithmic models are far more flexible, scalable, and accurate for complex big data problems. The traditional statistical methods, by contrast, are based on a priori assumed parametric models that are mainly suitable for small datasets.
STATąML: This algorithmic-supremacy standpoint had been furiously debated by several eminent statisticians in the discussion of the original paper. They found Breiman's claim problematic on several counts. First, despite convincing prediction results, the absence of an explicit data-generating model (a probabilistic base) can make algorithmic methods less useful for scientific investigation; see the commentaries by Cox (2001) and Parzen (2001). Efron (2020) articulated this admirably: "Abandoning mathematical models comes close to abandoning the historic scientific goal of understanding nature." Second, the black-box nature of algorithmic models makes them inscrutable, opaque, and not easily interpretable. In fact, regarding machine learning models, Breiman himself agreed that "their mechanism for producing a prediction is difficult to understand...So on interpretability, they rate an F." Where do we stand now? To a large extent, the cultural division between these two statistical learning frameworks has been growing steadily over the past decades, especially in the postdeep-learning era. What options are we left with? 1) Let's get rid of black-boxes and use statistical parametric models, even though that means sacrificing accuracy; 2) Let's stick to black-boxes because they are accurate, even though that means sacrificing interpretability and robustness. Perhaps unsurprisingly, neither of these two extreme positions is tenable. To make real progress, we are forced to consider the question: how do these two different cultures fit together? Can we build a framework of data analysis that combines the best ideas from both sides and offer some way to make two into one? In this paper, we show that such an integrated framework of statistical learning is entirely buildable.
The message of this paper is very simple: the algorithmic models should not be feared because of their complexity, parametric models should not be shunned because of their simplicity, we can connect them both to create a more powerful learning scheme. With the help of a synthetic example, the next section explains what we mean by 'powerful' and why we need such a technology.

The Butterfly Prediction Problem
I strongly support the view that statisticians must face the crisis of the difficulties in their practice of regression. (Parzen, 2001) We are given px i , y i q, i " 1, . . . , N as shown in the Fig 1, which we refer to the butterfly data. The goal is to build a prediction model for the outcome variable Y given X " x. Figure 1: The butterfly prediction problem: A superposition of N " 350 samples, generated from y " x` and y "´x` , with " N p0, 1q. The red line is the estimated regression function, whose 90% bootstrap confidence band is shown by the blue dotted line.
The conventional regression-based prediction methods proceed as follows: (1) Estimation: We have a vast number of choices available. On the one hand, we have the simplest parametric model-the linear regression-and on the other, a large collection of sophisticated machine learning (ML) methods like k-nearest neighbors (knn), random forest, support-vector machine (svm), gradient boosting machine (gbm), neural net, etc. Irrespective of whether we choose a parametric or algorithmic method, the estimated-regression function takes the form, more or less, of a 'flat' zero line, as shown in Fig. 1. (2) Inference: The obvious next question is, how accurate is the estimate? The blue dotted line in Fig. 1 displays the 90% bootstrap confidence band, which is narrow at the center (around 0) and gets wider as we move outward-correctly reflecting the uncertainty of the mean function.
Conclusion: the small bootstrap error (uncertainty) indicates that the estimated regression line is impressively accurate! But, how useful is this analysis? Even though we have an "accurate" estimate of the underlying regression prediction, it does not tell us anything useful. In fact, it paints a grossly misleading picture of the reality (in the sense that x is not useful for predicting y, which is clearly untrue). So, what went wrong? Do we need a bigger, more complex model? Better tuning of the hyperparameters? More computing power? More data? Unfortunately, 'more data + more computing power' does not necessarily guarantee a better algorithm. The problem lies elsewhere.
How to avoid statistical sin in the practice of regression? To avoid "elegantly solving the wrong problem" we must first ask: what to estimate, before tackling the question of how to estimate (linear or nonlinear; parametric or algorithmic; Bayes or frequentist). Almost all prediction methods start with the following 'µ` ' model 2 , which approximates y as a non-linear function of x " px 1 , . . . , x p q: y " µpxq` , ( 1.1) where the random errors i 's are i.i.d with expected value zero, and finite variance; often assumed to be Gaussian with fixed variance " N p0, σ 2 q. The conventional statistical learning estimates the high-dimensional mean (regression) function µpxq, which can then be used for prediction. Now, it is not hard to see that no matter how complex the ML method we choose to approximate the function µpxq, it will still be fruitless (and a turbo-waste of computation 3 ) since the conditional mean contains no useful information for the butterfly example! But then, how should we perform prediction for the butterfly data? What should we report?
These are basic questions, yet often overlooked in the rush to apply fancy ML methods.
Uncertainty prediction machine. The main object of interest is the random variable Y |X " x, which will be denoted by Y |x . Can we really predict the value of the response variable Y for a given X " x? No. It impossible to foresee what exact value the random variable Y |x will take 4 . At best, we can predict the probabilities and its distribution over all possible values of the response variable-the conditional distribution f Y |X"x pyq, which encodes a complete description of Y |x . The problem of statistical learning is to design an 'uncertainty prediction machine' to infer f Y |X"x pyq from the observed data px i , y i q, i " 1, . . . , N . Classical ML methods generally predict some type of summary statistic (e.g., location or scale) of the conditional distribution. However, the butterfly example taught us that this limited information could be dangerously misleading when dealing with messy heterogeneous datasets.
One of the motivating questions behind our study is: how can we convert "first generation" machine learning methods into uncertainty distribution prediction machine? Our interest lies in extracting the underlying statistical "law" (governing equation) that is driving the data.
2 Open your favorite machine learning textbook, and it won't be long before you encounter 'µ` ' model. 3 Algorithms are not bulldozer, they are tools for understanding. 4 After all, Heisenberg (1927) taught us, we can never know enough about the present to exactly predict the future event, except for predicting probabilities of different possible outcomes-the indeterminacy principle.

Uncertainty Prediction Machine: What to Expect?
Before delving into the technical details, it may be worthwhile if we start by taking a 'crude look at the whole' to clearly convey what practical results can be hoped from our statistical learning technology. Given the data px 1 , y 1 q, . . . , px N , y N q with x i P R p and y i P R, our primary focus is the inference of Y |x , where Y |x denotes the conditional random variable Y |X " x with distribution function F Y |X"x pyq and density f Y |X"x pyq. We will be using the butterfly data as a stylized example to describe the different stages of our analysis.
Level 0. ML curve-fitting. We start with a user-selected machine learning method ML 0 to estimate (possibly high-dimensional) regression function p µpxq along with its uncertainty band.
The result for butterfly data is already displayed in Fig. 1. Frankly speaking, this is where the practice of traditional machine learning ends, and ours begins.
Level 1. Heterogeneity diagnostic. A natural question is whether the initial ML 0 captured the essential relationship between y and the predictor variables. This can be checked by searching for (unmodeled) patterns in the residuals. Recall that for the butterfly dataμpxq « 0 for all x; thus residuals are actually the original response values y 1 , . . . , y n .
The critical task boils down to checking whether the residuals are homogeneous, i.e., f Y |X"x pyq " f Y pyq for all x. This would ensure that ML 0 was able to capture all the important predictive information. However, in most practical real-life problems (including butterfly data) the residuals will be heterogeneous. And the main challenge is to identify the components of f Y |X"x pyq that are significantly changing with x. We like to address this by constructing a simple diagnostic plot that can help users to visually examine and detect the prominent sources of heterogeneity in the residuals. Fig. 2(a) is one such graphical exploratory plot drawn for the butterfly data, which reveals scale (second-order effect) and tail (fourth-order kurtosis effect) of f Y |X"x pyq are changing significantly with x. Accordingly, this tool helps us to discover the hidden patterns that, while invisible to ML 0 , still persist and should be modeled for sharpening the prediction quality.
Level 2. Pivot density. We now turn our attention to the problem of predicting conditional distribution at a particular x, say at x 0 " 2 for the butterfly example. There are some obvious candidates: r f Y pyq the empirical (marginal) distribution of y; or its nonparametric smooth (say, kernel-smoothed) version p f Y pyq; or some kind of parametric model motivated from domain-knowledge or statistical convenience, e.g., N pμpx 0 q,σ 2 y q, whereσ y denotes the standard error of y. We call these initial starting "guesses" the pivot density; they are nothing but a crude first-approximation of the true unknown conditional density estimate. For the butterfly data, we choose N p0,σ 2 y " 2.55 2 q to be the pivot, as shown in Fig. 2(b).
A natural question to ask is whether the presumed pivot model is approximately correct. In most practical cases, the answer will be no, they are not adequate. Thus we have to take some corrective measures to rectify the deficiency of the starting pivot. But before doing so, it is crucial to devise some technique that can reveal the deficiency of the current model. The only problem is that we don't have enough y-samples at X " x 0 to perform traditional goodnessof-fit tests. We have to come up with something new that is computable and interpretable.
Level 3. Uncertainty quantification for the pivot. The basic question is whether the starting pivot density f 0 pyq is a good model for Y |x 0 . We answer this question by estimating the following contrast density function for u " F 0 pyq where F´1 0 puq is the quantile function for the pivot f 0 pyq. Note that d 0 x puq captures the deviation of f 0 pyq from the true unknown conditional density f Y |X"x 0 pyq, and by virtue of doing so it helps us to quantify and characterize the uncertainty of the pivot density. By inspecting the shape of d 0 x 0 puq, 0 ă u ă 1 (how it deviates from uniformity), practitioners can quickly infer whether f 0 pyq needs any repair or not. Fig. 2(c) displays the estimated contrast function for the butterfly data. Looking at the shape, it becomes clear that the initial Gaussian pivot completely missed the bimodality of f Y |X"2 pyq. The contrast function d 0 x puq plays a central role in our theory. In Section 2.3 we provide an estimation strategy. The fascinating aspect of the algorithm is that it utilizes the power of the initial machine learning algorithm ML 0 to get the covariate-adaptive p d 0 x .
Level 4. Density modulation. When the contrast function indicates the inadequacy of the starting pivot f 0 pyq as a model for the conditional distribution f Y |X"x pyq, what to do next? How to modify (or 'boost') f 0 pyq to make it data-consistent? This can be solved in a remarkably simple way via density modulation: x modulates (modifies) the shape of starting f 0 pyq to produce the conditional density. The detailed theory of why and how this works will be discussed in Section 2. Fig. 2(d) displays the estimated predictive density for the butterfly data, which we can write down analytically as which is product of two components: the Gaussian pivot and the contrast density d 0 x 0 pF 0 pyqq. The basis functions tT j py; F 0 qu in equation (1.3) are specially-designed rank-based orthonormal polynomials of the pivot. They injects robustness and non-linearity into the data modeling process-but more on this later. We will also show that this new class of d-modulated conditional density models can generate an extraordinarily rich spectrum of shapes.
One notable aspect of (1.3) is that we have 'identified' a parametric model (instead of blindly assuming, which Breiman sharply criticized) by algorithmic (fully nonparametric) means. In Section 2.3, we elucidate the process by which we boost and convert the "0th order" machine learning method (ML 0 ) into a finitely parameterizable uncertainty prediction machine. This allows our learning philosophy to integrate the fundamental character of both algorithmic datadriven modeling and parametric statistical modeling. Nevertheless, the predicted distribution (1.3) compactly represents all possible outcome values along with their respective probabilities, which is essential for decision-making under uncertainty.
Level 5. Predictive inference: What are some of the most likely values of the response variable y, given that we have observed X " x 0 ? For a fixed α, ideally, we would like to find the smallest volume region that covers 100p1´αq% area of the conditional distribution, aka the highest density prediction region. The orange highlighted part in Fig. 2 Fig. 17 of Appendix A.4.1.
Level 6. Verification: Data modeling typically consists of four stages: estimation (of the initial pivot), exploration (discovering the deficiency of the pivot), rectification (repairing the pivot), and verification (checking whether the corrected pivot emulates the observed reality).
The last step remains to be done, which requires evidence that the estimated uncertainty prediction model provides a satisfactory description of the data. Scientific data analysis, by definition, needs to be testable (falsifiable), and there's no two ways about it.
For our butterfly example, we compute U i " p F Y |X"x i py i q for i " 1, . . . , N using our trained model on a 15% hold-out validation dataset. It is not difficult to see that the distribution of U i 's will be 'close' to Uniform[0,1], if the estimated conditional density can explain the essential patterns/variations in the data. The quantiles of the empirical distribution of U i 's are plotted against the respective quantiles of the uniform distribution at the bottom right of Fig. 2. The tight cluster of points around the 45-degree reference line strongly indicates that the model is successful in capturing the underlying data generating process, and can be trusted to make predictions. Additionally, in Section 3.3, we provide a formal goodness-of-fit test for uniformity, which, when applied for this example, yields a p-value of 0.82.
Remark 1. Before leaving this section, we make two comments: (1) The purpose of this butterfly data example was to explain how our data-analysis scheme systematically extracts increasingly fine-grained knowledge from data, starting from a user-specified ML 0 .
(2) Our statistical learning theory is simultaneously applicable to small and large-scale high-dimensional problems. The next section clearly outlines our goals and objectives in somewhat broader terms.  Figure 3 shows a schematic description of the modular architecture of our data analysis pipeline, which aims to "assemble" different statistical learning cultures-algorithmic, parametric, exploratory, and robust modeling-into one coherent whole.

Goals and Contributions
‚ Unified Interface: We like to construct a generic "interface" that can convert any algorithmic regression procedure into an uncertainty prediction machine. We are less interested in solutions that depend on the specific choice of an ML method; developing 'tricks' on a case-by-case basis is too clumsy. This raises the question: how to design such a simple and unified interface? Section 2 presents some general theory that enables such a design.
‚ Algorithmic Parameterization: The next crucial aspect of our method lies in distilling an explicit and simple statistical model from a black-box technique. Section 2 lays out such a mechanism to allow algorithmic synthesis of probabilistic models with interpretable statistical parameters. This opens up the possibility of building closed-form, interpretable statistical models that are as expressive and scalable as machine learning models.
‚ Exploratory Machine Learning: How can we check whether a trained machine learning method is consistent with the observed data? If not, can we discover its blind spots using some exploratory graphical tools? Finally, can we revise the starting algorithmic method when required? All these basic questions are often left unanswered by traditional machine learning procedures. In the words of Richard Hamming (1997), "AI has traditionally stuck to the what is done and seldom considered the how it is done." The 'how' part is important to build trusted AI/ML systems. Section 3 discusses some new tools to make the 'modeling process' more transparent and interpretable. 5 It consists of the following chain of well-connected steps: This layout of data analysis embodies the viewpoint that empirical (statistical) science is a 'self-correcting' process.
‚ 'Robust + Efficient' Machine Learning: Current ML systems are extremely vulnerable to small perturbations of the training data, which makes them ineffective for critical applications.
One of the unsolved dilemmas is to construct learning algorithms that are simultaneously robust against noise and leaks very little, if any, efficiency under the standard (unperturbed) situation. Section 2.3 develops a new robust learning theory to tackle this important issue.
‚ Fundamentals of Applied Statistics: The new mechanics of data modeling that is presented here is not a "one-trick pony," it has deep connections with the basic fundamental statistics. In section 4, we demonstrate how one can systematically derive and generalize (to large high-dimensional problems in a scalable and flexible manner) many traditional and modern statistical methods as a special case of our framework. This could be especially useful for the practice and pedagogy of ML-powered modern applied statistics. Ultimately, interlinking Statistics and Machine Learning will enrich and revitalize both the communities, by creating excitement to work across the boundary.

Integrated Statistical Learning Theory
In this section, we introduce the basic theoretical underpinnings, principles, and methods of the integrated statistical-learning framework.

Entropy, Uncertainty, and Prediction
The marginal distribution f Y pyq embodies uncertainty of Y . Now, once we observe X " x, the uncertainty is updated to f Y |X"x pyq. We say that some information is received through Therefore, one can measure the additional information gain on Y after observing X " x by examining the change in the probability distributions f Y pyq and f Y |X"x pyq. Information theory provides an elegant framework for quantifying this "change." Definition 1. The conditional information density function (the reason behind this name will be clear soon) between F Y and F Y |X"x is defined as which satisfies: To improve readability, dpu; F Y , F Y |X"x q will be abbreviated as d x puq, with the understanding that this compares marginal with conditional.
Definition 2. Kullback-Leibler (KL) divergence between f Y pyq and f Y |X"x pyq is defined as Theorem 1. The KL-divergence between f Y pyq and f Y |X"x pyq can be rewritten as KL-divergence between d x puq (2.1) and uniform density over r0, 1s: KLpf Y |x ; f Y q " KLpd x ; 1 r0,1s q. (2.3) Proof: First substitute F Y pyq " u in (2.2), and then apply the definition (1.2) to deduce: Remark 2. Theorem 1 formally implies that the deviation of d x puq from uniformity (see figure   2(c) and visually compare the blue curve with the black dotted uniform line) captures the amount of (predictive) information gained on Y by virtue of a priori knowing X " x. For that reason, we call d x puq the "conditional information density." Consequently, it is not surprising that d x puq will play a fundamental role in developing our statistical theory of prediction.
For each x i in our dataset, we will get a different d x i and a different KL-divergence value (2.4).
Can we measure the overall predictive power of X by taking average of all these different KL-divergence values? To satisfactorily answer that, we have to introduce the concept of entropy.
Definition 3. Entropy is the most natural and fundamental measure of uncertainty. For a continuous random variable with distribution g, it is defined as Hpgq "´ş g log g.
Predictability-index can be defined as the difference between the entropy of Y and Y |X, which measures how much of our uncertainty about Y decreased after observing X. The following theorem establishes a beautiful connection between predictability-index and our conditional information density function.
Theorem 2. The average predictability of Y with respect to X (i.e., reduction in entropy) can be interpreted as how different d x puq is from uniform distribution on average: The proof is given in Appendix A.3. The important point here is that predictive information can be expressed solely in terms of d x , which distills the discrepancy (the "gap") between the marginal f Y and the conditional f Y |x . Accordingly, conditional information density (CID) d x puq acts as a natural starting point for developing a statistical theory of predictive modeling.

d-Modulated Conditional Density Representation
Definition 4 (d-modulation). Any arbitrary conditional density can be expressed as A few key points about this conditional density formula: ‚ This new representation of conditional density is universally valid and does not impose any distributional assumptions on the outcome (e.g., dispersion, symmetry, skewness, heavytailedness, etc.). Most importantly, it decouples the conditional density into two components: the marginal f Y (the 'pivot' density) and the conditional information density d x (which encapsulates all the essential knowledge for predicting Y from X " x).
‚ We call the density perturbation formula of Eq. (2.6) the d-modulation of marginal Y . The function d x filters out the directions by which the initial f Y pyq departs most significantly from the true f Y |X"x pyq. For that reason, we call it, alternatively, the contrast density function.
We now provide another interpretation of d x puq by introducing the notion of conditional probability integral transform (shortly, cPIT).
Definition 5. For continuous Y " F Y and Y |x " F Y |X"x , the following random variable is defined as conditional probability integral transform. Note that when F Y " F Y |X"x , i.e., under homogeneity, U |x follows uniform distribution.
Taking derivative we get the density: Remark 3. Definition 4 uses marginal distribution as a natural pivot for representing conditional density of Y given X " x. However, one can choose any reasonable density function (parametric, nonparametric, etc.) as a pivot, as long its support 6 contains the support of f Y |X"x pyq. The generalization of the formula (2.6) for arbitrary pivot f 0 pyq is straightforward: where d 0 x puq is the contrast density between F 0 and F Y |X"x as defined in (1.2).
The main reference here is Newell et al. (1959). Similar ideas also appeared in the Ph.D. thesis of Patrick Winston (1970), and more recently Marvin Minsky's (2007) latest book on building an AI-system with commonsense reasoning.

A Robust Learning Theory
It is evident from the decomposition (2.6) that we need to estimate d x pF Y pyqq to obtain the conditional density of Y given X " pX 1 , . . . , X p q. In this section, we describe a special technique for estimating d x that enjoys the following desirable properties: ‚ Robust as well as efficient: It has been already noticed that (i) existing statistical learning methods are notoriously sensitive to noise, which makes them an easy target for adversarial attacks; and at the same time, (ii) naively constructed robust algorithms perform poorly in the ideal setting with no contamination. The real question comes down to this: how to robustify algorithms so that it can be resistant to noise/outliers while also making sure it is sufficiently accurate (efficient) when applied to relatively clean data. 7 Here we will present a new class of statistical learning methods to balance robustness and efficiency.
‚ Scalable to massive high-dimensional data: An important practical goal is to have a method that works for large-pn, pq problems. We achieve this goal by utilizing the power of general machine learning tools.
‚ Interpretable: It would be nice to have a tractable (and preferably smooth) model for the contrast density d x puq with a few interpretable parameters that can capture increasingly complex shapes of the conditional distributions.
Its a challenging task to reliably learn the "true" relationship between the input features X and output Y from big noisy datasets. To ensure that we succeed at it, our method has to be 'doubly-robust,' and thus tackle possible contamination in both response and covariates. We will approach the estimation of d x in two stages: first, we will make it robust with respect to the outcome variable Y, and then we will discuss defence strategies to guard against corrupted input X.
Y -Robustness. Substitute u " F Y pyq in the definition (2.1), to see that d x pF Y pyqq allows us to represent the ratio f Y |X"x pyq{f Y pyq as a function of the rank-transform (i.e., probability integral transform) F Y pY q. This is the first major step towards ensuring the Y -robustness of our method. As a result, one can expand d x pF Y pyqq in the orthonormal basis of F Y pyq. One such orthonormal system is the LP-family of rank-polynomials (Mukhopadhyay, 2017a, Bruce et al., 2019, Mukhopadhyay and Wang, 2020a, which we denote tT j py; F Y qu to emphasize that they are polynomials of F Y pY q not Y , hence extremely robust. As the true F Y is unknown, we will instead use the empirical LP-bases (eLP) tT j py; r F Y qu for our data analysis. Appendix A.2 describes its construction and properties.
Summing up the above discussion, we have the following LP-series representation: The problem of estimating the function d x now boils down to estimating the orthogonal coefficients LP j|x . To get an expression for the conditional LP-coefficients, first recall that T j 's are mean zero random variables and are orthonormal with respect to measure F Y , i.e., where δ is the Kronecker delta function. Consequently, we multiply both sides of the equation (2.9) with T j py; F Y q¨f Y pyq and then integrate to produce the following identity: This immediately yields the following important theorem.
Theorem 4. The jth conditional LP-Fourier coefficient admits the following representation (2.12) The proof is trivial; just substitute d x pF Y pyqq in equation (2.11) by f Y |X"x pyq{f Y pyq. Theorem 4 allows us to interpret the conditional coefficients LP j|x (which are surely function of x) as a specialized regression function, where we regress T j py; F Y q on X. This is extremely handy, because one can now use any general machine learning method to estimate these coefficients! X-Robustness. Nevertheless, the computational theory (2.12) is still not fully satisfactory, as corrupted (or adversarially perturbed) X can substantially influence the coefficients. To provide further defense, we need one last piece of idea, which is a little-known but fundamental fact about 'regression via rank-transform.' Also see Appendix A.4.2 for something surprising.
Theorem 5. For any general random variable X (whether discrete or continuous) and a measurable function Ψp¨q, the following result holds ErΨpY q | Xs " ErΨpY q | F X pXqs, with probability 1 (2.13) Proof: We start with a basic quantile mechanics fact: for any random variable X we have Q X pF X pXqq " X, with probability 1.
where Q X puq " inf x tF X pxq ě uu, 0 ă u ă 1 denotes the quantile function of X. Accordingly, any function of X, say hpXq, can be equivalently rewritten as a function of the rank-transform F X pXq, since hpXq is equal to h˝Q X pF X pXqq with probability 1.
In plain language, this result proves the "sufficiency" of rank-transform (retaining all the information) for estimating conditional mean function. Therefore, we can estimate LP j|x (2.12) by regressing T j pY ; r F Y q on the LP-bases of X. This will allow doubly-robust estimation of d x without sacrificing an iota of efficiency in the process. Next, we will go over some examples, to demonstrate how this theory actually translates into practice (algorithm).
Example 1. The Butterfly data. Fig. 4 displays the scatter plots of T j py, r F Y q against u " r F X pxq for j " 1, . . . , 4. We refer these graphs as LP-scatter plots. The red curves denote knn-estimated regression curves. In the R computing language, this is simply knn(T j py; r F Y q " T X ), pj " 1, . . . , 4q where T k px; r F X q is the kth col of LP-transformed feature matrix T X . Of course, instead of knn method, one can choose any machine learning routine; the next example uses gradient boosting algorithm (gbm). To get the estimated conditional LP-coefficients at x 0 " 2 (indicated by the blue dots) we record the predicted values at u 0 " r F X px 0 q: x LP 2|x 0 equals to´.020, x LP 4|x 0 equals to´0.47, and the rest are practically zero. Substituting them into (2.9) yields the estimated contrast function p d x 0 , as was shown in Fig. 2(c).
Example 2. BUPA liver disorders data. This is a popular dataset (available in the UCI ML repository) used by many researchers, including Breiman (2001). It consists of measurements of gamma-glutamyl transpeptidase (GGT) and alanine-aminotransferase (ALT) extracted from Our data analysis scheme can be simply summarized as follows: Figure 4: First 4 LP-scatter plots for the butterfly data. It illustrates the steps for obtaining x LP j|x using generic machine learning-program, by operating on the LP-transformed domain. 1) We start by modeling the conditional mean function using gradient-boosting algorithm.
The rugged blue line denotes the estimated regression curve p µpxq.
2) To estimate the conditional density at x 0 " 3.5, we choose N pp µpx 0 q, σ 2 y q as our pivot model, where p µpx 0 q " 3.35 and σ y " 0.5.
3) Following theorems 4 and 5, we estimate the conditional LP-Fourier coefficients LP j|x 0 and the associated contrast density function: p d x 0 pF 0 pyqq « 1´0.72 T 2 py; F 0 q`0.10 T 3 py; F 0 q. This is shown in the middle panel of Fig. 5.
4) Interestingly, we can interpret the coefficients of p d x 0 : the large value of x LP 2|x 0 indicates that the selected pivot needs scale-correction ("second-order" correction) to get close to the true conditional density at x 0 . In addition, the negative sign implies the variability has to be reduced, which is vividly apparent from justified due to the presence of few large y-values 8 around x 0 " 3.5, as indicated by squares.
6) Finally, the shape of p d x provides quick exploratory guidance for choosing appropriate families of parametric conditional distributions. For the BUPA liver-disorders data, one could select skew-normal or asymmetric logistic distribution (Friedman, 2020) as a model for Y |x .

The Basic Algorithm
We are now in a position to provide the blueprint of the algorithmic interface in Fig. 3 that converts a user-selected machine learning method into an uncertainty prediction machine in a robust and interpretable manner. The theory in the previous section, makes it absolutely straightforward and extremely easy to implement, which essentially involves three steps. The following algorithm presents a flowchart of the core computational steps in pseudo-code, so that one can implement it using any programming language or ML-platform. We have found that the H2O open-source is extremely efficient. But users can choose any ML environment (e.g., TensorFlow, Azure, AutoGloun, etc.) to operationalize the following workflow.
Interface Design: A flowchart of the core computational steps in pseudo-code Input Data pX i , y i q for i " 1, . . . , N ; user-selected machine learning method (ML); values of m y and m x . And the target test-point x 0 P R p .
Step 1. T Y Ð LP.basispy, m y q; T X Ð LP.basispX, m x q for(j " 1, . . . , m y ) 8 Classical robust methods treat them as "outliers" and devise mechanisms for removing them from analysis. This is incorrect. A careful examination will reveal that these large-y values (indicated by squares) have some pattern; they are not randomly scattered (like outliers). The real question of what is an outlier and what is a "real" pattern, finally boils down to artfully balancing robustness with efficiency.

{
Step The LP-transformation is implemented by the function LP.basis; T X is the stacked matrix Step 1 makes the whole procedure robust, non-linear, and invariant under monotone transformations.
Step 2 performs the training of the ML method and step 3 predicts the fitted value at the desired X " x 0 . Based on our experience, m x " 4 and m y " 6 work reasonably well for most problems, including all the datasets analyzed in this paper. Finally, one can generate the uncertainty distribution f Y |x 0 by applying the d-modulation formula of equation (2.6).
Remark 5. In this section we have showed: how to actualize the vision sketched in Fig. 3; how to compress-down large complicated algorithmic methods into concise parametric forms with few interpretable coefficients 9 ; how to derive a smoothness-promoting model for the response Y from black-box prediction methods 10 ; and how to robustify algorithmic techniques 11 . We have achieved each of these ends by devising a generic mechanism that holds for any machine learning method. Furthermore, Section 4 shows how these same modeling principles can unify a wide range of traditional and novel statistical methods.

d-Kernel Smoothing and Importance Weighting
In many applications, we are often interested in estimating quantities like ErΨpY q|X " xs for general Ψp¨q function. This can be rewritten as by applying the formula (2.6). This leads to the following theorem, which expresses the (local) conditional mean as a weighted average of the global samples ty 1 , . . . , y N u.
Theorem 6 (d-kernel smoothing). Conditional expectation of ΨpY q given X " x can alternatively be expressed as: ErΨpY q|X " xs " E " ΨpY q¨d x`FY pY q˘‰, (2.14) 9 This materialize the objective of 'Trend 1" as proposed by Efron (2020) the rhs expectation is taken over the marginal distribution of Y .
Remark 6. The above representation implies the following d-weighted estimator: where the empirical weighting function w i pxq " p d x p r F Y py i qq. The formula (2.15) can be interpreted as a specially-designed kernel smoothing, which could be extremely handy in some cases.
Three particular examples are given below: 1) ΨpY q " IpY ą kq gives the conditional probability PrpY ą k|X " xq; This can be used to construct different risk measures based on tail events; 2) ΨpY q " Y recovers the conditional mean ErY |X " xs; 3) ΨpY q " pY´EpY qq 2 yields conditional variance VarpY |X " xq, whose square-root, i.e., the standard error of Y |x , is often used as an uncertainty quantification measure for a prediction.
However, we should point out that standard error is not always a prudent choice for quantifying reliability of a prediction, specially when the conditional distribution is heavy-tailed, or skewed, or multi-modal. Length of prediction interval (see Section 4.6) provides a more robust measure.
To drive this point home, example 12 discusses an interesting real prediction problem (the "Auto MPG" data).

Exploratory Learning Machine
Our goal here is to develop some graphical diagnostic methods to make a transition from a black-box answer machine to an exploratory learning machine that can drill deeper and generate previously unanticipated, interesting questions to facilitate data-driven scientific discovery and understanding. 12 This can be viewed as a pre-deployment "health check-up" for statistical learning methods that comes with a set of nonparametric exploratory tests.

Heterogeneity Component Analysis
Heterogeneity component analysis (HCA) is a technique for identifying dynamic components of Y |x -the aspects of the conditional distribution f Y |X"x pyq (such as location, scale, skewness, tail, etc.) that vary with covariates. We illustrate the main idea using the butterfly example. The procedure starts with LP-scatter plots, as was shown in Fig. 4. The first scatter plot captures how the location of f Y |X"x pyq (conditional mean) changes with x; the second one captures the change of scale (the nature of heteroscedasticity); the third one provides information on the changing skewness and so on. Thus, an obvious approach for detecting the heterogeneity components would be to: Step 1. Compute a goodness-of-fit statistic for the jth LP-scatter plot by its coefficient of multiple correlation: R 2 j " Proportion of variance in T j py; F Y q that is explained by tT 1 px; F X q, . . . , T m px; F X qu Step 2. Assess the significance of the R 2 j by constructing F-statistic for the jth LP-scatter Fig. 2(a) displays the significant (whose pvalue ă 0.05) heterogeneity components for the butterfly data.
The "Flatland" Hypothesis. For real data analysis, one can apply the above method on residuals to determine which shape parameters are varying with x, beyond first-order mean.
The "flatness" of residuals implies the errors are homogeneous with respect to the covariates.
Example 3. Bone mineral density data (Bachrach et al., 1999). It contains measurements on the relative change in spinal bone density (over one year period) of N " 485 North American adolescents, as a function of age. This dataset was previously analyzed by Hastie et al. (2009, p. 152). Fig. 6(a) shows the data along with the estimated regression smoother. We like to understand whether there is any "excess" heterogeneity beyond the 1st-order regression modeling. To check the homogeneity of the residuals with respect to the covariate x, we apply our HCA procedure on the residuals y i´p y i . The result is shown in the top row of Fig. 6, which says that the regression model has not taken into account the heteroscedasticity of the data (furthermore, there is some evidence of varying skewness in the conditional distribution).
These unmodelled (dynamic) heterogeneity components should be modeled to improve the prediction. The heterogeneity is partly due to the presence of both male and female youngsters' in the dataset.
Example 4. Online news popularity data (Fernandes et al., 2015). The objective of this study is to predict the popularity of online news. The data 14 consist of N " 39, 644 articles, and for each article we have y (log 10 of number of shares in social networks, a measure of popularity) and p " 59 extracted features (such as number of words in the title, number of links, number of images, etc.) x. This dataset was recently analyzed by Friedman (2020).
We start by fitting gradient boosting regression to the data. At this point, it is a natural question to ask whether the gbm captured most of the important information in the data.
13 To allow non-normal errors, use the fact that mF m,N´m´1 approaches to χ 2 m (in the sense of convergence in distribution) as N Ñ 8 and compute the p-value based on this large-sample chi-square approximation.
14 The article in row # 31, 038 seems to be an outlier (in fact it is an erroneous data: check its 4th and 5th feature values-rate of unique and non-stop words-which can't be more than 1, by definition). Since our method is doubly-robust we don't worry that much. But standard machine learning methods might get unduly influenced by this one data point (article). The practice of looking at the data is always a good habit. If not, what is missing? To spot the unmodelled dynamic components, we start from the residuals, as shown in Fig. 6(c). The HCA diagnostic plot of panel (d) is constructed based on the Lasso-selected features, which clearly indicates the presence of a strong heteroscedasticity.
However, it is also true that the conditional distribution will be asymmetric and long-tailed, but our analysis shows that they are not swiftly changing with covariates; they are more or less static in nature, not dynamically evolving shape-parameters. This further corroborates the findings of Friedman (2020, Sec. 6.2).

Pivot Uncertainty Modeling
The parameters LP j|x of the contrast density d x can be viewed as "generalized coordinates" that describe the shape of f Y |X"x pyq relative to some reference distribution (pivot) f 0 pyq. In fact, these shape-coefficients LP j|x capture the whole dynamics of how the conditional distribution f Y |X"x pyq evolves with x. 15 Naturally, a large value of qPivotpxq " ÿ jˇL P j|xˇ2 (3.1) indicates higher uncertainty (misfit) of the pivot f 0 for modeling the conditional f Y |X"x . The 'q' in qPivot stands for quantification of lack-of-fit. The above formula can also be interpreted as measuring the deviation of d x puq from uniformity (follows from Parseval's identity): Example 5. Movie box-office revenue data (Voudouris et al., 2012). The goal of this study is to build a forecasting model for film revenues. The data contain N " 4031 pairs of observations px i , y i q on the 1990s film data, where x i is the log of opening box-office revenues and y i is the log of box office revenues after the first week. We are interested in predicting the distribution of the outcome y for x " 12 and 14-marked with the blue and green dotted line in Fig. 7 (a). Our starting pivot f 0 " N p14, σ y " 3.25q is shown in the panel (b). To check whether the conditional density at x follows this presumed parametric law, we estimate the respective contrast densities; see the middle panel of Fig. 7. The bimodality of p d x"12 indicates that the Gaussian pivot failed to capture the presence of films with two kinds of box-office earnings.
This is also reflected in the pattern of the scatter plot, which is divided into two branches. On the other hand, the inverted 'U' shape (quadratic) of p d x"14 implies that the variability of the Gaussian pivot needs to be adjusted (2nd-order correction) to go from f 0 pyq Þ Ñ f Y |X"14 pyq.
The final refurbished conditional density models at x " 12 and 14 are shown in the bottom row of Fig. 7. They are estimated using the d-modulation technique, described in Sec. 2.2.

Goodness-of-fit Diagnostics
The flexibility of our approach (as depicted in Fig. 3) enables us to choose any machine learning method to design the uncertainty prediction machine (UPM). Nonetheless, a user may want to know which ML-powered distribution prediction algorithm(s) better explains the pattern in the data. To answer that here we introduce a graphical exploratory procedure to evaluate the overall goodness-of-fit of the method. Deploying statistical models without knowing how well it actually fits the data, is a statistical sin. We illustrate our idea using the butterfly data.
ML-driven UPM system. To construct an UPM for the butterfly data, we start with three possible base learners: k-nearest neighbors (knn), random forest (RF), and gradient-boosting 15 Indeed, one can even construct an m-dimensional phase-space (analogous to statistical mechanics), consisting of N points tLP 1|xi , . . . , LP m|xi u N i"1 to graphically represent the evolving shape of f Y |X"xi pyq. Each point in LP-phase-space corresponds to a particular shape for a particular value of x " px 1 , . . . , x p q. computed based on a 15% hold-out data. Knn (with k " 15) seems fits the data well. algorithm (GBM). We train each ML-driven UPM model using the generic procedure described in section 2.4. To check whether these models are congruent with the observed data, we proceed as follows: Step 1. Compute the generalized quantile-residuals for all three models on a hold-out validation dataset (separate from the training samples) of size n v U i " p F Y |X"x i py i q, for i " 1, . . . , n v ; and " 1, 2, 3.

(3.3)
Note that if the model accurately captures the pattern in the validation data, then we expect the distribution of the U 's to be close to uniform.
Step 2. Display the histograms and QQ-plot of tU 1 , . . . , U nv u. This is done in Fig. 8 for the butterfly data, which shows the knn model satisfactorily captures the observed reality.
Step 3. To quantify the model-fit we propose the following test statistic for the -th model: where Leg j denotes the jth orthonormalized shifted Legendre polynomial on unit interval.
The qDIV statistic measures the distance between Uniformr0, 1s and the sample distribution of U . We have used m " 6 in all our examples. One can also compute pvalues using the χ 2 m null distribution of qDIV. For more details see Mukhopadhyay (2018) and Mukhopadhyay and Wang (2020a).
Step 4. It is also interesting to note that one can directly compute these generalized quantileresiduals from the contrast distribution function, since (by substituting F 0 ptq " uq: For the butterfly data qDIV values (with p-values in the parentheses) for knn, random forest, and gbm respectively are 0.0466 p0.82q, 1.12 p1.4ˆ10´1 0 q, and 0.386 p0.0026q. These statistic values can be used as a general measure of "lack-of-fit" for comparing (ranking or even tuning hyperparameters) the performances of different ML-engine based UPMs.

Sharpening Black-box Generators
Imagine we are given some black-box generative model that can simulate y-samples for a given x " px 1 , . . . , x p q: MODELpxq Þ Ñ tr y x 1 , . . . , r y x s u.
Without having any information on the particular MODEL 16 , our goal is to: (i) check whether the samples are truly generated from the underlying conditional distribution f Y |X"x pyq; (ii) if not, explain 'why' by displaying the associated p d x ; and finally, (iii) sharpen the given imperfect samples to make them compatible with the underlying stochastic data generating mechanism.
Butterfly data. To illustrate the method, let us consider that we are given all the y i 's for which the x i value is between 0 and 4. The histogram of these samples is shown in the left panel of Fig. 9. The blue curve shows the true f Y |X"2 pyq. Starting from these weak conditional samples, our goal is to refine it so that they comply with the underlying law at x " 2.
Step 1. Choose the empirical distribution q f y|x of tr y x 1 , . . . , r y x s u be the pivot; see Fig. 9(a).
Step 2. Randomly generate one sample r y * from q f y|x and U from Uniformr0, 1s.
Step 3. Let q F y|x denotes the cdf of q f y|x . Then, set p y x " r y * , if Otherwise discard it and go back to the previous step. Here d x puq means d x pu; q F y|x , F y|x q.
Step 4. Repeat steps 2-3 until we have s-sample tp y x 1 , . . . , p y x s u, which we call d-sharp samples, since d x acts as a tool for sharpening the initial weak conditional samples; see Fig. 9(b).

Some Connections and Unified Interpretation
We should teach in our introductory courses that one meaning of statistics is "statistical data modeling done in a systematic way" (Parzen, 2001) The theoretical constructs of our integrated learning framework are not based on an isolated "trick," but are fundamental to statistics. Our formalism and principles of data modeling unify many traditional and modern statistical methods. In this section, we will report findings to highlight this fact.

K-sample Comparison Problems
Given random samples from k distributions F i pyq, i " 1, . . . , k we seek to test the following null hypothesis: with the alternative hypothesis that at least two distributions are different. Before attacking this general problem of equality of distributions, we start with a simpler problem of testing the equality of means. The Kruskal-Wallis statistic provides a nonparametric test for it.
Definition and notation. We have k mutually independent random samples with sizes n 1 , . . . n k with the combined sample size N ; R i denotes the rank of Y i in the pooled samples; G is the set of indices for the -th group. The Kruskal-Wallis test statistic is defined as: One may go one step further and ask how to test the equality of scale parameters (variability over different groups), which can be tested using Mood statistic, defined as For testing high-order differences like skewness (not to mention kurtosis, which will take almost two full lines to write), the formula of the test statistic gets even more complicated. 17 A practitioner at this point might wonder how to make sense of these monstrous formulae? Is any intuitive or systematic derivation possible? The answer is yes. We just have to look at the problem from a different perspective. A modern introduction to the k-sample comparison problem is given below.
Step 1. View it as a pX, Y q problem: Define X to be the group-index variable taking values t1, . . . , ku. We now have tpx i , y i qu N i"1 data, which can be displayed as a scatter plot.
Step 2. Reformulate H 0 : The k-sample null-hypothesis of equality of distributions can now be viewed as a test of the homogeneity problem (see Section 3. the original k-sample hypothesis (4.1) can alternatively be rewritten as H 1 0 : LP j| " 0, for all j.
Hence, intuitively, it makes sense to apply the HCA diagnostic test of section 3.1 on the px, yq data. So what happens, if we dare to take the leap of faith? We get almost all of the known results of k-sample modeling in one-shot. 18 To justify the statement, we start with the following theorem.
Theorem 7. The Kruskal-Wallis statistics can be expressed as where the first-order LP-multiple correlation R 2 1 is defined in the step 1 of Section 3.1.
17 If you are a machine learning practitioner and concerned that these are too abstract and complicated, hang in there-don't give up. We will soon see they can be represented in a beautiful and compact manner using our modern notation, which can be implemented in one line, without the burden of memorizing any formula. 18 After all, "the man who perpetually hesitates accomplishes nothing; It is the man who dares who wins." The proof is in Appendix A.3. The beauty of this result is that it seamlessly generalizes to high-order cases 19 (proofs are not difficult and are left to the reader): and so on. The practical benefit of these results is that they provide a systematic and comprehensive k-sample analysis program that is radically simple to implement, since LP-multiple correlations R 2 j can be computed in a single line R-code: summary(lm(T j pyq " T X )$r.squared.
Example 6. LDL cholesterol of Quail (Hettmansperger and McKean, 2010). This is a randomized experiment investigating LDL (low-density lipoprotein) cholesterol in quails; N " 39 quails were randomly assigned to k " 4 diets for a specific period of time, where each diet mixed with a different drug compound. The boxplot is shown in Fig. 10. The scientific question is: whether or not different drug compounds change the LDL cholesterol levels. The right panel of Fig. 10 shows the re-scaled HCA-plot, where we multiply the jth LP-multiplecorrelations R 2 j by the sample size N . It shows that the distribution of LDL cholesterol levels is changing in scale and location.
Remark 7. For this data, the classical anova F-test yields p-value 0.345, thus fails to detect any significant location differences-clearly contradicting the boxplots. Why has the F-test failed? There are two primary reasons for that: (i) the highly non-Gaussian nature of the distribution of Y along with a few outlying values were enough to kill the parametric F-test; and (ii) the assumption of equal variance is also seems inaccurate. The takeaway: we need both robustness and efficiency to reliably detect high-order distributional differences.

Probabilistic Index Model
Do baseball players gain weight as they get older? Can a larger dose of an antipsychotic drug reduce the severity of depression (after adjusting for the confounders)? Do working-class Belgian families spend less of their income on food, proportionately, as they come to make more money? All of these problems are concerned with modeling the comparison probability regression: were multiplied by the sample size. The asterisk-sign '*' indicates the associated test statistic has p-value ă 0.10. In R, kruskal.testpy " xq yields the value 7.18, which is practically same as ours using (4.6).
as a function of x, instead of as a usual conditional mean regression. CPRpxq encodes the probability that a randomly selected subject from the population with X " x has a higher response value than a randomly selected subject from the full data. This concept was pioneered by Olivier Thas and his collaborators under the name of probabilistic index model; see Thas et al. (2012). To understand how (4.7) is related to our framework, we start with the following integral representation: This leads to the following important result.
Theorem 8. PrpY ď Y |x q can be expressed as a regression of probability integral transform F Y pY q on X: The first equality follows from (4.8), and the second equality follows from supplementary equation (E5.4). The representation (4.9) justifies the name comparison probability regression, which is simply a standardized 1st-order LP-regression function. As a consequence of Theorem 8, the whole procedure can be implemented in one line of R-code: lmpT 1 pyq " T X q.
Example 7. Baseball data (Matloff, 2017). Fig. 11 shows the age verses weight scatter plot of N " 1015 major-league baseball players. We would like to investigate whether baseball players gain weight as they age. In other words, our task is to check whether PrpY ď Y |x q increases as a function of age or not. We proceed as follows: Step 1. Compute the LP-basis functions: T 1 py; r F Y q for the response and T X " rT 1 px; r F X q, . . . , T m px; r F X qs for the predictor variable. Here we have picked m " 4.
Step 2. (penalized) LP-regression: Perform the multiple linear regression T 1 py; r F Y q on T X .

Return the Bayesian information criterion (BIC) selected model. If no variable is selected then
it supports the null hypothesis H 0 : PrpY ď Y |x q " 0.5. For the baseball data we get: There is no intercept as ErT 1 py; r F Y qs " 0, by construction. Although athletes strive to keep physically fit, eq. (4.10) seems to suggest that the baseball players do gain some weight (in a slow linear rate) over time. Fig. 11(b) makes this somewhat clear to understand how.

Shape Predictors: Generalized Feature Selection
When dealing with a large number of covariates, it is often important to understand which are the most predictive for the response variable Y . Currently, the most popular feature selection method is L 1 -penalized lasso regression (Hastie et al., 2009, Ch. 3.4), which minimizes the residual sum of squares }y´Xβ} 2 2 subject to }β} 1 ď λ. This will be denoted algorithmically as lassopy " Xq. It is important to note that, when we perform lasso-regression of y on X, we only get 'first-order' informative features-the variables that influence the conditional mean ErY |X " xs. Naturally, the question is: can we go beyond mean, and find the most relevant attributes to describe the shape of the conditional distribution f Y |X"x pyq? Can we better utilize the lasso-technique to find those "shape predictors"? The following is an attempt to construct such a generalized feature selection program, which is extremely easy to implement.
The algorithmic logic proceeds as follows: Step 1. Let's start by recalling our basic model: f Y |X"x pyq " f Y pyq¨d x pF Y pyqq. The covariates x " px 1 , . . . , x p q can influence the distribution of the response Y only through d x˝FY pyq " 1`ř j LP j|x T j py; F Y q, which is governed by the following system of equations: The first equation captures the dynamics of conditional mean as a function of x, while the second one describes how the variability (scale) changes with x, and so on.
Step 2. Thus, it is quite obvious that one can extract the jth order features, denoted by where p β pjq is the output of lassopT j pyq " Xq, which efficiently computes the solution of p β pjq " arg min . (4.11) Accordingly, O 1 is the collection of location-informative variables, O 2 contains all the scaleinformative ones, O 3 variables are responsible for predicting change in skewness of Y, etc.
Step 3. To make the analysis doubly-robust and to allow for non-linearity of the features, simply substitute X by the LP-transformed matrix T X , as we did in Sec. 2.4.
Definition 6. Let p β pjq k be the estimated lasso regression-coefficient for T k px; r F q. Then define the Omnibus Variable Importance Score (Ovis) for X as Ovis " ÿ j ÿ kˇp β pjq kˇ, 2 for " 1, . . . , p. (4.12) Remark 8 (Portability). Our procedure for finding generalized shape predictors is extremely flexible: one can easily integrate it with any machine learning algorithm. Train the model ML(T j pyq " X), and plug that into a feature-importance wrapper e.g., h2o.varimp() from h2o R-package, varImp() from caret R-package, or FeatureImp() from iml R-package, etc.
Example 8. Online news popularity data, continued. Which attributes are most important for predicting the popularity of a given article? More specifically, which variables affect the changing shape of the conditional distribution? Our approach consists of four main steps: ‚ First, determine which shape parameters of f Y |X"x pyq are changing with x? In Fig. 6, we have already seen that scale and skewness are the two principal dynamic components, in addition to location. ‚ Second, to determine the variables that impact the changing location, scale, and skewness we perform the following LP-lasso regression 20 lasso`T j pyq " T 1 px 1 q`¨¨¨`T 1 px p q˘, for j " 1, 2, 3, where T k px q :" T k px; r F q. Store the selected features in the set O j .
‚ Third, we summarize the findings by ranking the variables using the Ovis index; here we use k " 1 and j " 3 in the formula (4.12). The top 10 features are displayed in Fig.   12, which shows the nature of contributions of different variables. For example, the variable 'is weekend' (whether the news article was published on the weekend or not) plays a dual role of being both a location and scale informant. Consequently, our multi-layered robust method provides a refined understanding of the "which and how" aspects of feature selection.
‚ Finally, one can perform targeted feature screening. An investigator can specifically query which variables are mostly responsible for heteroscedasticity, or change in symmetry, etc. Table   1 shows the top five variables in each category: location, scale, and skewness. These insights will ultimately help the applied researchers to better understand the nature of the association between response Y and X " pX 1 , . . . , X p q.

The XYZ Problem: Distributional Impact Analysis
In application fields such as healthcare, economics, and social sciences, data often arise in XYZ format, where Z is the binary treatment variable, Y is the response variable, and X is the (possibly large) collection of covariates. One such problem is discussed below. 20 see Appendix A.4.4 for a related discussion on 'robust lasso.' Investigators can use this tool to identify different sources of heterogeneity.
Example 9. Rosner's FEV data (Rosner, 1995). Table 2 describes the data. The main interest lies in understanding the impact of smoking on the distribution of FEV as a function of age.
This is important for individualized custom-tailored decision-making, where a treatment might be locally-effective for a sub-population (characterized by certain values of x) without being globally effective for the whole heterogeneous population.  Our model for conditional distribution of Y given X " x and Z " z is Distributional impact analysis aims to quantify (how much) and characterize (in which ways) the differences between f Y |x,1 pyq and f Y |x,0 pyq. To develop a measure of distributional impact, first note that from equation (4.13), both f Y |x,1 and f Y |x,0 have the same pivot f Y pyq. Thus the deviation between them directly depends on the distance between the contrast densities d x,0 and d x,1 . Theorem 9. The L 2 distance between the d x,1 puq and d x,0 puq can be expressed as 2 distance between the corresponding LP-Fourier coefficients.
Definition 7. Define the distributional impact function for a given X " x: The components of the DIF capture the amount of change in the location, scale, etc. For example, the location difference can be estimated from the first-order LP-difference statistic: LP 1|x,0´L P 1|x,1 .
We are now ready to apply this procedure on the FEV data. To estimate the LP-coefficients in (4.13) we apply our UPM framework (algorithm of Section 2.3.2) with gradient-boosting machine as the base learner. We chose gbm, as it efficiently computes the interactions 21 between the treatment and covariates. The left panel of Fig. 13 shows the DIF statistic for ages 12 to 15.
The stunning fact about this graph is that it simultaneously answers which x-groups are most impacted by the treatment and how they are impacted. For example, at age 14 variability is the dominant mode of difference, whereas at age 15, location is the main source of difference.
Overall, the DIF barplot does a modest job of capturing the heterogeneous impact of teenage smoking on lung function across different age groups. 21 Treatmentˆcovariates interactions allow the treatment effect to vary among individuals with covariates.
Remark 9 (Connection with causal inference). For randomized experiments and clinical trials our DIF statistic can be interpreted as a 'distributional' treatment effect 22 , one whose impact vary across sub-populations. For observational studies, one needs to assume the so-called 'unconfoundedness' assumption to attach a causal interpretation.

Quantile Regression
Quantile regression, pioneered by , has become a pervasive tool in a host of real-world application areas. Despite the great progress made in the last 40 years (Koenker, 2017), some practical challenges still remain: how to develop a nonparametric quantile regression method that can scale to high-dimensional problems? How to efficiently incorporate nonlinearity? How to ensure that the estimated conditional quantile curves will not cross each other? Our statistical learning theory can offer some realistic solutions to these frontier problems of quantile regression. But before getting into that, let us start with a slightly broader context. Our "interlinked" statistical learning architecture (as depicted in Fig. 3) has two notable consequences for the practice of nonparametric data modeling: ‚ Our technology opens up brand-new ways of building "ML-powered" statistical models that are simultaneously flexible and scalable 23 for large(n, p) problems, where classical nonparametric methods are not a viable option.
‚ Our UPM learning architecture is designed using a high-level universal language that is agnostic to the specific ML method, to ensure the portability and durability of the technology. 24 To perform UPM-based quantile regression, simply extract the appropriate percentiles from the conditional density estimate p f Y |X"x pyq. The beauty of our method is that we don't have to develop theories specialized to a certain kind of machine learning method like random forest (Meinshausen, 2006, Athey et al., 2019 or neural network (Cannon, 2011)-it's all done using a single generic procedure; see Appendix A.4.3 and Fig. 19 for more details.
Example 10. Dutch Boys data (Fredriks et al., 2000). This dataset is a part of the Fourth Dutch Growth Study, which comprised of observations on age and BMI of N " 7294 Dutch boys. The goal is to estimate age-specific reference growth curves based on conditional quantile function p Q Y |X"x puq. Fig 14 displays our result where we have used lasso (as implemented 22 The topic of 'distributional' treatment effect (as opposed to 'mean' effects) carries a major significance in modern econometrics (Bitler et al., 2006, Imbens and Wooldridge, 2009, Banerjee et al., 2015, where it is known that the impact of an intervention on the outcome distribution can be highly heterogeneous-somewhat similar to what we have already seen in Fig. 13. The good news is: DIF(Y, Z|X) provides a systematic way to characterize the impact of a treatment Z across the entire distribution of Y as a function of the covariates X. 23 By taking advantage of high-performance specialized ML-hardware and software architecture. 24 It is important to keep in mind that ML algorithms are growing at a staggeringly fast rate. In these circumstances, it is nor practical to develop 'retail' strategies on a case by case basis for each ML-procedure.
A practical way out is to build a unified learning ecosystem.

Prediction Interval Estimator
One way to summarize the conditional density is through prediction interval (PI), which provides a concise view of the most likely values of the response variable 25 . For a specified level α, the goal is construct an interval that covers no less than (1´α) of the probability mass of f Y |X"x pyq. How to construct such PIs? We discuss three different procedures. The real challenge is to make it as narrow as possible, while maintaining the desired coverage.
‚ Quantile-based PI. As noted by Meinshausen (2006), one can use quantile regression to build PIs. A 100p1´αq% prediction interval for Y given X " x can be estimated as to produce narrow, more precise prediction intervals while ensuring the desired coverage. qPI seems to be little misaligned, due to the long right-tail.
Naturally, the width (precision) of the quantile-PIs vary over the covariate space.
‚ Standard-error-based PI. If we are brave enough to assume some parametric form of the conditional density, then we can greatly simplify the expression for the PIs. For example, under Gaussianity assumption, one can express (4.16) in the following compact form, just involving conditional mean µpxq and standard-deviation: where Φ´1pαq " z pαq , VarpY |X " xq " σ 2 |x , and 'g' denotes the Gaussian-PI.
‚ Highest-density PI. This is the shortest possible interval with a given coverage probability (Box and Tiao, 1973), defined as where τ 1´α is the largest number satisfying ż y P hdPI 1´α pxq f Y |X"x pyq dy " 1´α.
Another nice property of hdPIs is that any point within the interval has a higher density than any other point outside it, which justifies their name.
Based on their operation principles, these three kinds of PIs can be broadly categorized into two groups: the vertical and the horizontal approach.
Example 11 Butterfly data, continued. All three prediction intervals are shown in Fig. 15(a).
Three main conclusions: (i) The shortest among all is the highest-density PI, which take the form of disjoint subintervals-one for each local mode. (ii) The Gaussian assumption of gPI seems too restrictive for real-world data analysis. (iii) Both gPI and qPI include a low-density valley area around x " 0 at the cost of other, more-likely values around x "˘2.
Example 12. The Auto-MPG data. The task is to predict the miles-per-gallon (MPG) gas consumption of an automobile based on p " 7 features including horsepower, weight, and acceleration. The dataset was used in the 1983 American Statistical Association Exposition and is available in the UCI Machine Learning Repository. Here, we are mainly interested in predicting the MPG of a specific instance-row number 390, which corresponds to the 1982 Dodge Rampage car. The UPM-predicted uncertainty distribution is displayed in Fig. 15(b).
The bimodality of the estimated conditional density reflects the fact that 1982 Dodge Rampage is a hybrid two-door car-a mix between a passenger car and a truck. The next notable thing is the length of the PIs: The standard error-based gPI is almost three times wider than the hdPI, and thus provides an overly pessimistic assessment of an accurate prediction. This also sheds some light on the apparently paradoxical fact (as noticed by Wager et al. 2014) why bootstrap standard error-based Gaussian confidence interval yields a large error bar for the Dodge Rampage example, even when the prediction was dead on.

Modern Applied Statistics: Theory, Practice, and Pedagogy
A theoretical research, which has no relevance to practice and teaching, is not persuasive enough to be taken as fundamental. To us, it is very important to know whether our research was able to link (at least partially) the triad of theory-practice-teaching. In this article, through several examples, we described 'the joy of systematic data analysis' derived from a general theory, which has the following implications for the practice and pedagogy of statistics: ‚ Our theory makes it easy: to apply (arise from simplification) and see the connections (arise from unification) between different statistical methods.
‚ Our theory provides a holistic training 26 by introducing a large variety of statistical topics (embracing different cultures: algorithmic, parametric, nonparametric, information-theoretic, robust, exploratory) in an organized manner through a common semantics. This is extremely important to produce broad integrator foxes rather than highly specialized hedgehogs. 27 Ultimately, what matters is whether our theory provides a faster and easier (lazier?) way to learn the fundamentals of statistics. We believe-yes, it does. 28 26 For more discussion on this topic see Mukhopadhyay (2017b). 27 Readers may find it worthwhile to read Philip Tetlock's book on 'Why Foxes Are Better Forecasters Than Hedgehogs,' which is based on a 20-year study with 284 experts from diverse fields, including government officials, professors, journalists, and others. As Richard Feynman said "Science is not a specialist business." 28 At least it has a better chance than any other currently known global theory of data analysis. We derive con- method. This was a groundbreaking paper that marked the beginning of statistical machine learning 29 . Fix and Hodges's work was a precursor to the celebrated kernel-based methods and decision tree-based algorithms, which revolutionized the world of machine learning. The time has come that we bring machine learning back to its statistical roots. 30 The current paper challenges Breiman's "two-culture" model in favor of an integrated learning system where different cultures can be "glued" together by some underlying deeper principles. 31 Through numerous examples, we have shown that the two cultures are not contradictory but complementary to each other, which when combined properly, can yield a more powerful learning technology that is simultaneously flexible, scalable, and explainable.
If there is one thing that our work has made apparent is: there's plenty of room in the middletremendous opportunities lie at the boundaries between the two cultures. In this paper we have described a new theory for constructing a universal interface between statistics and machine learning. What we have proposed is by no means the only possible or even the "best" theory; it is merely a provisional one. As we go forward, it will be very important to come up with new innovative theories that can reconcile and build new links between various cultures of statistical modeling. All of this is still in its nascent stages, but definitely poised for rapid progress in the coming decade.
fidence from the ability of our theory to tackle statistical problems as diverse as time-series analysis (Mukhopadhyay and Parzen, 2018), copula modeling , empirical Bayes , graph-theory (Mukhopadhyay and Wang, 2020b), multiple testing (Mukhopadhyay, 2018(Mukhopadhyay, , 2016, high-dimensional data analysis (Mukhopadhyay and Wang, 2020a), large-scale distributed learning (Bruce et al., 2019), etc. This is an ongoing and growing movement with the mission to structure the field in an understandable and effective way. 29 Interestingly, just after that, Arthur Samuel came up with the phrase "Machine Learning" in 1952. 30 The next revolution of machine learning will need both power of computing and power of statistical thinking (and little bit of common sense). 31 There is no doubt that in the coming years we will witness many more new varieties and cultures of data analysis, but in the midst of those temporary excitements, we should not forget to keep the discipline internally consistent. Diversity without unity would be a complete mess -not healthy for our profession.

SUPPLEMENTARY APPENDIX
A.1. Software We have developed an R Software LPMachineLearning 33 to perform all the tasks outlined in the paper. We hope this software will encourage applied data scientists to apply our method for their real prediction problems.

A.2. LP-Orthonormal System: Construction and Properties
LP-basis functions are specially-designed orthonormal system of rank-polynomials. They are inherently 'nonparametric,' constructed in a fully data-driven way, not pre-defined like classical polynomials. For a mixed (either discrete or continuous) random variable Z with distribution function F Z , we describe the construction of LP-basis functions.
Construction Algorithm. A fully-automated construction of LP-orthonormal rank-polynomials of Z consists of the following steps: Step 1. Mid-Distribution Transform: The mid-distribution function of Z is defined as where p Z pzq is the probability mass function.
Step 2. Standardizing F mid Z pZq: The random variable F mid Z pZq has mean ErF mid Z pZqs " .5 and variance VarrF mid Z pZqs " 1 12`1´řz p 3 Z pzq˘. Define the 1st-order LP-basis function as Step 3. Gram-Schmidt Orthonormalization: Construct high-order polynomials by applying Gram-Schmidt orthonormalization (see Appendix A of Mukhopadhyay and Parzen (2020) for a quick refresher on Gram-Schmidt process) on the set of functions of the power of T 1 pZ; F Z q.
Some properties. Here we list some important properties of the They satisfy the following properties of the LP-orthonormal systems: 1) LP-bases are orthonormal with respect to the measure (weighting function) F Z : ż T j pz; F Z q dF Z " 0, and ż T j pz; F Z qT k pz; F Z q dF Z " δ jk .
For related discussions, see Remark 4 of Mukhopadhyay and Parzen (2020, p. 82  of Auto-MPG data. They are plotted with respect to r F X pxq over unit interval r0, 1s, following the equation (E5.11).
2) For real-data analysis, construct empirical LP basis (in short eLP basis) tT j pz; r F Z qu j"1,2...,m , where m is strictly less than the number of unique values in the sample tz 1 , . . . , z n u. eLP bases are orthonormal system of functions with respect to the discrete empirical measure r F Z .
4) Derivation of empirical T 1 pz; r F Z q from a sample of all unique z 1 , . . . , z N (i.e, realization from a continuous Z). In this case, the empirical probability mass function (pmf) is r p Z pz i q " 1{N , and the empirical cdf is r F Z pz i q " R i {N , where R i =rank(z i ). This immediately implies 34 In our regression context the response variable Y is continuous, thus admits this analytic LP-basis form.
However, this one-to-one correspondence between LP-polynomials and Legendre polynomials of rank-transform (also know as probability integral transform) F Z pZq is not true for discrete Z, i.e., for the covariates X.
(following eq. E5.2) Since, r F mid Z pz i q " r F Z pz i q´1 2 r p Z pz i q " 1 N pR i´1 {2q. and the correction factor 1´ÿ i r p 3 Z pz i q " 1´1 N 2 .
Eq. (E5.7) makes it clear why we call LP-bases are polynomials of ranks.
5) The shapes of LP-polynomials are data-adaptive. The left panel of Fig. 16 displays the top four eLP-basis functions for the covariate weight, taken from the Auto-MPG data (Example 12 in the main paper). The weight variable has very little ties, which is the reason why the shapes of LP-bases completely match with eq. (E5.4-E5.6). Contrast this with the right panel of Fig. 16, which shows top four eLP bases for the variable acceleration. Due to the presence of a large number of ties, it takes a unique piecewise-constant shape.
The last equality follows from Theorem 1.
Theorem 9: Define the LP unit-basis functions as S j pu; F Y q " T j pQ Y puq; F Y q for 0 ă u ă 1, (E5.11) which satisfy the following orthonormality relations (immediate from eq. 2.10) Substitute F Y pyq " u in (2.9) to express the contrast density function in the quantile domain: d x,z puq " 1`ÿ j LP j|x,z S j pu; F Y q, 0 ă u ă 1.
It is now easy to see that the L 2 distance between d x,0 and d x,1 can be written as › › d x,0´dx,1 › › 2 2 " ÿ jˇL P j|x,0´L P j|x,1ˇ2 due to (E5.12). Hence proved.

A.4. Additional Remarks
Remark A.4.1 Koenker's Quantile regression for Butterfly data: Fig. 17 shows the quantile regression curves for the butterfly data, which implements the algorithm proposed in Koenker and d'Orey (1987). The left plot shows the data in the xy-domain, and the right plot shows the same data in the quantile domain where we replaced x by its rank-transform r F X pxq. Red curves are the spline smoother (estimated using smooth.spline() R-function). Few interesting points to note: ‚ Smoothness: Regression via rank-transform (see Theorem 5) seems to produce much smoother (parsimonious) regression curve compared to the rough zigzag xy-domain estimate. Is it universal that nature reveals herself in a more parsimonious way in the quantile domain? Apart from intuitive understanding, we still don't know any concrete mathematical explanation for this surprising phenomenon.
‚ Data-sparsity: The quantile-domain treatment seems to address the 'data sparsity' (that exists for x ą 30) problem quite well. Surprisingly, after quantile-transformation, a data-sparse regression problem turns into a data-dense one.
‚ Visualization: The rank-transform scatter makes it far easy to visualize and understand the relationship between x and y.
‚ Robust+Smooth: This shows an additional benefit of X-robustness (Theorem 5) as a tool for promoting smoothness.
Remark A.4.3 ML-assisted Quantile regression: comparison. The first row of Fig. 19 shows the generalized random forests-based (Athey et al., 2019) and gradient boosted quantile regression estimates. They are specially-designed ML-algorithms to produce the desired conditional quantile curves. The second row shows our UPM-based versions, which are derived from a general scheme described in Section 2.3 of the main paper.
Remark A.4.4 Robust Lasso: As already noted, the online news data contain an outlier-the news article in row #31,038. How does it impact traditional estimation and feature selection methods? As we will see, a single data point can have a devastating impact on the overall modeling. Fig. 20 (left panel) shows the lasso estimated p β under two scenarios: first, on Figure 18: Boston housing data: The house prices drop sharply with high crime rates.
the outlier-removed data, and second, on the full data (with outlier). Notice the estimated coefficient values (as well as their sparsity-levels) are dramatically affected because of just one data point. One way to address this fragility is to apply lasso on the first-order LP-transformed T X " rT 1 px; r F X 1 q, . . . , T 1 px; r F Xp qs, instead of raw feature matrix X. The result is displayed on the right panel, which shows extraordinary consistency with (almost) no impact on the estimated coefficients. This mid-rank transform based robust lasso method could be a reliable alternative to deal with noisy messy data.  shows the result for lasso(y " X) case, and the right one for robustified lasso(y " T X ). For more details see remark A.4.4.