V-statistics and Variance Estimation

Zhengze Zhou, Lucas Mentch, Giles Hooker

arXiv
May 7, 2020

Abstract: This paper develops a general framework for analyzing asymptotics of V-statistics. Previous literature on limiting distribution mainly focuses on the cases when n→∞ with fixed kernel size k. Under some regularity conditions, we demonstrate asymptotic normality when k grows with n by utilizing existing results for U-statistics. The key in our approach lies in a mathematical reduction to U-statistics by designing an equivalent kernel for V-statistics. We also provide a unified treatment on variance estimation for both U- and V-statistics by observing connections to existing methods and proposing an empirically more accurate estimator. Ensemble methods such as random forests, where multiple base learners are trained and aggregated for prediction purposes, serve as a running example throughout the paper because they are a natural and flexible application of V-statistics.

Lay Summary: There has been recent interest in developing estimates for the variance of random forests and other ensemble methods. These have centered around using trees built on subsamples of the data in which theoretical results allow us to obtain an expression for this variance. Unfortunately, estimates for this variance are biassed upwards for the number of trees typically used in a random forest. This paper shows that this problem can be fixed by the simple change of taking subsamples with replacement instead of without replacement, and does the theoretical work to update expressions for the variance. 



Featured Fellows

Giles Hooker

Statistics, UC Berkeley
Faculty Affiliate