Machine Learning and Science Forum: Robust Mean Estimation in Nearly-Linear Time, with Applications to Outlier Removal

ML&Sci Forum

April 20, 2020
1:30pm to 2:30pm
Virtual Presentation

Participate remotely using this Zoom link.

Abstract: Robust mean estimation is the following basic estimation question: given i.i.d. copies of a random vector X in d-dimensional Euclidean space of which a small constant fraction are corrupted, how well can you estimate the mean of the distribution? This is a classical problem in statistics, going back to the 60's and 70's, and has recently found application to many problems in reliable machine learning. However, in high dimensions, classical algorithms for this problem either were (1) computationally intractable, or (2) lost poly(d) factors in their accuracy guarantees. Recently, polynomial time algorithms have been demonstrated for this problem that still achieve (nearly) optimal provable error guarantees. However, the running times of these algorithms were either at least quadratic in dimension or in 1/(desired accuracy). In this talk we give the first truly nearly linear time algorithm for robust mean estimation which achieves nearly optimal statistical performance. Our algorithm is based on the matrix multiplicative weights method. 

From our provable algorithm, we also extract an implementable subroutine for outlier removal/data cleaning. In the second part of the talk I will describe several experiments showing that this method compares favorably to existing methods for outlier detection in e.g. sklearn.

Based on Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection. Yihe Dong, Samuel B. Hopkins, Jerry Li. NeurIPS 2019.

The Machine Learning and Science Forum (formerly the Berkeley Statistics and Machine Learning Forum) meets biweekly to discuss current applications across a wide variety of research domains in the physical sciences and beyond. Hosted by UC Berkeley Physics Professor and BIDS Senior Fellow Uros Seljak, these active sessions bring together domain scientists, statisticians, and computer scientists who are either developing state-of-the-art methods or are interested in applying these methods in their research. To receive email notifications about upcoming meetings, or to request more information, please contact berkeleymlforum@gmail.comAll interested members of the UC Berkeley and Berkeley Lab communities are welcome and encouraged to attend.

Speaker(s)

Sam Hopkins

EECS, UC Berkeley

Sam Hopkins is a Miller Postdoctoral Fellow in the EECS department at UC Berkeley. His research interests lie in algorithms, especially algorithms for high-dimensional statistics. He received a PhD in computer science from Cornell University, and his work has been supported by an NSF Graduate Research Fellowship and a Microsoft PhD Fellowship.