Garbage In, Garbage Out? Do Machine Learning Research Papers Report Where Training Data Comes From?

Supervised machine learning is widely used across fields, but major issues are arising around biased, inaccurate, and incomplete training data. In this project, we investigate to what extent published machine learning application papers give specific details about the training data they used, focusing heavily on papers that involve humans labeling specific cases (e.g. is an e-mail spam or not?). A team of undergraduate research interns are reviewing a large corpus of papers and recording questions such as: does the paper report how many human labelers were involved, what their qualifications were, whether they were given formal instructions or definitions, whether they independently checked each other’s work, how often they agreed or disagreed, and how they dealt with disagreements? Much of machine learning focuses on what to do once labeled training data is obtained, but this project tackles the equally-important issue about whether such data is reliable in the first place. This information is crucial for building trustworthy and high-quality classifiers, but it is often not reported.

BIDS Affiliates


R. Stuart Geiger