BIDS Fellow Nick Adams released TextThresher, an open-access natural language processing (NLP) platform, in August 2016.
What is TextThresher?
TextThresher (http://www.textthresher.org/) is a mass collaboration software allowing researchers to direct hundreds of volunteers – working through the internet – to label tens of thousands of text documents according to all the concepts vital to researchers’ theories and questions. With TextThresher, projects that would have required a decade of effort, and the close training of wave after wave of research assistants, can be completed in about a year online.
How Will People Use TextThresher?
TextThresher is specifically designed for large and complex content analysis jobs that cannot be completed with existing automated algorithms. It is the ideal tool whenever automated approaches to textual data fail to recognize concepts vital to social scientists’ intricate theories, fail to tease out ambiguous or contextualized meanings, or fail to effectively parse relationships among, or sequences of, social entities.
If you are interested in performing a shallow sentiment analysis of Tweets, or developing an exploratory topic model of some corpus, you won’t need TextThresher. If you have a few dozen interviews to analyze, TextThresher is probably overkill. But if you want to extract hierarchically organized, openly validated, research-grade records of related social entities and concepts appearing across thousands of longer documents, TextThresher is for you. Especially in this first beta version, it is ideally suited for the analysis of news events, historical trends, or the evolution of legal theories. Here’s how it works:
The crowd content analysis assembly line TextThresher enables is organized around two major steps. First, annotators identify (across the researcher’s documents) text units (words, phrases, sentences) that correspond with the (relatively small number of) nodes at the highest level of the researcher’s hierarchically-organized conceptual/semantic scheme. These high-level nodes describe a researcher’s units of analysis, the social units (be they individuals, events, organizations, etc.) described by variables and attributes at the lower-level nodes of the conceptual/semantic scheme. In contrast to old-style content analysis, an annotator using TextThresher does not even attempt the conceptually overwhelming task of applying dozens of different labels to a full document. They just label text units corresponding with the (usually) 3-6 highest level concepts important to a researcher. This is comparatively easy work.
In the second step, TextThresher displays those much smaller text units, corresponding with just one case of one unit of analysis, to citizen scientists/ crowd workers, and guides them through a series of leading questions about the text unit. Since TextThresher already knows the text unit is about a certain type of unit of analysis (or ‘object’ to use computer science speak), it only asks questions prompting users to search for details about the variables/attributes of that unit of analysis. By answering this relatively short list of questions and highlighting the words justifying their answers, citizen scientists label the text exactly as highly-trained research assistants would. But their work goes much faster and they are more accurate, because (1) they are only reading relatively short text units, (2) they are only concerned to find a relatively short list of variables (that are guaranteed to be relevant for the text unit they are analyzing); and (3) the work is organized as a ‘reading comprehension’ task familiar to everyone who has graduated middle school.
TextThresher uses a number of transparent approaches to validate annotators’ labels, including gold standard pre-testing, Bayesian voting weighted by annotator reputation scores, and active learning algorithms. All the labels are exportable as annotation objects consistent with W3C annotation standards, and maintain their full provenance. So, in addition to scaling up content analysis for all the ‘big text data’ out there, TextThresher also brings the old method into the light of ‘open science.’
How Can I Get My Hands on TextThresher?
Today, I am announcing that TextThresher lives. It moves data through all of its interfaces as it should. The interfaces are fully functional. (See Demo below.) And TextThresher can be deployed on Scifabric (PYBOSSA), our partner citizen (volunteer) science platform. In the weeks and months to come, we will be testing TextThresher’s user experience, refining our label validation algorithms, and using TextThresher to collect data for the GoodlyLabs’ DecidingForce and PublicEditor projects. Once we feel confident that TextThresher is working smoothly (probably around October 2017), we will invite researchers to apply to become beta users of the software. (If you already know you are excited to use TextThresher, feel free to shoot me an email and I will keep you updated about upcoming opportunities.) We hope to release TextThresher 1.0 to the general public in early 2018.
TextThresher would not exist without the support and hard work of many people.
I wish to first thank our institutional sponsors. The Hypothes.is “Open Annotation” Fund, the Alfred P. Sloan Foundation, and the Berkeley Institute for Data Science (BIDS) all provided seed funding that allowed us to hire creative and skilled developers. BIDS, too, provided workspace for meetings and my fiscal support. The D-Lab and the Digital Humanities @ Berkeley also provided essential resources when the project was in its very early stages.
TextThresher’s viability also owes to the encouragement of the annotation and citizen science communities. Dan Whaley, Benjamin Young, Nick Stenning, and Jake Hartnell of Hypothes.is are especially to blame for motivating and guiding our early efforts. Daniel Lombraña of Scifabric, Chris Lintott of Zooniverse, and Jason Radford of Volunteer Science also bolstered our hopes that the citizen science community would appreciate and use our tools.
And of course, TextThresher, would not exist without the collective efforts, lost sleep, and careful programming of our talented and dedicated development team. From our earliest prototype till today, we have been fueled by the voluntary and semi-voluntary efforts of students and freelance developers across the Berkeley campus and Bay Area. As the person who got it all started at a point when I could just barely script my way out of a paper bag, I especially wish to thank Daniel Haas, Fady Shoukry, and Tyler Burton for their early efforts architecting TextThresher’s backend and frontend (and for believing in the vision).
Steven Elleman deserves kudos for our rather sophisticated (if I do say so!) highlighter tool. Jasmine Deng has built the reading comprehension interface that makes TextThresher so easy to use compared to QDAS packages. Flora Xue, with the mentorship of the busy and brilliant Stefan van der Walt, has refactored our data model through multiple improving iterations. And we can all count on TextThresher to become increasingly efficient thanks to the human-computer interactions enabled by Manisha Sharma’s hand-rolled ‘NLP hints’ module.
All of this work has been helped along, too, by a number of volunteers like Allen Cao, Youdong Zhang, Aaron Culich, Arjun Mehta, Piyush Patil, and Vivian Fang who have taken on quick but essential tasks across the TextThresher codebase. Finally, I have to express my deep gratitude for Norman Gilmore, our development team lead. Norman has not only played an essential role in architecting, writing, and improving code throughout TextThresher, he has also served as a patient and caring mentor to all of our developers, helping our team establish and maintain agile scrum practices, proper git etiquette, and a happy, grooving work rhythm. Thanks, Norman! And thanks to all our friends, family, and colleagues who have been rooting for us. We did it! Our work is done! ;) (Haha!)