Scaling Up Content Analysis: Text Thresher Joins Forces with and Crowdcrafting

June 8, 2015

The Text Thresher team and I are excited to announce that we have joined forces with and Crowdcrafting (with funding from the Alfred P. Sloan Foundation and the Open Annotation Fund) to develop our prototype software in a way that bridges the broader annotation and citizen science communities. Together, we will be helping researchers extract richer data from larger text corpora. Over the course of a six-month grant, we will be converting our prototype Text Thresher software into the Annotator Content Analysis (ACA) module—an open source tool hosted on Github that will enable researchers to break down daunting text-annotation projects into smaller, more manageable content analysis tasks that can be performed by crowd workers, citizen scientists, and annotation hobbyists through the Crowdcrafting platform.

The Crowd Content Analysis Assembly Line and ACA Module

Our team is re-organizing traditional, slow-going content analysis into two steps. First, experts (trained research assistants) identify text units in larger documents that correspond with just one branch of a researcher’s larger semantic scheme—a branch specifying variables and attributes that describe just one unit of analysis.

Next, the ACA module will display those smaller text units (rich with information about the variables and attributes of just one unit of analysis) to crowd workers and walk them through a series of leading questions about the text. By answering these questions and highlighting (using ACA) the words justifying their answers, crowd workers extract detailed variable/attribute information relevant to the researcher’s semantic scheme while labeling the text that corresponds to those variables/attributes. Thus, the crowd completes work equivalent to content analysis much faster than a small research team could. This content analysis work is achievable as crowd work because researchers reduce text units from document length to a few sentences, because those few sentences are only relevant to a small branch of the larger semantic scheme, and because so many people are familiar with reading-comprehension tasks.

The Possibilities

The possibilities for the ACA module extend as far as the availability of text data and the imaginations of researchers. Some researchers will be interested in legal documents, others policy documents and speeches. Some may have less interest in a particular class of documents and more interest in units of text ranging across them—perhaps related to the construction and reproduction of gender, class, or ethnic categories. Some may wish to study students’ written work en masse to better understand educational outcomes or the email correspondence of non-governmental organizations to optimize communication flows.

Whatever the corpus and topic, the ACA module can help researchers generate rich, large databases from text. With such data, longitudinal and cross-cultural trends in the expression of political power, gender, race, and more may be measured, modeled, and visualized to animate public awareness of issues and events affecting everything from individual well-being to economic prosperity and international cooperation. And galleries, libraries, archives, museums, and classrooms may also deploy the ACA module, advancing scientific literacy and engaging more people in social scientists’ efforts to better understand our world.


What is is building an open platform for discussion on the web. It leverages open source annotation tools and standards to enable sentence-level critique or note-taking on top of news, blogs, scientific articles, books, terms of service, ballot initiatives, legislation, and more. Everything builds is free, open, non-profit, neutral, lasting, and community oriented.

What is Crowdcrafting?

Crowdcrafting is a web-based service that invites volunteers to contribute to scientific projects developed by citizens, professionals, or institutions that need help to solve problems, analyze data, or complete challenging tasks that can’t be done by machines alone but require human intelligence. The platform is 100% open source—that is, its software is developed and distributed freely—and 100% open science, making scientific research accessible to everyone. Crowdcrafting uses its own PyBossa software: an open source framework for crowdsourcing projects. Institutions like the British Museum, CERN, and United Nations (UNITAR) are also PyBossa users.