Proselint: The Linting of Science Prose and the Science of Linting Prose

SciPy 2016

Lecture

July 13, 2016
3:30pm to 4:00pm
Austin, TX

Writing is notoriously hard, even for the best writers, and it's not for lack of good advice — a tremendous amount of knowledge is strewn across usage guides, dictionaries, technical manuals, essays, pamphlets, websites, and the hearts and minds of great authors and editors. But this knowledge is trapped, waiting to be extracted and transformed.

We built Proselint, a Python-based linter for prose. Proselint identifies violations of expert style and usage guidelines. Proselint is open-source software released under the BSD license and works with Python 2 and 3. It runs as a command-line utility or editor plugin (e.g., Sublime Text, Atom, Vim, Emacs) and outputs advice in standard formats (e.g., JSON). Though in its infancy – perhaps 2% of what it could be – Proselint already includes modules addressing: redundancy, jargon, illogic, clichés, sexism, misspelling, inconsistency, misuse of symbols, malapropisms, oxymorons, security gaffes, hedging, apologizing, pretension.

Proselint can be seen as both a language tool for scientists and a tool for language science. On the one hand, it includes modules that promote clear and consistent prose in science writing. On the other, it measures language usage and explores the factors relevant to creating a useful linter.

Presenter(s):
M Pacer, University of California, Berkeley
Jordan Suchow, University of California, Berkeley

The annual SciPy Conference brings together over 650 participants from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development. The full program consists of 2 days of tutorials, 3 days of talks, and 2 days of developer sprints.

SciPy 2016 Talk and Poster Schedule

Speaker(s)

M Pacer

BIDS Alum - Postdoctoral Scholar

M Pacer was a computational cognitive scientist working as a core developer on the Jupyter Project.  Her work focused on developing mechanisms for integrating computational narratives (e.g., Jupyter notebooks) into the scientific publishing pipeline, with long range goals to make a data set appropriate for scientific language processing, which required joint inference on natural language (as scientific prose in which connections to previous work and theory are usually established), mathematical language (as equations, formalisms, and theorems that precisely express theoretical relations), and programming language (in the explicit or implicit computations that connect data to theories).