TextXD brings together researchers from across a wide range of disciplines, who work with text as a primary source of data. We work to identify common principles, algorithms and tools to advance text-intensive research, and break down the boundaries between domains, to foster exchange and new collaborations among like-minded researchers. Talks will range from the theory of NLP and deep learning to applied analyses or new software packages.
2018 TextXD Symposium
December 5-7, 2018
190 Doe Library, UC Berkeley
- Wednesday, Dec. 5th : Training workshops - "Intro to Text Analysis"
- Thursday, Dec. 6th: Invited talks, discussions, and applied collaboration sessions
- Friday, Dec. 7th: Invited talks, discussions, applied collaboration sessions
- Laurent El Ghaoui, UC Berkeley Electrical Engineering and Computer Sciences (EECS)
- Niek Veldhuis, UC Berkeley Near Eastern Studies
AGENDA -- Titles link to video recordings of the presentations.
Day 1: Wed, Dec 5th (Learn)
Learn text analysis and deep learning tools
1:00-2:00 — Caroline Le Pennec-Caldichoury: Text as Data, An introduction
2:00-3:00 — Geoff Bacon (UC Berkeley, Linguistics): Intro to Web Scraping
3:00-3:15 — Break
3:15-4:15 — Jaren Haber (UC Berkeley, Sociology): Intro to Neural-Net Word Embeddings
4:15-5:15 — Geoff Bacon (UC Berkeley, Linguistics): Intro to Text Classification
5:15-5:30 — Discussion
Day 2: Thu, Dec 6th (AM: Talks, PM: Create)
Tools and approaches from across disciplines + hands-on learning
8:45-9:00 — Aaron Culich et al. (D-Lab): Introduction
9:00-9:10 — David Mongeau, Chris Kennedy, Maryam Vareth: Welcome and Introduction
9:10-9:30 — Rochelle Terman (U. Chicago, Political Science): The Outrage Machine: Human Rights Shaming from Media, Governments, and NGOs
9:30-9:50 — Adam Anderson (UC Berkeley, Near Eastern Studies): Learning Curve: Student Responses to the Digital Humanities Curriculum
9:50-10:10 — Alex Paxton (U. Connecticut, Psychological Sciences): The octo-source community: Exploring open-source software community health on GitHub
10:10-10:30 — Deborah Sunter (Tufts, Mechanical Engineering): Text Analysis to Understand International Variations in Interpreting Sustainable Development
10:50-11:10 — Caroline Le Pennec-Caldichoury (UC Berkeley, Economics): Electoral competition and campaign messages in French legislative elections
11:10-11:30 — Geoff Bacon (UC Berkeley, Linguistics): Probing sentence embeddings for structure-dependent tense
11:30-11:50 — Jaren Haber (UC Berkeley, Sociology): Inductive dictionary creation with word embedding models
1:15-1:45 — Keynote Niek Veldhuis (UC Berkeley, Near Eastern Studies): Sumerian Word Embeddings
1:45-2:05 — Manoj Tiwari (Google): Lessons learned from building real-life ML systems
2:05-2:25 — Christopher Hench (Amazon Alexa): An Overview of the Alexa Architecture
2:45-5:20 — Collaborative work session
5:30- 7:00 — Social Event Happy hour at Tap Haus (2518 Durant Ave - a block south of campus)
Day 3: Fri, Dec 7th (AM: Talks, PM: Create)
Tools and approaches from across disciplines + hands-on learning
9:00-9:10 — Heather Haveman: Welcome
9:10-9:30 — Laura Nelson (Northeastern, Sociology & Anthropology): Finding simple patterns in complex data
9:30-9:50 — Devin Cornell (UCSB, Sociology): The Evolution of Political Discourse in the Colombian Party Centro Democrático
9:50-10:10 — Alina Arseniev-Koehler (UCLA, Sociology): Gender, Morality, Social Class, and other Cultural Dimensions in Word Embeddings
10:10-10:30 — Rob Voigt (Stanford, NLP): Computational Linguistics for Police-Community Interaction
10:50-11:10 — Abigail See (Stanford, NLP): Controlling text generation for a better chatbot
11:10-11:30 — Russell Lee-Goldman (Google): Linguistic research and application at Google
11:30-11:50 — AJ Alvero (Stanford, Education): Sociocultural Considerations of the College Admissions Essay
11:50-12:20 — Keynote Laurent El Ghaoui (UC Berkeley, EECS & BAIR): Text Analytics: A Guided Tour
1:25-1:45 — Tanya Roosta & Emmanuel Vallod (SumUp Analytics): Topic analysis and beyond, in real-time
1:45-2:05 — Milena Gianfrancesco (UCSF) & Suzanne Tamang (Stanford): Using text mining methods to detect a clinical infection
2:05-2:25 — Jae Ho Sohn (UCSF, Radiology): Natural Language Processing in Radiology: Why, What, and How?
2:25-2:45 — Hunter Mills & Justin Krogue (UCSF): Using ML on UMLS Terminology to Determine Hip Fracture Status from Clinical Notes
3:05-5:20 — Collaborative work session
5:20- 5:30 — Team Demos & Closing Remarks
5:30- 7:00 — Social Event: Data science reception hosted by D-Lab (Barrows Hall 356)
Niek Veldhuis' keynote presentation on "Sumerian Word Embeddings" (Photo: J. Dugan)
Registration was free, but pre-registration was required due to limited seating. Registrants were accepted via email confirmation on a space-available basis.
Organizing Committee: Chris Kennedy (Chair), Heather Haveman, Caroline Le Pennec, Maryam Vareth, Jaren Haber, Geoff Bacon, Aaron Culich, Jonathan Dugan, and Adam Lavertu.
Contact: Chris Kennedy (email@example.com)
This event was hosted by BIDS and co-sponsored by the D-Lab, UCSF Bakar Computational Health Sciences Institute, and the UC Berkeley School of Information.
Heather A. Haveman is a Professor of Sociology and Business at UC Berkeley. She holds a BA in history and an MBA (from the University of Toronto), and a Ph.D. in organizational behavior and industrial relations (from UC Berkeley). Following positions at Duke University's Fuqua School of Business, Cornell University's Johnson Graduate School of Management, and Columbia University's Graduate School of Business, Professor Haveman joined UC Berkeley in July 2006. Her research interests include how organizations, the fields in which they are embedded, and the careers of their members and employees evolve. Her current work involves American magazines and wineries, Chinese listed firms, and the emerging marijuana market in several US states.
Former BIDS Data Science Fellow Laura K. Nelson is an Assistant Professor of Sociology in the College of Social Sciences and Humanities at Northeastern University. Laura uses computational methods and open source tools - principally automated text analysis - to study social movements, culture, gender, institutions, and organizations. She is particularly interested in developing computational tools that can bolster the way social scientists do inductive and theory-driven research. She received her PhD in sociology from the University of California, Berkeley, and she also holds an MA from UC Berkeley and a BA from the University of Wisconsin, Madison. While at UC Berkeley, she was a postdoctoral fellow with Digital Humanities @ Berkeley, developing a course for undergraduates on computational text analysis in the humanities and social sciences.
Alexandra is a BIDS data science fellow and a postdoctoral scholar working with Tom Griffiths in the Institute of Cognitive and Brain Sciences. She got her PhD in cognitive and information sciences from the University of California, Merced, in December 2015.
Her work explores human communication in data-rich environments. From capitalizing on large-scale real-world corpora to capturing multimodal experimental data, her research seeks to understand how context changes communication dynamics. Broadly, her work integrates computational and social perspectives to understand interpersonal interaction as a nonlinear dynamical system.
Relatedly, Alexandra also develops research methods to facilitate quantitative research on interaction and encourages others to use data-rich computational methods through teaching and service. As part of that effort, she works with the Center for Data on the Mind to foster the application of big data to questions about cognition and behavior.
Niek Veldhuis is Professor of Assyriology (cuneiform studies) in the Department of Near Eastern Studies. He received his PhD at the Rijksuniversiteit Groningen (The Netherlands) in 1997, and came to Berkeley in 2002. His primary interests are in the intellectual history of ancient Mesopotamia (History of the Mesopotamian Lexical Tradition, 2014) and Sumerian literature (Religion, Literature and Scholarship: The Sumerian Composition Nanše and the Birds, 2004). He is director of the NEH-supported Digital Corpus of Cuneiform Lexical Texts and is a member of the international Oracc Steering Committee, providing tools and standards for digital publication of cuneiform texts to scholars worldwide. Today, his main research focus is on developing computational text analysis scripts (primarily in Jupyter Notebooks) for cuneiform datasets.
Christopher Hench was a BIDS Data Science Fellow and a PhD Candidate in German Literature and Medieval Studies at UC Berkeley from 2017 to 2018. He studied computational approaches to the formal analysis of lyric and epic poetry, and reading soundscapes. More broadly, with a particular interest in the challenges of domain adaptation for NLP and algorithms for the detection and scoring of text reuse. Christopher was also the Program Development Lead for Digital Humanities at Berkeley and the D-Lab at Berkeley, where he collaborated in several research projects and taught Python and Git workshops. He also coordinated the modules development effort in cooperation with BIDS, D-Lab, and the Data Science Education Program DSEP.
Deborah Sunter is an Assistant Professor of Mechanical Engineering at Tufts University. While at the University of California, Berkeley, she was a BIDS Data Science Fellow and a postdoctoral fellow in the Renewable and Appropriate Energy Laboratory. Her research interests included data science for sustainability, national energy planning, city-integrated renewable energy systems, environmental justice, and clean technology innovation. While working on her BS in Mechanical and Aerospace Engineering at Cornell University, she developed a nanosatellite mission that was successfully launched into orbit. Although fascinated by aerospace applications, the time-critical issue of global warming shifted her focus in graduate school to explore renewable energy. Specializing in computational modeling of thermo-physics in multiphase systems, she developed a novel solar absorber tube and received her PhD in Mechanical Engineering at the University of California, Berkeley. The need for better global environmental solutions led her to do research abroad in both Japan and China. After receiving her doctorate, she advanced her understanding of energy policy as an AAAS Science and Technology Policy Fellow at the U.S. Department of Energy.
Maryam Vareth leads BIDS’ data science research efforts in the Health & Life Sciences. Dr. Vareth is a Co-Director of the Innovate For Health initiative, a collaboration among UC Berkeley, UCSF, and Janssen Pharmaceutical Companies of Johnson & Johnson. As an experienced engineer, researcher, and data scientist, she applies mathematics, statistics and physics to solve unmet needs in healthcare to enhance patients’ experience during their medical journey. She is an advocate for “data-driven” medicine, and in particular for linking medical imaging data with medical diagnostics and therapeutics to extract clinically-relevant insights through the use of open research and open source practices. Dr. Vareth received her BS and MS training in Electrical Engineering and Computer Science (EECS) from UC Berkeley, where she was awarded the prestigious Regent’s and Chancellor’s Scholarship. She completed her PhD through the joint UC Berkeley-UCSF Bioengineering program as a National Science Foundation Fellow, where she was awarded the Margaret Hart Surbeck Endowed Fellowship for Interdisciplinary Research for her work on developing new techniques and algorithms for the acquisition, reconstruction and quantitative analysis of Magnetic Resonance Spectroscopy Imaging (MRSI), with the goal of improving its speed, sensitivity and specificity to improve the management of patients with brain tumors. She conducted her post-doctoral fellowship at UCSF, combining structural, physiological and metabolic imaging data from large clinical trials to quantitatively characterize heterogeneity within malignant brain tumors.
Chris Kennedy is an instructor in psychiatry at Harvard Medical School / Massachusetts General Hospital. He has a PhD in biostatistics from UC Berkeley. He is a senior fellow at UC Berkeley’s D-Lab and is affiliated with the Integrative Cancer Research Group and the Division of Research at Kaiser Permanente Northern California. At BIDS, he was a BIDS - Biomedical Big Data Training (BBDT) Data Science Fellow and a PhD student in biostatistics at UC Berkeley, where he worked with Alan Hubbard. He was also a D-Lab instructor and consultant, and an NIH biomedical big data trainee. His methodological interests encompassed targeted machine learning, randomized trials, causal inference, deep learning, text analysis, signal processing, and computer vision. His applications were primarily in precision medicine, public health, genomics, and election campaigns. His software projects included the SuperLearner ensemble learning system and varImpact for variable importance estimation; he leverages high performance computing on Savio and XSEDE clusters to accelerate his work. Prior to Berkeley he worked in political analytics in DC, running dozens of randomized trials and integrating machine learning into multi-million dollar programs to improve voter turnout for underrepresented Americans. He has also worked to support climate change action through Al Gore’s Climate Reality Project and the Yale Program on Climate Change Communication. He holds an M.A. in political science from UC Berkeley, an M.P.Aff. from the LBJ School of Public Affairs, and a B.A. in government & economics from The University of Texas at Austin.
As BIDS Chief Research Officer, Jonathan Dugan directed BIDS' efforts to solve research issues, working with the Berkeley research community to empower faculty and researchers to develop missing areas of the data science environment (tools, methods, systems, workflows, etc.), and secure the resources to fund them. He helped to promote successes, fund new work, and find practical solutions that brought together the best faculty, postdocs, students, and staff to solve immediate challenges for our research and education efforts.
Jonathan had focused his career to date on the promotion of science, education and open culture. His work in both nonprofit and for-profit businesses included consulting, business management, entrepreneurship, web services and software development, community engagement, and biomedical informatics systems. He completed his PhD from Stanford in 2002 in biomedical informatics, where he developed nonlinear mathematical simulations for protein structure modeling. His research interests include citations, data sharing, software development, community engagement, identity and reputation systems, and applying machine learning techniques to solve research questions.