Teaching Researchers How to Tackle Their Big Data with Apache Spark

January 16, 2015

Earlier this week, BIDS, AMPLab, and Databricks hosted a three-day mini-course on distributed analytics and machine learning with Apache Spark. The course, which covered a subset of material to be included in two massive open online courses (MOOCs) being offered later this year, attracted nearly 40 individuals from the Berkeley community interested in learning to do more with their big data. The course began with a tutorial on Spark, included lectures on data science and distributed machine learning, and featured a series of hands-on exercises using Apache Spark in Databricks Cloud.

Apache Spark

As researchers’ ability to collect data is increasing and datasets are growing in both size and complexity, more advanced data science techniques and tools are needed to process and make sense of large-scale data. While there are countless tools available to process small-scale data, the options for larger datasets are rather limited. Apache Spark, the popular open source big data–processing engine developed at AMPLab, solves this issue as it is both easy to use and easy to scale.

Upcoming MOOCs

The mini-course was viewed as a success by both the students and the instructors―Ameet Talwalkar and Anthony Joseph―who are now turning their attention to teaching the two upcoming MOOCs offered by BerkeleyX and Databricks:

  • Introduction to Big Data with Apache Spark: Students will learn how to apply data science techniques using parallel programming in Spark to explore big (and small) data. The course will identify the most common responsibilities of data scientists and teach students how to use Spark to deliver against these expectations.
    • When: February 23–March 27, 2015
    • Professor: Anthony D. Joseph, Professor in Electrical Engineering and Computer Science at UC Berkeley and Technical Advisor at Databricks
  • Scalable Machine Learning: The course will present the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines and provide hands-on experience using Apache Spark. Students will use Spark to implement scalable algorithms for fundamental statistical models while tackling key real-world problems from various domains.
    • When: April 14–May 18, 2015
    • Professor: Ameet Talwalkar, Assistant Professor of Computer Science at UCLA and Technical Advisor at Databricks

Both courses are available to the public for free and are now open for enrollment on the edX website. edX Verified Certificates are also available for a fee, and together, these courses comprise The Big Data XSeries offered by edX. For more information, visit https://www.edx.org/.