Introduction to Apache Spark

Instructor: Dave Schulz (Research Computing Services)

An introduction to basic theory of Map-Reduce applied to Resilient Distributed Datasets (RDD). This course primarily uses PySpark to introduce the Spark approach to parallelization.

This course will begin with an introduction to RDDs and MapReduce. We will demonstrate how to load data to an RDD and how to make use of the RDD API. This introduction will include practical examples of filtering, transforming, and aggregating a data stream with the RDD API. We will also introduce additional high level Spark APIs such as the DataFrame API.

Target audience: researchers interested in a first introduction to parallel computations in a Spark framework

Duration: 3 hours

Level: beginner

Prerequisites: This course assumes a familiarity with basic python syntax for variable declaration, function definition and use, and iteration.

Laptop software: All attendees will need to bring their laptops with wireless access and with a remote SSH client installed (on Windows laptops we recommend the free edition of MobaXterm; on Mac and Linux laptops no need to install anything).