Schedule

Part Description Dates CS 451/651 Assignments CS 431/631 Assignments
1 MapReduce Algorithm Design Sep 5, 10, 12, 17 A0: Sep 17
2 From MapReduce to Spark Sep 19, 24 A1: Sep 24 A0: Sep 19
3 Analyzing Text Sep 26, Oct 1 A2: Oct 1 A1: Sep 26
4 Analyzing Graphs Oct 3, 8 A3: Oct 8 A2: Oct 10
5 Analyzing Relational Data Oct 10, 22, 24 A3: Oct 24
6 Data Mining and Machine Learning Oct 29, 31, Nov 5, 7 A4: Oct 29
7Mutable State Nov 12, 14 A5: Nov 12 A4: Nov 14
8 Analyzing Graphs, Redux Nov 19, 21
9 Real-Time Analytics Nov 26, 28 A6: Nov 26 A5: Nov 28
10 Looking Ahead Dec 3 A7: Dec 3

Part 1: MapReduce Algorithm Design Sep 5, 10, 12, 17

Topics

  • What's this course about?
  • Why big data?
  • The datacenter is the computer and other "big ideas"
  • MapReduce programming model
  • Cloud computing and datacenters
  • Hadoop API
  • Hadoop physical execution
  • MapReduce design patterns
  • Intermediate aggregation and combiners
  • Partitioning, grouping, sorting, and monoids

Readings

  • Data-Intensive Text Processing with MapReduce
  • Hadoop: The Definitive Guide (4th Edition):
    • Chapter 1: Meet Hadoop
    • Chapter 2: MapReduce
    • Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
    • Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
    • Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
    • Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
    • Chapter 8: MapReduce Types and Formats
    • Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX PDF   Part 1a: September 5

PPTX PDF   Part 1b: September 10

PPTX PDF   Part 1c: September 12

PPTX PDF   Part 1d: September 17

Back to top

-

Part 2: From MapReduce to Spark Sep 19, 24

Topics

  • Evolution of dataflow abstractions
  • MapReduce, Pig, Spark, etc.

Readings

  • Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
  • Learning Spark (Optional):
    • Chapter 1: Introduction to Data Analysis with Spark
    • Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
    • Chapter 3: Programming with RDDs
    • Chapter 4: Working with Key/Value Pairs
    • Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX PDF   Part 2a (1/2): September 19

PPTX PDF   Part 2a (2/2): September 19

PPTX PDF   Part 2b: September 24

Back to top

Part 3: Analyzing Text Sep 26, Oct 1

Topics

  • Language models and machine translation
  • Inverted indexing and search

Readings

Slides

PPTX PDF   Part 3a: September 26

PPTX PDF   Part 3b: October 1

Back to top

Part 4: Analyzing Graphs Oct 3, 8

Topics

  • Graph representations
  • Parallel breadth-first search
  • PageRank and random walks
  • Issues and challenges with dataflow abstractions

Readings

Slides

PPTX PDF   Part 4a: October 3

PPTX PDF   Part 4b: October 8

Back to top

Part 5: Analyzing Relational Data Oct 10, 22, 24

Topics

  • OLTP vs. OLAP
  • Data warehousing and data lakes, ETL
  • SQL-on-Hadoop: relational data processing with MapReduce and Spark
  • Optimizations for relational processing: row vs. column stores, vectorized processing
  • Semistructured data and record reconstruction (Parquet)

Readings

Slides

PPTX PDF   Part 5a: October 10

PPTX PDF   Part 5b: October 22

PPTX PDF   Part 5c: October 24

Back to top

Part 6: Data Mining and Machine Learning Oct 29, 30, Nov 5, 7

Topics

  • Supervised machine learning: binary classification
  • Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
  • Production machine learning pipelines
  • Hashing: minhash, random projections, etc.
  • Clustering: k-means, Gaussian mixture models

Readings

Slides

PPTX PDF   Part 6a: October 29

PPTX PDF   Part 6b: October 31

PPTX PDF   Part 6c: November 5

PPTX PDF   Part 6d: November 7

Back to top

Part 7: Mutable State Nov 12, 14

Topics

  • Bigtable/HBase: Log-structure merge trees
  • Distributed hash tables
  • Consistency, latency, and availability tradeoffs

Readings

Slides

PPTX PDF   Part 7a: November 12

PPTX PDF   Part 7b: November 14

Back to top

Part 8: Analyzing Graphs, Redux Mar 21, 26

Topics

  • Bulk synchronous parallel: "think like a vertex" (Giraph)
  • Alternative approaches: GraphX

Readings

Slides

PPTX PDF   Part 8a: November 19

PPTX PDF   Part 8b: November 21

Back to top

Part 9: Real-Time Analytics Nov 26, 28

Topics

  • Stream processing semantics, issues, and frameworks
  • Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
  • Integrating batch and stream processing

Readings

Slides

PPTX PDF   Part 9a: November 26

PPTX PDF   Part 9b: November 28

Back to top