Schedule

Part Description Dates CS 451/651 Assignments CS 431/631 Assignments
1 Introduction to Big Data Sep 9
2 MapReduce Algorithm Design Sep 14, 16, 21 A0: Sep 17 A0: Sep 17
3 From MapReduce to Spark Sep 23, 28, 30
4 Analyzing Text Oct 5, 7 A1: Oct 1 A1: Oct 1
5 Analyzing Graphs Oct 19, 21 A2: Oct 18 A2: Oct 18
6 Data Mining and Machine Learning Oct 26, 28, Nov 2, 4 A3: Oct 29 A3: Nov 1
7 Analyzing Relational Data Nov 9, 11, 16 A4: Nov 5 A4: Nov 12
8 Real-Time Analytics Nov 18, 23 A5: Nov 19
9 Mutable State Nov 25, 30 A6: Nov 29 A5: Nov 26
10 Analyzing Graphs, Redux Dec 2, 7 A7: Dec 6 A6: Dec 6

Part 1: Introduction to Big Data

Topics

  • What's this course about?
  • Why big data?
  • Scaling models

Slides

PDF   Part 1

Back to top

Part 2: MapReduce Algorithm Design

Topics

  • MapReduce programming model
  • Cloud computing and datacenters
  • Hadoop API
  • Hadoop physical execution
  • MapReduce design patterns
  • Intermediate aggregation and combiners
  • Partitioning, grouping, and sorting

Readings

  • Data-Intensive Text Processing with MapReduce
  • Hadoop: The Definitive Guide (4th Edition):
    • Chapter 1: Meet Hadoop
    • Chapter 2: MapReduce
    • Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
    • Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
    • Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
    • Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
    • Chapter 8: MapReduce Types and Formats
    • Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PDF   Part 2a

PDF   CS451/651: Hadoop API

PDF   Part 2b

PDF   Part 2c

Back to top

Part 3: From MapReduce to Spark

Topics

  • Evolution of dataflow abstractions
  • MapReduce, Pig, Spark, etc.

Readings

  • Learning Spark (Optional):
    • Chapter 1: Introduction to Data Analysis with Spark
    • Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
    • Chapter 3: Programming with RDDs
    • Chapter 4: Working with Key/Value Pairs
    • Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Slides

PDF   Part 3a

PDF   Part 3b

PDF   Part 3c

Back to top

Part 4: Analyzing Text

Topics

  • Language models and machine translation
  • Inverted indexing and search

Readings

Slides

PDF   Part 4a

PDF   Part 4b

Back to top

Part 5: Analyzing Graphs

Topics

  • Graph representations
  • Parallel breadth-first search
  • PageRank and random walks
  • Issues and challenges with dataflow abstractions

Readings

Slides

PDF   Part 5a

PDF   Part 5b

Back to top

Part 6: Data Mining and Machine Learning

Topics

  • Supervised machine learning: binary classification
  • Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
  • Production machine learning pipelines
  • Hashing: minhash
  • Clustering: k-means

Readings

  • Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
  • Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
  • Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)

Slides

PDF   Part 6a

PDF   Part 6b

PDF   Part 6c

PDF   Part 6d

Back to top

Part 7: Analyzing Relational Data

Topics

  • OLTP vs. OLAP
  • Data warehousing and data lakes, ETL
  • SQL-on-Hadoop: relational data processing with MapReduce and Spark
  • Optimizations for relational processing: row vs. column stores, vectorized processing
  • Semistructured data and record reconstruction (Parquet)

Readings

Slides

PDF   Part 7a

PDF   Part 7b

PDF   Part 7c

Back to top

Part 8: Real-Time Analytics

Topics

  • Stream processing semantics, issues, and frameworks
  • Introcudtion to Apache Spark Streaming
  • Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)

Readings

Slides

PDF   Part 8a

PDF   Part 8b

Back to top

Part 9: Mutable State

Topics

  • Bigtable/HBase: Log-structure merge trees
  • Distributed hash tables
  • Consistency, latency, and availability tradeoffs

Readings

Slides

PDF   Part 9a

PDF   Part 9b

Back to top

Part 10: Analyzing Graphs, Redux

Topics

  • Bulk synchronous parallel: "think like a vertex" (Giraph)

Readings

Slides

PDF   Part 10a

PDF   Part 10b

Back to top