Data-Intensive Distributed Computing

Schedule

Part	Description	Dates	CS 451/651 Assignments	CS 431/631 Assignments
1	Introduction to Big Data	Sep 9
2	MapReduce Algorithm Design	Sep 14, 16, 21	A0: Sep 17	A0: Sep 17
3	From MapReduce to Spark	Sep 23, 28, 30
4	Analyzing Text	Oct 5, 7	A1: Oct 1	A1: Oct 1
5	Analyzing Graphs	Oct 19, 21	A2: Oct 18	A2: Oct 18
6	Data Mining and Machine Learning	Oct 26, 28, Nov 2, 4	A3: Oct 29	A3: Nov 1
7	Analyzing Relational Data	Nov 9, 11, 16	A4: Nov 5	A4: Nov 12
8	Real-Time Analytics	Nov 18, 23	A5: Nov 19
9	Mutable State	Nov 25, 30	A6: Nov 29	A5: Nov 26
10	Analyzing Graphs, Redux	Dec 2, 7	A7: Dec 6	A6: Dec 6

Part 1: Introduction to Big Data

Topics

What's this course about?
Why big data?
Scaling models

Slides

PDF Part 1

Part 2: MapReduce Algorithm Design

Topics

MapReduce programming model
Cloud computing and datacenters
Hadoop API
Hadoop physical execution
MapReduce design patterns
Intermediate aggregation and combiners
Partitioning, grouping, and sorting

Readings

Data-Intensive Text Processing with MapReduce
Hadoop: The Definitive Guide (4th Edition):
- Chapter 1: Meet Hadoop
- Chapter 2: MapReduce
- Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
- Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
- Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
- Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
- Chapter 8: MapReduce Types and Formats
- Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PDF Part 2a

PDF CS451/651: Hadoop API

PDF Part 2b

PDF Part 2c

Part 3: From MapReduce to Spark

Topics

Evolution of dataflow abstractions
MapReduce, Pig, Spark, etc.

Readings

Learning Spark (Optional):
- Chapter 1: Introduction to Data Analysis with Spark
- Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
- Chapter 3: Programming with RDDs
- Chapter 4: Working with Key/Value Pairs
- Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Slides

PDF Part 3a

PDF Part 3b

PDF Part 3c

Part 4: Analyzing Text

Topics

Language models and machine translation
Inverted indexing and search

Readings

Large Language Models in Machine Translation (Optional)
Data-Intensive Text Processing with MapReduce — Chapter 4: Inverted Indexing for Text Retrieval

Slides

PDF Part 4a

PDF Part 4b

Part 5: Analyzing Graphs

Topics

Graph representations
Parallel breadth-first search
PageRank and random walks
Issues and challenges with dataflow abstractions

Readings

Data-Intensive Text Processing with MapReduce — Chapter 5: Graph Algorithms

Slides

PDF Part 5a

PDF Part 5b

Part 6: Data Mining and Machine Learning

Topics

Supervised machine learning: binary classification
Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
Production machine learning pipelines
Hashing: minhash
Clustering: k-means

Readings

Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)

Slides

PDF Part 6a

PDF Part 6b

PDF Part 6c

PDF Part 6d

Part 7: Analyzing Relational Data

Topics

OLTP vs. OLAP
Data warehousing and data lakes, ETL
SQL-on-Hadoop: relational data processing with MapReduce and Spark
Optimizations for relational processing: row vs. column stores, vectorized processing
Semistructured data and record reconstruction (Parquet)

Readings

Data-Intensive Text Processing with MapReduce — Chapter 6: Processing Relational Data
MapReduce: A major step backwards

Slides

PDF Part 7a

PDF Part 7b

PDF Part 7c

Part 8: Real-Time Analytics

Topics

Stream processing semantics, issues, and frameworks
Introcudtion to Apache Spark Streaming
Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)

Readings

Zaharia et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013.

Slides

PDF Part 8a

PDF Part 8b

Part 9: Mutable State

Topics

Bigtable/HBase: Log-structure merge trees
Distributed hash tables
Consistency, latency, and availability tradeoffs

Readings

The original Bigtable paper.
The original DHT paper.
Daniel Abadi. Consistency Tradeoffs in Modern Distributed Database System Design, Computer, 45(2):37-42, 2012.

Slides

PDF Part 9a

PDF Part 9b

Part 10: Analyzing Graphs, Redux

Topics

Bulk synchronous parallel: "think like a vertex" (Giraph)

Readings

Mining of Massive Datasets: Link Analysis Section 5.4
Sherif Sakr. Large-Scale Graph Processing Systems, 2016.

Slides

PDF Part 10a

PDF Part 10b

Syllabus Data-Intensive Distributed Computing (Fall 2021)

Schedule

Part 1: Introduction to Big Data

Topics

Slides

Part 2: MapReduce Algorithm Design

Topics

Readings

Slides

Part 3: From MapReduce to Spark

Topics

Readings

Slides

Part 4: Analyzing Text

Topics

Readings

Slides

Part 5: Analyzing Graphs

Topics

Readings

Slides

Part 6: Data Mining and Machine Learning

Topics

Readings

Slides

Part 7: Analyzing Relational Data

Topics

Readings

Slides

Part 8: Real-Time Analytics

Topics

Readings

Slides

Part 9: Mutable State

Topics

Readings

Slides

Part 10: Analyzing Graphs, Redux

Topics

Readings

Slides

Syllabus
Data-Intensive Distributed Computing (Fall 2021)