Data-Intensive Distributed Computing

Schedule

Part	Description	Dates	CS 451/651 Assignments	CS 431/631 Assignments
1	MapReduce Algorithm Design	Sep 5, 10, 12, 17	A0: Sep 17
2	From MapReduce to Spark	Sep 19, 24	A1: Sep 24	A0: Sep 19
3	Analyzing Text	Sep 26, Oct 1	A2: Oct 1	A1: Sep 26
4	Analyzing Graphs	Oct 3, 8	A3: Oct 8	A2: Oct 10
5	Analyzing Relational Data	Oct 10, 22, 24		A3: Oct 24
6	Data Mining and Machine Learning	Oct 29, 31, Nov 5, 7	A4: Oct 29
7	Mutable State	Nov 12, 14	A5: Nov 12	A4: Nov 14
8	Analyzing Graphs, Redux	Nov 19, 21
9	Real-Time Analytics	Nov 26, 28	A6: Nov 26	A5: Nov 28
10	Looking Ahead	Dec 3	A7: Dec 3

Part 1: MapReduce Algorithm Design Sep 5, 10, 12, 17

Topics

What's this course about?
Why big data?
The datacenter is the computer and other "big ideas"
MapReduce programming model
Cloud computing and datacenters
Hadoop API
Hadoop physical execution
MapReduce design patterns
Intermediate aggregation and combiners
Partitioning, grouping, sorting, and monoids

Readings

Data-Intensive Text Processing with MapReduce
Hadoop: The Definitive Guide (4th Edition):
- Chapter 1: Meet Hadoop
- Chapter 2: MapReduce
- Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
- Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
- Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
- Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
- Chapter 8: MapReduce Types and Formats
- Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX PDF Part 1a: September 5

PPTX PDF Part 1b: September 10

PPTX PDF Part 1c: September 12

PPTX PDF Part 1d: September 17

Part 2: From MapReduce to Spark Sep 19, 24

Topics

Evolution of dataflow abstractions
MapReduce, Pig, Spark, etc.

Readings

Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
Learning Spark (Optional):
- Chapter 1: Introduction to Data Analysis with Spark
- Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
- Chapter 3: Programming with RDDs
- Chapter 4: Working with Key/Value Pairs
- Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX PDF Part 2a (1/2): September 19

PPTX PDF Part 2a (2/2): September 19

PPTX PDF Part 2b: September 24

Part 3: Analyzing Text Sep 26, Oct 1

Topics

Language models and machine translation
Inverted indexing and search

Readings

Data-Intensive Text Processing with MapReduce — Chapter 4: Inverted Indexing for Text Retrieval

Slides

PPTX PDF Part 3a: September 26

PPTX PDF Part 3b: October 1

Part 4: Analyzing Graphs Oct 3, 8

Topics

Graph representations
Parallel breadth-first search
PageRank and random walks
Issues and challenges with dataflow abstractions

Readings

Data-Intensive Text Processing with MapReduce — Chapter 5: Graph Algorithms

Slides

PPTX PDF Part 4a: October 3

PPTX PDF Part 4b: October 8

Part 5: Analyzing Relational Data Oct 10, 22, 24

Topics

OLTP vs. OLAP
Data warehousing and data lakes, ETL
SQL-on-Hadoop: relational data processing with MapReduce and Spark
Optimizations for relational processing: row vs. column stores, vectorized processing
Semistructured data and record reconstruction (Parquet)

Readings

Data-Intensive Text Processing with MapReduce — Chapter 6: Processing Relational Data
MapReduce: A major step backwards
Chaudhuri et al. (2011) An overview of business intelligence technology, CACM, 54(8):88-98.

Slides

PPTX PDF Part 5a: October 10

PPTX PDF Part 5b: October 22

PPTX PDF Part 5c: October 24

Part 6: Data Mining and Machine Learning Oct 29, 30, Nov 5, 7

Topics

Supervised machine learning: binary classification
Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
Production machine learning pipelines
Hashing: minhash, random projections, etc.
Clustering: k-means, Gaussian mixture models

Readings

Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)
Jimmy Lin and Dmitriy Ryaboy. Scaling Big Data Mining Infrastructure: The Twitter Experience, SIGKDD Explorations, 14(2):6-19, 2012.

Slides

PPTX PDF Part 6a: October 29

PPTX PDF Part 6b: October 31

PPTX PDF Part 6c: November 5

PPTX PDF Part 6d: November 7

Part 7: Mutable State Nov 12, 14

Topics

Bigtable/HBase: Log-structure merge trees
Distributed hash tables
Consistency, latency, and availability tradeoffs

Readings

The original Bigtable paper.
The original DHT paper.
Daniel Abadi. Consistency Tradeoffs in Modern Distributed Database System Design, Computer, 45(2):37-42, 2012.

Slides

PPTX PDF Part 7a: November 12

PPTX PDF Part 7b: November 14

Part 8: Analyzing Graphs, Redux Mar 21, 26

Topics

Bulk synchronous parallel: "think like a vertex" (Giraph)
Alternative approaches: GraphX

Readings

Sherif Sakr. Large-Scale Graph Processing Systems, 2016.

Slides

PPTX PDF Part 8a: November 19

PPTX PDF Part 8b: November 21

Part 9: Real-Time Analytics Nov 26, 28

Topics

Stream processing semantics, issues, and frameworks
Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
Integrating batch and stream processing

Readings

Zaharia et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013.

Slides

PPTX PDF Part 9a: November 26

PPTX PDF Part 9b: November 28

Syllabus Data-Intensive Distributed Computing (Fall 2019)

Schedule

Part 1: MapReduce Algorithm Design Sep 5, 10, 12, 17

Topics

Readings

Slides

Part 2: From MapReduce to Spark Sep 19, 24

Topics

Readings

Slides

Part 3: Analyzing Text Sep 26, Oct 1

Topics

Readings

Slides

Part 4: Analyzing Graphs Oct 3, 8

Topics

Readings

Slides

Part 5: Analyzing Relational Data Oct 10, 22, 24

Topics

Readings

Slides

Part 6: Data Mining and Machine Learning Oct 29, 30, Nov 5, 7

Topics

Readings

Slides

Part 7: Mutable State Nov 12, 14

Topics

Readings

Slides

Part 8: Analyzing Graphs, Redux Mar 21, 26

Topics

Readings

Slides

Part 9: Real-Time Analytics Nov 26, 28

Topics

Readings

Slides

Syllabus
Data-Intensive Distributed Computing (Fall 2019)