Schedule
Part |
Description |
Dates |
Related Assignment |
1 |
Introduction to Big Data |
Sept. 5th |
CS451 - A0
CS431 - A0
|
2 |
MapReduce Algorithm Design |
Sept. 10, 12, 17 |
CS451 - A1
CS431 - A1
|
3 |
From MapReduce to Spark |
Sept. 19, 24, 26 |
CS451 - A2
CS431 - A2
|
4 |
Analyzing Text |
Oct. 1, 3 |
CS451 - A3
|
5 |
Analyzing Graphs |
Oct. 8, 10, 22 |
CS451 - A4
CS431 - A3
|
|
Reading Week! |
Oct. 12-20 |
- |
6 |
Data Mining and Machine Learning |
Oct. 24, 29, 31🎃, Nov. 5 |
CS451 - A5
CS431 - A4
|
7 |
Analyzing Relational Data |
Nov. 7, 12, 14 |
CS451 - A6
CS431 - A5
|
8 |
Real-Time Analytics (Streaming) |
Nov. 19, 21 |
CS451 - A7
CS431 - A6
|
9 |
Mutable State (Big Table / HBase) |
Nov. 26, 28 |
- |
10 |
Analyzing Graphs, Redux (Giraph, Spark GraphX) |
Dec. 3 |
- |
(The party hat is because it my birthday)
Note that the following slides are from last term. When I have time I will be tweaking them. There's some Javascript that puts an "updated" note beside any files that change.
Part 1: Introduction to Big Data
Topics
- What's this course about?
- Why big data?
- Scaling models
Slides
Back to top
Part 2: MapReduce Algorithm Design
Topics
- MapReduce programming model
- Cloud computing and datacenters
- Hadoop API
- Hadoop physical execution
- MapReduce design patterns
- Intermediate aggregation and combiners
- Partitioning, grouping, and sorting
Readings
- Data-Intensive Text Processing with MapReduce
- Hadoop: The Definitive Guide (4th Edition):
- Chapter 1: Meet Hadoop
- Chapter 2: MapReduce
- Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
- Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
- Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
- Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
- Chapter 8: MapReduce Types and Formats
- Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")
Slides
Back to top
Part 3: From MapReduce to Spark
Topics
- Evolution of dataflow abstractions
- MapReduce, Pig, Spark, etc.
Readings
- Learning Spark (Optional):
- Chapter 1: Introduction to Data Analysis with Spark
- Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
- Chapter 3: Programming with RDDs
- Chapter 4: Working with Key/Value Pairs
- Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)
Slides
Back to top
Part 4: Analyzing Text
Topics
- Language models and machine translation
- Inverted indexing and search
Readings
Slides
Back to top
Part 5: Analyzing Graphs
Topics
- Graph representations
- Parallel breadth-first search
- PageRank and random walks
- Issues and challenges with dataflow abstractions
Readings
Slides
Back to top
Part 6: Data Mining and Machine Learning
Topics
- Supervised machine learning: binary classification
- Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
- Production machine learning pipelines
- Hashing: minhash
- Clustering: k-means
Readings
- Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
- Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
- Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)
Slides
Back to top
Part 7: Analyzing Relational Data
Topics
- OLTP vs. OLAP
- Data warehousing and data lakes, ETL
- SQL-on-Hadoop: relational data processing with MapReduce and Spark
- Optimizations for relational processing: row vs. column stores, vectorized processing
- Semistructured data and record reconstruction (Parquet)
Readings
Slides
Back to top
Part 8: Real-Time Analytics
Topics
- Stream processing semantics, issues, and frameworks
- Introduction to Apache Spark Streaming
- Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
Readings
Slides
Back to top
Part 9: Mutable State
Topics
- Bigtable/HBase: Log-structure merge trees
- Distributed hash tables
- Consistency, latency, and availability tradeoffs
Readings
Slides
Back to top
Part 10: Analyzing Graphs, Redux
Topics
- Bulk synchronous parallel: "think like a vertex" (Giraph)
Readings
Slides
Back to top