Instructors: Dan Holtby -- Please include the course code in the subject e.g. "[CS451] ... " so my email filters will put it in the right folder (my root inbox is a lawless wasteleand)
When/Where: See Quest
Piazza: CS451/651 or CS431

Office Hours:

Who Time Location
Dan Holtby Tuesday/Thursday 11:30-12:30 (give or take time to go grab a coffee from C&D) DC 2106
Dan Holtby (CS451) Wednesday (11AM-noon) on Microsoft Teams (Look for a post in the "Office Hours" channel). (also feel free to request an appointment, I'll try to fit you in if it's during the workday MWF) Microsoft Teams (online)
Dan Holtby (CS431) Friday (11AM-noon)Microsoft Teams (online)
TAs Varies, see Piazza postings for schedules

The big data stack

Over the past decade, we have seen the emergence of "big data": disruptive technologies that have transformed commerce, science, and many aspects of society. These developments are enabled by infrastructure that allows us to distribute computations across hundreds or even thousands of commodity servers. One important advance that has made all this possible is the development of abstractions for data-intensive computing that allow programmers to reason about computations at a massive scale, hiding low-level details such as synchronization, data movement, and fault tolerance.

What is this course about? This course provides an introduction to data-intensive distributed computing. Our focus is algorithm design and "thinking at scale": we will cover data mining and machine learning techniques as applied to text, graphs, and relational data. Most of the course will be taught in a combination of MapReduce and Spark, two representative dataflow abstractions for large-scale data analysis, although we will introduce alternative abstractions such as bulk-synchronous parallel and streaming models as well.

One might break down the "big data" stack in the manner shown on the right. At the bottom resides the execution infrastructure, which is responsible for coordinating computations across a cluster (examples include MapReduce and Spark). In the middle resides analytics infrastructure, which implements data mining and machine learning algorithms on top of the execution infrastructure (an example would be MLlib in Spark). At the top are the tools data scientists use to generate insights, built on top of the analytics infrastructure. This course focuses on the middle part — by the end of the course, you will be able to implement basic data mining and machine learning algorithms that can operate at scale. Of course, effective algorithm design requires understanding the execution infrastructure (below) and what the algorithms are used for (above), so we will cover the broader context as well.