Data-Intensive Distributed Computing

Bespin

Bespin is a software library that contains reference implementations of "big data" algorithms in MapReduce and Spark. It provides sample code for many of the algorithms we'll be discussing in class and also provides starting points for the assignments. You'll want to familiarize yourself with the library.

Single-Node Hadoop: Linux Student CS Environment

A single-node Hadoop cluster (also called "local" mode) comes pre-configured in the linux.student.cs.uwaterloo.ca environment. We will ensure that everything works correctly in this environment.

TL;DR. Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):

export PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin:/u3/cs451/packages/spark/bin:/u3/cs451/packages/hadoop/bin:/u3/cs451/packages/maven/bin:/u3/cs451/packages/scala/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

Note that we do not advise you to add the above lines to your shell config file (e.g., .bash_profile), but rather to set up your environment explicitly every time you log in. The reason for this is to reduce the possibility of conflicts when you start using the Datasci cluster (see below).

Alternative: these commands live in ~cs451/pub/setup.bash so you can just run

source ~cs451/pub/setup.bash

It's easier than copy & pasting those two lines.every time. And I've included a guard that only runs them if you're on one of the student.cs Ubuntu hosts. So it is safe to add to your .bashrc.

Details. For the course we need Java, Scala, Hadoop, Spark, and Maven. Java is already available in the default user environment (but we need to point to the right version). The rest of the packages are installed in /u3/cs451/packages/. The directories scala, hadoop, spark, and maven are actually symlinks to specific versions. This is so that we can transparently change the links to point to different versions if necessary without affecting downstream users. Currently, the versions are:

Java: 1.8.0
Scala: 2.12.20
Hadoop: 3.4.1
Spark: 3.5.4
Maven: 3.3.9

Single-Node Hadoop: Personal Install

You can complete all assignments using student.cs and the datasci cluster, HOWEVER, student.cs has a sometimes-slow filesystem that can make maven builds take a long time, and that can be painful when doing dev work. I strongly suggest that if you have a computer with enough free space (~10GB) that you follow the next steps and install Hadoop and Spark yourself.

In addition to using the single-node Hadoop cluster on linux.student.cs.uwaterloo.ca, you may wish to install all necessary software packages locally on your own machine. We provide basic installation instructions here, but the course staff cannot provide technical support due to the size of the class and the idiosyncrasies of individual systems. We will be responsible for making sure everything works properly in the Linux Student CS Environment (above), but can only offer limited help with your own system. It's pretty straight forward in Ubuntu (whether running natively or using WSL under Windows).

Both Hadoop and Spark work fine on Mac OS X and Linux, but may be difficult to get working on Windows (I very strongly suggest that you use WSL, where it's easy). Note that to run Hadoop and Spark on your local machine comfortably, you'll need at least 4 GB memory and plenty of disk space (at least 10 GB).

You'll also need Java (Must use JDK 1.8), Scala (must use Scala 2.12.20 EXACTLY since Maven is picky about versions), and Maven (any reasonably recent version).

The versions of the packages installed on linux.student.cs.uwaterloo.ca are as follows:

Download the above packages, unpack the tarball, add their respective bin/ directories to your path (and your shell config), and you should be go to go.

Alternatively, you can also install the various packages using a package manager, e.g., apt-get, MacPorts, etc. However, make sure you get the right version.

Installing Hadoop and Spark on Ubuntu

(Note that this will work for any distribution, but if you need to install a package you'd use the appropriate manager, like brew for OS X.)

Step 1: Do you have Java?

What does javac -version say?

"openjdk 1.8.0_xyz" - excellent, you have OpenJDK Java 8 installed AND it's the default java (the xyz doesn't matter, any number is fine)
"java version 1.8.0_xzy" - excellent, you have Oracle Java 8 installed and it's the default java. (you should switch to OpenJDK at some point though, especially if you're a grad student and this is a department issued machine you're using!)
"1.A.B_xzy" - not good, unless A is 8 and B is 0 you do not have Java 8 installed (or you might, but it's not the default java)
"command not found" - you definitely don't have any kind of Java installed.

If you had the wrong version, you can use update-alternatives --list javac to see all versions of java that are installed. If you see that 1.8.0 is installed, then you can make it the default:

      sudo update-alternatives --set java {wherever it said java 8 was installed}/bin/java
      sudo update-alternatives --set javac {wherever it said java 8 was installed}/bin/javac

(I don't know the non-ubuntu alternatives for update-alternatives) If you got "command not found" or the alternatives list did not contain Java 8, you can install it like this: sudo apt install openjdk-8-jdk. If you have multiple versions installed now you will need to do the update-alternatives commands above.

Step 2: Install Scala

You can get the right version From the scala-lang website For ubuntu you want the .deb file. You could also run in the command line wget https://github.com/scala/scala/releases/download/v2.12.20/scala-2.12.20.deb . For OS X the above link has instructions for using brew or port to install it. If you have the .deb file downloaded then run sudo apt install ./scala-2.12.20.deb

Step 3: Install Maven

This one is easy: sudo apt install maven The version doesn't matter so whatever version is in the distro's repository is fine.

Step 4: Install Hadoop

Go here: https://hadoop.apache.org/releases.html - you can grab a newer version of Hadoop 3 if one is out by the time you're reading this note!
Download the binary (not aarch64, unless your machine is an ARM64 machine, like newer Macs, etc. in which case you hopefully knew what aarch64 meant!)
Unzip the tar.gz file wherever you want, like, /opt/hadoop for example. Or extract it in your home directory and it'll be at ~/hadoop-.../ Or anywhere you want. Just remember where you put it.

Step 5: Install Spark

Go here: https://spark.apache.org/downloads.html - make sure you grab 3.5.x - spark releases are pretty frequent but the minor version should not break things. DO NOT GET 4.0
For "package type" select "Pre-built with user-provided Hadoop"
Download the binary
Unzip the tar.gz file wherever you want, like, /opt/spark for example. Or anywhere you want.

Step 6: Configuration

Add this to your shell's rc file (e.g. .bashrc)

export HADOOP_HOME={path where you unzipped hadoop}
export SPARK_HOME={path where you unzipped spark}
export PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH

You *might* also need to do this to $SPARK_HOME/conf/spark-env.sh: add/uncomment the line

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

I have a note that says to do it, but this might be out of date.

Step 7: Try it out

Try running spark-shell You should get a scala prompt. See if it runs Spark code without errors:

scala> sc.parallelize(Array(1,2,3,4,5)).map(_*2).collect
res0: Array[Int] = Array(2, 4, 6, 8, 10)

At that point you're good to go!

Distributed Hadoop Cluster: Datasci

In addition to running "toy" Hadoop on a single node (which obviously defeats the point of a distributed framework), we're going to using the school's modest Hadoop teaching cluster called Datasci.

Warning

If you've added setup scripts for other courses in the past, you might have to remove that from your .bashrc, .bash_profile, .profile, or wherever else you put them! In particular, CS348's setup is breaking people's $PATH variable and leaving them unable to do anything on datasci! (CS246 setup fails to display a cow fortune, which is probably nothing to worry about unless you really wanted your cow fortune). If you have datasci problems, and say you have never modified your profile files, but you, in fact, have done so, I will make this face: 😞

Actually, I've been told some courses' setup scripts modify your startup - so if you only had to run it once, instead of each login, it probably messed with your environment and you'll have to undo this!

Accounts are already set up for students enrolled in the course. You should be able to log into the cluster as follows:

ssh <your userid>@datasci-login.cs.uwaterloo.ca

NOTE: You must configure a public/private keypair. You can find directions here: MFCF - Creating SSH keys. Datasci shares the file system with student.cs so adding to your authorized_keys file on student.cs will also add it to datasci.

If you're using PuTTY on Windows, the program to use is called "PuTTYGen". Save the private key somewhere, and copy public key text shown into "~/.ssh/authorized_keys". Unlike under Linux, putty does not default to trying private keys, you have to configure this manually. Under "Connection" > "SSH" > "Auth", enter the location of your private key in the box labeled "Private key for authentication".

Warning: datasci-login only accepts connections from on-campus. To connect from home you must use a VPN, like the School of Computer Science VPN. Follow the directions on that page to get started. The campus VPN should also work. Note: You can also ssh to student.cs from off-campus then ssh to datasci-login, but you really should install the VPN as you also need to be on the VPN to view the cluster monitoring page!

NOTE: Do not set up the environment in Datasci. The path is already set.

Cluster Monitor Page

You can monitor the status for running jobs at the Cluster Management Page. They're sorted by submission time so if it's busy you might need to scroll down, or scroll to the right to find the "search" bar to search by userid. Clicking on the "ID" link will take you to the details of the job. For Spark jobs, the "Tracking UI" column will link to the Spark Application page for your job (but only while the job is running, if it's already finished you just get the summary, no cool diagrams). You'll familiarize yourself with this on A0. It's very useful. Like SSH, you must be on campus / a campus VPN to view the page (again, this is why you can't just ssh from student.cs to datasci-login.cs)

Software (CS451/651 only)Data-Intensive Distributed Computing (Fall 2025)