Single-Node Hadoop: Linux Student CS Environment

A single-node Hadoop cluster (also called "local" mode) comes pre-configured in the linux.student.cs.uwaterloo.ca environment. We will ensure that everything works correctly in this environment. (BUT: I'm not actually sure if non-CS students have a cs-teaching account so you might not be able to do this!)

TL;DR. Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):

export PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin:/u3/cs451/packages/spark3/bin:/u3/cs451/packages/hadoop/bin:/u3/cs451/packages/maven/bin:/u3/cs451/packages/scala/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export PYTHONPATH=/u/cs451/packages/spark/python
export SPARK_HOME=/u/cs451/packages/spark
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export PYSPARK_PYTHON=/usr/bin/python3

Note that we do not advise you to add the above lines to your shell config file (e.g., .bash_profile), but rather to set up your environment explicitly every time you log in. The reason for this is to reduce the possibility of conflicts.

Alternative: these commands live in ~cs451/pub/setup-431.bash so you can just run

source ~cs451/pub/setup-431.bash
It's easier than copy & pasting those two lines.every time. And I've included a guard that only runs them if you're on one of the student.cs Ubuntu hosts. It is safe to add to your .bashrc.

Single-Node Hadoop: Personal Install

You can complete all assignments using student.cs or Google colab, HOWEVER, student.cs has a sometimes-slow filesystem that can make maven builds take a long time, and that can be painful when doing dev work. I strongly suggest that if you have a computer with enough free space (~10GB) that you follow the next steps and install Hadoop and Spark yourself.
You should get whatever the newest 3.x.y is. Get the version "Pre-built for Hadoop 3.3 and later", which is the first option. You will need to hava Java installed to use Spark. You do not need to install Python as Python comes bundled with Spark.

Download the above package, unpack the tarball, add their respective bin/ directories to your path (and your shell config), and you should be go to go.

Alternatively, you can also install the various packages using a package manager, e.g., apt-get, MacPorts, etc. However, make sure you get the right version.

Installing Hadoop and Spark on Ubuntu

Step 1 is to install Java if it's not already installed. Here's how to do it on Ubuntu
sudo apt-get install openjdk-8-jdk-headless
Next you need to unpack the Spark tarball. Just in yoru home directory is fine. If you want you can rename the folder it creates to "spark" instead of the full title (which is what I did). Finally, you'll need to update your environment by editing .bashrc. Here is mine for reference
export PATH=$PATH:/home/djholtby/spark/bin/
export SPARK_HOME=/home/djholtby/spark/
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
That "should" be all you need to do for VS Code to run your notebook files.

Using VS Code

This is my preferred editor for notebook files. When you first open one you'll be prompted to pick a "kernel". I suggest you follow the directions and create a "virtual environment", but you can also just use the global install if you like. Make sure to install findspark by running pip3 install findspark. This package lets you do this:
import findspark
findspark.init()
That will automatically find spark and add the libraries so you can now use Spark without having to have configured all of the PySpark paths manually before launching Python.
  from pyspark import SparkContext, SparkConf
(The assignment notebooks will have blocks that do this for you).