
A single-node Hadoop cluster (also called "local" mode) comes
pre-configured in the linux.student.cs.uwaterloo.ca
environment. We will ensure that everything works correctly in this
environment. (BUT: I'm not actually sure if non-CS students have a cs-teaching account so you might not be able to do this!)
TL;DR. Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):
export PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin:/u3/cs451/packages/spark3/bin:/u3/cs451/packages/hadoop/bin:/u3/cs451/packages/maven/bin:/u3/cs451/packages/scala/bin:$PATH export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre export PYTHONPATH=/u/cs451/packages/spark/python export SPARK_HOME=/u/cs451/packages/spark export PYSPARK_DRIVER_PYTHON=/usr/bin/python3 export PYSPARK_PYTHON=/usr/bin/python3
Note that we do not advise you to add the above lines to
your shell config file (e.g., .bash_profile), but rather
to set up your environment explicitly every time you log
in. The reason for this is to reduce the possibility of conflicts.
Alternative: these commands live in ~cs451/pub/setup-431.bash so you can just run
source ~cs451/pub/setup-431.bashIt's easier than copy & pasting those two lines.every time. And I've included a guard that only runs them if you're on one of the student.cs Ubuntu hosts. It is safe to add to your
.bashrc.
Download the above package, unpack the tarball, add their
respective bin/ directories to your path (and your shell
config), and you should be go to go.
Alternatively, you can also install the various packages using a
package manager, e.g., apt-get, MacPorts, etc. However,
make sure you get the right version.
sudo apt-get install openjdk-8-jdk-headlessNext you need to unpack the Spark tarball. Just in yoru home directory is fine. If you want you can rename the folder it creates to "spark" instead of the full title (which is what I did). Finally, you'll need to update your environment by editing
.bashrc. Here is mine for reference
export PATH=$PATH:/home/djholtby/spark/bin/ export SPARK_HOME=/home/djholtby/spark/ export SPARK_DIST_CLASSPATH=$(hadoop classpath)That "should" be all you need to do for VS Code to run your notebook files.
pip3 install findspark. This package lets you do this:
import findspark findspark.init()That will automatically find spark and add the libraries so you can now use Spark without having to have configured all of the PySpark paths manually before launching Python.
from pyspark import SparkContext, SparkConf(The assignment notebooks will have blocks that do this for you).