Apache Spark is a most common for Big Data people. In a few words, Spark is a fast and powerful framework that provides an API to perform massive distributed processing over resilient sets of data.
Jupyter Notebook is a popular application that enables you to edit, run and share Python code into a web view. You can easily run your code and visualize the output as per each cell view. This is the reason most common platform used for Programmers in Data Science field including Kaggle Competitions.
Installation : I am supposing that you all are done with installation of Spark and Hadoop, in case if you want blog on that please mentions in comment section so we can cover that as well in coming blogs. Before installing Pyspark, you must have Python and Spark installed. As we are having all this environment installed.
Check whether Pyspark is installed correctly :
- Login to Remote Environment using Bitwise SSH Client or any other Remote Connectivity Client
- Let us check what are the databases we are having access when we are Logged in with SSH Client
Pyspark Installation: Let us check whether Pyspark is installed correctly or not.
- Use the command for Annaconda Path: PATH=/cloudera/parcels/Anaconda/bin:$PATH; export PATH. This may be different as per Path for your environment setup
- Type Pyspark
- We can see that Pyspark is installed in our Environment
Working with Jupyter Notebook integration with Pyspark: Before moving to Jupyter Notebook there are few steps for environment setup. Run all the command for remote environment cmd
a) Path Setup
1. locate spark-env.sh
3. PATH=cloudera/parcels/Anaconda/bin:$PATH; export PATH
6. export PYSPARK_DRIVER_PYTHON_OPTS=’notebook’
b) Then assign the port 8888 at your local command prompt or any other port specifying Jupyter Notebook with this port can be opened in your Local Environment using SSH setup by using the command: ssh -N -L localhost:8888:localhost:8888 id@remoteservername. This step is known as “Set up SSH tunnel to your remote machine“.To access the notebook on your remote machine over SSH, set up a SSH tunnel to the remote machine using the above command: This command opens a new SSH session in the terminal. I’ve added the option -N to tell SSH that I’m not going to execute any remote commands. This ensures that the connection cannot be used in that way, see this as an added security measure.
I’ve also added the -L option that tells SSH to open a tunnel from port 8888 on the remote machine to port 8888 in my local machine.
ssh -N -L localhost:8888:localhost:8888 id@remoteservername
In case you are using service account then replace service account name instead of your account.
c) ) Jupyter Notebook will be opened like:
d) Sometimes you may get error like multiplesparkcontext error: This happens because when you type “pyspark” in the terminal, the system automatically initialized the SparkContext (maybe Object?), so you should stop it before creating a new one.
sc = SparkContext.getOrCreate() or you can use sc.stop().
Now you can run your code in Jupyter Notebook with Pyspark. We have used Linux Environment for all above mentioned setup.
Please provide us with your feedback.
Regards Team Kite4Sky