0

How-to: Use IPython Notebook with Apache Spark

IPython Notebook and Spark’s Python API are a powerful combination for data science.

[Editor’s note (added April 25,2016): See updated docs on this subject here.]

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

The developers of IPython have invested considerable effort in building the IPython Notebook, a system inspired by Mathematica that allows you to create “executable documents.” IPython Notebooks can integrate formatted text (Markdown), executable code (Python), mathematical formulae (LaTeX), and graphics/visualizations (matplotlib) into a single document that captures the flow of an exploration and can be exported as a formatted report or an executable script. Below are a few pieces on why IPython Notebooks can improve your productivity:

Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.

Software Prerequisites

  • IPython: I used IPython 1.x, since I’m running Python 2.6 on CentOS 6. This required me to install a few extra dependencies, like Jinja2, ZeroMQ, pyzmq, and Tornado, to allow the notebook functionality, as detailed in the IPython docs. :These requirements are only for the node on which IPython Notebook (and therefore the PySpark driver) will be running.
  • PySpark: I used the CDH-installed PySpark (1.x) running through YARN-client mode, which is our recommended method on CDH 5.1. It’s easy to use a custom Spark (or any commit from the repo) through YARN as well. Finally, this will also work with Spark standalone mode.

IPython Configuration

This installation workflow loosely follows the one contributed by Fernando Perez here. This should be performed on the machine where the IPython Notebook will be executed, typically one of the Hadoop nodes.

First create an IPython profile for use with PySpark.

1
ipython profile create pyspark

This should have created the profile directory ~/.ipython/profile_pyspark/. Edit the file ~/.ipython/profile_pyspark/ipython_notebook_config.py to have:

1
2
3
4
5
c = get_config()
 
c.NotebookApp.ip = ‘*’
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8880 # or whatever you want; be aware of conflicts with CDH

If you want a password prompt as well, first generate a password for the notebook app:

1
python c ‘from IPython.lib import passwd; print passwd()’ > ~/.ipython/profile_pyspark/nbpasswd.txt

and set the following in the same .../ipython_notebook_config.py file you just edited:

1
2
PWDFILE=‘~/.ipython/profile_pyspark/nbpasswd.txt’
c.NotebookApp.password = open(PWDFILE).read().strip()

Finally, create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py with the following contents:

1
2
3
4
5
6
7
8
9
import os
import sys
 
spark_home = os.environ.get(‘SPARK_HOME’, None)
  if not spark_home:
      raise ValueError(‘SPARK_HOME environment variable is not set’)
sys.path.insert(0, os.path.join(spark_home, ‘python’))
  sys.path.insert(0, os.path.join(spark_home, ‘python/lib/py4j-0.8.1-src.zip’))
execfile(os.path.join(spark_home, ‘python/pyspark/shell.py’))

Starting IPython Notebook with PySpark

IPython Notebook should be run on a machine from which PySpark would be run on, typically one of the Hadoop nodes.

First, make sure the following environment variables are set:

1
2
3
4
5
# for the CDH-installed Spark
export SPARK_HOME=‘/opt/cloudera/parcels/CDH/lib/spark’
 
# this is where you specify all the options you would normally add after bin/pyspark
  export PYSPARK_SUBMIT_ARGS=‘–master yarn –deploy-mode client –num-executors 24 –executor-memory 10g –executor-cores 5’

Note that you must set whatever other environment variables you want to get Spark running the way you desire. For example, the settings above are consistent with running the CDH-installed Spark in YARN-client mode. If you wanted to run your own custom Spark, you could build it, put the JAR on HDFS, set the SPARK_JAR environment variable, along with any other necessary parameters. For example, see here for running a custom Spark on YARN.

Finally, decide from what directory to run the IPython Notebook. This directory will contain the .ipynb files that represent the different notebooks that can be served. See the IPython docs for more information. From this directory, execute:

1
ipython notebook profile=pyspark

Note that if you just want to serve the notebooks without initializing Spark, you can start IPython Notebook using a profile that does not execute the shell.py script in the startup file.

Example Session

At this point, the IPython Notebook server should be running. Point your browser to http://my.host.com:8880/, which should open up the main access point to the available notebooks. This should look something like this:

How-to: Use IPython Notebook with Apache Spark

This will show the list of possible .ipynb files to serve. If it is empty (because this is the first time you’re running it) you can create a new notebook, which will also create a new .ipynb file. As an example, here is a screenshot from a session that uses PySpark to analyze the GDELT event data set:

How-to: Use IPython Notebook with Apache Spark

The full .ipynb file can be obtained as a GitHub gist.

The notebook itself can be viewed (but not executed) using the public IPython Notebook Viewer.

Learn more about Spark’s role in an enterprise data hub (EDH) here.

Uri Laserson (@laserson) is a data scientist at Cloudera.

keven

Leave a Reply

Your email address will not be published. Required fields are marked *