It’s increasingly said that “notebooks” are the new spreadsheets in terms of being a tool for exploratory data analysis. The Apache Zeppelin project (https://zeppelin.incubator.apache.org/) is certainly one such promising notebook-style interface for performing advanced interactive querying of Hadoop data (whether via Hive, Spark, Shell or other scripting languages).
At the time of writing Zeppelin is not completely mature, for example – it lacks the ability to connect to a Kerberos secured Hive service, which may make things difficult in an enterprise environment. Nonetheless it’s worth the look as a new type of workflow for data scientists and other data analysts.
Binaries for Zeppelin can be obtained here (version 5.5):
Normal startup steps for Zeppelin are:
cd ~/zeppelin-0.5.5-incubating-bin-all bin/zeppelin-daemon.sh restart
If using WASB (Windows Azure Blob Storage), however, as the default Hadoop filesystem, this step should be run before starting the Zeppelin daemon as per the above step. Adding these JARs to the Classpath tells Zeppelin how to read from the WASB filesystem:
The following config should be set for running spark on as a Yarn job on an existing HDP2.3 Hadoop cluster (via the Interpreter tab in Zeppelin):
Running the following code in a Zeppelin Notebook should succeed:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
And the Spark instance should be kicked off and remain running in the Yarn Resource Manager:
Zeppelin may have trouble reading from the WASB filesystem if the above classpath is not added prior to starting Zeppelin:
java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
Thanks to the Microsoft support team for assisting with finding the right JAR files to add to the Classpath!