Spark is a great way to make use of the available RAM on a Hadoop cluster to run fast in-memory analysis and queries, and H2O is a great project for running distributed machine learning algorithms on data stored in Hadoop. Together they form “Sparkling Water” (Spark + H2O, obviously!).
Easy to follow instructions for setting up Sparkling Water are available here: http://h2o-release.s3.amazonaws.com/sparkling-water/master/103/index.html
Running spark on Yarn is a good way to utilise an existing Hadoop cluster, however it’s challenging using the “live” method below to keep the Sparkling Water H2O Flow interface running permanently. Doing so can let a number of data scientists use the notebook style interface to run machine learning tasks. Luckily, using the spark-submit invocation with the water.SparklingWaterDriver class can ensure the web UI remains online even after the shell session which kicked it off exits (see below Persistent method).
Live method – doesn’t stay online after exiting shell session
- Create a shell script:
sparkling-water-1.3.5/bin/sparkling-shell –num-executors 3 –executor-memory 2g –master yarn-client
- Run sparkling-shell
val h2oContext = new H2OContext(sc).start()
Persistent method – stays online even after exiting shell session
To start a “persistent” H2O cluster on Yarn (i.e. one which doesn’t exit immediately) simply run this command at the command line of a node where the spark client and sparkling water is installed:
nohup bin/spark-submit –class water.SparklingWaterDriver –master yarn-client –num-executors 3 –driver-memory 4g –executor-memory 2g –executor-cores 1 ../sparkling-water-0.2.1-58/assembly/build/libs/*.jar &
The Spark UI should be available on it’s usual port (http://XXX.XXX.XXX.XXX:54321) and should remain there even if the shell session which started the UI dies!
Thanks to the helpful and responsive folks at H2Oai for the above tip (originally answered here)!