Sparkling-water – keeping the web UI alive

Spark is a great way to make use of the available RAM on a Hadoop cluster to run fast in-memory analysis and queries, and H2O is a great project for running distributed machine learning algorithms on data stored in Hadoop.  Together they form “Sparkling Water” (Spark + H2O, obviously!).

Easy to follow instructions for setting up Sparkling Water are available here: http://h2o-release.s3.amazonaws.com/sparkling-water/master/103/index.html

Running spark on Yarn is a good way to utilise an existing Hadoop cluster, however it’s challenging using the “live” method below to keep the Sparkling Water H2O Flow interface running permanently.  Doing so can let a number of data scientists use the notebook style interface to run machine learning tasks.  Luckily, using the spark-submit invocation with the water.SparklingWaterDriver class can ensure the web UI remains online even after the shell session which kicked it off exits (see below Persistent method).

Live method – doesn’t stay online after exiting shell session

  1. Create a shell script:

    #!/bin/bash
    export SPARK_HOME=’/usr/hdp/current/spark-client/’
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export MASTER=”yarn-client”
    sparkling-water-1.3.5/bin/sparkling-shell –num-executors 3 –executor-memory 2g –master yarn-client

  2. Run sparkling-shell

    import org.apache.spark.h2o._
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._

Persistent method – stays online even after exiting shell session

To start a “persistent” H2O cluster on Yarn (i.e. one which doesn’t exit immediately) simply run this command at the command line of a node where the spark client and sparkling water is installed:

nohup bin/spark-submit –class water.SparklingWaterDriver –master yarn-client –num-executors 3 –driver-memory 4g –executor-memory 2g –executor-cores 1 ../sparkling-water-0.2.1-58/assembly/build/libs/*.jar &

The Spark UI should be available on it’s usual port (http://XXX.XXX.XXX.XXX:54321) and should remain there even if the shell session which started the UI dies!

Thanks to the helpful and responsive folks at H2Oai for the above tip (originally answered here)!

Avoiding “add jar” to load custom SerDe when using Excel or Beeswax on Hortonworks Hadoop

Intro – analysing tweets with Hive

Following various tutorial examples online (e.g. Hortonworks – How To Refine and Visualize Sentiment Data and Microsoft – Analyze Twitter data using Hive in HDInsight) it is possible to expose semi structured Twitter feed data in tabular format via Hadoop and Hive.  Once the data is available in Hive it is possible to visualise this via Excel and the Power BI toolset.

Even after getting this data from Twitter’s stream API or a small sample from Twitter’s API testing development console, a custom SerDe is required for Hive to be able to read an external table based on JSON files.  This SerDe takes the form of a Jar file which must be made available whenever the hive query is executed.  Usually via this command:

add jar json-serde-1.3-jar-with-dependencies.jar;

select * from vw_tweets […]

PS – The jar in question in Roberto Congiu’s JSON SerDe (binaries available here, and CDH5 version appears to be compatible with Hortonworks).

The Problem?

In the case of client tools such as Hive and Beeswax, it’s necessary to execute the “add jar” command before querying any tables based on the custom SerDe, otherwise, errors such as this occur:

hive> select * from vw_tweets_hashtagsPerDay limit 2;

FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)

Similarly, errors such as this occur when using the Hortonworks Hive ODBC driver in Excel 2013 with Powerquery, and it is not otherwise possible to manually load the JAR in the Excel ODBC connection query:

DataSource.Error: ODBC: ERROR [HY000] [Hortonworks][HiveODBC] (35) Error from Hive: error code: ‘40000’ error message: ‘Error while compiling statement: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)’.
Details:
Message=ERROR [HY000] [Hortonworks][HiveODBC] (35) Error from Hive: error code: ‘40000’ error message: ‘Error while compiling statement: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)’.
ErrorCode=-2146232009

The solution

It is possible to fix the above issues by loading the SerDe JAR file to a location on the local filesystem of the Hadoop node running the Hiveserver2, e.g.:

/home/hive/json-serde-1.3-jar-with-dependencies.jar

Two config changes must then be made to the cluster:

  1. Add this line to the end of hive-env.sh:
    export HIVE_AUX_JARS_PATH=${HIVE_AUX_JARS_PATH}:/home/hive/json-serde-1.3-jar-with-dependencies.jar

    hive-env

  2. Add this config to the custom hive-site.xml file:

    hive.aux.jars.path = file:///home/hive/json-serde-1.3-jar-with-dependencies.jar
    hive-aux-jars-path

  3. Restart Hive components.

Finally – confirm ODBC is working by querying your Hive table / view in Excel or Beeswax!  It should work without first running the “add jar” command, because the custom SerDe responsible for reading the JSON files is now loaded automatically when a session is opened.

Visualising Solar Generation Data in a Custom Histogram using D3.js

Using the “brush” feature of the D3 Javascript library again proves handy for creating an interactive, animated histogram.  This type of visualisation helps to analyse and explore the distribution of time-series data.

For this demo, home solar PV generation data has been obtained from United Energy’s Energy Easy portal in CSV format.  For the sake of convenience in dealing with the raw data which usually comes in half-hourly intervals, this data has been loaded in to a Pentaho data warehouse instance (more details in a later post, perhaps!) and converted to day-by-day figures.

D3 - Interactive Histogram - Preparing Solar PV generation data with Saiku
Preparing Solar PV generation data with Saiku

Analysis can help explore whether solar panels are getting less efficient over time, or even determine what a “good” day of production is like in summer vs winter (by looking at the relevant frequency of each in the histogram).

Features:

  • Drag and scroll date region which affects histogram above:
    D3 - Interactive Histogram - Selectable Date Range
  • Changeable histogram buckets:
    D3 - Interactive Histogram - Adjustable Buckets
  • Snap-shotting of one selected date range for visual comparison with another (e.g. summer vs winter comparison):
    D3 - Interactive Histogram - Summer vs Winter Comparison

Key techniques – how it’s achieved:

  1. Data is embedded into the HTML page:

    <pre id=”csvdata”>
    Day,Quantity
    2013-01-01,9.288
    2013-01-02,10.494
    2013-01-03,7.376

    2014-12-29,3.832
    2014-12-30,7.752
    2014-12-31,7.212
    </pre>

    It’s hidden visually using the CSS display:none directive…

    #csvdata {
    display: none;
    }

    …and then read into a Javascript variable for display and computation via D3.

    var raw = d3.select(“#csvdata”).text();
    var dataset_copy = d3.csv.parse(raw);

  2. Maximum value is found in the selected region of the dataset:

    max = d3.max(dataset, function (d) {    return +d.Quantity;});

    …histogram is created based on maximum value:

    // Generate a histogram using uniformly-spaced bins.
    datasetHist = d3.layout.histogram()
    .frequency(false)

    .range([0, max])
    .bins(buckets)
    (dataset.map(function (d) {

    return d.Quantity;
    }));

  3. An HTML form input slider is used for selection of number of buckets:

        240px; text-align: right; font-size: 10px;font-family: sans-serif;”>Number of buckets: 10

    </label>
    nBuckets”>

    Value of slider is passed into Javascript (to variable “buckets”) via this code:

    d3.select(“#nBuckets”).on(“input”, function () {
    buckets = +this.value;
    d3.select(“#nBuckets-value”).text(+this.value);
    clearSnapshot();
    render();
    });

Demo / code:

Try out the demo and check out the code here:
http://jsfiddle.net/hzmj24d1/

Re-pivoting data using OpenRefine’s Columns to Rows feature

Problem

A frequent challenge for transforming time-series data (e.g. weather, meter data) is changing columns representing multiple times of the day to a single column or in OLAP terms what might generically be described as an “Hour of the Day” or “Interval” dimension.

Example input schema:

Date, 00:00, 00:30, 01:00, 01:30, ..., 23:00, 23:30
01-01-2015,0,0,0,1,...,1,0
02-01-2015,0,0,1,2,...,2,1

Example output schema:

Date, Interval, Value
01-01-2015,00:00,0
01-01-2015,00:30,0
...
02-01-2015,23:00,2
02-01-2015,23:30,1

Luckily, this is quite straightforward with a tool designed for dealing with messy tabular (e.g. CSV) data such as the free, web-based tool OpenRefine.

Solution

  1. Open the input CSV file in OpenRefine:

    OpenRefine Demo - Columns to Rows - Before

  2. Right-click one of the interval / time-based columns and select Transpose > Transpose cells across columns into rows:

    OpenRefine Demo - Columns to Rows - Transpose

  3. Select all the interval columns and select a new “one column” called Interval to transpose these into.  Also select the option to prepend “:” to each new cell (this is useful in a subsequent step):

    OpenRefine Demo - Columns to Rows - Columns to Rows

    Measure columns have now become a single column called “Interval” and the number of rows has multiplied greatly:

    OpenRefine Demo - Columns to Rows - Semicolon character

  4. Now, use Edit column > Split into several columns to ensure the Interval name and Value figures are separate:

    OpenRefine Demo - Columns to Rows - Split into columns

    Select “:” as the separator and click OK:
    OpenRefine Demo - Columns to Rows - Split into columns 2

    …and Voilà — after renaming the new column, the data is now in a pivoted form with only a single “Value” column containing the measure for the day and interval:

    OpenRefine Demo - Columns to Rows - After

Visualising energy consumption profile (by hour of day) using D3.js

With the benefit of smart electricity meters it’s possible to obtain hourly data showing household consumption in KWh. I downloaded this dataset for my own house in CSV format from United Energy’s EnergyEasy portal.

With some massaging, the data can be formatted to a structure which which makes aggregation easier.  The excellent tool OpenRefine made this task easier, effectively unpivoting half-hourly measures which were in many columns into a single column, so that the data looks like this:

Day,Interval,Quantity
2012-01-01,0000,0.05
2012-01-01,0030,0.05
2012-01-01,0100,0.044
2012-01-01,0130,0.05
2012-01-01,0200,0.044
[...]
2013-12-31,2130,0.025
2013-12-31,2200,0.019
2013-12-31,2230,0.025
2013-12-31,2300,0.025
2013-12-31,2330,0.025

Using the D3 Javascript visualisation library, it’s possible to create an interactive visualisation which can interrogate this data in arbitrary ways to find patterns and answer basic questions about household energy consumption.  For example:

  1. During which hours of the day is the highest average energy consumption? Is this different in summer vs winter?  Has this changed from 2012 to 2013?
  2. Has the minimum energy consumption overnight changed?  Is the new (and slightly annoying) energy saving power board purchased in mid 2013 doing its job to reduce standby power use?
  3. During which hours of the day is power usage the most variable?

Features

  • Selectable date range – e.g. to compare a rolling 12 month period. This uses a “context” graphics section in D3.js with brush functionality to trigger realtime recalculation of data in the “focus” section when a user selects a range using their mouse.  The live update of the hourly consumption profile means it’s easy to see trends over time in the “focus” area of the screen (shown in the following point):

    D3 - Visualising hourly energy consumption profile - Brush
    D3 selectable time range using “brush” technique
  • Plotting of max / min / mean / standard deviation of KWh consumption per hour of the day:

    D3 - Visualising hourly energy consumption profile - Mean Min Max StdDev
    D3 Mean Max Min and Standard Deviation calculations for each half-hourly time interval of the day
  • “Snapshotting” of date range – e.g. to compare two consecutive years in an interactive way:

    D3 snapshot time comparison
    D3 snapshot time comparison

Demo

Check out a live example here:
http://jsfiddle.net/ugxo00bu/7/

Problem starting HBASE master on Hadoop with Cloudera

After formatting the Hadoop HDFS Namenode and trying to restart the Hadoop cluster in Cloudera I encountered thisfatal error on the HBASE master, preventing HBASE from starting at all:

Unhandled exception. Starting shutdown.
org.apache.hadoop.hbase.TableExistsException: hbase:namespace
at org.apache.hadoop.hbase.master.handler.CreateTableHandler.prepare(CreateTableHandler.java:133)
at org.apache.hadoop.hbase.master.TableNamespaceManager.createNamespaceTable(TableNamespaceManager.java:232)
at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:86)
at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1069)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:942)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:613)
at java.lang.Thread.run(Thread.java:745)

After unsuccessfully trying to fix this error by removing the /hbase directory on HDFS, I stumbled across the solution to clear the /hbase directory via a the Zookeeper service client:

azureuser@master:/$ zookeeper-client
Connecting to localhost:2181
2015-01-24 02:17:31,535 [myid:] – INFO  [main:Environment@100] – Client environment:zookeeper.version=3.4.5-cdh5.3.0–1, built on 12/17/2014 02:46 GMT
2015-01-24 02:17:31,540 [myid:] – INFO  [main:Environment@100] – Client environment:host.name=master.hadoopnet
[…]
2015-01-24 02:17:31,737 [myid:] – INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1235] – Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14b19b8aa4f000b, negotiated timeout = 30000
[zk: localhost:2181(CONNECTED) 0]
WATCHER::

WatchedEvent state:SyncConnected type:None path:null

[zk: localhost:2181(CONNECTED) 1] rmr /hbase
[zk: localhost:2181(CONNECTED) 2] quit
Quitting…
2015-01-24 02:17:56,579 [myid:] – INFO  [main:ZooKeeper@684] – Session: 0x14b19b8aa4f000b closed
2015-01-24 02:17:56,579 [myid:] – INFO  [main-EventThread:ClientCnxn$EventThread@512] – EventThread shut down

Which fixed the issue! Restarting HBASE via Cloudera now succeeds without the above error.

Seems like Zookeeper has a filesystem-like concept called znodes: http://blog.cloudera.com/blog/2013/10/what-are-hbase-znodes/ which is not to be confused with HDFS in a Hadoop cluster.

Using Mondrian’s CurrentDateMember to show current day’s data in MDX

Let’s say we have the following MDX query to show data for a particular date (in this case the quantity measure of the cube Electricity):

WITH
SET [~ROWS] AS {[Time].[Day].[2014-01-01]}
SELECT
NON EMPTY {[Measures].[Quantity]} ON COLUMNS,
NON EMPTY [~ROWS] ON ROWS
FROM [Electricity]

Works OK:

Saiku MDX - Current Date Member 1

But what if we want the date to be dynamic, reflecting today’s date?

We can change the MDX to use Mondiran’s CurrentDateMember function:

WITH
SET [~ROWS] AS {CurrentDateMember([Time], """[Time].[Day]""\.[yyyy-mm-dd]")}
SELECT
NON EMPTY {[Measures].[Quantity]} ON COLUMNS,
NON EMPTY [~ROWS] ON ROWS
FROM [Electricity]

Now the date selected is today’s date (Note: date members in the time hierarchy are in the format of yyyy-mm-dd):

Saiku MDX - Current Date Member 2

Extra tip – rolling date range up to current day
We can also change the MDX query to select a rolling 365-day date range (ending with the current day):

...
SET [~ROWS] AS {CurrentDateMember([Time], """[Time].[Day]""\.[yyyy-mm-dd]").Lag(356):CurrentDateMember([Time], """[Time].[Day]""\.[yyyy-mm-dd]")}
...

Now a rolling 365-day date range is shown:

Saiku MDX - Current Date Member 3