Avoiding “add jar” to load custom SerDe when using Excel or Beeswax on Hortonworks Hadoop

Intro – analysing tweets with Hive

Following various tutorial examples online (e.g. Hortonworks – How To Refine and Visualize Sentiment Data and Microsoft – Analyze Twitter data using Hive in HDInsight) it is possible to expose semi structured Twitter feed data in tabular format via Hadoop and Hive.  Once the data is available in Hive it is possible to visualise this via Excel and the Power BI toolset.

Even after getting this data from Twitter’s stream API or a small sample from Twitter’s API testing development console, a custom SerDe is required for Hive to be able to read an external table based on JSON files.  This SerDe takes the form of a Jar file which must be made available whenever the hive query is executed.  Usually via this command:

add jar json-serde-1.3-jar-with-dependencies.jar;

select * from vw_tweets […]

PS – The jar in question in Roberto Congiu’s JSON SerDe (binaries available here, and CDH5 version appears to be compatible with Hortonworks).

The Problem?

In the case of client tools such as Hive and Beeswax, it’s necessary to execute the “add jar” command before querying any tables based on the custom SerDe, otherwise, errors such as this occur:

hive> select * from vw_tweets_hashtagsPerDay limit 2;

FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)

Similarly, errors such as this occur when using the Hortonworks Hive ODBC driver in Excel 2013 with Powerquery, and it is not otherwise possible to manually load the JAR in the Excel ODBC connection query:

DataSource.Error: ODBC: ERROR [HY000] [Hortonworks][HiveODBC] (35) Error from Hive: error code: ‘40000’ error message: ‘Error while compiling statement: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)’.
Details:
Message=ERROR [HY000] [Hortonworks][HiveODBC] (35) Error from Hive: error code: ‘40000’ error message: ‘Error while compiling statement: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)’.
ErrorCode=-2146232009

The solution

It is possible to fix the above issues by loading the SerDe JAR file to a location on the local filesystem of the Hadoop node running the Hiveserver2, e.g.:

/home/hive/json-serde-1.3-jar-with-dependencies.jar

Two config changes must then be made to the cluster:

  1. Add this line to the end of hive-env.sh:
    export HIVE_AUX_JARS_PATH=${HIVE_AUX_JARS_PATH}:/home/hive/json-serde-1.3-jar-with-dependencies.jar

    hive-env

  2. Add this config to the custom hive-site.xml file:

    hive.aux.jars.path = file:///home/hive/json-serde-1.3-jar-with-dependencies.jar
    hive-aux-jars-path

  3. Restart Hive components.

Finally – confirm ODBC is working by querying your Hive table / view in Excel or Beeswax!  It should work without first running the “add jar” command, because the custom SerDe responsible for reading the JSON files is now loaded automatically when a session is opened.

Problem starting HBASE master on Hadoop with Cloudera

After formatting the Hadoop HDFS Namenode and trying to restart the Hadoop cluster in Cloudera I encountered thisfatal error on the HBASE master, preventing HBASE from starting at all:

Unhandled exception. Starting shutdown.
org.apache.hadoop.hbase.TableExistsException: hbase:namespace
at org.apache.hadoop.hbase.master.handler.CreateTableHandler.prepare(CreateTableHandler.java:133)
at org.apache.hadoop.hbase.master.TableNamespaceManager.createNamespaceTable(TableNamespaceManager.java:232)
at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:86)
at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1069)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:942)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:613)
at java.lang.Thread.run(Thread.java:745)

After unsuccessfully trying to fix this error by removing the /hbase directory on HDFS, I stumbled across the solution to clear the /hbase directory via a the Zookeeper service client:

azureuser@master:/$ zookeeper-client
Connecting to localhost:2181
2015-01-24 02:17:31,535 [myid:] – INFO  [main:Environment@100] – Client environment:zookeeper.version=3.4.5-cdh5.3.0–1, built on 12/17/2014 02:46 GMT
2015-01-24 02:17:31,540 [myid:] – INFO  [main:Environment@100] – Client environment:host.name=master.hadoopnet
[…]
2015-01-24 02:17:31,737 [myid:] – INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1235] – Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14b19b8aa4f000b, negotiated timeout = 30000
[zk: localhost:2181(CONNECTED) 0]
WATCHER::

WatchedEvent state:SyncConnected type:None path:null

[zk: localhost:2181(CONNECTED) 1] rmr /hbase
[zk: localhost:2181(CONNECTED) 2] quit
Quitting…
2015-01-24 02:17:56,579 [myid:] – INFO  [main:ZooKeeper@684] – Session: 0x14b19b8aa4f000b closed
2015-01-24 02:17:56,579 [myid:] – INFO  [main-EventThread:ClientCnxn$EventThread@512] – EventThread shut down

Which fixed the issue! Restarting HBASE via Cloudera now succeeds without the above error.

Seems like Zookeeper has a filesystem-like concept called znodes: http://blog.cloudera.com/blog/2013/10/what-are-hbase-znodes/ which is not to be confused with HDFS in a Hadoop cluster.