Using Azure Blob storage with Hadoop

Cloud providers such as Amazon (AWS) and Microsoft (Azure) provide fault-tolerant distributed storage services which can literally “take the load” off a Hadoop installation, providing some compelling advantages.  In the case of Microsoft Azure’s blob storage, however, this is not without its pitfalls. With the release of Hadoop version 2.7.0 (and vendor packaged versions such… Continue reading Using Azure Blob storage with Hadoop

Managing Yarn memory with multiple Hive users

Out of the box (e.g. a standard Hortonworks HDP 2.2 install), Hive does not come configured optimally to manage multiple users running queries simultaneously.  This means it is possible for a single Hive query to use up all available Yarn memory, preventing other users from running a query simultaneously. This high memory consumption can be… Continue reading Managing Yarn memory with multiple Hive users

Sparkling-water – keeping the web UI alive

Spark is a great way to make use of the available RAM on a Hadoop cluster to run fast in-memory analysis and queries, and H2O is a great project for running distributed machine learning algorithms on data stored in Hadoop.  Together they form “Sparkling Water” (Spark + H2O, obviously!). Easy to follow instructions for setting… Continue reading Sparkling-water – keeping the web UI alive

Avoiding “add jar” to load custom SerDe when using Excel or Beeswax on Hortonworks Hadoop

Intro – analysing tweets with Hive Following various tutorial examples online (e.g. Hortonworks – How To Refine and Visualize Sentiment Data and Microsoft – Analyze Twitter data using Hive in HDInsight) it is possible to expose semi structured Twitter feed data in tabular format via Hadoop and Hive.  Once the data is available in Hive… Continue reading Avoiding “add jar” to load custom SerDe when using Excel or Beeswax on Hortonworks Hadoop

Visualising Solar Generation Data in a Custom Histogram using D3.js

Using the “brush” feature of the D3 Javascript library again proves handy for creating an interactive, animated histogram.  This type of visualisation helps to analyse and explore the distribution of time-series data. For this demo, home solar PV generation data has been obtained from United Energy’s Energy Easy portal in CSV format.  For the sake… Continue reading Visualising Solar Generation Data in a Custom Histogram using D3.js

Re-pivoting data using OpenRefine’s Columns to Rows feature

Problem A frequent challenge for transforming time-series data (e.g. weather, meter data) is changing columns representing multiple times of the day to a single column or in OLAP terms what might generically be described as an “Hour of the Day” or “Interval” dimension. Example input schema: Date, 00:00, 00:30, 01:00, 01:30, …, 23:00, 23:30 01-01-2015,0,0,0,1,…,1,0… Continue reading Re-pivoting data using OpenRefine’s Columns to Rows feature

Visualising energy consumption profile (by hour of day) using D3.js

With the benefit of smart electricity meters it’s possible to obtain hourly data showing household consumption in KWh. I downloaded this dataset for my own house in CSV format from United Energy’s EnergyEasy portal. With some massaging, the data can be formatted to a structure which which makes aggregation easier.  The excellent tool OpenRefine made… Continue reading Visualising energy consumption profile (by hour of day) using D3.js

Problem starting HBASE master on Hadoop with Cloudera

After formatting the Hadoop HDFS Namenode and trying to restart the Hadoop cluster in Cloudera I encountered thisfatal error on the HBASE master, preventing HBASE from starting at all: Unhandled exception. Starting shutdown. org.apache.hadoop.hbase.TableExistsException: hbase:namespace at org.apache.hadoop.hbase.master.handler.CreateTableHandler.prepare(CreateTableHandler.java:133) at org.apache.hadoop.hbase.master.TableNamespaceManager.createNamespaceTable(TableNamespaceManager.java:232) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:86) at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1069) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:942) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:613) at java.lang.Thread.run(Thread.java:745) After unsuccessfully trying to fix this… Continue reading Problem starting HBASE master on Hadoop with Cloudera

Using Mondrian’s CurrentDateMember to show current day’s data in MDX

Let’s say we have the following MDX query to show data for a particular date (in this case the quantity measure of the cube Electricity): WITH SET [~ROWS] AS {[Time].[Day].[2014-01-01]} SELECT NON EMPTY {[Measures].[Quantity]} ON COLUMNS, NON EMPTY [~ROWS] ON ROWS FROM [Electricity] Works OK: But what if we want the date to be dynamic,… Continue reading Using Mondrian’s CurrentDateMember to show current day’s data in MDX