Using Azure Blob storage with Hadoop

Cloud providers such as Amazon (AWS) and Microsoft (Azure) provide fault-tolerant distributed storage services which can literally “take the load” off a Hadoop installation, providing some compelling advantages.  In the case of Microsoft Azure’s blob storage, however, this is not without its pitfalls.

With the release of Hadoop version 2.7.0 (and vendor packaged versions such as Hortonworks HDP 2.3) Windows Azure Blob storage can be used as either default or secondary storage for Hadoop as instead of HDFS.  See Alexei Khalyako’s description of how to configure both of these options here.

These are some benefits of using Blob storage instead of HDFS (see also Microsoft’s opinion):

  • Separate storage from compute – data can exist with 1, 10, 1000 or even zero Hadoop nodes, meaning compute resources can be scaled freely as there is no reliance on having HDFS services running on a minimum number of nodes with locally attached disks in order to simply access the data.  Equally there is no need to “rebalance” data when nodes are added or removed from the cluster.
  • Relatively Low cost of storage – at the time of writing this is roughly $414.10 AUD per Terabyte per year (see here).  This is quite impressive given the cost of hardware / electricity required to maintain this data even on commodity equipment.
  • Automatic replication – Azure storage can be replicated long-distance with the click of a mouse (by choosing geographically replicated storage).  This means the Azure cloud layer will take care of replicating data to another of its data-centres for disaster recovery purposes.  With the cheapest form of replication (local replication) data is stored 3 times in a single data centre, ensuring High Availability in the event of a single disk failure in the azure Datacentre.  With geo-replicated storage the data is also copied another 3 times into a secondary datacentre (although it would require a declaration from Microsoft to make the secondary copy accessible after a disaster – something completely out of an Azure customer’s control).

Questions you might ask about using Azure Blob storage instead of HDFS:

Q: Wouldn’t it be really slow having an on-premise Hadoop cluster connecting to storage accessibly only over the internet via TCP/IP?
A: Yes.  For this reason this architecture is not recommended.  Instead, it’s worth thinking about Azure storage only for clusters which are stood up in Azure itself (as VM’s or as the platform as a service offering HDInsight).  It is assumed the TCP/IP connectivity within an Azure datacentre is fast enough not to worry about network bottlenecks – i.e. from machines to storage, even despite being over TCP/IP.

Q: Hadoop is all about moving compute closer to storage – doesn’t using blob go against this principle?
A: Microsoft’s answer to this seems to be that the backbone connecting Azure compute VMs to blob storage should provide performance similar to what would be seen with locally attached disks (see Cindy Gross’s useful blog post here).  In other words, the architecture is like having a big disk attached to many nodes simultaneously and directly.

Q: If a blob storage account behaves like a hard disk, won’t it get overloaded with multiple nodes connecting to it simultaneously?
A: No – luckily per Microsoft blob storage accounts apparently do not behave like disks.  The performance characteristics of blob storage allow many Hadoop nodes to be simultaneously reading / writing data.  Microsoft claims a target of 60MB/s throughput (see here) per blob which might correspond to a single chunked file of a Hive table, as well as 15 Gbps overall read performance for a single storage account – i.e. allowing for approx 31 nodes, each reading from a blob account simultaneously at 60MB/s.

Pitfalls of blob storage instead of HDFS?

A very significant pitfall of using blob storage with Hadoop (despite the above advantages) is that whilst…:

“File owner and group are persisted, but the permissions model is not enforced.” (https://hadoop.apache.org/docs/stable/hadoop-azure/index.html)

This presents enormous challenges at the enterprise level in providing access to multiple users or even self-service access to unstructured or semi-structured data in a Hadoop-based data lake.  Given the frequent need to protect sensitive data within an organisation (e.g. customer, employee, financial data) it seems a severe limitation that the Hadoop interface to the blob storage APIs has not been augmented with the ability to enforce the file and folder permissions which it so dutifully records!

An example of the problem can be seen here by comparing native HDFS storage behaviour with blob storage behaviour (both when acting as the default hadoop filesystem):

Using HDFS – Authorisations working correctly:

[azureuser@hdplinux4 tmp]$ id
uid=500(azureuser) gid=500(azureuser) groups=500(azureuser)
[azureuser@hdplinux4 tmp]$ hdfs dfs -ls /tmp/testonly/
Found 1 items
-rw-------   1 hdfs hdfs         13 2015-08-26 00:57 /tmp/testonly/test.txt
[azureuser@hdplinux4 tmp]$ pwd
/tmp
[azureuser@hdplinux4 tmp]$ hdfs dfs -copyToLocal /tmp/testonly/test.txt
copyToLocal: Permission denied: user=azureuser, access=READ, inode="/tmp/testonly/test.txt":hdfs:hdfs:-rw-------

Using Blob – Authorisations are completely ignored (despite being visible via a hdfs dfs -ls command, permissions which say that only user hdfs should be allowed to read test.txt are completely ignored when user azureuser tries to copy the file):

azureuser@hdplinuxblob:/tmp> id
uid=1002(azureuser) gid=100(users) groups=100(users),16(dialout),33(video)
azureuser@hdplinuxblob:/tmp> hdfs dfs -ls /tmp/testonly/test.txt
-rw-------   1 hdfs hdfs         13 2015-08-26 01:00 /tmp/testonly/test.txt
azureuser@hdplinuxblob:/tmp> hdfs dfs -copyToLocal /tmp/testonly/test.txt
azureuser@hdplinuxblob:/tmp> ls -la test.txt
-rw-r--r-- 1 azureuser users 13 Aug 26 01:09 test.txt

Only time will tell whether Microsoft will rectify the severe gap currently in the Azure Blob storage integration into Hadoop.  There is some indication that they intend to close the gap with the Azure Data Lake service due for eventual release, which promises compatibility with many flavours of Hadoop (e.g. Hortonworks and Cloudera) as well as integration into Active Directory to allow for files and folders to be secured.  The challenge still remains, however, of providing a security mechanism which is compatible with the wider Hadoop ecosystem, and this gives pause to think about the choosing blob over HDFS when it should otherwise be an easy decision.

Managing Yarn memory with multiple Hive users

Out of the box (e.g. a standard Hortonworks HDP 2.2 install), Hive does not come configured optimally to manage multiple users running queries simultaneously.  This means it is possible for a single Hive query to use up all available Yarn memory, preventing other users from running a query simultaneously.

This high memory consumption can be observed via the resource manager HTTP management screen – e.g. http://<resourcemanagerIP>:8088/cluster

Almost all yarn memory used
Almost all yarn memory used

Also in Ambari…

Yarn used memory at 100%
Yarn used memory at 100%

Minimum queue memory per user

To guarantee the ability for more users to run Hive queries simultaneously (assuming capacity scheduler is used with default queue configuration), we can make a simple config settings change via Ambari:

Ambari Yarn config for capacity scheduler
Ambari Yarn config for capacity scheduler

Change from:

yarn.scheduler.capacity.root.default.user-limit-factor=1

To:

yarn.scheduler.capacity.root.default.user-limit-factor=0.33

This now means that each user of Hive will now receive a maximum of a third (or close to it) of Yarn memory resources.

Only a third of yarn memory used
Only a third of yarn memory used
Yarn used memory at 39%
Yarn used memory at 39%

This enables a better user experience for multi-user interactive querying of Hive – for example, by enabling 2-3 users to simultaneously use the cluster.

Another option

There is, however, one potential disadvantage to the above — namely cluster memory is potentially being wasted (by not being allocated) if the job queue contains only a single user’s jobs.  A related parameter change can alleviate this – namely by setting:

yarn.scheduler.capacity.default.minimum-user-limit-percent=33

The “minimum user limit percent” means that each user is guaranteed a certain percentage of the yarn job queue’s memory if there is a mix of different users’ jobs waiting in the queue.  In other words, 3 users will each get 33% of the queue memory for execution if their jobs are all waiting in the queue at the same time. If however, there is only one user with jobs waiting in the queue, his / her jobs will execute and consume all available memory in the queue.  For User A this means a better use of memory overall, but possibly at the expense of User B who might return from their lunch break and must wait for one of User A’s jobs to finish before getting the guaranteed percentage memory allocation.

Finding the balance

The above, along with other parameters can be used to ensure users make the most of available cluster memory but do not effectively lock out other users by filling the queue with long running jobs.

For example – these settings allow a single user to use up to 90% of available yarn queue memory, and up to 4 users (each with 25%) to eventually be running in the cluster (the 5th, 6th, 7th users will have to wait for other users’ jobs to be fully completed):

yarn.scheduler.capacity.root.default.user-limit-factor=0.90
yarn.scheduler.capacity.default.minimum-user-limit-percent=25

Sparkling-water – keeping the web UI alive

Spark is a great way to make use of the available RAM on a Hadoop cluster to run fast in-memory analysis and queries, and H2O is a great project for running distributed machine learning algorithms on data stored in Hadoop.  Together they form “Sparkling Water” (Spark + H2O, obviously!).

Easy to follow instructions for setting up Sparkling Water are available here: http://h2o-release.s3.amazonaws.com/sparkling-water/master/103/index.html

Running spark on Yarn is a good way to utilise an existing Hadoop cluster, however it’s challenging using the “live” method below to keep the Sparkling Water H2O Flow interface running permanently.  Doing so can let a number of data scientists use the notebook style interface to run machine learning tasks.  Luckily, using the spark-submit invocation with the water.SparklingWaterDriver class can ensure the web UI remains online even after the shell session which kicked it off exits (see below Persistent method).

Live method – doesn’t stay online after exiting shell session

  1. Create a shell script:

    #!/bin/bash
    export SPARK_HOME=’/usr/hdp/current/spark-client/’
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export MASTER=”yarn-client”
    sparkling-water-1.3.5/bin/sparkling-shell –num-executors 3 –executor-memory 2g –master yarn-client

  2. Run sparkling-shell

    import org.apache.spark.h2o._
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._

Persistent method – stays online even after exiting shell session

To start a “persistent” H2O cluster on Yarn (i.e. one which doesn’t exit immediately) simply run this command at the command line of a node where the spark client and sparkling water is installed:

nohup bin/spark-submit –class water.SparklingWaterDriver –master yarn-client –num-executors 3 –driver-memory 4g –executor-memory 2g –executor-cores 1 ../sparkling-water-0.2.1-58/assembly/build/libs/*.jar &

The Spark UI should be available on it’s usual port (http://XXX.XXX.XXX.XXX:54321) and should remain there even if the shell session which started the UI dies!

Thanks to the helpful and responsive folks at H2Oai for the above tip (originally answered here)!