Using Azure Blob storage with Hadoop

Cloud providers such as Amazon (AWS) and Microsoft (Azure) provide fault-tolerant distributed storage services which can literally “take the load” off a Hadoop installation, providing some compelling advantages.  In the case of Microsoft Azure’s blob storage, however, this is not without its pitfalls. With the release of Hadoop version 2.7.0 (and vendor packaged versions such… Continue reading Using Azure Blob storage with Hadoop

Managing Yarn memory with multiple Hive users

Out of the box (e.g. a standard Hortonworks HDP 2.2 install), Hive does not come configured optimally to manage multiple users running queries simultaneously.  This means it is possible for a single Hive query to use up all available Yarn memory, preventing other users from running a query simultaneously. This high memory consumption can be… Continue reading Managing Yarn memory with multiple Hive users

Sparkling-water – keeping the web UI alive

Spark is a great way to make use of the available RAM on a Hadoop cluster to run fast in-memory analysis and queries, and H2O is a great project for running distributed machine learning algorithms on data stored in Hadoop.  Together they form “Sparkling Water” (Spark + H2O, obviously!). Easy to follow instructions for setting… Continue reading Sparkling-water – keeping the web UI alive