Using Azurite blob storage emulator with Spark

Sometimes it’s handy to be able to test Apache Spark developments locally. This might include testing cloud storage such as WASB (Windows Azure Storage Blob).

These steps describe the process for testing WASB locally without the need for an Azure account. These steps make use of the Azurite Storage Emulator.

Steps

  1. Prerequisites
    • Download and extract Apache Spark (spark-3.1.2-bin-hadoop3.2.tgz)
    • Download and install Docker
    • Start the Docker service – e.g. on Linux:
      sudo service docker start
    • (Optionally) Download and install Azure Storage Explorer
  2. Create a new directory and start the Azurite Storage Emulator Docker container – e.g.:

    mkdir ~/blob

    docker run -p 10000:10000 -p 10001:10001 -v /home/david/blob/:/data mcr.microsoft.com/azure-storage/azurite

    NB – in the above example, data will be persisted to the local linux directory /home/david/blob.
  3. Upload files with Storage Explorer:

    Connect Storage Explorer to the Local Storage emulator (keep defaults when adding the connection):





    Upload a sample file – e.g. to the “data” container:

  4. Start Spark using the packages option to include libraries needed to access Blob storage. The Maven coordinates are shown here are for the latest hadoop-azure package:

    cd ~/spark/spark-3.1.2-bin-hadoop3.2/bin

    ./pyspark --packages org.apache.hadoop:hadoop-azure:3.3.1

    The PySpark shell should start as per normal after downloading hadoop-azure and its dependencies.

    Troubleshooting:
    The following stack trace indicates the hadoop-azure driver or dependencies were not loaded successfully:
    ... py4j.protocol.Py4JJavaError: An error occurred while calling o33.load. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269) ... Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593) ... 25 more ...

    Ensure the “packages” option is correctly set when invoking pyspark above.
  5. Query the data using the emulated Blob storage location from the PySpark shell:

    df=spark.read.format("csv").option("header",True).load("wasb://data@storageemulator/iris.csv")

    df.show()


    Notes:
    data – container where the data was uploaded earlier
    @storageemulator – this is a fixed string used to tell the WASB connector to point to the local emulator

    Example output:

Conclusion

Local storage emulation allows testing of wasb locations without the need to connect to a remote Azure subscription / storage account.

Point in time Delta Lake table restore after S3 object deletion

Background

The Delta Lake format in Databricks provides a helpful way to restore table data using “time-travel” in case a DML statement removed or overwrote some data.

The goal of a restore is to bring back table data to a consistent version.

Delta lake timetravel

This allows accidental table operations to be reverted.

Example

Original table – contains 7 distinct diamond colour types including color = “G”:

Original table

Then, an accidental deletion occurs:

Accidental SQL delete statement

The table is now missing some data:

Modified table

However, we can bring back the deleted data by checking the Delta Lake history and restoring to a version or timestamp prior to when the delete occurred – in this case version 0 of mytable:

Delta Lake table history

Restoring the original table based on a timestamp (after version 0, but prior to version 1):

%sql
DROP TABLE IF EXISTS mytable_deltarestore;

CREATE TABLE mytable_deltarestore
USING DELTA
LOCATION "s3a://<mybucket>/mytable_deltarestore"
AS SELECT * FROM default.mytable TIMESTAMP AS OF "2021-07-25 12:20:00"; 

Now, the original data is available in the restored table, thanks to Delta Lake time-travel:

Restored data – via Timetravel

Challenge

What happens if table files (parquet data files or transaction log files) have been deleted in the underlying storage?

This might occur if a user or administrator accidentally deletes objects from S3 cloud storage.

Two types of files might get deleted manually.

Delta Lake data files

Symptom – table is missing data and can’t be queried:

%sql
SELECT * FROM mytable@v0;

(1) Spark Jobs
FileReadException: Error while reading file s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions
Caused by: FileNotFoundException: No such file or directory: s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet

Delta Lake transaction logs

Symptom – table state is inconsistent and can’t be queried:

%sql
FSCK REPAIR TABLE mytable DRY RUN

Error in SQL statement: FileNotFoundException: s3a://<mybucket>/mytable/_delta_log/00000000000000000000.json: Unable to reconstruct state at version 1 as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)

Solution

Versioning can be enabled for S3 buckets via the AWS management console:

S3 bucket configuration – Bucket Versioning enabled

This means that if any current object versions are deleted after the above configuration is set, it may be possible to restore them.

Databricks Delta Lake tables are stored on S3 under a given folder / prefix – e.g.:

s3a://<mybucket>/<mytable>

If this prefix can be restored to a “point in time”, this can be used to restore a non-corrupted version of a table – for example:

NB: Restoring will mean all data added after deletion occurs will be lost and would need to be reloaded from an upstream source. This also assumes that previous object versions are available on S3.

The following steps can be used in Databricks to restore past S3 object versions to a new location and re-read the table at the restore point:

  1. Install the s3-pit-restore python library in a new Databricks notebook cell:
    %pip install s3-pit-restore
  2. Run the restore command with a timestamp prior to the deletion:
    %sh
    export AWS_ACCESS_KEY_ID="<access_key_id>"
    export AWS_SECRET_ACCESS_KEY="<secret_access_key>"
    export AWS_DEFAULT_REGION="<aws_region>"
    s3-pit-restore -b <mybucket> -B <mybucket> -p mytable/ -P mytable_s3restore -t "25-07-2021 23:26:00 +10"
  3. Create a new table pointing to the restore location:
    %sql
    CREATE TABLE mytable_s3restore
    USING DELTA
    LOCATION "s3a://<mybucket>/mytable_s3restore/mytable";
  4. Verify the table contents are again available and no longer corrupted:

Conclusion

Other techniques like Table Access Control may be preferable to prevent Databricks users from deleting underlying S3 data, however Point in Time restore techniques may be possible where table corruption has occurred and S3 bucket versioning is enabled.

References

Realtime Solar PV charting

Solar PV inverters often have their own web-based monitoring solutions. However some of these do not make it easy to view current generation or consumption due to refresh delays. Out of the box monitoring is usually good for looking at long-term time periods however lacks the granularity to see consumption of appliances over the short term.

The challenge

Realtime monitoring of Solar generation and net export helps to maximise self-consumption. For example coordinating appliances to make best use of solar PV.

Existing inverter monitoring does not show granular data over recent history – for example, to be able to tell when a dishwasher has finished its heating cycle and whether another high-consumption appliance should be turned on:

Solution

This sample android application allows realtime monitoring whilst charting consumption, generation and net export:

Solar Watch screenshot

The chart shows recent data over time and is configurable for SMA and Enphase inverters. In both cases the local interface of each inverter is used to pull raw data:

  • SMA: https://<inverter_ip_address>/dyn/getDashValues.json
    • NB – Smart Inverter Screen must be enabled
  • Enphase: http://<envoy_ip_address>/production.json

Code

https://github.com/niftimus/SolarWatch

Features

  • Interactive UI:

  • Configurable settings:

Limitations / areas for future improvement

  • Improve security handling of SSL – the current code imports a self-signed SMA inverter certificate and disables hostname verification to allow the SMA local data to be retrieved
  • Refine code and publish to an app store
  • Remove hard-coding for extraction of metrics
  • Better error handling
  • Add a data export function

Conclusion

This sample app is really handy to monitor appliances in realtime and allows making informed decisions about when to start appliances.

Time of Use vs Flat Rate Electricity – which is cheaper?

Electricity retailers sometimes give the choice of paying a flat rate for electricity, or so called Time of Use (ToU) rate. Time of use pricing usually has peak, off-peak and shoulder prices. This can also vary by time of year and also weekend or weekday.

For the consumer, Time of Use pricing may be beneficial if consumption can be shifted to off-peak hours, but this is potentially offset by more expensive rates during peak times.

Assuming a retailer gives the ability to choose – which one is cheaper?

Solution

This web calculator gives the ability to simulate costs based on historical meter data usage and configurable pricing and peak/off-peak definition:

http://members.iinet.net.au/~energyanalyser/

Energy Analyser – Screenshot

Note: Beta only. Default prices may be different depending on the retailer or electricity plan, but the sliders allow adjustment to configure unit prices to match any real plan for comparison.

Features

  • Calculate costs, potential savings and get a recommendation:
  • Fully client-side, JavaScript and HTML – no server upload required
  • Ability to drag-and-drop upload a Victorian Energy Compare formatted CSV:
  • Focus on a particular date range within the uploaded meter data:
  • Ability to configure time of use definitions (i.e. peak, off-peak and shoulder times):

Potential future improvements

The following future improvements could make the solution more useful:

  • Cope with different data formats (different States’ data)
  • Ability to compare two (or n) different plans
  • Automatic comparison of available plans from multiple retailers (pulling prices automatically)
  • Inclusion of solar feed-in tariff as a comparison point
  • Provide recommendations for changing energy usage behaviour
  • Simulate the impact of having a home battery

Creating a virtual solar PV plug for EV charging – Part 2

In Part 1 we explored the idea of using a Smart Plug as well as home solar monitoring to save money when charging a plug in hybrid car.

This post details a technical approach so the plug only turns on when excess solar is available.

The code

See here for the code on GitHub:

https://github.com/niftimus/SmartPlugAutomate

Notes:

  • The code is experimental and proof of concept only – it has not been fully tested
  • The code runs as a Linux service
  • It features a web UI
  • It checks home energy consumption and decides whether to turn the plug on or off based on a threshold

The logic

For each check interval the code checks the current state of the plug and decides whether to:

  • Do nothing
  • Leave on
  • Leave off
  • Turn on
  • Turn off

Here’s a flowchart showing the decision-making process:

The Web UI

The features

  • Ability to disable / enable automatic control
    • This is useful where the plug needs to be manually controlled via its physical button
  • Configurable Min power threshold
    • This is useful where it’s acceptable to use some grid power as well as solar (e.g. partly cloudy weekends with cheaper electricity rates)
  • Minimum on / off buffer periods to reduce switching (e.g. for devices which do not benefit from being powered on and off continually)
  • Monitoring messages to see how many times the switch has been controlled and its last state
  • Overall net ( W )
    • Useful for seeing current net household energy consumption
  • Automatic recovery if the plug, solar monitoring API or Wifi network goes offline temporarily

The result

So far this solution works great.

On a partially cloudy day, the plug automatically turns on or off once excess solar drops below the min power threshold. Similarly, the plug will turn off when household consumption is high – for example, during the heating cycle of a washing machine / dishwasher or when an electric kettle is used.

We got an interesting email from our electricity retailer after setting up this solution:

Solar health status email from electricity retailer. This shows the solution is working in increasing self-consumption.
Email from our electricity retailer

The message indicates we have successfully boosted our self-consumption – i.e. more solar energy is being self-consumed rather than being exported to the grid, giving the appearance to the retailer that the solar PV system is underperforming. Success!

Conclusion

This is not quite as good as having a home battery or a dedicated (and much more refined) device like the Zappi, however it comes close. It is a great way to boost self-consumption of excess solar PV energy using software and a low-cost smart plug. With around a year of weekly charging, this solution can pay for the cost of the smart plug by reducing the effective cost of electricity.

Creating a virtual solar PV plug for EV charging – Part 1

A while ago the Fully Charged show featured a great device called the Zappi, which can charge an EV using surplus solar:

https://www.youtube.com/watch?v=0EtegQfZQRw

This is pretty amazing for EV owners who also have solar PV.

It means that instead of exporting surplus energy at a reduced rate ($0.12/kWh) it is possible to avoid importing energy at a higher rate ($0.25/kWh). This can effectively double the benefit of having solar PV by boosting self consumption.

However as of writing, the Zappi V2 is $1,395 (for example, from EVolution here).

Challenge

Is it possible to create a software virtual plug to charge an EV using only self-generated solar PV?

The idea

Charging the EV using only rooftop solar costs $0.12/kWh. This is the opportunity cost of the feed-in tariff which would would otherwise be earned for feeding energy into the grid.

Charging the EV using grid power alone costs around $0.25/kWh.

Depending on the proportion of PV generation at a given time, the effective cost per kWh may be somewhere in between.

What if we can turn on the charger only at times when the solar is generating 100% or more of what the EV will use?

A custom software program could query net solar export and control a smart plug to generate savings.

Equipment

Mitsubishi Outlander PHEV
Envoy S Metered Solar house monitor
TP Link HS110 Smart Plug

Potential benefits

  • Cheaper EV charging (approximately 50% savings)
  • No need to manually enable / disable charging when:
    • Weather is variable
    • Household consumption is high (e.g. boiling a kettle or running the dishwasher)

Things to consider

These are also some risks to consider when designing a DIY software control:

  • The PHEV plug safety instructions say not to plug anything in between the wall socket and charger plug – i.e. where the SmartPlug should go.
  • The PHEV charger expects to be plugged in and left alone – will it be happy with power being enabled / disabled?

Another thing to consider… is it worth buying a Smartplug to do this?

Assuming the plug can be purchased for a reasonable price (for example $40 including shipping from here) and weekly EV charging from nearly empty, the plug pays itself off in <1 year:

Plug cost:40.00
Opportunity cost / lost export ($/kWh):0.12
Saved expense ($/kWh):0.25
Net saving ($/kWh):0.13
kWh savings to pay off:307.69
Average charging session (kWh):8.00
Number of charges:38.46
Back of the envelope calculations

Continued…

See Part 2 for an approach to implement this solution in Python…

Workaround for com.microsoft.aad.adal4j.AuthenticationException when accessing SQL Server table via Active Directory in Databricks

Symptom

When using Databricks 5.5 LTS to read a table from SQL Server using Azure Active Directory (AAD) authentication, the following exception occurs:

Error : java.lang.NoClassDefFoundError: com/microsoft/aad/adal4j/AuthenticationException Error : java.lang.NoClassDefFoundError: com/microsoft/aad/adal4j/AuthenticationException
 at com.microsoft.sqlserver.jdbc.SQLServerConnection.getFedAuthToken(SQLServerConnection.java:3609)
 at com.microsoft.sqlserver.jdbc.SQLServerConnection.onFedAuthInfo(SQLServerConnection.java:3580)
 at com.microsoft.sqlserver.jdbc.SQLServerConnection.processFedAuthInfo(SQLServerConnection.java:3548)
 at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onFedAuthInfo(tdsparser.java:261)
 at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:103)
 at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:4290)
 at com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:3157)
 at com.microsoft.sqlserver.jdbc.SQLServerConnection.access$100(SQLServerConnection.java:82)
 at com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:3121)
 at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7151)
 at ...io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
 at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.microsoft.aad.adal4j.AuthenticationException
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 59 more...

Cause

https://github.com/Azure/azure-sqldb-spark/issues/28

Workaround steps

1 – Create a new init script which will remove legacy MSSQL drivers from the cluster. The following commands create a new directory on DBFS and then create a shell script with a single command to remove mssql driver JARs:

%sh
mkdir /dbfs/myInitScriptDir
echo "rm /databricks/jars/*mssql*" > /dbfs/myInitScriptDir/myInitScript.sh

2 – Add the cluster init script in Clusters > Cluster > Edit > Advanced Options:

3 – Add the following two libraries to the cluster via Clusters > Cluster > Libraries > Install new:

com.microsoft.azure:adal4j:1.6.5
com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre8

4 – Restart the cluster.

5 – Run the following R code in aworkbook cell to validate that AAD authentication is working. NB – Replace the placeholder values in bold:

library(sparklyr)

connection <- spark_connect(method = "databricks")

x <- spark_read_jdbc(
connection,
name = 'mytemptable',
options = list(
url = 'jdbc:sqlserver://myazuresqlserver.database.windows.net:1433;database=myazuresqldatabase;authentication=ActiveDirectoryPassword;',
driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
user = 'myuser@example.com',
password = 'XXXXXXXX',
hostNameInCertificate = '*.database.windows.net',
dbtable = 'dbo.mytable'
)
)

x

After running the command “x” above, the table data should be displayed.

Conclusion

The Azure SQL Database table can now be read and the AuthenticationException no longer occurs:

Successful table query after spark_read_jdbc()

Credit: This workaround is based on thereverand‘s very helpful post on GitHub here.

Automatically tagging, captioning and categorising locally stored images using the Azure Computer Vision API

It’s easy in the digital age to amass tens of thousands of photos (or more!). Categorising these can be a challenging task, let alone searching through them to find that one happy snap from 10 years ago.

Significant advances in machine learning over the past decade have made it possible to automatically tag and categorise photos without user input (assuming a machine learning model has been pre-trained). Many social media and photo sharing platforms make this functionality available for their users — for example, Flickr’s “Magic View”.  What if a user has a large number of files stored locally on a Hard Disk?

The problem

  • 49,049 uncategorised digital images stored locally
  • Manual categorisation
  • No easy way to search (e.g. “red dress”, “mountain”, “cat on a mat”)

The solution

Steps

  1. Obtain a Microsoft Azure cloud subscription (note – Azure is not free, however free trials may be available):
    https://azure.microsoft.com/en-us/free/
  2. Start a cognitive services account from the Azure portal and take note of one of the “Keys” (keys are interchangeable):
    https://portal.azure.com/
    computer_vision-azure_keys
  3. Log in to your Linux machine and ensure you have python3 installed:
    user@host.site:~> which python3
    /usr/bin/python3
  4. Ensure you have these python libraries installed:
    sudo su -
    pip3 install python-xmp-toolkit
    pip3 install argparse
    pip3 install Pillow
    exit
  5. Obtain a copy of the image-auto-tag script:
    git clone https://github.com/niftimusmaximus/image-auto-tag
  6. Automatically tag, caption and categorise an image (e.g. image.jpg):
    cd image-auto-tag
    ./image-auto-tag.py --key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      --captionConfidenceLevel 0.50 --tagConfidenceLevel 0.5
      --categoryConfidenceLevel 0.5 image.jpg

    Note – replace key with one of the ones obtained from the Azure Portal above

    Script will process the image:

    INFO: [image.jpg] Reading input file 1/1                                                                                                                      
    INFO: [image.jpg] Temporarily resized to 800x600                                                                                                              
    INFO: [image.jpg] Uploading to Azure Computer Vision API
                      (length: 107330 bytes)                                                                               
    INFO: [image.jpg] Response received from Azure Computer Vision API
                      (length: 1026 bytes)                                                                       
    INFO: [image.jpg] Appended caption 'a river with a mountain in the
                      background' (confidence: 0.67 >= 0.50)                                                     
    INFO: [image.jpg] Appended category 'outdoor_water'
                      (confidence: 0.84 >= 0.50)                                                                                
    INFO: [image.jpg] Appending tag 'nature' (confidence: 1.00 >= 0.50)                                                                                           
    INFO: [image.jpg] Appending tag 'outdoor' (confidence: 1.00 >= 0.50)                                                                                          
    INFO: [image.jpg] Appending tag 'water' (confidence: 0.99 >= 0.50)                                                                                            
    INFO: [image.jpg] Appending tag 'mountain' (confidence: 0.94 >= 0.50)                                                                                         
    INFO: [image.jpg] Appending tag 'river' (confidence: 0.90 >= 0.50)                                                                                            
    INFO: [image.jpg] Appending tag 'rock' (confidence: 0.89 >= 0.50)                                                                                             
    INFO: [image.jpg] Appending tag 'valley' (confidence: 0.75 >= 0.50)                                                                                           
    INFO: [image.jpg] Appending tag 'lake' (confidence: 0.60 >= 0.50)                                                                                             
    INFO: [image.jpg] Appending tag 'waterfall' (confidence: 0.60 >= 0.50)                                                                                        
    INFO: [image.jpg] Finished writing XMP data to file 1/1
  7. Verify the results:
    Auto tagging

    computer_vision-keyword_search
    API has applied “tags” which can be searched

    Auto captioning

    computer_vision-auto_caption
    API has captioned this image as “a beach with palm trees”

    Auto categorisation

    "plant_tree" hierarchical category has been applied
    API has applied the category “plant_tree” to this image

    Note – please see here for the API’s 86 category taxonomy

Script features

  • Writes to standard XMP metadata tags within JPG images which can be read by image management applications such as XnView MP and digiKam
  • Sends downsized images to Azure to improve performance

    Example
    – only send image of width 640 pixels (original image will retain its dimensions)

    ./image-auto-tag.py --key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    --azureResizeWidth 640 image.jpg
  • Allows customisation of thresholds for tags, description and caption. This is useful because whilst good, the API is not perfect!

    Example – only caption image if caption confidence score from API is 0.5 or above:

    ./image-auto-tag.py --key XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    --captionConfidenceLevel 0.5 image.jpg

Useful queries for the Hive metastore

Hive metastore tables

The Hive metastore stores metadata about objects within Hive.  Usually this metastore sits within a relational database such as MySQL.

Sometimes it’s useful to query the Hive metastore directly to find out what databases, tables and views exist in Hive and how they’re defined. For example, say we want to expose a report to users about how many Hive tables are currently in a Hadoop cluster.  Or perhaps we want to run a script which performs some bulk operation on all tables in a particular Hive database.

Luckily, it’s easy to query the metastore using a tool such as MySQL Workbench using appropriate connectors – e.g. MySQL JDBC drivers.

Here’s a rough database diagram showing how the Hive metastore hangs together:

Hive metastore database diagram (ERD)
Hive metastore database diagram (from HDP 2.3, click here for full screen)

Handy metastore SQL queries

Show all Hive databases

SELECT * FROM hive.DBS;

Output:

DB_ID DESC DB_LOCATION_URI NAME OWNER_NAME OWNER_TYPE
1 Default Hive database hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse default public ROLE
6 NULL hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/xademo.db xademo hive USER

 List tables in a given database

SELECT t.* FROM hive.TBLS t
 JOIN hive.DBS d
 ON t.DB_ID = d.DB_ID
 WHERE d.NAME = 'default';

Output:

TBL_ID CREATE_TIME DB_ID LAST_ACCESS_TIME OWNER RETENTION SD_ID TBL_NAME TBL_TYPE VIEW_EXPANDED_TEXT VIEW_ORIGINAL_TEXT LINK_TARGET_ID
1 1439988377 1 0 hue 0 1 sample_07 MANAGED_TABLE NULL NULL NULL
2 1439988387 1 0 hue 0 2 sample_08 MANAGED_TABLE NULL NULL NULL

Show the storage location of a given table

SELECT s.* FROM hive.TBLS t
JOIN hive.DBS d
ON t.DB_ID = d.DB_ID
JOIN hive.SDS s
ON t.SD_ID = s.SD_ID
WHERE TBL_NAME = 'sample_07'
AND d.NAME='default';

Output:

SD_ID CD_ID INPUT_FORMAT IS_COMPRESSED IS_STOREDASSUBDIRECTORIES LOCATION NUM_BUCKETS OUTPUT_FORMAT SERDE_ID
1 1 org.apache.hadoop.mapred.TextInputFormat 0 0 hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/sample_07 -1 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 1

Find out how a given view has been defined

SELECT t.* FROM hive.TBLS t
JOIN hive.DBS d
ON t.DB_ID = d.DB_ID
WHERE TBL_NAME = 'vw_sample_07'
AND d.NAME='default';

Output:

TBL_ID CREATE_TIME DB_ID LAST_ACCESS_TIME OWNER RETENTION SD_ID TBL_NAME TBL_TYPE VIEW_EXPANDED_TEXT VIEW_ORIGINAL_TEXT LINK_TARGET_ID
31 1471788438 1 0 hue 0 31 vw_sample_07 VIRTUAL_VIEW select count(*) from `default`.`sample_07` select count(*) from default.sample_07 NULL

Get column names, types and comments of a given table

SELECT c.* FROM hive.TBLS t
 JOIN hive.DBS d
 ON t.DB_ID = d.DB_ID
 JOIN hive.SDS s
 ON t.SD_ID = s.SD_ID
 JOIN hive.COLUMNS_V2 c
 ON s.CD_ID = c.CD_ID
 WHERE TBL_NAME = 'sample_07'
 AND d.NAME='default'
 ORDER by INTEGER_IDX;

Output:

CD_ID COMMENT COLUMN_NAME TYPE_NAME INTEGER_IDX
1 NULL code string 0
1 NULL description string 1
1 NULL total_emp int 2
1 NULL salary int 3

Conclusion

It’s possible to query metadata from the Hive metastore which can be handy for understanding what data is available in a Hive instance.  It’s also possible to edit this information too, although this would usually be inadvisable as the schema of the metastore may be subject to change between different Hive versions, and the results of modifying Hive internals could be unexpected at best, and catastrophic at worst.

Python + JDBC = Dynamic Hive scripting

Working with Hive can be challenging without the benefit of a procedural language (such as T-SQL or PL/SQL) in order to do things with data in between Hive statements or run dynamic hive statements in bulk.  For example – we may want to do a rowcount of all tables in one of our Hive databases, without having to code a fixed list of tables in our Hive code.

We can compile Java code to run queries against hive dynamically, but this can be overkill for smaller requirements. Scripting can be a better way to code more complex Hive tasks.

Python to the rescue

Python code can be used to execute dynamic Hive statements, which is useful in these sorts of scenarios:

  1. Code branching depending on results of a Hive query – e.g. ensuring Hive query A successfully executes before running Hive query B
  2. Using looked-up data to form a filter in a Hive query – e.g. selecting data from the latest partition in a Hive table without needing to perform a nested query to get the latest partition

There are several Python libraries available for connecting to Hive such as PyHive and Pyhs2 (the latter unfortunately now unmanaged).  Some major Hadoop vendors however decline to support this type of direct integration explicitly.  They do, however, still strongly support ODBC and JDBC interfaces.

Python + JDBC

We can, in fact, connect Python to sources including Hive and also the Hive metastore using the package JayDeBe API. This is effectively a wrapper allowing Java DB drivers to be used in Python scripts.

Example:

  1. The shell code (setting environment variables)

    First, we need to set the classpath to include the library directories where Hive JDBC drivers can be found, and also where the Python JayDeBe API module can be found:

    export CLASSPATH=$CLASSPATH:`hadoop classpath`:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hive-client/*:/usr/hdp/current/hadoop-client/client/*
    export PYTHONPATH=$PYTHONPATH:/home/me/jaydebeapi/build/
  2. The Python code

    Connections can be established to Hive and Hive metastore using jaydebeapi’s connect() method:

    # Connect to Hive
    conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver',
            ['jdbc:hive2://myhiveserver.mydomain.local/default;principal=hive/_HOST@MYDOMAIN.LOCAL;',
            '', ''], '/path/to/hive-jdbc.jar',)
    curs_hive = conn_hive.cursor()
    
    # Connect to Hive metastore
    conn_mysql = jaydebeapi.connect('com.mysql.jdbc.Driver',
            ['jdbc:mysql://metastoremysqlserver.mydomain.local:3306/hive',
             'mysql_username', 'mysql_password'],
            '/path/to/mysql-jdbc-connector.jar',)
    curs_mysql = conn_mysql.cursor()

    A metastore query can be run to retrieve the names of all tables in the default database into an arry (mysql_query_output):

    # Query the metastore to get all tables in defined databases
    mysql_query_string = "select t.TBL_NAME
    from TBLS t join DBS d
    on t.DB_ID = d.DB_ID
    where t.TBL_NAME like '%mytable%'
    and d.NAME='default'"
    
    curs_mysql.execute(mysql_query_string)
    
    mysql_query_output = curs_mysql.fetchall()

    Hive queries can be dynamically generated and executed to retrieve row counts for all the tables found above:

    # Perform a row count of each hive table found and output it to the screen
    for i in mysql_query_output:
            
            hive_query_string = "select '" + i[0] + "' as tabname,
            count(*) as cnt
            from default." + i[0]
    
            curs_hive.execute(hive_query_string)
    
            hive_query_output = curs_hive.fetchall()
    
            print hive_query_output

    Done! Output from Hive queries now should be printed to the screen.

Pros and cons of the solution

Pros:

  • Provides a nice way of scripting whilst using Hive data
  • Basic error handling is possible through Python after each HQL is executed
  • Connection to a wide variety of JDBC compatible databases

Cons:

  • Relies on client memory to store query results – not suitable for big data volumes (Spark would be a better solution on this front, as all processing is done in parallel and not brought back to the client unless absolutely necessary)
  • Minimal control / visibility over Hive query whilst running