Python + JDBC = Dynamic Hive scripting

Working with Hive can be challenging without the benefit of a procedural language (such as T-SQL or PL/SQL) in order to do things with data in between Hive statements or run dynamic hive statements in bulk.  For example – we may want to do a rowcount of all tables in one of our Hive databases, without having to code a fixed list of tables in our Hive code.

We can compile Java code to run queries against hive dynamically, but this can be overkill for smaller requirements. Scripting can be a better way to code more complex Hive tasks.

Python to the rescue

Python code can be used to execute dynamic Hive statements, which is useful in these sorts of scenarios:

  1. Code branching depending on results of a Hive query – e.g. ensuring Hive query A successfully executes before running Hive query B
  2. Using looked-up data to form a filter in a Hive query – e.g. selecting data from the latest partition in a Hive table without needing to perform a nested query to get the latest partition

There are several Python libraries available for connecting to Hive such as PyHive and Pyhs2 (the latter unfortunately now unmanaged).  Some major Hadoop vendors however decline to support this type of direct integration explicitly.  They do, however, still strongly support ODBC and JDBC interfaces.

Python + JDBC

We can, in fact, connect Python to sources including Hive and also the Hive metastore using the package JayDeBe API. This is effectively a wrapper allowing Java DB drivers to be used in Python scripts.

Example:

  1. The shell code (setting environment variables)

    First, we need to set the classpath to include the library directories where Hive JDBC drivers can be found, and also where the Python JayDeBe API module can be found:

    export CLASSPATH=$CLASSPATH:`hadoop classpath`:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hive-client/*:/usr/hdp/current/hadoop-client/client/*
    export PYTHONPATH=$PYTHONPATH:/home/me/jaydebeapi/build/
  2. The Python code

    Connections can be established to Hive and Hive metastore using jaydebeapi’s connect() method:

    # Connect to Hive
    conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver',
            ['jdbc:hive2://myhiveserver.mydomain.local/default;principal=hive/_HOST@MYDOMAIN.LOCAL;',
            '', ''], '/path/to/hive-jdbc.jar',)
    curs_hive = conn_hive.cursor()
    
    # Connect to Hive metastore
    conn_mysql = jaydebeapi.connect('com.mysql.jdbc.Driver',
            ['jdbc:mysql://metastoremysqlserver.mydomain.local:3306/hive',
             'mysql_username', 'mysql_password'],
            '/path/to/mysql-jdbc-connector.jar',)
    curs_mysql = conn_mysql.cursor()

    A metastore query can be run to retrieve the names of all tables in the default database into an arry (mysql_query_output):

    # Query the metastore to get all tables in defined databases
    mysql_query_string = "select t.TBL_NAME
    from TBLS t join DBS d
    on t.DB_ID = d.DB_ID
    where t.TBL_NAME like '%mytable%'
    and d.NAME='default'"
    
    curs_mysql.execute(mysql_query_string)
    
    mysql_query_output = curs_mysql.fetchall()

    Hive queries can be dynamically generated and executed to retrieve row counts for all the tables found above:

    # Perform a row count of each hive table found and output it to the screen
    for i in mysql_query_output:
            
            hive_query_string = "select '" + i[0] + "' as tabname,
            count(*) as cnt
            from default." + i[0]
    
            curs_hive.execute(hive_query_string)
    
            hive_query_output = curs_hive.fetchall()
    
            print hive_query_output

    Done! Output from Hive queries now should be printed to the screen.

Pros and cons of the solution

Pros:

  • Provides a nice way of scripting whilst using Hive data
  • Basic error handling is possible through Python after each HQL is executed
  • Connection to a wide variety of JDBC compatible databases

Cons:

  • Relies on client memory to store query results – not suitable for big data volumes (Spark would be a better solution on this front, as all processing is done in parallel and not brought back to the client unless absolutely necessary)
  • Minimal control / visibility over Hive query whilst running
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s