A handy feature of Hadoop Hive is the ability to use the filename and path of underlying files as columns in a view or table using the virtual Hive column INPUT__FILE__NAME. This is particularly handy in the case of external tables where some metadata about files is embedded in the location on HDFS or the filename itself.
The problem
Say we have some files sitting under a nested folder structure in HDFS:
The files each contain two columns, a string and an integer. We can create an external hive table as follows:
CREATE EXTERNAL TABLE default.test_filename
(
col0 string,
col1 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION '/tmp/testonly/';
A simple select on the above table shows data from all subdirectories without any more information:
The solution – including file metadata in Hive
Hive features a virtual column INPUT__FILE__NAME which can be used to query the file location and filenames of textfiles where the actual table data is sitting:
SELECT INPUT__FILE__NAME, * FROM default.test_filename;
The full HDFS path is now included as a column in each row of data (e.g. “hdfs://sandbox.hortonworks.com:8020/tmp/testonly/vic/2016-01-01.txt”):
We can go one step further and use standard Hive functions to create two new artificial columns – file_date (from the filename) and state (from the HDFS subfolder).
SELECT substring(INPUT__FILE__NAME,length(INPUT__FILE__NAME)-13,10) as file_date,
regexp_extract(INPUT__FILE__NAME, '.*/(.*)/.*.txt', 1) as state,
*
FROM default.test_filename;
Now there are two extra columns in our Hive resultset derived solely from the HDFS path of the underlying files:
Apache Hive is great for enabling SQL-like queryability over flat files. This is trivial in the case of tabular formatted files such as CSV files where we can set custom row and field delimiters out-of-the-box (e.g. Hive’s inbuilt CSV serde). Even more complex files can be read and converted to a desired row and column format using Hive’s regexp_extract() function, so long as we can operate on a single row at a time.
The problem
What if the rows we want in Hive aren’t rows in the input files? That is, we need to read the file as a whole and decode it to produce the output we want to see in Hive.
An example is the Australian Bureau of Meteorology’s ASCII Grid format. These files are fixed file formats with a header section which effectively describes how to read the file. In the data section, each data row corresponds to a row of latitude on a map (with starting coordinates identified in the header) and similarly each column defines a line of longitude. Read as a whole, the file contains a grid of readings of particular weather observations – e.g. rainfall for a given time period:
Example decoding of ASCII grid format file
To read this data in Hive it might be possible to define a table which hard-codes column values to their corresponding longitude, but this leaves the problem of reading simlarly formatted files with a different geographical granularity or different starting position on the globe. Similarly, we may struggle at the Hive query language layer to determine the appropriate latitude of a given data row in the file. This is because the header contains the required metadata as to which row in the file corresponds to a certain latitude.
To make the grid data easier to consume in Hive we may wish to transform files into a format such as this:
Geospatial data – desired tabular format
This means we can query a file by filtering on particular lat / long combinations. One way to transform the file into this format is via creating custom Hive InputFormat and Record Reader Java classes that we can use at query time.
InputFormat / RecordReader vs SerDe
A key distinction when creating custom classes to use with Hive is the following:
InputFormat and RecordReader – takes files as input – generates rows
SerDe – takes rows as input – generates columns
Here, ASCII grid formatted files cannot be de-serialised row-by-row because there is important information in the header about what each row contains (i.e. the latitude of a given row is dependent on its position in the file and also information in the header), so a SerDe is likely not the best option. Instead, an InputFormat Java class can be written to convert the input ASCII grid formatted files into the desired tabular format above, making it possibe to query via Hive via arbitrary lat / long coordinates.
Creating a custom InputFormat
An InputFormat compatible with Hive can be created by creating classes which implement and extend standard mapred library classes:
CustomTextInputFormat.java – extends FixedLengthInputFormat. Returns a CustomTextRecordReader which plugs in to Hive at runtime behind the scenes.
CustomTextRecordReader.java– implements mapred RecordReader<LongWritable, BytesWritable>. Reads and decompresses (if required) files off the Hadoop filesystem. Calls ReadASCIIGridFile to do the actual transformation.
ReadASCIIGridFile.java– contains a static class which does the transformation from input (a byte array – ASCII grid formatted) to output (a byte array – Hive row format)
Notes:
Code uses the mapred rather than mapreduce API of Hadoop, as Hive only supports mapred style InputFormat objects
CustomTextInputFormat.java sets all files to be non-splittable. This is done because a file must be read in full with its header to properly convert to the target format.
The CustomTextRecordReader copes with compressed input files by utilising the org.apache.hadoop.io.compress.CompressionCodec class to decompress any input files which are compressed. This is advantageous for ASCII grid formatted files which lend themselves well to compression (e.g. via GZIP) before being uploaded to HDFS.
Compiling the custom InputFormat
Copy text out of the above files and save to a folder on the filesystem. Build these and then and compile into a JAR file (note – a Java JDK must be installed to run the compilation, preferably the same version as the target Hadoop system is running):
Note – the classpath in the javac command assumes that necessary Hadoop library jar files are installed in certain locations. The locations mentioned are for the Hortonworks HDP 2.3.0 Sandbox VM, but can be changed to suit other versions / distributions.
Using the custom InputFormat with Hive
Run hive at the command line.
In the Hive session, add the newly created InputFormat JAR:
hive> add jar /tmp/CustomTextInputFormat.jar;
Added [/tmp/CustomTextInputFormat.jar] to class path
Added resources: [/tmp/CustomTextInputFormat.jar]
Create an external table on an HDFS directory containing ASCII grid formatted files:
hive> create external table default.test_ascii(lat1 float, long1 float, lat2 float, long2 float, measurement float)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS INPUTFORMAT 'com.analyticsanvil.custominputformat.CustomTextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'LOCATION '/tmp/';
OK
Time taken: 13.192 seconds
The query returns data in the desired tabular format.
Conclusion
By creating an InputFormat Java class which reads and transforms fixed format files at the time of Hive querying, we can effectively convert data into forms better which are better suited to analytical purposes.
Similarly, writing a custom input format allows almost any data to be read by invoking a custom Java class on each mapper, translating the input into tabular format for use in Hive. In the above example ReadASCIIGridFile.java converts an ASCII grid formatted file to a long list of lat / long combinations and readings, but equally, a new Java class could be coded to read more exotic forms of input data – e.g. MP3 audio files, JPEGs or other types of binary file. So long as a developer knows how to code the translation in Java, input formats can be converted into Hive queryable tabular data on the fly and in parallel (e.g. MP3 files with timestamps and spectral / frequency analysis).
It’s worth noting, further improvements can be made to the Java code above. For example – more efficient parsing of input files (currently using regular expressions), better error checking, memory utilisation and a mechanism to combine input files for a given input split to improve performance.