It’s easy in the digital age to amass tens of thousands of photos (or more!). Categorising these can be a challenging task, let alone searching through them to find that one happy snap from 10 years ago. Significant advances in machine learning over the past decade have made it possible to automatically tag and categorise… Continue reading Automatically tagging, captioning and categorising locally stored images using the Azure Computer Vision API
Useful queries for the Hive metastore
The Hive metastore stores metadata about objects within Hive. Usually this metastore sits within a relational database such as MySQL. Sometimes it’s useful to query the Hive metastore directly to find out what databases, tables and views exist in Hive and how they’re defined. For example, say we want to expose a report to users… Continue reading Useful queries for the Hive metastore
Python + JDBC = Dynamic Hive scripting
Working with Hive can be challenging without the benefit of a procedural language (such as T-SQL or PL/SQL) in order to do things with data in between Hive statements or run dynamic hive statements in bulk. For example – we may want to do a rowcount of all tables in one of our Hive databases,… Continue reading Python + JDBC = Dynamic Hive scripting
Useful date formulas for Hive
Hive comes with some handy functions for transforming dates. These can be helpful when working with date dimension tables and performing time-based comparisons and aggregations. e.g. Convert a native Hive date formatted date string: date_format(myDate,’dd-MM-yyyy’) Return the week number (within the year) of a particular date – i.e. first week of the year is 1,… Continue reading Useful date formulas for Hive
Finding a sequence of events in Hive using analytic functions
Hadoop Hive features several useful functions for efficiently performing analytics over ordered sets of rows — these are known as the windowing and analytics functions. For example, lead and lag functions can be used to produce rolling averages or cumulative sums over a window of time (e.g. hourly averages for some metric over the preceeding… Continue reading Finding a sequence of events in Hive using analytic functions
Using HDFS path and filename as columns in a Hive table
A handy feature of Hadoop Hive is the ability to use the filename and path of underlying files as columns in a view or table using the virtual Hive column INPUT__FILE__NAME. This is particularly handy in the case of external tables where some metadata about files is embedded in the location on HDFS or the… Continue reading Using HDFS path and filename as columns in a Hive table
Creating a custom Hive input format and record reader to read fixed format flat files
Apache Hive is great for enabling SQL-like queryability over flat files. This is trivial in the case of tabular formatted files such as CSV files where we can set custom row and field delimiters out-of-the-box (e.g. Hive’s inbuilt CSV serde). Even more complex files can be read and converted to a desired row and column… Continue reading Creating a custom Hive input format and record reader to read fixed format flat files