Finding a sequence of events in Hive using analytic functions

Hadoop Hive features several useful functions for efficiently performing analytics over ordered sets of rows — these are known as the windowing and analytics functions.  For example, lead and lag functions can be used to produce rolling averages or cumulative sums over a window of time (e.g. hourly averages for some metric over the preceeding rolling 24 hours).

Another useful feature is the ability to introduce ordinality or sequence into SQL data where there is no strict or predictable sequence field.  This can help us search for chains of events over time.

Example

Imagine a supermarket tracking customer purchases. The following query can be used to find customers who have purchased an Apple in one transaction and in their immediate next transaction, an Orange (assuming transaction_id is a field which increases over time, such as a receipt number):

select x.customer_id from
(
    SELECT customer_id,
    product_name,
    row_number() OVER (
        PARTITION BY customer_id ORDER BY transaction_id
    ) as rn
    FROM default.tbl_product_sales
) x
join
(
    SELECT customer_id,
    product_name,
    row_number() OVER (
        PARTITION BY customer_id ORDER BY transaction_id
    ) as rn
    FROM default.tbl_product_sales
) y
on x.customer_id=y.customer_id
where y.rn=x.rn+1
and x.product_name='Apple'
and y.product_name='Orange';

So, even though transaction_id may not be contiguous or predictable (i.e. a single customer might have consecutive transactions with numbers 1234, 1255, 1257, etc.), we can still use PARTITION BY and ORDER BY to assign a new row number field which is contiguous – whereby each each customer will have their transactions grouped and ordered.  In the above query, rn and rn+1 represent any transaction for a given customer and the transaction immediately afterwards.

 

Advertisement