Point in time Delta Lake table restore after S3 object deletion

Background

The Delta Lake format in Databricks provides a helpful way to restore table data using “time-travel” in case a DML statement removed or overwrote some data.

The goal of a restore is to bring back table data to a consistent version.

Delta lake timetravel

This allows accidental table operations to be reverted.

Example

Original table – contains 7 distinct diamond colour types including color = “G”:

Original table

Then, an accidental deletion occurs:

Accidental SQL delete statement

The table is now missing some data:

Modified table

However, we can bring back the deleted data by checking the Delta Lake history and restoring to a version or timestamp prior to when the delete occurred – in this case version 0 of mytable:

Delta Lake table history

Restoring the original table based on a timestamp (after version 0, but prior to version 1):

%sql
DROP TABLE IF EXISTS mytable_deltarestore;

CREATE TABLE mytable_deltarestore
USING DELTA
LOCATION "s3a://<mybucket>/mytable_deltarestore"
AS SELECT * FROM default.mytable TIMESTAMP AS OF "2021-07-25 12:20:00"; 

Now, the original data is available in the restored table, thanks to Delta Lake time-travel:

Restored data – via Timetravel

Challenge

What happens if table files (parquet data files or transaction log files) have been deleted in the underlying storage?

This might occur if a user or administrator accidentally deletes objects from S3 cloud storage.

Two types of files might get deleted manually.

Delta Lake data files

Symptom – table is missing data and can’t be queried:

%sql
SELECT * FROM mytable@v0;

(1) Spark Jobs
FileReadException: Error while reading file s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions
Caused by: FileNotFoundException: No such file or directory: s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet

Delta Lake transaction logs

Symptom – table state is inconsistent and can’t be queried:

%sql
FSCK REPAIR TABLE mytable DRY RUN

Error in SQL statement: FileNotFoundException: s3a://<mybucket>/mytable/_delta_log/00000000000000000000.json: Unable to reconstruct state at version 1 as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)

Solution

Versioning can be enabled for S3 buckets via the AWS management console:

S3 bucket configuration – Bucket Versioning enabled

This means that if any current object versions are deleted after the above configuration is set, it may be possible to restore them.

Databricks Delta Lake tables are stored on S3 under a given folder / prefix – e.g.:

s3a://<mybucket>/<mytable>

If this prefix can be restored to a “point in time”, this can be used to restore a non-corrupted version of a table – for example:

NB: Restoring will mean all data added after deletion occurs will be lost and would need to be reloaded from an upstream source. This also assumes that previous object versions are available on S3.

The following steps can be used in Databricks to restore past S3 object versions to a new location and re-read the table at the restore point:

  1. Install the s3-pit-restore python library in a new Databricks notebook cell:
    %pip install s3-pit-restore
  2. Run the restore command with a timestamp prior to the deletion:
    %sh
    export AWS_ACCESS_KEY_ID="<access_key_id>"
    export AWS_SECRET_ACCESS_KEY="<secret_access_key>"
    export AWS_DEFAULT_REGION="<aws_region>"
    s3-pit-restore -b <mybucket> -B <mybucket> -p mytable/ -P mytable_s3restore -t "25-07-2021 23:26:00 +10"
  3. Create a new table pointing to the restore location:
    %sql
    CREATE TABLE mytable_s3restore
    USING DELTA
    LOCATION "s3a://<mybucket>/mytable_s3restore/mytable";
  4. Verify the table contents are again available and no longer corrupted:

Conclusion

Other techniques like Table Access Control may be preferable to prevent Databricks users from deleting underlying S3 data, however Point in Time restore techniques may be possible where table corruption has occurred and S3 bucket versioning is enabled.

References

Advertisement

Realtime Solar PV charting

Solar PV inverters often have their own web-based monitoring solutions. However some of these do not make it easy to view current generation or consumption due to refresh delays. Out of the box monitoring is usually good for looking at long-term time periods however lacks the granularity to see consumption of appliances over the short term.

The challenge

Realtime monitoring of Solar generation and net export helps to maximise self-consumption. For example coordinating appliances to make best use of solar PV.

Existing inverter monitoring does not show granular data over recent history – for example, to be able to tell when a dishwasher has finished its heating cycle and whether another high-consumption appliance should be turned on:

Solution

This sample android application allows realtime monitoring whilst charting consumption, generation and net export:

Solar Watch screenshot

The chart shows recent data over time and is configurable for SMA and Enphase inverters. In both cases the local interface of each inverter is used to pull raw data:

  • SMA: https://<inverter_ip_address>/dyn/getDashValues.json
    • NB – Smart Inverter Screen must be enabled
  • Enphase: http://<envoy_ip_address>/production.json

Code

https://github.com/niftimus/SolarWatch

Features

  • Interactive UI:

  • Configurable settings:

Limitations / areas for future improvement

  • Improve security handling of SSL – the current code imports a self-signed SMA inverter certificate and disables hostname verification to allow the SMA local data to be retrieved
  • Refine code and publish to an app store
  • Remove hard-coding for extraction of metrics
  • Better error handling
  • Add a data export function

Conclusion

This sample app is really handy to monitor appliances in realtime and allows making informed decisions about when to start appliances.