Point in time Delta Lake table restore after S3 object deletion

Background

The Delta Lake format in Databricks provides a helpful way to restore table data using “time-travel” in case a DML statement removed or overwrote some data.

The goal of a restore is to bring back table data to a consistent version.

Delta lake timetravel

This allows accidental table operations to be reverted.

Example

Original table – contains 7 distinct diamond colour types including color = “G”:

Original table

Then, an accidental deletion occurs:

Accidental SQL delete statement

The table is now missing some data:

Modified table

However, we can bring back the deleted data by checking the Delta Lake history and restoring to a version or timestamp prior to when the delete occurred – in this case version 0 of mytable:

Delta Lake table history

Restoring the original table based on a timestamp (after version 0, but prior to version 1):

%sql
DROP TABLE IF EXISTS mytable_deltarestore;

CREATE TABLE mytable_deltarestore
USING DELTA
LOCATION "s3a://<mybucket>/mytable_deltarestore"
AS SELECT * FROM default.mytable TIMESTAMP AS OF "2021-07-25 12:20:00"; 

Now, the original data is available in the restored table, thanks to Delta Lake time-travel:

Restored data – via Timetravel

Challenge

What happens if table files (parquet data files or transaction log files) have been deleted in the underlying storage?

This might occur if a user or administrator accidentally deletes objects from S3 cloud storage.

Two types of files might get deleted manually.

Delta Lake data files

Symptom – table is missing data and can’t be queried:

%sql
SELECT * FROM mytable@v0;

(1) Spark Jobs
FileReadException: Error while reading file s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions
Caused by: FileNotFoundException: No such file or directory: s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet

Delta Lake transaction logs

Symptom – table state is inconsistent and can’t be queried:

%sql
FSCK REPAIR TABLE mytable DRY RUN

Error in SQL statement: FileNotFoundException: s3a://<mybucket>/mytable/_delta_log/00000000000000000000.json: Unable to reconstruct state at version 1 as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)

Solution

Versioning can be enabled for S3 buckets via the AWS management console:

S3 bucket configuration – Bucket Versioning enabled

This means that if any current object versions are deleted after the above configuration is set, it may be possible to restore them.

Databricks Delta Lake tables are stored on S3 under a given folder / prefix – e.g.:

s3a://<mybucket>/<mytable>

If this prefix can be restored to a “point in time”, this can be used to restore a non-corrupted version of a table – for example:

NB: Restoring will mean all data added after deletion occurs will be lost and would need to be reloaded from an upstream source. This also assumes that previous object versions are available on S3.

The following steps can be used in Databricks to restore past S3 object versions to a new location and re-read the table at the restore point:

  1. Install the s3-pit-restore python library in a new Databricks notebook cell:
    %pip install s3-pit-restore
  2. Run the restore command with a timestamp prior to the deletion:
    %sh
    export AWS_ACCESS_KEY_ID="<access_key_id>"
    export AWS_SECRET_ACCESS_KEY="<secret_access_key>"
    export AWS_DEFAULT_REGION="<aws_region>"
    s3-pit-restore -b <mybucket> -B <mybucket> -p mytable/ -P mytable_s3restore -t "25-07-2021 23:26:00 +10"
  3. Create a new table pointing to the restore location:
    %sql
    CREATE TABLE mytable_s3restore
    USING DELTA
    LOCATION "s3a://<mybucket>/mytable_s3restore/mytable";
  4. Verify the table contents are again available and no longer corrupted:

Conclusion

Other techniques like Table Access Control may be preferable to prevent Databricks users from deleting underlying S3 data, however Point in Time restore techniques may be possible where table corruption has occurred and S3 bucket versioning is enabled.

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s