The Delta Lake format in Databricks provides a helpful way to restore table data using “time-travel” in case a DML statement removed or overwrote some data.
The goal of a restore is to bring back table data to a consistent version.
This allows accidental table operations to be reverted.
Original table – contains 7 distinct diamond colour types including color = “G”:
Then, an accidental deletion occurs:
The table is now missing some data:
However, we can bring back the deleted data by checking the Delta Lake history and restoring to a version or timestamp prior to when the delete occurred – in this case version 0 of mytable:
Restoring the original table based on a timestamp (after version 0, but prior to version 1):
%sql DROP TABLE IF EXISTS mytable_deltarestore; CREATE TABLE mytable_deltarestore USING DELTA LOCATION "s3a://<mybucket>/mytable_deltarestore" AS SELECT * FROM default.mytable TIMESTAMP AS OF "2021-07-25 12:20:00";
Now, the original data is available in the restored table, thanks to Delta Lake time-travel:
What happens if table files (parquet data files or transaction log files) have been deleted in the underlying storage?
This might occur if a user or administrator accidentally deletes objects from S3 cloud storage.
Two types of files might get deleted manually.
Delta Lake data files
Symptom – table is missing data and can’t be queried:
%sql SELECT * FROM mytable@v0; (1) Spark Jobs FileReadException: Error while reading file s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions Caused by: FileNotFoundException: No such file or directory: s3a://<mybucket>/mytable/part-00000-1932f078-53a0-4cbe-ac92-1b7c48f4900e-c000.snappy.parquet
Delta Lake transaction logs
Symptom – table state is inconsistent and can’t be queried:
%sql FSCK REPAIR TABLE mytable DRY RUN Error in SQL statement: FileNotFoundException: s3a://<mybucket>/mytable/_delta_log/00000000000000000000.json: Unable to reconstruct state at version 1 as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)
Versioning can be enabled for S3 buckets via the AWS management console:
This means that if any current object versions are deleted after the above configuration is set, it may be possible to restore them.
Databricks Delta Lake tables are stored on S3 under a given folder / prefix – e.g.:
If this prefix can be restored to a “point in time”, this can be used to restore a non-corrupted version of a table – for example:
NB: Restoring will mean all data added after deletion occurs will be lost and would need to be reloaded from an upstream source. This also assumes that previous object versions are available on S3.
The following steps can be used in Databricks to restore past S3 object versions to a new location and re-read the table at the restore point:
- Install the s3-pit-restore python library in a new Databricks notebook cell:
%pip install s3-pit-restore
- Run the restore command with a timestamp prior to the deletion:
s3-pit-restore -b <mybucket> -B <mybucket> -p mytable/ -P mytable_s3restore -t "25-07-2021 23:26:00 +10"
- Create a new table pointing to the restore location:
CREATE TABLE mytable_s3restore
- Verify the table contents are again available and no longer corrupted:
Other techniques like Table Access Control may be preferable to prevent Databricks users from deleting underlying S3 data, however Point in Time restore techniques may be possible where table corruption has occurred and S3 bucket versioning is enabled.
- Delta Lake time travel in Databricks – https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
- Delta Lake transaction logs – https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
- Table access control – https://docs.databricks.com/security/access-control/table-acls/index.html#table-access-control
- PIT restore Python library – https://github.com/angeloc/s3-pit-restore