What is Delta Lake in Databricks Lakehouse Platform

~~toc~~

WHAT IS DELTA LAKE?

Data lakes have enormous potential, but consistency and reliability issues sometimes limit their true value.

Meet Delta Lake, an open-source storage framework based on Parquet and JSON that enables data lakes with ACID transactions, scalable metadata management, and unified batch/streaming processing.

Delta Lake seamlessly integrates with the Data Lakehouse architecture, acting as a foundational component of the Databricks Lakehouse platform.

Delta Lake bridges the gap between data lakes and data warehouses, enabling reliable, performant, and flexible data management for a wide range of analytical requirements.

Let's look at Delta Lake, a fundamental component of the Databricks Lakehouse Platform that provides reliability and structure to your data lake. If you're interested, read the Data Lakehouse article first, then come back.

KEY FEATURES OF DELTA LAKE:

ACID Transactions

Imagine never seeing inconsistent data again! Delta Lake ensures data integrity through atomicity, consistency, isolation, and durability (ACID). This implies that data updates are comprehensive, consistent, occur in isolation, and survive even in the event of failures.

Scalable Metadata Handling

Say goodbye to metadata bottlenecks. Delta Lake effectively manages petabyte-scale tables with billions of files. It uses a smart metadata management system to keep track of your data effectively, ensuring smooth operations even with huge data sets.

Schema Enforcement and Evolution

Data schemas can evolve, but changes shouldn't break your pipelines. Delta Lake enforces your chosen schema to prevent incorrect data from entering your system.

Furthermore, it enables controlled schema development, allowing you to adapt to new requirements without jeopardizing data integrity.

Need to explore previous versions of your data?

Delta Lake's time travel tool gives you access to historical snapshots, allowing you to roll back data, audit trails, and do insightful historical analyses. Imagine easily reverting to a specific version in time if needed!

Upserts and Deletes

Use Delta Lake's efficient merge and delete capabilities to streamline your data updates. This feature enables Change Data Capture (CDC) operations, which process only the altered data, resulting in improved performance and efficiency.

Open Format and Compression

Harness the power of Parquet! Delta Lake stores data in an efficient columnar format, which allows for considerable compression and quicker data access. This open format enables vendor neutrality and smooth interaction across several tools and ecosystems.

Unified Streaming and Batch Processing

Delta Lake offers unified streaming and batch processing for both real-time and historical data. It smoothly manages both streaming and batch data processing, allowing you to evaluate data in real-time while keeping a historical record for future use.

DELTA LAKE ARCHITECTURE

As seen in the diagram below, Delta Lake is a default component of the Databricks runtime and remains with the Databricks cluster, allowing the generation of Delta files and delta tables.

Delta lake framework and delta files/tables are stored independently; the framework is part of the cluster and manages everything from a central location. However, the storage might be an existing data lake where the delta files and tables are kept.

Delta files are in parquet format, which is already an optimized and compressed format for data storage. Along with the Delta files in Delta Lake, JSON files known as delta logs are generated. Delta logs are essentially transaction logs recorded in JSON format. Delta logs assist the delta lake to keep track of creates, inserts and deletes.

TRANSACTION LOG (DELTA LOG) IN DELTA LAKE

Let's look at the transaction log (Delta log) in detail. The system maintains orderly records of all table transactions and serves as a single source of truth.

The JSON file includes commit information: Operation executed + Predicates used and impacted data files (added/removed).

Let's look at a few instances of how delta files and delta logs function in Delta Lake.

Reads and writes.

Assume the initial write creates two parquet files, File 1.parquet and File 2.parquet.For the same write delta log file created is 000.json to store the write transaction.

Next time a file is read, before reading the actual file, databricks reference the delta log file to obtain information about the current files in the storage layer. At this moment, files 1 and 2 will be read.

Updates:

Let us imagine that the user is editing some data in File 1. In this instance, a new File 3.parquet will be produced by copying existing unchanged data and updated data. Similarly, the Delta log file, 001.json, contains the current data information.

If the user attempts to read the file, the data will be retrieved from Files 2 and 3. File 1 is still available in the storage but won't be read as the latest delta log 001.json has noted the update transaction where File 1 is only kept for history.

History is preserved in the storage layer to access history or go back to historical versions of data; this is called the time travel capability in Delta Lake.

Error in writes

The user attempts to write a new data file 5, but it terminates in an error condition. File 5.parquet is incompletely written in the storage layer, but the transaction log does not contain the information from File 5, as transaction logs are only created after a successful action.

Now, if the user wants to read the data, it will only read the most recent transaction log, which will not include information about file 5, so only the right and up-to-date file is read.

I hope the blog is informative. Stay tuned for more interesting Delta Lake articles. Happy learning!!

What is Delta Lake in Databricks Lakehouse Platform

WHAT IS DELTA LAKE?