Skip to main content

Ingesting Data to Bronze using Auto Loader

Document Version: 1.0
Last Updated: 02-02-2026
Maintained By: Fauzaan


Ingesting Files to Bronze Layer with Auto Loader

This guide explains how to use Databricks Auto Loader (cloudFiles) to ingest raw files from Azure Data Lake Storage (ADLS) into our Bronze Delta tables. Autoloader is a tool that only processes new files by maintaining a checkpoint log, to achieve pipeline idempotency.

Auto Loader is our standard for ingestion because it automatically handles:

  • New File Discovery: It finds new files as they arrive.
  • Schema Drift: It can adapt to changing columns or capture bad data.
  • State Management: It remembers exactly which files have been processed, so we never duplicate data.

1. Architecture & Best Practices

Before writing code, it is critical to understand where files should live. We use Unity Catalog Volumes to manage access securely.

The "Two Volume" Rule

To prevent errors and infinite loops, we must strictly separate our Source Data from our System Metadata (Checkpoints).

Critical: Checkpoint Location

NEVER place your checkpoint directory inside the same folder as your source data. If you do, Auto Loader might try to ingest its own checkpoint files, causing an infinite loop or permission errors.

Recommended Structure: We keep the checkpoints volume in a separate schema called autoloader and we keep the volumes linked to source data in the bronze schema.

  • Source Volume: Contains the raw .csv, .json, or .unl files.
  • System Volume: Contains the checkpoints (state files) for the pipeline.

A Separate Volume for each Data Feed

In our architecture, we enforce a strict 1-to-1 relationship between a data feed and its storage volumes. This means every single data feed (e.g., bc_customer, bc_subscriber) gets its own dedicated Source Volume and Checkpoint Volume.

We do this for three critical reasons:

  1. Isolation: Dumping all source files in one directory would mean that autoloader will pick files from different feeds with different schemas and try to write to a single table. We need a separate source directory for each data feed by design.
  2. Performance: Auto Loader lists files to detect new data. If we dumped all files into one giant volume, Auto Loader would waste time scanning millions of irrelevant files just to find the few new ones for your specific table. Keeping them separate ensures high performance.
  3. Safety: By physically separating the Source Volume (Input) from the Checkpoint Volume (State), we completely eliminate the risk of the "Infinite Loop" error, where Auto Loader accidentally tries to ingest its own checkpoint files.

2. Naming Conventions

To keep our Unity Catalog organized, we use a strict naming pattern that links the Source Volume, the Target Table, and the Checkpoint Volume.

1. Source Volume & Target Table

We use the same name for both the Source Volume and the Target Table. This makes it obvious which volume feeds which table.

  • Format: br_<source_system>_<data_feed_name>
  • Example: br_cbs_bc_subscriber
Why this format?
  • br_ indicates this is a Bronze layer asset.
  • cbs indicates the Source System (e.g., Core Banking System).
  • bc_subscriber is the specific Data Feed.

2. Checkpoint Volume

The checkpoint volume resides in a separate schema called autoloader and is named after the feed itself.

  • Format: <data_feed_name>
  • Example: bc_subscriber

3. Final Repository Structure

We organize our assets into two distinct schemas in Unity Catalog to separate "Business Data" from "System Infrastructure".

  • deal_dev (Schema): Contains the actual data and tables used by developers and analysts.
  • autoloader (Schema): Contains the system state files (checkpoints) that users generally do not need to see.
deal_dev (Schema)
└── bronze
├── Volumes
│ ├── br_cbs_bc_subscriber <-- Source Volume (Raw .unl files)
│ ├── br_cbs_invoices
│ └── br_crm_customers

└── Tables
├── br_cbs_bc_subscriber <-- Target Table (Clean Delta Table)
├── br_cbs_invoices
└── br_crm_customers

autoloader (Schema)
└── Volumes
├── bc_subscriber <-- Checkpoint Volume (System State)
├── invoices
└── customers

4. Setup (One-Time SQL)

This SQL was run to define the relationship between the volume and the storage location in Azure

Naming Convention

Replace BC_CUSTOMER, with your feed name.

-- Create the Source Volume (connected to where raw files land)
CREATE EXTERNAL VOLUME deal_dev.bronze.br_cbs_bc_subscriber
LOCATION 'abfss://source-files@dhidatalakestoreucdev.dfs.core.windows.net/BC_CUSTOMER'

-- Create the System Volume (Where we store checkpoints)
CREATE EXTERNAL VOLUME deal_dev.autoloader.bc_customer
LOCATION 'abfss://autoloader-state@dhidatalakestoreucdev.dfs.core.windows.net/BC_CUSTOMER';


5. The Ingestion Code (PySpark)

Use the following template for your ingestion notebook. This example is configured for Pipe-delimited (.unl) files without headers, which is how it will be in new CBS.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

# Config variables for volumes and source table
source_path = "/Volumes/deal_dev/bronze/br_cbs_bc_subscriber/"
checkpoint_path = "/Volumes/deal_dev/autoloader/bc_customer/"
target_table_name = "deal_dev.bronze.br_cbs_bc_subscriber"


# File header definition
# Since files have no header, you must define columns in the exact order they appear.
custom_schema = StructType([
StructField("customer_id", StringType(), True),
StructField("third_party_customer_key", StringType(), True),
StructField("customer_type", StringType(), True),
StructField("customer_segment", StringType(), True),
# ... Add all your columns here ...
])

Auto Loader Logic

# Read Stream
df_stream = (spark.readStream
.format("cloudFiles")

# Use the CSV engine (because structurally it is a delimited text file)
.option("cloudFiles.format", "csv")

# Tell Auto Loader to look for .unl files specifically
# Without this, it will ignore your files because they don't end in .csv
.option("pathGlobFilter", "*.unl")

# Define custom delimiter (Pipe)
.option("sep", "|")

# File doesnt contain any header
.option("header", "false")

# Match data's date format (yyyyMMddHHmmss)
# .option("timestampFormat", "yyyyMMddHHmmss")

# Treat empty strings between pipes || as NULL, not ""
.option("nullValue", "")

.schema(custom_schema)
.option("cloudFiles.rescuedDataColumn", "_rescued_data")
.load(source_path)
)

# Write Stream
(df_stream.writeStream
.format("delta")

# State management
.option("checkpointLocation", checkpoint_path)

# Trigger Strategy: Run once for all available data, then stop.
# This changes the streaming pipeline to a batch run
.trigger(availableNow=True)

.toTable(target_table)
)
Bonus: Event-Driven Automation (No More Schedules!)

You don't have to run this pipeline on a fixed schedule (e.g., every hour). You can make it Event-Driven to save costs and reduce latency.

How to do it: In Databricks Workflows, set the Trigger type to File Arrival.

  1. Grant the EventGrid EventSubscription Contributor role to the Databrick Connector on the storage account used for the streaming CDR data. This location should be added as an external location in databricks.
  2. Point the trigger to your Source Volume (ADLS Gen2 Container) in jobs.
  3. Set the job to run this notebook.

Why this is powerful:

  • Zero Waste: If no files arrive for 3 days, the cluster never turns on. You pay $0.
  • Instant Processing: The moment a file lands in ADLS, the pipeline wakes up, ingests it using availableNow=True, updates the table, and shuts down immediately.
  • Idempotency: Because Auto Loader tracks state with checkpoints, it doesn't matter if the trigger fires once or ten times—it will never duplicate data.

6. Nuances & Explanation

The _rescued_data Column

note

Always verify this column after your first run!

We enable .option("cloudFiles.rescuedDataColumn", "_rescued_data"). If a row in your source file is corrupt (e.g., has a string "ABC" in an Integer column, or has extra columns), Auto Loader will not crash. Instead, it will:

  1. Set the specific column to NULL.
  2. Save the entire raw row into the _rescued_data JSON column so you can debug it later.