Bronze Autoloader Generic
Document Version: 1.0
Last Updated: 20-04-2026
02_bronze_autoloader_generic
Purpose
02_bronze_autoloader_generic is the shared ingestion notebook that loads source files into a bronze Delta table using Databricks Auto Loader.
It is designed to be reusable across many feeds by changing widget parameters instead of cloning notebook logic.
What the active implementation does
The uploaded version of this notebook performs the following active steps:
- Reads widget parameters.
- Validates required core values.
- Resolves the schema JSON file path relative to the current notebook.
- Loads the schema from JSON into a Spark
StructType. - Configures a
cloudFilesstreaming reader. - Applies source-format options such as CSV delimiter and header handling.
- Adds standard ingestion metadata columns.
- Writes the stream to the target Delta table using
availableNow=True.
Read pattern
The notebook uses:
spark.readStream- format
cloudFiles cloudFiles.format = source_formatcloudFiles.schemaLocation = {checkpoint_path}/_schemascloudFiles.rescuedDataColumn = rescued_data_column
This is Databricks Auto Loader, which is well suited for incremental file discovery in cloud storage / Unity Catalog volumes.
Write pattern
The notebook writes with:
.format("delta").outputMode(output_mode).option("checkpointLocation", checkpoint_path).option("mergeSchema", str(merge_schema).lower()).trigger(availableNow=True).toTable(target_table_name)
availableNow=True means each job run behaves like a bounded ingestion run that processes all currently available files and then stops.
Metadata columns added by the notebook
The notebook enriches ingested rows with standard bronze metadata:
w_business_tsw_target_table_namew_load_typew_run_datew_ingest_tsw_source_file_namew_ingestion_run_idw_source_systemw_job_namew_task_namew_job_idw_job_run_idw_task_run_idw_job_trigger_typew_job_start_ts
These fields make downstream traceability and support much easier.
Schema handling
The notebook requires schema_file_path and reads the schema file as JSON.
Path resolution behavior
schema_file_path can be provided in one of these forms:
/Workspace/.../some/workspace/relative/path./Schemas/schema_x.json
If the path is relative, the notebook resolves it relative to the notebook directory in the workspace.
This is useful when keeping notebook and schema assets together in the same folder structure.
CSV-specific behavior
When source_format is csv, the notebook applies:
sep = delimiterheader = headernullValue = null_value
For other source formats, those options are ignored unless you extend the notebook.
Checkpointing and schema tracking
Two storage locations matter:
checkpoint_path
Used for Structured Streaming checkpoint state. This path must be stable for the job and should not be casually changed after go-live.
schema_location
Derived automatically as:
{checkpoint_path}/_schemas
This is where Auto Loader stores schema tracking information.
Current limitations and reserved parameters
The notebook includes parameters such as:
staging_table_namebusiness_keysoverwrite_schemacleanup_stage_after_finalize
The uploaded implementation currently does not actively use those parameters in the live execution path.
There are commented sections that suggest an extended design involving:
- staging table writes
- row counting from a staged run
- high-watermark processing
- finalize logic for snapshot / incremental handling
- cleanup of staged rows
Because those blocks are commented out, this documentation should not claim that the generic notebook currently performs those steps unless your environment has a modified version.
load_type in the current implementation
load_type is currently written as metadata (w_load_type) and forwarded to the audit notebook. In the uploaded active path, it does not yet change the write behavior by itself.
That means values like snapshot and incremental are still useful for lineage and future compatibility, but they do not independently change ingestion semantics in this version unless you extend the notebook.
Operational guidance
Keep one checkpoint path per source/table
Do not share the same checkpoint path across unrelated feeds. Each logical ingestion should have its own checkpoint directory.
Keep schema files under source control
The schema JSON is part of the ingestion contract. Store it with the notebook project and update it through normal change control.
Use precise file patterns
source_file_pattern helps prevent accidentally ingesting unrelated files from the same landing folder.
Use a rescued data column
Keep _rescued_data enabled unless you have a strong reason not to. It helps preserve malformed or unexpected fields for investigation.
Common issues
No files loaded
Possible causes:
- landing path is empty
- file pattern does not match incoming files
- checkpoint already recorded those files
- source path is wrong
Schema file not found
Possible causes:
- relative path resolved incorrectly
- schema file not deployed with the notebooks
- incorrect workspace path syntax
Unexpected schema errors
Possible causes:
- schema JSON does not match actual file layout
- delimiter or header configuration is wrong
- merge behavior expectations do not match the notebook's active logic