CS2 - Incremental Loading | Madhurdataworld

Global Freight Forwarders: Incremental Loading of Data

Project Overview

This Project focuses on building an automated data ingestion pipeline in Microsoft Fabric for Global Freight Forwarders. The goal was to eliminate manual data handling and ensure that only the latest log files are available reliably for analysis. The dataset consists of daily logs generated during daily operations.

Business Problem

GFF's operations team is struggling to analyse near-real-time shipment patterns due to an inefficient, manual data workflow. Every day, numerous raw shipping logs in JSON format are added to their data store. The team must manually identify and process only the new files, a slow and error-prone task that delays critical analysis and cannot scale with data growth. Our mandate is to automate and streamline this process.

Project Objective

To architect and deploy a robust, automated, and incremental pipeline that efficiently identifies and consolidates only the new log files each day into a single, queryable Delta Lake table in the Bronze layer of the Lakehouse.
Successful Pipeline Execution: The deployed pipeline must run end-to-end without errors.
Verified Incremental Logic: The pipeline correctly identifies and processes only new files on subsequent runs, preventing data duplication.
Data Integrity: The resulting table contains the combined data from all newly ingested JSON files after each run.
Correct Destination: The data is loaded into the ShippingLogs table within the Bronze layer of the Lakehouse.

Project Design

Developed an incremental ingestion pipeline using Watermarking logic to capture only new or updated shipment logs from the source Lakehouse.
JSON logs are then consolidated into a single Delta Lake table in the bronze layer of the destination Lakehouse.
Created a Watermark Table in Lakehouse to manage incremental data loads. This table serves as a reference point to track ingested data.
Designed a pipeline using the Lookup Activity to connect with the Watermark table. When executing the Lookup, it retrieves the latest timestamp values of each source table, allowing the pipeline to determine the starting point for the Copy Data activity.
After retrieving the watermark value, the Copy Activity was configured to retrieve only records/logs after the watermark value. The retrieved records/logs are then written to the Bronze layer of Lakehouse.
After the Copy operation, Notebook is created to dynamically update the watermark value to the Pipeline Trigger time using PySpark code.
To eliminate manual operations, the Pipeline was scheduled to refresh automatically at set intervals.

Business Problem Solved

By automating the pipeline refresh, manual effort was eliminated.
Manual effort of filtering and uploading the latest logs has been removed, reducing the potential for errors.
Enabled the data availability on time, thus enhancing the decision-making and removing the potential lags
The pipeline is designed to handle a large volume of data and can scale as per business requirements.

CS2 - Copy Data Destination Settings.png

Global Freight Forwarders: Incremental Loading of Data

Contact Information