Metropolis - Metadata-Driven File Ingestion and Quarantine for Central Data Inbox

Project Overview
The City of Metropolis launched a digital transformation initiative to unify public-sector data across departments such as Police, Transportation, and Citizen Services. A shared “data inbox” was introduced as a central drop zone for all reports and files — aiming to simplify data collection and analysis. However, rapid adoption led to data chaos. The inbox became cluttered with hundreds of files in varying formats (CSV, JSON, Excel, PDF) and inconsistent naming conventions. Analysts spent hours manually identifying and loading data, undermining the promise of automation. The tipping point came when a critical enforcement report went unnoticed for days due to naming inconsistencies, exposing a significant gap in the city’s data ingestion process — the “last mile” where data enters the system.
​​
Business Problem
Analysts manually checked the inbox each day, identified files by name, and triggered pipelines themselves. Misnamed files were missed, some files were loaded twice, and unknown files were ignored with no record. The city needed an automated, reliable way to classify and process files while maintaining a clean and auditable inbox. The data engineering team faced a deceptively complex challenge that goes to the heart of modern data platform design. They needed to architect a solution that could operate autonomously, handling the unpredictable nature of real-world data ingestion without constant human supervision. The core requirements were precise but technically demanding.
​
Project Objective
-
To architect and deploy a robust, automated, and Single Dynamic Router pipeline that gets a complete list of all files in the inbox. It then iterates through this list, using conditional logic (a Switch activity) to inspect each filename.
-
If the name matches a known pattern, the pipeline routes the file to the appropriate copy activity and, upon success, moves it to an archive folder. If no match is found, the file is moved directly to a quarantine folder.
-
This approach requires a sophisticated initial design and a solid understanding of control flow activities and dynamic expressions (@item().name, etc.).
-
Successful Pipeline Execution: The deployed pipeline must run end-to-end without errors.
​​
Project Design
-
Designed a Single Dynamic Router, metadata-driven pipeline that scans all incoming files, dynamically routes them by format using a switch logic, and transfers valid files to department-specific folders before ingesting them into lakehouse tables. Files with unrecognised formats are automatically quarantined, and all processed files are deleted from the inbox to prevent duplication. ​​​​​​​​​​
-
Using the Get Metadata activity, the Pipeline was configured to scan the Central inbox and retrieve the metadata of all incoming files. The output serves as the input for the ForEach activity without relying on hardcoded names.
-
In the ForEach activity, all files identified during the GetMetaData activity were iterated over, and each was routed to the Switch activity based on its name and extension.
-
In the Switch activity, each file was conditionally tested to check its name and extension. Based on the conditional logic, the pipeline routed each file to the appropriate department folder. After proper routing, the files were deleted from the Central inbox to prevent duplication of files and ensure reliable downstream ingestion.
-
To eliminate manual operations, the Pipeline was scheduled to refresh automatically at set intervals.
​​
​
​​
​
​
​
​
​
​​​​​​​​Business Problem Solved
-
The implemented solution was the Single Dynamic "Router" Pipeline. This architecture was chosen because it provided the optimal balance of power, scalability, and maintainability.
-
The pipeline immediately automated the entire triage process, eliminating hours of daily manual work. The active file management was particularly celebrated.
-
Successfully processed files were moved to the archive, creating a clear audit trail. Unrecognised files were moved to the quarantine folder, creating an automated exception queue for the data governance team to review. This ensured no data was ever lost or ignored.
​










