CS5 - PySpark Notebook | Madhurdataworld

Innovative Solutions - Building a People Analytics Engine with PySpark

Project Overview

At Innovative Solutions, Employee and department information was fragmented across multiple spreadsheets and disconnected HR systems, creating data silos and manual dependencies. This led to inconsistent workforce reporting and slow response times when addressing critical questions from HR, Finance, and leadership teams. As the organisation experienced rapid growth, the existing setup failed to provide timely visibility into key workforce metrics, including headcount, hiring patterns, and departmental cost allocation. These limitations hindered both day-to-day operational planning and longer-term strategic decision-making, highlighting the need for a centralised, automated, and analytics-ready workforce data solution.

Business Problem

The company's core employee and department data exists but is fragmented across different systems and is not centrally analysed. Basic questions from leadership about departmental costs, hiring trends, and employee tenure take days of manual effort to answer and are often prone to errors. This lack of insight presents several critical business problems:

Budget Uncertainty: The finance team cannot get a clear, real-time picture of the salary budget distribution across departments, making financial planning difficult.
Talent Management Gaps: HR has no efficient way to analyze employee tenure, identify hiring patterns, or compare performance metrics across different parts of the organization.
Operational Inefficiency: Without a centralised analytics engine, answering even simple questions requires manually pulling data, hindering the ability to make fast, informed decisions.

Project Objective

To build a foundational analytics engine using PySpark.
Given a backlog of critical questions from HR, Finance, and executive leadership, the objective is to ingest core company data and write production-level PySpark code to answer this backlog, transforming raw data into actionable intelligence.
The final deliverable is a single, well-commented Jupyter Notebook (.ipynb) containing the complete PySpark code and the corresponding outputs for all assigned questions.

Project Design

Designed and implemented a PySpark-based analytics notebook that directly leverages existing Lakehouse tables to address critical HR and Finance use cases.
Intentionally avoided creating new ingestion pipelines, focusing instead on maximizing value from already curated datasets, reducing complexity and time to delivery.
Built clean, modular, and well-documented PySpark code, making the notebook easy to understand, maintain, and extend by other team members.
Integrated multiple data domains—including employee master data, departmental structures, budget allocations, and performance indicators—through optimized joins and transformations.
Structured the notebook into logical sections, with each section answering a specific business question such as headcount trends, cost distribution, or performance analysis.
Ensured that all calculations and aggregations are consistent and reusable, eliminating ad-hoc logic and discrepancies across analyses.
Enabled fast, repeatable insights for stakeholders by replacing manual analysis with automated, code-driven analytics.
Created an analytics asset that can be directly reused for dashboards, self-service analysis, or future People Analytics enhancements without rework.

Business Problem Solved

Immediate Workforce Visibility: HR and Finance teams can now view reliable headcount, team composition, and cost metrics on demand—without relying on manual reconciliation or spreadsheet merges.
Faster Executive Insights: Leadership receives answers to critical workforce questions within minutes instead of waiting days for static reports, accelerating decision-making cycles.
Elimination of Manual Effort: Automated PySpark pipelines replaced error-prone spreadsheet workflows, freeing teams from repetitive data preparation tasks and reducing operational overhead.
Consistent, Trustworthy Metrics: Centralized and standardized logic for tenure, hiring velocity, attrition, and departmental costing removed inconsistencies across reports and stakeholders.
Proactive Talent Planning: HR teams can now identify retention risks, hiring gaps, and workforce trends early, enabling data-driven talent strategies rather than reactive planning.
Stronger Financial Governance: Finance gains continuous visibility into salary distributions and department-level spending, improving forecast accuracy and budget control.
Scalable Analytics Foundation: The People Analytics Engine is modular and extensible, making it easy to power dashboards, enable self-service analytics, and support future workforce modeling initiatives.

Complete IPynb File

Comprehensive iPynb File

Innovative Solutions - Building a People Analytics Engine with PySpark

Contact Information