Data Engineering Projects

Production Data Platforms

Real-Time Data Pipeline Platform — NatWest Bank

Duration: September 2025 - Present
Role: Data Engineer

Building and maintaining production-grade data pipelines that process millions of transactions daily, supporting analytics, reporting, and regulatory requirements across the organization.

Architecture:

Ingestion Layer: Real-time streaming via Apache Kafka from transaction systems
Storage Layer: Raw data landing in Amazon S3 with partitioning strategies
Processing Layer: PySpark distributed processing for transformation and validation
Warehouse Layer: Curated datasets in Snowflake with optimized data models
Orchestration: Apache Airflow DAGs managing 40+ interdependent pipelines
Analytics: Tableau dashboards connecting to Snowflake for business intelligence

Key Features:

Real-time data ingestion handling peak loads of 10K+ events/second
Three-zone architecture (Raw → Curated → Refined) for data reliability
Data quality framework with automated validation and alerting
Incremental processing patterns for efficient compute usage
Monitoring and observability with SLA tracking

Technologies: Apache Kafka, PySpark, Amazon S3, Snowflake, Apache Airflow, Tableau, Python, SQL

Impact:

Enables real-time analytics for business decision-making
Supports regulatory reporting with audit trails and data lineage
Reduced data processing latency from hours to minutes
Improved data quality through automated validation
Recovery time reduced from 6 hours to 30 minutes through three-zone architecture

Enterprise Cloud Data Platform — Accenture

Duration: July 2023 - August 2025
Role: Data Engineer

Delivered large-scale cloud data engineering solutions for Fortune 500 clients across multiple industries, building scalable platforms on Azure and AWS.

Solutions Delivered:

Azure Databricks Data Lakehouse:

Unified analytics platform combining data lake and warehouse capabilities
Delta Lake for ACID transactions and time travel
Unity Catalog for data governance and access control
AutoLoader for incremental ingestion from cloud storage

Snowflake Data Warehouse:

Multi-tenant architecture supporting multiple business units
Zero-copy cloning for dev/test environments
Time Travel and Fail-safe for data recovery
Secure data sharing with external partners

Microsoft Fabric Unified Analytics:

Lakehouse architecture with OneLake storage
Data pipelines for orchestration
Power BI integration for self-service analytics
Real-time analytics with KQL databases

Key Achievements:

Built end-to-end data platforms serving 1000+ users
Migrated legacy ETL to cloud-native ELT patterns
Implemented CI/CD pipelines reducing deployment time by 70%
Designed data mesh architecture with domain-oriented ownership

Technologies: Azure Databricks, Snowflake, Azure Data Factory, Azure Data Lake, Microsoft Fabric, PySpark, dbt, Terraform, GitHub Actions, Python, SQL

Impact:

Enabled data-driven decision making for enterprise clients
Reduced infrastructure costs through cloud-native optimization
Improved time-to-insight with self-service analytics
Established data governance and security frameworks

Business Intelligence Platform — Dpoint Group

Duration: May 2022 - June 2023
Location: Barcelona, Spain
Role: Data Engineer

Developed BI and analytics solutions supporting operational reporting and executive dashboards for manufacturing and logistics operations.

Solutions Built:

ETL Pipeline Architecture:

SSIS packages extracting data from SAP BW
Transformation layer cleaning and standardizing data
Azure SQL Database as centralized data warehouse
Incremental loading strategies for efficiency

Power BI Dashboards:

Executive KPI dashboards with drill-down capabilities
Operational reports for supply chain visibility
Financial analytics with budget vs. actuals tracking
Mobile-optimized dashboards for field teams

Automation Initiatives:

Python scripts for recurring report generation
Excel VBA macros for data preparation
Scheduled workflows in Azure Data Factory
Email distribution of reports to stakeholders

Technologies: SSIS, SAP BW, Power BI, Azure Data Factory, Azure SQL Database, Python, Excel VBA, SQL

Impact:

Automated 30+ manual reporting processes
Reduced report generation time from days to hours
Improved data accuracy through standardized processes
Enabled self-service analytics for business users

Personal Projects & Learning

Real-Time Data Quality Monitoring with ML

GitHub: kalluripradeep/realtime-data-quality-monitor Status: ✅ Production-Ready

ML-powered real-time data quality monitoring system that detects anomalies in streaming data with sub-10ms latency, preventing data quality issues before they reach production systems.

Architecture:

Streaming Ingestion: Apache Kafka consuming 600+ events per minute
Real-Time Processing: Spark Structured Streaming with micro-batch processing
ML Anomaly Detection: Isolation Forest algorithm detecting quality issues in real-time
Data Storage: PostgreSQL for metrics, checkpointing for exactly-once semantics
Quality Scoring: Multi-dimensional quality assessment (completeness, accuracy, freshness)
Monitoring: Real-time dashboards tracking quality scores and anomaly rates

Key Features:

Sub-10ms latency for ML-based quality scoring on streaming data
93% quality scores maintained across all test pipelines
332,000+ orders processed through the monitoring pipeline
Automatic anomaly detection using Isolation Forest trained on historical patterns
Multi-dimensional quality checks: Completeness, uniqueness, validity, consistency, accuracy, timeliness
Exactly-once processing with Kafka checkpointing and idempotent operations
Real-time alerting when quality scores drop below thresholds

Technical Implementation:

Kafka consumer with configurable batch sizes for optimal throughput
scikit-learn Isolation Forest model trained on 51+ historical quality samples
PostgreSQL for storing quality metrics with time-series analysis
Configurable quality thresholds per data source
Comprehensive logging and error handling

Technologies: Apache Kafka, Spark Structured Streaming, Python, scikit-learn, PostgreSQL, Isolation Forest ML, pandas

Impact:

332K+ orders processed with 93% quality scores maintained
Sub-10ms latency proves feasibility for high-throughput production systems
Demonstrates production-ready ML integration with streaming data
Provides reusable framework for data quality monitoring

Modern ETL / Data Platform

GitHub: kalluripradeep/modern-etl-stack Status: ✅ Open Source

A full modern data engineering platform built from scratch, demonstrating enterprise-grade architecture at zero cost. A cost-effective alternative to commercial enterprise tools, potentially saving companies £100K+ annually.

Architecture:

Orchestration: Apache Airflow managing end-to-end pipeline workflows
Change Data Capture: Apache Kafka with Debezium for real-time CDC from source databases
Processing: Apache Spark for distributed transformation at scale
Transformation Layer: dbt with Bronze → Silver → Gold medallion architecture
Storage: PostgreSQL as source system, MinIO as S3-compatible object store
Monitoring: Prometheus + Grafana for pipeline observability and alerting

Key Features:

1,000+ orders processed with real-time CDC and sub-second latency
Medallion architecture (Bronze/Silver/Gold) for reliable, layered data quality
Real-time CDC via Debezium — captures every insert, update, and delete
Full observability with Prometheus metrics and Grafana dashboards
Fully containerised — runs locally with Docker Compose
Cost-effective — replicates £100K+ enterprise tooling with open source components

Technologies: Apache Airflow, Apache Kafka, Debezium, Apache Spark, dbt, PostgreSQL, MinIO, Prometheus, Grafana, Docker

Impact:

Demonstrates a complete modern data stack that any team can self-host
Open sourced to help data engineers learn and adopt best practices
Shows end-to-end CDC pipeline patterns rarely documented in full

E-commerce Data Pipeline

GitHub: kalluripradeep/ecommerce-data-pipeline

End-to-end data pipeline demonstrating modern data engineering practices. Simulates e-commerce transaction processing with focus on data quality and testing.

Features:

PySpark transformations with unit tests
Apache Airflow orchestration
Data quality validation framework
dbt data modeling
Docker containerization

Technologies: PySpark, Apache Airflow, dbt, Docker, Python, SQL

Open Source Contributions

Beyond client and personal projects, I actively contribute to open-source data engineering tools:

Apache Airflow - 3 merged PRs: documentation, bug fixes, and community reviews (#58587, #59938, #61005)
dbt-core - 1 merged PR + 2 active contributions (PRs)

View all open source contributions →

Technical Writing

I document lessons learned from these projects through technical writing:

71,000+ views across Medium and Dev.to
5 published articles on data pipeline architecture and debugging
Featured discussions on Reddit r/dataengineering

Read my technical articles →

Speaking Engagements

I share production lessons through conference talks and meetups:

Oxford Microsoft Data Platform Group (January 2026) - Completed
Topic: “From Raw to Refined: Building Production Data Pipelines That Scale”
Invited back for dedicated Apache Airflow session

View all speaking engagements →

← Back to Home

View Experience →