Available for Big Data roles

Hager Elkahlawy

Big Data Engineer Trainee

Building scalable data pipelines & real-time systems with Kafka, Spark, and Hadoop.

3TB+
Data processed daily
99%
Pipeline reliability
+25%
Performance gain
Hager Elkahlawy, Big Data Engineer
Kafka stream live
Spark job · 25% faster
About

From Computer Science to Data Systems

I'm a Big Data Engineer with a Computer Engineering background and 2 years of hands-on experience designing scalable ETL pipelines, optimizing Spark jobs, and turning raw data into reliable signals teams can act on.

What started as curiosity for algorithms became a deep interest in how systems handle terabytes at speed. Today I work across Hadoop, Spark, Hive and Python to ship pipelines that process 3TB+ daily with 99% reliability in production.

I care about clean architecture, observability, and pipelines that don't wake engineers up at 3 AM. I'm now expanding into real-time streaming with Kafka and cloud-native deployments — building toward systems that scale with the business, not against it.

Computer Engineering

Bachelor's from The Open University, UK (2025) — strong foundations in distributed systems and algorithms.

Problem-Solving Mindset

I break problems down to their data flow primitives — sources, throughput, latency, failure modes.

Big Data in Production

2 years building pipelines that process 3TB+ daily with 99% reliability across Hadoop and Spark.

Always Shipping & Learning

From Kaggle competitions to IBM Data Science training — I learn by building real systems end-to-end.

Skills

Tools that move data at scale

A stack tuned for distributed processing, fault tolerance, and real-time analytics.

Big Data
Engineer
Kafka
Spark
Hadoop
Python
SQL
Docker
Linux
AWS
Hive
HBase

Big Data Tools

  • Hadoop90%
  • Apache Spark90%
  • Hive75%
  • Kafka70%

Programming

  • Python90%
  • SQL90%
  • Bash70%

Data Engineering

  • ETL Pipelines90%
  • Data Cleaning88%
  • Data Modeling78%

Databases & Storage

  • HDFS85%
  • PostgreSQL80%
  • HBase65%

Cloud & Systems

  • Linux82%
  • Docker75%
  • AWS basics65%

ML & Analytics

  • ML Algorithms70%
  • Neural Networks65%
  • Visualization80%
Projects

Systems I've shipped

Real engineering work — not tutorials. Each project below highlights the problem, the architecture choices, and the measurable impact.

Featured architecture

Real-Time Supply Chain Analytics Platform

End-to-end streaming pipeline turning supply events into live operational decisions.

Pipeline
API
Kafka
Spark
Data Lake
Dashboard
Problem

Logistics teams reacted to delays hours after they occurred. Batch reporting couldn't surface bottlenecks fast enough to act on shipment SLAs.

Solution

Built a streaming-first architecture: event APIs feed Kafka topics, Spark Structured Streaming enriches and aggregates events, results land in a data lake, and a dashboard surfaces KPIs in seconds.

Key Features
  • · Streaming + batch unified
  • · Late-data handling with watermarks
  • · Health checks & alerting
  • · Containerized for reproducible deploys
Impact
  • Cut delay-detection latency from hours to seconds
  • Single source of truth for ops + finance
  • Auto-alerts on shipment SLAs
KafkaSpark Structured StreamingPythonHDFSDockerPostgreSQL
Graduation Project · 2025

Diabetic Retinopathy Detection — Web System

Web-based AI system that detects diabetic retinopathy from retinal images, backed by a Big Data pipeline.

Pipeline
Retinal Images
HDFS
Spark + Hive
Model
Web UI
Problem

Screening thousands of retinal images per day overwhelms specialists and delays early diagnosis.

Solution

Leveraged Hadoop, Spark and Hive to process large-scale retinal-image datasets efficiently, with scalable pipelines feeding a model that achieves high prediction accuracy for early diagnosis.

Key Features
  • · Distributed image preprocessing
  • · Scalable Big Data pipeline
  • · Web UI for clinicians
  • · Built on Hadoop / Spark / Hive
Impact
  • High prediction accuracy enabling early diagnosis
  • Processes thousands of retinal images daily
  • Better patient outcomes via faster screening
HadoopSparkHivePython
Production · 2024

Big Data Pipeline Optimization

Robust ETL pipeline capable of handling 3TB of data daily, tuned for throughput and reliability.

Pipeline
Sources
Hive Staging
Spark ETL
Curated Layer
BI
Problem

ETL jobs were the bottleneck as data volume grew toward 3TB/day, slowing reports and stretching prep time.

Solution

Established a robust ETL pipeline on Hadoop and Spark, then optimized Spark workflows — query tuning, partitioning, and join refactors — to lift data-processing performance by 25%.

Key Features
  • · Skew-aware Spark joins
  • · Hive partition pruning
  • · Idempotent retries
  • · Per-job metrics & SLAs
Impact
  • +25% data-processing performance
  • −30% data-preparation time
  • 99% data reliability in production
HadoopSparkHivePythonSQL
System Architecture

How the data actually flows

The mental model I use when designing real-time analytics pipelines.

Pipeline diagram
Source → Kafka → Spark → HDFS → Insights
01
Sources
APIs, app logs, retinal-image batches
02
Kafka
Streaming buffer for 3TB/day ingest
03
Spark
Distributed ETL & transformations
04
Hive on HDFS
Partitioned warehouse layer
05
Curated Data Lake
99% reliable, query-ready datasets
06
Dashboards & ML
Reporting + diabetic-retinopathy model
3TB+ processed daily
+25% Spark performance gain
99% production reliability

Why Kafka

Pipelines ingest 3TB+ of mixed batch and streaming data daily. Kafka decouples producers from Spark, absorbs traffic spikes, and keeps a replayable log so reprocessing never hits the source systems.

Why Spark

Spark is the workhorse for the heavy ETL — distributed joins, aggregations and image-batch processing for the retinopathy project. Tuning shuffles, partitions and caching delivered the 25% performance gain documented in the CV.

Hive + Python ETL

Hive handles SQL-style transformations on HDFS while Python orchestrates cleaning, validation and feature prep — the same workflow that cut data-preparation time by 30%.

Performance Tuning

Query rewrites, broadcast joins, partition pruning and right-sized executors turn long-running Spark jobs into predictable ones — measured, not guessed.

99% Reliability

Idempotent writes, retries and checkpointing keep production pipelines stable. When a stage fails, it replays from the last safe offset instead of corrupting downstream tables.

Built for Scale

Partition-aware design across Kafka topics, HDFS layout and Spark executors — the same architecture scales from thousands of retinal images to multi-terabyte daily loads.

Real-Time Demo

A glimpse of the system in motion

Mock telemetry from a streaming pipeline — the same shape of dashboards I build in production.

Events / sec
12,480
Kafka throughput
Batch latency
612ms
Spark micro-batch
Pipeline health
99.4%
Last 24h
Live stream — orders.events
tail -f
[INFO] kafka: consumer group=ingest-1 offset=10238421
[INFO] spark: micro-batch 8421 processed 12,488 events in 612ms
[INFO] sink: wrote 12,488 rows to s3://lake/curated/orders
[WARN] watermark: late event delta=3.1s within tolerance
Monitoring
  • Consumer lagOK
  • Disk usage62%
  • Late events0.3%
  • Failed jobs (24h)0
Engineering Thinking

How I think as a Data Engineer

Volume first

What is the data shape today and at 10x? Decisions made for MB break at TB.

Latency budget

Real-time vs near-real-time vs batch is a product question, not a tech one.

Reliability over cleverness

Idempotent writes, retries, and clear failure modes beat exotic tricks.

Clean pipeline design

Bronze → Silver → Gold layers; transformations are pure and testable.

Scalability mindset

Partition on access patterns. Avoid hot keys. Plan for skew before it bites.

Observable by default

If a stage isn't logged and metered, it doesn't exist when something breaks.

Challenges Solved

Real bugs, real fixes

Problem

Kafka consumers stalled under traffic spikes

What I did

Tuned partition count + consumer group concurrency, switched to manual offset commits, added lag-based autoscaling triggers.

Result

Stable consumer lag under 2s during peaks; no missed events.

Problem

Docker containers conflicted in local dev environment

What I did

Standardized service ports, externalized configs to .env, and built a single docker-compose covering Kafka, Spark, and Postgres.

Result

One-command spin-up; new contributors productive on day one.

Problem

Skewed joins killed Spark job performance

What I did

Profiled stages with the Spark UI, salted hot keys, and broadcast small dimension tables.

Result

Job runtime dropped ~40%, freeing the cluster for downstream work.

Problem

Dirty source data corrupted reports

What I did

Added a typed validation layer, quarantined bad records, and surfaced a daily quality report instead of failing silently.

Result

Trust restored — analytics team stopped hand-checking outputs.

Impact

Numbers and outcomes that matter

3TB+
processed daily across pipelines
+25%
faster Spark processing after tuning
−30%
less data prep time via ETL refactors
99%
pipeline reliability in production
  • Real-time insights replaced overnight reports for ops teams
  • Faster monitoring with proactive alerts instead of post-mortems
  • Better decisions from a single curated source of truth
  • Monitoring automation freed engineering time for shipping
  • Scalable pipeline design — same code, more data, no rewrites
What's Next

Future improvements I'm building toward

Airflow orchestration

Move ad-hoc cron jobs to typed DAGs with retries and lineage.

Deploy on AWS / Huawei Cloud

Managed Kafka (MSK) + EMR/Spark to remove ops overhead.

Integrate real APIs

Replace mock producers with live partner APIs and schema registry.

Improve dashboard analytics

Add cohort, funnel, and anomaly views on top of curated data.

Monitoring & alerting

Prometheus + Grafana + paging on SLO breaches and pipeline lag.

Experience

A timeline of building & learning

Jan 2024 – Present

Big Data Engineer

Software Company
  • Designed and maintained scalable data pipelines processing 3TB+ daily on Hadoop & Spark.
  • Improved data processing performance by 25% via Spark tuning and query optimization.
  • Built ETL workflows in Hive + Python — cut data preparation time by 30%.
  • Partnered with engineering teams to ship data solutions to production at 99% reliability.
2025

B.Eng., Computer Engineering

The Open University, UK
  • Graduation Project: Diabetic Retinopathy Detection — AI + Big Data web system.
  • Coursework spanning distributed systems, data structures, and machine learning.
2023 – 2024

Certifications & Training

Coursera · IBM · Udemy
  • Big Data Analytics Professional Certificate — Hadoop, Spark, Hive, Pipelines.
  • IBM Data Science & Big Data Tools Training — Python, SQL, Visualization.
  • AI & Machine Learning Certificate — Algorithms, Neural Networks, Python ML.
Tech Stack

The toolbox

Apache KafkaApache SparkHadoopHiveHBaseClickHousePythonSQLBashDockerLinuxAirflowPostgreSQLHDFSAWSHuawei CloudTensorFlowPandasApache KafkaApache SparkHadoopHiveHBaseClickHousePythonSQLBashDockerLinuxAirflowPostgreSQLHDFSAWSHuawei CloudTensorFlowPandas
Contact

Let's build scalable data systems together 🚀

Open to Big Data Engineering roles, internships, and collaborations on real-time data systems.