Available for Big Data roles

Hager Elkahlawy

Big Data Engineer Trainee

Building scalable data pipelines & real-time systems with Kafka, Spark, and Hadoop.

View Projects Contact Me

3TB+

Data processed daily

99%

Pipeline reliability

+25%

Performance gain

Kafka stream live

Spark job · 25% faster

About

From Computer Science to Data Systems

I'm a Big Data Engineer with a Computer Engineering background and 2 years of hands-on experience designing scalable ETL pipelines, optimizing Spark jobs, and turning raw data into reliable signals teams can act on.

What started as curiosity for algorithms became a deep interest in how systems handle terabytes at speed. Today I work across Hadoop, Spark, Hive and Python to ship pipelines that process 3TB+ daily with 99% reliability in production.

I care about clean architecture, observability, and pipelines that don't wake engineers up at 3 AM. I'm now expanding into real-time streaming with Kafka and cloud-native deployments — building toward systems that scale with the business, not against it.

Computer Engineering

Bachelor's from The Open University, UK (2025) — strong foundations in distributed systems and algorithms.

Problem-Solving Mindset

I break problems down to their data flow primitives — sources, throughput, latency, failure modes.

Big Data in Production

2 years building pipelines that process 3TB+ daily with 99% reliability across Hadoop and Spark.

Always Shipping & Learning

From Kaggle competitions to IBM Data Science training — I learn by building real systems end-to-end.

Skills

Tools that move data at scale

A stack tuned for distributed processing, fault tolerance, and real-time analytics.

Big Data

Engineer

Kafka

Spark

Hadoop

Python

SQL

Docker

Linux

AWS

Hive

HBase

Big Data Tools

Hadoop90%
Apache Spark90%
Hive75%
Kafka70%

Programming

Python90%
SQL90%
Bash70%

Data Engineering

ETL Pipelines90%
Data Cleaning88%
Data Modeling78%

Databases & Storage

HDFS85%
PostgreSQL80%
HBase65%

Cloud & Systems

Linux82%
Docker75%
AWS basics65%

ML & Analytics

ML Algorithms70%
Neural Networks65%
Visualization80%

Projects

Systems I've shipped

Real engineering work — not tutorials. Each project below highlights the problem, the architecture choices, and the measurable impact.

Featured architecture

Real-Time Supply Chain Analytics Platform

End-to-end streaming pipeline turning supply events into live operational decisions.

GitHub Discuss

Pipeline

API

Kafka

Spark

Data Lake

Dashboard

Problem

Logistics teams reacted to delays hours after they occurred. Batch reporting couldn't surface bottlenecks fast enough to act on shipment SLAs.

Solution

Built a streaming-first architecture: event APIs feed Kafka topics, Spark Structured Streaming enriches and aggregates events, results land in a data lake, and a dashboard surfaces KPIs in seconds.

Key Features

· Streaming + batch unified
· Late-data handling with watermarks
· Health checks & alerting
· Containerized for reproducible deploys

Impact

→ Cut delay-detection latency from hours to seconds
→ Single source of truth for ops + finance
→ Auto-alerts on shipment SLAs

KafkaSpark Structured StreamingPythonHDFSDockerPostgreSQL

Graduation Project · 2025

Diabetic Retinopathy Detection — Web System

Web-based AI system that detects diabetic retinopathy from retinal images, backed by a Big Data pipeline.

GitHub Discuss

Pipeline

Retinal Images

HDFS

Spark + Hive

Model

Web UI

Problem

Screening thousands of retinal images per day overwhelms specialists and delays early diagnosis.

Solution

Leveraged Hadoop, Spark and Hive to process large-scale retinal-image datasets efficiently, with scalable pipelines feeding a model that achieves high prediction accuracy for early diagnosis.

Key Features

· Distributed image preprocessing
· Scalable Big Data pipeline
· Web UI for clinicians
· Built on Hadoop / Spark / Hive

Impact

→ High prediction accuracy enabling early diagnosis
→ Processes thousands of retinal images daily
→ Better patient outcomes via faster screening

HadoopSparkHivePython

Production · 2024

Big Data Pipeline Optimization

Robust ETL pipeline capable of handling 3TB of data daily, tuned for throughput and reliability.

GitHub Discuss

Pipeline

Sources

Hive Staging

Spark ETL

Curated Layer

Problem

ETL jobs were the bottleneck as data volume grew toward 3TB/day, slowing reports and stretching prep time.

Solution

Established a robust ETL pipeline on Hadoop and Spark, then optimized Spark workflows — query tuning, partitioning, and join refactors — to lift data-processing performance by 25%.

Key Features

· Skew-aware Spark joins
· Hive partition pruning
· Idempotent retries
· Per-job metrics & SLAs

Impact

→ +25% data-processing performance
→ −30% data-preparation time
→ 99% data reliability in production

HadoopSparkHivePythonSQL

System Architecture

How the data actually flows

The mental model I use when designing real-time analytics pipelines.

Pipeline diagram

Source → Kafka → Spark → HDFS → Insights

Live in production · 3TB/day

Sources

APIs, app logs, retinal-image batches

›

Kafka

Streaming buffer for 3TB/day ingest

›

Spark

Distributed ETL & transformations

›

Hive on HDFS

Partitioned warehouse layer

›

Curated Data Lake

99% reliable, query-ready datasets

›

Dashboards & ML

Reporting + diabetic-retinopathy model

3TB+ processed daily

+25% Spark performance gain

99% production reliability

Why Kafka

Pipelines ingest 3TB+ of mixed batch and streaming data daily. Kafka decouples producers from Spark, absorbs traffic spikes, and keeps a replayable log so reprocessing never hits the source systems.

Why Spark

Spark is the workhorse for the heavy ETL — distributed joins, aggregations and image-batch processing for the retinopathy project. Tuning shuffles, partitions and caching delivered the 25% performance gain documented in the CV.

Hive + Python ETL

Hive handles SQL-style transformations on HDFS while Python orchestrates cleaning, validation and feature prep — the same workflow that cut data-preparation time by 30%.

Performance Tuning

Query rewrites, broadcast joins, partition pruning and right-sized executors turn long-running Spark jobs into predictable ones — measured, not guessed.

99% Reliability

Idempotent writes, retries and checkpointing keep production pipelines stable. When a stage fails, it replays from the last safe offset instead of corrupting downstream tables.

Built for Scale

Partition-aware design across Kafka topics, HDFS layout and Spark executors — the same architecture scales from thousands of retinal images to multi-terabyte daily loads.

Real-Time Demo

A glimpse of the system in motion

Mock telemetry from a streaming pipeline — the same shape of dashboards I build in production.

Events / sec

12,480

Kafka throughput

Batch latency

612ms

Spark micro-batch

Pipeline health

99.4%

Last 24h

Live stream — orders.events

tail -f

›[INFO] kafka: consumer group=ingest-1 offset=10238421

›[INFO] spark: micro-batch 8421 processed 12,488 events in 612ms

›[INFO] sink: wrote 12,488 rows to s3://lake/curated/orders

›[WARN] watermark: late event delta=3.1s within tolerance

Monitoring

Consumer lagOK
Disk usage62%
Late events0.3%
Failed jobs (24h)0

Engineering Thinking

How I think as a Data Engineer

Volume first

What is the data shape today and at 10x? Decisions made for MB break at TB.

Latency budget

Real-time vs near-real-time vs batch is a product question, not a tech one.

Reliability over cleverness

Idempotent writes, retries, and clear failure modes beat exotic tricks.

Clean pipeline design

Bronze → Silver → Gold layers; transformations are pure and testable.

Scalability mindset

Partition on access patterns. Avoid hot keys. Plan for skew before it bites.

Observable by default

If a stage isn't logged and metered, it doesn't exist when something breaks.

Challenges Solved

Real bugs, real fixes

Problem

Kafka consumers stalled under traffic spikes

What I did

Tuned partition count + consumer group concurrency, switched to manual offset commits, added lag-based autoscaling triggers.

Result

Stable consumer lag under 2s during peaks; no missed events.

Problem

Docker containers conflicted in local dev environment

What I did

Standardized service ports, externalized configs to .env, and built a single docker-compose covering Kafka, Spark, and Postgres.

Result

One-command spin-up; new contributors productive on day one.

Problem

Skewed joins killed Spark job performance

What I did

Profiled stages with the Spark UI, salted hot keys, and broadcast small dimension tables.

Result

Job runtime dropped ~40%, freeing the cluster for downstream work.

Problem

Dirty source data corrupted reports

What I did

Added a typed validation layer, quarantined bad records, and surfaced a daily quality report instead of failing silently.

Result

Trust restored — analytics team stopped hand-checking outputs.

Impact

Numbers and outcomes that matter

3TB+

processed daily across pipelines

+25%

faster Spark processing after tuning

−30%

less data prep time via ETL refactors

99%

pipeline reliability in production

Real-time insights replaced overnight reports for ops teams
Faster monitoring with proactive alerts instead of post-mortems
Better decisions from a single curated source of truth
Monitoring automation freed engineering time for shipping
Scalable pipeline design — same code, more data, no rewrites

What's Next

Future improvements I'm building toward

Airflow orchestration

Move ad-hoc cron jobs to typed DAGs with retries and lineage.

Deploy on AWS / Huawei Cloud

Managed Kafka (MSK) + EMR/Spark to remove ops overhead.

Integrate real APIs

Replace mock producers with live partner APIs and schema registry.

Improve dashboard analytics

Add cohort, funnel, and anomaly views on top of curated data.

Monitoring & alerting

Prometheus + Grafana + paging on SLO breaches and pipeline lag.

Experience

A timeline of building & learning

Jan 2024 – Present

Big Data Engineer

Software Company

›Designed and maintained scalable data pipelines processing 3TB+ daily on Hadoop & Spark.
›Improved data processing performance by 25% via Spark tuning and query optimization.
›Built ETL workflows in Hive + Python — cut data preparation time by 30%.
›Partnered with engineering teams to ship data solutions to production at 99% reliability.

2025

B.Eng., Computer Engineering

The Open University, UK

›Graduation Project: Diabetic Retinopathy Detection — AI + Big Data web system.
›Coursework spanning distributed systems, data structures, and machine learning.

2023 – 2024

Certifications & Training

Coursera · IBM · Udemy

›Big Data Analytics Professional Certificate — Hadoop, Spark, Hive, Pipelines.
›IBM Data Science & Big Data Tools Training — Python, SQL, Visualization.
›AI & Machine Learning Certificate — Algorithms, Neural Networks, Python ML.

Tech Stack

The toolbox

Apache KafkaApache SparkHadoopHiveHBaseClickHousePythonSQLBashDockerLinuxAirflowPostgreSQLHDFSAWSHuawei CloudTensorFlowPandasApache KafkaApache SparkHadoopHiveHBaseClickHousePythonSQLBashDockerLinuxAirflowPostgreSQLHDFSAWSHuawei CloudTensorFlowPandas

Contact

Let's build scalable data systems together 🚀

Open to Big Data Engineering roles, internships, and collaborations on real-time data systems.

Hager Elkahlawy

Big Data Engineer · Open to roles

hagerelkahlawey23@gmail.com

Phone

01229678856

Location

Cairo, Egypt

GitHub

github.com/hager1223

linkedin.com/in/hager-elkahlawy