Skip to content
View SunilKuruba's full-sized avatar
👨‍💼
Open to work
👨‍💼
Open to work

Block or report SunilKuruba

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SunilKuruba/README.md

Hi, I'm Sunil

Senior Backend & Data Engineer | AWS Certified Data Engineer | Ex-Fivetran

I bring ~5 years of industry experience in data engineering and backend systems, designing and operating scalable, cloud-native, and high-performance data platforms used in production by enterprise customers. I recently completed an MS in Computer Science at the University of Illinois Chicago, where I specialised in data engineering, cloud computing, distributed systems, and big data technologies, building directly on my industry background.

I am an AWS Certified Data Engineer – Associate with strong hands-on experience building secure, cost-efficient, and high-throughput data pipelines using services such as Amazon S3, Glue, EMR, Kinesis, Redshift, Athena, and DynamoDB. My expertise spans ETL/ELT pipelines, data lake and lakehouse architectures, data governance, and real-time analytics at scale.

Previously, I worked as a Senior Software Engineer at Fivetran, collaborating in a startup environment across multiple teams within the data pipeline platform and contributing to both source connectors and destination writers. My work spanned API-based connectors as well as database connectors such as DynamoDB and MongoDB, focusing on scalability, correctness, and performance. I led the design of a high-performance DynamoDB incremental sync engine, achieving 15× faster syncs, and implemented MongoDB Change Streams–based CDC incremental syncing, delivering a 5× performance improvement. On the destination side, I worked on data warehouse writers, including BigQuery and Snowflake.

Beyond individual connectors, I authored and designed a reliability framework that was adopted across 10+ engineering teams, improving consistency and fault tolerance across the data pipeline platform. I also won multiple internal hackathons, delivering features focused on product improvements, developer productivity, and platform innovation. In addition, I mentored and onboarded interns through a structured training program and regularly participated in technical interviews for engineering roles.

I enjoy working on high-impact data infrastructure problems, building systems that are scalable, reliable, and cost-efficient from day one. I am currently open to full-time roles in data engineering, backend systems, and cloud infrastructure, particularly in fast-moving, product-focused teams.

Core Strengths

  • Data Engineering & Pipelines: Designing and implementing scalable ETL/ELT pipelines, schema evolution, data modeling, connector development, orchestration, and real-time data processing.
  • Distributed Systems & Processing: Apache Spark, Hadoop MapReduce, Apache Flink, Kafka, AWS Kinesis, and gRPC for processing large-scale datasets in both batch and real-time systems.
  • Cloud Platforms & Services:
    • AWS: EC2, Lambda, S3, EMR, Glue, Step Functions, RDS, DynamoDB, Redshift, Athena, EventBridge, IAM, KMS, CloudWatch, CloudTrail.
    • GCP: BigQuery, Compute Engine, Pub/Sub, Cloud Storage.
    • Azure: Azure VM, Azure Blob Storage.
  • Databases & Warehousing: Experience with modern data warehouses and databases like Snowflake, BigQuery, Redshift, DynamoDB, MySQL, PostgreSQL, MongoDB, and SQL Server.
  • Programming & Backend Development: Proficient in Java (Advanced), SQL (Advanced), Scala, Python, C++, Shell Scripting; expertise in backend architecture, REST APIs, and service frameworks.
  • DevOps & Infrastructure: Skilled in Docker, Kubernetes, Terraform, GitHub Actions, CI/CD, observability tools like New Relic, CloudWatch, and SonarQube for maintaining robust production systems.
  • Data Quality & Governance: Ensuring data integrity and reliability through validation frameworks, governance practices, and monitoring across the data lifecycle.
  • Collaboration & Leadership: Onboarded interns with structured training, led design efforts, and contributed to engineering hiring processes.

Certification

AWS Certified Data Engineer – Associate

Verified on Credly
Demonstrates ability to design, build, secure, and maintain data analytics solutions on AWS that are efficient, scalable, and cost-optimized. Proficient in:

  • Data lake and lakehouse architecture
  • Real-time and batch data ingestion
  • Data transformation using Glue, EMR
  • Querying with Athena, Redshift
  • Secure access via IAM, encryption, and governance

Education

University of Illinois Chicago

Master of Science in Computer Science | Aug 2024 – Dec 2025

RV College of Engineering, Bengaluru

Bachelor of Engineering in Computer Science | Aug 2016 – May 2020

Experience

Senior Software Engineer

Fivetran · Bengaluru, India
Mar 2023 – Aug 2024 · 1 yr 6 mos

  • Redesigned and developed a new BigQuery data writer aligned with SQL-based writers, eliminating 90% of maintenance overhead.
  • Enhanced Warehouse Data Writer throughput by 30% by implementing multithreaded concurrent processing for split files.
  • Added support for JSON data types in BigQuery, ensuring seamless schema evolution and data compatibility.
  • Introduced partitioning and clustering in BigQuery writer to reduce customer costs by ~90% — a hackathon-winning optimization.
  • Led infrastructure improvements across distributed data pipelines and contributed to system-level performance gains.

Software Engineer 2

Fivetran · Bengaluru, India
Sep 2021 – Mar 2023 · 1 yr 7 mos

  • Engineered a high-performance DynamoDB connector with 15× speedup in incremental syncs.
  • Improved MongoDB connector using Change Streams to achieve 5× faster data ingestion with reduced latency.
  • Designed support for Azure CosmosDB for MongoDB API, expanding Fivetran’s connector catalog.
  • Built Data Preview functionality using the IES framework to simplify customer onboarding and demo experiences.

Software Engineer

Fivetran · Bengaluru, India
Jun 2020 – Aug 2021 · 1 yr 3 mos

  • Authored Isolated Endpoint Sync (IES) — a hackathon-winning framework now adopted by 500+ connectors and 10+ teams.
  • Built a public Shopify connector app with OAuth-based merchant onboarding, GraphQL extraction, and failover capabilities.
  • Enhanced Stripe connector with multithreading and connected accounts support for scale and fault-tolerance.
  • Developed an ETL connector for ADP REST APIs with complete ERD-based schema documentation.

Software Engineering Intern

Fivetran · Bengaluru, India
Jan 2020 – May 2020 · 5 mos

  • Built webhook-based incremental sync mechanism for Recharge connector, achieving a 10× increase in extract performance.
  • Benchmarked performance of full ETL pipelines using Snowflake, delivering optimization insights for production rollouts.
  • Contributed to multiple API-based connectors and gained hands-on experience with Fivetran’s connector lifecycle.

Projects

Mini Database Management System Internals Implementation

GitHub: View Project Implemented core database engine internals including page-based storage, buffer management, record storage, and B+ tree indexing as part of an academic DBMS project. Tech: C, Storage Manager, Buffer Manager (FIFO/LRU), Record Manager, B+ Tree, Valgrind


AWS vs GCP Data Pipeline Benchmarking

GitHub: View Project
Benchmarks real-time data pipelines on AWS and GCP using a common IoT workload. Evaluates performance, cost, and sustainability.
Tech: AWS Kinesis, GCP Pub/Sub, Lambda, Dataflow, Python


Visual Analytics and Interactive Dashboards for LinkedIn Postings

GitHub: View Project

Developed an interactive visual analytics platform analyzing 124K+ LinkedIn job postings to uncover trends in skill demand, salaries, geography, experience levels, and remote work. Built reproducible data pipelines in Python and designed linked dashboards using Altair/Vega-Lite, including geospatial salary maps, skill–salary–industry views, and embedding-based job similarity exploration using PCA and UMAP. Tech: Python, Pandas, Altair, Vega-Lite, PCA, UMAP, Jupyter, GitHub Pages


AWS Bedrock LLM Conversation API with Ollama

GitHub: View Project
Built a cloud-native conversational API using AWS Bedrock and Ollama for multi-turn LLM-based dialogue.
Tech: Scala, Akka HTTP, gRPC, AWS Lambda, Docker


Distributed Neural Network Training & Sentence Generation

GitHub: View Project
Built a Spark-based deep learning pipeline to train and generate text using DL4J and AWS EMR.
Tech: Scala, Apache Spark, DL4J, AWS EMR


Social-Aware Movie Revenue Prediction

GitHub: View Project
A machine learning pipeline that predicts movie box office revenue by combining traditional metadata (e.g., budget, genre, cast) with sentiment and emotion signals extracted from Reddit and YouTube.
Tech: Python, scikit-learn, NLP, Reddit & YouTube API, Data Visualization, EDA


Hadoop-based LLM Tokenization & Embeddings

GitHub: View Project
Created a distributed NLP pipeline using custom tokenizers and Hadoop MapReduce to generate text embeddings.
Tech: Scala, Hadoop, AWS EMR


Help Session Activity Management System

GitHub: View Project
Designed the backend data model for scheduling and managing academic help sessions between TAs and students.
Tech: SQL, Database Design, ER Diagram

Pinned Loading

  1. Mini-Database-Management-System-Internals-Implementation Mini-Database-Management-System-Internals-Implementation Public

    This project implements foundational database management system (DBMS) internals in C, including paging, buffering, record storage, and indexing.

    C

  2. AWS-VS-GCP-Data-Pipeline-Comparative-Analysis-of-Real-Time-Data-Streaming AWS-VS-GCP-Data-Pipeline-Comparative-Analysis-of-Real-Time-Data-Streaming Public

    This project benchmarks real-time data pipelines on AWS and GCP using a common IoT workload. It evaluates performance, cost, and sustainability across two architectures.

    Python

  3. AWS-Bedrock-Based-LLM-Conversation-API-with-Ollama-Integration-with-Dockerized-Deployment AWS-Bedrock-Based-LLM-Conversation-API-with-Ollama-Integration-with-Dockerized-Deployment Public

    A cloud-native conversational API that integrates AWS Bedrock and Ollama for multi-turn dialogue generation. Built using Scala with Akka HTTP, gRPC, and AWS Lambda, and deployed via Docker for port…

    Scala

  4. Apache-Spark-and-AWS-EMR-Distributed-Neural-Network-Training-and-Sentence-Generation Apache-Spark-and-AWS-EMR-Distributed-Neural-Network-Training-and-Sentence-Generation Public

    A scalable deep learning pipeline designed for training sentence-generation neural networks using Apache Spark and DL4J on AWS EMR. The project leverages Spark RDDs for distributed data preprocessi…

    Scala

  5. Apache-Hadoop-and-AWS-EMR-Distributed-LLM-Text-Processing-and-Embeddings Apache-Hadoop-and-AWS-EMR-Distributed-LLM-Text-Processing-and-Embeddings Public

    A scalable NLP pipeline for processing and embedding large text corpora using Hadoop MapReduce on AWS EMR. Implements custom tokenization and embedding generation using Scala, deployed via a single…

    Scala

  6. Visual-Analytics-and-Interactive-Dashboards-for-LinkedIn-Postings Visual-Analytics-and-Interactive-Dashboards-for-LinkedIn-Postings Public

    An interactive visual analytics system for exploring large-scale job market data using LinkedIn job postings. The project combines reproducible Altair/Vega-Lite dashboards, spatial salary and skill…

    Jupyter Notebook