Cloud Data Engineer
- Capgemini
- Full-Time
- Feb 2020 - Present · 2 yrs 2 mos
• Conceptualized, designed and developed a robust ETL pipeline for ingesting and publishing terabytes of data daily
across variety of data platforms with big data and DWH on cloud increasing agility, elasticity and data processing.
• Collaborated with Product Owner, Business Owner, development and testing team for analyzing requirements,
designing and developing pipelines along with creating meaningful dashboards for the users.
• Developed an in-house data testing tool in PySpark to reconcile, validate and provide logs via email after each stage in
the data pipeline reducing development and testing time along with monitoring efforts by 70%.
• Built XML shredding tool in PySpark to convert XML into Avro for consumption and analysis using EMR and S3 as POC.
• Performance analysis and fixing issues for Spark jobs to optimize execution time and reduce cost of resources.