Satyajeet
Data engineer
To display my potential in a competitive environment and look for challenges which provide both personal and professional growth & development
SYNOPSIS
AWS Certified Solutions Architect and data engineer – Associate with 8+ Yrs. experience in designing and developing Big data solutions using AWS and Python.
Hands-On expertise in various AWS services like EMR, Kinesis, Lambda, ElasticSearch, Kibana, S3, DMS, Redshift, Glue ETL tool Talend, programming language Python and Scala. Big data processing using Spark, Hudi and Databricks.
Experience setting up new AWS environment using Terraform, Jenkins and Bitbucket as source code repository.
Good understanding of advanced python libraries like Numpy, Pandas, visualisation library
Presently working with Deltacubes Technology – AWS Data Engineering is responsible for building end-to-end data lake solutions on AWS.
Previously associated with Synechron Technologies Pvt Ltd. in Pune as part of the core team responsible for developing big data solutions and developing deployment pipelines using Terraform and Jenkins.
Previously associated with HCL Technologies Ltd for Client CISCO as a Lead Engineer,
responsible for python automation.
Previously associated with Cognizant Technology Solutions in Chennai as Programmer Analyst for the client CCC Information Services.
Technical Expertise –
7+ years of hands-on experience with AWS and Python.
Around 3-year experience in Big data ecosystem Hadoop, Hive, Scala and Spark, Pyspark, AWS Glue, Athena, Redshift.
1+ years of hands-on experience on Jenkins and Terraform.
Good exposure to writing SQL queries.
Well versed with advanced python libraries.
Working in Agile methodology.
WORK EXPERIENCE
Project
Data Lake using Spark and Scala; Hudi/ Databricks
Position
Senior Associate – Technology
Period
May2020-Current
Description
As per the current architecture, we need to maintain two separate layers of data i.e. Speed layer for the real-time data, and batch layer which is a replica of the speed layer lagging by 6 hours used for historical purposes.
This requirement is for building a data lake using Scala as a programming language to process files utilizing parallel computing capacity of spark on over the Databricks platform.
This is intended to replace our existing ETL pipeline using Talend / Apache Hudi due to limitations over compute capacity and heavy amount of files processed by our platform. This requirement is to build data lake for different vendors to pull and analyse data and determine user behaviour, and make necessary business decisions accordingly.
The source of the python scripts in AWS EC2/Lambda are SAP HANA, Google sheets, Google drive files, user badge swipe data, COVID vaccination details etc. S3 acts as the landing zone for these files in CSV format; this triggers an AWS lambda event which reads the data and performs truncate/load or upsert in the Redshift tables, views in
Redshift are then used in Tableau dashboards
Various data sources for this requirement are traditional databases such as MYSQL, Oracle and SQLServer, which are processed using AWS DMS service making files available in S3.
Responsibilities
Solution Environment
Tools
Databricks notebook, Intellij IDE, SBT
Project Type
Data Pipeline
Project
Building an ETL pipeline using Talend
Client
Asurion LLC, Nashville, USA
Period
Oct 2019-April 2020
D
Copyright© Cosette Network Private Limited All Rights Reserved