Satyajeet
SYNOPSIS
- AWS Certified Solutions Architect – Associate with 9 Yrs. experience in designing and developing Big data solutions using AWS and Python.
- Hands-On expertise in various AWS services like EMR, Kinesis, Lambda, ElasticSearch, Kibana, S3, DMS, Redshift, Glue ETL tool Talend, programming language Python and Scala. Big data processing using Spark, Hudi and Databricks.
- Experience setting up new AWS environment using Terraform, Jenkins and Bitbucket as source code repository.
- Good understanding of advanced python libraries like Numpy, Pandas, visualisation library matplotlib, seaborn.
- Presently working with Incedo Inc. as Technical Lead – AWS Data Engineering in Pune responsible for building end-to-end data lake solutions on AWS.
- Previously associated with Synechron Technologies Pvt Ltd. in Pune as part of the core team responsible for developing big data solutions and developing deployment pipelines using Terraform and Jenkins.
- Previously associated with HCL Technologies Ltd for Client CISCO as a Lead Engineer, responsible for python automation.
- Previously associated with Cognizant Technology Solutions in Chennai as Programmer Analyst for the client CCC Information Services.
Technical Expertise –
- 7+ years of hands on experience with AWS and Python.
- Around 3-year experience in Big data ecosystem Hadoop, Hive, Scala and Spark, Pyspark, AWS Glue, Athena, Redshift.
- 1+ years of hands on experience on Jenkins and Terraform.
- Good exposure to writing SQL queries.
- Well versed with advanced python libraries.
- Working in Agile methodology.
WORK EXPERIENCE
#1 : WEX Analytics
Description
This requirement is to build data lake for different vendors to pull and analyse data and determine user behaviour, and make necessary business decisions accordingly. The source of the python scripts in AWS EC2/Lambda are SAP HANA, Google sheets, Google drive files, user badge swipe data, COVID vaccination details etc. S3 acts as the landing zone for these files in CSV format; this triggers an AWS lambda event which reads the data and performs runcate/load or upsert in the Redshift tables, views in Redshift are then used in Tableau dashboards.
Role Technical Lead
Responsibilities:
- Gathering the requirements & analysis based on the Business
- Requirement Document (BRD).
- Implement the module to move the files from the different sources all the way to the Redshift.
- Sole ownership of the entire data pipelines worked upon.Monitoring and failure analysis of the jobs based on SNS / SES notifications triggered by the jobs.
Solution Environment
- AWS(S3,Lambda,EC2,Glue,Redshift,Cloudwatch,SNS,Secrets Manager),Python boto3
Tools: AWS, PyCharm, Putty, IntelliJ, Sublime Text
Project Type: Data pipeline on AWS
#2 : Data Lake using Spark and Scala; Hudi/ Databricks
Description
As per the current architecture, we need to maintain two separate layers of data i.e. Speed layer for the real-time data, and batch layer which is a replica of the speed layer lagging by 6 hours used for historical purposes. This requirement is for building a data lake using Scala as a programming language to process files utilizing parallel computing capacity of spark on over the Databricks platform. This is intended to replace our existing ETL pipeline using Talend Apache Hudi due to limitations over compute capacity and heavy amount of files processed by our platform. Various data sources for this requirement are traditional databases such as MYSQL, Oracle and SQLServer, which are processed using AWS DMS service making files available in S3.
Responsibilities
- Develop various modules in scala to interact with AWS services.
- Integrating these modules to build an end-to-end data pipeline to be processed
on Databricks using spark’s parallel processing capability.
- Writing test cases of the scala code to achieve at least 80% code coverage using
scoverage.
Solution Environmen