Rate

₹ 245,000 (Monthly)

Experience

8 Years

Availability

Immediate

Work From

Any

Skills

AWSPythonElasticSearch Kibana, Glue ETL tool Talend EMR KinesisLambdaS3 DMSKibanaIAMTerraform

Description

Satyajeet

Data engineer

To display my potential in a competitive environment and look for challenges which provide both personal and professional growth & development

SYNOPSIS

 AWS Certified Solutions Architect and data engineer – Associate with 8+ Yrs. experience in designing and developing Big data solutions using AWS and Python.

 Hands-On expertise in various AWS services like EMR, Kinesis, Lambda, ElasticSearch, Kibana, S3, DMS, Redshift, Glue ETL tool Talend, programming language Python and Scala. Big data processing using Spark, Hudi and Databricks.

 Experience setting up new AWS environment using Terraform, Jenkins and Bitbucket as source code repository.

 Good understanding of advanced python libraries like Numpy, Pandas, visualisation library

matplotlib, seaborn.

 Presently working with Deltacubes Technology – AWS Data Engineering is responsible for building end-to-end data lake solutions on AWS.

 Previously associated with Synechron Technologies Pvt Ltd. in Pune as part of the core team responsible for developing big data solutions and developing deployment pipelines using Terraform and Jenkins.

 Previously associated with HCL Technologies Ltd for Client CISCO as a Lead Engineer,

responsible for python automation.

 Previously associated with Cognizant Technology Solutions in Chennai as Programmer Analyst for the client CCC Information Services.

 Technical Expertise –

 7+ years of hands-on experience with AWS and Python.

 Around 3-year experience in Big data ecosystem Hadoop, Hive, Scala and Spark, Pyspark, AWS Glue, Athena, Redshift.

 1+ years of hands-on experience on Jenkins and Terraform.

 Good exposure to writing SQL queries.

 Well versed with advanced python libraries.

 Working in Agile methodology.

WORK EXPERIENCE

Project

Data Lake using Spark and Scala; Hudi/ Databricks

Position

Senior Associate – Technology

Period

May2020-Current

Description

As per the current architecture, we need to maintain two separate layers of data i.e. Speed layer for the real-time data, and batch layer which is a replica of the speed layer lagging by 6 hours used for historical purposes.

This requirement is for building a data lake using Scala as a programming language to process files utilizing parallel computing capacity of spark on over the Databricks platform.

This is intended to replace our existing ETL pipeline using Talend / Apache Hudi due to limitations over compute capacity and heavy amount of files processed by our platform. This requirement is to build data lake for different vendors to pull and analyse data and determine user behaviour, and make necessary business decisions accordingly.

The source of the python scripts in AWS EC2/Lambda are SAP HANA, Google sheets, Google drive files, user badge swipe data, COVID vaccination details etc. S3 acts as the landing zone for these files in CSV format; this triggers an AWS lambda event which reads the data and performs truncate/load or upsert in the Redshift tables, views in

Redshift are then used in Tableau dashboards

Various data sources for this requirement are traditional databases such as MYSQL, Oracle and SQLServer, which are processed using AWS DMS service making files available in S3.

Responsibilities

Develop various modules in scala to interact with AWS services.
Integrating these modules to build an end-to-end data pipeline to be processed on Databricks using spark’s parallel processing capability.
Writing test cases of the scala code to achieve at least 80% code coverage using scoverage.

Solution Environment

Scala, Spark, Databricks, AWS primarily Lambda, Python, Step Functions, Cloudwatch, DynamoDB ,pyspark

Tools

Databricks notebook, Intellij IDE, SBT

Project Type

Data Pipeline

Project

Building an ETL pipeline using Talend

Client

Asurion LLC, Nashville, USA