Sam is available for hire

Sam Rogers

Verified Expert in Engineering

Data Engineer and Developer

Location

Boston, MA, United States

Toptal Member Since

June 24, 2020

Sam is a data engineer who specializes in creating AWS solutions for ETL. Due to his attentiveness and drive for excellence, he has continuously provided scalable, repeatable, and cost-effective solutions to process data at scale. Sam的成功之处在于运行他的Python代码和AWS资源的项目，但他对谷歌云和Azure也有一定的了解.

Data Warehousing Data Warehouse Design Business Intelligence (BI)Python SQL ETL Data Pipelines Snowflake Docker Spark PySpark Apache Airflow Data Modeling Amazon API Gateway PostGIS Dashboard Dask

Portfolio

Starry Internet

Salesforce, Stripe, NetSuite, Amazon Web Services (AWS), PostGIS, PostgreSQL...

Drift

Amazon Web Services (AWS), Redshift, Docker, Apache Airflow, Snowflake, SQL...

Liberty Mutual

Amazon Web Services (AWS), Geospatial Data, Dask, PostGIS, EMR, Redshift, SQL...

Experience

Python - 3 years SQL - 3 years ETL - 3 years Snowflake - 2 years Docker - 2 years Spark - 2 years Apache Airflow - 2 years

Availability

Part-time

Preferred Environment

Bash, PyCharm, Slack

The most amazing...

...thing that I've ever done was migrate a data warehouse in just two weeks. This involved 20 data sources, over 10TB of data, and hundreds of different reports.

Work Experience

Data Engineer

2019 - PRESENT

Starry Internet

Designed and implemented a large-scale data-processing platform utilizing Spark and AWS EMR. 这包括建立一个管道来处理和聚合每分钟来自数万台设备的物联网数据.
将团队的代码部署流程从手动构建和上传更改为在AWS代码构建中运行的CI/CD管道，从而将每次部署的工程工作从10分钟减少到1分钟以下.
Led the design, implementation, testing, and migration to a highly scalable Airflow environment. 该环境是每小时运行800多个容器化任务的所有ETL的核心平台.
通过为更广泛的数据工程团队实现事件响应框架和票务系统，提高了责任意识，减少了错误响应时间.
Rearchitected the use of Snowflake for large scale data processing. 将工作负载从Snowflake转移到运行在EMR上的PySpark，从而节省了30%的成本，减少了50%的管道运行时间.
开发数据质量工具，每小时对数据仓库运行1200多次检查，以确保数据符合预期. 这导致了从被动错误处理到主动监视和事件管理的转变.

Technologies: Salesforce, Stripe, NetSuite, Amazon Web Services (AWS), PostGIS, PostgreSQL, Apache Airflow, Docker, Geospatial Data, Spark, Scala, Snowflake, SQL, Python

Data Engineer

2018 - 2019

Drift

Managed and maintained all aspects of ETL, data warehousing, 分析工具和基础设施，并负责吸收新的数据源, data quality, and availability (was also the data team's hire #1).
Stood up the Airflow back end using ECS, Fargate, RDS, and Redis to serve as the core ETL tool for all data processing and pipelines.
Led the migration from Redshift to Snowflake involving 17 separate streaming data sources, 1,000+ tables, and over 20 different teams reliant on the warehouse. Migration resulted in a zero increase in cost and a 75% decrease in query time.
Developed a reliable Spark pipeline to process 100GB+ of data daily and produce clean, manageable aggregations of end-user interaction data.
已构建成功因素得分:根据使用情况确定客户健康状况的统计模型, interaction, and engagement data. This score serves as a key business metric that customer success managers are evaluated on.
Evaluated, implemented, and trained a team on Looker, 一个强大的数据定义管理和BI工具，使非技术用户能够访问和分析数据.

技术:Amazon Web Services (AWS)、Redshift、Docker、Apache Airflow、Snowflake、SQL、Python

Data Science Engineer

2017 - 2018

Liberty Mutual

开发基础设施，以处理和了解飞机噪音对特定地点宜居性的影响(超过100亿次记录).
Produced a prototype to enable executives to quickly ingest and understand 1,000+ comments from monthly employee opinion surveys. Developed a front-end web app to allow for access with ease.
架构并构建了一个数据管道，以支持客户服务呼叫的自动汇总.

技术:亚马逊网络服务(AWS)，地理空间数据，Dask, PostGIS, EMR, Redshift, SQL, Python

Analytics Associate

2016 - 2017

Liberty Mutual

开发来自众多不同来源的市场规模模型，以估计各种新产品概念的潜在商业价值.
评估一个机会，并开发一个模型，以智能地选择哪些无过错索赔应该发送到诉讼. The model is projected to increase recovery dollars by $700,000.
从整个组织的领导者那里收集基于云的基础设施的用例，并对用例进行优先级排序，最终创建云过渡策略.

Technologies: R, SQL

Experience

Total Home Score Data Pipeline

http://www.totalhomescore.com

Total Home Score是一个产品，旨在帮助潜在的购房者和租房者在做出决定之前了解居住在特定房产中的感觉.

In order to scale this product and calculate scores for millions of properties, 我构建了一个大规模的数据管道来执行复杂的地理空间计算和聚合.

该管道包括使用Spark和EMR对道路交通数据进行计算，并生成驾驶员在特定道路长度上的典型驾驶方式的汇总. 然后将地址加载到Dask中，并跨数千个分区进行计算，以确定在给定地址的特定半径内存在多少“危险”道路.

Additionally, 我开发了一个管道来处理飞机位置数据(超过100亿个点)，并确定在特定属性处预期的飞机噪音水平.

End User Analytics Cache

一个基于python的应用程序，运行在AWS Lambda和Redis上，支持毫秒级的汇总产品使用数据的记录检索.

While working at a marketing technology provider, our product team wanted the ability to surface product usage data to our customers. 客户不希望等待在数据仓库中运行查询才能返回结果. 我设计的解决方案是运行一组预定义的聚合，并将它们放入缓存中，以便客户几乎可以立即接收和可视化结果.

Not only was this more rapid than running aggregations on demand, but it was also more cost-effective, instead of running thousands of aggregation queries in Snowflake per day, only one query needed to run to generate the output data and place it into our cache.

Containerized Airflow Processing

气流是一个开源工具，设计用于编排、调度和执行ETL作业. The tool was originally designed to run these processes within it as well. However, 由于气流执行实际的处理而不仅仅是在资源之间进行协调，可能会出现许多问题. For example, all dependencies need to be installed on the same instance, memory leaks can bring down the entire cluster, 如果客户数据被集群接触，可能需要额外的安全措施.

我的解决方案是让气流仅仅作为一个容器执行工具，而不是在应用程序中进行实际的处理. In addition, 要执行的作业和要传递给它的参数的所有配置仍然包含在气流代码中, but executed elsewhere. This makes for a simple interface for other data engineers to implement new pipelines.

For example, if an engineer has a file in S3 that they want to be loaded to a database on a schedule, they simply utilize the loader operator class that already exists in the Airflow repository. When executed, 该类提供一个在AWS Fargate中运行的任务，该任务使用传递给它的配置执行一个进程.

Skills

Languages

Python, SQL, Snowflake, Bash, R, Scala, SAS

Frameworks

Spark, Flask, Serverless Framework, Django

Libraries/APIs

PySpark, Flask-RESTful, Dask, Stripe, Luigi

Tools

Apache Airflow, Looker, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Container Registry (ECR), Amazon Elastic MapReduce (EMR), Slack, PyCharm, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (SQS), AWS CodeBuild, AWS IAM

Paradigms

Business Intelligence (BI), ETL, REST, DevOps

Platforms

Docker, Amazon Web Services (AWS), AWS Lambda, Amazon EC2, Salesforce

Storage

数据管道，PostGIS, MySQLdb，数据库，Redis, MySQL, PostgreSQL, Amazon S3 (AWS S3)， Redshift

Other

Pipelines, Data Warehousing, Dashboards, Web Dashboards, Data Warehouse Design, Data Modeling, Geospatial Data, GeoSpark, Amazon API Gateway, Dash, EMR, NetSuite, Singer ETL, Data Build Tool (dbt)

Education

2013 - 2016

Bachelor's Degree in Economics

University at Buffalo - Buffalo, NY, USA

Collaboration That Works

How to Work with Toptal

在数小时内，而不是数周或数月，我们的网络将为您直接匹配全球行业专家.

Share your needs

Discuss your requirements and refine your scope in a call with a Toptal domain expert.

Choose your talent

在24小时内获得专业匹配人才的简短列表，以进行审查，面试和选择.

Start your risk-free talent trial

Work with your chosen talent on a trial basis for up to two weeks. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring