In a hurry? Check out my projects!

Aaron Richter being weird

Aaron Richter, PhD

data scientist // data engineer

I'm passionate about all things data! I am happiest when optimizing business processes using data pipelines and analytics, or teaching a room full of data nerds about a cool algorithm. I write code for production and make sure research is reproducible. I'm a technical leader to a team of data professionals and hold a PhD in machine learning for healthcare applications.

Want to see more? Check out my projects!

What I Do

Talks & Publications

Talks


Selected publications

Find all publications at my Google scholar and contact me if you can't get through paywalls.

Predicting Melanoma Risk

Built and evaluated several machine learning models to predict individual patient risk of developing melanoma from routinely captured electronic health records. Involved processing de-identified data from over 20 million patients into a research dataset to create over 100,000 features used for building predictive models.

Wrote all code and conducted all experiments myself.

  • Languages: Python, R, SQL, LaTeX
  • Platforms: AWS (EC2, S3, EMR), Databricks
  • Data prep: Apache Spark, tidyverse, pandas/numpy/scipy
  • ML: logistic regression, decision trees, random forest, XGBoost, SHAP
  • Analysis: RMarkdown, ggplot2, Shiny
  • Writing: Overleaf / ShareLaTeX, Lucidchart

Download PDF here!


Spark ETL

This project is the pride and joy of my work at ModMed. It started as a way to generate different datasets that had shared ETL components, and grew to power the entire business's enterprise data warehouse and data products.

The main concept is having object-oriented wrappers around the functional logic of ETL transformations. This allows re-usable operations to be encapsulated and de-couples storage/IO logic from the ETL. It is portable and can run on any environment that supports Spark. Engineers can focus solely on data transformations without worrying about scaffolding or boilerplate code. Additionally, it allows for local development and unit testing before deploying to data in the wild. Metadata and data quality are first class citizens, with a table not "complete" until data quality rules have been written and applied at runtime. Data dictionaries are generated from code so documentation is always up-to-date. Find out more in the video below from my Spark+AI Summit talk.

Most recently, I refactored the core code to support Databricks as an execution environment in addition to Amazon EMR. As part of this process, I built an Airflow DAG that allowed tables to be generated asynchronously, blocking until each table's parent(s) were generated. This decreased the runtime of our nightly data warehouse pipeline by 75%, enabling a new data product to be delivered to our clients that was not previously possible.

  • Languages: Python, JSON, YAML, Java (minimal)
  • Platforms: Databricks, AWS (EMR, S3)
  • Frameworks: Spark, Airflow
  • Tables: 500+ tables for 20+ data products
  • Code: 40,000+ lines of code and 500+ pull requests (30,000 and 350 by yours truly)

Data Team @ ModMed

I started at ModMed as a software engineering intern focused on building out data pipelines. I was eventually brought on full time as a data scientist, becoming the first member of the data team. The team is a dynamic team within a dynamic company, exemplified by our growth (and the fact that my desk has been moved over 10 times!). I have been fortunate to see the team expand from just me to over 10 data professionals.

As an individual with intricate knowledge of the data models and ETL pipelines at the company, I serve as a technical lead and subject matter expert for the data warehouse and analytics products. This involves architecture work, best practice/documentation writing, and software lifecycle activities.

  • Conducted countless interviews
  • Recruited several data engineers from my personal network
  • Interviewed and onboarded multiple data engineers, biostatisticians, data analysts, and engineering interns
  • Maintain software lifecycle for four major code repositories

Data team

AWS ML Experimenter

This package was built to support machine learning experimentation for my PhD research. The data was too big (and sensitive) to fit on a personal workstation, so all data preparation and model training was performed in the cloud with AWS. Then, the results were pulled down from S3 files for local analysis with R.

I built a framework around these processes to make it seamless to develop the ML code locally on test data, then deploy it to the full data in the cloud when ready. I also built a Shiny dashboard to monitor the experiments in real time, as some of them would take days to complete. Check out the video walkthrough below!

  • Languages: Python, R, Bash
  • Platforms: AWS (EC2, S3), shinyapps.io
  • ML: scikit-learn, keras
  • Analysis: RMarkdown, Shiny

The project is open-source and available here .


PyData Miami

PyData Miami is a meetup and conference chapter of the global PyData network. PyData focuses on computational and data applications of the Python language, along with others such as R and Julia.

I have been a frequent speaker and organizer of the meetup since 2018 (formerly the Miami Machine Learning meetup), and was a co-organizer and program chair for the inaugural PyData Miami conference in 2019. The 2020 conference is on hold due to our (least) favorite human coronavirus, but you can watch my PyData Miami 2019 talk below!

  • Spoke at several meetups and the PyData Miami 2019 conference
  • Program Chair for 2019 and 2020 conferences
  • Witnessed the Miami data community grow, and met so many awesome people!

About Me

I started my coding journey way back in middle school- fighting Lego robots and scraping Flash games off websites. After I received my bachelor's degree in computer science I started an internship at Modernizing Medicine while also pursuing my PhD in machine learning. Little did I know that I would find my passion in data! I did a lot of Hadoop during my early work at Modernizing Medicine while teaching myself Python and R for research at school. I was on the bleeding edge of Spark as it took over the big data world, and am a Databricks certified developer. Recently I jumped over to the Python-native big data space, and am becoming an expert in Dask and RAPIDS through my work at Saturn Cloud.

Outside of work I like to contribute to data communities through meetups and conferences. I am an organizer for PyData Miami, have spoken at Spark+AI Summit, and mentored hackathons at Cornell University and Lynn University. Outside of work and data I like to travel the world (virtually for now, thanks corona), ride my motorcycle, and appreciate artwork. I'm also an aspiring ukelele star (not really).

Work
2020 - Present
 
Saturn Cloud
  • Senior Data Scientist
  • Solutions Architect
  • Developer Advocate
2014 - 2020
 
Modernizing Medicine
  • Data Scientist
  • Data Engineer
  • Tech Lead
Education
2014 - 2019
 
PhD, Computer Science
  • Florida Atlantic University
  • 13 peer-reviewed articles
  • Published by IEEE, ACM

2014 - 2019
 
MS, Computer Science
  • Received with PhD

2012 - 2014
 
BS, Computer Science
  • Florida Atlantic University
Community
2018 - Present
 
PyData Miami
  • Conference Organizer
  • Speaker

Ongoing
 
Speaking
  • Spark+AI Summit 2018
  • Academic conferences
  • South Florida ML meetups

Ongoing
 
Mentorship
  • Cornell Health Tech Hackathon
  • Lynn University Hackathon