Talks & Publications

Conferences, meetups

Talks & Publications

Talks

Watson Institute master course: Intro to machine learning
Watson Institute at Lynn University, April 2020
Running and analyzing machine learning experiments in the cloud
IEEE International Conference on Machine Learning & Applications, December 2019
Intro to machine learning + ML for mobile
South Florida Mobile Developers meetup, December 2019
Predicting melanoma risk from electronic health records with machine learning techniques
PhD dissertation defense, July 2019
I’m a data scientist, ask me anything! (Live AMA)
Ironhack Miami, March 2019
Data @ modmed (Ironhack DataXperience)
Ironhack Miami, February 2019
Your data fits in RAM: How to avoid cluster computing
PyData Miami, January 2019
Shaping the future of data analytics
Panel member, Lynn University Business Symposium, November 2018
A modern big data architecture for healthcare research
FAU Big Data Science conference, October 2018
Doing big data with Spark (+R)
Data.miami bootcamp, July 2018
Dynamic healthcare dataset generation, curation, and quality with PySpark
Spark+AI Summit, June 2018
Predicting melanoma risk from electronic health record data
Miami Machine Learning meetup, May 2018
Feature learning with matrix factorization and neural networks
Ft. Lauderdale Machine Learning meetup, February 2018
Machine learning in the cloud with Amazon Web Services
Miami Machine Learning meetup, February 2018
Machine learning with big data using Spark
Ft. Lauderdale Machine Learning meetup, November 2015
Big data analytics with Hadoop (lecturer)
FAU course COT6930, Fall 2015
Communicating Experimental Results with R
FAU Big Data Lab, June 2015

Selected publications

Sample size determination for biomedical big data with limited labels
Network Modeling Analysis in Health Informatics and Bioinformatics, January 2020
Predicting melanoma risk from electronic health records with machine learning techniques
PhD dissertation, July 2019
Efficient learning from big data for cancer risk modeling: A case study with melanoma
Computers in Biology and Medicine, April 2019
Melanoma risk prediction with structured electronic health records
ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, August 2018
A review of statistical and machine learning methods for modeling cancer risk using structured clinical data
Artificial Intelligence in Medicine, July 2018
Predicting sentinel node status in melanoma from a real-world EHR dataset
IEEE International Conference on Bioinformatics and Biomedicine, November 2017

Find all publications at my Google scholar and contact me if you can't get through paywalls.

Predicting Melanoma Risk

PhD dissertation

Predicting Melanoma Risk

Built and evaluated several machine learning models to predict individual patient risk of developing melanoma from routinely captured electronic health records. Involved processing de-identified data from over 20 million patients into a research dataset to create over 100,000 features used for building predictive models.

Wrote all code and conducted all experiments myself.

Languages: Python, R, SQL, LaTeX
Platforms: AWS (EC2, S3, EMR), Databricks
Data prep: Apache Spark, tidyverse, pandas/numpy/scipy
ML: logistic regression, decision trees, random forest, XGBoost, SHAP
Analysis: RMarkdown, ggplot2, Shiny
Writing: Overleaf / ShareLaTeX, Lucidchart

Download PDF here!

Spark ETL

Object-oriented ETL framework

Spark ETL

This project is the pride and joy of my work at ModMed. It started as a way to generate different datasets that had shared ETL components, and grew to power the entire business's enterprise data warehouse and data products.

The main concept is having object-oriented wrappers around the functional logic of ETL transformations. This allows re-usable operations to be encapsulated and de-couples storage/IO logic from the ETL. It is portable and can run on any environment that supports Spark. Engineers can focus solely on data transformations without worrying about scaffolding or boilerplate code. Additionally, it allows for local development and unit testing before deploying to data in the wild. Metadata and data quality are first class citizens, with a table not "complete" until data quality rules have been written and applied at runtime. Data dictionaries are generated from code so documentation is always up-to-date. Find out more in the video below from my Spark+AI Summit talk.

Most recently, I refactored the core code to support Databricks as an execution environment in addition to Amazon EMR. As part of this process, I built an Airflow DAG that allowed tables to be generated asynchronously, blocking until each table's parent(s) were generated. This decreased the runtime of our nightly data warehouse pipeline by 75%, enabling a new data product to be delivered to our clients that was not previously possible.

Languages: Python, JSON, YAML, Java (minimal)
Platforms: Databricks, AWS (EMR, S3)
Frameworks: Spark, Airflow
Tables: 500+ tables for 20+ data products
Code: 40,000+ lines of code and 500+ pull requests (30,000 and 350 by yours truly)

Data Team @ ModMed

Tech lead

Data Team @ ModMed

I started at ModMed as a software engineering intern focused on building out data pipelines. I was eventually brought on full time as a data scientist, becoming the first member of the data team. The team is a dynamic team within a dynamic company, exemplified by our growth (and the fact that my desk has been moved over 10 times!). I have been fortunate to see the team expand from just me to over 10 data professionals.

As an individual with intricate knowledge of the data models and ETL pipelines at the company, I serve as a technical lead and subject matter expert for the data warehouse and analytics products. This involves architecture work, best practice/documentation writing, and software lifecycle activities.

Conducted countless interviews
Recruited several data engineers from my personal network
Interviewed and onboarded multiple data engineers, biostatisticians, data analysts, and engineering interns
Maintain software lifecycle for four major code repositories

AWS ML Experimenter

scikit-learn + EC2

AWS ML Experimenter

This package was built to support machine learning experimentation for my PhD research. The data was too big (and sensitive) to fit on a personal workstation, so all data preparation and model training was performed in the cloud with AWS. Then, the results were pulled down from S3 files for local analysis with R.

I built a framework around these processes to make it seamless to develop the ML code locally on test data, then deploy it to the full data in the cloud when ready. I also built a Shiny dashboard to monitor the experiments in real time, as some of them would take days to complete. Check out the video walkthrough below!

Languages: Python, R, Bash
Platforms: AWS (EC2, S3), shinyapps.io
ML: scikit-learn, keras
Analysis: RMarkdown, Shiny

The project is open-source and available here .

PyData Miami

Conference/meetup organizer, speaker

PyData Miami

PyData Miami is a meetup and conference chapter of the global PyData network. PyData focuses on computational and data applications of the Python language, along with others such as R and Julia.

I have been a frequent speaker and organizer of the meetup since 2018 (formerly the Miami Machine Learning meetup), and was a co-organizer and program chair for the inaugural PyData Miami conference in 2019. The 2020 conference is on hold due to our (least) favorite human coronavirus, but you can watch my PyData Miami 2019 talk below!

Spoke at several meetups and the PyData Miami 2019 conference
Program Chair for 2019 and 2020 conferences
Witnessed the Miami data community grow, and met so many awesome people!

About Me

I started my coding journey way back in middle school- fighting Lego robots and scraping Flash games off websites. After I received my bachelor's degree in computer science I started an internship at Modernizing Medicine while also pursuing my PhD in machine learning. Little did I know that I would find my passion in data! I did a lot of Hadoop during my early work at Modernizing Medicine while teaching myself Python and R for research at school. I was on the bleeding edge of Spark as it took over the big data world, and am a Databricks certified developer. Recently I jumped over to the Python-native big data space, and am becoming an expert in Dask and RAPIDS through my work at Saturn Cloud.

Outside of work I like to contribute to data communities through meetups and conferences. I am an organizer for PyData Miami, have spoken at Spark+AI Summit, and mentored hackathons at Cornell University and Lynn University. Outside of work and data I like to travel the world (virtually for now, thanks corona), ride my motorcycle, and appreciate artwork. I'm also an aspiring ukelele star (not really).

Work

2020 - Present

Saturn Cloud

Senior Data Scientist
Solutions Architect
Developer Advocate

2014 - 2020

Modernizing Medicine

Data Scientist
Data Engineer
Tech Lead

Education

2014 - 2019

PhD, Computer Science

Florida Atlantic University
13 peer-reviewed articles
Published by IEEE, ACM

2014 - 2019

MS, Computer Science

Received with PhD

2012 - 2014

BS, Computer Science

Florida Atlantic University

Community

2018 - Present

PyData Miami

Conference Organizer
Speaker

Ongoing

Speaking

Spark+AI Summit 2018
Academic conferences
South Florida ML meetups

Ongoing

Mentorship

Cornell Health Tech Hackathon
Lynn University Hackathon

Aaron Richter, PhD

data scientist // data engineer

What I Do

Talks & Publications

Talks & Publications

Talks

Selected publications

Predicting Melanoma Risk

Predicting Melanoma Risk

Spark ETL

Spark ETL

Data Team @ ModMed

Data Team @ ModMed

AWS ML Experimenter

AWS ML Experimenter

PyData Miami

PyData Miami

About Me

Work

Education

Community