Migrating a machine learning pipeline to Kubernetes

Zach Lipp

he/him

Senior Software Engineer, Lumere

19 February 2020

Problem overview

We want to help our team of expert medical researchers classify hospital purchases

Field	Data Type	Example
Cost	Float	`0.01`
Description	String	`SUT SILK 3-0 SA74H`
Contract	String	`SUTURE PRODUCTS`
Department	String	`SURGERY`

Category: Sutures

Problem overview

You don’t need to be an expert for some of these

Field	Data Type	Example
Contract	String	`SUTURE PRODUCTS`

Category: Sutures

Enter machine learning!

We can use the text descriptions as inputs to classification models. This is called short text classification.

Machine learning deployment

Modeling	Delivery	Pros	Cons
Jupyter	Excel	It works!	Time intensive (for all parties) Manual
ECS	Django	Delivery much simpler Does not require data scientist to run models	Expensive Error-prone Scaling problems
Kubernetes	Django	Delivery the same Fault-tolerant Built for scale	Distributing software is hard TBD

Results

Our reconfigured pipeline is faster end-to-end
We no longer require manual modeling runs
Improved monitoring and observability
Models are written to disk
We parallelized model training, predicting, and preprocessing
We distribute and schedule work with Dask

Configuration

Two Deployments (Dask workers, scheduler)
Three CronJobs (training, predicting, refreshing training data)

Lessons learned

1. Know your APIs

scikit-learn has great functionality for building pipelines
- Pipeline
- FunctionTransformer
- ColumnTransformer
pandas can save your database some munging
- DataFrame.groupby
- .to_sql()

Lessons learned

2. Treat ML code like application code

From Hidden Technical Debt in Machine Learning Systems, NIPS 2015

Lessons learned

2. Treat ML code like application code

By adapting old code to meet our new data model and make use of pandas over SQL, we avoided some costly joins and aggregations, leading to a 5-6 orders of magnitude speedup

Lessons learned

3. Avoid premature optimization

One success:
We focused on migrating models as-is while independently researching better models

One ~~failure~~ opportunity for improvement:
Dask fails silently and fails often

Lessons learned

Know your APIs
Treat ML code like application code
Avoid premature optimization

Fin

lippingoff.netlify.app/talks/machine-learning-kubernetes-migration/