Migrating a machine learning pipeline to Kubernetes

Zach Lipp

he/him

Senior Software Engineer, Lumere

19 February 2020

Problem overview

We want to help our team of expert medical researchers classify hospital purchases

Field Data Type Example
Cost Float 0.01
Description String SUT SILK 3-0 SA74H
Contract String SUTURE PRODUCTS
Department String SURGERY
Category: Sutures

Problem overview

You don’t need to be an expert for some of these

Field Data Type Example
Contract String SUTURE PRODUCTS
Category: Sutures

Enter machine learning!

We can use the text descriptions as inputs to classification models. This is called short text classification.

Machine learning deployment

Modeling Delivery Pros Cons
Jupyter Excel
  • It works!
  • Time intensive (for all parties)
  • Manual
ECS Django
  • Delivery much simpler
  • Does not require data scientist to run models
  • Expensive
  • Error-prone
  • Scaling problems
Kubernetes Django
  • Delivery the same
  • Fault-tolerant
  • Built for scale
  • Distributing software is hard
  • TBD

Results

  • Our reconfigured pipeline is faster end-to-end
  • We no longer require manual modeling runs
  • Improved monitoring and observability
  • Models are written to disk
  • We parallelized model training, predicting, and preprocessing
  • We distribute and schedule work with Dask

Configuration

  • Two Deployments (Dask workers, scheduler)
  • Three CronJobs (training, predicting, refreshing training data)

Lessons learned

1. Know your APIs

  • scikit-learn has great functionality for building pipelines
    • Pipeline
    • FunctionTransformer
    • ColumnTransformer
  • pandas can save your database some munging
    • DataFrame.groupby
    • .to_sql()

Lessons learned

2. Treat ML code like application code

From Hidden Technical Debt in Machine Learning Systems, NIPS 2015

Lessons learned

2. Treat ML code like application code

By adapting old code to meet our new data model and make use of pandas over SQL, we avoided some costly joins and aggregations, leading to a 5-6 orders of magnitude speedup

Lessons learned

3. Avoid premature optimization

One success:
We focused on migrating models as-is while independently researching better models

One failure opportunity for improvement:
Dask fails silently and fails often

Lessons learned

  1. Know your APIs
  2. Treat ML code like application code
  3. Avoid premature optimization

Fin