Startup hacks and engineering miracles from your exhausted friends at Faraday

How to export a Dataiku DSS (or any scikit-learn) model to PMML

Andy Rossmeissl on

This post is part of our data science series

At Faraday we use Dataiku to do ad hoc exploratory data science work, and especially for investigating new predictive techniques before building them into our platform.

Dataiku is awesome and has an incredibly responsive team. One drawback for me, however, has been Dataiku's lack of support for PMML, a standard serialization format for predictive models and their associated apparatus.

Luckily with a little hacking you can export a Dataiku model to PMML. And this technique can work anywhere you have a scikit-learn-based model you're trying to export.

Prerequisites

We're going to use Dataiku's built-in Python environment, which lives in your DSS data directory (generally /Users/username/Library/DataScienceStudio/dss_home on a Mac). We need to add a couple libraries first:

$ cd $DSS_DATA_DIR
$ ./bin/pip install sklearn_pandas
$ ./bin/pip install git+https://github.com/jpmml/sklearn2pmml.git

You'll also need a working JDK. If this doesn't work:

$ java -version
java version "1.8.0_121"  

Then install a JDK. (On Mac: brew cask install java.)

Locate your classifier

OK, now let's get our hands on the model you're trying to export. Maybe it's already in memory, but more likely it's pickled on disk. With Dataiku, you'll find your pickled classifier in a path that looks like this:

$DSS_DATA_DIR/analysis-data/PROJECTKEY/abcdefgh/ijklmnop/sessions/s1/pp1/m1

There it is, clf.pkl. It's helpful to copy this file into your working dir so we don't accidentally disturb it.

Export the model to PMML

Now let's start up an interactive Python console — again using Dataiku's built-in environment:

$ cd $DSS_DATA_DIR
$ ./bin/python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)  
>>>

First let's load up some libraries:

>>> from sklearn.externals import joblib
>>> from sklearn2pmml import PMMLPipeline
>>> from sklearn2pmml import sklearn2pmml

Now we'll unmarshal the model using joblib, a pickle-compatible serialization library:

>>> clf = joblib.load('/path/to/clf.pkl')

Here's the only tricky part: we have to wrap the trained estimator in a Pipeline-like object that sklearn2pmml understands. (This is likely to get less tricky soon.)

>>> pipeline = PMMLPipeline([
...   ("estimator", clf)
... ])

And finally perform the export:

>>> sklearn2pmml(pipeline, "clf.pmml")
INFO: Parsing PKL..  
[snip]
INFO: Marshalled PMML in 714 ms.  

All done! The heavy lifting here is done by sklearn2pmml, which wraps the JPMML-SkLearn library. Thanks to Villu Ruusmann in particular for his help.