AI, startup hacks, and engineering miracles from your friends at Faraday

How to export a Dataiku DSS (or any scikit-learn) model to PMML

Andy Rossmeissl on

This post is part of our data science series

At Faraday we use Dataiku to do ad hoc exploratory data science work, and especially for investigating new predictive techniques before building them into our platform.

Dataiku is awesome and has an incredibly responsive team. One drawback for me, however, has been Dataiku's lack of support for PMML, a standard serialization format for predictive models and their associated apparatus.

Luckily with a little hacking you can export a Dataiku model to PMML. And this technique can work anywhere you have a scikit-learn-based model you're trying to export.

Prerequisites

We're going to use Dataiku's built-in Python environment, which lives in your DSS data directory (generally /Users/username/Library/DataScienceStudio/dss_home on a Mac). We need to add a couple libraries first:

$ cd $DSS_DATA_DIR
$ ./bin/pip install sklearn_pandas
$ ./bin/pip install git+https://github.com/jpmml/sklearn2pmml.git

You'll also need a working JDK. If this doesn't work:

$ java -version
java version "1.8.0_121"  

Then install a JDK. (On Mac: brew cask install java.)

Locate your classifier

OK, now let's get our hands on the model you're trying to export. Maybe it's already in memory, but more likely it's pickled on disk. With Dataiku, you'll find your pickled classifier in a path that looks like this:

$DSS_DATA_DIR/analysis-data/PROJECTKEY/abcdefgh/ijklmnop/sessions/s1/pp1/m1

There it is, clf.pkl. It's helpful to copy this file into your working dir so we don't accidentally disturb it.

Export the model to PMML

Now let's start up an interactive Python console — again using Dataiku's built-in environment:

$ cd $DSS_DATA_DIR
$ ./bin/python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)  
>>>

First let's load up some libraries:

>>> from sklearn.externals import joblib
>>> from sklearn2pmml import PMMLPipeline
>>> from sklearn2pmml import sklearn2pmml

Now we'll unmarshal the model using joblib, a pickle-compatible serialization library:

>>> clf = joblib.load('/path/to/clf.pkl')

Here's the only tricky part: we have to wrap the trained estimator in a Pipeline-like object that sklearn2pmml understands. (This is likely to get less tricky soon.)

>>> pipeline = PMMLPipeline([
...   ("estimator", clf)
... ])

And finally perform the export:

>>> sklearn2pmml(pipeline, "clf.pmml")
INFO: Parsing PKL..  
[snip]
INFO: Marshalled PMML in 714 ms.  

All done! The heavy lifting here is done by sklearn2pmml, which wraps the JPMML-SkLearn library. Thanks to Villu Ruusmann in particular for his help.

Meet Dealbot, the open-source sales cadence automation system for Pipedrive

Andy Rossmeissl on

Dealbot illustration

We've just released Dealbot, our lightweight sales automation system for Pipedrive, as open source. Over the coming weeks we'll introduce Dealbot's features along with its cadence ecosystem in a series of blog posts.

Here's the secret behind the most successful sales development teams out there:

They're using cadences.

A cadence is a named, prescribed schedule of activities like phone calls and emails that you can use to engage and qualify your leads.

7x7 cadence

SalesLoft's classic 7x7 cadence

Using the cadence approach to outreach is one of those cases where a good tool makes all the difference.

Dealbot applies cadences to your deals in Pipedrive

Paid services like SalesLoft and Outreach are the gold standard for cadence management. But these tools might be overboard for your team, and can be expensive.

Faraday's Dealbot provides everything you need to get started using cadences with your Pipedrive CRM in an open-source package that can be hosted at Heroku for free.

Here's how to get started quickly

  1. Create a free Pipedrive account if you don't already have one.

  2. Click the magic button:
    Deploy

  3. Fill out the required config fields and click "Deploy for Free." Don't close the tab! You'll get an email telling you what to do next.

Stay tuned

Over the next few weeks we'll dive into Dealbot features, use cases, cadences, and technology. In the meantime, check out the Dealbot site for more, or sign up below for future Dealbot blog posts.

Cover letters are writing samples

Andy Rossmeissl on

Right, right, we know that cover letters are dead and that nobody reads them, etc.

Don't listen to these clowns. Whether you're hiring or applying, please don't skip the cover letter. That's because (repeat after me):

Cover letters are writing samples

In many cases it's the only shot you're going to get to prove (if you're applying) or learn (if you're hiring) that the candidate knows how to convey complex ideas (a whole person!) briefly and persuasively (you're selling yourself, after all).