AI, startup hacks, and engineering miracles from your friends at Faraday

How to export a Dataiku DSS (or any scikit-learn) model to PMML

This post is part of our data science series

At Faraday we use Dataiku to do ad hoc exploratory data science work, and especially for investigating new predictive techniques before building them into our platform.

Dataiku is awesome and has an incredibly responsive team. One drawback for me, however, has been Dataiku's lack of support for PMML, a standard serialization format for predictive models and their associated apparatus.

Luckily with a little hacking you can export a Dataiku model to PMML. And this technique can work anywhere you have a scikit-learn-based model you're trying to export.


We're going to use Dataiku's built-in Python environment, which lives in your DSS data directory (generally /Users/username/Library/DataScienceStudio/dss_home on a Mac). We need to add a couple libraries first:

$ ./bin/pip install sklearn_pandas
$ ./bin/pip install git+

You'll also need a working JDK. If this doesn't work:

$ java -version
java version "1.8.0_121"

Then install a JDK. (On Mac: brew cask install java.)

Locate your classifier

OK, now let's get our hands on the model you're trying to export. Maybe it's already in memory, but more likely it's pickled on disk. With Dataiku, you'll find your pickled classifier in a path that looks like this:


There it is, clf.pkl. It's helpful to copy this file into your working dir so we don't accidentally disturb it.

Export the model to PMML

Now let's start up an interactive Python console — again using Dataiku's built-in environment:

$ ./bin/python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)

First let's load up some libraries:

>>> from sklearn.externals import joblib
>>> from sklearn2pmml import PMMLPipeline
>>> from sklearn2pmml import sklearn2pmml

Now we'll unmarshal the model using joblib, a pickle-compatible serialization library:

>>> clf = joblib.load('/path/to/clf.pkl')

Here's the only tricky part: we have to wrap the trained estimator in a Pipeline-like object that sklearn2pmml understands. (This is likely to get less tricky soon.)

>>> pipeline = PMMLPipeline([
...   ("estimator", clf)
... ])

And finally perform the export:

>>> sklearn2pmml(pipeline, "clf.pmml")
INFO: Parsing PKL..
INFO: Marshalled PMML in 714 ms.

All done! The heavy lifting here is done by sklearn2pmml, which wraps the JPMML-SkLearn library. Thanks to Villu Ruusmann in particular for his help.

Meet Dealbot, the open-source sales cadence automation system for Pipedrive

Dealbot illustration

We've just released Dealbot, our lightweight sales automation system for Pipedrive, as open source. Over the coming weeks we'll introduce Dealbot's features along with its cadence ecosystem in a series of blog posts.

Here's the secret behind the most successful sales development teams out there:

They're using cadences.

A cadence is a named, prescribed schedule of activities like phone calls and emails that you can use to engage and qualify your leads.

7x7 cadence

SalesLoft's classic 7x7 cadence

Using the cadence approach to outreach is one of those cases where a good tool makes all the difference.

Dealbot applies cadences to your deals in Pipedrive

Paid services like SalesLoft and Outreach are the gold standard for cadence management. But these tools might be overboard for your team, and can be expensive.

Faraday's Dealbot provides everything you need to get started using cadences with your Pipedrive CRM in an open-source package that can be hosted at Heroku for free.

Here's how to get started quickly

  1. Create a free Pipedrive account if you don't already have one.

  2. Click the magic button:

  3. Fill out the required config fields and click "Deploy for Free." Don't close the tab! You'll get an email telling you what to do next.

Stay tuned

Over the next few weeks we'll dive into Dealbot features, use cases, cadences, and technology. In the meantime, check out the Dealbot site for more, or sign up below for future Dealbot blog posts.

Cover letters are writing samples

Right, right, we know that cover letters are dead and that nobody reads them, etc.

Don't listen to these clowns. Whether you're hiring or applying, please don't skip the cover letter. That's because (repeat after me):

Cover letters are writing samples

In many cases it's the only shot you're going to get to prove (if you're applying) or learn (if you're hiring) that the candidate knows how to convey complex ideas (a whole person!) briefly and persuasively (you're selling yourself, after all).

How to migrate your Hubspot blog to GitHub Pages, Jekyll, or somewhere else

Hubspot can be a great tool depending on the size/structure of your business, but it's not for everybody. If you find yourself wanting to move your blog off of Hubspot's COS, you've probably already found Hubspot's export documentation—and probably (like me) found the resulting data lacking.

(If you haven't done this, here's a spoiler: Hubspot exports each post as a separate html file, fully rendered complete with all of your template code. This isn't very helpful if you've built a new template somewhere else that you want to insert your old post content in.)

Here's how to migrate your blog

There's two pieces of info you need before you get started:

  1. Hubspot API key — You can get this here
  2. Hubspot blog ID — From your Hubspot dashboard, choose Content → Blog and choose the blog you want to export from the dropdown at the top. You'll find the blog's numeric ID in the URL. (The URL will have 2 numbers in it: you want the second, likely larger, number, as the first is your Hubspot account ID.)

Next, you should download this gist by clicking the "Download ZIP" button and unzip it somewhere. You should review the code here to make sure I'm not doing anything nefarious to your Hubspot data.

Then, to perform the export, you'll run the following commands in your shell (assuming you have a modern Ruby with Bundler installed):

$ cd path-to-gist
$ bundle
$ HUBSPOT_API_KEY=XXX HUBSPOT_BLOG_ID=YYY bundle exec ruby export_hubspot_blog_posts.rb

There should now be a blog directory in there with a markdown file for each post you exported.

Customizing for your migration target

By default the script assumes you're trying to move to Github Pages or some other Jekyll-powered blog host. If you're trying to go somewhere else with your post content, you may find the script to be a good starting point. Inside the loop, you have easy access to all of the data you need to generate a corresponding new post within the new blog.

Moving posts to GitHub Pages/Jekyll

Just copy the blog directory over to your blog repo. You'll probably want to make sure all these posts get put inside a layout by putting something like this in your _config.yml:

      path: "blog"
      layout: "post"

And that's it!


If you have ideas to make this better (or have found an error) please tweet us at @faradayio.

Here's a free, easy way to track your MRR

I was once pitching an investor who asked what our current MRR was. I returned to the traction slide in our deck and started to repeat the (modest) figure, but he stopped me: "Not what's in the deck from a couple weeks ago—I mean right this minute."

I bungled the response. Don't let it happen to you.

The reality is that in early stages of a SaaS startup, MRR is tricky. You're trying out a few different pricing models, a single customer churning can represent a huge percentage of your revenue, and, if you're lucky, your co-founder is adding new subscriptions you don't even know about.

There are tools like InsightSquared that can help with this, but as an early stage startup you probably aren't going to like the price tag. You can try to reconstruct things from your Stripe dashboard, but churn timing complicates the picture.

Without further ado . . .

The Faraday MRR Changelog

It's a simple, free Google Sheets doc that your team can share to keep a real-enough-time tally of MRR and customer count.


How to install

  1. Go here for the template
  2. Choose File ➞ Make a copy
  3. Fill in row 6 with your current effective MRR

How to use

The key thing is to add a new row every time something happens. If you add a customer and they later churn, don't delete the original row: add a new one. Check out the 2nd tab in the doc for examples.

  • What's the difference between Log date and Effective Date? The log date is whenever you add the row. The effective date is a little tricker: it's the date at which the change in MRR or customer count takes effect. If you close a deal, the effective date will probably be the same as the log date. If a customer on an annual plan notifies you halfway through the year that they won't be renewing, the log date of the churn is today, but the effective date is the last day of their annual billing cycle.

  • Should I ever change an old row? Probably not, unless you're correcting an old mistake, but even then I'd recommend against: just create a new row with the changes necessary to reconcile. Changelogs work best when they're (to borrow a programming term) immutable.

Bugs? Ideas?

Drop me a line at my first name at