AI, startup hacks, and engineering miracles from your friends at Faraday

Buy or build: things we built

Seamus Abshere on

Here are Faraday's contributions to open source that we use every day in production. No experiments here; this is the stuff that we looked for on the shelf, found the options wanting, and built ourselves.

A new standard for secrets: Secretfile

secret_garden (Ruby), vault-env (JS), and credentials-to-env (Rust) all implement a standard we call Secretfile(s):

# /app/Secretfile
DATABASE_URL secrets/database/$VAULT_ENV:url
REDIS_URL secrets/redis/$VAULT_ENV:url

Then you use it like this SecretGarden.fetch('DATABASE_URL').
Clients implementing this standard are meant to first check the environment for DATABASE_URL, then failing that look up the secret in Hashicorp Vault (interpolating $VAULT_ENV into production, staging, etc. first). It's very useful for development where your DATABASE_URL is just postgres://seamus@127.0.0.1:5432/myapp - you can save this in a local .env file and only mess with Vault in production/staging.

Lightning fast CSV processing: catcsv and scrubcsv

catcsv is a very fast CSV concatenation tool that gracefully handles headers and compression. It also supports Google's Snappy compression. We store everything on S3 and GCS szip'ed using burntsushi's szip.

$ cat a.csv
city,state
burlington,vt

$ cat b.csv
city,state
madison,wi

$ szip a.csv

$ ls
a.csv.sz
b.csv

$ catcsv a.csv.sz b.csv
city,state
burlington,vt
madison,wi

Of course, before you cat files, sometimes you need to clean them up with scrubcsv:

$ scrubcsv giant.csv > scrubbed.csv
3000001 rows (1 bad) in 51.58 seconds, 72.23 MiB/sec

Lightning-fast fixed-width to CSV: fixed2csv

fixed2csv converts fixed-width files to CSV very fast. You start with this:

first     last      middle
John      Smith     Q
Sally     Jones

You should be able to run:

$ fixed2csv -v 10 10 6 < input.txt
first,last,middle
John,Smith,Q
Sally,Jones,

World's fastest geocoder: node_smartystreets

node_smartystreets is the world's fastest geocoder client. We shell out to its binary rather than using it as a library. It will do 10k records/second against the smartystreets geocoding API. If you don't have an Unlimited plan, use it with extreme caution.

Better caching: lock_and_cache

lock_and_cache (Ruby) and lock_and_cache_js (JS) go beyond normal caching libraries: they lock the calculation while it's being performed. Most caching libraries don't do locking, meaning that >1 process can be calculating a cached value at the same time. Since you presumably cache things because they cost CPU, database reads, or money, doesn't it make sense to lock while caching?

def expensive_thing
  @expensive_thing ||= LockAndCache.lock_and_cache("expensive_thing/#{id}", expires: 30) do
    # do expensive calculation
  end
end

It uses Redis for distributed caching and locking, so this is not only cross-process but also cross-machine.

Better state machine: status_workflow

status_workflow handles state transitions with distributed locking using Redis. Most state machine libraries either don't do locking or use Postgres advisory locks.

class Document < ActiveRecord::Base
  include StatusWorkflow
  status_workflow(
    archive_requested: [:archiving],
    archiving: [:archived],
  )
end

Then you can do

document.enter_archive_requested!

It's safe to use in a horizontally sharded environment because it uses distributed locking - the second process that tries to do this will get a InvalidTransition error even if it's the same microsecond.

Rust build tools: rust-musl-builder and heroku-buildpack-rust

rust-musl-builder is how we build Rust apps on top of Alpine. It also drives heroku-buildpack-rust, the preeminent way of running Rust on Heroku.

Minimal postgres for node: simple-postgres

simple-postgres (JS) is just the essentials to talk to Postgres from Node. We particularly love its use of template literals for apparently magical escaping:

let account = await db.row`
  SELECT *
  FROM accounts
  WHERE id = ${id}
`

Yes, that's safe!

Minimal HTTP server: srvr

srvr (JS) is a small HTTP server that speaks for itself:

  • everything express does
  • better
  • less code
  • no dependencies
  • websockets

Proper Docker API support for Rust: boondock

boondock is a rewrite of rust-docker to be more correct.

Coordinate docker-compose: cage

cage boots multiple docker-compose.ymls, each as a pod. It's sortof like a local k8s. You configure it with a bunch of docker-compose files:

pods/
├── admin.yml (a pod containing adminweb and horse)
├── common.env (common env vars)
├── donkey.yml (a pod containing donkey)
├── placeholders.yml (development-only pod with redis, db, etc.)
[...]

Local development looks like this:

$ cage pull
==== Fetching secrets from vault into config/secrets.yml
==== Logging into ECR
Fetching temporary AWS 'administrator' credentials from vault
Pulling citus        ... done
Pulling citusworker1 ... done
Pulling citusworker2 ... done
Pulling queue        ... done
Pulling redis        ... done
Pulling s3           ... done
Pulling smtp         ... done
Pulling vault        ... done
Pulling horse        ... done
Pulling adminweb     ... done
[...]
$ cage up
Starting fdy_citusworker2_1 ... done
Starting fdy_smtp_1         ... done
Starting fdy_citus_1        ... done
Starting fdy_vault_1        ... done
Starting fdy_citusworker1_1 ... done
Starting fdy_queue_1        ... done
Starting fdy_s3_1           ... done
Starting fdy_redis_1        ... done
Starting fdy_horse_1 ... done
Starting fdy_adminweb_1 ... done
[...]
$ cage stop
Stopping fdy_citusworker2_1 ... done
Stopping fdy_vault_1        ... done
Stopping fdy_citus_1        ... done
Stopping fdy_s3_1           ... done
[...]

Fixed up rust crates: rust-amqp

rust-amqp@tokio (Rust) is our rewrite of the internals of the rust-amqp crate in proper tokio. It is much more reliable and needs to be merged upstream.

(beta release) 3rd gen batch processing on k8s: falconeri

falconeri is a distributed batch job runner for kubernetes (k8s). It is compatible with Pachyderm pipeline definitions, but is simpler and handles autoscaling, etc. properly.

(alpha release) Seamless transfer between Postgres/Citus and BigQuery: dbcrossbar

dbcrossbar handles all the details of transferring tables and data to and from Postgres and Google BigQuery. Additionally, it knows about citus, the leading Postgres horizontal sharding solution - so it can do highly efficient transfers between Citus clusters and BigQuery.

Conclusion

That's it. I only mentioned tools that we use every day.

Geochunk: fast, intelligent splitting for piles of address data

Bill Morris on

aurora

This post is part of our practical cartography and data science series.

The problem: you want to split up a few million U.S. address records into equally-sized chunks that retain spatial hierarchy. You want to do this without anything other than a street address (geocoding is expensive!). Maybe you want to do this as part of a map/reduce process (we certainly do), maybe you want to do some sampling, who knows?

The solution: Muthaflippin' Geochunk

Anyone who's ever used U.S. ZIP codes as a way to subdivide datasets can tell you: 60608 (pop 79,607) is a totally different beast than 05851 (pop 525). They're not census tracts; it's not really appropriate to compare them statistically or thematically.

Our solution - largely the work of platform wizard and Rust enthusiast Eric Kidd - is to bake census data into a tool that does the splitting for you at a level that allows for easy comparison. More specifically:

It provides a deterministic mapping from zip codes to "geochunks" that you can count on remaining stable.

Check out the Jupyter notebook that explains the algorithm in detail, but it works like so:

Install

Install rust first if you don't have it:

curl https://sh.rustup.rs -sSf | sh

. . . then geochunk, using the rust package manager:

cargo install geochunk

. . . or install from one of the prepackaged binaries.

Use 1: Indexing

Build a table that assigns every U.S. zipcode to a geochunk that contains 250,000 people:

geochunk export zip2010 250000 > chunks_of_250k_people.csv

Use 2: List processing

Alternately, let's try a pipeline example that uses geochunk csv: say you want to parallel-process every address in the state of Colorado, and you need equal-size but contiguous slices to do it.

wget -c https://s3.amazonaws.com/data.openaddresses.io/runs/283082/us/co/statewide.zip && unzip statewide.zip
  • Pipe the full file through geochunk, into slices of about 250,000 people each:
cat us/co/statewide.csv | geochunk csv zip2010 250000 POSTCODE > statewide_chunks_150k.csv

. . . and now you have 2 million addresses, chopped into ~8 equally-sized slices with rough contiguity:

denver

Geochunk works on this scale in 1.38s (Have you heard us evangelizing about Rust yet?), leaving you plenty of time for the real processing.

This tool is serious dogfood for us; it's baked into our ETL system, and we use it to try making a tiny dent in the Modifiable Areal Unit Problem. We hope you'll find it useful too.

scrubcsv: now with null value removal

Seamus Abshere on

This is part of our series on data science because it belongs in your toolchain. Happy Null Removal!

The latest version of scrubcsv has built-in null value removal:

$ cat a.csv
name,breed,age
jerry,beagle,n/a
tater,null,1

$ scrubcsv -n 'null|n/a' a.csv
name,breed,age
jerry,beagle,
tater,,1

See how null and n/a went away?

Get the latest version with

$ cargo install scrubcsv -f

How we made our CSV processing 142x faster

Bill Morris on

This post is part of our data science hacks series

At Faraday, we've long used csvkit to understand, transform, and beat senseless our many streams of data. However, even this inimitable swiss army knife can be improved on - we've switched to xsv.

xsv is a fast CSV-parsing toolkit written in Rust that mostly matches the functionality of csvkit (including the clutch ability to pipe between modules), with a few extras tacked on (like smart sampling). Did I mention it's fast? In a standup comparison, I ran the "stats" module of XSV against "csvstat" from csvkit, on a 30k-line, 400-column CSV file:

  • Python-based csvkit chews through it in a respectable-and-now-expected 4m16s.

  • xsv takes 1.8 seconds. I don't even have time for a sip of my coffee.

The difference between csvkit and xsv is partly defined by scale; both tools are plenty fast on smaller datasets. But once you get into 10MB-and-upward range, xsv's processing speed pulls away exponentially.

If you've been using csvkit forever (like me), or if you want to be able to transform and analyze CSVs without loading them into a DB, give xsv a shot:

Install Rust

curl https://sh.rustup.rs -sSf | sh

. . . which also gives you the rust package manager cargo, which lets you:

Install xsv

cargo install xsv

Then be sure your PATH is configured correctly:

export PATH=~/.cargo/bin:$PATH

. . . and try it out on a demo CSV with 10k rows, some messy strings, and multiple data types:

curl https://gist.githubusercontent.com/wboykinm/044e2af62fc0c7f77e17f6ccd55b8fb0/raw/fca391e6c03a06a7be770fefca6c47a9acdd2305/mock_data.csv \
| xsv stats \
| xsv table

(xsv table formats the data so it's readable in the console):

field           type     sum                 min                  max                  min_length  max_length  mean                stddev
id              Integer  5005000             1                    1000                 1           4           500.49999999999994  288.6749902572106
first_name      Unicode                      Aaron                Willie               3           11                              
last_name       Unicode                      Adams                Young                3           10                              
email           Unicode                      aadamsp5@senate.gov  wwrightd8@upenn.edu  12          34                              
gender          Unicode                      Female               Male                 4           6                               
ip_address      Unicode                      0.111.40.87          99.50.37.244         9           15                              
value           Unicode                      $1007.98             $999.37              0           8                               
company         Unicode                      Abata                Zoovu                0           13                              
lat             Float    243963.82509999987  -47.75034            69.70287             0           9           24.42080331331331   24.98767816017553
lon             Float    443214.19009999954  -179.12198           170.29993            0           10          44.36578479479489   71.16647723898215
messed_up_data  Unicode                      !@#$%^&*()           𠜎𠜱𠝹𠱓𠱸𠲖𠳏       0           393                             
version         Unicode                      0.1.1                9.99                 3           14                              

Happy parsing!

scrubcsv: clean CSVs, drop bad lines

Seamus Abshere on

This is part of our series on things that are obvious once you see them - and our data science series because it belongs in your toolchain.

Lies, damn lies, and commercial CSV export modules. Who wrote these things? On what planet would this be acceptable? Whatever.

Name,What's wrong
"Robert "Bob" Smith",quotes inside quotes
Robert "Bob" Smith,quotes in the middle
Robert Bob" Smith,unmatched quote

Ruby dies immediately trying to read it:

$ irb
irb(main):001:0> require 'csv'
=> true
irb(main):002:0> CSV.read('broken.csv')
CSV::MalformedCSVError: Missing or stray quote in line 2

Introducing scrubcsv, a is a lightning-fast static binary written in Rust that best-effort parses CSV and then immediately dumps back out 100% guaranteed standards-compliant CSV. Top speed? About 67mb/s.

$ scrubcsv broken.csv > fixed.csv
4 rows (0 bad) in 0.00 seconds, 787.13 KiB/sec

$ cat fixed.csv
Name,What's wrong
"Robert Bob"" Smith""",quotes inside quotes
"Robert ""Bob"" Smith",quotes in the middle
"Robert Bob"" Smith",unmatched quote

It uses BurntSushi's world-beating CSV parser which is almost certainly faster than your SSD.