AI, startup hacks, and engineering miracles from your friends at Faraday

How we made our CSV processing 142x faster

This post is part of our data science hacks series

At Faraday, we've long used csvkit to understand, transform, and beat senseless our many streams of data. However, even this inimitable swiss army knife can be improved on - we've switched to xsv.

xsv is a fast CSV-parsing toolkit written in Rust that mostly matches the functionality of csvkit (including the clutch ability to pipe between modules), with a few extras tacked on (like smart sampling). Did I mention it's fast? In a standup comparison, I ran the "stats" module of XSV against "csvstat" from csvkit, on a 30k-line, 400-column CSV file:

  • Python-based csvkit chews through it in a respectable-and-now-expected 4m16s.

  • xsv takes 1.8 seconds. I don't even have time for a sip of my coffee.

The difference between csvkit and xsv is partly defined by scale; both tools are plenty fast on smaller datasets. But once you get into 10MB-and-upward range, xsv's processing speed pulls away exponentially.

If you've been using csvkit forever (like me), or if you want to be able to transform and analyze CSVs without loading them into a DB, give xsv a shot:

Install Rust

curl https://sh.rustup.rs -sSf | sh

. . . which also gives you the rust package manager cargo, which lets you:

Install xsv

cargo install xsv

Then be sure your PATH is configured correctly:

export PATH=~/.cargo/bin:$PATH

. . . and try it out on a demo CSV with 10k rows, some messy strings, and multiple data types:

curl https://gist.githubusercontent.com/wboykinm/044e2af62fc0c7f77e17f6ccd55b8fb0/raw/fca391e6c03a06a7be770fefca6c47a9acdd2305/mock_data.csv \
| xsv stats \
| xsv table

(xsv table formats the data so it's readable in the console):

field           type     sum                 min                  max                  min_length  max_length  mean                stddev
id              Integer  5005000             1                    1000                 1           4           500.49999999999994  288.6749902572106
first_name      Unicode                      Aaron                Willie               3           11                              
last_name       Unicode                      Adams                Young                3           10                              
email           Unicode                      aadamsp5@senate.gov  wwrightd8@upenn.edu  12          34                              
gender          Unicode                      Female               Male                 4           6                               
ip_address      Unicode                      0.111.40.87          99.50.37.244         9           15                              
value           Unicode                      $1007.98             $999.37              0           8                               
company         Unicode                      Abata                Zoovu                0           13                              
lat             Float    243963.82509999987  -47.75034            69.70287             0           9           24.42080331331331   24.98767816017553
lon             Float    443214.19009999954  -179.12198           170.29993            0           10          44.36578479479489   71.16647723898215
messed_up_data  Unicode                      !@#$%^&*()           𠜎𠜱𠝹𠱓𠱸𠲖𠳏       0           393                             
version         Unicode                      0.1.1                9.99                 3           14                              

Happy parsing!

How to reverse geocode in bulk


This post is part of our practical cartography series.

We just rebuilt our Argo reverse-geocoding module as a proper command-line tool. Got a pile of coordinates in a table like this?

Pipe them through argo to get the context of an address assigned to each of them:

npm install argo-geo -g
argo -i myfile.csv -a "blahblahmapzenauthtoken"

Using Mapzen search, that'll churn through your table at 6 queries per second, appending results to each coordinate pair until it's done:

We built this to process millions of rooftop coordinates that a vendor provided to us without addresses, but you could just as easily use it for any position-only datasets:

  • Bird sightings from the field
  • Cars auto-extracted from imagery
  • GPS tracks from that pub crawl where you forgot the names of the bars
  • Mobile-collected reports of voter intimidation

We named it "Argo" to follow the Greek mythology pattern of Mapzen's geocoding engine "Pelias". Google and Mapbox each offer reverse-geocoding services as well, but those are just that: services. They include TOUs that restrict caching of the results, and man, did we want to cache these. The good folks at Mapzen built their search architecture on some truly amazing open datasets, and they match the spirit of the source by allowing storage and repurposing.

Thanks, Mapzen!

How to crunch lots of geodata in parallel

This post is part of our data science and practical cartography series.

GNU parallel + ogr2ogr = happy data scientists

These power tools in combination make it very easy to process lots of geodata at once, in as many parallel operations as your local machine or server can support.

Reprojecting in bulk

Here's an example, assuming you have a folder full of shapefiles you want to reproject into Geographic coordinates. Make a directory for the output, then pipe every shapefile through ogr2ogr in parallel:

mkdir wgs84
ls *.shp | parallel ogr2ogr -t_srs 'EPSG:4326' wgs84/{} {}

Running a sequence of commands on many files

In order to build whole data workflows, you can wrap your sequence of commands in a bash function. Here's an example, where we:

  1. Download each state landmarks file from the census FTP
  2. Extract each file
  3. Create a new file for each consisting of only airport landmarks, projected to WGS84
# grab this handy list of all state FIPS codes
wget -c https://gist.githubusercontent.com/wboykinm/6c514e9caf1fc3158e350fa926ea02bd/raw/f742515fd06824dafd0a88c62b4de11fa1e39fa1/state_fips_codes.txt

# define the function
get_airports() {
  # grab the data from the census server
  wget -c http://www2.census.gov/geo/tiger/TIGER2016/POINTLM/tl_2016_$1_pointlm.zip
  unzip tl_2016_$1_pointlm.zip
  # extract just airports (code K2451) and reproject to WGS84
  ogr2ogr -t_srs "EPSG:4326" -where "MTFCC = 'K2451'" tl_2016_$1_airports.shp tl_2016_$1_pointlm.shp
  echo "done with state $1"
}
export -f get_airports

# kick off the parallel processing!
cat state_fips_codes.txt | parallel get_airports {}

This crunches through 52 states and territories in 21.8 seconds on a small ec2 server, limited only by network speed.

airports

Install the tools

  • GNU parallel
    • OSX: brew install parallel
    • Ubuntu: apt-get install parallel
  • ogr2ogr
    • OSX: brew install gdal --HEAD
    • Ubuntu: sudo apt-get install gdal-bin

Bonus toolkit: From Derek Watkins, here are a few dozen examples of the awesome geoprocessing you can you with GDAL/OGR.

Happy mapping!

How to do histograms in PostgreSQL

This post is part of our data science and PostgreSQL series.

UPDATED FOR 2017 Now with easy subquery support and a more sensible argument order!

We adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram($table_name_or_subquery, $column_name);

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('(SELECT * FROM small_donors_vt LIMIT 50000)', 'transaction_amt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

CREATE OR REPLACE FUNCTION histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
  WITH
  source AS (
    SELECT * FROM %s
  ),
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM source
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM source, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram',
  table_name_or_subquery,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name
  );
END
$func$ LANGUAGE plpgsql;

Note:

  1. You don't need to use a subquery - you can also just provide a table name as the first argument.
  2. Subqueries are useful for sampling. For example, (SELECT * FROM bigtable LIMIT 500 ORDER BY RANDOM(). Don't forget parentheses!
  3. You can call this as SELECT histogram() or SELECT * FROM histogram(). The latter form is much more legible!

Happy querying!

How to preview PostGIS maps on your command line

This is part of our practical cartography and PostgreSQL series. Put a map on it!

Sometimes it's a pain to open up QGIS and load a PostGIS-enabled DB. Sometimes I don't feel like writing a custom tileserver and hooking it up to Leaflet or Mapbox GL just so I can see if my map looks right.

Sometimes I use the psql command line and a nifty tool by Morgan Herlocker called "geotype" to view my map data.

npm install -g geotype

. . . which enables fast and simple maps like this:

ny

Yep. That's New York, alright.

dc

. . . and that sure looks like the population distribution of the District of Columbia.

These maps are nothing to show to customers, but they make QA/QC a lot easier. Here's the syntax, piping psql output directly into geotype:

psql $DB_URL -t -c "SELECT ST_AsGeoJSON(ST_Collect(the_geom)) FROM mytable" | geotype

(The -t and ST_Collect() coerce the output into the type of data that geotype can read)

Happy mapping!