Startup hacks and engineering miracles from your exhausted friends at Faraday

Be not afraid of ZCTAs

Bill Morris on

This post is part of our practical cartography series.

Most American geographers will note that - as much as we'd like it to be otherwise - ZIP Codes are not polygons. Rather, they're constantly-changing lines used by the USPS to coordinate delivery in an efficient network. Many of us polygon-happy mappers use ZIP Code Tabulation Areas (ZCTAs) instead; these are provided by the US Census as a reasonable open data alternative to ZIPs. They're particularly nice for thematic mapping (though their shortcomings have also been well-documented):

map

But why use ZCTAs if they can never be reconciled with their ground-truth ZIP cousins?

Because the difference is small.

Faraday has address and location records for every household in the country, and it was straightforward to check for disagreement between the ZIP Code of each physical address and the ZCTA polygon that contains it.

Here are the results, broken down by state

The national error rate of ZCTAs is 1.4%. That might be too high for some use cases, but perfectly acceptable for others. There's some regional variation, too: you're usually safe to use ZCTAs in Hawaii and Maine, but might want to exercise caution in Oregon and Utah.

Happy mapping!

Plancha: how to flatten multi-sheet excel workbooks

Bill Morris on

This is part of our series on data science because it belongs in your toolchain.

If you work with data long enough - actually scratch that; if you work with data for more than a week - you'll run into the dreaded multi sheet (or tab) excel workbook. Sometimes the sheets are unrelated, but other times they should really all be stacked together in the same table, ideally in a more-interoperable format than .xlsx:

in

Enter plancha. Named for the trusty tortilla press, we built this simple CLI tool to flatten multi-sheet excel files, resolve header mismatches, and return a pipeline-friendly csv, like this:

out

Install

This is a node.js tool, so use npm:

npm install plancha -g

Usage

Just feed it an input .xlsx file:

plancha -i myfile.xlsx


Happy data-pressing!

How we made our CSV processing 142x faster

Bill Morris on

This post is part of our data science hacks series

At Faraday, we've long used csvkit to understand, transform, and beat senseless our many streams of data. However, even this inimitable swiss army knife can be improved on - we've switched to xsv.

xsv is a fast CSV-parsing toolkit written in Rust that mostly matches the functionality of csvkit (including the clutch ability to pipe between modules), with a few extras tacked on (like smart sampling). Did I mention it's fast? In a standup comparison, I ran the "stats" module of XSV against "csvstat" from csvkit, on a 30k-line, 400-column CSV file:

  • Python-based csvkit chews through it in a respectable-and-now-expected 4m16s.

  • xsv takes 1.8 seconds. I don't even have time for a sip of my coffee.

The difference between csvkit and xsv is partly defined by scale; both tools are plenty fast on smaller datasets. But once you get into 10MB-and-upward range, xsv's processing speed pulls away exponentially.

If you've been using csvkit forever (like me), or if you want to be able to transform and analyze CSVs without loading them into a DB, give xsv a shot:

Install Rust

curl https://sh.rustup.rs -sSf | sh  

. . . which also gives you the rust package manager cargo, which lets you:

Install xsv

cargo install xsv  

Then be sure your PATH is configured correctly:

export PATH=~/.cargo/bin:$PATH  

. . . and try it out on a demo CSV with 10k rows, some messy strings, and multiple data types:

curl https://gist.githubusercontent.com/wboykinm/044e2af62fc0c7f77e17f6ccd55b8fb0/raw/fca391e6c03a06a7be770fefca6c47a9acdd2305/mock_data.csv \  
| xsv stats \
| xsv table

(xsv table formats the data so it's readable in the console):

field           type     sum                 min                  max                  min_length  max_length  mean                stddev  
id              Integer  5005000             1                    1000                 1           4           500.49999999999994  288.6749902572106  
first_name      Unicode                      Aaron                Willie               3           11  
last_name       Unicode                      Adams                Young                3           10  
email           Unicode                      aadamsp5@senate.gov  wwrightd8@upenn.edu  12          34  
gender          Unicode                      Female               Male                 4           6  
ip_address      Unicode                      0.111.40.87          99.50.37.244         9           15  
value           Unicode                      $1007.98             $999.37              0           8  
company         Unicode                      Abata                Zoovu                0           13  
lat             Float    243963.82509999987  -47.75034            69.70287             0           9           24.42080331331331   24.98767816017553  
lon             Float    443214.19009999954  -179.12198           170.29993            0           10          44.36578479479489   71.16647723898215  
messed_up_data  Unicode                      !@#$%^&*()           𠜎𠜱𠝹𠱓𠱸𠲖𠳏       0           393  
version         Unicode                      0.1.1                9.99                 3           14  

Happy parsing!

How to reverse geocode in bulk

Bill Morris on

This post is part of our practical cartography series.

We just rebuilt our Argo reverse-geocoding module as a proper command-line tool. Got a pile of coordinates in a table like this?

Pipe them through argo to get the context of an address assigned to each of them:

npm install argo-geo -g  
argo -i myfile.csv -a "blahblahmapzenauthtoken"  

Using Mapzen search, that'll churn through your table at 6 queries per second, appending results to each coordinate pair until it's done:

We built this to process millions of rooftop coordinates that a vendor provided to us without addresses, but you could just as easily use it for any position-only datasets:

  • Bird sightings from the field
  • Cars auto-extracted from imagery
  • GPS tracks from that pub crawl where you forgot the names of the bars
  • Mobile-collected reports of voter intimidation

We named it "Argo" to follow the Greek mythology pattern of Mapzen's geocoding engine "Pelias". Google and Mapbox each offer reverse-geocoding services as well, but those are just that: services. They include TOUs that restrict caching of the results, and man, did we want to cache these. The good folks at Mapzen built their search architecture on some truly amazing open datasets, and they match the spirit of the source by allowing storage and repurposing.

Thanks, Mapzen!

How to crunch lots of geodata in parallel

Bill Morris on

This post is part of our data science and practical cartography series.

GNU parallel + ogr2ogr = happy data scientists

These power tools in combination make it very easy to process lots of geodata at once, in as many parallel operations as your local machine or server can support.

Reprojecting in bulk

Here's an example, assuming you have a folder full of shapefiles you want to reproject into Geographic coordinates. Make a directory for the output, then pipe every shapefile through ogr2ogr in parallel:

mkdir wgs84  
ls *.shp | parallel ogr2ogr -t_srs 'EPSG:4326' wgs84/{} {}  

Running a sequence of commands on many files

In order to build whole data workflows, you can wrap your sequence of commands in a bash function. Here's an example, where we:

  1. Download each state landmarks file from the census FTP
  2. Extract each file
  3. Create a new file for each consisting of only airport landmarks, projected to WGS84
# grab this handy list of all state FIPS codes
wget -c https://gist.githubusercontent.com/wboykinm/6c514e9caf1fc3158e350fa926ea02bd/raw/f742515fd06824dafd0a88c62b4de11fa1e39fa1/state_fips_codes.txt

# define the function
get_airports() {  
  # grab the data from the census server
  wget -c http://www2.census.gov/geo/tiger/TIGER2016/POINTLM/tl_2016_$1_pointlm.zip
  unzip tl_2016_$1_pointlm.zip
  # extract just airports (code K2451) and reproject to WGS84
  ogr2ogr -t_srs "EPSG:4326" -where "MTFCC = 'K2451'" tl_2016_$1_airports.shp tl_2016_$1_pointlm.shp
  echo "done with state $1"
}
export -f get_airports

# kick off the parallel processing!
cat state_fips_codes.txt | parallel get_airports {}

This crunches through 52 states and territories in 21.8 seconds on a small ec2 server, limited only by network speed.

airports

Install the tools

  • GNU parallel
    • OSX: brew install parallel
    • Ubuntu: apt-get install parallel
  • ogr2ogr
    • OSX: brew install gdal --HEAD
    • Ubuntu: sudo apt-get install gdal-bin

Bonus toolkit: From Derek Watkins, here are a few dozen examples of the awesome geoprocessing you can you with GDAL/OGR.

Happy mapping!

How to do histograms in PostgreSQL

Bill Morris on

This post is part of our data science and PostgreSQL series.

UPDATED FOR 2017 Now with easy subquery support and a more sensible argument order!

We adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram($table_name_or_subquery, $column_name);  

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('(SELECT * FROM small_donors_vt LIMIT 50000)', 'transaction_amt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

CREATE OR REPLACE FUNCTION histogram(table_name_or_subquery text, column_name text)  
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)  
AS $func$  
BEGIN  
RETURN QUERY EXECUTE format('  
  WITH
  source AS (
    SELECT * FROM %s
  ),
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM source
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM source, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram',
  table_name_or_subquery,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name
  );
END  
$func$ LANGUAGE plpgsql;

Note:

  1. You don't need to use a subquery - you can also just provide a table name as the first argument.
  2. Subqueries are useful for sampling. For example, (SELECT * FROM bigtable LIMIT 500 ORDER BY RANDOM(). Don't forget parentheses!
  3. You can call this as SELECT histogram() or SELECT * FROM histogram(). The latter form is much more legible!

Happy querying!

How to preview PostGIS maps on your command line

Bill Morris on

This is part of our practical cartography and PostgreSQL series. Put a map on it!

Sometimes it's a pain to open up QGIS and load a PostGIS-enabled DB. Sometimes I don't feel like writing a custom tileserver and hooking it up to Leaflet or Mapbox GL just so I can see if my map looks right.

Sometimes I use the psql command line and a nifty tool by Morgan Herlocker called "geotype" to view my map data.

npm install -g geotype  

. . . which enables fast and simple maps like this:

ny

Yep. That's New York, alright.

dc

. . . and that sure looks like the population distribution of the District of Columbia.

These maps are nothing to show to customers, but they make QA/QC a lot easier. Here's the syntax, piping psql output directly into geotype:

psql $DB_URL -t -c "SELECT ST_AsGeoJSON(ST_Collect(the_geom)) FROM mytable" | geotype  

(The -t and ST_Collect() coerce the output into the type of data that geotype can read)

Happy mapping!

Mountains of Census geodata for all

Bill Morris on

U.S. Census data gives our modeling a good predictive boost, and it's a robust quality assurance tool for all the third-party data we've got flowing through our wires.

The Census offers its geographic data in easy-to get, familiar formats via the TIGER portal, but distribution is split up by state for the largest datasets: blocks and block groups. There's a pretty simple reason for this: they're big. The census block shapefile for Indiana alone is 116MB compressed.

eastcoast

Ours is probably not a common use case, but we need all of the blocks and block groups in our database - merged, indexed and queryable. It took a significant amount of work to get them there, so in case anyone else needs them too, we're sharing national 2015 datasets in PostGIS dumpfile format, downloadable and ready to use here:


Census block groups

.pg_dump (426MB) | .sql (1.2GB) bg


Census blocks

.pg_dump (4.7GB) | .sql (12GB) b


Add these to your local PostgreSQL database like so:

pg_restore --no-owner --no-privileges --dbname <dbname> <filename>.pg_dump

# OR

psql <dbname> -f <filename>.sql  

To keep things simple, these are just geometries and GeoIDs (CREATE TABLE schemas can be perused here). Detailed analysis will require joining attributes separately.

Side note: I can't recommend censusreporter.org enough for census-based sanity checks.

Happy mapping!

The big picture: exportable maps

Bill Morris on

Today we're introducing exportable audience maps . . .

map

Do you use maps in your Faraday workflow? Prints? Presentations? Twitter? Let us know how these maps can help, and how we can improve them.

Getting a map image from the platform is now simple:

export

From any saved audience, just click the "Export" button and you'll be able to download a high-res image of the geography in context. Share it with colleagues and partners, add it to reports, or put it on your wall.

Getting bite-sized chunks of OpenStreetmap

Bill Morris on

At Faraday, we dig OSM.

OpenStreetmap (OSM) is the foundation of our basemap and a model of the power of open data. It guides customers on our platform to their ideal audiences . . .

baemap

. . . and it serves as building blocks for geospatial analysis, both the kind we already do and the kind we want to do more of.

The problem is that it's big. The entire OSM database is portable, but at 50GB it's not very friendly. Sometimes we just want the driveway network of one county, or the building footprints in a zip code. Whole companies have sprung up around this workflow, but we have a few tried-and-true-and-cheap tools that we rely on:

  • Mapzen-hosted metro extracts - If your desired zone is on the list of regularly-updated cities, just grab the shapefiles and go!
  • OSM vector tiles - Use these with toolsets like tilereduce for distributed geoprocessing at tile scale.
  • Overpass API - This tuneable endpoint works great for specific queries in minutely-defined regions (e.g. find all the one-way streets in Park Slope), but it can be a bit opaque. Use the query-overpass node module to spit out GeoJSON with minimal fuss.

Happy mapping!