AI, startup hacks, and engineering miracles from your friends at Faraday

Geochunk: fast, intelligent splitting for piles of address data

Bill Morris on

aurora

This post is part of our practical cartography and data science series.

The problem: you want to split up a few million U.S. address records into equally-sized chunks that retain spatial hierarchy. You want to do this without anything other than a street address (geocoding is expensive!). Maybe you want to do this as part of a map/reduce process (we certainly do), maybe you want to do some sampling, who knows?

The solution: Muthaflippin' Geochunk

Anyone who's ever used U.S. ZIP codes as a way to subdivide datasets can tell you: 60608 (pop 79,607) is a totally different beast than 05851 (pop 525). They're not census tracts; it's not really appropriate to compare them statistically or thematically.

Our solution - largely the work of platform wizard and Rust enthusiast Eric Kidd - is to bake census data into a tool that does the splitting for you at a level that allows for easy comparison. More specifically:

It provides a deterministic mapping from zip codes to "geochunks" that you can count on remaining stable.

Check out the Jupyter notebook that explains the algorithm in detail, but it works like so:

Install

Install rust first if you don't have it:

curl https://sh.rustup.rs -sSf | sh

. . . then geochunk, using the rust package manager:

cargo install geochunk

. . . or install from one of the prepackaged binaries.

Use 1: Indexing

Build a table that assigns every U.S. zipcode to a geochunk that contains 250,000 people:

geochunk export zip2010 250000 > chunks_of_250k_people.csv

Use 2: List processing

Alternately, let's try a pipeline example that uses geochunk csv: say you want to parallel-process every address in the state of Colorado, and you need equal-size but contiguous slices to do it.

wget -c https://s3.amazonaws.com/data.openaddresses.io/runs/283082/us/co/statewide.zip && unzip statewide.zip
  • Pipe the full file through geochunk, into slices of about 250,000 people each:
cat us/co/statewide.csv | geochunk csv zip2010 250000 POSTCODE > statewide_chunks_150k.csv

. . . and now you have 2 million addresses, chopped into ~8 equally-sized slices with rough contiguity:

denver

Geochunk works on this scale in 1.38s (Have you heard us evangelizing about Rust yet?), leaving you plenty of time for the real processing.

This tool is serious dogfood for us; it's baked into our ETL system, and we use it to try making a tiny dent in the Modifiable Areal Unit Problem. We hope you'll find it useful too.

Be not afraid of ZCTAs

Bill Morris on

This post is part of our practical cartography series.

Most American geographers will note that - as much as we'd like it to be otherwise - ZIP Codes are not polygons. Rather, they're constantly-changing lines used by the USPS to coordinate delivery in an efficient network. Many of us polygon-happy mappers use ZIP Code Tabulation Areas (ZCTAs) instead; these are provided by the US Census as a reasonable open data alternative to ZIPs. They're particularly nice for thematic mapping (though their shortcomings have also been well-documented):

map

But why use ZCTAs if they can never be reconciled with their ground-truth ZIP cousins?

Because the difference is small.

Faraday has address and location records for every household in the country, and it was straightforward to check for disagreement between the ZIP Code of each physical address and the ZCTA polygon that contains it.

Here are the results, broken down by state

The national error rate of ZCTAs is 1.4%. That might be too high for some use cases, but perfectly acceptable for others. There's some regional variation, too: you're usually safe to use ZCTAs in Hawaii and Maine, but might want to exercise caution in Oregon and Utah.

Happy mapping!

How to reverse geocode in bulk

Bill Morris on


This post is part of our practical cartography series.

We just rebuilt our Argo reverse-geocoding module as a proper command-line tool. Got a pile of coordinates in a table like this?

Pipe them through argo to get the context of an address assigned to each of them:

npm install argo-geo -g
argo -i myfile.csv -a "blahblahmapzenauthtoken"

Using Mapzen search, that'll churn through your table at 6 queries per second, appending results to each coordinate pair until it's done:

We built this to process millions of rooftop coordinates that a vendor provided to us without addresses, but you could just as easily use it for any position-only datasets:

  • Bird sightings from the field
  • Cars auto-extracted from imagery
  • GPS tracks from that pub crawl where you forgot the names of the bars
  • Mobile-collected reports of voter intimidation

We named it "Argo" to follow the Greek mythology pattern of Mapzen's geocoding engine "Pelias". Google and Mapbox each offer reverse-geocoding services as well, but those are just that: services. They include TOUs that restrict caching of the results, and man, did we want to cache these. The good folks at Mapzen built their search architecture on some truly amazing open datasets, and they match the spirit of the source by allowing storage and repurposing.

Thanks, Mapzen!

How to crunch lots of geodata in parallel

Bill Morris on

This post is part of our data science and practical cartography series.

GNU parallel + ogr2ogr = happy data scientists

These power tools in combination make it very easy to process lots of geodata at once, in as many parallel operations as your local machine or server can support.

Reprojecting in bulk

Here's an example, assuming you have a folder full of shapefiles you want to reproject into Geographic coordinates. Make a directory for the output, then pipe every shapefile through ogr2ogr in parallel:

mkdir wgs84
ls *.shp | parallel ogr2ogr -t_srs 'EPSG:4326' wgs84/{} {}

Running a sequence of commands on many files

In order to build whole data workflows, you can wrap your sequence of commands in a bash function. Here's an example, where we:

  1. Download each state landmarks file from the census FTP
  2. Extract each file
  3. Create a new file for each consisting of only airport landmarks, projected to WGS84
# grab this handy list of all state FIPS codes
wget -c https://gist.githubusercontent.com/wboykinm/6c514e9caf1fc3158e350fa926ea02bd/raw/f742515fd06824dafd0a88c62b4de11fa1e39fa1/state_fips_codes.txt

# define the function
get_airports() {
  # grab the data from the census server
  wget -c http://www2.census.gov/geo/tiger/TIGER2016/POINTLM/tl_2016_$1_pointlm.zip
  unzip tl_2016_$1_pointlm.zip
  # extract just airports (code K2451) and reproject to WGS84
  ogr2ogr -t_srs "EPSG:4326" -where "MTFCC = 'K2451'" tl_2016_$1_airports.shp tl_2016_$1_pointlm.shp
  echo "done with state $1"
}
export -f get_airports

# kick off the parallel processing!
cat state_fips_codes.txt | parallel get_airports {}

This crunches through 52 states and territories in 21.8 seconds on a small ec2 server, limited only by network speed.

airports

Install the tools

  • GNU parallel
    • OSX: brew install parallel
    • Ubuntu: apt-get install parallel
  • ogr2ogr
    • OSX: brew install gdal --HEAD
    • Ubuntu: sudo apt-get install gdal-bin

Bonus toolkit: From Derek Watkins, here are a few dozen examples of the awesome geoprocessing you can you with GDAL/OGR.

Happy mapping!

How to preview PostGIS maps on your command line

Bill Morris on

This is part of our practical cartography and PostgreSQL series. Put a map on it!

Sometimes it's a pain to open up QGIS and load a PostGIS-enabled DB. Sometimes I don't feel like writing a custom tileserver and hooking it up to Leaflet or Mapbox GL just so I can see if my map looks right.

Sometimes I use the psql command line and a nifty tool by Morgan Herlocker called "geotype" to view my map data.

npm install -g geotype

. . . which enables fast and simple maps like this:

ny

Yep. That's New York, alright.

dc

. . . and that sure looks like the population distribution of the District of Columbia.

These maps are nothing to show to customers, but they make QA/QC a lot easier. Here's the syntax, piping psql output directly into geotype:

psql $DB_URL -t -c "SELECT ST_AsGeoJSON(ST_Collect(the_geom)) FROM mytable" | geotype

(The -t and ST_Collect() coerce the output into the type of data that geotype can read)

Happy mapping!