Geochunk: fast, intelligent splitting for piles of address data

Bill Morris on


This post is part of our practical cartography and data science series.

The problem: you want to split up a few million U.S. address records into equally-sized chunks that retain spatial hierarchy. You want to do this without anything other than a street address (geocoding is expensive!). Maybe you want to do this as part of a map/reduce process (we certainly do), maybe you want to do some sampling, who knows?

The solution: Muthaflippin' Geochunk

Anyone who's ever used U.S. ZIP codes as a way to subdivide datasets can tell you: 60608 (pop 79,607) is a totally different beast than 05851 (pop 525). They're not census tracts; it's not really appropriate to compare them statistically or thematically.

Our solution - largely the work of platform wizard and Rust enthusiast Eric Kidd - is to bake census data into a tool that does the splitting for you at a level that allows for easy comparison. More specifically:

It provides a deterministic mapping from zip codes to "geochunks" that you can count on remaining stable.

Check out the Jupyter notebook that explains the algorithm in detail, but it works like so:


Install rust first if you don't have it:

curl -sSf | sh  

. . . then geochunk, using the rust package manager:

cargo install geochunk  

. . . or install from one of the prepackaged binaries.

Use 1: Indexing

Build a table that assigns every U.S. zipcode to a geochunk that contains 250,000 people:

geochunk export zip2010 250000 > chunks_of_250k_people.csv  

Use 2: List processing

Alternately, let's try a pipeline example that uses geochunk csv: say you want to parallel-process every address in the state of Colorado, and you need equal-size but contiguous slices to do it.

wget -c && unzip  
  • Pipe the full file through geochunk, into slices of about 250,000 people each:
cat us/co/statewide.csv | geochunk csv zip2010 250000 POSTCODE > statewide_chunks_150k.csv  

. . . and now you have 2 million addresses, chopped into ~8 equally-sized slices with rough contiguity:


Geochunk works on this scale in 1.38s (Have you heard us evangelizing about Rust yet?), leaving you plenty of time for the real processing.

This tool is serious dogfood for us; it's baked into our ETL system, and we use it to try making a tiny dent in the Modifiable Areal Unit Problem. We hope you'll find it useful too.

Be not afraid of ZCTAs

Bill Morris on

This post is part of our practical cartography series.

Most American geographers will note that - as much as we'd like it to be otherwise - ZIP Codes are not polygons. Rather, they're constantly-changing lines used by the USPS to coordinate delivery in an efficient network. Many of us polygon-happy mappers use ZIP Code Tabulation Areas (ZCTAs) instead; these are provided by the US Census as a reasonable open data alternative to ZIPs. They're particularly nice for thematic mapping (though their shortcomings have also been well-documented):


But why use ZCTAs if they can never be reconciled with their ground-truth ZIP cousins?

Because the difference is small.

Faraday has address and location records for every household in the country, and it was straightforward to check for disagreement between the ZIP Code of each physical address and the ZCTA polygon that contains it.

Here are the results, broken down by state

The national error rate of ZCTAs is 1.4%. That might be too high for some use cases, but perfectly acceptable for others. There's some regional variation, too: you're usually safe to use ZCTAs in Hawaii and Maine, but might want to exercise caution in Oregon and Utah.

Happy mapping!

How to reverse geocode in bulk

Bill Morris on

This post is part of our practical cartography series.

We just rebuilt our Argo reverse-geocoding module as a proper command-line tool. Got a pile of coordinates in a table like this?

Pipe them through argo to get the context of an address assigned to each of them:

npm install argo-geo -g  
argo -i myfile.csv -a "blahblahmapzenauthtoken"  

Using Mapzen search, that'll churn through your table at 6 queries per second, appending results to each coordinate pair until it's done:

We built this to process millions of rooftop coordinates that a vendor provided to us without addresses, but you could just as easily use it for any position-only datasets:

  • Bird sightings from the field
  • Cars auto-extracted from imagery
  • GPS tracks from that pub crawl where you forgot the names of the bars
  • Mobile-collected reports of voter intimidation

We named it "Argo" to follow the Greek mythology pattern of Mapzen's geocoding engine "Pelias". Google and Mapbox each offer reverse-geocoding services as well, but those are just that: services. They include TOUs that restrict caching of the results, and man, did we want to cache these. The good folks at Mapzen built their search architecture on some truly amazing open datasets, and they match the spirit of the source by allowing storage and repurposing.

Thanks, Mapzen!