AI, startup hacks, and engineering miracles from your friends at Faraday

How to get U.S. Census data as CSV — censusapi2csv

Bill Morris on

This post is part of our data science series.

The U.S. Census and American Community Survey (ACS) are the crown jewels of open data (bother your Representative today to make sure they stay that way), but working with data from the Census API isn't always intuitive. Here's an example response to an API call for ACS per capita income data:

[["B19301_001E","state","county","tract","block group"],
["25611","50","007","000100","1"],
["36965","50","007","000100","2"],
["29063","50","007","000200","1"],
. . .

It's not a CSV, it's not exactly JSON, it's just . . . data. We tend to use CSVs as our basic building blocks, so we built a tool to nudge this response into a pure format. Here's how to use it:

Install

npm install censusapi2csv -g  

Usage

Let's grab a few things from the ACS API: total population (B01001) and per capita income (B19301), for every block group in Chittenden County, Vermont:

censusapi2csv -l 'block group' -f B01001,B19301 -s 50 -c 007  

. . . we can even pipe this into our favorite CSV-parsing tool, xsv:

censusapi2csv -l 'block group' -f B01001,B19301 -s 50 -c 007 | xsv table  

. . . and we get a formatted look at the data:

B01001_001E  B19301_001E  state  county  tract   block group  
3057         25611        50     007     000100  1  
1200         36965        50     007     000100  2  
1641         29063        50     007     000200  1  
1882         28104        50     007     000200  2  
699          61054        50     007     000200  3  
. . .

This is just a tiny step in the process of working with census data - and there are many alternative approaches - but we thought it was worth sharing.

Geochunk: fast, intelligent splitting for piles of address data

Bill Morris on

aurora

This post is part of our practical cartography and data science series.

The problem: you want to split up a few million U.S. address records into equally-sized chunks that retain spatial hierarchy. You want to do this without anything other than a street address (geocoding is expensive!). Maybe you want to do this as part of a map/reduce process (we certainly do), maybe you want to do some sampling, who knows?

The solution: Muthaflippin' Geochunk

Anyone who's ever used U.S. ZIP codes as a way to subdivide datasets can tell you: 60608 (pop 79,607) is a totally different beast than 05851 (pop 525). They're not census tracts; it's not really appropriate to compare them statistically or thematically.

Our solution - largely the work of platform wizard and Rust enthusiast Eric Kidd - is to bake census data into a tool that does the splitting for you at a level that allows for easy comparison. More specifically:

It provides a deterministic mapping from zip codes to "geochunks" that you can count on remaining stable.

Check out the Jupyter notebook that explains the algorithm in detail, but it works like so:

Install

Install rust first if you don't have it:

curl https://sh.rustup.rs -sSf | sh  

. . . then geochunk, using the rust package manager:

cargo install geochunk  

. . . or install from one of the prepackaged binaries.

Use 1: Indexing

Build a table that assigns every U.S. zipcode to a geochunk that contains 250,000 people:

geochunk export zip2010 250000 > chunks_of_250k_people.csv  

Use 2: List processing

Alternately, let's try a pipeline example that uses geochunk csv: say you want to parallel-process every address in the state of Colorado, and you need equal-size but contiguous slices to do it.

wget -c https://s3.amazonaws.com/data.openaddresses.io/runs/283082/us/co/statewide.zip && unzip statewide.zip  
  • Pipe the full file through geochunk, into slices of about 250,000 people each:
cat us/co/statewide.csv | geochunk csv zip2010 250000 POSTCODE > statewide_chunks_150k.csv  

. . . and now you have 2 million addresses, chopped into ~8 equally-sized slices with rough contiguity:

denver

Geochunk works on this scale in 1.38s (Have you heard us evangelizing about Rust yet?), leaving you plenty of time for the real processing.

This tool is serious dogfood for us; it's baked into our ETL system, and we use it to try making a tiny dent in the Modifiable Areal Unit Problem. We hope you'll find it useful too.

Be not afraid of ZCTAs

Bill Morris on

This post is part of our practical cartography series.

Most American geographers will note that - as much as we'd like it to be otherwise - ZIP Codes are not polygons. Rather, they're constantly-changing lines used by the USPS to coordinate delivery in an efficient network. Many of us polygon-happy mappers use ZIP Code Tabulation Areas (ZCTAs) instead; these are provided by the US Census as a reasonable open data alternative to ZIPs. They're particularly nice for thematic mapping (though their shortcomings have also been well-documented):

map

But why use ZCTAs if they can never be reconciled with their ground-truth ZIP cousins?

Because the difference is small.

Faraday has address and location records for every household in the country, and it was straightforward to check for disagreement between the ZIP Code of each physical address and the ZCTA polygon that contains it.

Here are the results, broken down by state

The national error rate of ZCTAs is 1.4%. That might be too high for some use cases, but perfectly acceptable for others. There's some regional variation, too: you're usually safe to use ZCTAs in Hawaii and Maine, but might want to exercise caution in Oregon and Utah.

Happy mapping!