AI, startup hacks, and engineering miracles from your friends at Faraday

How to get U.S. Census data as CSV — censusapi2csv

Bill Morris on

This post is part of our data science series.

The U.S. Census and American Community Survey (ACS) are the crown jewels of open data (bother your Representative today to make sure they stay that way), but working with data from the Census API isn't always intuitive. Here's an example response to an API call for ACS per capita income data:

[["B19301_001E","state","county","tract","block group"],
["25611","50","007","000100","1"],
["36965","50","007","000100","2"],
["29063","50","007","000200","1"],
. . .

It's not a CSV, it's not exactly JSON, it's just . . . data. We tend to use CSVs as our basic building blocks, so we built a tool to nudge this response into a pure format. Here's how to use it:

Install

npm install censusapi2csv -g  

Usage

Let's grab a few things from the ACS API: total population (B01001) and per capita income (B19301), for every block group in Chittenden County, Vermont:

censusapi2csv -l 'block group' -f B01001,B19301 -s 50 -c 007  

. . . we can even pipe this into our favorite CSV-parsing tool, xsv:

censusapi2csv -l 'block group' -f B01001,B19301 -s 50 -c 007 | xsv table  

. . . and we get a formatted look at the data:

B01001_001E  B19301_001E  state  county  tract   block group  
3057         25611        50     007     000100  1  
1200         36965        50     007     000100  2  
1641         29063        50     007     000200  1  
1882         28104        50     007     000200  2  
699          61054        50     007     000200  3  
. . .

This is just a tiny step in the process of working with census data - and there are many alternative approaches - but we thought it was worth sharing.

Geochunk: fast, intelligent splitting for piles of address data

Bill Morris on

aurora

This post is part of our practical cartography and data science series.

The problem: you want to split up a few million U.S. address records into equally-sized chunks that retain spatial hierarchy. You want to do this without anything other than a street address (geocoding is expensive!). Maybe you want to do this as part of a map/reduce process (we certainly do), maybe you want to do some sampling, who knows?

The solution: Muthaflippin' Geochunk

Anyone who's ever used U.S. ZIP codes as a way to subdivide datasets can tell you: 60608 (pop 79,607) is a totally different beast than 05851 (pop 525). They're not census tracts; it's not really appropriate to compare them statistically or thematically.

Our solution - largely the work of platform wizard and Rust enthusiast Eric Kidd - is to bake census data into a tool that does the splitting for you at a level that allows for easy comparison. More specifically:

It provides a deterministic mapping from zip codes to "geochunks" that you can count on remaining stable.

Check out the Jupyter notebook that explains the algorithm in detail, but it works like so:

Install

Install rust first if you don't have it:

curl https://sh.rustup.rs -sSf | sh  

. . . then geochunk, using the rust package manager:

cargo install geochunk  

. . . or install from one of the prepackaged binaries.

Use 1: Indexing

Build a table that assigns every U.S. zipcode to a geochunk that contains 250,000 people:

geochunk export zip2010 250000 > chunks_of_250k_people.csv  

Use 2: List processing

Alternately, let's try a pipeline example that uses geochunk csv: say you want to parallel-process every address in the state of Colorado, and you need equal-size but contiguous slices to do it.

wget -c https://s3.amazonaws.com/data.openaddresses.io/runs/283082/us/co/statewide.zip && unzip statewide.zip  
  • Pipe the full file through geochunk, into slices of about 250,000 people each:
cat us/co/statewide.csv | geochunk csv zip2010 250000 POSTCODE > statewide_chunks_150k.csv  

. . . and now you have 2 million addresses, chopped into ~8 equally-sized slices with rough contiguity:

denver

Geochunk works on this scale in 1.38s (Have you heard us evangelizing about Rust yet?), leaving you plenty of time for the real processing.

This tool is serious dogfood for us; it's baked into our ETL system, and we use it to try making a tiny dent in the Modifiable Areal Unit Problem. We hope you'll find it useful too.

Plancha: how to flatten multi-sheet excel workbooks

Bill Morris on

This is part of our series on data science because it belongs in your toolchain.

If you work with data long enough - actually scratch that; if you work with data for more than a week - you'll run into the dreaded multi sheet (or tab) excel workbook. Sometimes the sheets are unrelated, but other times they should really all be stacked together in the same table, ideally in a more-interoperable format than .xlsx:

in

Enter plancha. Named for the trusty tortilla press, we built this simple CLI tool to flatten multi-sheet excel files, resolve header mismatches, and return a pipeline-friendly csv, like this:

out

Install

This is a node.js tool, so use npm:

npm install plancha -g

Usage

Just feed it an input .xlsx file:

plancha -i myfile.xlsx


Happy data-pressing!