AI, startup hacks, and engineering miracles from your friends at Faraday

How to get U.S. Census data as CSV — censusapi2csv

Bill Morris on

This post is part of our data science series.

The U.S. Census and American Community Survey (ACS) are the crown jewels of open data (bother your Representative today to make sure they stay that way), but working with data from the Census API isn't always intuitive. Here's an example response to an API call for ACS per capita income data:

[["B19301_001E","state","county","tract","block group"],
["25611","50","007","000100","1"],
["36965","50","007","000100","2"],
["29063","50","007","000200","1"],
. . .

It's not a CSV, it's not exactly JSON, it's just . . . data. We tend to use CSVs as our basic building blocks, so we built a tool to nudge this response into a pure format. Here's how to use it:

Install

npm install censusapi2csv -g  

Usage

Let's grab a few things from the ACS API: total population (B01001) and per capita income (B19301), for every block group in Chittenden County, Vermont:

censusapi2csv -l 'block group' -f B01001,B19301 -s 50 -c 007  

. . . we can even pipe this into our favorite CSV-parsing tool, xsv:

censusapi2csv -l 'block group' -f B01001,B19301 -s 50 -c 007 | xsv table  

. . . and we get a formatted look at the data:

B01001_001E  B19301_001E  state  county  tract   block group  
3057         25611        50     007     000100  1  
1200         36965        50     007     000100  2  
1641         29063        50     007     000200  1  
1882         28104        50     007     000200  2  
699          61054        50     007     000200  3  
. . .

This is just a tiny step in the process of working with census data - and there are many alternative approaches - but we thought it was worth sharing.

Geochunk: fast, intelligent splitting for piles of address data

Bill Morris on

aurora

This post is part of our practical cartography and data science series.

The problem: you want to split up a few million U.S. address records into equally-sized chunks that retain spatial hierarchy. You want to do this without anything other than a street address (geocoding is expensive!). Maybe you want to do this as part of a map/reduce process (we certainly do), maybe you want to do some sampling, who knows?

The solution: Muthaflippin' Geochunk

Anyone who's ever used U.S. ZIP codes as a way to subdivide datasets can tell you: 60608 (pop 79,607) is a totally different beast than 05851 (pop 525). They're not census tracts; it's not really appropriate to compare them statistically or thematically.

Our solution - largely the work of platform wizard and Rust enthusiast Eric Kidd - is to bake census data into a tool that does the splitting for you at a level that allows for easy comparison. More specifically:

It provides a deterministic mapping from zip codes to "geochunks" that you can count on remaining stable.

Check out the Jupyter notebook that explains the algorithm in detail, but it works like so:

Install

Install rust first if you don't have it:

curl https://sh.rustup.rs -sSf | sh  

. . . then geochunk, using the rust package manager:

cargo install geochunk  

. . . or install from one of the prepackaged binaries.

Use 1: Indexing

Build a table that assigns every U.S. zipcode to a geochunk that contains 250,000 people:

geochunk export zip2010 250000 > chunks_of_250k_people.csv  

Use 2: List processing

Alternately, let's try a pipeline example that uses geochunk csv: say you want to parallel-process every address in the state of Colorado, and you need equal-size but contiguous slices to do it.

wget -c https://s3.amazonaws.com/data.openaddresses.io/runs/283082/us/co/statewide.zip && unzip statewide.zip  
  • Pipe the full file through geochunk, into slices of about 250,000 people each:
cat us/co/statewide.csv | geochunk csv zip2010 250000 POSTCODE > statewide_chunks_150k.csv  

. . . and now you have 2 million addresses, chopped into ~8 equally-sized slices with rough contiguity:

denver

Geochunk works on this scale in 1.38s (Have you heard us evangelizing about Rust yet?), leaving you plenty of time for the real processing.

This tool is serious dogfood for us; it's baked into our ETL system, and we use it to try making a tiny dent in the Modifiable Areal Unit Problem. We hope you'll find it useful too.

Be not afraid of ZCTAs

Bill Morris on

This post is part of our practical cartography series.

Most American geographers will note that - as much as we'd like it to be otherwise - ZIP Codes are not polygons. Rather, they're constantly-changing lines used by the USPS to coordinate delivery in an efficient network. Many of us polygon-happy mappers use ZIP Code Tabulation Areas (ZCTAs) instead; these are provided by the US Census as a reasonable open data alternative to ZIPs. They're particularly nice for thematic mapping (though their shortcomings have also been well-documented):

map

But why use ZCTAs if they can never be reconciled with their ground-truth ZIP cousins?

Because the difference is small.

Faraday has address and location records for every household in the country, and it was straightforward to check for disagreement between the ZIP Code of each physical address and the ZCTA polygon that contains it.

Here are the results, broken down by state

The national error rate of ZCTAs is 1.4%. That might be too high for some use cases, but perfectly acceptable for others. There's some regional variation, too: you're usually safe to use ZCTAs in Hawaii and Maine, but might want to exercise caution in Oregon and Utah.

Happy mapping!

Plancha: how to flatten multi-sheet excel workbooks

Bill Morris on

This is part of our series on data science because it belongs in your toolchain.

If you work with data long enough - actually scratch that; if you work with data for more than a week - you'll run into the dreaded multi sheet (or tab) excel workbook. Sometimes the sheets are unrelated, but other times they should really all be stacked together in the same table, ideally in a more-interoperable format than .xlsx:

in

Enter plancha. Named for the trusty tortilla press, we built this simple CLI tool to flatten multi-sheet excel files, resolve header mismatches, and return a pipeline-friendly csv, like this:

out

Install

This is a node.js tool, so use npm:

npm install plancha -g

Usage

Just feed it an input .xlsx file:

plancha -i myfile.xlsx


Happy data-pressing!

scrubcsv: now with null value removal

Seamus Abshere on

This is part of our series on data science because it belongs in your toolchain. Happy Null Removal!

The latest version of scrubcsv has built-in null value removal:

$ cat a.csv
name,breed,age  
jerry,beagle,n/a  
tater,null,1

$ scrubcsv -n 'null|n/a' a.csv
name,breed,age  
jerry,beagle,  
tater,,1  

See how null and n/a went away?

Get the latest version with

$ cargo install scrubcsv -f

How we made our CSV processing 142x faster

Bill Morris on

This post is part of our data science hacks series

At Faraday, we've long used csvkit to understand, transform, and beat senseless our many streams of data. However, even this inimitable swiss army knife can be improved on - we've switched to xsv.

xsv is a fast CSV-parsing toolkit written in Rust that mostly matches the functionality of csvkit (including the clutch ability to pipe between modules), with a few extras tacked on (like smart sampling). Did I mention it's fast? In a standup comparison, I ran the "stats" module of XSV against "csvstat" from csvkit, on a 30k-line, 400-column CSV file:

  • Python-based csvkit chews through it in a respectable-and-now-expected 4m16s.

  • xsv takes 1.8 seconds. I don't even have time for a sip of my coffee.

The difference between csvkit and xsv is partly defined by scale; both tools are plenty fast on smaller datasets. But once you get into 10MB-and-upward range, xsv's processing speed pulls away exponentially.

If you've been using csvkit forever (like me), or if you want to be able to transform and analyze CSVs without loading them into a DB, give xsv a shot:

Install Rust

curl https://sh.rustup.rs -sSf | sh  

. . . which also gives you the rust package manager cargo, which lets you:

Install xsv

cargo install xsv  

Then be sure your PATH is configured correctly:

export PATH=~/.cargo/bin:$PATH  

. . . and try it out on a demo CSV with 10k rows, some messy strings, and multiple data types:

curl https://gist.githubusercontent.com/wboykinm/044e2af62fc0c7f77e17f6ccd55b8fb0/raw/fca391e6c03a06a7be770fefca6c47a9acdd2305/mock_data.csv \  
| xsv stats \
| xsv table

(xsv table formats the data so it's readable in the console):

field           type     sum                 min                  max                  min_length  max_length  mean                stddev  
id              Integer  5005000             1                    1000                 1           4           500.49999999999994  288.6749902572106  
first_name      Unicode                      Aaron                Willie               3           11  
last_name       Unicode                      Adams                Young                3           10  
email           Unicode                      aadamsp5@senate.gov  wwrightd8@upenn.edu  12          34  
gender          Unicode                      Female               Male                 4           6  
ip_address      Unicode                      0.111.40.87          99.50.37.244         9           15  
value           Unicode                      $1007.98             $999.37              0           8  
company         Unicode                      Abata                Zoovu                0           13  
lat             Float    243963.82509999987  -47.75034            69.70287             0           9           24.42080331331331   24.98767816017553  
lon             Float    443214.19009999954  -179.12198           170.29993            0           10          44.36578479479489   71.16647723898215  
messed_up_data  Unicode                      !@#$%^&*()           𠜎𠜱𠝹𠱓𠱸𠲖𠳏       0           393  
version         Unicode                      0.1.1                9.99                 3           14  

Happy parsing!

scrubcsv: clean CSVs, drop bad lines

Seamus Abshere on

This is part of our series on things that are obvious once you see them - and our data science series because it belongs in your toolchain.

Lies, damn lies, and commercial CSV export modules. Who wrote these things? On what planet would this be acceptable? Whatever.

Name,What's wrong  
"Robert "Bob" Smith",quotes inside quotes
Robert "Bob" Smith,quotes in the middle  
Robert Bob" Smith,unmatched quote  

Ruby dies immediately trying to read it:

$ irb
irb(main):001:0> require 'csv'  
=> true
irb(main):002:0> CSV.read('broken.csv')  
CSV::MalformedCSVError: Missing or stray quote in line 2  

Introducing scrubcsv, a is a lightning-fast static binary written in Rust that best-effort parses CSV and then immediately dumps back out 100% guaranteed standards-compliant CSV. Top speed? About 67mb/s.

$ scrubcsv broken.csv > fixed.csv
4 rows (0 bad) in 0.00 seconds, 787.13 KiB/sec

$ cat fixed.csv
Name,What's wrong  
"Robert Bob"" Smith""",quotes inside quotes
"Robert ""Bob"" Smith",quotes in the middle
"Robert Bob"" Smith",unmatched quote

It uses BurntSushi's world-beating CSV parser which is almost certainly faster than your SSD.

How to crunch lots of geodata in parallel

Bill Morris on

This post is part of our data science and practical cartography series.

GNU parallel + ogr2ogr = happy data scientists

These power tools in combination make it very easy to process lots of geodata at once, in as many parallel operations as your local machine or server can support.

Reprojecting in bulk

Here's an example, assuming you have a folder full of shapefiles you want to reproject into Geographic coordinates. Make a directory for the output, then pipe every shapefile through ogr2ogr in parallel:

mkdir wgs84  
ls *.shp | parallel ogr2ogr -t_srs 'EPSG:4326' wgs84/{} {}  

Running a sequence of commands on many files

In order to build whole data workflows, you can wrap your sequence of commands in a bash function. Here's an example, where we:

  1. Download each state landmarks file from the census FTP
  2. Extract each file
  3. Create a new file for each consisting of only airport landmarks, projected to WGS84
# grab this handy list of all state FIPS codes
wget -c https://gist.githubusercontent.com/wboykinm/6c514e9caf1fc3158e350fa926ea02bd/raw/f742515fd06824dafd0a88c62b4de11fa1e39fa1/state_fips_codes.txt

# define the function
get_airports() {  
  # grab the data from the census server
  wget -c http://www2.census.gov/geo/tiger/TIGER2016/POINTLM/tl_2016_$1_pointlm.zip
  unzip tl_2016_$1_pointlm.zip
  # extract just airports (code K2451) and reproject to WGS84
  ogr2ogr -t_srs "EPSG:4326" -where "MTFCC = 'K2451'" tl_2016_$1_airports.shp tl_2016_$1_pointlm.shp
  echo "done with state $1"
}
export -f get_airports

# kick off the parallel processing!
cat state_fips_codes.txt | parallel get_airports {}

This crunches through 52 states and territories in 21.8 seconds on a small ec2 server, limited only by network speed.

airports

Install the tools

  • GNU parallel
    • OSX: brew install parallel
    • Ubuntu: apt-get install parallel
  • ogr2ogr
    • OSX: brew install gdal --HEAD
    • Ubuntu: sudo apt-get install gdal-bin

Bonus toolkit: From Derek Watkins, here are a few dozen examples of the awesome geoprocessing you can you with GDAL/OGR.

Happy mapping!

How to do histograms in PostgreSQL

Bill Morris on

This post is part of our data science and PostgreSQL series.

UPDATED FOR 2017 Now with easy subquery support and a more sensible argument order!

We adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram($table_name_or_subquery, $column_name);  

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('(SELECT * FROM small_donors_vt LIMIT 50000)', 'transaction_amt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

CREATE OR REPLACE FUNCTION histogram(table_name_or_subquery text, column_name text)  
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)  
AS $func$  
BEGIN  
RETURN QUERY EXECUTE format('  
  WITH
  source AS (
    SELECT * FROM %s
  ),
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM source
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM source, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram',
  table_name_or_subquery,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name
  );
END  
$func$ LANGUAGE plpgsql;

Note:

  1. You don't need to use a subquery - you can also just provide a table name as the first argument.
  2. Subqueries are useful for sampling. For example, (SELECT * FROM bigtable LIMIT 500 ORDER BY RANDOM(). Don't forget parentheses!
  3. You can call this as SELECT histogram() or SELECT * FROM histogram(). The latter form is much more legible!

Happy querying!

How to convert a fixed-width file into CSV

Seamus Abshere on

This is part of our data science series. How predictive!

(The more valuable and massive a data set is, the less likely it's in a format you can just parse. Has anybody else noticed that?)

Here's how to convert a fixed-width file to CSV with the standard GNU unix tool gawk:

Theoretical (i.e., see "real life" below)

Thanks to stackoverflow: (reproducing verbatim)

gawk '$1=$1' OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Where FIELDWIDTHS is a list of field widths and OFS is the output file separator.

Real life

In real life, fixed width files contain commas and double quotes.

# put this in a file called fixed2csv.awk
{
  for (i=1;i<=NF;i++) {
    sub(/\s+$/,"",$i)
    sub("\"","\"\"",$i)
    printf "\"%s\"%s", $i, (i<NF?OFS:ORS)
  }
}

Then run it on your data:

gawk -f fixed2csv.awk OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Thanks to Ed Morton on Stackoverflow for inspiration!