Startup hacks and engineering miracles from your exhausted friends at Faraday

How we made our CSV processing 142x faster

Bill Morris on

This post is part of our data science hacks series

At Faraday, we've long used csvkit to understand, transform, and beat senseless our many streams of data. However, even this inimitable swiss army knife can be improved on - we've switched to xsv.

xsv is a fast CSV-parsing toolkit written in Rust that mostly matches the functionality of csvkit (including the clutch ability to pipe between modules), with a few extras tacked on (like smart sampling). Did I mention it's fast? In a standup comparison, I ran the "stats" module of XSV against "csvstat" from csvkit, on a 30k-line, 400-column CSV file:

  • Python-based csvkit chews through it in a respectable-and-now-expected 4m16s.

  • xsv takes 1.8 seconds. I don't even have time for a sip of my coffee.

The difference between csvkit and xsv is partly defined by scale; both tools are plenty fast on smaller datasets. But once you get into 10MB-and-upward range, xsv's processing speed pulls away exponentially.

If you've been using csvkit forever (like me), or if you want to be able to transform and analyze CSVs without loading them into a DB, give xsv a shot:

Install Rust

curl https://sh.rustup.rs -sSf | sh  

. . . which also gives you the rust package manager cargo, which lets you:

Install xsv

cargo install xsv  

Then be sure your PATH is configured correctly:

export PATH=~/.cargo/bin:$PATH  

. . . and try it out on a demo CSV with 10k rows, some messy strings, and multiple data types:

curl https://gist.githubusercontent.com/wboykinm/044e2af62fc0c7f77e17f6ccd55b8fb0/raw/fca391e6c03a06a7be770fefca6c47a9acdd2305/mock_data.csv \  
| xsv stats \
| xsv table

(xsv table formats the data so it's readable in the console):

field           type     sum                 min                  max                  min_length  max_length  mean                stddev  
id              Integer  5005000             1                    1000                 1           4           500.49999999999994  288.6749902572106  
first_name      Unicode                      Aaron                Willie               3           11  
last_name       Unicode                      Adams                Young                3           10  
email           Unicode                      aadamsp5@senate.gov  wwrightd8@upenn.edu  12          34  
gender          Unicode                      Female               Male                 4           6  
ip_address      Unicode                      0.111.40.87          99.50.37.244         9           15  
value           Unicode                      $1007.98             $999.37              0           8  
company         Unicode                      Abata                Zoovu                0           13  
lat             Float    243963.82509999987  -47.75034            69.70287             0           9           24.42080331331331   24.98767816017553  
lon             Float    443214.19009999954  -179.12198           170.29993            0           10          44.36578479479489   71.16647723898215  
messed_up_data  Unicode                      !@#$%^&*()           𠜎𠜱𠝹𠱓𠱸𠲖𠳏       0           393  
version         Unicode                      0.1.1                9.99                 3           14  

Happy parsing!