This post is part of our data science hacks series
xsv is a fast CSV-parsing toolkit written in Rust that mostly matches the functionality of csvkit (including the clutch ability to pipe between modules), with a few extras tacked on (like smart sampling). Did I mention it's fast? In a standup comparison, I ran the "stats" module of XSV against "csvstat" from csvkit, on a 30k-line, 400-column CSV file:
Python-based csvkit chews through it in a respectable-and-now-expected 4m16s.
xsv takes 1.8 seconds. I don't even have time for a sip of my coffee.
The difference between csvkit and xsv is partly defined by scale; both tools are plenty fast on smaller datasets. But once you get into 10MB-and-upward range, xsv's processing speed pulls away exponentially.
If you've been using csvkit forever (like me), or if you want to be able to transform and analyze CSVs without loading them into a DB, give xsv a shot:
curl https://sh.rustup.rs -sSf | sh
. . . which also gives you the rust package manager
cargo, which lets you:
cargo install xsv
Then be sure your PATH is configured correctly:
. . . and try it out on a demo CSV with 10k rows, some messy strings, and multiple data types:
curl https://gist.githubusercontent.com/wboykinm/044e2af62fc0c7f77e17f6ccd55b8fb0/raw/fca391e6c03a06a7be770fefca6c47a9acdd2305/mock_data.csv \ | xsv stats \ | xsv table
xsv table formats the data so it's readable in the console):
field type sum min max min_length max_length mean stddev id Integer 5005000 1 1000 1 4 500.49999999999994 288.6749902572106 first_name Unicode Aaron Willie 3 11 last_name Unicode Adams Young 3 10 email Unicode firstname.lastname@example.org email@example.com 12 34 gender Unicode Female Male 4 6 ip_address Unicode 0.111.40.87 220.127.116.11 9 15 value Unicode $1007.98 $999.37 0 8 company Unicode Abata Zoovu 0 13 lat Float 243963.82509999987 -47.75034 69.70287 0 9 24.42080331331331 24.98767816017553 lon Float 443214.19009999954 -179.12198 170.29993 0 10 44.36578479479489 71.16647723898215 messed_up_data Unicode !@#$%^&*() 𠜎𠜱𠝹𠱓𠱸𠲖𠳏 0 393 version Unicode 0.1.1 9.99 3 14