AI, startup hacks, and engineering miracles from your friends at Faraday

scrubcsv: clean CSVs, drop bad lines

Seamus Abshere on

This is part of our series on things that are obvious once you see them - and our data science series because it belongs in your toolchain.

Lies, damn lies, and commercial CSV export modules. Who wrote these things? On what planet would this be acceptable? Whatever.

Name,What's wrong
"Robert "Bob" Smith",quotes inside quotes
Robert "Bob" Smith,quotes in the middle
Robert Bob" Smith,unmatched quote

Ruby dies immediately trying to read it:

$ irb
irb(main):001:0> require 'csv'
=> true
CSV::MalformedCSVError: Missing or stray quote in line 2

Introducing scrubcsv, a is a lightning-fast static binary written in Rust that best-effort parses CSV and then immediately dumps back out 100% guaranteed standards-compliant CSV. Top speed? About 67mb/s.

$ scrubcsv broken.csv > fixed.csv
4 rows (0 bad) in 0.00 seconds, 787.13 KiB/sec

$ cat fixed.csv
Name,What's wrong
"Robert Bob"" Smith""",quotes inside quotes
"Robert ""Bob"" Smith",quotes in the middle
"Robert Bob"" Smith",unmatched quote

It uses BurntSushi's world-beating CSV parser which is almost certainly faster than your SSD.

How to convert a fixed-width file into CSV

Seamus Abshere on

This is part of our data science series. How predictive!

(The more valuable and massive a data set is, the less likely it's in a format you can just parse. Has anybody else noticed that?)

Here's how to convert a fixed-width file to CSV with the standard GNU unix tool gawk:

Theoretical (i.e., see "real life" below)

Thanks to stackoverflow: (reproducing verbatim)

gawk '$1=$1' OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv

Where FIELDWIDTHS is a list of field widths and OFS is the output file separator.

Real life

In real life, fixed width files contain commas and double quotes.

# put this in a file called fixed2csv.awk
  for (i=1;i<=NF;i++) {
    printf "\"%s\"%s", $i, (i<NF?OFS:ORS)

Then run it on your data:

gawk -f fixed2csv.awk OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv

Thanks to Ed Morton on Stackoverflow for inspiration!