Startup hacks and engineering miracles from your exhausted friends at Faraday

scrubcsv: clean CSVs, drop bad lines

Seamus Abshere on

This is part of our series on things that are obvious once you see them - and our data science series because it belongs in your toolchain.

Lies, damn lies, and commercial CSV export modules. Who wrote these things? On what planet would this be acceptable? Whatever.

Name,What's wrong  
"Robert "Bob" Smith",quotes inside quotes
Robert "Bob" Smith,quotes in the middle  
Robert Bob" Smith,unmatched quote  

Ruby dies immediately trying to read it:

$ irb
irb(main):001:0> require 'csv'  
=> true
irb(main):002:0> CSV.read('broken.csv')  
CSV::MalformedCSVError: Missing or stray quote in line 2  

Introducing scrubcsv, a is a lightning-fast static binary written in Rust that best-effort parses CSV and then immediately dumps back out 100% guaranteed standards-compliant CSV. Top speed? About 67mb/s.

$ scrubcsv broken.csv > fixed.csv
4 rows (0 bad) in 0.00 seconds, 787.13 KiB/sec

$ cat fixed.csv
Name,What's wrong  
"Robert Bob"" Smith""",quotes inside quotes
"Robert ""Bob"" Smith",quotes in the middle
"Robert Bob"" Smith",unmatched quote

It uses BurntSushi's world-beating CSV parser which is almost certainly faster than your SSD.

How to permanently delete versioned objects from S3

Seamus Abshere on

This is part of our cloud security and things that are obvious once you see them series. Duhh... safe!

Amazon's explanation of deleting a versioned object and the SDK documentation do not give an example of permanently deleting a versioned object. Here's how to do it:

require 'aws-sdk'

s3 = Aws::S3::Resource.new(  
  region: 'us-east-1',
  access_key_id: ACCESS_KEY_ID,
  secret_access_key: SECRET_ACCESS_KEY
)
bucket = s3.bucket('my-versioned-bucket')

bucket.objects.each do |object_summary|  
  o = bucket.object object_summary.key
  # this is the secret: specify the version while deleting
  o.delete version_id: o.version_id
end  

If you don't specify the version, you get a delete marker, which you can proceed to delete infinite times and it will not go away :)

Obvious once you see it: 1-click Excel autofilter + split

Seamus Abshere on

This is part of our series on things that are obvious once you see them. Duh!

Excel's autofilter and freeze panes are a big reason we keep buying Microsoft Office. Add both with 1 click (hint: the smiley face)

the macro in action

Sub faraday_io_autofilter_split()  
  Rows("1:1").Select
  Selection.AutoFilter
  Rows("2:2").Select
  ActiveWindow.FreezePanes = True
  Cells(1, 1).Select
End Sub  
  1. Tools -> Macro -> Visual Basic Editor -> double click Module1 of Personal Macro Workbook
  2. Paste the code above (starts with Sub faraday_io_autofilter_split())
  3. File -> Save Personal Macro Workbook
  4. Excel -> Close and Return to Microsoft Excel
  5. Right-click toolbar -> Customize Toolbars and Menus -> Commands -> Macros -> drag Custom Button to the menubar (it's a ☺)
  6. Right-click new button -> Assign Macro... -> select 'Personal Macro Workbook'!faraday_io_autofilter_split

a screencast of the instructions

How to merge JSON fields in PostgresSQL

Seamus Abshere on

This is part of our series on PostgreSQL and things that are obvious once you see them. Nice!

How do you combine (merge) JSON fields in Postgres? You define a new function json_collect()

# select * from greetings;
         data
-----------------------
 {"es":"Saludos"}
 {"en":"Hello"}
 {"ru":"Здравствуйте"}
(3 rows)

# select json_collect(data) from greetings;
                        json_collect
-------------------------------------------------------------
 { "es" : "Saludos", "en" : "Hello", "ru" : "Здравствуйте" }
(1 row)

json_collect is not built into Postgres, though. You have to create it yourself:

-- needed by json_collect
CREATE FUNCTION json_merge(data json, merge_data json) RETURNS json LANGUAGE sql IMMUTABLE  
AS $$  
  SELECT json_object_agg(key, value)
  FROM (
    WITH to_merge AS (
      SELECT * FROM json_each(merge_data)
    )
    SELECT *
    FROM json_each(data)
    WHERE key NOT IN (SELECT key FROM to_merge)
    UNION ALL
    SELECT * FROM to_merge
  ) t;
$$;

CREATE AGGREGATE json_collect(json) (  
  SFUNC = json_merge,
  STYPE = json,
  INITCOND = '{}'
);

Thanks to @matheusoliveira!