Startup hacks and engineering miracles from your exhausted friends at Faraday

scrubcsv: now with null value removal

Seamus Abshere on

This is part of our series on data science because it belongs in your toolchain. Happy Null Removal!

The latest version of scrubcsv has built-in null value removal:

$ cat a.csv
name,breed,age  
jerry,beagle,n/a  
tater,null,1

$ scrubcsv -n 'null|n/a' a.csv
name,breed,age  
jerry,beagle,  
tater,,1  

See how null and n/a went away?

Get the latest version with

$ cargo install scrubcsv -f

scrubcsv: clean CSVs, drop bad lines

Seamus Abshere on

This is part of our series on things that are obvious once you see them - and our data science series because it belongs in your toolchain.

Lies, damn lies, and commercial CSV export modules. Who wrote these things? On what planet would this be acceptable? Whatever.

Name,What's wrong  
"Robert "Bob" Smith",quotes inside quotes
Robert "Bob" Smith,quotes in the middle  
Robert Bob" Smith,unmatched quote  

Ruby dies immediately trying to read it:

$ irb
irb(main):001:0> require 'csv'  
=> true
irb(main):002:0> CSV.read('broken.csv')  
CSV::MalformedCSVError: Missing or stray quote in line 2  

Introducing scrubcsv, a is a lightning-fast static binary written in Rust that best-effort parses CSV and then immediately dumps back out 100% guaranteed standards-compliant CSV. Top speed? About 67mb/s.

$ scrubcsv broken.csv > fixed.csv
4 rows (0 bad) in 0.00 seconds, 787.13 KiB/sec

$ cat fixed.csv
Name,What's wrong  
"Robert Bob"" Smith""",quotes inside quotes
"Robert ""Bob"" Smith",quotes in the middle
"Robert Bob"" Smith",unmatched quote

It uses BurntSushi's world-beating CSV parser which is almost certainly faster than your SSD.

No MD5, SHA1, or SHA256 collisions for US addresses

Seamus Abshere on

I calculated hashes of single-family home addresses in the United States:

create table hashtest as (  
  select
    house_number_and_street,
    city,
    state,
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'md5') as "md5",
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'sha1') as "sha1",
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'sha256') as "sha256"
  from households
)

E.g.,

=> select upper(house_number_and_street || ',' || city || ',' || state) "key", digest(upper(house_number_and_street || ',' || city || ',' || state), 'md5') "md5" from households limit 1;
             key               |                md5
-------------------------------+------------------------------------
 1024 PENINSULA DR,WESTWOOD,CA | \x511cdfb25d6b77d45742ed0407b5c2ef
(1 row)

Then I counted the distinct hashes:

=> select count(distinct md5) md5, count(distinct sha1) sha1, count(distinct sha256) sha256, count(*) from hashtest;
   md5    |   sha1   |  sha256  |  count
----------+----------+----------+----------
 78224992 | 78224992 | 78224992 | 81087108
(1 row)

Some of the addresses are repeated in the database because the APNs are identical, but the conclusion is that we have 78 million uniques and no hash collisions with the algorithms tested.

Open letter to Slack: Fitt's law and channel weeding

Seamus Abshere on

Update: They fixed this! Thanks so much!

I've got a beef with Slack and I back it up with Fitt's law:

This scientific law predicts that the time required to rapidly move to a target area is a function of the ratio between the distance to the target and the width of the target.

Here's the beef:

it's hard to choose what messages to delete

As a remote team manager, I try to weed our Slack channels by deleting bot explosions and repetitive error dumps.

  1. Obey Fitt's law and let me click anywhere on the message to select it.
  2. Let me specify a regex (e.g. Unknown account) that will mass-select messages for deletion.

Huzzah!

How to use Postgres cursors and multi-insert in Ruby with Sequel

Seamus Abshere on

This post is part of our PostgreSQL series.

SQL cursors are a great way to stream data out of a database without loading it all into memory.

Reading using a cursor

Sequel has use_cursor:

BATCH_SIZE = 1000  
DB[:widgets].  
  select(:id, :name, :power).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    row[:id] # whatever
  end
end  

Writing using cursors and multi-insert

Here's a real-world example - we had to copy things from widgets to widgets_copy.

We thought we were being really clever, keeping all the data inside Postgres:

# NOTE: this is not the recommended method, but it works too
batches = (desired_count.to_f / BATCH_SIZE).ceil  
batches.times do |i|  
  offset = i * BATCH_SIZE
  DB.run <<-SQL
    INSERT INTO widgets_copy (
      id,
      name,
      power,
      created_at,
      updated_at
    ) (
      SELECT
        uuid_generate_v4(),
        name,
        power,
        now(),
        now(),
      FROM widgets
      ORDER BY id
      LIMIT #{BATCH_SIZE}
      OFFSET #{offset}
    )
  SQL
end  

Even with 244gb of RAM, this sometimes stopped dead 75% of the way through. We rewrote it—pulling the data into Ruby, even—and it's both faster and doesn't stop in the middle. We're using Sequel's multi_insert.

batch = []  
now = Sequel.function :now  
uuid_generate_v4 = Sequel.function :uuid_generate_v4  
DB[:widgets].  
  select(:name, :power).
  order(:id).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    batch << row.merge!(
      id: uuid_generate_v4,
      updated_at: now,
      created_at: now
    )
    if (batch.length % BATCH_SIZE == 0) || batch.length == desired_count
      DB[:widgets_copy].multi_insert batch
      batch.clear
    end
  end
end  

Thanks to @dkastner and @devTristan who promoted these ideas internally!

3 reasons Citus is the best (non-Heroku) Postgres host available today

Seamus Abshere on

Citus Cloud is the hosted Postgres that our app uses. We tried EnterpriseDB Postgres Plus Cloud and couldn't even get going; we used Amazon RDS for a year but migrated away about 6 months ago. Heroku Postgres wasn't bare-metal enough for us.

1. Log directly to Papertrail

Screenshot of logging setup It's not a world-shattering invention, but features like sending your Postgres logs to a syslog host is very useful for debugging.

Citus takes features requests and actually implements them (this was one of mine).

RDS has a weird home-grown in-browser log viewer which is useless without things like rds-pgbadger.

2. citus extension preinstalled

You know that feeling where you're so happy you can use SQL, but you worry just a bit 🙏🏽 about your horizontal scaling story? Citus is the company behind the citus extension to Postgres:

Citus parallelizes incoming queries by breaking it into multiple fragment queries which run on the worker shards in parallel [...] The workers are simply running extended PostgreSQL servers and they apply PostgreSQL’s standard planning and execution logic to run these fragment SQL queries. Therefore, any optimization that helps PostgreSQL also helps Citus.

(from Querying Distributed Tables)

Speaking of extensions, EnterpriseDB didn't even come with PostGIS the last time we checked; to get it you needed a Windows-based (!!) "StackBuilder" tool. Support was based in India and used an ancient Salesforce interface. You get the idea.

3. Everything is negotiable

Citus gave me a great price to move over from RDS.

You're not juggling Amazon on-demand and reserved instances, or forgetting your snapshot settings and paying $2000/mo for obsolete EBS snapshot storage (did that!), or being a tiny fish in the massive AWS pond.

You're not spending a month of negotiations on the phone with EnterpriseDB "execs," paying a huge up-front fee, and then seeing the project fail utterly when your top devops engineer couldn't install a basic Postgres extension.

This is a company with people like Craig Kerstiens (helped build Heroku Postgres) on the other end of the phone, live support chat, and a real value prop.

How to create an RDS instance with Terraform

Seamus Abshere on

This post is part of our PostgreSQL series.

Terraform's RDS support makes it easy to create a database instance. Here's a cheatsheet:

resource "aws_db_instance" "mydb1" {  
  allocated_storage        = 256 # gigabytes
  backup_retention_period  = 7   # in days
  db_subnet_group_name     = "${var.rds_public_subnet_group}"
  engine                   = "postgres"
  engine_version           = "9.5.4"
  identifier               = "mydb1"
  instance_class           = "db.r3.large"
  multi_az                 = false
  name                     = "mydb1"
  parameter_group_name     = "mydbparamgroup1" # if you have tuned it
  password                 = "${trimspace(file("${path.module}/secrets/mydb1-password.txt"))}"
  port                     = 5432
  publicly_accessible      = true
  storage_encrypted        = true # you should always do this
  storage_type             = "gp2"
  username                 = "mydb1"
  vpc_security_group_ids   = ["${aws_security_group.mydb1.id}"]
}

Here's the security group you need:

resource "aws_security_group" "mydb1" {  
  name = "mydb1"

  description = "RDS postgres servers (terraform-managed)"
  vpc_id = "${var.rds_vpc_id}"

  # Only postgres in
  ingress {
    from_port = 5432
    to_port = 5432
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Allow all outbound traffic.
  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

You can get these values from the EC2 console (don't forget them!):

variable "rds_vpc_id" {  
  default = "vpc-XXXXXXXX"
  description = "Our default RDS virtual private cloud (rds_vpc)."
}

variable "rds_public_subnets" {  
  default = "subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY"
  description = "The public subnets of our RDS VPC rds-vpc."
}

variable "rds_public_subnet_group" {  
  default = "default-vpc-XXXXXXXX"
  description = "Apparently the group name, according to the RDS launch wizard."
}

How to convert a fixed-width file into CSV

Seamus Abshere on

This is part of our data science series. How predictive!

(The more valuable and massive a data set is, the less likely it's in a format you can just parse. Has anybody else noticed that?)

Here's how to convert a fixed-width file to CSV with the standard GNU unix tool gawk:

Theoretical

Thanks to stackoverflow: (reproducing verbatim)

gawk '$1=$1' OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Where FIELDWIDTHS is a list of field widths and OFS is the output file separator.

Real life

In real life, fixed width files contain commas and double quotes.

# put this in a file called fixed2csv.awk
{
  for (i=1;i<=NF;i++) {
    sub(/\s+$/,"",$i)
    sub("\"","\"\"",$i)
    printf "\"%s\"%s", $i, (i<NF?OFS:ORS)
  }
}

Then run it on your data:

gawk -f fixed2csv.awk OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Thanks to Ed Morton on Stackoverflow for inspiration!

How to permanently delete versioned objects from S3

Seamus Abshere on

This is part of our cloud security and things that are obvious once you see them series. Duhh... safe!

Amazon's explanation of deleting a versioned object and the SDK documentation do not give an example of permanently deleting a versioned object. Here's how to do it:

require 'aws-sdk'

s3 = Aws::S3::Resource.new(  
  region: 'us-east-1',
  access_key_id: ACCESS_KEY_ID,
  secret_access_key: SECRET_ACCESS_KEY
)
bucket = s3.bucket('my-versioned-bucket')

bucket.objects.each do |object_summary|  
  o = bucket.object object_summary.key
  # this is the secret: specify the version while deleting
  o.delete version_id: o.version_id
end  

If you don't specify the version, you get a delete marker, which you can proceed to delete infinite times and it will not go away :)

How to enable S3 server-side encryption for existing objects

Seamus Abshere on

This is part of our cloud security series.

Do you have unencrypted S3 objects lying around? Don't! Here's the safe way to retroactively enable server-side encryption:

Step 1: Make a backup bucket

AWS management console is easiest. Call it [my-bucket]-backup.

Step 2: Copy one way

require 'aws-sdk'

s3 = Aws::S3::Resource.new(region: 'us-east-1', access_key_id: ACCESS_KEY_ID, secret_access_key: SECRET_ACCESS_KEY)  
b1 = s3.bucket('my-bucket')  
b2 = s3.bucket('my-bucket-backup')

# or no prefix if you want everything
b1.objects(prefix: 'xyz').each do |object_summary|  
  o1 = b1.object object_summary.key
  o2 = b2.object object_summary.key
  o1.copy_to o2, server_side_encryption: 'AES256'
end  

Step 3: Sanity check

Now look at [my-bucket]-backup - it's probably 100% perfect, but just reassure yourself.

Step 4: Copy back over

There are 2 changes here, so you might want to copy-paste:

b2.objects.each do |object_summary|  
  o1 = b1.object object_summary.key
  o2 = b2.object object_summary.key
  o2.copy_to o1, server_side_encryption: 'AES256'
end  

Step 5: (optional) Clean up

Delete [my-bucket]-backup.