Startup hacks and engineering miracles from your exhausted friends at Faraday

Use headless chromium with capybara and selenium webdriver - today!

Seamus Abshere on

UPDATE: we have a new version of this post out that resolves some of the gotchas below.

Here's a Rubyist's magic incantation to run headless chromium with selenium-webdriver and capybara: (it will be similar in other languages)

require 'selenium-webdriver'

Capybara.register_driver :headless_chromium do |app|  
  caps =
    "chromeOptions" => {
      'binary' => '/home/myuser/chrome-linux-440004/chrome',
      'args' => ['headless', 'disable-gpu']
  driver =
    browser: :chrome,
    desired_capabilities: caps


  1. You need a chromium (chrome) binary that reports version 57 (version 59 is too new). For example, snapshot 440004 - just download and unzip.
  2. You need a recent chromedriver binary with support for headless chrome. For example, snapshot 460342 - just download and unzip.
  3. If you get the error unrecognized chrome version, then see (1) above - you probably have a too-recent chromium.

Thanks to @dkastner!

Antipattern: Using Ruby's Hash#[]

Seamus Abshere on

This is part of our antipatterns series. Ouch! Updated for 2017!

Ask yourself why you're using Hash#[]. It is is a great way to introduce silent bugs into your app.

Use Hash#fetch if you expect the value to exist

That way you get sensible error messages.

#> params = {}

#> params.fetch('really').fetch('important')
KeyError: key not found: "really"  

Use Hash#dig if you don't care

Because you don't get idiotic, non-semantic NoMethodError: undefined method '[]' for nil:NilClass errors.

#> params.dig('really', 'important')
=> nil

Avoid Hash#[] because... just... why?

#> params['really']['important']
NoMethodError: undefined method `[]' for nil:NilClass  

Special case: ENV

The Twelve-Factor App has us all using environment variables. But most of us default to ENV#[] to look stuff up... even if it's critical. Bad idea! Use fetch!

#> ENV['REALLY_IMPORTANT'] == 'thing'
=> false # well i hope you didn't need that

#> ENV.fetch('REALLY_IMPORTANT') == 'thing'
KeyError: key not found: "REALLY_IMPORTANT"  

scrubcsv: now with null value removal

Seamus Abshere on

This is part of our series on data science because it belongs in your toolchain. Happy Null Removal!

The latest version of scrubcsv has built-in null value removal:

$ cat a.csv

$ scrubcsv -n 'null|n/a' a.csv

See how null and n/a went away?

Get the latest version with

$ cargo install scrubcsv -f

scrubcsv: clean CSVs, drop bad lines

Seamus Abshere on

This is part of our series on things that are obvious once you see them - and our data science series because it belongs in your toolchain.

Lies, damn lies, and commercial CSV export modules. Who wrote these things? On what planet would this be acceptable? Whatever.

Name,What's wrong  
"Robert "Bob" Smith",quotes inside quotes
Robert "Bob" Smith,quotes in the middle  
Robert Bob" Smith,unmatched quote  

Ruby dies immediately trying to read it:

$ irb
irb(main):001:0> require 'csv'  
=> true
CSV::MalformedCSVError: Missing or stray quote in line 2  

Introducing scrubcsv, a is a lightning-fast static binary written in Rust that best-effort parses CSV and then immediately dumps back out 100% guaranteed standards-compliant CSV. Top speed? About 67mb/s.

$ scrubcsv broken.csv > fixed.csv
4 rows (0 bad) in 0.00 seconds, 787.13 KiB/sec

$ cat fixed.csv
Name,What's wrong  
"Robert Bob"" Smith""",quotes inside quotes
"Robert ""Bob"" Smith",quotes in the middle
"Robert Bob"" Smith",unmatched quote

It uses BurntSushi's world-beating CSV parser which is almost certainly faster than your SSD.

No MD5, SHA1, or SHA256 collisions for US addresses

Seamus Abshere on

I calculated hashes of single-family home addresses in the United States:

create table hashtest as (  
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'md5') as "md5",
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'sha1') as "sha1",
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'sha256') as "sha256"
  from households


=> select upper(house_number_and_street || ',' || city || ',' || state) "key", digest(upper(house_number_and_street || ',' || city || ',' || state), 'md5') "md5" from households limit 1;
             key               |                md5
 1024 PENINSULA DR,WESTWOOD,CA | \x511cdfb25d6b77d45742ed0407b5c2ef
(1 row)

Then I counted the distinct hashes:

=> select count(distinct md5) md5, count(distinct sha1) sha1, count(distinct sha256) sha256, count(*) from hashtest;
   md5    |   sha1   |  sha256  |  count
 78224992 | 78224992 | 78224992 | 81087108
(1 row)

Some of the addresses are repeated in the database because the APNs are identical, but the conclusion is that we have 78 million uniques and no hash collisions with the algorithms tested.

Open letter to Slack: Fitt's law and channel weeding

Seamus Abshere on

Update: They fixed this! Thanks so much!

I've got a beef with Slack and I back it up with Fitt's law:

This scientific law predicts that the time required to rapidly move to a target area is a function of the ratio between the distance to the target and the width of the target.

Here's the beef:

it's hard to choose what messages to delete

As a remote team manager, I try to weed our Slack channels by deleting bot explosions and repetitive error dumps.

  1. Obey Fitt's law and let me click anywhere on the message to select it.
  2. Let me specify a regex (e.g. Unknown account) that will mass-select messages for deletion.


How to use Postgres cursors and multi-insert in Ruby with Sequel

Seamus Abshere on

This post is part of our PostgreSQL series.

SQL cursors are a great way to stream data out of a database without loading it all into memory.

Reading using a cursor

Sequel has use_cursor:

BATCH_SIZE = 1000  
  select(:id, :name, :power).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    row[:id] # whatever

Writing using cursors and multi-insert

Here's a real-world example - we had to copy things from widgets to widgets_copy.

We thought we were being really clever, keeping all the data inside Postgres:

# NOTE: this is not the recommended method, but it works too
batches = (desired_count.to_f / BATCH_SIZE).ceil  
batches.times do |i|  
  offset = i * BATCH_SIZE <<-SQL
    INSERT INTO widgets_copy (
    ) (
      FROM widgets
      ORDER BY id
      OFFSET #{offset}

Even with 244gb of RAM, this sometimes stopped dead 75% of the way through. We rewrote it—pulling the data into Ruby, even—and it's both faster and doesn't stop in the middle. We're using Sequel's multi_insert.

batch = []  
now = Sequel.function :now  
uuid_generate_v4 = Sequel.function :uuid_generate_v4  
  select(:name, :power).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    batch << row.merge!(
      id: uuid_generate_v4,
      updated_at: now,
      created_at: now
    if (batch.length % BATCH_SIZE == 0) || batch.length == desired_count
      DB[:widgets_copy].multi_insert batch

Thanks to @dkastner and @devTristan who promoted these ideas internally!

3 reasons Citus is the best (non-Heroku) Postgres host available today

Seamus Abshere on

Citus Cloud is the hosted Postgres that our app uses. We tried EnterpriseDB Postgres Plus Cloud and couldn't even get going; we used Amazon RDS for a year but migrated away about 6 months ago. Heroku Postgres wasn't bare-metal enough for us.

1. Log directly to Papertrail

Screenshot of logging setup It's not a world-shattering invention, but features like sending your Postgres logs to a syslog host is very useful for debugging.

Citus takes features requests and actually implements them (this was one of mine).

RDS has a weird home-grown in-browser log viewer which is useless without things like rds-pgbadger.

2. citus extension preinstalled

You know that feeling where you're so happy you can use SQL, but you worry just a bit 🙏🏽 about your horizontal scaling story? Citus is the company behind the citus extension to Postgres:

Citus parallelizes incoming queries by breaking it into multiple fragment queries which run on the worker shards in parallel [...] The workers are simply running extended PostgreSQL servers and they apply PostgreSQL’s standard planning and execution logic to run these fragment SQL queries. Therefore, any optimization that helps PostgreSQL also helps Citus.

(from Querying Distributed Tables)

Speaking of extensions, EnterpriseDB didn't even come with PostGIS the last time we checked; to get it you needed a Windows-based (!!) "StackBuilder" tool. Support was based in India and used an ancient Salesforce interface. You get the idea.

3. Everything is negotiable

Citus gave me a great price to move over from RDS.

You're not juggling Amazon on-demand and reserved instances, or forgetting your snapshot settings and paying $2000/mo for obsolete EBS snapshot storage (did that!), or being a tiny fish in the massive AWS pond.

You're not spending a month of negotiations on the phone with EnterpriseDB "execs," paying a huge up-front fee, and then seeing the project fail utterly when your top devops engineer couldn't install a basic Postgres extension.

This is a company with people like Craig Kerstiens (helped build Heroku Postgres) on the other end of the phone, live support chat, and a real value prop.

How to create an RDS instance with Terraform

Seamus Abshere on

This post is part of our PostgreSQL series.

Terraform's RDS support makes it easy to create a database instance. Here's a cheatsheet:

resource "aws_db_instance" "mydb1" {  
  allocated_storage        = 256 # gigabytes
  backup_retention_period  = 7   # in days
  db_subnet_group_name     = "${var.rds_public_subnet_group}"
  engine                   = "postgres"
  engine_version           = "9.5.4"
  identifier               = "mydb1"
  instance_class           = "db.r3.large"
  multi_az                 = false
  name                     = "mydb1"
  parameter_group_name     = "mydbparamgroup1" # if you have tuned it
  password                 = "${trimspace(file("${path.module}/secrets/mydb1-password.txt"))}"
  port                     = 5432
  publicly_accessible      = true
  storage_encrypted        = true # you should always do this
  storage_type             = "gp2"
  username                 = "mydb1"
  vpc_security_group_ids   = ["${}"]

Here's the security group you need:

resource "aws_security_group" "mydb1" {  
  name = "mydb1"

  description = "RDS postgres servers (terraform-managed)"
  vpc_id = "${var.rds_vpc_id}"

  # Only postgres in
  ingress {
    from_port = 5432
    to_port = 5432
    protocol = "tcp"
    cidr_blocks = [""]

  # Allow all outbound traffic.
  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = [""]

You can get these values from the EC2 console (don't forget them!):

variable "rds_vpc_id" {  
  default = "vpc-XXXXXXXX"
  description = "Our default RDS virtual private cloud (rds_vpc)."

variable "rds_public_subnets" {  
  default = "subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY"
  description = "The public subnets of our RDS VPC rds-vpc."

variable "rds_public_subnet_group" {  
  default = "default-vpc-XXXXXXXX"
  description = "Apparently the group name, according to the RDS launch wizard."

How to convert a fixed-width file into CSV

Seamus Abshere on

This is part of our data science series. How predictive!

(The more valuable and massive a data set is, the less likely it's in a format you can just parse. Has anybody else noticed that?)

Here's how to convert a fixed-width file to CSV with the standard GNU unix tool gawk:


Thanks to stackoverflow: (reproducing verbatim)

gawk '$1=$1' OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Where FIELDWIDTHS is a list of field widths and OFS is the output file separator.

Real life

In real life, fixed width files contain commas and double quotes.

# put this in a file called fixed2csv.awk
  for (i=1;i<=NF;i++) {
    printf "\"%s\"%s", $i, (i<NF?OFS:ORS)

Then run it on your data:

gawk -f fixed2csv.awk OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Thanks to Ed Morton on Stackoverflow for inspiration!