Startup hacks and engineering miracles from your exhausted friends at Faraday

How to read CREATE TABLE SQL with pg_query

Seamus Abshere on

pg_query is a really cool Ruby library that uses Postgres's query parser to give you programmatic access to SQL queries.

One thing you can do with it is read a CREATE TABLE statement to map column names to column types:

sql = <<-SQL  
  CREATE TABLE example (
    my_text text,
    my_int int,
    my_float float
  )
SQL

require 'pg_query'  
types = PgQuery.parse(sql).tree[0].dig('RawStmt', 'stmt', 'CreateStmt', 'tableElts').map do |c|  
  c.fetch('ColumnDef')
end.inject({}) do |memo, c|  
  memo[c.fetch('colname')] = c.dig('typeName', 'TypeName', 'names').detect { |td| td.dig('String', 'str') != 'pg_catalog' }.dig('String', 'str')
  memo
end  

The output will be:

{
  "my_text"  => "text",
  "my_int"   => "int4",
  "my_float" => "float8"
}

Thanks to Lukas Fittl for this gem!

How to generate UUID ranges for SQL queries

Seamus Abshere on

So you've got a table with a UUID primary key. How do you search for only 1/4, or 1/8, or 1/128 of it?

BETWEEN '00000000-0000-0000-0000-000000000000' AND '3fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '40000000-0000-0000-0000-000000000000' AND '7fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '80000000-0000-0000-0000-000000000000' AND 'bfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c0000000-0000-0000-0000-000000000000' AND 'ffffffff-ffff-ffff-ffff-ffffffffffff'  

And how do you generate those ranges? Here it is in Ruby, with n=4:

UUID_MAX = 2**128  
def to_uuid(int)  
  memo = int.to_s(16)
  memo = memo.rjust(32, '0')
  [ 8, 13, 18, 23 ].each do |p|
    memo = memo.insert(p, '-')
  end
  memo
end  
def sql_uuid_ranges(n)  
  chunk = UUID_MAX / n
  (n+1).times.map do |i|
    [chunk*i, chunk*i-1]
  end.each_cons(2).map do |bottom, top|
    a1, _ = bottom
    _, b2 = top
    "BETWEEN '#{to_uuid(a1)}' AND '#{to_uuid(b2)}'"
  end
end  
puts sql_uuid_ranges(4)  

Note that we have to do some fancy stuff with chunk*i-1 because BETWEEN is inclusive on both sides. Not that that will cause a collision any time before the heat death of the universe, but still.

Here's the same thing, with n=64:

BETWEEN '00000000-0000-0000-0000-000000000000' AND '03ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '04000000-0000-0000-0000-000000000000' AND '07ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '08000000-0000-0000-0000-000000000000' AND '0bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '0c000000-0000-0000-0000-000000000000' AND '0fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '10000000-0000-0000-0000-000000000000' AND '13ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '14000000-0000-0000-0000-000000000000' AND '17ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '18000000-0000-0000-0000-000000000000' AND '1bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '1c000000-0000-0000-0000-000000000000' AND '1fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '20000000-0000-0000-0000-000000000000' AND '23ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '24000000-0000-0000-0000-000000000000' AND '27ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '28000000-0000-0000-0000-000000000000' AND '2bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '2c000000-0000-0000-0000-000000000000' AND '2fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '30000000-0000-0000-0000-000000000000' AND '33ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '34000000-0000-0000-0000-000000000000' AND '37ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '38000000-0000-0000-0000-000000000000' AND '3bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '3c000000-0000-0000-0000-000000000000' AND '3fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '40000000-0000-0000-0000-000000000000' AND '43ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '44000000-0000-0000-0000-000000000000' AND '47ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '48000000-0000-0000-0000-000000000000' AND '4bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '4c000000-0000-0000-0000-000000000000' AND '4fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '50000000-0000-0000-0000-000000000000' AND '53ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '54000000-0000-0000-0000-000000000000' AND '57ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '58000000-0000-0000-0000-000000000000' AND '5bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '5c000000-0000-0000-0000-000000000000' AND '5fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '60000000-0000-0000-0000-000000000000' AND '63ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '64000000-0000-0000-0000-000000000000' AND '67ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '68000000-0000-0000-0000-000000000000' AND '6bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '6c000000-0000-0000-0000-000000000000' AND '6fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '70000000-0000-0000-0000-000000000000' AND '73ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '74000000-0000-0000-0000-000000000000' AND '77ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '78000000-0000-0000-0000-000000000000' AND '7bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '7c000000-0000-0000-0000-000000000000' AND '7fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '80000000-0000-0000-0000-000000000000' AND '83ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '84000000-0000-0000-0000-000000000000' AND '87ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '88000000-0000-0000-0000-000000000000' AND '8bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '8c000000-0000-0000-0000-000000000000' AND '8fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '90000000-0000-0000-0000-000000000000' AND '93ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '94000000-0000-0000-0000-000000000000' AND '97ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '98000000-0000-0000-0000-000000000000' AND '9bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '9c000000-0000-0000-0000-000000000000' AND '9fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'a0000000-0000-0000-0000-000000000000' AND 'a3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'a4000000-0000-0000-0000-000000000000' AND 'a7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'a8000000-0000-0000-0000-000000000000' AND 'abffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'ac000000-0000-0000-0000-000000000000' AND 'afffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'b0000000-0000-0000-0000-000000000000' AND 'b3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'b4000000-0000-0000-0000-000000000000' AND 'b7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'b8000000-0000-0000-0000-000000000000' AND 'bbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'bc000000-0000-0000-0000-000000000000' AND 'bfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c0000000-0000-0000-0000-000000000000' AND 'c3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c4000000-0000-0000-0000-000000000000' AND 'c7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c8000000-0000-0000-0000-000000000000' AND 'cbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'cc000000-0000-0000-0000-000000000000' AND 'cfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'd0000000-0000-0000-0000-000000000000' AND 'd3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'd4000000-0000-0000-0000-000000000000' AND 'd7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'd8000000-0000-0000-0000-000000000000' AND 'dbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'dc000000-0000-0000-0000-000000000000' AND 'dfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'e0000000-0000-0000-0000-000000000000' AND 'e3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'e4000000-0000-0000-0000-000000000000' AND 'e7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'e8000000-0000-0000-0000-000000000000' AND 'ebffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'ec000000-0000-0000-0000-000000000000' AND 'efffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'f0000000-0000-0000-0000-000000000000' AND 'f3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'f4000000-0000-0000-0000-000000000000' AND 'f7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'f8000000-0000-0000-0000-000000000000' AND 'fbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'fc000000-0000-0000-0000-000000000000' AND 'ffffffff-ffff-ffff-ffff-ffffffffffff'  

You get the idea.

ActiveRecord ↔︎ Sequel raw sql cheat sheet

Seamus Abshere on

UPDATED with fixes from @jeremyevans

Going back and forth between ActiveRecord and Sequel can be very confusing... especially because the Sequel documentation assumes that you use "datasets" instead of raw sql.

What ActiveRecord Sequel
One value ActiveRecord::Base.connection.select_value(sql) DB[sql].get
One row ActiveRecord::Base.connection.select_row(sql) DB[sql].first
One column across many rows ActiveRecord::Base.connection.select_values(sql) DB[sql].select_map
Many rows as hashes ActiveRecord::Base.connection.select_all(sql) DB[sql].all
Quote value ActiveRecord::Base.connection.quote(value) DB.literal(value)
Quote identifier (e.g. table name, column name) ActiveRecord::Base.connection.quote_ident(ident) DB.quote_identifier(ident)

Please email us if you think of any other ones!

How to aggregate JSONB in PostgreSQL 9.5+

Seamus Abshere on

This is part of our series on PostgreSQL and things that are obvious once you see them. It's a 2017 update to our 2016 article How to merge JSON fields in Postgres Nice!

Update We used to call this jsonb_collect, but then we realized it was very similar to json_object_agg(name, value)... except that there is no version of that function that just takes an expression. So, copying jsonb_agg(expression), we give you...

How do you aggregate (aka merge aka combine aka collect) JSONB fields in Postgres 9.5+? You define a new aggregate jsonb_object_agg(expression):

CREATE AGGREGATE jsonb_object_agg(jsonb) (  
  SFUNC = 'jsonb_concat',
  STYPE = jsonb,
  INITCOND = '{}'
);

Here's how you use it:

# select * from greetings;
         data
-----------------------
 {"es":"Saludos"}
 {"en":"Hello"}
 {"ru":"Здравствуйте"}
(3 rows)

# select jsonb_object_agg(data) from greetings;
                        jsonb_object_agg
-------------------------------------------------------------
 { "es" : "Saludos", "en" : "Hello", "ru" : "Здравствуйте" }
(1 row)

If you're curious about the aggregate we just added, note that jsonb_concat(jsonb, jsonb) is the function backing || that was introduced in Postgres 9.5.

It's just that simple!

Use headless chromium with capybara and selenium webdriver - today!

Seamus Abshere on

UPDATE: we have a new version of this post out that resolves some of the gotchas below.

Here's a Rubyist's magic incantation to run headless chromium with selenium-webdriver and capybara: (it will be similar in other languages)

require 'selenium-webdriver'

Capybara.register_driver :headless_chromium do |app|  
  caps = Selenium::WebDriver::Remote::Capabilities.chrome(
    "chromeOptions" => {
      'binary' => '/home/myuser/chrome-linux-440004/chrome',
      'args' => ['headless', 'disable-gpu']
    }
  )
  driver = Capybara::Selenium::Driver.new(
    app,
    browser: :chrome,
    desired_capabilities: caps
  )
end  

Hints

  1. You need a chromium (chrome) binary that reports version 57 (version 59 is too new). For example, snapshot 440004 - just download and unzip.
  2. You need a recent chromedriver binary with support for headless chrome. For example, snapshot 460342 - just download and unzip.
  3. If you get the error unrecognized chrome version, then see (1) above - you probably have a too-recent chromium.

Thanks to @dkastner!

Antipattern: Using Ruby's Hash#[]

Seamus Abshere on

This is part of our antipatterns series. Ouch! Updated for 2017!

Ask yourself why you're using Hash#[]. It is is a great way to introduce silent bugs into your app.

Use Hash#fetch if you expect the value to exist

That way you get sensible error messages.

#> params = {}

#> params.fetch('really').fetch('important')
KeyError: key not found: "really"  

Use Hash#dig if you don't care

Because you don't get idiotic, non-semantic NoMethodError: undefined method '[]' for nil:NilClass errors.

#> params.dig('really', 'important')
=> nil

Avoid Hash#[] because... just... why?

#> params['really']['important']
NoMethodError: undefined method `[]' for nil:NilClass  

Special case: ENV

The Twelve-Factor App has us all using environment variables. But most of us default to ENV#[] to look stuff up... even if it's critical. Bad idea! Use fetch!

#> ENV['REALLY_IMPORTANT'] == 'thing'
=> false # well i hope you didn't need that

#> ENV.fetch('REALLY_IMPORTANT') == 'thing'
KeyError: key not found: "REALLY_IMPORTANT"  

scrubcsv: now with null value removal

Seamus Abshere on

This is part of our series on data science because it belongs in your toolchain. Happy Null Removal!

The latest version of scrubcsv has built-in null value removal:

$ cat a.csv
name,breed,age  
jerry,beagle,n/a  
tater,null,1

$ scrubcsv -n 'null|n/a' a.csv
name,breed,age  
jerry,beagle,  
tater,,1  

See how null and n/a went away?

Get the latest version with

$ cargo install scrubcsv -f

scrubcsv: clean CSVs, drop bad lines

Seamus Abshere on

This is part of our series on things that are obvious once you see them - and our data science series because it belongs in your toolchain.

Lies, damn lies, and commercial CSV export modules. Who wrote these things? On what planet would this be acceptable? Whatever.

Name,What's wrong  
"Robert "Bob" Smith",quotes inside quotes
Robert "Bob" Smith,quotes in the middle  
Robert Bob" Smith,unmatched quote  

Ruby dies immediately trying to read it:

$ irb
irb(main):001:0> require 'csv'  
=> true
irb(main):002:0> CSV.read('broken.csv')  
CSV::MalformedCSVError: Missing or stray quote in line 2  

Introducing scrubcsv, a is a lightning-fast static binary written in Rust that best-effort parses CSV and then immediately dumps back out 100% guaranteed standards-compliant CSV. Top speed? About 67mb/s.

$ scrubcsv broken.csv > fixed.csv
4 rows (0 bad) in 0.00 seconds, 787.13 KiB/sec

$ cat fixed.csv
Name,What's wrong  
"Robert Bob"" Smith""",quotes inside quotes
"Robert ""Bob"" Smith",quotes in the middle
"Robert Bob"" Smith",unmatched quote

It uses BurntSushi's world-beating CSV parser which is almost certainly faster than your SSD.

No MD5, SHA1, or SHA256 collisions for US addresses

Seamus Abshere on

I calculated hashes of single-family home addresses in the United States:

create table hashtest as (  
  select
    house_number_and_street,
    city,
    state,
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'md5') as "md5",
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'sha1') as "sha1",
    digest(upper(house_number_and_street || ',' || city || ',' || state), 'sha256') as "sha256"
  from households
)

E.g.,

=> select upper(house_number_and_street || ',' || city || ',' || state) "key", digest(upper(house_number_and_street || ',' || city || ',' || state), 'md5') "md5" from households limit 1;
             key               |                md5
-------------------------------+------------------------------------
 1024 PENINSULA DR,WESTWOOD,CA | \x511cdfb25d6b77d45742ed0407b5c2ef
(1 row)

Then I counted the distinct hashes:

=> select count(distinct md5) md5, count(distinct sha1) sha1, count(distinct sha256) sha256, count(*) from hashtest;
   md5    |   sha1   |  sha256  |  count
----------+----------+----------+----------
 78224992 | 78224992 | 78224992 | 81087108
(1 row)

Some of the addresses are repeated in the database because the APNs are identical, but the conclusion is that we have 78 million uniques and no hash collisions with the algorithms tested.

Open letter to Slack: Fitt's law and channel weeding

Seamus Abshere on

Update: They fixed this! Thanks so much!

I've got a beef with Slack and I back it up with Fitt's law:

This scientific law predicts that the time required to rapidly move to a target area is a function of the ratio between the distance to the target and the width of the target.

Here's the beef:

it's hard to choose what messages to delete

As a remote team manager, I try to weed our Slack channels by deleting bot explosions and repetitive error dumps.

  1. Obey Fitt's law and let me click anywhere on the message to select it.
  2. Let me specify a regex (e.g. Unknown account) that will mass-select messages for deletion.

Huzzah!