Startup hacks and engineering miracles from your exhausted friends at Faraday

How to read CREATE TABLE SQL with pg_query

Seamus Abshere on

pg_query is a really cool Ruby library that uses Postgres's query parser to give you programmatic access to SQL queries.

One thing you can do with it is read a CREATE TABLE statement to map column names to column types:

sql = <<-SQL  
  CREATE TABLE example (
    my_text text,
    my_int int,
    my_float float
  )
SQL

require 'pg_query'  
types = PgQuery.parse(sql).tree[0].dig('RawStmt', 'stmt', 'CreateStmt', 'tableElts').map do |c|  
  c.fetch('ColumnDef')
end.inject({}) do |memo, c|  
  memo[c.fetch('colname')] = c.dig('typeName', 'TypeName', 'names').detect { |td| td.dig('String', 'str') != 'pg_catalog' }.dig('String', 'str')
  memo
end  

The output will be:

{
  "my_text"  => "text",
  "my_int"   => "int4",
  "my_float" => "float8"
}

Thanks to Lukas Fittl for this gem!

Geochunk: fast, intelligent splitting for piles of address data

Bill Morris on

aurora

This post is part of our practical cartography and data science series.

The problem: you want to split up a few million U.S. address records into equally-sized chunks that retain spatial hierarchy. You want to do this without anything other than a street address (geocoding is expensive!). Maybe you want to do this as part of a map/reduce process (we certainly do), maybe you want to do some sampling, who knows?

The solution: Muthaflippin' Geochunk

Anyone who's ever used U.S. ZIP codes as a way to subdivide datasets can tell you: 60608 (pop 79,607) is a totally different beast than 05851 (pop 525). They're not census tracts; it's not really appropriate to compare them statistically or thematically.

Our solution - largely the work of platform wizard and Rust enthusiast Eric Kidd - is to bake census data into a tool that does the splitting for you at a level that allows for easy comparison. More specifically:

It provides a deterministic mapping from zip codes to "geochunks" that you can count on remaining stable.

Check out the Jupyter notebook that explains the algorithm in detail, but it works like so:

Install

Install rust first if you don't have it:

curl https://sh.rustup.rs -sSf | sh  

. . . then geochunk, using the rust package manager:

cargo install geochunk  

. . . or install from one of the prepackaged binaries.

Use 1: Indexing

Build a table that assigns every U.S. zipcode to a geochunk that contains 250,000 people:

geochunk export zip2010 250000 > chunks_of_250k_people.csv  

Use 2: List processing

Alternately, let's try a pipeline example that uses geochunk csv: say you want to parallel-process every address in the state of Colorado, and you need equal-size but contiguous slices to do it.

wget -c https://s3.amazonaws.com/data.openaddresses.io/runs/283082/us/co/statewide.zip && unzip statewide.zip  
  • Pipe the full file through geochunk, into slices of about 250,000 people each:
cat us/co/statewide.csv | geochunk csv zip2010 250000 POSTCODE > statewide_chunks_150k.csv  

. . . and now you have 2 million addresses, chopped into ~8 equally-sized slices with rough contiguity:

denver

Geochunk works on this scale in 1.38s (Have you heard us evangelizing about Rust yet?), leaving you plenty of time for the real processing.

This tool is serious dogfood for us; it's baked into our ETL system, and we use it to try making a tiny dent in the Modifiable Areal Unit Problem. We hope you'll find it useful too.

How to generate UUID ranges for SQL queries

Seamus Abshere on

So you've got a table with a UUID primary key. How do you search for only 1/4, or 1/8, or 1/128 of it?

BETWEEN '00000000-0000-0000-0000-000000000000' AND '3fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '40000000-0000-0000-0000-000000000000' AND '7fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '80000000-0000-0000-0000-000000000000' AND 'bfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c0000000-0000-0000-0000-000000000000' AND 'ffffffff-ffff-ffff-ffff-ffffffffffff'  

And how do you generate those ranges? Here it is in Ruby, with n=4:

UUID_MAX = 2**128  
def to_uuid(int)  
  memo = int.to_s(16)
  memo = memo.rjust(32, '0')
  [ 8, 13, 18, 23 ].each do |p|
    memo = memo.insert(p, '-')
  end
  memo
end  
def sql_uuid_ranges(n)  
  chunk = UUID_MAX / n
  (n+1).times.map do |i|
    [chunk*i, chunk*i-1]
  end.each_cons(2).map do |bottom, top|
    a1, _ = bottom
    _, b2 = top
    "BETWEEN '#{to_uuid(a1)}' AND '#{to_uuid(b2)}'"
  end
end  
puts sql_uuid_ranges(4)  

Note that we have to do some fancy stuff with chunk*i-1 because BETWEEN is inclusive on both sides. Not that that will cause a collision any time before the heat death of the universe, but still.

Here's the same thing, with n=64:

BETWEEN '00000000-0000-0000-0000-000000000000' AND '03ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '04000000-0000-0000-0000-000000000000' AND '07ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '08000000-0000-0000-0000-000000000000' AND '0bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '0c000000-0000-0000-0000-000000000000' AND '0fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '10000000-0000-0000-0000-000000000000' AND '13ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '14000000-0000-0000-0000-000000000000' AND '17ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '18000000-0000-0000-0000-000000000000' AND '1bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '1c000000-0000-0000-0000-000000000000' AND '1fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '20000000-0000-0000-0000-000000000000' AND '23ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '24000000-0000-0000-0000-000000000000' AND '27ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '28000000-0000-0000-0000-000000000000' AND '2bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '2c000000-0000-0000-0000-000000000000' AND '2fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '30000000-0000-0000-0000-000000000000' AND '33ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '34000000-0000-0000-0000-000000000000' AND '37ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '38000000-0000-0000-0000-000000000000' AND '3bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '3c000000-0000-0000-0000-000000000000' AND '3fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '40000000-0000-0000-0000-000000000000' AND '43ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '44000000-0000-0000-0000-000000000000' AND '47ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '48000000-0000-0000-0000-000000000000' AND '4bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '4c000000-0000-0000-0000-000000000000' AND '4fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '50000000-0000-0000-0000-000000000000' AND '53ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '54000000-0000-0000-0000-000000000000' AND '57ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '58000000-0000-0000-0000-000000000000' AND '5bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '5c000000-0000-0000-0000-000000000000' AND '5fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '60000000-0000-0000-0000-000000000000' AND '63ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '64000000-0000-0000-0000-000000000000' AND '67ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '68000000-0000-0000-0000-000000000000' AND '6bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '6c000000-0000-0000-0000-000000000000' AND '6fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '70000000-0000-0000-0000-000000000000' AND '73ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '74000000-0000-0000-0000-000000000000' AND '77ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '78000000-0000-0000-0000-000000000000' AND '7bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '7c000000-0000-0000-0000-000000000000' AND '7fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '80000000-0000-0000-0000-000000000000' AND '83ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '84000000-0000-0000-0000-000000000000' AND '87ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '88000000-0000-0000-0000-000000000000' AND '8bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '8c000000-0000-0000-0000-000000000000' AND '8fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '90000000-0000-0000-0000-000000000000' AND '93ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '94000000-0000-0000-0000-000000000000' AND '97ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '98000000-0000-0000-0000-000000000000' AND '9bffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN '9c000000-0000-0000-0000-000000000000' AND '9fffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'a0000000-0000-0000-0000-000000000000' AND 'a3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'a4000000-0000-0000-0000-000000000000' AND 'a7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'a8000000-0000-0000-0000-000000000000' AND 'abffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'ac000000-0000-0000-0000-000000000000' AND 'afffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'b0000000-0000-0000-0000-000000000000' AND 'b3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'b4000000-0000-0000-0000-000000000000' AND 'b7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'b8000000-0000-0000-0000-000000000000' AND 'bbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'bc000000-0000-0000-0000-000000000000' AND 'bfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c0000000-0000-0000-0000-000000000000' AND 'c3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c4000000-0000-0000-0000-000000000000' AND 'c7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'c8000000-0000-0000-0000-000000000000' AND 'cbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'cc000000-0000-0000-0000-000000000000' AND 'cfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'd0000000-0000-0000-0000-000000000000' AND 'd3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'd4000000-0000-0000-0000-000000000000' AND 'd7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'd8000000-0000-0000-0000-000000000000' AND 'dbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'dc000000-0000-0000-0000-000000000000' AND 'dfffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'e0000000-0000-0000-0000-000000000000' AND 'e3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'e4000000-0000-0000-0000-000000000000' AND 'e7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'e8000000-0000-0000-0000-000000000000' AND 'ebffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'ec000000-0000-0000-0000-000000000000' AND 'efffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'f0000000-0000-0000-0000-000000000000' AND 'f3ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'f4000000-0000-0000-0000-000000000000' AND 'f7ffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'f8000000-0000-0000-0000-000000000000' AND 'fbffffff-ffff-ffff-ffff-ffffffffffff'  
BETWEEN 'fc000000-0000-0000-0000-000000000000' AND 'ffffffff-ffff-ffff-ffff-ffffffffffff'  

You get the idea.

ActiveRecord ↔︎ Sequel raw sql cheat sheet

Seamus Abshere on

UPDATED with fixes from @jeremyevans

Going back and forth between ActiveRecord and Sequel can be very confusing... especially because the Sequel documentation assumes that you use "datasets" instead of raw sql.

What ActiveRecord Sequel
One value ActiveRecord::Base.connection.select_value(sql) DB[sql].get
One row ActiveRecord::Base.connection.select_row(sql) DB[sql].first
One column across many rows ActiveRecord::Base.connection.select_values(sql) DB[sql].select_map
Many rows as hashes ActiveRecord::Base.connection.select_all(sql) DB[sql].all
Quote value ActiveRecord::Base.connection.quote(value) DB.literal(value)
Quote identifier (e.g. table name, column name) ActiveRecord::Base.connection.quote_ident(ident) DB.quote_identifier(ident)

Please email us if you think of any other ones!

Customer Success, cookies, and diminishing returns: don't announce features all at once

Thomas Bryenton on

Chocolate chip cookies

Right this moment, how much would you pay for a cookie at a coffeeshop? Let's say the amount is $3.

picture_of_cookies

(I could really go for one, so my personal estimate might be higher. I mean, check out how good that looks.)

How much would you be open to paying the coffeeshop for a subsequent cookie? A third cookie? A fifth?

Probably it would be steadily less than $3. Why? Later cookies have lower value.

Well . . . for now at least. If you come back to the coffeeshop tomorrow, you could be ready for more!

Customer Success

A related principle applies in Customer Success: people love hearing when you ship a feature they wanted. But tell them about a third feature or a fifth? Their eyes will glaze right over.

(Speaking of glaze, if cookies aren't your thing, how about these glazed scones?)

scone_picture

Sure, sometimes an engineering team works in bursts. Progress can come all at once. Then, the temptation for Customer Success is to "catch up" by reporting things to clients exactly when they happen.

This doesn't mean that as a Customer Success team you can't be strategic about who to announce features to, and when.

So:

  • Notice each client's rhythm of hunger for new features
  • Keep a stash of "feature snacks" to give clients
  • Treat yourself to a literal or figurative cookie (or scone)

How to aggregate JSONB in PostgreSQL 9.5+

Seamus Abshere on

This is part of our series on PostgreSQL and things that are obvious once you see them. It's a 2017 update to our 2016 article How to merge JSON fields in Postgres Nice!

Update We used to call this jsonb_collect, but then we realized it was very similar to json_object_agg(name, value)... except that there is no version of that function that just takes an expression. So, copying jsonb_agg(expression), we give you...

How do you aggregate (aka merge aka combine aka collect) JSONB fields in Postgres 9.5+? You define a new aggregate jsonb_object_agg(expression):

CREATE AGGREGATE jsonb_object_agg(jsonb) (  
  SFUNC = 'jsonb_concat',
  STYPE = jsonb,
  INITCOND = '{}'
);

Here's how you use it:

# select * from greetings;
         data
-----------------------
 {"es":"Saludos"}
 {"en":"Hello"}
 {"ru":"Здравствуйте"}
(3 rows)

# select jsonb_object_agg(data) from greetings;
                        jsonb_object_agg
-------------------------------------------------------------
 { "es" : "Saludos", "en" : "Hello", "ru" : "Здравствуйте" }
(1 row)

If you're curious about the aggregate we just added, note that jsonb_concat(jsonb, jsonb) is the function backing || that was introduced in Postgres 9.5.

It's just that simple!

Be not afraid of ZCTAs

Bill Morris on

This post is part of our practical cartography series.

Most American geographers will note that - as much as we'd like it to be otherwise - ZIP Codes are not polygons. Rather, they're constantly-changing lines used by the USPS to coordinate delivery in an efficient network. Many of us polygon-happy mappers use ZIP Code Tabulation Areas (ZCTAs) instead; these are provided by the US Census as a reasonable open data alternative to ZIPs. They're particularly nice for thematic mapping (though their shortcomings have also been well-documented):

map

But why use ZCTAs if they can never be reconciled with their ground-truth ZIP cousins?

Because the difference is small.

Faraday has address and location records for every household in the country, and it was straightforward to check for disagreement between the ZIP Code of each physical address and the ZCTA polygon that contains it.

Here are the results, broken down by state

The national error rate of ZCTAs is 1.4%. That might be too high for some use cases, but perfectly acceptable for others. There's some regional variation, too: you're usually safe to use ZCTAs in Hawaii and Maine, but might want to exercise caution in Oregon and Utah.

Happy mapping!

How to finally use headless Chrome to power your automated tests

Derek Kastner on

Google Chrome version 59 will ship with the headless
option
. This means you can test your web applications using chrome without needing xvfb. One problem: the latest chromedriver (version 2.29) doesn't support versions of Chrome higher than 58.

The solution is to build the latest chromedriver that supports the latest chrome/chromium. Google does not make nightly builds of chromedriver public, you have to download the chromium source and build
chromedriver yourself.

How all the pieces work together

Cucumber uses the capybara gem to send commands to selenium-webdriver. Selenium-webdriver in turn starts up your local copy of chromedriver, which then starts up chrome and controls the browser through a special debug port.

To get the latest chromium, I used a tool on GitHub that downloads the latest snapshot compiled for Linux.

To get the latest chromedriver, I followed the
build instructions.

Configuring the tests

When configuring capybara, you need to tell selenium-webdriver the path to your custom chromium binary and send the --headless flag, along with other flags
you'll likely need in a CI build node environment.

For running in docker:

Capybara.register_driver :headless_chromium do |app|  
  caps = Selenium::WebDriver::Remote::Capabilities.chrome(                       
    "chromeOptions" => {                                                         
      'binary' => "/chromium-latest-linux/466395/chrome-linux/chrome",           
      'args' => %w{headless no-sandbox disable-gpu}                              
    }                                                                            
  )                                                                              
  driver = Capybara::Selenium::Driver.new(                                       
    app,                                                                         
    browser: :chrome,                                                            
    desired_capabilities: caps                                                   
  )                                                                              
end                                                                              

Capybara.default_driver = :headless_chromium  

Now, capybara will drive a headless chromium!

Two magic words for greater independence and communication

Thomas Bryenton on

What if there were two words you could add to any email to get your team to weigh in quickly?

There are: DEFAULT DO

How default-do works

  1. Write up a final version of what you'll be doing (your "default do")
  2. Tell teammates you're about to do this thing
  3. Do the thing

Just default do it

I use default-do every day to keep Faraday moving:

  • Emails
  • Code
  • Mockups
  • Decisions
  • Policies
  • Blog posts!

It doesn't matter what it is: if I'm confident it's the right thing to do, I'll tell my team I'm about to do it, pause, then just do it.

Don't wear it out. If you don't really want feedback, don't ask, and if you need it, your default isn't "do," it's "don't."

Use headless chromium with capybara and selenium webdriver - today!

Seamus Abshere on

UPDATE: we have a new version of this post out that resolves some of the gotchas below.

Here's a Rubyist's magic incantation to run headless chromium with selenium-webdriver and capybara: (it will be similar in other languages)

require 'selenium-webdriver'

Capybara.register_driver :headless_chromium do |app|  
  caps = Selenium::WebDriver::Remote::Capabilities.chrome(
    "chromeOptions" => {
      'binary' => '/home/myuser/chrome-linux-440004/chrome',
      'args' => ['headless', 'disable-gpu']
    }
  )
  driver = Capybara::Selenium::Driver.new(
    app,
    browser: :chrome,
    desired_capabilities: caps
  )
end  

Hints

  1. You need a chromium (chrome) binary that reports version 57 (version 59 is too new). For example, snapshot 440004 - just download and unzip.
  2. You need a recent chromedriver binary with support for headless chrome. For example, snapshot 460342 - just download and unzip.
  3. If you get the error unrecognized chrome version, then see (1) above - you probably have a too-recent chromium.

Thanks to @dkastner!