Startup hacks and engineering miracles from your exhausted friends at Faraday

How to do histograms in PostgreSQL

Bill Morris on

This post is part of our data science and PostgreSQL series.

We wanted an easy way to look at data distribution from the command line, so we adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram('column_name', 'table_name');  

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('transaction_amt', 'small_donors_vt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

DROP FUNCTION IF EXISTS histogram(text,regclass);  
CREATE OR REPLACE FUNCTION histogram(columnname text, tablename regclass)  
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)  
AS $func$  
BEGIN  
RETURN QUERY EXECUTE format('WITH  
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM %s
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM %s, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram', 
  columnname,
  columnname,
  tablename,
  columnname,
  columnname,
  columnname,
  columnname,
  tablename,
  columnname
  );
END  
$func$ LANGUAGE plpgsql;

Happy querying!

How to preview PostGIS maps on your command line

Bill Morris on

This is part of our practical cartography and PostgreSQL series. Put a map on it!

Sometimes it's a pain to open up QGIS and load a PostGIS-enabled DB. Sometimes I don't feel like writing a custom tileserver and hooking it up to Leaflet or Mapbox GL just so I can see if my map looks right.

Sometimes I use the psql command line and a nifty tool by Morgan Herlocker called "geotype" to view my map data.

npm install -g geotype  

. . . which enables fast and simple maps like this:

ny

Yep. That's New York, alright.

dc

. . . and that sure looks like the population distribution of the District of Columbia.

These maps are nothing to show to customers, but they make QA/QC a lot easier. Here's the syntax, piping psql output directly into geotype:

psql $DB_URL -t -c "SELECT ST_AsGeoJSON(ST_Collect(the_geom)) FROM mytable" | geotype  

(The -t and ST_Collect() coerce the output into the type of data that geotype can read)

Happy mapping!

Good, strict default warnings for Rust code with Clippy

Eric Kidd on

Do you maintain a project written in Rust? Would you like to enable really aggressive warnings to make code review easier? Here's what it looks like:

We can set this up using a combination of Rust's built-in warnings and the excellent Clippy package. For best results with Clippy, it's easiest to install the nightly release of Rust:

curl https://sh.rustup.rs -sSf | sh  
rustup update nightly  

You can then update your Cargo.toml file as follows:

[features]
unstable = ["clippy"]

[dependencies]
clippy = { version = "0.0.*", optional = true }  

Near the top of your project's src/lib.rs file, add this enormous list of warnings:

// Enable clippy if our Cargo.toml file asked us to do so.
#![cfg_attr(feature="clippy", feature(plugin))]
#![cfg_attr(feature="clippy", plugin(clippy))]

// Enable as many useful Rust and Clippy warnings as we can stand.  We'd
// also enable `trivial_casts`, but we're waiting for
// https://github.com/rust-lang/rust/issues/23416.
#![warn(missing_copy_implementations,
        missing_debug_implementations,
        missing_docs,
        trivial_numeric_casts,
        unsafe_code,
        unused_extern_crates,
        unused_import_braces,
        unused_qualifications)]
#![cfg_attr(feature="clippy", warn(cast_possible_truncation))]
#![cfg_attr(feature="clippy", warn(cast_possible_wrap))]
#![cfg_attr(feature="clippy", warn(cast_precision_loss))]
#![cfg_attr(feature="clippy", warn(cast_sign_loss))]
#![cfg_attr(feature="clippy", warn(missing_docs_in_private_items))]
#![cfg_attr(feature="clippy", warn(mut_mut))]
// Disallow `println!`. Use `debug!` for debug output
// (which is provided by the `log` crate).
#![cfg_attr(feature="clippy", warn(print_stdout))]
// This allows us to use `unwrap` on `Option` values (because doing makes
// working with Regex matches much nicer) and when compiling in test mode
// (because using it in tests is idiomatic).
#![cfg_attr(all(not(test), feature="clippy"), warn(result_unwrap_used))]
#![cfg_attr(feature="clippy", warn(unseparated_literal_suffix))]
#![cfg_attr(feature="clippy", warn(wrong_pub_self_convention))]

For documentation about individual warnings, run rustc -W help and look at the Clippy lint list. This will show you all available warnings, and explain what each one does.

If you also want to turn warnings into compilation errors, you can add:

// Fail hard on warnings.  This will be automatically disabled when we're
// used as a dependency by other crates, thanks to Cargo magic.
#![deny(warnings)]

To see your new warnings, run:

rustup run nightly cargo build --features unstable  

This will compile your code using the nightly build of Rust, activating the unstable feature that we defined in Cargo.toml, which will in turn activate the clippy feature. Surprisingly, even with all these warnings activated, I see very few false positives in practice.

Because Clippy uses nightly Rust, you may occasionally get compilation errors. If this happens, get the latest nightly build and the latest version of Clippy:

rustup update nightly  
cargo update  

If this still doesn't work, just leave off --features unstable and try again tomorrow.

How to convert a fixed-width file into CSV

Seamus Abshere on

This is part of our data science series. How predictive!

(The more valuable and massive a data set is, the less likely it's in a format you can just parse. Has anybody else noticed that?)

Here's how to convert a fixed-width file to CSV with the standard GNU unix tool gawk:

Theoretical

Thanks to stackoverflow: (reproducing verbatim)

gawk '$1=$1' OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Where FIELDWIDTHS is a list of field widths and OFS is the output file separator.

Real life

In real life, fixed width files contain commas and double quotes.

# put this in a file called fixed2csv.awk
{
  for (i=1;i<=NF;i++) {
    sub(/\s+$/,"",$i)
    sub("\"","\"\"",$i)
    printf "\"%s\"%s", $i, (i<NF?OFS:ORS)
  }
}

Then run it on your data:

gawk -f fixed2csv.awk OFS=, FIELDWIDTHS='4 2 5 1 1' infile > outfile.csv  

Thanks to Ed Morton on Stackoverflow for inspiration!

How to permanently delete versioned objects from S3

Seamus Abshere on

This is part of our cloud security and things that are obvious once you see them series. Duhh... safe!

Amazon's explanation of deleting a versioned object and the SDK documentation do not give an example of permanently deleting a versioned object. Here's how to do it:

require 'aws-sdk'

s3 = Aws::S3::Resource.new(  
  region: 'us-east-1',
  access_key_id: ACCESS_KEY_ID,
  secret_access_key: SECRET_ACCESS_KEY
)
bucket = s3.bucket('my-versioned-bucket')

bucket.objects.each do |object_summary|  
  o = bucket.object object_summary.key
  # this is the secret: specify the version while deleting
  o.delete version_id: o.version_id
end  

If you don't specify the version, you get a delete marker, which you can proceed to delete infinite times and it will not go away :)

How to enable S3 server-side encryption for existing objects

Seamus Abshere on

This is part of our cloud security series.

Do you have unencrypted S3 objects lying around? Don't! Here's the safe way to retroactively enable server-side encryption:

Step 1: Make a backup bucket

AWS management console is easiest. Call it [my-bucket]-backup.

Step 2: Copy one way

require 'aws-sdk'

s3 = Aws::S3::Resource.new(region: 'us-east-1', access_key_id: ACCESS_KEY_ID, secret_access_key: SECRET_ACCESS_KEY)  
b1 = s3.bucket('my-bucket')  
b2 = s3.bucket('my-bucket-backup')

# or no prefix if you want everything
b1.objects(prefix: 'xyz').each do |object_summary|  
  o1 = b1.object object_summary.key
  o2 = b2.object object_summary.key
  o1.copy_to o2, server_side_encryption: 'AES256'
end  

Step 3: Sanity check

Now look at [my-bucket]-backup - it's probably 100% perfect, but just reassure yourself.

Step 4: Copy back over

There are 2 changes here, so you might want to copy-paste:

b2.objects.each do |object_summary|  
  o1 = b1.object object_summary.key
  o2 = b2.object object_summary.key
  o2.copy_to o1, server_side_encryption: 'AES256'
end  

Step 5: (optional) Clean up

Delete [my-bucket]-backup.

Postgres strftime (or: how to group by month)

Seamus Abshere on

This post is part of our data science and PostgreSQL series.

"How do you do strftime in postgres?"

The answer: to_char(date, format).

If you want to group by month, this is what you're looking for:

psql=> select count(*), to_char(created_at, 'YYYY-MM') from employees group by to_char(created_at, 'YYYY-MM') order by to_char(created_at,'YYYY-MM') desc;  
 count | to_char
-------+---------
    27 | 2016-08
    32 | 2016-07
    58 | 2016-06
    17 | 2016-05
    57 | 2016-04
    44 | 2016-03
    28 | 2016-02
    45 | 2016-01
    10 | 2015-12
    10 | 2015-11
    24 | 2015-10
    15 | 2015-09
    32 | 2015-08
    38 | 2015-07
    31 | 2015-06
    18 | 2015-05
    19 | 2015-04
     5 | 2015-03
     8 | 2015-02
    10 | 2015-01
     7 | 2014-12
    22 | 2014-11
(22 rows)

That's it.

Cache headers served by Google, Facebook, AWS in 2016

Seamus Abshere on

Recently I realized that our SPA app was misbehaving: our JS and CSS had cache-busting digests in the URLs, but browsers were caching the underlying index.html.

What do the big apps send?

Cache-Control Expires Pragma
AWS management console no-cache, no-store, must-revalidate -1 no-cache
Facebook private, no-cache, no-store, must-revalidate Sat, 01 Jan 2000 00:00:00 GMT no-cache
Google (the search engine) private, max-age=0 -1
gmail no-cache, no-store, max-age=0, must-revalidate Mon, 01 Jan 1990 00:00:00 GMT no-cache
Google Drive no-cache, no-store, max-age=0, must-revalidate Mon, 01 Jan 1990 00:00:00 GMT no-cache
Intercom max-age=0, private, must-revalidate
LinkedIn no-cache, no-store Thu, 01 Jan 1970 00:00:00 GMT no-cache

Implementation in nginx

In the end, I just went with nginx's default:

expires epoch;  

That produced

Cache-Control: no-cache  
Expires: Thu, 01 Jan 1970 00:00:01 GMT  

:D

Mountains of Census geodata for all

Bill Morris on

U.S. Census data gives our modeling a good predictive boost, and it's a robust quality assurance tool for all the third-party data we've got flowing through our wires.

The Census offers its geographic data in easy-to get, familiar formats via the TIGER portal, but distribution is split up by state for the largest datasets: blocks and block groups. There's a pretty simple reason for this: they're big. The census block shapefile for Indiana alone is 116MB compressed.

eastcoast

Ours is probably not a common use case, but we need all of the blocks and block groups in our database - merged, indexed and queryable. It took a significant amount of work to get them there, so in case anyone else needs them too, we're sharing national 2015 datasets in PostGIS dumpfile format, downloadable and ready to use here:


Census block groups

.pg_dump (426MB) | .sql (1.2GB) bg


Census blocks

.pg_dump (4.7GB) | .sql (12GB) b


Add these to your local PostgreSQL database like so:

pg_restore --no-owner --no-privileges --dbname <dbname> <filename>.pg_dump

# OR

psql <dbname> -f <filename>.sql  

To keep things simple, these are just geometries and GeoIDs (CREATE TABLE schemas can be perused here). Detailed analysis will require joining attributes separately.

Side note: I can't recommend censusreporter.org enough for census-based sanity checks.

Happy mapping!