Startup hacks and engineering miracles from your exhausted friends at Faraday

How to use Postgres cursors and multi-insert in Ruby with Sequel

Seamus Abshere on

This post is part of our PostgreSQL series.

SQL cursors are a great way to stream data out of a database without loading it all into memory.

Reading using a cursor

Sequel has use_cursor:

BATCH_SIZE = 1000  
DB[:widgets].  
  select(:id, :name, :power).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    row[:id] # whatever
  end
end  

Writing using cursors and multi-insert

Here's a real-world example - we had to copy things from widgets to widgets_copy.

We thought we were being really clever, keeping all the data inside Postgres:

# NOTE: this is not the recommended method, but it works too
batches = (desired_count.to_f / BATCH_SIZE).ceil  
batches.times do |i|  
  offset = i * BATCH_SIZE
  DB.run <<-SQL
    INSERT INTO widgets_copy (
      id,
      name,
      power,
      created_at,
      updated_at
    ) (
      SELECT
        uuid_generate_v4(),
        name,
        power,
        now(),
        now(),
      FROM widgets
      ORDER BY id
      LIMIT #{BATCH_SIZE}
      OFFSET #{offset}
    )
  SQL
end  

Even with 244gb of RAM, this sometimes stopped dead 75% of the way through. We rewrote it—pulling the data into Ruby, even—and it's both faster and doesn't stop in the middle. We're using Sequel's multi_insert.

batch = []  
now = Sequel.function :now  
uuid_generate_v4 = Sequel.function :uuid_generate_v4  
DB[:widgets].  
  select(:name, :power).
  order(:id).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    batch << row.merge!(
      id: uuid_generate_v4,
      updated_at: now,
      created_at: now
    )
    if (batch.length % BATCH_SIZE == 0) || batch.length == desired_count
      DB[:widgets_copy].multi_insert batch
      batch.clear
    end
  end
end  

Thanks to @dkastner and @devTristan who promoted these ideas internally!

3 reasons Citus is the best (non-Heroku) Postgres host available today

Seamus Abshere on

Citus Cloud is the hosted Postgres that our app uses. We tried EnterpriseDB Postgres Plus Cloud and couldn't even get going; we used Amazon RDS for a year but migrated away about 6 months ago. Heroku Postgres wasn't bare-metal enough for us.

1. Log directly to Papertrail

Screenshot of logging setup It's not a world-shattering invention, but features like sending your Postgres logs to a syslog host is very useful for debugging.

Citus takes features requests and actually implements them (this was one of mine).

RDS has a weird home-grown in-browser log viewer which is useless without things like rds-pgbadger.

2. citus extension preinstalled

You know that feeling where you're so happy you can use SQL, but you worry just a bit 🙏🏽 about your horizontal scaling story? Citus is the company behind the citus extension to Postgres:

Citus parallelizes incoming queries by breaking it into multiple fragment queries which run on the worker shards in parallel [...] The workers are simply running extended PostgreSQL servers and they apply PostgreSQL’s standard planning and execution logic to run these fragment SQL queries. Therefore, any optimization that helps PostgreSQL also helps Citus.

(from Querying Distributed Tables)

Speaking of extensions, EnterpriseDB didn't even come with PostGIS the last time we checked; to get it you needed a Windows-based (!!) "StackBuilder" tool. Support was based in India and used an ancient Salesforce interface. You get the idea.

3. Everything is negotiable

Citus gave me a great price to move over from RDS.

You're not juggling Amazon on-demand and reserved instances, or forgetting your snapshot settings and paying $2000/mo for obsolete EBS snapshot storage (did that!), or being a tiny fish in the massive AWS pond.

You're not spending a month of negotiations on the phone with EnterpriseDB "execs," paying a huge up-front fee, and then seeing the project fail utterly when your top devops engineer couldn't install a basic Postgres extension.

This is a company with people like Craig Kerstiens (helped build Heroku Postgres) on the other end of the phone, live support chat, and a real value prop.

How to create an RDS instance with Terraform

Seamus Abshere on

This post is part of our PostgreSQL series.

Terraform's RDS support makes it easy to create a database instance. Here's a cheatsheet:

resource "aws_db_instance" "mydb1" {  
  allocated_storage        = 256 # gigabytes
  backup_retention_period  = 7   # in days
  db_subnet_group_name     = "${var.rds_public_subnet_group}"
  engine                   = "postgres"
  engine_version           = "9.5.4"
  identifier               = "mydb1"
  instance_class           = "db.r3.large"
  multi_az                 = false
  name                     = "mydb1"
  parameter_group_name     = "mydbparamgroup1" # if you have tuned it
  password                 = "${trimspace(file("${path.module}/secrets/mydb1-password.txt"))}"
  port                     = 5432
  publicly_accessible      = true
  storage_encrypted        = true # you should always do this
  storage_type             = "gp2"
  username                 = "mydb1"
  vpc_security_group_ids   = ["${aws_security_group.mydb1.id}"]
}

Here's the security group you need:

resource "aws_security_group" "mydb1" {  
  name = "mydb1"

  description = "RDS postgres servers (terraform-managed)"
  vpc_id = "${var.rds_vpc_id}"

  # Only postgres in
  ingress {
    from_port = 5432
    to_port = 5432
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Allow all outbound traffic.
  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

You can get these values from the EC2 console (don't forget them!):

variable "rds_vpc_id" {  
  default = "vpc-XXXXXXXX"
  description = "Our default RDS virtual private cloud (rds_vpc)."
}

variable "rds_public_subnets" {  
  default = "subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY"
  description = "The public subnets of our RDS VPC rds-vpc."
}

variable "rds_public_subnet_group" {  
  default = "default-vpc-XXXXXXXX"
  description = "Apparently the group name, according to the RDS launch wizard."
}

How to do histograms in PostgreSQL

Bill Morris on

This post is part of our data science and PostgreSQL series.

We wanted an easy way to look at data distribution from the command line, so we adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram('column_name', 'table_name');  

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('transaction_amt', 'small_donors_vt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

DROP FUNCTION IF EXISTS histogram(text,regclass);  
CREATE OR REPLACE FUNCTION histogram(columnname text, tablename regclass)  
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)  
AS $func$  
BEGIN  
RETURN QUERY EXECUTE format('WITH  
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM %s
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM %s, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram', 
  columnname,
  columnname,
  tablename,
  columnname,
  columnname,
  columnname,
  columnname,
  tablename,
  columnname
  );
END  
$func$ LANGUAGE plpgsql;

Happy querying!

How to preview PostGIS maps on your command line

Bill Morris on

This is part of our practical cartography and PostgreSQL series. Put a map on it!

Sometimes it's a pain to open up QGIS and load a PostGIS-enabled DB. Sometimes I don't feel like writing a custom tileserver and hooking it up to Leaflet or Mapbox GL just so I can see if my map looks right.

Sometimes I use the psql command line and a nifty tool by Morgan Herlocker called "geotype" to view my map data.

npm install -g geotype  

. . . which enables fast and simple maps like this:

ny

Yep. That's New York, alright.

dc

. . . and that sure looks like the population distribution of the District of Columbia.

These maps are nothing to show to customers, but they make QA/QC a lot easier. Here's the syntax, piping psql output directly into geotype:

psql $DB_URL -t -c "SELECT ST_AsGeoJSON(ST_Collect(the_geom)) FROM mytable" | geotype  

(The -t and ST_Collect() coerce the output into the type of data that geotype can read)

Happy mapping!

Postgres strftime (or: how to group by month)

Seamus Abshere on

This post is part of our data science and PostgreSQL series.

"How do you do strftime in postgres?"

The answer: to_char(date, format).

If you want to group by month, this is what you're looking for:

psql=> select count(*), to_char(created_at, 'YYYY-MM') from employees group by to_char(created_at, 'YYYY-MM') order by to_char(created_at,'YYYY-MM') desc;  
 count | to_char
-------+---------
    27 | 2016-08
    32 | 2016-07
    58 | 2016-06
    17 | 2016-05
    57 | 2016-04
    44 | 2016-03
    28 | 2016-02
    45 | 2016-01
    10 | 2015-12
    10 | 2015-11
    24 | 2015-10
    15 | 2015-09
    32 | 2015-08
    38 | 2015-07
    31 | 2015-06
    18 | 2015-05
    19 | 2015-04
     5 | 2015-03
     8 | 2015-02
    10 | 2015-01
     7 | 2014-12
    22 | 2014-11
(22 rows)

That's it.

How to merge JSON fields in PostgresSQL

Seamus Abshere on

This is part of our series on PostgreSQL and things that are obvious once you see them. Nice!

How do you combine (merge) JSON fields in Postgres? You define a new function json_collect()

# select * from greetings;
         data
-----------------------
 {"es":"Saludos"}
 {"en":"Hello"}
 {"ru":"Здравствуйте"}
(3 rows)

# select json_collect(data) from greetings;
                        json_collect
-------------------------------------------------------------
 { "es" : "Saludos", "en" : "Hello", "ru" : "Здравствуйте" }
(1 row)

json_collect is not built into Postgres, though. You have to create it yourself:

-- needed by json_collect
CREATE FUNCTION json_merge(data json, merge_data json) RETURNS json LANGUAGE sql IMMUTABLE  
AS $$  
  SELECT json_object_agg(key, value)
  FROM (
    WITH to_merge AS (
      SELECT * FROM json_each(merge_data)
    )
    SELECT *
    FROM json_each(data)
    WHERE key NOT IN (SELECT key FROM to_merge)
    UNION ALL
    SELECT * FROM to_merge
  ) t;
$$;

CREATE AGGREGATE json_collect(json) (  
  SFUNC = json_merge,
  STYPE = json,
  INITCOND = '{}'
);

Thanks to @matheusoliveira!