AI, startup hacks, and engineering miracles from your friends at Faraday

How to use Postgres cursors and multi-insert in Ruby with Sequel

Seamus Abshere on

This post is part of our PostgreSQL series.

SQL cursors are a great way to stream data out of a database without loading it all into memory.

Reading using a cursor

Sequel has use_cursor:

BATCH_SIZE = 1000
DB[:widgets].
  select(:id, :name, :power).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    row[:id] # whatever
  end
end

Writing using cursors and multi-insert

Here's a real-world example - we had to copy things from widgets to widgets_copy.

We thought we were being really clever, keeping all the data inside Postgres:

# NOTE: this is not the recommended method, but it works too
batches = (desired_count.to_f / BATCH_SIZE).ceil
batches.times do |i|
  offset = i * BATCH_SIZE
  DB.run <<-SQL
    INSERT INTO widgets_copy (
      id,
      name,
      power,
      created_at,
      updated_at
    ) (
      SELECT
        uuid_generate_v4(),
        name,
        power,
        now(),
        now(),
      FROM widgets
      ORDER BY id
      LIMIT #{BATCH_SIZE}
      OFFSET #{offset}
    )
  SQL
end

Even with 244gb of RAM, this sometimes stopped dead 75% of the way through. We rewrote it—pulling the data into Ruby, even—and it's both faster and doesn't stop in the middle. We're using Sequel's multi_insert.

batch = []
now = Sequel.function :now
uuid_generate_v4 = Sequel.function :uuid_generate_v4
DB[:widgets].
  select(:name, :power).
  order(:id).
  use_cursor(rows_per_fetch: BATCH_SIZE).each do |row|
    batch << row.merge!(
      id: uuid_generate_v4,
      updated_at: now,
      created_at: now
    )
    if (batch.length % BATCH_SIZE == 0) || batch.length == desired_count
      DB[:widgets_copy].multi_insert batch
      batch.clear
    end
  end
end

Thanks to @dkastner and @devTristan who promoted these ideas internally!

3 reasons Citus is the best (non-Heroku) Postgres host available today

Seamus Abshere on

Citus Cloud is the hosted Postgres that our app uses. We tried EnterpriseDB Postgres Plus Cloud and couldn't even get going; we used Amazon RDS for a year but migrated away about 6 months ago. Heroku Postgres wasn't bare-metal enough for us.

1. Log directly to Papertrail

Screenshot of logging setup It's not a world-shattering invention, but features like sending your Postgres logs to a syslog host is very useful for debugging.

Citus takes features requests and actually implements them (this was one of mine).

RDS has a weird home-grown in-browser log viewer which is useless without things like rds-pgbadger.

2. citus extension preinstalled

You know that feeling where you're so happy you can use SQL, but you worry just a bit 🙏🏽 about your horizontal scaling story? Citus is the company behind the citus extension to Postgres:

Citus parallelizes incoming queries by breaking it into multiple fragment queries which run on the worker shards in parallel [...] The workers are simply running extended PostgreSQL servers and they apply PostgreSQL’s standard planning and execution logic to run these fragment SQL queries. Therefore, any optimization that helps PostgreSQL also helps Citus.

(from Querying Distributed Tables)

Speaking of extensions, EnterpriseDB didn't even come with PostGIS the last time we checked; to get it you needed a Windows-based (!!) "StackBuilder" tool. Support was based in India and used an ancient Salesforce interface. You get the idea.

3. Everything is negotiable

Citus gave me a great price to move over from RDS.

You're not juggling Amazon on-demand and reserved instances, or forgetting your snapshot settings and paying $2000/mo for obsolete EBS snapshot storage (did that!), or being a tiny fish in the massive AWS pond.

You're not spending a month of negotiations on the phone with EnterpriseDB "execs," paying a huge up-front fee, and then seeing the project fail utterly when your top devops engineer couldn't install a basic Postgres extension.

This is a company with people like Craig Kerstiens (helped build Heroku Postgres) on the other end of the phone, live support chat, and a real value prop.

How to create an RDS instance with Terraform

Seamus Abshere on

This post is part of our PostgreSQL series.

Terraform's RDS support makes it easy to create a database instance. Here's a cheatsheet:

resource "aws_db_instance" "mydb1" {
  allocated_storage        = 256 # gigabytes
  backup_retention_period  = 7   # in days
  db_subnet_group_name     = "${var.rds_public_subnet_group}"
  engine                   = "postgres"
  engine_version           = "9.5.4"
  identifier               = "mydb1"
  instance_class           = "db.r3.large"
  multi_az                 = false
  name                     = "mydb1"
  parameter_group_name     = "mydbparamgroup1" # if you have tuned it
  password                 = "${trimspace(file("${path.module}/secrets/mydb1-password.txt"))}"
  port                     = 5432
  publicly_accessible      = true
  storage_encrypted        = true # you should always do this
  storage_type             = "gp2"
  username                 = "mydb1"
  vpc_security_group_ids   = ["${aws_security_group.mydb1.id}"]
}

Here's the security group you need:

resource "aws_security_group" "mydb1" {
  name = "mydb1"

  description = "RDS postgres servers (terraform-managed)"
  vpc_id = "${var.rds_vpc_id}"

  # Only postgres in
  ingress {
    from_port = 5432
    to_port = 5432
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Allow all outbound traffic.
  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

You can get these values from the EC2 console (don't forget them!):

variable "rds_vpc_id" {
  default = "vpc-XXXXXXXX"
  description = "Our default RDS virtual private cloud (rds_vpc)."
}

variable "rds_public_subnets" {
  default = "subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY,subnet-YYYYYYYY"
  description = "The public subnets of our RDS VPC rds-vpc."
}

variable "rds_public_subnet_group" {
  default = "default-vpc-XXXXXXXX"
  description = "Apparently the group name, according to the RDS launch wizard."
}

How to do histograms in PostgreSQL

Bill Morris on

This post is part of our data science and PostgreSQL series.

UPDATED FOR 2017 Now with easy subquery support and a more sensible argument order!

We adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram($table_name_or_subquery, $column_name);

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('(SELECT * FROM small_donors_vt LIMIT 50000)', 'transaction_amt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

CREATE OR REPLACE FUNCTION histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
  WITH
  source AS (
    SELECT * FROM %s
  ),
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM source
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM source, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram',
  table_name_or_subquery,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name,
  column_name
  );
END
$func$ LANGUAGE plpgsql;

Note:

  1. You don't need to use a subquery - you can also just provide a table name as the first argument.
  2. Subqueries are useful for sampling. For example, (SELECT * FROM bigtable LIMIT 500 ORDER BY RANDOM(). Don't forget parentheses!
  3. You can call this as SELECT histogram() or SELECT * FROM histogram(). The latter form is much more legible!

Happy querying!

How to preview PostGIS maps on your command line

Bill Morris on

This is part of our practical cartography and PostgreSQL series. Put a map on it!

Sometimes it's a pain to open up QGIS and load a PostGIS-enabled DB. Sometimes I don't feel like writing a custom tileserver and hooking it up to Leaflet or Mapbox GL just so I can see if my map looks right.

Sometimes I use the psql command line and a nifty tool by Morgan Herlocker called "geotype" to view my map data.

npm install -g geotype

. . . which enables fast and simple maps like this:

ny

Yep. That's New York, alright.

dc

. . . and that sure looks like the population distribution of the District of Columbia.

These maps are nothing to show to customers, but they make QA/QC a lot easier. Here's the syntax, piping psql output directly into geotype:

psql $DB_URL -t -c "SELECT ST_AsGeoJSON(ST_Collect(the_geom)) FROM mytable" | geotype

(The -t and ST_Collect() coerce the output into the type of data that geotype can read)

Happy mapping!