Startup hacks and engineering miracles from your exhausted friends at Faraday

How to do histograms in PostgreSQL

Bill Morris on

This post is part of our data science and PostgreSQL series.

We wanted an easy way to look at data distribution from the command line, so we adapted this excellent piece by Dmitri Fontaine and turned it into a function.

It can be invoked like this:

SELECT * FROM histogram('column_name', 'table_name');  

. . . to give sweet results like this, in a check of the distribution of 2016 political contributions in Vermont:

fec=# SELECT * FROM histogram('transaction_amt', 'small_donors_vt');

 bucket |   range   | freq |       bar       
--------+-----------+------+-----------------
      1 | [0,9]     | 2744 | ******
      2 | [10,19]   | 5630 | *************
      3 | [20,29]   | 6383 | ***************
      4 | [30,39]   | 1290 | ***
      5 | [40,49]   |  369 | *
      6 | [50,59]   | 3541 | ********
      7 | [60,69]   |  174 | 
      8 | [70,79]   |  313 | *
      9 | [80,89]   |  171 | 
     10 | [90,99]   |   65 | 
     11 | [100,109] | 2363 | ******
     12 | [110,119] |   51 | 
     13 | [120,129] |  115 | 
     14 | [130,139] |   32 | 
     15 | [140,146] |   11 | 
     16 | [150,159] |  187 | 
     17 | [160,169] |   24 | 
     18 | [170,177] |   33 | 
     19 | [180,189] |   19 | 
     20 | [191,199] |   24 | 
     21 | [200,200] |  795 | **

Use it yourself by adding this to your postgres setup:

DROP FUNCTION IF EXISTS histogram(text,regclass);  
CREATE OR REPLACE FUNCTION histogram(columnname text, tablename regclass)  
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)  
AS $func$  
BEGIN  
RETURN QUERY EXECUTE format('WITH  
  min_max AS (
    SELECT min(%s) AS min, max(%s) AS max FROM %s
  ),
  histogram AS (
    SELECT
      width_bucket(%s, min_max.min, min_max.max, 20) AS bucket,
      numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
      count(%s) AS freq
    FROM %s, min_max
    WHERE %s IS NOT NULL
    GROUP BY bucket
    ORDER BY bucket
  )
  SELECT
    bucket,
    "range",
    freq::bigint,
    repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
  FROM histogram', 
  columnname,
  columnname,
  tablename,
  columnname,
  columnname,
  columnname,
  columnname,
  tablename,
  columnname
  );
END  
$func$ LANGUAGE plpgsql;

Happy querying!