Friday, December 19, 2014

Sharing records with PostgreSQL Row Security

We can also share protected rows. We can borrow PostgreSQL's built-in aclitem (Access Control List Item) type and aclexplode() function.

This is in no way optimized for performance.

aclitem is "internal", so may change at any time.

create table protected.shareable_docs (
  id int primary key,
  name text not null,
  owner name not null default current_user,
  row_acl aclitem[]

/* No joins in updateable views */
create view shareable_docs with ( security_barrier ) as 
  from protected.shareable_docs
    owner = current_user
    or exists (select 1 from aclexplode(row_acl) where privilege_type = 'SELECT' and pg_has_role(grantee, 'member') )
with check option;

grant delete, insert, select, update on shareable_docs to alice, bob;

set role alice;

/* alice creates a doc and shares it with bob */
insert into shareable_docs (id, name, row_acl) values (1, 'shared doc 1', '{bob=r/alice}');

set role bob;

select count(*) from shareable_docs; --result = 1

You'd need to write triggers to prevent users from deleting shared items though.

An ACL Item is structured like this: grantee=privileges/grantor . Grantee is the role (a user or group) that gets the privileges, and grantor is the role that gives the privileges. Here are some of the privilege codes:
r = SELECT (read)
w = UPDATE (write)
So if alice wanted to grant SELECT (WITH GRANT OPTION), UPDATE, and DELETE privileges to role editors, your aclitem would look like this: editors=r*wd/alice

Thursday, December 18, 2014

Simple Row Security with PostgreSQL 9.4

PostgreSQL 9.4 makes row security a whole lot easier:
  • security_barrier views are update-able
  • WITH CHECK OPTION prevents users from inserting, updating, or deleting rows that they can't / won't be able to see
create role alice;

create role bob;

create schema protected;

create table protected.bank_accounts (
 id int primary key,
 name text not null,
 owner name not null default current_user,
 balance decimal(19,2) not null

create view bank_accounts with ( security_barrier ) as 
 from protected.bank_accounts
  owner = current_user
with check option;

grant delete, insert, select, update on bank_accounts to alice, bob;

Users can't set owner to a role in which they don't have membership:
set role alice;

insert into bank_accounts values (1, 'chequeing', 'bob', 500);

ERROR:  new row violates WITH CHECK OPTION for view "bank_accounts"
Users can only see their own stuff:
set role alice;

insert into bank_accounts values (1, 'chequeing', 'alice', 500);

select count(*) from bank_accounts; --result = 1

set role bob;

select count(*) from bank_accounts; --result = 0

delete from bank_accounts; --0 rows affected

Monday, December 8, 2014

MS SQL Server to PostgreSQL Quick Copy

You can quickly copy data from an MS SQL Server database to PostgreSQL using the bcp command, iconv, and psql's \copy command.

I copied 100,000 records from one server, to my laptop, to another server in 1.5 seconds.

Sadly, bcp must write to a file first, so you can't pipe it directly. Delete the file when you're done.

In Linux, you could create a named pipe and pretend it's a file, using the mkfifo command.

bcp can only output Unicode in UTF-16, so we must use iconv to convert the output to UTF-8

I assume that you have psql installed, and that it's configured to connect to your PostgreSQL server.

  1. If bcp is not installed (run bcp.exe -v) then install the Microsoft Command Line Utilities 11 for SQL Server
  2. If win_iconv is not installed, download it, rename it to iconv.exe and put it in your PATH
  1. If you have not installed the Microsoft ODBC Driver for SQL Server on Linux, you can get it here
In your shell run (this assumes you are using ActiveDirectory/Kerberos auth. Use -U, -P instead of -T if not):
bcp "select * from some_table" queryout results.tsv -T -S serverHostName -w

iconv -f UTF16 -t UTF8 results.tsv | psql -c "\copy some_table from STDIN"
Delete results.tsv

Saturday, December 6, 2014

Fuzzy Record Matching in SQL, Part 1

Let's say you have some existing customer records like this:


And you have some new customer records like this (say from another company you bought):


And you want to find possible matches. Now obviously Haystack1 and Needles1 match, but Haystack2 and Haystack3 are possible matches too.

Here's a basic way to find relevant matches, ranked:

select as needle_id, as haystack_id,
 case when = then 1 else 0 end 
 + case when = then 1 else 0 end as relevance
 needles n
 haystack h 
on = 
 or =
order by 
 relevance desc;

This gives us:


Thursday, November 20, 2014

How To Load SQL data After Hibernate Creates Your Schema

Hibernate can create your database schema for you from Java @Entities using the hbm2ddl tool.

You can also tell it to run 1 or more SQL files after it is done creating your schema, for example to load test or reference data.

In persistence.xml, inside <persistence-unit><properties> add

<!-- "value" should be "create" or "create-drop". Warning, this is destructive! -->
<property name="" value="create-drop"/>

<!-- relative to main/resources -->
<property name="hibernate.hbm2ddl.import_files" value="/import.sql, /import2.sql"/>

<!-- Lets you have multiple statements on multiple lines -->
<property name="hibernate.hbm2ddl.import_files_sql_extractor" value="org.hibernate.tool.hbm2ddl.MultipleLinesSqlCommandExtractor" />

Then in src/main/resources, add your SQL files, and they will be run.

Tuesday, August 26, 2014

TF-IDF Text Search in Postgres

/* see */
create extension if not exists smlar;

/* your basic table to search against: */
create table documents (
  document_id int primary key,
  body text not null

   I created 100,000 "lorem ipsum" documents here
   In retrospect, not a great choice due to the small number of unique words used to generate the dataset
copy documents 
from program 'curl ""' 
with (format csv, header true);

/* this table holds document frequencies (# of docs in which a term appears) for the documents.body column: */
create table documents_body_stats (
  value text unique,
  ndoc int not null

/* used ts_stat for convenience, not ideal, but good for quick n dirty: */
insert into documents_body_stats
    ts_stat( 'select to_tsvector(''simple'', body) from documents' );

/* the smlar tfdif table needs the total document count as well. It's added as a row with null in the value column: */
insert into documents_body_stats values 
  (null, (select count(*) from documents) );

/* turn documents into array of words. you could also use tsvector2textarray( to_tsvector(...) ) : */
create or replace function tokenize(text) returns text[] language sql strict immutable as $$
 select regexp_split_to_array( lower($1), '[^[:alnum:]]' );

/* use smlar's text array opclass. gist is a little more flexible than gin in this case (allows 'n' tf_method): */
create index on documents using gist ( tokenize(body) _text_sml_ops ); --24 seconds

/* setup smlar: */
set smlar.type = 'tfidf';
set smlar.stattable = 'documents_body_stats';
set smlar.tf_method = 'n';
set smlar.threshold = 0.4;

/* the query */
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] --where TFIDF similarity >= smlar.threshold
order by
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc
limit 10;

Wednesday, August 20, 2014

PL/V8 vs PL/pgSQL

A not-so-scientific test comparing PL/V8 and PL/pgSQL.

Actual database access seems about the same, but from everything I have seen, PL/V8 always beats PL/pgSQL in terms of performance for logic.

create extension if not exists plv8;

create or replace function fizz_buzz_plv8() returns text volatile language plv8 as $$

 var result = "";

 for(var i=0; i <= 100000; i++) {

  if(i % 3 == 0) result += "Fizz";

  if(i % 5 == 0) result += "Buzz";

  if(result == "") result += i;

 return result;

create or replace function fizz_buzz_plpgsql() returns text volatile language plpgsql as $$
 result text = '';

 for i in 0 .. 100000 loop
  if i % 3 = 0 then result = result || 'Fizz'; end if;

  if i % 5 = 0 then result = result || 'Buzz'; end if;

  if result = '' then result = result || i; end if;
 end loop;

 return result;

select fizz_buzz_plv8(); // 52 ms

select fizz_buzz_plpgsql(); // 1292 ms

Saturday, August 16, 2014

pgAgent wishlist

  1. Be able to LISTEN to notifications from a Postgres database, and run a job depending on which notification is fired
  2. Allow send email after job finishes, configurable to send only on error. You don't need to make your own MTA, just drop a message in localhost:25
  3. A yum package
  4. Use RFC 5545 RRULES, as it's the standard for scheduling
  5. Fix the bug that shows failure when a Windows batch file returns success

Send email on error with pgAgent

pgAgent is a scheduler for Postgres. It's not bad, but doesn't have the ability to email an admin when there's an error.

Here's a hackish work-around to get that going.

Install pgAgent

On Windows, this is most easily done with Stack Builder from EnterpriseDB. Make sure the pgAgent service is running after you install pgAgent

(Optional) Install and set up a localhost Mail Transfer Agent (MTA)

Configuring your SMTP server settings in multiple programs isn't DRY, so it's better to set up a localhost MTA and have it relay to your SMTP server. It's also dumb to have your program wait for a response from the SMTP server. A localhost MTA gives you a free queue where you can drop a message in localhost:25 and not wait.

For developing/testing on Windows, you can use smtp4dev, which will act as a localhost MTA, but open incoming messages in your default email client, such as Outlook.

(Windows only) Get a command line SMTP client

I recommend blat. It's a simple command line SMTP client.

You can send an email like this

blat -t -f -subject "The Subject" -body "The body" -server host:port

Add blat to the PATH.

Create Some Views in the postgres database

Get the last time that the email job was run successfully, or -infinity:

create view email_job_last_success as 
    ( SELECT 
      pgagent.pga_joblog a
      pgagent.pga_job b 
      a.jlgjobid = b.jobid
      b.jobname = 'email_job' 
      AND a.jlgstatus = 's' --success
      a.jlgstart DESC
    LIMIT 1 ), 
    '-infinity'::timestamptz) AS jlgstart;

Get the list of steps that failed since the email job was last run successfully:

create view failed_steps AS 
    pgagent.pga_jobsteplog a
    pgagent.pga_jobstep b 
    a.jsljstid = b.jstid
    pgagent.pga_job c 
    b.jstjobid = c.jobid
    jslstatus = 'f' --failure
    AND jobname <> 'email_job'
    AND jslstart > ( SELECT jlgstart FROM email_job_last_success)
    jslstart DESC;

Create the Email Job

  1. Create a job called email_job (if you use a different name, you need to change the views above)
  2. Create a schedule for the job. Have it run say every five minutes. 
  3. Add a step
    1. Name: send_email , or whatever
    2. Connection type: local
    3. Database: postgres 
    4. Kind: SQL
    5. On error: Fail
    6. Definition:

do $$

  /* if no failed steps, then don't send email: */
  if (select count(*) = 0 from failed_steps) then return; end if;

  copy (
      jstdbname as "Database",
      jobname as "Job",
      jstname as "Step",
      jslstart as "Step Started At",
      jslresult as "Result Code",
      jsloutput as "Output"
  to program 
    'blat -t -f -server localhost:25 -subject "pgAgent job failure"' 
  with (
    format csv, 
    header true, 
    delimiter E'\t'

end $$;

Change the line after "to program" above to customize your email message. Use sendmail instead of blat on Linux. The email job will run every X minutes. It will see if there are any steps (other than its own) that failed since the email job was last run successfully.

If there are failed steps, then it will COPY them (via stdin) to the email program as CSV. The CSV will appear as the email's body. The email program will send one email with all of the failed steps.

Friday, August 15, 2014

My wish list for Postgres 9.5

  1. Upsert, as always
  2. Replication with SQLite over WebSocket, with full Row Security support. Now every mobile company on the planet wants to use Postgres to sync with Android and iPhone
  3. Automatic HATEOAS RESTful API for data and schemas: JSON and CSV. CouchDB killer.
  4. Incremental / Differential backup & restore over HTTPS (like to S3)
  5. Access O/S environment variables in PL/pgSQL. Use case: So I can keep COPY paths the same in Windows and *Nix
  6. bcrypt or scrypt role passwords (instead of MD5)
  7. skip_lines (default 0) and skip_comments (default '#') options for COPY so I can skip more than the first line in a CSV. Popular use case: geonames
  8. Make it dead easy for extension developers to compile on both *Nix and Windows. There are too many extensions that don't work on Windows
  9. SQL:2011 Temporal features
  10. Make Text Search a little more like Solr, basically just integrate the smlar extension

Monday, August 11, 2014

Getting Canadian Postal Codes into Postgres

--download and unzip (9MB) to /srv/

create table canadian_postal_codes (
  postal_code text primary key,
  latitude numeric(8,6) not null,
  longitude numeric(9,6) not null,
  place_name text,
  province_code char(2) not null

copy canadian_postal_codes from '/srv/' with ( format csv, null '', encoding 'latin1' );

--there are two postal codes that are too long:
delete from canadian_postal_codes where length(postal_code) > 6;

--there are some that are lower case:
update canadian_postal_codes set postal_code = upper(postal_code) where postal_code ~ '[a-z]';

--some are "impossible", i.e. start with wrong letter for province, will update here later. 

Tuesday, August 5, 2014

geonames continents

create table continents (
 continent_code char(2) primary key,
 name text not null unique,
 geoname_id int not null unique

insert into continents values
('AF', 'Africa', 6255146),
('AS', 'Asia', 6255147),
('EU', 'Europe', 6255148),
('NA', 'North America', 6255149),
('OC', 'Oceania', 6255151),
('SA', 'South America', 6255150),
('AN', 'Antarctica', 6255152);

Sunday, August 3, 2014

Norvig-Style Spelling Correction in Postgres. It's reasonably fast.

I wanted to do some data matching but the input data had errors.

I decided to try Peter Norvig's spelling corrector directly in the database.

I got pretty good results. The same accuracy as Norvig, and only marginally slower (13.5 Hz vs 17 Hz). It started out about 10x slower, but I was able to speed it up after a few hours of tweaking.

Response time for a single correctly spelled word is < 10ms. For a word with edit distance 1, the response time is also < 10ms. For edit distance 2, the response time is around 100ms.

I used PL/V8 for most of it, as it seems to be the fastest of the PL languages, for calculations at least.

Here's the code:

This function finds all words with an edit distance of 1 of the target word. There will be 54*length+25 words. For example, if the target word is "united" (length 6), then there will be 349 words returned. Some will be nonsense words, so we match it against our corpus later.

/* inspired by */

 var i, results = [], letters = "abcdefghijklmnopqrstuvwxyz".split("");

 // deletion
 for (i=0; i < $1.length; i++)
     results.push($1.slice(0, i) + $1.slice(i+1));

 // transposition
 for (i=0; i < $1.length-1; i++)
     results.push($1.slice(0, i) + $1.slice(i+1, i+2) + $1.slice(i, i+1) + $1.slice(i+2));
 // alteration
 for (i=0; i < $1.length; i++)
     letters.forEach(function (l) {
         results.push($1.slice(0, i) + l + $1.slice(i+1));

 // insertion
 for (i=0; i <= $1.length; i++)
     letters.forEach(function (l) {
         results.push($1.slice(0, i) + l + $1.slice(i));

 return results;

This one finds all the words with an edit distance of 2. We find the 2-distance words of all the 1-distance words, so there are about 100,000 unique words returned. Most are nonsense, so we join them against the corpus later.

 var edits1 = plv8.find_function("edits1");

 var words = {}; //using object keys avoids duplicates, dropping edits2 from ~200k to 100k

 for(var i = 0; i < $1.length; i++) {

  var e2 = edits1($1[i]); 

  for(var j = 0; j < e2.length; j++) {
   words[e2[j]] = null;

 return Object.keys(words);


And this one does the actual correcting. It has to be volatile due to the temp table creation. Create the temp table outside the function and you can mark it as stable, which will let Postgres cache results in the same transaction.

"big" is a table with unique terms and their counts, based on the large corpus here (6M chars, ~30,000 unique words).

 tmp text;
 e1 text[];

  --if the word is in the corpus, return it:
  if exists( select 1 from big where word = $1 ) then 
    return $1; 
  end if;

  --get all the 1-distance words:
  e1 = edits1($1);
  from big 
    word = any ( select unnest( e1 ) ) 
  order by 
    nentry desc 
  limit 1 
  into tmp;
  --if there are 1-distance words that match, return the most common one:
  if found then 
    return tmp; 
  end if;

  --using a session temp table is much faster than comparing to an array
  create temp table if not exists edits2temp ( word text );
  truncate table edits2temp;

  --get all the 2-distance edits (~100k of them) and put in temp table:
  insert into edits2temp select unnest( edits2(e1) );

    inner join edits2temp edits on big.word = edits.word
  order by big.nentry desc
  limit 1
  into tmp;

  --if there are 2-distance words that match, return the most common one:
  if found then 
    return tmp; 
  end if;

  truncate table edits2temp; 

  --nothing found, return the original word:
 return $1;

end $$;

Saturday, August 2, 2014

postgresql where short circuit

    when cheap_function() then true
    when expensive_function() then true

Monday, July 28, 2014

Searching multiple columns using PostgreSQL text search

You can do multiple column text search out-of-the-box with PostgreSQL. Well, you can do it on up to four columns.

The trick is to use the tsvector "weight" label as a label representing the column.

Our schema:

create table multi_fields (
  persons_initials text not null,
  phone text not null,
  province char(2) not null

insert into multi_fields values

Now we create a function to label our columns, and put an index on our table using that function. This example bypasses the text search parser. You'll want probably want to use setweight() and to_tsvector() || to_tsvector() ... in production.

create or replace function three_column_ts_vector(text, text, text) returns tsvector strict immutable language sql as '
  select ( $1 || '':1A '' || $2 || '':1B '' || $3 || '':1C'' )::tsvector;

create index on multi_fields using gin( three_column_ts_vector(persons_initials, phone, province) );

Some sample queries:

--where any column contains BC (3 results):
from multi_fields
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC'::tsquery;
--where only the province column (label "C") contains BC (2 results):
from multi_fields
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:C'::tsquery;
--where only the persons_initials column (label "A") contains BC (1 result):
from multi_fields
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:A'::tsquery;
--where only the phone column (label "B") contains BC (0 results):
from multi_fields
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:B'::tsquery;
--where province is BC and the name is ON (2 results):
from multi_fields
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:C & ON:A'::tsquery;

Loading Geonames Country Info into Postgres

Download the countryInfo.txt file and strip lines beginning with the "#" character, as Postgres can only skip one line using the COPY command.

mkdir /srv/geonames

cd /srv/geonames


sed '/^#/ d' < countryInfo.txt > countryInfoNoComments.txt

In Postgres:

create table country_info (
  iso char(2) primary key,
  iso3 char(3) not null unique,
  iso_numeric char(3) not null,
  fips char(2),
  country text not null unique,
  capital text,
  area_in_sq_km numeric,
  population int not null,
  continent char(2) not null,
  tld text,
  currency_code char(3),
  currency_name text,
  phone text,
  postal_code_format text,
  postal_code_regex text,
  languages text,
  geonameid int,
  neighbours text,
  equivalent_fips_code text

copy country_info
from '/srv/geonames/countryInfoNoComments.txt'
with (
  format text,
  null ''

Steps to install the pg_similarity extension into Postgres

The pg_similarity extension lets you run a variety of similarity functions, such as Jaro Winkler, in Postgres.

In your shell:

cd /usr/local/src

git clone

cd pg_similarity

export USE_PGXS=1


make install

In Postgres:

create extension pg_similarity;

Test it out:

select jarowinkler('acme, inc', 'acme incorporated');

double precision

Steps to install the smlar extension into Postgres

The smlar extension lets you do TF-IDF and cosine similarity calculations in Postgres.

If you figure out how to compile these on Windows (64-bit), please let me know.

In your shell:

cd /usr/local/src

git clone git://

cd smlar

export USE_PGXS=1


make install

In Postgres:

create extension smlar;

Test it out:

select smlar( '{a,b}'::text[], '{c,a}'::text[] );


Wednesday, July 23, 2014

Using the same script for Postgres COPY in Windows and Ubuntu

As per my previous blog post, you should store shared application data in %ProgramData% (Windows) or /srv (Ubuntu).

Let's say you download Geonames' file and unzipped it and put it into your shared application data folder. How can we access these different O/S paths using the same SQL script with Postgres?

One way would be to create a symlink on Windows from /srv to %ProgramData%, then you can use the following script on either Windows or Ubuntu:

copy geonames
from '/srv/geonames/allCountries.txt'
with (
  format text,
  null ''

To create a symlink in Windows, run Command Prompt as an admin:

mklink /d c:\srv %ProgramData%

It would be great if the Postgres guys would let us use O/S environment variables (envvars) in file path, like '$SharedAppData/geonames/allCountries.txt' , then you'd only have to define a common envvar in Linux and Windows.

Shared Data files, Java

On Windows (Vista and up, at least) the correct place to store shared data for applications is %ProgramData% . For example, say we want MaxMind's GeoLite2 database file available to several applications. It should go at %ProgramData%\maxmind\GeoLite2-City.mmdb .

On Ubuntu, the correct place to store shared program data is /srv . So GeoLite would go at /srv/maxmind/GeoLite2-City.mmdb .

It would be nice in Java to have a cross-platform system property to access these locations easily. Here's how.

In Windows:


In Ubuntu:

sudo nano /etc/environment

Add/update the following:

Then in a Java/Spring program you can use the system property like:


Sunday, July 6, 2014

A Probably-Secure Tomcat HTTPS Setup

Edit %CATALINA_HOME%/conf/server.xml. We are doing a few things here:

1. Setting up port 443 for HTTPS and port 80 to redirect to it
2. Disabling insecure protocols (SSL)
3. Disabling HTTP compression (HTTP compression with TLS may reveal session ids or CSRF tokens)
4. using O/S User Environment Variables instead of putting passwords into files
5. Using only good ciphers

Warning: Old clients might not work

        port                            ="80" 
        protocol                        ="org.apache.coyote.http11.Http11NioProtocol"
        redirectPort                    ="443" 
        sslEnabledProtocols            = "TLSv1.2, TLSv1.1, TLSv1"
        compression                    = "off"
        allowUnsafeLegacyRenegotiation = "false"
        port                           = "443" 
        protocol                       = "org.apache.coyote.http11.Http11NioProtocol"
        SSLEnabled                     = "true" 
        scheme                         = "https" 
        secure                         = "true"
        keyAlias                       = "localhost"
        keyPass                        = "${PWD_KEY_LOCALHOST}"
        keystoreFile                   = "${user.home}/.keystore"
        keystorePass                   = "${PWD_KEYSTORE}"
        keystoreType                   = "JKS"

You would want to add the following to Tomcat's web.xml to redirect http traffic to https:




To pass O/S User Environment Variables to Tomcat, edit %CATALINA_BASE%/bin/setenv.bat :


To create a keypair for your localhost development server, type:

keytool -genkeypair -validity 365 -keyalg EC -keysize 256 -alias localhost

When it (bizarrely) asks for your first and last name, type in localhost

To set O/S User Environment Variables, run cmd as that user, and type, for example SETX PWD_KEYSTORE s3cr3t

Wednesday, June 25, 2014

Draft Entities with Hibernate Validator

Let's say we want to fire different validation rules, depending on whether a record is a draft or not.

The Entity:

public class BlogPost {

  private Long id;

  //always enforce this constraint:
  private String title;

  // only enforce this constraint if our validation group is "Publish"
  @NotBlank( groups = { Publish.class } )
  private String body;

  ...getters and setters...

//the "Publish" validation group:
public interface Publish extends Default {};

The HTML form (Spring MVC):

  <form:label path="title">Title <form:errors path="title"/></form:label>
  <form:input path="title"/>

  <form:label path="body">Body <form:errors path="body"/></form:label>
  <form:textarea path="body"></form:textarea>

  <button type="submit" name="action" value="save-draft">Save Draft</button>
  <button type="submit" name="action" value="publish">Publish</button>


The Spring MVC Controller:

public class BlogPostController {
  /* this method is called if we hit the Save Draft button. It validates the blogPost with the Default group (ignoring the constraint on Body) */

  @RequestMapping( method = RequestMethod.POST, params="action=save-draft" )
  String saveDraft( @Validated BlogPost blogPost, BindingResult result) {

  /* this method is called if we hit the Publish button. It validates the blogPost with the Publish group, checking the @NotBlank constraint on Body */

  @RequestMapping( method = RequestMethod.POST, params="action=publish" )
  String publish( @Validated({Publish.class}) BlogPost blogPost, BindingResult result)) { 


Wednesday, June 18, 2014

How to import Geonames data into PostgreSQL on Windows

  1. Download (251MB) to %ProgramData%\geonames\, and unzip it.
  2. In PostgreSQL, create a table to import the data into:
create table geonames (
  geoname_id int primary key,
  name text,
  ascii_name text,
  alternate_names text,
  latitude numeric(8,6),
  longitude numeric(9,6),
  feature_class char(1),
  feature_code text,
  country_code char(2),
  cc2 TEXT,
  admin1_code text,
  admin2_code text,
  admin3_code text,
  admin4_code text,
  population bigint,
  elevation int,
  dem text,
  timezone text,
  modification_date date

Run the COPY command:

copy geonames
from 'c:/programdata/geonames/allCountries/allCountries.txt'
with (
  format text,
  null ''

On my laptop it took ~65 seconds. There were 9,071,443 rows.

Let's say we just want to view the ISO 3166-1 countries. Lets's create an index on the feature_code column so we can quickly filter on it (warning, this can take a while):

create index on geonames (feature_code);

And now a view to show only the ISO countries:

create view iso_countries as 
from geonames 
feature_code IN ( 'PCL', 'PCLD', 'PCLF', 'PCLI', 'PCLIX', 'PCLS' )
or geoname_id in (2461445, 5880801, 607072, 6697173) --western sahara, american samoa, svalbar and jan mayen, antarctica
and geoname_id <> 831053 --kosovo
order by

Sunday, June 8, 2014

Pulling Exchange Rates From A JSON Web API with PostgreSQL

  1. This probably isn't a good idea on a busy server, as it might block
  2. Requires PL/PythonU
  3. Requires an App ID from, which is free (US base currency only)

A Function To Make An HTTP Request

create or replace function http_get(url text) returns text strict volatile language plpythonu AS $$

  import urllib2;
  return urllib2.urlopen(url).read();


A Function to Return a Monetary Amount in USD

CREATE TABLE exchange_rates_usd
  counter_currency char(3),
  close_date date,
  midmarket_rate numeric(19,10) NOT NULL,
  PRIMARY KEY (counter_currency, close_date)

create function usd(amount numeric(19,4), _counter_currency char(3), _close_date date = current_date) returns numeric(19,4) language plpgsql strict volatile as $$
 result numeric(19,4);
 response json;

 from exchange_rates_usd 
  counter_currency = _counter_currency 
  and close_date = _close_date 
 into result;

 if found then return amount / result; end if;

 response = http_get( format('', _close_date, current_setting('my.OPENEXCHANGERATES_APP_ID') ) );

 insert into exchange_rates_usd (counter_currency, close_date, midmarket_rate) 
  , _close_date
  , rates.value::text::numeric(19,10) 
 from json_each(response->'rates') rates;

 from exchange_rates_usd 
  counter_currency = _counter_currency 
  and close_date = _close_date 
 into result;

 return amount / result;

end $$;

The Output

/* you can set this value in postgresql.conf as well: */
set my.OPENEXCHANGERATES_APP_ID = 'whatever your app id is';

select usd(100, 'CAD', '2014-08-14');


Sunday, April 20, 2014

Broadcasting PostgreSQL NOTIFY messages to WebSocket Clients

In this post, I look at how to broadcast NOTIFY messages from a Postgres database to WebSocket clients using Spring MVC 4.

Source code here

Basically, if you change data in your database, it can notify web browser clients, without polling.

Technologies used:

The system works like this:

Client subscribes to a WebSocket topic...

NOTIFY event on database server ->
  PGNotificationListener on web server ->
      Send Websocket notification on server ->
         Receive Websocket event on browser. 

With the code below, if you call NOTIFY dml_events, 'some message'; in Postgres, it will be broadcast to all WebSocket clients.

Follow this answer  regarding proper listener setup

Project Structure:


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=""




            <!-- PostgreSQL JDBC driver -->

            <!-- Your JSON library -->












import com.impossibl.postgres.api.jdbc.PGConnection;
import com.impossibl.postgres.api.jdbc.PGNotificationListener;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.messaging.simp.SimpMessagingTemplate;

import javax.sql.DataSource;
import java.sql.Statement;

 * @since 7/22/2014
public class PGNotifyToWebSocket {

    private SimpMessagingTemplate messagingTemplate;

    private PGConnection pgConnection;

    public PGNotifyToWebSocket(DataSource dataSource) throws Throwable {

        pgConnection = (PGConnection) dataSource.getConnection();

        pgConnection.addNotificationListener(new PGNotificationListener() {
            public void notification(int processId, String channelName, String payload) {
                messagingTemplate.convertAndSend("/channels/" + channelName, payload);

    public void init() throws Throwable {

        Statement statement = pgConnection.createStatement();
        statement.execute("LISTEN dml_events");

    public void destroy() throws Throwable {

        Statement statement = pgConnection.createStatement();
        statement.execute("UNLISTEN dml_events");

<beans xmlns=""


    <context:component-scan base-package="com.databasepatterns"/>


    <mvc:view-controller path="/" view-name="/WEB-INF/index.jsp"/>

    <bean class="com.impossibl.postgres.jdbc.PGDataSource" id="dataSource">
        <property name="host" value="${PGHOST:localhost}"/>
        <property name="database" value="${PGDATABASE:postgres}"/>
        <property name="password" value="${PGPASSWORD}"/>
        <property name="port" value="${PGPORT:5432}"/>
        <property name="user" value="${PGUSER:postgres}"/>

    <bean class="PGNotifyToWebSocket" init-method="init" destroy-method="destroy">
        <constructor-arg ref="dataSource"/>

    <websocket:message-broker application-destination-prefix="/app">
        <websocket:stomp-endpoint path="/hello">
        <websocket:simple-broker prefix="/channels"/>





<p>Run <kbd>NOTIFY dml_events 'some message';</kbd> in Postgres (in the <code>$PGDATABASE</code> or <code>postgres</code> database). Tested with PG 9.3, on Windows 7, Chrome 36.</p>

<div id="out"></div>

<script src="//"></script>
<script src="//"></script>


      var socket = new SockJS("/hello");

      var stompClient = Stomp.over(socket);

      stompClient.connect( {}, function(frame) {

        stompClient.subscribe("/channels/dml_events", function(response) {
            document.getElementById("out").innerText += response + "\r\n\r\n";




<web-app version="3.0"




Monday, March 3, 2014

Some Basic Unit Test Patterns In Postgres

Let's say you want to write a unit test for PostgreSQL that expects an exception. Here's one way to do it:
create table my_table (
  id int primary key,
  name text not null

create or replace function test.throws_exception_when_name_is_null() returns void as $$

  insert into my_table values (1, null);
  exception when not_null_violation then return; --swallow the exception, because we are hoping for it. 

end $$ language plpgsql;

select test.throws_exception_when_name_is_null();
Some other common exception names:
  • foreign_key_violation
  • unique_violation
  • check_violation
See the full list of exceptions

Let's say you want to run a timed test. Throw an exception if the test takes too long:
create or replace function a_long_running_function() returns void as $$

    perform pg_sleep(2);

end $$ language plpgsql;

create or replace function test.test_timed() returns void as $$

    perform a_long_running_function();

end $$ language plpgsql set statement_timeout to 1900;

select test.test_timed();

Friday, February 14, 2014

Trees and Paths using WITH RECURSIVE in PostgreSQL

Let's say we had some data like this - from the United Nations geoscheme for world regions:

create table subregions (
  id smallint primary key,
  name text not null,
  parent_id smallint null references subregions(id)

insert into subregions values
(5,'South America',419),
(11,'Western Africa',2),
(13,'Central America',419),
(14,'Eastern Africa',2),
(15,'Northern Africa',2),
(17,'Middle Africa',2),
(18,'Southern Africa',2),
(21,'Northern America',19),
(30,'Eastern Asia',142),
(34,'Southern Asia',142),
(35,'South-Eastern Asia',142),
(39,'Southern Europe',150),
(53,'Australia and New Zealand',9),
(143,'Central Asia',142),
(145,'Western Asia',142),
(151,'Eastern Europe',150),
(154,'Northern Europe',150),
(155,'Western Europe',150),
(419,'Latin America and the Caribbean',19);
And you wanted to make a pretty tree like this:

      Eastern Africa
      Middle Africa
      Northern Africa
      Southern Africa
      Western Africa
      Latin America and the Caribbean
         Central America
         South America
      Northern America
      Central Asia
      Eastern Asia
      South-Eastern Asia
      Southern Asia
      Western Asia
      Eastern Europe
      Northern Europe
      Southern Europe
      Western Europe
      Australia and New Zealand

Here's how you'd do it:

with recursive my_expression as (
  --start with the "anchor", i.e. all of the nodes whose parent_id is null:
    name as path,
    name as tree,
    0 as level 
  from subregions
    parent_id is null

  union all

  --then the recursive part:
  select as id,
    previous.path || ' > ' || as path,
    repeat('   ', previous.level + 1) || as tree,
    previous.level + 1 as level
  from subregions current 
  join my_expression as previous on current.parent_id =
from my_expression
order by 

You can think of WITH RECURSIVE as a chain of UNION statements. A good explanation here: How does a Recursive CTE run, line by line?

You can also show paths like this:

from my_expression
order by

World > Africa
World > Africa > Eastern Africa
World > Africa > Middle Africa
World > Africa > Northern Africa
World > Africa > Southern Africa
World > Africa > Western Africa
World > Americas
World > Americas > Latin America and the Caribbean
World > Americas > Latin America and the Caribbean > Caribbean
World > Americas > Latin America and the Caribbean > Central America
World > Americas > Latin America and the Caribbean > South America
World > Americas > Northern America
World > Asia
World > Asia > Central Asia
World > Asia > Eastern Asia
World > Asia > South-Eastern Asia
World > Asia > Southern Asia
World > Asia > Western Asia
World > Europe
World > Europe > Eastern Europe
World > Europe > Northern Europe
World > Europe > Southern Europe
World > Europe > Western Europe
World > Oceania
World > Oceania > Australia and New Zealand
World > Oceania > Melanesia
World > Oceania > Micronesia
World > Oceania > Polynesia
Fiddle with it.

Saturday, February 1, 2014

call NOTIFY on a DDL event

create or replace function on_ddl_event() returns event_trigger language plpgsql as $$
  perform pg_notify('ddl_events', format('{"tg_tag":"%s","statement_timestamp":"%s"}', tg_tag, statement_timestamp()));
end $$;

create event trigger on_sql_drop on sql_drop
execute procedure on_ddl_event();

create event trigger on_ddl_command_end on ddl_command_end 
execute procedure on_ddl_event();

It might send a NOTIFY like this:

Asynchronous notification of 'ddl_events' received from backend pid 6940
   Data: {"tg_tag":"ALTER TABLE","statement_timestamp":"2014-02-01 01:31:23.651-08"}

Tuesday, January 28, 2014

Installing the debugger for pgAdmin on Windows

Software used:
  • Windows 7 64-bit
  • PostgreSQL 9.3 installed from EnterpriseDB installer
  • pgAdmin III 1.18

1. Open c:\program files\postgresql\9.3\data\postgresql.conf
2. Un-comment or add this line:

shared_preload_libraries = '$libdir/plugin_debugger.dll'

3. Restart PostgreSQL server
4. In the database you want to debug, run create extension pldbgapi;
5. Restart pgAdmin III
6. In pgAdmin III, you should now have a Debugging option on context menus for user-defined functions/procedures. You can't debug PL/SQL or PL/V8 language functions, only PL/pgSQL with this debugger. 

Saturday, January 18, 2014

Using the Ispell dictionary in PostgreSQL Text Search

You can use Ispell-compatible dictionaries in Postgres Text Search to get the infinitive form of words. For example, it will convert "tallest" to "tall". It doesn't do spell-checking.

This example uses the Canadian English dictionary, but you can try it with others as well. These steps are for Windows.


  1. Combine en_CA.dic with en_CA.dic_delta, sort them, and save them as en_ca.dict with UTF-8 encoding to the right folder
  2. Save en_CA.aff as en_ca.affix to the right folder
  3. Create the Ispell dictionary in Postgres


  1. Open
  2. Select all of the text, copy it, and paste it to Text Mechanic: Add a line break at the end
  3. Open
  4. Select all of the text, copy it, and paste it below the previously pasted text in Text Mechanic.
  5. Scroll to the top, and get rid of the first line. It should be a 5-digit number
  6. Click the Alphabetical button, and wait for the text to sort
  7. Select all of the text and copy it to the clipboard
  8. Open Windows Notepad as an administrator
  9. Paste the text from Step 7 into Notepad
  10. Save the file as en_ca.dict (with UTF-8 encoding) to your Postgres text search folder. Mine is C:\Program Files\PostgreSQL\9.3\share\tsearch_data .
  11. Open, and copy it to Notepad. Save the file as en_ca.affix to your Postgres text search folder.
In PgAdmin, run the following SQL:
create text search dictionary ispell_en_ca (
  template = ispell,
  dictfile = en_ca,    --uses en_ca.dict
  afffile = en_ca,     --uses en_ca.affix
  stopwords = english  --uses english.stop

--make sure it works:
select * from ts_lexize('ispell_en_ca', 'colours');


You will need to create a new text search configuration to use the dictionary.

Wednesday, January 15, 2014

On Creating Draft Entities

  • Constraints not in effect when entity is in DRAFT state
  • Constrains are in effect when entity is in other states
  • Entity has a stable primary key, so that can share entity with others (you'll note that Gmail drafts get different IDs on every save)
  • Allow for "child" entities to link to the draft entity and not move around


create table email_messages (
  id serial primary key,
  subject varchar(78) null, --let me be null in drafts, but not otherwise
  status text not null check (status in ('DRAFT', 'OUTBOX', 'DELETED', 'SENT')),

  check (case when status='DRAFT' then true else ( subject is not null ) end )

insert into email_messages (subject, status) values (null, 'DRAFT') returning id; --returns 1

insert into email_messages (subject, status) values (null, 'OUTBOX') returning id; --error

Another way to do it:

create table email_messages (
  id serial primary key,
  subject varchar(78) null check ( is_draft or subject is not null ),
  is_draft boolean not null

insert into email_messages (subject, is_draft) values (null, true); --returns 1

insert into email_messages (subject, is_draft) values (null, false); --error