Previous Articles

There's also an index of all entries.
 

Happy Numbers

After discovering the excellent Gwene service, which allows you to subscribe to newsgroups to read RSS content (blogs, planets, commits, etc), I came to read this nice article about Happy Numbers. That's a little problem that fits well an interview style question, so I first solved it yesterday evening in Emacs Lisp as that's the language I use the most those days.

A happy number is defined by the following process. Starting with any positive integer, replace the number by the sum of the squares of its digits, and repeat the process until the number equals 1 (where it will stay), or it loops endlessly in a cycle which does not include 1. Those numbers for which this process ends in 1 are happy numbers, while those that do not end in 1 are unhappy numbers (or sad numbers).

Now, what about implementing the same in pure SQL, for more fun? Now that's interesting! After all, we didn't get WITH RECURSIVE for tree traversal only, did we?

Unfortunately, we need a little helper function first, if only to ease the reading of the recursive query. I didn't try to inline it, but here it goes:

create or replace function digits(x bigint)
  returns setof int
  language sql
as $$
  select substring($1::text from i for 1)::int
    from generate_series(1, length($1::text)) as t(i)
$$;

That was easy: it will output one row per digit of the input number — and rather than resorting to powers of ten and divisions and remainders, we do use plain old text representation and substring. Now, to the real problem. If you're read what is an happy number and already did read the fine manual about Recursive Query Evaluation, it should be quite easy to read the following:

with recursive happy(n, seen) as (
    select 7::bigint, '{}'::bigint[]
  union all
    select sum(d*d), h.seen || sum(d*d)
      from (select n, digits(n) as d, seen
              from happy
           ) as h
  group by h.n, h.seen
    having not seen @> array[sum(d*d)]
)
  select * from happy;
  n  |       seen
-----+------------------
   7 | {}
  49 | {49}
  97 | {49,97}
 130 | {49,97,130}
  10 | {49,97,130,10}
   1 | {49,97,130,10,1}
(6 rows)

Time: 1.238 ms

That shows how it works for some happy number, and it's easy to test for a non-happy one, like for example 17. The query won't cycle thanks to the seen array and the having filter, so the only difference between an happy and a sad number will be that in the former case the last line output by the recursive query will have n = 1. Let's expand this knowledge into a proper function (because we want to be able to have the number we test for happiness as an argument):

create or replace function happy(x bigint)
  returns boolean
  language sql
as $$
with recursive happy(n, seen) as (
    select $1, '{}'::bigint[]
  union all
    select sum(d*d), h.seen || sum(d*d)
      from (select n, digits(n) as d, seen
              from happy
           ) as h
  group by h.n, h.seen
    having not seen @> array[sum(d*d)]
)
  select n = 1 as happy
    from happy
order by array_length(seen, 1) desc nulls last
   limit 1
$$;

We need the desc nulls last trick in the order by because the array_length() of any dimension of an empty array is NULL, and we certainly don't want to return all and any number as unhappy on the grounds that the query result contains a line input, {}. Let's now play the same tricks as in the puzzle article:

=# select array_agg(x) as happy from generate_series(1, 50) as t(x) where happy(x);
              happy
----------------------------------
 {1,7,10,13,19,23,28,31,32,44,49}
(1 row)

Time: 24.527 ms

=# explain analyze select x from generate_series(1, 10000) as t(x) where happy(x);
                      QUERY PLAN
----------------------------------------------------------------------------------------
 Function Scan on generate_series t  (cost=0.00..265.00 rows=333 width=4)
                          (actual time=2.938..3651.019 rows=1442 loops=1)
   Filter: happy((x)::bigint)
 Total runtime: 3651.534 ms
(3 rows)

Time: 3652.178 ms

(Yes, I tricked the EXPLAIN ANALYZE output so that it fits on the page width here). For what it's worth, finding the first 10000 happy numbers in Emacs Lisp on the same laptop takes 2830 ms, also running a recursive version of the code.

 

Playing with bit strings

The idea of the day ain't directly from me, I'm just helping with a very thin subpart of the problem. The problem, I can't say much about, let's just assume you want to reduce the storage of MD5 in your database, so you want to abuse bit strings. A solution to use them works fine, but the datatype is still missing some facilities, for example going from and to hexadecimal representation in text.

create or replace function hex_to_varbit(h text)
 returns varbit
 language sql
as $$
  select ('X' || $1)::varbit;
$$;

create or replace function varbit_to_hex(b varbit)
 returns text
 language sql
as $$
  select array_to_string(array_agg(to_hex((b << (32*o))::bit(32)::bigint)), '')
    from (select b, generate_series(0, n-1) as o
            from (select $1, octet_length($1)/4) as t(b, n)) as x
$$;

To understand the magic in the second function, let's walk through the tests one could do when wanting to grasp how things work in the bitstring world (using also some reading of the fine documentation, too).

=# select ('101011001011100110010110'::varbit << 0)::bit(8);
   bit
----------
 10101100
(1 row)

=# select ('101011001011100110010110'::varbit << 8)::bit(8);
   bit
----------
 10111001
(1 row)

=# select ('101011001011100110010110'::varbit << 16)::bit(8);
   bit
----------
 10010110
(1 row)

=# select * from *TEMP VERSION OF THE FUNCTION FOR TESTING*
 o |                b                 |    x
---+----------------------------------+----------
 0 | 10101100101111010001100011011011 | acbd18db
 1 | 01001100110000101111100001011100 | 4cc2f85c
 2 | 11101101111011110110010101001111 | edef654f
 3 | 11001100110001001010010011011000 | ccc4a4d8
(4 rows)

What do we get from that, will you ask? Let's see a little example:

=# select hex_to_varbit(md5('foo'));
                                                          hex_to_varbit
----------------------------------------------------------------------------------------------------------------------------------
 10101100101111010001100011011011010011001100001011111000010111001110110111101111011001010100111111001100110001001010010011011000
(1 row)

=# select md5('foo'), varbit_to_hex(hex_to_varbit(md5('foo')));
               md5                |          varbit_to_hex
----------------------------------+----------------------------------
 acbd18db4cc2f85cedef654fccc4a4d8 | acbd18db4cc2f85cedef654fccc4a4d8
(1 row)

Storing varbits rather than the text form of the MD5 allows us to go from 6510 MB down to 4976 MB on a sample table containing 100 millions rows. We're targeting more that that, so that's a great win down here!

In case you wonder, querying the main index on varbit rather than the one on text for a single result row, the cost of doing the conversion with varbit_to_hex seems to be around 28 µs. We can afford it.

Hope this helps!

 

Editing constants in constraints

We're using constants in some constraints here, for example in cases where several servers are replicating to the same federating one: each origin server has his own schema, and all is replicated nicely on the central host, thanks to Londiste, as you might have guessed already.

For bare-metal recovery scripts, I'm working on how to change those constants in the constraints, so that pg_dump -s plus some schema tweaking would kick-start a server. Here's a PLpgSQL snippet to do just that:

  FOR rec IN EXECUTE
$s$
SELECT schemaname, tablename, conname, attnames, def
  FROM (
   SELECT n.nspname, c.relname, r.conname,
          (select array_accum(attname)
             from pg_attribute
            where attrelid = c.oid and r.conkey @> array[attnum]) as attnames,
          pg_catalog.pg_get_constraintdef(r.oid, true)
   FROM pg_catalog.pg_constraint r
        JOIN pg_class c on c.oid = r.conrelid
        JOIN pg_namespace n ON n.oid = c.relnamespace
   WHERE r.contype = 'c'
ORDER BY 1, 2, 3
       ) as cons(schemaname, tablename, conname, attnames, def)
WHERE attnames @> array['server']::name[]
$s$
  LOOP
    rec.def := replace(rec.def, 'server = ' || old_id,
                                'server = ' || new_id);

    sql := 'ALTER TABLE ' || rec.schemaname || '.' || rec.tablename
        || ' DROP CONSTRAINT ' || rec.conname;
    RAISE NOTICE '%', sql;
    RETURN NEXT;
    EXECUTE sql;

    sql := 'ALTER TABLE ' || rec.schemaname || '.' || rec.tablename
        || ' ADD ' || rec.def;
    RAISE NOTICE '%', sql;
    RETURN NEXT;
    EXECUTE sql;

  END LOOP;

This relies on the fact that our constraints are on the column server. Why would this be any better than a sed one-liner, would you ask me? I'm fed up with having pseudo-parsing scripts and taking the risk that the simple command will change data I didn't want to edit. I want context aware tools, pretty please, to feel safe.

Otherwise I'd might have gone with pg_dump -s| sed -e 's:\(server =\) 17:\1 18:' but this one-liner already contains too much useless magic for my taste (the space before 17 ain't in the group match to allow for having \1 18 in the right hand side. And this isn't yet parametrized, and there I'll need to talk to the database, as that's were I store the servers name and their id (a bigserial — yes, the constraints are all generated from scripts). I don't want to write an SQL parser and I don't want to play loose, so the PLpgSQL approach is what I'm thinking as the best tool here. Opinionated answers get to my mailbox!

 

debian packaging PostgreSQL extensions

In trying to help an extension debian packaging effort, I've once again proposed to handle it. That's because I now begin to know how to do it, as you can see in my package overview page at debian QA facility. There's a reason why I proposed myself here, it's that yet another tool of mine is now to be found in debian, and should greatly help extension packaging there. You can already check for the postgresql-server-dev-all package page if you're that impatient!

Back? Ok, so I used to have two main gripes against debian support for PostgreSQL. The first one, which is now feeling alone, is that both project release support policy aren't compatible enough for debian stable to include all currently supported stable PostgreSQL major version. That's very bad that debian stable will only propose one major version, knowing that the support for several of them is in there.

The problem is two fold: first, debian stable has to maintain any distributed package. There's no deprecation policy allowing for droping the ball. So the other side of this coin is that debian developers must take on themselves maintaining included software for as long as stable is not renamed oldstable. And it so happens that there's no debian developer that feels like maintaining end of lined PostgreSQL releases without help from PostgreSQL Core Team. Or, say, without official statement that they would help.

Now, why I don't like this situation is because I'm pretty sure there's very few software development group offering as long and reliable maintenance policy as PostgreSQL is doing, but debian will still happily distribute unknown-maintenance-policy pieces of code in its stable repositories. So the uncertainty excuse is rather poor. And highly frustrating.

Note: you have to admit that the debian stable management model copes very well with all the debian included software. You can't release stable with a new PostgreSQL major version unless each and every package depending on PostgreSQL will actually work with the newer version, and the debian scripts will care for upgrading the cluster. Where it's not working good is when you're using debian for a PostgreSQL server for a proprietary application, which happens quite frequently too.

The consequence of this fact leads to my second main gripe against debian support for PostgreSQL: the extensions. It so happens that the PostgreSQL extensions are developped for supporting several major versions from the same source code. So typically, all you need to do is recompile the extension against the new major version, and there you go.

Now, say debian new stable is coming with 8.4 rather than 8.3 as it used to. You should be able to just build the extensions (like prefix), without changing the source package, nor droping postgresql-8.3-prefix from the distribution on the grounds that 8.3 ain't in debian stable anymore.

I've been ranting a lot about this state of facts, and I finally provided a patch to the postgresql-common debian packaging, which made it into version 110: welcome pg_buildext. An exemple of how to use it can be found in the git branch for prefix, it shows up in debian/pgversions and debian/rules files.

As you can see, the pg_buildext tool allows you to list the PostgreSQL major versions the extension you're packaging supports, and only those that are both in your list and in the current debian supported major version list will get built. pg_buildext will do a VPATH build of your extension, so it's capable of building the same extension for multiple major versions of PostgreSQL. Here's how it looks:

        # build all supported version
        pg_buildext build $(SRCDIR) $(TARGET) "$(CFLAGS)"

        # then install each of them
        for v in `pg_buildext supported-versions $(SRCDIR)`; do \
                dh_install -ppostgresql-$$v-prefix ;\
        done

And the files are to be found in those places:

dim ~/dev/prefix cat debian/postgresql-8.3-prefix.install
debian/prefix-8.3/prefix.so usr/lib/postgresql/8.3/lib
debian/prefix-8.3/prefix.sql usr/share/postgresql/8.3/contrib

dim ~/dev/prefix cat debian/postgresql-8.4-prefix.install
debian/prefix-8.4/prefix.so usr/lib/postgresql/8.4/lib
debian/prefix-8.4/prefix.sql usr/share/postgresql/8.4/contrib

So you still need to maintain debian/pgversions and the postgresql-X.Y-extension.* files, but then a change in debian support for PostgreSQL major versions will be handled automatically (there's a facility to trigger automatic rebuild when necessary).

All this ranting to explain that pretty soon, the extenion's packages that I maintain will no longer have to be patched when dropping a previously supported major version of PostgreSQL. I'm breathing a little better, so thanks a lot Martin!

 

Querying the Catalog to plan an upgrade

Some user on IRC was reading the releases notes in order to plan for a minor upgrade of his 8.3.3 installation, and was puzzled about potential needs for rebuilding GIST indexes. That's from the 8.3.5 release notes, and from the 8.3.8 notes you see that you need to consider hash indexes on interval columns too. Now the question is, how to find out if any such beasts are in use in your database?

It happens that PostgreSQL is letting you know those things by querying its system catalogs. That might look hairy at first, but it's very worth getting used to those system tables. You could compare that to introspection and reflexive facilities of some programming languages, except much more useful, because you're reaching all the system at once. But, well, here it goes:

SELECT schemaname, tablename, relname, amname, indexdef
  FROM pg_indexes i
       JOIN pg_class c ON i.indexname = c.relname and c.relkind = 'i'
       JOIN pg_am am ON c.relam = am.oid
 WHERE amname = 'gist';

Now you could replace the WHERE clause with WHERE amname IN ('gist', 'hash') to check both conditions at once. What about pursuing the restriction on the hash indexes rebuild to schedule, as they should only get done to indexes on interval columns. Well let's try it:

SELECT schemaname, tablename, relname as indexname, amname, indclass
  FROM pg_indexes i
       JOIN pg_class c on i.indexname = c.relname and c.relkind = 'i'
       JOIN pg_am am on c.relam = am.oid
       JOIN pg_index x on x.indexrelid = c.oid
 WHERE amname in ('btree', 'gist')
       and schemaname not in ('pg_catalog', 'information_schema');

We're not there yet, because as you notice, the catalogs are somewhat optimized and not always in a normal form. That's good for the system's performance, but it makes querying a bit uneasy. What we want is to get from the indclass column if there's any of them (it's an oidvector) that applies to an interval data type. There's a subtlety here as the index could store interval data even if the column is not of an interval type itself, so we have to find both cases.

Well the subtlety applies after you know what an operator class is: “An operator class defines how a particular data type can be used with an index” is what the CREATE OPERATOR CLASS manual page teaches us. What we need to know here is that an index will talk to an operator class to get to the data type, either the column data type or the index storage one.

SELECT schemaname, tablename, relname as indexname, amname, indclass, opcname, typname
  FROM pg_indexes i
       JOIN pg_class c on i.indexname = c.relname and c.relkind = 'i'
       JOIN pg_am am on c.relam = am.oid
       JOIN pg_index x on x.indexrelid = c.oid
       JOIN pg_opclass o
         on string_to_array(x.indclass::text, ' ')::oid[] @> array[o.oid]::oid[]
       JOIN pg_type t on o.opckeytype = t.oid
WHERE amname = 'hash' and t.typname = 'interval'

UNION ALL

SELECT schemaname, tablename, relname as indexname, amname, indclass, opcname, typname
  FROM pg_indexes i
       JOIN pg_class c on i.indexname = c.relname and c.relkind = 'i'
       JOIN pg_am am on c.relam = am.oid
       JOIN pg_index x on x.indexrelid = c.oid
       JOIN pg_opclass o
         on string_to_array(x.indclass::text, ' ')::oid[] @> array[o.oid]::oid[]
       JOIN pg_type t on o.opcintype = t.oid
WHERE amname = 'hash' and t.typname = 'interval';

Most certainly this query will return no row for you, as hash indexes are not widely used, mainly because they are not crash tolerant. For seeing some results you could remove the amname restriction of course, that would show the query is working, but don't forget to add the restriction back to plan for the upgrade!

But hey, why walking the extra mile here, would you ask me? After all, in the second query we would already have had the information we needed should we added the indexdef column, albeit in a human reader friendly way: the resultset would then contain the CREATE INDEX command you need to issue to build the index from scratch. That would be enough for checking only the catalog, but the extra mile allows you to produce a SQL script to build the indexes that need your attention post upgrade. That last step is left as an exercise for the reader, though.

 

Database Virtual Machines

Today I'm being told once again about SQLite as an embedded database software. That one ain't a database server but a software library that you can use straight into your main program. I'm yet to use it, but it looks like its SQL support is good enough for simple things — and that covers loads of things. I guess read-only cache and configuration storage would be the obvious ones, because it seems that SQLite use cases aren't including mixed concurrency, that is workloads with concurrent readers and writers.

The part that got my full attention is The Virtual Database Engine of SQLite, as this blog title would imply. It seems to be the same idea as what MonetDB calls their MonetDB Assembly Language, and I've been trying to summarize some idea about it in my Next Generation PostgreSQL article.

The main thing is how to further optimize PostgreSQL given what we have. It seems that among the major road blocks in the performance work is how we get the data from disk and to the client. We're still spending so many time in the CPU that the disk bandwidth are not always saturated, and that's a problem. Further thoughts on the full length article, but that's just about a one page section now!

 

Partitioning: relation size per “group”

This time, we are trying to figure out where is the bulk of the data on disk. The trick is that we're using DDL partitioning, but we want a “nice” view of size per partition set. Meaning that if you have for example a parent table foo with partitions foo_201006 and foo_201007, you would want to see a single category foo containing the accumulated size of all the partitions underneath foo.

Here we go:

select groupe, pg_size_pretty(sum(bytes)::bigint) as size, sum(bytes)
  from (
select relkind as k, nspname, relname, tablename, bytes,
         case when relkind = 'r' and relname ~ '[0-9]{6}$'
              then substring(relname from 1 for length(relname)-7)

              when relkind = 'i' and  tablename ~ '[0-9]{6}$'
              then substring(tablename from 1 for length(tablename)-7)

              else 'core'
          end as groupe
  from (
  select nspname, relname,
         case when relkind = 'i'
              then (select relname
                      from pg_index x
                           join pg_class xc on x.indrelid = xc.oid
                           join pg_namespace xn on xc.relnamespace = xn.oid
                     where x.indexrelid = c.oid
                    )
              else null
           end as tablename,
         pg_size_pretty(pg_relation_size(c.oid)) as relation,
         pg_total_relation_size(c.oid) as bytes,
         relkind
    from pg_class c join pg_namespace n on c.relnamespace = n.oid
   where c.relkind in ('r', 'i')
         and nspname in ('public', 'archive')
         and pg_total_relation_size(c.oid) > 32 * 1024
order by 5 desc
       ) as s
       ) as t
group by 1
order by 3 desc;

Note that by simply removing those last two lines here, you will get a detailed view of the indexes and tables that are taking the most volume on disk at your place.

Now, what about using window functions here so that we get some better detailed view of historic changes on each partition? With some evolution figure in percentage from the previous partition of the same year, accumulated size per partition and per year, yearly sum, you name it. Here's another one you might want to try, ready for some tuning (schema name, table name, etc):

WITH s AS (
  select relname,
         pg_relation_size(c.oid) as size,
         pg_total_relation_size(c.oid) as tsize,
         substring(substring(relname from '[0-9]{6}$') for 4)::bigint as year
    from pg_class c
         join pg_namespace n on n.oid = c.relnamespace
   where c.relkind = 'r'
     -- and n.nspname = 'public'
     -- and c.relname ~ 'stats'
     and substring(substring(relname from '[0-9]{6}$') for 4)::bigint >= 2008
order by relname
),
     sy AS (
  select relname,
         size,
         tsize,
         year,
         (sum(size) over w_year)::bigint as ysize,
         (sum(size) over w_month)::bigint as cumul,
         (lag(size) over (order by relname))::bigint as previous
    from s
  window w_year  as (partition by year),
         w_month as (partition by year order by relname)
),
     syp AS (
  select relname,
         size,
         tsize,
         rank() over (partition by year order by size desc) as rank,
         case when ysize = 0 then ysize
              else round(size / ysize::numeric * 100, 2) end as yp,
         case when previous = 0 then previous
              else round((size / previous::numeric - 1.0) * 100, 2) end as evol,
         cumul,
         year,
         ysize
    from sy
)
  SELECT relname,
         pg_size_pretty(size) as size,
         pg_size_pretty(tsize) as "+indexes",
         evol, yp as "% annuel", rank,
         pg_size_pretty(cumul) as cumul, year,
         pg_size_pretty(ysize) as "yearly sum",
         pg_size_pretty((sum(size) over())::bigint) as total
    FROM syp
ORDER BY relname;

Hope you'll find it useful, I certainly do!

 

Emacs and PostgreSQL

Those are my two all times favorite Open Source Software. Or Free Software in the GNU sense of the world, as both the BSD and the GPL are labeled free there. Even if I prefer the The Debian Free Software Guidelines as a global definition and the WTFPL license. But that's a digression.

I think that Emacs and PostgreSQL do share a lot in common. I'd begin with the documentation, which quality is amazing for both projects. Then of course the extensibility with Emacs Lisp on the one hand and catalog-driven operations on the other hand. Whether you're extending Emacs or PostgreSQL you'll find that it's pretty easy to tweak the system while it's running. The other comparison points are less important, like the fact the both the systems get about the same uptime on my laptop (currently 13 days, 23 hours, 57 minutes, 10 seconds).

So of course I'm using Emacs to edit PostgreSQL .sql files, including stored procedures. And it so happens that line numbering in plpgsql is not as straightforward as one would naively think, to the point that we'd like to have better tool support there. So I've extended Emacs linum-mode minor mode to also display the line numbers as computed per PostgreSQL, and here's what it looks like:

Now, here's also the source code, dim-pgsql.el. Hope you'll enjoy!

 

Background writers

There's currently a thread on hackers about bg worker: overview and a series of 6 patches. Thanks a lot Markus! This is all about generalizing a concept already in use in the autovacuum process, where you have an independent subsystem that require having an autonomous daemon running and able to start its own workers.

I've been advocating about generalizing this concept for awhile already, in order to have postmaster able to communicate to subsystems when to shut down and start and reload, etc. Some external processes are only external because there's no need to include them by default in to the database engine, not because there's no sense to having them in there.

So even if Markus work is mainly about generalizing autovacuum so that he has a coordinator to ask for helper backends to handle broadcasting of writesets for Postgres-R, it still could be a very good first step towards something more general. What I'd like to see the generalization handle are things like PGQ, or the pgagent scheduler. In some cases, pgbouncer too.

What we're missing there is an API for everybody to be able to extend PostgreSQL with its own background processes and workers. What would such a beast look like? I have some preliminary thoughts about this in my Next Generation PostgreSQL article, but that's still early thoughts. The main idea is to steal as much as sensible from Erlang Generic Supervisor Behaviour, and maybe up to its Generic Finite State Machines behavior. In the Erlang world, a behavior is a generic process.

The FSM approach would allow for any user daemon to provide an initial state and register functions that would do some processing then change the state. My feeling is that if those functions are exposed at the SQL level, then you can talk to the daemon from anywhere (the Erlang ideas include a globally —cluster wide— unique name). Of course the goal would be to provide an easy way for the FSM functions to have a backend connected to the target database handle the work for it, or be able to connect itself. Then we'd need something else here, a way to produce events based on the clock. I guess relying on SIGALRM is a possibility.

I'm not sure about how yet, but I think getting back in consultancy after having opened 2ndQuadrant France has some influence on how I think about all that. My guess is that those blog posts are a first step on a nice journey!

 

Logs analysis

Nowadays to analyze logs and provide insights, the more common tool to use is pgfouine, which does an excellent job. But there has been some improvements in logs capabilities that we're not benefiting from yet, and I'm thinking about the CSV log format.

So the idea would be to turn pgfouine into a set of SQL queries against the logs themselves once imported into the database. Wait. What about having our next PostgreSQL version, which is meant (I believe) to include CSV support in SQL/MED, to directly expose its logs as a system view?

A good thing would be to expose that as a ddl-partitioned table following the log rotation scheme as setup in postgresql.conf, or maybe given in some sort of a setup, in order to support logrotate users. At least some facilities to do that would be welcome, and I'm not sure plain SQL/MED is that when it comes to source partitioning.

Then all that remains to be done is a set of SQL queries and some static or dynamic application to derive reports from there.

This is yet again an idea I have in mind but don't have currently time to explore myself, so I talk about it here in the hope that others will share the interest. Of course, now that I work at 2ndQuadrant, you can make it so that we consider the idea in more details, up to implementing and contributing it!

 

Using indexes as column store?

There's a big trend nowadays about using column storage as opposed to what PostgreSQL is doing, which would be row storage. The difference is that if you have the same column value in a lot of rows, you could get to a point where you have this value only once in the underlying storage file. That means high compression. Then you tweak the executor to be able to load this value only once, not once per row, and you win another huge source of data traffic (often enough, from disk).

Well, it occurs to me that maybe we could have column oriented storage support without adding any new storage facility into PostgreSQL itself, just using in new ways what we already have now. Column oriented storage looks somewhat like an index, where any given value is meant to appear only once. And you have links to know where to find the full row associated in the main storage.

There's a work in progress to allow for PostgreSQL to use indexes on their own, without having to get to the main storage for checking the visibility. That's known as the Visibility Map, which is still only a hint in released versions. The goal is to turn that into a crash-safe trustworthy source in the future, so that we get covering indexes. That means we can use an index and skip getting to the full row in main storage and get the visibility information there.

Now, once we have that, we could consider using the indexes in more queries. It could be a win to get the column values from the index when possible and if you don't output more columns from the heap, return the values from there. Scanning the index only once per value, not once per row.

There's a little more though on the point in the Next Generation PostgreSQL article I've been referencing already, should you be interested.

 

MVCC in the Cloud

At CHAR(10) Markus had a talk about Using MVCC for Clustered Database Systems and explained how Postgres-R does it. The scope of his project is to maintain a set of database servers in the same state, eventually.

Now, what does it mean to get "In the Cloud"? Well there are more than one answer I'm sure, mine would insist on including this "Elasticity" bit. What I mean here is that it'd be great to be able to add or lose nodes and stay online. Granted, that what's Postgres-R is providing. Does that make it ready for the "Cloud"? Well it happens so that I don't think so.

Once you have elasticity, you also want scalability. That could mean lots of thing, and Postgres-R already provides a great deal of it, at the connect and reads level: you can do your business unlimited on any node, the others will eventually (eagerly) catch-up, and you can do your select on any node too, reading from the same data set. Eventually.

What's still missing here is the hard sell, write scalability. This is the idea that you don't want to sustain the same write load on all the members of the "Cloud cluster". It happens that I have some idea about how to go on this, and this time I've been trying to write them down. You might be interested into the MVCC in the Cloud part of my Next Generation PostgreSQL notes.

My opinion is that if you want to distribute the data, this is a problem that falls in the category of finding the data on disk. This problem is already solved in the executor, it knows which operating system level file to open and where to seek inside that in order to find a row value for a given relation. So it should be possible to teach it that some relation's storage ain't local, to get the data it needs to communicate to another PostgreSQL instance.

I would call that a remote tablespace. It allows for distributing both the data and their processing, which could happen in parallel. Of course that means there's now some latency concerns, and that some JOIN will get slow if you need to retrieve the data from the network each time. For that what I'm thinking about is the possibility to manage a local copy of a remote tablespace, which would be a mirror tablespace. But that's for another blog post.

Oh, if that makes you think a lot of SQL/MED, that would mean I did a good enough job at explaining the idea. The main difference though would be to ensure transaction boundaries over the local and remote data: it's one single distributed database we're talking about here.

 

Back from CHAR(10)

It surely does not feel like a full month and some more went by since we were enjoying PGCon 2010, but in fact it was already the time for CHAR(10). The venue was most excellent, as Oxford is a very beautiful city. Also, the college was like a city in the city, and having the accomodation all in there really smoothed it all.

On a more technical viewpoint, the range of topics we talked about and the even broader one in the "Hall Track" make my mind full of ideas, again. So I'm preparing a quite lengthy article to summarise or present all those ideas, and I think a post series should cover the points in there. When trying to label things, it appears that my current obsessions are mainly about PostgreSQL in the Cloud and Further Optimising PostgreSQL, so that's what I'll be talking about those next days.

Meanwhile I'm going to search for existing solutions on how to use the Paxos algorithm to generate a reliable distributed sequence, using libpaxos for example. The goal would be to see if it's feasible to have a way to offer some global XID from a network of servers in a distributed fashion, ideally in such a way that new members can join in at any point, and of course that losing a member does not cause downtime for the online ones. It sounds like this problem has been extensively researched and is solved, either by the Global Communication Systems or the underlying algorithms. Given the current buy-in lack of our community for GCS my guess is that bypassing them would be a pretty good move, even if that mean implementing a limited form of GCS ourselves.

 

Back from PgCon2010

This year's edition has been the best pgcon ever for me. Granted, it's only my third time, but still :) As Josh said the "Hall Track" in particular was very good, and the Dev Meeting has been very effective!

Extensions

This time I prepared some slides to present the extension design and I tried hard to make it so that we get to agree on a plan, even recognizing it's not solving all of our problems from the get go. I had been talking about the concept and design with lots of people already, and continued to do so while in Ottawa on Monday evening and through all Tuesday. So Wednesday, I felt prepared. It proved to be a good thing, as I edited the slides with ideas from several people I had the chance to expose my ideas to! Thanks Greg Stark and Heikki Linnakangas for the part we talked about at the meeting, and a lot more people for the things we'll have to solve later (Hi Stefan!).

So the current idea for extensions is for the backend support to start with a file in `pg_config --sharedir`/extensions/foo/control containing the foo extension's metadata. From that we know if we can install an extension and how. Here's an example:

name = foo
version = 1.0
custom_variable_classes = 'foo'
depends  = bar (>= 1.1), baz
conflicts = bla (< 0.8)

The other files should be install.sql, uninstall.sql and foo.conf. The only command the user will have to type in order for using the extension in his database will then be:

  INSTALL EXTENSION foo;

For that to work all that needs to happen is for me to write the code. I'll keep you informed as soon as I get a change to resume my activities on the git branch I'm using. You can already find my first attempt at a pg_execute_from_file() function there.

Building atop that backend support we already have two gentlemen competing on features to offer to distribute and package extensions! That will complete the work just fine, thanks guys.

Hot Standby

Heikki's talk about Built-in replication in PostgreSQL 9.0 left me with lots of thinking. In particular it seems we need two projects out of core to complete what 9.0 has to offer, namely something very simple to prepare a base backup and something more involved to manage a pool of standbys.

pg_basebackup

The idea I had listening to the talk was that it might be possible to ask the server, in a single SQL query, for the list of all the files it's using. After all, there's those pg_ls_files() and pg_read_file() functions, we could put them to good use. I couldn't get the idea out of my head, so I had to write some code and see it running: pg_basebackup is there at github, grab a copy!

What it does is very simple, in about 100 lines of self-contained python code it get all the files from a running server through a normal PostgreSQL connection. That was my first recursive query. I had to create a new function to get the file contents as the existing one returns text, and I want bytea here, of course.

Note that the code depends on the bytea representation in use, so it's only working with 9.0 as of now. Can be changed easily though, send a patch or just ask me to do it!

Lastly, note that even if pg_basebackup will compress each chunk it sends over the libpq connection, it won't be your fastest option around. Its only advantage there is its simplicity. Get the code, run it with 2 arguments: a connection string and a destination directory. There you are.

wal proxy, wal relay

The other thing that we'll miss in 9.0 is the ability to both manage more than a couple of standby servers and to manage failover gracefully. Here the idea would be to have a proxy server acting as both a walreceiver and a walsender. Its role would be to both archive the WAL and relay them to the real standbys.

Then in case of master's failure, we could instruct this proxy to be fed from the elected new master (manual procedure), the other standbys not being affected. Well apart than apparently changing the timeline (which will happen as soon as you promote a standby to master) while streaming is not meant to be supported. So the proxy would also disconnect all the slaves and have them reconnect.

If we need such a finesse, we could have the restore_command on the standbys prepared so that it'll connect to the proxy's archive. Now on failover, the standbys are disconnected from the stream, get a WAL file with a new timeline from the archive, replay it, and reconnect.

That means that for a full HA scenario you could get on with three servers. You're back to two servers at failover time and need to rebuild the crashed master as a standby, running a base backup again.

If you've followed the idea, I hope you liked it! I still have to motivate some volunteers so that some work gets done here, as I'm probably not the one to ask to as far as coding this is concerned, if you want it out before 9.1 kicks in!

Queuing

We also had a nice Hall Tack session with Jan Wieck, Marko Kreen and Jim Nasby about how to get a single general (enough) queueing solution for PostgreSQL. It happens that the Slony queueing ideas made their way into PGQ and that we'd want to add some more capabilities to this one.

What we talked about was adding more interfaces (event producers, event format translating at both ends of the pipe) and optimising how many events from the past we keep in the queue for the subscribers, in a cascading environment.

It seems that the basic architecture of the queue is what PGQ 3 provides already, so it could even be not that much of a hassle to get something working out of the ideas exchanged.

Of course, one of those ideas has been discussed at the Dev Meeting, it's about deriving the transaction commit order from the place which already has the information rather than reconstructing it after the fact. We'll see how it goes, but it started pretty well with a design mail thread.

Other talks

I went to some other talks too, of course, unfortunately with an attention span far from constant. Between the social events (you should read that as beer drinking evenings) and the hall tracks, more than once my brain were less present than my body in the talks. I won't risk into commenting them here, but overall it was very good: in about each talk, new ideas popped into my head. And I love that.

Conclusion: I'm addicted.

The social aspect of the conference has been very good too. Once more, a warm welcome from the people that are central to the project, and who are so easily available for a chat about any aspect of it! Or just for sharing a drink.

Meeting our users is very important too, and pgcon allows for that also. I've met some people I'm used to talk to via IRC, and it was good fun sharing a beer over there.

All in all, I'm very happy I made it to Ottawa despite the volcano activity, there's so much happening over there! Thanks to all the people who made it possible by either organizing the conference or attending to it! See you next year, I'm addicted...

 

Import fixed width data with pgloader

So, following previous blog entries about importing fixed width data, from Postgres Online Journal and David (perl) Fetter, I couldn't resist following the meme and showing how to achieve the same thing with pgloader.

I can't say how much I dislike such things as the following, and I can't help thinking that non IT people are right looking at us like this when encountering such prose.

  map {s/\D*(\d+)-(\d+).*/$a.="A".(1+$2-$1). " "/e} split(/\n/,<<'EOT');

So, the pgloader way. First you need to have setup a database, I called it pgloader here. Then you need the same CREATE TABLE as on the original article, here is it for completeness:

CREATE TABLE places(usps char(2) NOT NULL,
    fips char(2) NOT NULL,
    fips_code char(5),
    loc_name varchar(64));

Now the data file I've taken here: http://www.census.gov/tiger/tms/gazetteer/places2k.txt.

Then we translate the file description into pgloader setup:

[pgsql]
host = localhost
port = 5432
base = pgloader
user = dim
pass = None

log_file            = /tmp/pgloader.log
log_min_messages    = DEBUG
client_min_messages = WARNING

client_encoding = 'latin1'
lc_messages         = C
pg_option_standard_conforming_strings = on

[fixed]
table           = places
format          = fixed
filename        = places2k.txt
columns         = *
fixed_specs     = usps:0:2, fips:2:2, fips_code:4:5, loc_name:9:64, p:73:9, h:82:9, land:91:14, water:105:14, ldm:119:14, wtm:131:14, lat:143:10, long:153:11

We're ready to import the data now:

dim ~/PostgreSQL/examples pgloader -vsTc pgloader.conf
pgloader     INFO     Logger initialized
pgloader     WARNING  path entry '/usr/share/python-support/pgloader/reformat' does not exists, ignored
pgloader     INFO     Reformat path is []
pgloader     INFO     Will consider following sections:
pgloader     INFO       fixed
pgloader     INFO     Will load 1 section at a time
fixed        INFO     columns = *, got [('usps', 1), ('fips', 2), ('fips_code', 3), ('loc_name', 4)]
fixed        INFO     Loading threads: 1
fixed        INFO     closing current database connection
fixed        INFO     fixed processing
fixed        INFO     TRUNCATE TABLE places;
pgloader     INFO     All threads are started, wait for them to terminate
fixed        INFO     COPY 1: 10000 rows copied in 5.769s
fixed        INFO     COPY 2: 10000 rows copied in 5.904s
fixed        INFO     COPY 3: 5375 rows copied in 3.187s
fixed        INFO     No data were rejected
fixed        INFO      25375 rows copied in 3 commits took 14.907 seconds
fixed        INFO     No database error occured
fixed        INFO     closing current database connection
fixed        INFO     releasing fixed semaphore
fixed        INFO     Announce it's over

Table name        |    duration |    size |  copy rows |     errors
====================================================================
fixed             |     14.901s |       - |      25375 |          0

Note the -T option is for TRUNCATE, which you only need when you want to redo the loading, I've come to always mention it in interactive usage. The -v option is for some more verbosity and the -s for the summary at end of operations.

With the pgloader.conf and places2k.txt in the current directory, and an empty table, just typing in pgloader at the prompt would have done the job.

Oh, the pg_option_standard_conforming_strings bit is from the git HEAD, the current released version has no support for setting any PostgreSQL knob yet. Still, it's not necessary here, so you can forget about it.

You will also notice that pgloader didn't trim the data for you, which ain't funny for the places column. That's a drawback of the fixed width format that you can work on two ways here, either by means of

UPDATE places SET loc_name = trim(loc_name) ;
or a custom reformat module for pgloader. I guess the latter solution is overkill, but it allows for pipe style processing of the data and a single database write.

Send me a mail if you want me to show here how to setup such a reformatting module in a next blog entry!

 

pgloader activity report

Yes. This pgloader project is still maintained and somewhat active. Development happens when I receive a complaint, either about a bug in existing code or a feature in yet-to-write code. If you have a bug to report, just send me an email!

If you're following the development of it, the sources just moved from CVS at pgfoundry to http://github.com/dimitri/pgloader. I will still put the releases at pgfoundry, and the existing binary packages maintenance should continue. See also the development version documentation, which contains not yet released stuff.

This time it's about new features, the goal being to open pgloader usage without describing all the file format related details into the pgloader.conf file. This time around, Simon is giving feedback and told me he would appreciate that pgloader would work more like the competition.

We're getting there with some new options. The first one is that rather than only Sections, now your can give a filename as an argument. pgloader will then create a configuration section for you, considering the file format to be CSV, setting columns = *. The default field separator is |, so you have also the -f, --field-separator option to set that from the command line.

As if that wasn't enough, pgloader now supports any PostgreSQL option either in the configuration file (prefix the real name with pg_option_) or on the command line, via the -o, --pg-options switch, that you can use more than once. Command line setting will take precedence over any other setup, of course. Consider for example -o standard_conforming_strings=on.

While at it, some more options can now be set on the command line, including -t, --section-threads and -m, --max-parallel-sections on the one hand and -r, --reject-log and -j, --reject-data on the other hand. Those two last must contain a %s place holder which will get replaced by the section name, or the filename if you skipped setting up a section for it.

Your pgloader usage is now more command line friendly than ever!

 

Finding orphaned sequences

This time we're having a database where sequences were used, but not systematically as a default value of a given column. It's mainly an historic bad idea, but you know the usual excuse with bad ideas and bad code: the first 6 months it's experimental, after that it's historic.

Still, here's a query for 8.4 that will allow you to list those sequences you have that are not used as a default value in any of your tables:

WITH seqs AS (
  SELECT n.nspname, relname as seqname
    FROM pg_class c
         JOIN pg_namespace n on n.oid = c.relnamespace
   WHERE relkind = 'S'
),
     attached_seqs AS (
  SELECT n.nspname,
         c.relname as tablename,
         (regexp_matches(pg_get_expr(d.adbin, d.adrelid), '''([^'']+)'''))[1] as seqname
    FROM pg_class c
         JOIN pg_namespace n on n.oid = c.relnamespace
         JOIN pg_attribute a on a.attrelid = c.oid
         JOIN pg_attrdef d on d.adrelid = a.attrelid
                            and d.adnum = a.attnum
                            and a.atthasdef
  WHERE relkind = 'r' and a.attnum > 0
        and pg_get_expr(d.adbin, d.adrelid) ~ '^nextval'
)

 SELECT nspname, seqname, tablename
   FROM seqs s
        LEFT JOIN attached_seqs a USING(nspname, seqname)
  WHERE a.tablename IS NULL;

I hope you don't need the query...

 

Getting out of SQL_ASCII, part 2

So, if you followed the previous blog entry, now you have a new database containing all the static tables encoded in UTF-8 rather than SQL_ASCII. Because if it was not yet the case, you now severely distrust this non-encoding.

Now is the time to have a look at properly encoding the live data, those stored in tables that continue to receive write traffic. The idea is to use the UPDATE facilities of PostgreSQL to tweak the data, and too fix the applications so as not to continue inserting badly encoded strings in there.

Finding non UTF-8 data

First you want to find out the badly encoded data. You can do that with this helper function that RhodiumToad gave me on IRC. I had a version from the archives before that, but the regexp was hard to maintain and quote into a PL function. This is avoided by two means, first one is to have a separate pure SQL function for the regexp checking (so that you can index it should you need to) and the other one is to apply the regexp to hex encoded data. Here we go:

create or replace function public.utf8hex_valid(str text)
 returns boolean
 language sql immutable
as $f$
   select $1 ~ $r$(?x)
                  ^(?:(?:[0-7][0-9a-f])
                     |(?:(?:c[2-9a-f]|d[0-9a-f])
                        |e0[ab][0-9a-f]
                        |ed[89][0-9a-f]
                        |(?:(?:e[1-9abcef])
                           |f0[9ab][0-9a-f]
                           |f[1-3][89ab][0-9a-f]
                           |f48[0-9a-f]
                          )[89ab][0-9a-f]
                       )[89ab][0-9a-f]
                    )*$
                $r$;
$f$;

Now some little scripting around it in order to skip intense manual and boring work (and see, some more catalog queries). Don't forget we will have to work on a per-column basis here...

create or replace function public.check_encoding_utf8
 (
   IN schemaname text,
   IN tablename  text,
  OUT relname    text,
  OUT attname    text,
  OUT count      bigint
 )
 returns setof record
 language plpgsql
as $f$
DECLARE
  v_sql text;
BEGIN
  FOR relname, attname
   IN SELECT c.relname, a.attname
        FROM pg_attribute a
             JOIN pg_class c on a.attrelid = c.oid
             JOIN pg_namespace s on s.oid = c.relnamespace
             JOIN pg_roles r on r.oid = c.relowner
       WHERE s.nspname = schemaname
         AND atttypid IN (25, 1043) -- text, varchar
         AND relkind = 'r'          -- ordinary table
         AND r.rolname = 'some_specific_role'
         AND CASE WHEN tablename IS NOT NULL
                  THEN c.relname ~ tablename
                  ELSE true
              END
  LOOP
    v_sql := 'SELECT count(*) '
          || '  FROM ONLY '|| schemaname || '.' || relname
          || ' WHERE NOT public.utf8hex_valid(encode(textsend('
          || attname
          || '), ''hex''))';

    -- RAISE NOTICE 'Checking: %.%', relname, attname;
    -- RAISE NOTICE 'SQL: %', v_sql;
    EXECUTE v_sql INTO count;
    RETURN NEXT;
  END LOOP;
END;
$f$;

Note that the tablename is compared using the ~ operator, so that's regexp matching there too. Also note that I wanted only to check those tables that are owned by a specific role, your case may vary.

The way I used this function was like this:

create table leon.check_utf8 as
 select *
   from public.check_encoding_utf8();

Then you need to take action on those lines in leon.check_utf8 table which have a count > 0. Rince and repeat, but you may soon realise building the table over and over again is costly.

Cleaning up the data

Up for some more helper tools? Unless you really want to manually fix this huge amount of columns where some data ain't UTF-8 compatible... here's some more:

create or replace function leon.nettoyeur
 (
  IN  action      text,
  IN  encoding    text,
  IN  tablename   text,
  IN  columname   text,

  OUT orig        text,
  OUT utf8        text
 )
 returns setof record
 language plpgsql
as $f$
DECLARE
  p_convert text;
BEGIN
  IF encoding IS NULL
  THEN
    p_convert := 'translate('
              || columname || ', '
              || $$'\211\203\202'$$
              || ', '
              || $$'   '$$
              || ') ';
  ELSE
    -- in 8.2, write convert using, in 8.3, the other expression
    -- p_convert := 'convert(' || columname || ' using ' || conversion || ') ';
    p_convert := 'convert(textsend(' || columname || '), '''|| encoding ||''', ''utf-8'' ) ';
  END IF;

  IF action = 'select'
  THEN
    FOR orig, utf8
     IN EXECUTE 'SELECT ' || columname || ', '
         || p_convert
         || '  FROM ONLY ' || tablename
         || ' WHERE not public.utf8hex_valid('
         || 'encode(textsend('|| columname ||'), ''hex''))'
    LOOP
      RETURN NEXT;
    END LOOP;

  ELSIF action = 'update'
  THEN
    EXECUTE 'UPDATE ONLY ' || tablename
         || ' SET ' || columname || ' = ' || p_convert
         || ' WHERE not public.utf8hex_valid('
         || 'encode(textsend('|| columname ||'), ''hex''))';

    FOR orig, utf8
     IN SELECT *
          FROM leon.nettoyeur('select', encoding, tablename, columname)
    LOOP
      RETURN NEXT;
    END LOOP;

  ELSE
    RAISE EXCEPTION 'Léon, Nettoyeur, veut de l''action.';

  END IF;
END;
$f$;

As you can see, this function allows to check the conversion process from a given supposed encoding before to actually convert the data in place. This is very useful as even when you're pretty sure the non-utf8 data is latin1, sometime you find it's windows-1252 or such. So double check before telling leon.nettoyeur() to update your precious data!

Also, there's a facility to use translate() when none of the encoding match your expectations. This is a skeleton just replacing invalid characters with a space, tweak it at will!

Conclusion

Enjoy your clean database now, even if it still accepts new data that will probably not pass the checks, so we still have to be careful about that and re-clean every day until the migration is effective. Or maybe add a CHECK clause that will reject badly encoded data...

In fact here we're using Londiste to replicate the live data from the old to the new server, and that means the replication will break each time there's new data written in non-utf8, as the new server is running 8.4, which by design ain't very forgiving. Our plan is to clean-up as we go (remove table from the subscriber, fix it, add it again) and migrate as soon as possible!

Bonus points to those of you getting the convoluted reference :)

 

Getting out of SQL_ASCII, part 1

It happens that you have to manage databases designed by your predecessor, and it even happens that the team used to not have a DBA. Those histerical raisins can lead to having a SQL_ASCII database. The horror!

What SQL_ASCII means, if you're not already familiar with the consequences of such a choice, is that all the text and varchar data that you put in the database is accepted as-is. No checks. At all. It's pretty nice when you're lazy enough to not dealing with strange errors in your application, but if you think that t's a smart move, please go read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky now. I said now, I'm waiting for you to get back here. Yes, I'll wait.

The problem of course is not being able to read the data you just stored, which is seldom the use case anywhere you use a database solution such as PostgreSQL.

Now, it happens too that it's high time to get off of SQL_ASCII, the infamous. In our case we're lucky enough in that the data are all in fact latin1 or about that, and this comes from the fact that all the applications connecting to the database are sharing some common code and setup. Then we have some tables that can be tagged archives and some other live. This blog post will only deal with the former category.

For those tables that are not receiving changes anymore, we will migrate them by using a simple but time hungry method: COPY OUT|recode|COPY IN. I've tried to use iconv for recoding our data, but it failed to do so in lots of cases, so I've switched to using the GNU recode tool, which works just fine.

The fact that it takes so much time doing the conversion is not really a problem here, as you can do it offline, while the applications are still using the SQL_ASCII database. So, here's the program's help:

recode.sh [-npdf0TI] [-U user ] -s schema [-m mintable] pattern
        -d    debug
        -n    dry run, only print table names and expected files
        -s    schema
        -m    mintable, to skip already processed once
        -U    connect to PostgreSQL as user
        -f    force table loading even when export files do exist
        -0    only (re)load tables with zero-sized copy files
        -T    Truncate the tables before COPYing recoded data
        -I    Temporarily drop the indexes of the table while COPYing
   pattern    ^table_name_, e.g.

The -I option is neat enough to create the indexes in parallel, but with no upper limit on the number of index creation launched. In our case it worked well, so I didn't have to bother.

Take a look at the recode.sh script, and don't hesitate editing it for your purpose. It's missing some obvious options to get useful in the large, such as the recode request which is currently hardcoded to l1..utf8. If there's any demand about it, I'll setup a GitHub project for the little script.

We'll get back to the subject of this entry in part 2, dealing with how to recode your data in the database itself, thanks to some insane regexp based queries and helper functions. And thanks to a great deal of IRC based helping, too.

 

Resetting sequences. All of them, please!

So, after restoring a production dump with intermediate filtering, none of our sequences were set to the right value. I could have tried to review the process of filtering the dump here, but it's a one-shot action and you know what that sometimes mean. With some pressure you don't script enough of it and you just crawl more and more.

Still, I think how I solved it is worthy of a blog entry. Not that it's about a super unusual clever trick, quite the contrary, because questions involving this trick are often encountered on the support IRC.

The idea is to query the catalog for all sequences, and produce from there the SQL command you will have to issue for each of them. Once you have this query, it's quite easy to arrange from the psql prompt as if you had dynamic scripting capabilities. Of course in 9.0 you will have inline anonymous DO blocks.

#> \o /tmp/sequences.sql
#> \t
Showing only tuples.
#> YOUR QUERY HERE
#> \o
#> \t
Tuples only is off.

Once you have the /tmp/sequences.sql file, you can ask psql to execute its command as you're used to, that's using \i in an explicit transaction block.

Now, the interresting part if you got here attracted by the blog entry title is in fact the query itself. A nice way to start is to \set ECHO_HIDDEN then describe some table, you now have a catalog example query to work with. Then you tweak it somehow and get this:

  SELECT 'select '
          || trim(trailing ')'
             from replace(pg_get_expr(d.adbin, d.adrelid),
                          'nextval', 'setval'))
          || ', (select max( ' || a.attname || ') from only '
          || nspname || '.' || relname || '));'
    FROM pg_class c
         JOIN pg_namespace n on n.oid = c.relnamespace
         JOIN pg_attribute a on a.attrelid = c.oid
         JOIN pg_attrdef d on d.adrelid = a.attrelid
                            and d.adnum = a.attnum
                            and a.atthasdef
  WHERE relkind = 'r' and a.attnum > 0
        and pg_get_expr(d.adbin, d.adrelid) ~ '^nextval';

Coming next, a recode based script in order to get from SQL_ASCII to UTF-8, and some strange looking queries too.

recode.sh [-npdf0TI] [-U user ] -s schema [-m mintable] pattern

Stay tuned!

 

pg_staging's bird view

One of the most important feedback I got about the presentation of pgstaging were the lack of pictures, something like a bird-view of how you operate it. Well, thanks to ditaa and Emacs picture-mode here it is:

Hope you enjoy, it should not be necessary to comment much if I got to the point!

Of course I commited the text source file to the Git repository. The only problem I ran into is that ditaa defaults to ouputing a quite big right margin containing only white pixels, and that didn't fit well, visually, in this blog. So I had to resort to ImageMagik crop command in order to avoid any mouse usage in the production of this diagram.

convert .../pg_staging/bird-view.png -crop '!550' bird-view.png
mv bird-view-0.png pg_staging.png

Quicker than learning to properly use a mouse, at least for me :)

 

PGday.eu feedback

At pgday there was this form you could fill to give speakers some feedback about their talks. And that's a really nice way as a speaker to know what to improve. And as Magnus was searching a nice looking chart facility in python and I spoke about matplotlib, it felt like having to publish something.

Here is my try at some nice graphics. Well I'll let you decide how nice the result is:

If you want to see the little python script I used, here it is: feedback.py, with the data embedded and all...

Now, how to read it? Well, the darker the color the better the score. For example I had 5 people score me 5 for Topic Importance on the Hi-Media talk (in french) and only 3 people at this same score and topic for pg_staging talk. The scores are from 1 to 5, 5 being the best.

The comitee accepted interesting enough topics and it seems I managed to deliver acceptable content from there. Not very good content, because reading the comments I missed some nice birds-eye pictures to help the audience get into the subject. As I'm unable to draw (with or without a mouse) I plan to fix this in latter talks by using ditaa, the DIagrams Through Ascii Art tool. I already used it and together with Emacs picture-mode it's very nice.

Oh yes the baseline of this post is that there will be later talks. I seem to be liking those and the audience feedback this time is saying that it's not too bad for them. See you soon :)

 

prefix 1.1.0

So I had two bug reports about prefix in less than a week. It means several things, one of them is that my code is getting used in the wild, which is nice. The other side of the coin is that people do find bugs in there. This one is about the behavior of the btree opclass of the type prefix range. We cheat a lot there by simply having written one, because a range does not have a strict ordering: is [1-3] before of after [2-4]? But when you know you have no overlapping intervals in your prefix_range column, being able to have it part of a primary key is damn useful.

Note: in 8.5 we should have a way to express contraint exclusion and have PostgreSQL forbids overlapping entries for us. Not being there yet, you could write a constraint trigger and use the GiST index to have nice speed there, which is exactly what this constraint exclusion support is about.

It turns out the code change required is pretty simple:

-    return (a->first == b->first) ? (a->last - b->last) : (a->first - b->first);
+    /*
+     * we are comparing e.g. '1' and '12' (the shorter contains the
+     * smaller), so let's pretend '12' < '1' as it contains less elements.
+     */
+    return (alen == mlen) ? 1 : -1;

This happens in the compare support function (see Interfacing Extensions To Indexes) so that means you now have to rebuild your prefix_range btree indexes, hence the version number bump.

 

Yet Another PostgreSQL tool hits debian

So there it is, this newer contribution of mine that I presented at PGDay is now in debian NEW queue. pg_staging will empower you with respect to what you do about those nightly backups (pg_dump -Fc or something).

The tool provides a lot of commands to either dump or restore a database. It comes with documentation covering about it all, except for the londiste support part, which will be there in time for 1.0.0 release. The Todo list is getting smaller and smaller, the version you'll soon find in debian sid is already called 0.9.

So, how do you go about using this software, and what service it implements?

it's all about deriving a staging environment from your backups

To validate backups, you want to restore them and check the database you get from them. And your developers will want to sometime refresh the database they're working with. And you could have both an integration environment and a pre-live one: On the former, you develop new code atop a stable set of data; while on the latter you test stable enough code (ready to go live) on a set of data as near as live data as possible.

And you want to be flexible about it, so that there's not a fulltime job to handle retoring databases each and every days, for project A integration or project B pre-live testing, or project C accounting snapshot. Or you name it.

And of course you want to have a single point of control of all your databases. Let's call it the controler.

setting up pg_staging

The pg_staging setup consists of one pg_staging.ini file wherein you describe your different target databases (those dev and prelive ones), and of course where to get the production backups from. Currently you have to serve the backups file in a format suitable for pg_restore (that means you use either pg_dump -Ft or pg_dump -Fc) on an apache folder. The produced HTML will get parsed.

So you setup the DEFAULT section with common settings, then one section per target: the databases you want to restore. Tell pg_staging where they are (host), etc, and it'll be able to drive them.

In order to being able to host more than a single restored dump on a staging server, for the same database, we use pgbouncer:

pg_staging> pgbouncer some_db.dev
              some_db      some_db_20091029 :5432
     some_db_20090717      some_db_20090717 :5432
     some_db_20091029      some_db_20091029 :5432

So as explained into the pg_staging(1) man page, you have to open non-interactive SSH connection from the controler to the hosts where the databases will get restored. Then you have to do a minimal setup pgbouncer on the hosts with a trust connection. It'll get used from pg_staging for adding newly restored database and have them accessible. Then you can also switch the new database to being the virtual some_db so that you avoid editing any connection string on your softwares.

Also, install the pgstaging-client package on every host you target. The client is a simple shell script that must run as root (sudo is used) in order to replace your pgbouncer setup or manage your londiste services.

See man 5 pg_staging for available options, including schemas to filter out either completely or just skipping data restoring in those.

pg_staging usage

Now you're all setup, you can begin to enjoy using pgstaging. Enter the console and see what you have in there.

$ pg_staging
Welcome to pg_staging 0.9.
pg_staging> databases
...
pg_staging> restore some_db.dev
...
pg_staging> pgbouncer some_db.dev
...
pg_staging> dbsizes --all some_db.dev
...
pg_staging> psql some_db.dev
some_db_20091125=#

And as you can see in man pg_staging there are a lot of commands already. You can for example obtain a new pg_restore catalog from a dump file, with some schemas commented out. It will even comment out triggers that are using a function which is defined in a filtered out schema, for example a PGQ trigger. And much much more.

pg_staging will even allow you to dump your production databases, but consider installing a separate instance of it on the machine serving the backups to your local network thanks to an apache directory listing!

Roadmap to 1.0.0

What's remain to be done is testing and having PITR based restoring to work, and adding some documentation (tutorial, which this blog post about is; and londiste support). At this point, unless some reader here asks for a new feature (set), I'll consider pg_staging ready for 1.0.0. After all, we're using it about daily here :)

Consider commenting, you should be able to easily spot my private mail address...

 

PGDay.eu, Paris: it was awesome!

PGDay.eu was held this week-end in Paris, and it really was a great moment. Lots of attendees, lots of quality talks (slides are online), good food, great party: all the ingredients were there!

It also was for me the occasion to first talk about this tool I've been working on for months, called pg_staging, which aims to empower those boring production backups to help maintaining staging environments (for your developers and testers).

All in all such events keep reminding me what it means exactly when we way that one of the greatest things about PostgreSQL is its community. If you don't know what I'm talking about, consider joining!

 

prefix 1.0.0

So there it is, at long last, the final 1.0.0 release of prefix! It's on its way into the debian repository (targetting sid, in testing in 10 days) and available on pgfoundry to.

In order to make it clear that I intend to maintain this version, the number has 3 digits rather than 2... which is also what PostgreSQL users will expect.

The only last minute change is that you can now use the first version of the two following rather than the second one:

-  create index idx_prefix on prefixes using gist(prefix gist_prefix_range_ops);
+  create index idx_prefix on prefixes using gist(prefix);

For you information, I'm thinking about leaving pgfoundry as far as the source code management goes, because I'd like to be done with CVS. I'd still use the release file hosting though at least for now. It's a burden but it's easier for the users to find them, when they are not using plain apt-get install. That move would lead to host prefix and pgloader and the backports over there at github, where my next pet project, pg_staging, will be hosted too.

The way to see this pgfoundry leaving is that if everybody does the same, then migrating the facility to some better or more recent hosting software will be easier. Maybe some other parts of the system are harder than the sources to migrate, though. If that's the case I'll consider moving them out too, maybe getting listed on the PostgreSQL Software Catalogue will prove enough as far as web presence goes?

 

hstore-new & preprepare reach debian too

It seems like debian developers are back from annual conference and holiday, so they have had a look at the NEW queue and processed the packages in there. Two of them were mines and waiting to get in unstable, hstore-new and preprepare.

Time to do some bug fixing already, as hstore-new packaging is using a bash'ism I shouldn't rely on (or so the debian buildfarm is telling me) and for preprepare I was waiting for inclusion before to go improving the GUC management, stealing some code from Selena's pgGearman :)

As some of you wonder about prefix 1.0 scheduling, it should soon get there now it's been in testing long enough and no bug has been reported. Of course releasing 1.0 in august isn't good timing, so maybe I should just wait some more weeks.

 

prefix 1.0~rc2 in debian testing

At long last, here it is. With binary versions both for postgresal-8.3 and postgresal-8.4! Unfortunately my other packaging efforts are still waiting on the NEW queue, but I hope to soon see hstore-new and preprepare enter debian too.

Anyway, the plan for prefix is to now wait something like 2 weeks, then, baring showstopper bugs, release the 1.0 final version. If you have a use for it, now is the good time for testing it!

About upgrading a current prefix installation, the advice is to save data as text instead of prefix_range, remove prefix support, install new version, change again the columns data type:

BEGIN;
  ALTER TABLE foo
     ALTER COLUMN prefix
             TYPE text USING text(prefix);

  DROP TYPE prefix_range CASCADE;
  \i prefix.sql

  ALTER TABLE foo
     ALTER COLUMN prefix
             TYPE prefix_range USING prefix_range(prefix);

  CREATE INDEX idx_foo_prefix ON foo
         USING gist(prefix gist_prefix_range_ops);
COMMIT;

Note: I just added the gist_prefix_range_ops as default for type prefix_range so it'll be optional to specify this in final 1.0. I got so used to typing it I didn't realize we don't have to :)

 

prefix 1.0~rc2-1

I've been having problem with building both postgresql-8.3-prefix and postgresql-8.4-prefix debian packages from the same source package, and fixing the packaging issue forced me into modifying the main prefix Makefile. So while reaching rc2, I tried to think about missing pieces easy to add this late in the game: and there's one, that's a function length(prefix_range), so that you don't have to cast to text no more in the following wildspread query:

  SELECT foo, bar
    FROM prefixes
   WHERE prefix @> '012345678'
ORDER BY length(prefix) DESC
   LIMIT 1;

And here's a simple stupid benchmark of the new function, here in prefix-1.0~rc2.tar.gz. And it'll soon reach debian, if my QA dept agrees (my sponsor is a QA dept all by himself!).

First some preparation:

dim=#   create table prefixes (
dim(#          prefix    prefix_range primary key,
dim(#          name      text not null,
dim(#          shortname text,
dim(#          status    char default 'S',
dim(#
dim(#          check( status in ('S', 'R') )
dim(#   );
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "prefixes_pkey" for
 table "prefixes"
CREATE TABLE
Time: 74,357 ms
dim=#   \copy prefixes from 'prefixes.fr.csv' with delimiter ; csv quote '"'
Time: 200,982 ms
dim=# select count(*) from prefixes ;
 count
-------
 11966
(1 row)
Time: 3,047 ms

And now for the micro-benchmark:

dim=# \o /dev/null
dim=# select length(prefix) from prefixes;
Time: 16,040 ms
dim=# select length(prefix::text) from prefixes;
Time: 23,364 ms
dim=# \o

Hope you enjoy!

 

prefix extension reaches 1.0 (rc1)

At long last, after millions and millions of queries just here at work and some more in other places, the prefix project is reaching 1.0 milestone. The release candidate is getting uploaded into debian at the moment of this writing, and available at the following place: prefix-1.0~rc1.tar.gz.

If you have any use for it (as some VoIP companies have already), please consider testing it, in order for me to release a shiny 1.0 next week! :)

Recent changes include getting rid of those square brackets output when it's not neccesary, fixing btree operators, adding support for more operators in the GiST support code (now supported: @>, <@, =, &&). Enjoy!

 

PgCon 2009

I can't really compare PgCon 2009 with previous years versions, last time I enjoyed the event it was in 2006, in Toronto. But still I found the experience to be a great one, and I hope I'll be there next year too!

I've met a lot of known people in the community, some of them I already had the chance to run into at Toronto or Prato, but this was the first time I got to talk to many of them about interresting projects and ideas. That only was awesome already, and we also had a lot of talks to listen to: as others have said, it was really hard to get to choose to go to only one place out of three.

I'm now back home and seems to be recovering quite fine from jet lag, and I even begun to move on the todo list from the conference. It includes mainly Skytools 3 testing and contributions (code and documentation), Extension Packaging work (Stephen Frost seems to be willing to help, which I highly appreciate) begining with search_path issues, and posting some backtrace to help fix some SPI_connect() bug at _PG_init() time in an extension.

The excellent lightning talk about How not to Review a Patch by Joshua Tolley took me out of the dim, I'll try to be bright enough and participate as a reviewer in later commit fests (well maybe not the first next ones as some personal events on the agenda will take all my "free" time)...

Oh and the Golconde presentation gave some insights too: this queueing based solution is to compare to the listen/notify mechanisms we already have in PostgreSQL, in the sense that's it's not transactional, and the events are kept in memory only to achieve very high distribution rates. So it's a very fine solution to manage a distributed caching system, for example, but not so much for asynchronous replication (you need not to replicate events tied to rollbacked transactions).

So all in all, spending last week in Ottawa was a splendid way to get more involved in the PostgreSQL community, which is a very fine place to be spending ones free time, should you ask me. See you soon!

 

Prepared Statements and pgbouncer

On the performance mailing list, a recent thread drew my attention. It devired to be about using a connection pool software and prepared statements in order to increase scalability of PostgreSQL when confronted to a lot of concurrent clients all doing simple select queries. The advantage of the pooler is to reduce the number of backends needed to serve the queries, thus reducing PostgreSQL internal bookkeeping. Of course, my choice of software here is clear: PgBouncer is an excellent top grade solution, performs real well (it won't parse queries), reliable, flexible.

The problem is that while conbining pgbouncer and prepared statements is possible, it requires the application to check at connection time if the statements it's interrested in are already prepared. This can be done by a simple catalog query of this kind:

  SELECT name
    FROM pg_prepared_statements
   WHERE name IN ('my', 'prepared', 'statements');

Well, this is simple but requires to add some application logic. What would be great would be to only have to EXECUTE my_statement(x, y, z) and never bother if the backend connection is a fresh new one or an existing one, as to avoid having to check if the application should prepare.

The preprepare pgfoundry project is all about this: it comes with a prepare_all() function which will take all statements present in a given table (SET preprepare.relation TO 'schema.the_table';) and prepare them for you. If you now tell pgbouncer to please call the function at backend creation time, you're done (see connect_query).

There's even a detailed README file, but no release yet (check out the code in the CVS, pgfoundry project page has clear instruction about how to do so.

 

Skytools 3.0 reaches alpha1

It's time for Skytools news again! First, we did improve documentation of current stable branch with hosting high level presentations and tutorials on the PostgreSQL wiki. Do check out the Londiste Tutorial, it seems that's what people hesitating to try out londiste were missing the most.

The other things people miss out a lot in current stable Skytools (version 2.1.9 currently) are cascading replication (which allows for switchover and failover) and DDL support. The new incarnation of skytools, version 3.0 reaches alpha1 today. It comes with full support for cascading and DDL, so you might want to give it a try.

It's a rough release, documentation is still to get written for a large part of it, and bugs are still to get fixed. But it's all in the Skytools spirit: simple and efficient concepts, easy to use and maintain. Think about this release as a developer preview and join us :)

 

Prefix GiST index now in 8.1

The prefix project is about matching a literal against prefixes in your table, the typical example being a telecom routing table. Thanks to the excellent work around generic indexes in PostgreSQL with GiST, indexing prefix matches is easy to support in an external module. Which is what the prefix extension is all about.

Maybe you didn't come across this project before, so here's the typical query you want to run to benefit from the special indexing, where the @> operator is read contains or is a prefix of:

  SELECT * FROM prefixes WHERE prefix @> '0123456789';

Now, a user asked about an 8.1 version of the module, as it's what some distributions ship (here, Red Hat Enterprise Linux 5.2). It turned out it was easy to support 8.1 when you already support 8.2, so the CVS now hosts 8.1 support code. And here's what the user asking about the feature has to say:

It's works like a charm now with 3ms queries over 200,000+ rows. The speed also stays less than 4ms when doing complex queries designed for fallback, priority shuffling, and having multiple carriers.

 

Importing XML content from file

The problem was raised this week on IRC and this time again I felt it would be a good occasion for a blog entry: how to load an XML file content into a single field?

The usual tool used to import files is COPY, but it'll want each line of the file to host a text representation of a database tuple, so it doesn't apply to the case at hand. RhodiumToad was online and offered the following code to solve the problem:

create or replace function xml_import(filename text)
  returns xml
  volatile
  language plpgsql as
$f$
    declare
        content bytea;
        loid oid;
        lfd integer;
        lsize integer;
    begin
        loid := lo_import(filename);
        lfd := lo_open(loid,262144);
        lsize := lo_lseek(lfd,0,2);
        perform lo_lseek(lfd,0,0);
        content := loread(lfd,lsize);
        perform lo_close(lfd);
        perform lo_unlink(loid);

        return xmlparse(document convert_from(content,'UTF8'));
    end;
$f$;

As you can see, the trick here is to use the large objects API to load the file content into memory (content variable), then to parse it knowing it's an UTF8 encoded XML file and return an XML datatype object.

 

Asko Oja talks about Skype architecture

In this russian page you'll see a nice presentation of Skype databases architectures by Asko Oja himself. It's the talk at Russian PostgreSQL Community meeting, October 2008, Moscow, and it's a good read.

The presentation page is in russian but the slides are in English, so have a nice read!

 

Skytools ticker daemon and londiste

One of the difficulties in getting to understand and configure londiste reside in the relation between the ticker and the replication. This question was raised once more on IRC yesterday, so I made a new FAQ entry about it: How do this ticker thing relates to londiste?

 

Comparing Londiste and Slony

In the page about Skytools I've encouraged people to ask some more questions in order for me to be able to try and answer them. That just happened, as usual on the #postgresql IRC, and the question is What does londiste lack that slony has?

 

Controling HOT usage in 8.3

As it happens, I've got some environments where I want to make sure HOT (aka Heap Only Tuples) is in use. Because we're doing so much updates a second that I want to get sure it's not killing my database server. I not only wrote some checking view to see about it, but also made a quick article about it in the French PostgreSQL website. Handling around in #postgresql means that I'm now bound to write about it in English too!

So HOT will get used each time you update a row without changing an indexed value of it, and the benefit is skipping index maintenance, and as far as I understand it, easying vacuum hard work too. To get the benefit, HOT will need some place where to put new version of the UPDATEd tuple in the same disk page, which means you'll probably want to set your table fillfactor to something much less than 100.

Now, here's how to check you're benefitting from HOT:

SELECT schemaname, relname,
       n_tup_upd,n_tup_hot_upd,
       case when n_tup_upd > 0
            then ((n_tup_hot_upd::numeric/n_tup_upd::numeric)*100.0)::numeric(5,2)
            else NULL
       end AS hot_ratio

 FROM pg_stat_all_tables;

 schemaname | relname | n_tup_upd | n_tup_hot_upd | hot_ratio
------------+---------+-----------+---------------+-----------
 public     | table1  |         6 |             6 |    100.00
 public     | table2  |   2551200 |       2549474 |     99.93

Here's even an extended version of the same request, displaying the fillfactor option value for the tables you're inquiring about. This comes separated from the first example because you get the fillfactor of a relation into the pg_class catalog reloptions field, and to filter against a schema qualified table name, you want to join against pg_namespace too.

SELECT t.schemaname, t.relname, c.reloptions,
       t.n_tup_upd, t.n_tup_hot_upd,
       case when n_tup_upd > 0
            then ((n_tup_hot_upd::numeric/n_tup_upd::numeric)*100.0)::numeric(5,2)
            else NULL
        end AS hot_ratio
FROM pg_stat_all_tables t
      JOIN (pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid)
        ON n.nspname = t.schemaname AND c.relname = t.relname

 schemaname | relname |   reloptions    | n_tup_upd | n_tup_hot_upd | hot_ratio
------------+---------+-----------------+-----------+---------------+-----------
 public     | table1  | {fillfactor=50} |   1585920 |       1585246 |     99.96
 public     | table2  | {fillfactor=50} |   2504880 |       2503154 |     99.93

Don't let the HOT question affect your sleeping no more!

 

Londiste Trick

So, you're using londiste and the ticker has not been running all night long, due to some restart glitch in your procedures, and the on call admin didn't notice the restart failure. If you blindly restart the replication daemon, it will load in memory all those events produced during the night, at once, because you now have only one tick where to put them all.

The following query allows you to count how many events that represents, with the magic tick numbers coming from pgq.subscription in columns sub_last_tick and sub_next_tick.

SELECT count(*)
  FROM pgq.event_1,
      (SELECT tick_snapshot
         FROM pgq.tick
        WHERE tick_id BETWEEN 5715138 AND 5715139
      ) as t(snapshots)
WHERE txid_visible_in_snapshot(ev_txid, snapshots);

In our case, this was more than 5 millions and 400 thousands of events. With this many events to care about, if you start londiste, it'll eat as many memory as needed to have them all around, which might be more that what your system is able to give it. So you want a way to tell londiste not to load all events at once. Here's how: add the following knob to your .ini configuration file before to restart the londiste daemon:

    pgq_lazy_fetch = 500

Now, londiste will lazyly fetch 500 events at once or less, even if a single batch (which contains all events between two ticks) contains a huge number of events. This number seems a good choice as it's the default PGQ setting of number of events in a single batch. This number is only outgrown when the ticker is not running or when you're producing more events than that in a single transaction.

Hope you'll find the tip useful!

 

Fake entry

This is a test of a fake entry to see how muse will manage this.

With some SQL inside:

SELECT * FROM planet.postgresql.org WHERE author = "dim";