<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>tail -f /dev/dim</title>
    <link>http://blog.tapoueh.org/blog.dim.html</link>
    <description>dim's PostgreSQL blog</description>
    <language>en-us</language>
    <generator>Emacs Muse</generator>

<item>
<title> Happy Numbers</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Happy%20Numbers</link>
<description><![CDATA[
<p><a name="20100830-11:00" id="20100830-11:00"></a>
After discovering the excellent <a href="http://gwene.org/">Gwene</a> service, which allows you to subscribe
to <em>newsgroups</em> to read <code>RSS</code> content (<em>blogs</em>, <em>planets</em>, <em>commits</em>, etc), I came to
read this nice article about <a href="http://programmingpraxis.com/2010/07/23/happy-numbers/">Happy Numbers</a>. That's a little problem that
fits well an interview style question, so I first solved it yesterday
evening in <a href="static/happy-numbers.el">Emacs Lisp</a> as that's the language I use the most those days.</p>

<blockquote>
<p class="quoted">
A happy number is defined by the following process. Starting with any
positive integer, replace the number by the sum of the squares of its
digits, and repeat the process until the number equals 1 (where it will
stay), or it loops endlessly in a cycle which does not include 1. Those
numbers for which this process ends in 1 are happy numbers, while those
that do not end in 1 are unhappy numbers (or sad numbers).</p>

</blockquote>

<p>Now, what about implementing the same in pure <code>SQL</code>, for more fun? Now that's
interesting! After all, we didn't get <code>WITH RECURSIVE</code> for tree traversal
only, <a href="http://archives.postgresql.org/message-id/e08cc0400911042333o5361b21cu2c9438f82b1e55ce@mail.gmail.com">did we</a>?</p>

<p>Unfortunately, we need a little helper function first, if only to ease the
reading of the recursive query. I didn't try to inline it, but here it goes:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">digits</span>(x bigint)
  <span style="color: #729fcf; font-weight: bold;">returns</span> setof <span style="color: #8ae234; font-weight: bold;">int</span>
  <span style="color: #729fcf; font-weight: bold;">language</span> <span style="color: #729fcf; font-weight: bold;">sql</span>
<span style="color: #729fcf; font-weight: bold;">as</span> $$
  <span style="color: #729fcf; font-weight: bold;">select</span> <span style="color: #729fcf;">substring</span>($1::text <span style="color: #729fcf; font-weight: bold;">from</span> i <span style="color: #729fcf; font-weight: bold;">for</span> 1)::<span style="color: #8ae234; font-weight: bold;">int</span>
    <span style="color: #729fcf; font-weight: bold;">from</span> generate_series(1, <span style="color: #729fcf; font-weight: bold;">length</span>($1::text)) <span style="color: #729fcf; font-weight: bold;">as</span> t(i)
$$;
</pre>

<p>That was easy: it will output one row per digit of the input number — and
rather than resorting to powers of ten and divisions and remainders, we do
use plain old text representation and <code>substring</code>. Now, to the real
problem. If you're read what is an happy number and already did read the
fine manual about <a href="http://www.postgresql.org/docs/8.4/interactive/queries-with.html">Recursive Query Evaluation</a>, it should be quite easy to
read the following:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">with</span> <span style="color: #729fcf; font-weight: bold;">recursive</span> happy(n, seen) <span style="color: #729fcf; font-weight: bold;">as</span> (
    <span style="color: #729fcf; font-weight: bold;">select</span> 7::bigint, <span style="color: #ad7fa8; font-style: italic;">'{}'</span>::bigint[]
  <span style="color: #729fcf; font-weight: bold;">union</span> <span style="color: #729fcf; font-weight: bold;">all</span>
    <span style="color: #729fcf; font-weight: bold;">select</span> <span style="color: #729fcf;">sum</span>(d*d), h.seen || <span style="color: #729fcf;">sum</span>(d*d)
      <span style="color: #729fcf; font-weight: bold;">from</span> (<span style="color: #729fcf; font-weight: bold;">select</span> n, digits(n) <span style="color: #729fcf; font-weight: bold;">as</span> d, seen
              <span style="color: #729fcf; font-weight: bold;">from</span> happy
           ) <span style="color: #729fcf; font-weight: bold;">as</span> h
  <span style="color: #729fcf; font-weight: bold;">group</span> <span style="color: #729fcf; font-weight: bold;">by</span> h.n, h.seen
    <span style="color: #729fcf; font-weight: bold;">having</span> <span style="color: #729fcf; font-weight: bold;">not</span> seen @&gt; <span style="color: #8ae234; font-weight: bold;">array</span>[<span style="color: #729fcf;">sum</span>(d*d)]
)
  <span style="color: #729fcf; font-weight: bold;">select</span> * <span style="color: #729fcf; font-weight: bold;">from</span> happy;
  n  |       seen
<span style="color: #888a85;">-----+------------------
</span>   7 | {}
  49 | {49}
  97 | {49,97}
 130 | {49,97,130}
  10 | {49,97,130,10}
   1 | {49,97,130,10,1}
(6 <span style="color: #729fcf; font-weight: bold;">rows</span>)

<span style="color: #8ae234; font-weight: bold;">Time</span>: 1.238 ms
</pre>

<p>That shows how it works for some <em>happy</em> number, and it's easy to test for a
non-happy one, like for example <code>17</code>. The query won't cycle thanks to the <code>seen</code>
array and the <code>having</code> filter, so the only difference between an <em>happy</em> and a
<em>sad</em> number will be that in the former case the last line output by the
recursive query will have <code>n = 1</code>. Let's expand this knowledge
into a proper function (because we want to be able to have the number we
test for happiness as an argument):</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">happy</span>(x bigint)
  <span style="color: #729fcf; font-weight: bold;">returns</span> <span style="color: #8ae234; font-weight: bold;">boolean</span>
  <span style="color: #729fcf; font-weight: bold;">language</span> <span style="color: #729fcf; font-weight: bold;">sql</span>
<span style="color: #729fcf; font-weight: bold;">as</span> $$
<span style="color: #729fcf; font-weight: bold;">with</span> <span style="color: #729fcf; font-weight: bold;">recursive</span> happy(n, seen) <span style="color: #729fcf; font-weight: bold;">as</span> (
    <span style="color: #729fcf; font-weight: bold;">select</span> $1, <span style="color: #ad7fa8; font-style: italic;">'{}'</span>::bigint[]
  <span style="color: #729fcf; font-weight: bold;">union</span> <span style="color: #729fcf; font-weight: bold;">all</span>
    <span style="color: #729fcf; font-weight: bold;">select</span> <span style="color: #729fcf;">sum</span>(d*d), h.seen || <span style="color: #729fcf;">sum</span>(d*d)
      <span style="color: #729fcf; font-weight: bold;">from</span> (<span style="color: #729fcf; font-weight: bold;">select</span> n, digits(n) <span style="color: #729fcf; font-weight: bold;">as</span> d, seen
              <span style="color: #729fcf; font-weight: bold;">from</span> happy
           ) <span style="color: #729fcf; font-weight: bold;">as</span> h
  <span style="color: #729fcf; font-weight: bold;">group</span> <span style="color: #729fcf; font-weight: bold;">by</span> h.n, h.seen
    <span style="color: #729fcf; font-weight: bold;">having</span> <span style="color: #729fcf; font-weight: bold;">not</span> seen @&gt; <span style="color: #8ae234; font-weight: bold;">array</span>[<span style="color: #729fcf;">sum</span>(d*d)]
)
  <span style="color: #729fcf; font-weight: bold;">select</span> n = 1 <span style="color: #729fcf; font-weight: bold;">as</span> happy
    <span style="color: #729fcf; font-weight: bold;">from</span> happy
<span style="color: #729fcf; font-weight: bold;">order</span> <span style="color: #729fcf; font-weight: bold;">by</span> array_length(seen, 1) <span style="color: #729fcf; font-weight: bold;">desc</span> nulls <span style="color: #729fcf; font-weight: bold;">last</span>
   <span style="color: #729fcf; font-weight: bold;">limit</span> 1
$$;
</pre>

<p>We need the <code>desc nulls last</code> trick in the <code>order by</code> because the <code>array_length()</code>
of any dimension of an empty array is <code>NULL</code>, and we certainly don't want to
return all and any number as unhappy on the grounds that the query result
contains a line <code>input, {}</code>. Let's now play the same tricks as in the puzzle
article:</p>

<pre class="src">
=# <span style="color: #729fcf; font-weight: bold;">select</span> array_agg(x) <span style="color: #729fcf; font-weight: bold;">as</span> happy <span style="color: #729fcf; font-weight: bold;">from</span> generate_series(1, 50) <span style="color: #729fcf; font-weight: bold;">as</span> t(x) <span style="color: #729fcf; font-weight: bold;">where</span> happy(x);
              happy
<span style="color: #888a85;">----------------------------------
</span> {1,7,10,13,19,23,28,31,32,44,49}
(1 <span style="color: #8ae234; font-weight: bold;">row</span>)

<span style="color: #8ae234; font-weight: bold;">Time</span>: 24.527 ms

=# explain analyze <span style="color: #729fcf; font-weight: bold;">select</span> x <span style="color: #729fcf; font-weight: bold;">from</span> generate_series(1, 10000) <span style="color: #729fcf; font-weight: bold;">as</span> t(x) <span style="color: #729fcf; font-weight: bold;">where</span> happy(x);
                      QUERY PLAN
<span style="color: #888a85;">----------------------------------------------------------------------------------------
</span> <span style="color: #729fcf; font-weight: bold;">Function</span> Scan <span style="color: #729fcf; font-weight: bold;">on</span> generate_series t  (cost=0.00..265.00 <span style="color: #729fcf; font-weight: bold;">rows</span>=333 width=4)
                          (actual <span style="color: #8ae234; font-weight: bold;">time</span>=2.938..3651.019 <span style="color: #729fcf; font-weight: bold;">rows</span>=1442 loops=1)
   Filter: happy((x)::bigint)
 Total runtime: 3651.534 ms
(3 <span style="color: #729fcf; font-weight: bold;">rows</span>)

<span style="color: #8ae234; font-weight: bold;">Time</span>: 3652.178 ms
</pre>

<p>(Yes, I tricked the <code>EXPLAIN ANALYZE</code> output so that it fits on the page width
here). For what it's worth, finding the first <code>10000</code> happy numbers in <em>Emacs
Lisp</em> on the same laptop takes <code>2830 ms</code>, also running a recursive version of
the code.</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Mon, 30 Aug 2010 11:00:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Happy%20Numbers</guid>

</item>

<item>
<title> Playing with bit strings</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Playing%20with%20bit%20strings</link>
<description><![CDATA[
<p><a name="20100826-17:45" id="20100826-17:45"></a>
The idea of the day ain't directly from me, I'm just helping with a very
thin subpart of the problem. The problem, I can't say much about, let's just
assume you want to reduce the storage of <code>MD5</code> in your database, so you want
to abuse <a href="http://www.postgresql.org/docs/8.4/interactive/datatype-bit.html">bit strings</a>. A solution to use them works fine, but the datatype is
still missing some facilities, for example going from and to hexadecimal
representation in text.</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">hex_to_varbit</span>(h text)
 <span style="color: #729fcf; font-weight: bold;">returns</span> varbit
 <span style="color: #729fcf; font-weight: bold;">language</span> <span style="color: #729fcf; font-weight: bold;">sql</span>
<span style="color: #729fcf; font-weight: bold;">as</span> $$
  <span style="color: #729fcf; font-weight: bold;">select</span> (<span style="color: #ad7fa8; font-style: italic;">'X'</span> || $1)::varbit;
$$;

<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">varbit_to_hex</span>(b varbit)
 <span style="color: #729fcf; font-weight: bold;">returns</span> text
 <span style="color: #729fcf; font-weight: bold;">language</span> <span style="color: #729fcf; font-weight: bold;">sql</span>
<span style="color: #729fcf; font-weight: bold;">as</span> $$
  <span style="color: #729fcf; font-weight: bold;">select</span> array_to_string(array_agg(to_hex((b &lt;&lt; (32*o))::<span style="color: #8ae234; font-weight: bold;">bit</span>(32)::bigint)), <span style="color: #ad7fa8; font-style: italic;">''</span>)
    <span style="color: #729fcf; font-weight: bold;">from</span> (<span style="color: #729fcf; font-weight: bold;">select</span> b, generate_series(0, n-1) <span style="color: #729fcf; font-weight: bold;">as</span> o
            <span style="color: #729fcf; font-weight: bold;">from</span> (<span style="color: #729fcf; font-weight: bold;">select</span> $1, <span style="color: #729fcf;">octet_length</span>($1)/4) <span style="color: #729fcf; font-weight: bold;">as</span> t(b, n)) <span style="color: #729fcf; font-weight: bold;">as</span> x
$$;
</pre>

<p>To understand the magic in the second function, let's walk through the tests
one could do when wanting to grasp how things work in the <code>bitstring</code> world
(using also some reading of the fine documentation, too).</p>

<pre class="src">
=# select ('101011001011100110010110'::varbit &lt;&lt; 0)::bit(8);
   bit
----------
 10101100
(1 row)

=# select ('101011001011100110010110'::varbit &lt;&lt; 8)::bit(8);
   bit
----------
 10111001
(1 row)

=# select ('101011001011100110010110'::varbit &lt;&lt; 16)::bit(8);
   bit
----------
 10010110
(1 row)

=# select * from *TEMP VERSION OF THE FUNCTION FOR TESTING*
 o |                b                 |    x
---+----------------------------------+----------
 0 | 10101100101111010001100011011011 | acbd18db
 1 | 01001100110000101111100001011100 | 4cc2f85c
 2 | 11101101111011110110010101001111 | edef654f
 3 | 11001100110001001010010011011000 | ccc4a4d8
(4 rows)
</pre>

<p>What do we get from that, will you ask? Let's see a little example:</p>

<pre class="src">
=# select hex_to_varbit(md5('foo'));
                                                          hex_to_varbit
----------------------------------------------------------------------------------------------------------------------------------
 10101100101111010001100011011011010011001100001011111000010111001110110111101111011001010100111111001100110001001010010011011000
(1 row)

=# select md5('foo'), varbit_to_hex(hex_to_varbit(md5('foo')));
               md5                |          varbit_to_hex
----------------------------------+----------------------------------
 acbd18db4cc2f85cedef654fccc4a4d8 | acbd18db4cc2f85cedef654fccc4a4d8
(1 row)
</pre>

<p>Storing <code>varbits</code> rather than the <code>text</code> form of the <code>MD5</code> allows us to go from
<code>6510 MB</code> down to <code>4976 MB</code> on a sample table containing 100 millions
rows. We're targeting more that that, so that's a great win down here!</p>

<p>In case you wonder, querying the main index on <code>varbit</code> rather than the one on
<code>text</code> for a single result row, the cost of doing the conversion with
<code>varbit_to_hex</code> seems to be around <code>28 µs</code>. We can afford it.</p>

<p>Hope this helps!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Thu, 26 Aug 2010 17:45:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Playing%20with%20bit%20strings</guid>

</item>

<item>
<title> Editing constants in constraints</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Editing%20constants%20in%20constraints</link>
<description><![CDATA[
<p><a name="20100809-14:30" id="20100809-14:30"></a>
We're using constants in some constraints here, for example in cases where
several servers are replicating to the same <em>federating</em> one: each origin
server has his own schema, and all is replicated nicely on the central host,
thanks to <a href="http://wiki.postgresql.org/wiki/Londiste_Tutorial#Federated_database">Londiste</a>, as you might have guessed already.</p>

<p>For bare-metal recovery scripts, I'm working on how to change those
constants in the constraints, so that <code>pg_dump -s</code> plus some schema tweaking
would kick-start a server. Here's a <code>PLpgSQL</code> snippet to do just that:</p>

<pre class="src">
  <span style="color: #729fcf; font-weight: bold;">FOR</span> rec <span style="color: #729fcf; font-weight: bold;">IN</span> <span style="color: #729fcf; font-weight: bold;">EXECUTE</span>
$s$
<span style="color: #729fcf; font-weight: bold;">SELECT</span> schemaname, tablename, conname, attnames, def
  <span style="color: #729fcf; font-weight: bold;">FROM</span> (
   <span style="color: #729fcf; font-weight: bold;">SELECT</span> n.nspname, c.relname, r.conname,
          (<span style="color: #729fcf; font-weight: bold;">select</span> array_accum(attname)
             <span style="color: #729fcf; font-weight: bold;">from</span> pg_attribute
            <span style="color: #729fcf; font-weight: bold;">where</span> attrelid = c.oid <span style="color: #729fcf; font-weight: bold;">and</span> r.conkey @&gt; <span style="color: #8ae234; font-weight: bold;">array</span>[attnum]) <span style="color: #729fcf; font-weight: bold;">as</span> attnames,
          pg_catalog.pg_get_constraintdef(r.oid, <span style="color: #729fcf; font-weight: bold;">true</span>)
   <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_catalog.pg_constraint r
        <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_class c <span style="color: #729fcf; font-weight: bold;">on</span> c.oid = r.conrelid
        <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_namespace n <span style="color: #729fcf; font-weight: bold;">ON</span> n.oid = c.relnamespace
   <span style="color: #729fcf; font-weight: bold;">WHERE</span> r.contype = <span style="color: #ad7fa8; font-style: italic;">'c'</span>
<span style="color: #729fcf; font-weight: bold;">ORDER</span> <span style="color: #729fcf; font-weight: bold;">BY</span> 1, 2, 3
       ) <span style="color: #729fcf; font-weight: bold;">as</span> cons(schemaname, tablename, conname, attnames, def)
<span style="color: #729fcf; font-weight: bold;">WHERE</span> attnames @&gt; <span style="color: #8ae234; font-weight: bold;">array</span>[<span style="color: #ad7fa8; font-style: italic;">'server'</span>]::<span style="color: #729fcf; font-weight: bold;">name</span>[]
$s$
  LOOP
    rec.def := replace(rec.def, <span style="color: #ad7fa8; font-style: italic;">'server = '</span> || old_id,
                                <span style="color: #ad7fa8; font-style: italic;">'server = '</span> || new_id);

    <span style="color: #729fcf; font-weight: bold;">sql</span> := <span style="color: #ad7fa8; font-style: italic;">'ALTER TABLE '</span> || rec.schemaname || <span style="color: #ad7fa8; font-style: italic;">'.'</span> || rec.tablename
        || <span style="color: #ad7fa8; font-style: italic;">' DROP CONSTRAINT '</span> || rec.conname;
    RAISE NOTICE <span style="color: #ad7fa8; font-style: italic;">'%'</span>, <span style="color: #729fcf; font-weight: bold;">sql</span>;
    <span style="color: #729fcf; font-weight: bold;">RETURN</span> <span style="color: #729fcf; font-weight: bold;">NEXT</span>;
    <span style="color: #729fcf; font-weight: bold;">EXECUTE</span> <span style="color: #729fcf; font-weight: bold;">sql</span>;

    <span style="color: #729fcf; font-weight: bold;">sql</span> := <span style="color: #ad7fa8; font-style: italic;">'ALTER TABLE '</span> || rec.schemaname || <span style="color: #ad7fa8; font-style: italic;">'.'</span> || rec.tablename
        || <span style="color: #ad7fa8; font-style: italic;">' ADD '</span> || rec.def;
    RAISE NOTICE <span style="color: #ad7fa8; font-style: italic;">'%'</span>, <span style="color: #729fcf; font-weight: bold;">sql</span>;
    <span style="color: #729fcf; font-weight: bold;">RETURN</span> <span style="color: #729fcf; font-weight: bold;">NEXT</span>;
    <span style="color: #729fcf; font-weight: bold;">EXECUTE</span> <span style="color: #729fcf; font-weight: bold;">sql</span>;

  <span style="color: #729fcf; font-weight: bold;">END</span> LOOP;
</pre>

<p>This relies on the fact that our constraints are on the column <code>server</code>. Why
would this be any better than a <code>sed</code> one-liner, would you ask me? I'm fed up
with having pseudo-parsing scripts and taking the risk that the simple
command will change data I didn't want to edit. I want context aware tools,
pretty please, to <em>feel</em> safe.</p>

<p>Otherwise I'd might have gone with <code>pg_dump -s| sed -e 's:\(server =\)
17:\1 18:'</code> but this one-liner already contains too much useless magic
for my taste (the space before <code>17</code> ain't in the group match to allow for
having <code>\1 18</code> in the right hand side. And this isn't yet parametrized, and
there I'll need to talk to the database, as that's were I store the servers
name and their id (a <code>bigserial</code> — yes, the constraints are all generated from
scripts). I don't want to write an <em>SQL parser</em> and I don't want to play
loose, so the <code>PLpgSQL</code> approach is what I'm thinking as the best tool
here. Opinionated answers get to my mailbox!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Mon, 09 Aug 2010 14:45:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Editing%20constants%20in%20constraints</guid>

</item>

<item>
<title> debian packaging PostgreSQL extensions</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20debian%20packaging%20PostgreSQL%20extensions</link>
<description><![CDATA[
<p><a name="20100806-13:00" id="20100806-13:00"></a>
In trying to help an extension <em>debian packaging</em> effort, I've once again
proposed to handle it. That's because I now begin to know how to do it, as
you can see in my <a href="http://qa.debian.org/developer.php?login=dim%40tapoueh.org">package overview</a> page at <em>debian QA</em> facility. There's a
reason why I proposed myself here, it's that yet another tool of mine is now
to be found in <em>debian</em>, and should greatly help <em>extension packaging</em>
there. You can already check for the <a href="http://packages.debian.org/sid/postgresql-server-dev-all">postgresql-server-dev-all</a> package page
if you're that impatient!</p>

<p>Back? Ok, so I used to have two main gripes against debian support for
<a href="http://www.postgresql.org/">PostgreSQL</a>. The first one, which is now feeling alone, is that both project
<a href="http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy">release support policy</a> aren't compatible enough for debian stable to include
all currently supported stable PostgreSQL major version. That's very bad
that debian stable will only propose one major version, knowing that the
support for several of them is in there.</p>

<p>The problem is two fold: first, debian stable has to maintain any
distributed package. There's no <em>deprecation policy</em> allowing for droping the
ball. So the other side of this coin is that debian developers must take on
themselves maintaining included software for as long as stable is not
renamed <code>oldstable</code>. And it so happens that there's no debian developer that
feels like maintaining <em>end of lined</em> PostgreSQL releases without help from
<a href="http://www.postgresql.org/community/contributors/">PostgreSQL Core Team</a>. Or, say, without official statement that they would
help.</p>

<p>Now, why I don't like this situation is because I'm pretty sure there's very
few software development group offering as long and reliable maintenance
policy as PostgreSQL is doing, but debian will still happily distribute
<em>unknown-maintenance-policy</em> pieces of code in its stable repositories. So the
<em>uncertainty</em> excuse is rather poor. And highly frustrating.</p>

<blockquote>
<p class="quoted">
<strong><em>Note:</em></strong> you have to admit that the debian stable management model copes very
well with all the debian included software. You can't release stable with
a new PostgreSQL major version unless each and every package depending on
PostgreSQL will actually work with the newer version, and the debian
scripts will care for upgrading the cluster. Where it's not working good
is when you're using debian for a PostgreSQL server for a proprietary
application, which happens quite frequently too.</p>

</blockquote>

<p>The consequence of this fact leads to my second main gripe against debian
support for PostgreSQL: the extensions. It so happens that the PostgreSQL
extensions are developped for supporting several major versions from the
same source code. So typically, all you need to do is recompile the
extension against the new major version, and there you go.</p>

<p>Now, say debian new stable is coming with <a href="http://packages.debian.org/squeeze/postgresql-8.4">8.4</a> rather than <a href="http://packages.debian.org/lenny/postgresql-8.3">8.3</a> as it used
to. You should be able to just build the extensions (like <a href="http://packages.debian.org/squeeze/postgresql-8.4-prefix">prefix</a>), without
changing the source package, nor droping <code>postgresql-8.3-prefix</code> from the
distribution on the grounds that <code>8.3</code> ain't in debian stable anymore.</p>

<p>I've been ranting a lot about this state of facts, and I finally provided a
patch to the <a href="http://packages.debian.org/sid/postgresql-common">postgresql-common</a> debian packaging, which made it into version
<code>110</code>: welcome <a href="http://packages.debian.org/sid/postgresql-server-dev-all">pg_buildext</a>. An exemple of how to use it can be found in the
git branch for <a href="http://github.com/dimitri/prefix">prefix</a>, it shows up in <a href="http://github.com/dimitri/prefix/blob/master/debian/pgversions">debian/pgversions</a> and <a href="http://github.com/dimitri/prefix/blob/master/debian/rules">debian/rules</a>
files.</p>

<p>As you can see, the <code>pg_buildext</code> tool allows you to list the PostgreSQL major
versions the extension you're packaging supports, and only those that are
both in your list and in the current debian supported major version list
will get built. <code>pg_buildext</code> will do a <code>VPATH</code> build of your extension, so it's
capable of building the same extension for multiple major versions of
PostgreSQL. Here's how it looks:</p>

<pre class="src">
        # build all supported version
        pg_buildext build $(SRCDIR) $(TARGET) <span style="color: #ad7fa8; font-style: italic;">"$(CFLAGS)"</span>

        # then install each of them
        for v in `pg_buildext supported-versions $(SRCDIR)`; do \
                dh_install -ppostgresql-$$v-prefix ;\
        done
</pre>

<p>And the files are to be found in those places:</p>

<pre class="src">
dim ~/dev/prefix cat debian/postgresql-8.3-prefix.install
debian/prefix-8.3/prefix.so usr/lib/postgresql/8.3/lib
debian/prefix-8.3/prefix.sql usr/share/postgresql/8.3/contrib

dim ~/dev/prefix cat debian/postgresql-8.4-prefix.install
debian/prefix-8.4/prefix.so usr/lib/postgresql/8.4/lib
debian/prefix-8.4/prefix.sql usr/share/postgresql/8.4/contrib
</pre>

<p>So you still need to maintain <a href="http://github.com/dimitri/prefix/blob/master/debian/pgversions">debian/pgversions</a> and the
<code>postgresql-X.Y-extension.*</code> files, but then a change in debian support for
PostgreSQL major versions will be handled automatically (there's a facility
to trigger automatic rebuild when necessary).</p>

<p>All this ranting to explain that pretty soon, the extenion's packages that I
maintain will no longer have to be patched when dropping a previously
supported major version of PostgreSQL. I'm breathing a little better, so
thanks a lot <a href="http://www.piware.de/category/debian/">Martin</a>!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Fri, 06 Aug 2010 13:00:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20debian%20packaging%20PostgreSQL%20extensions</guid>

</item>

<item>
<title> Querying the Catalog to plan an upgrade</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Querying%20the%20Catalog%20to%20plan%20an%20upgrade</link>
<description><![CDATA[
<p><a name="20100805-11:00" id="20100805-11:00"></a>
Some user on <code>IRC</code> was reading the releases notes in order to plan for a minor
upgrade of his <code>8.3.3</code> installation, and was puzzled about potential needs for
rebuilding <code>GIST</code> indexes. That's from the <a href="http://www.postgresql.org/docs/8.3/static/release-8-3-5.html">8.3.5 release notes</a>, and from the
<a href="http://www.postgresql.org/docs/8.3/static/release-8-3-8.html">8.3.8 notes</a> you see that you need to consider <em>hash</em> indexes on <em>interval</em>
columns too. Now the question is, how to find out if any such beasts are in
use in your database?</p>

<p>It happens that <a href="http://www.postgresql.org/">PostgreSQL</a> is letting you know those things by querying its
<a href="http://www.postgresql.org/docs/8.4/static/catalogs.html">system catalogs</a>. That might look hairy at first, but it's very worth getting
used to those system tables. You could compare that to introspection and
reflexive facilities of some programming languages, except much more useful,
because you're reaching all the system at once. But, well, here it goes:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">SELECT</span> schemaname, tablename, relname, amname, indexdef
  <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_indexes i
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_class c <span style="color: #729fcf; font-weight: bold;">ON</span> i.indexname = c.relname <span style="color: #729fcf; font-weight: bold;">and</span> c.relkind = <span style="color: #ad7fa8; font-style: italic;">'i'</span>
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_am am <span style="color: #729fcf; font-weight: bold;">ON</span> c.relam = am.oid
 <span style="color: #729fcf; font-weight: bold;">WHERE</span> amname = <span style="color: #ad7fa8; font-style: italic;">'gist'</span>;
</pre>

<p>Now you could replace the <code>WHERE</code> clause with <code>WHERE amname IN ('gist', 'hash')</code>
to check both conditions at once. What about pursuing the restriction on the
<em>hash</em> indexes rebuild to schedule, as they should only get done to indexes on
<code>interval</code> columns. Well let's try it:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">SELECT</span> schemaname, tablename, relname <span style="color: #729fcf; font-weight: bold;">as</span> indexname, amname, indclass
  <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_indexes i
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_class c <span style="color: #729fcf; font-weight: bold;">on</span> i.indexname = c.relname <span style="color: #729fcf; font-weight: bold;">and</span> c.relkind = <span style="color: #ad7fa8; font-style: italic;">'i'</span>
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_am am <span style="color: #729fcf; font-weight: bold;">on</span> c.relam = am.oid
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_index x <span style="color: #729fcf; font-weight: bold;">on</span> x.indexrelid = c.oid
 <span style="color: #729fcf; font-weight: bold;">WHERE</span> amname <span style="color: #729fcf; font-weight: bold;">in</span> (<span style="color: #ad7fa8; font-style: italic;">'btree'</span>, <span style="color: #ad7fa8; font-style: italic;">'gist'</span>)
       <span style="color: #729fcf; font-weight: bold;">and</span> schemaname <span style="color: #729fcf; font-weight: bold;">not</span> <span style="color: #729fcf; font-weight: bold;">in</span> (<span style="color: #ad7fa8; font-style: italic;">'pg_catalog'</span>, <span style="color: #ad7fa8; font-style: italic;">'information_schema'</span>);
</pre>

<p>We're not there yet, because as you notice, the catalogs are somewhat
optimized and not always in a normal form. That's good for the system's
performance, but it makes querying a bit uneasy. What we want is to get from
the <code>indclass</code> column if there's any of them (it's an <code>oidvector</code>) that applies
to an <code>interval</code> data type. There's a subtlety here as the index could store
<code>interval</code> data even if the column is not of an <code>interval</code> type itself, so we
have to find both cases.</p>

<p>Well the <em>subtlety</em> applies after you know what an <a href="http://www.postgresql.org/docs/8.4/static/xindex.html">operator class</a> is: <em>“An
operator class defines how a particular data type can be used with an
index”</em> is what the <a href="http://www.postgresql.org/docs/8.4/static/sql-createopclass.html">CREATE OPERATOR CLASS</a> manual page teaches us. What we
need to know here is that an index will talk to an operator class to get to
the data type, either the <em>column</em> data type or the index <em>storage</em> one.</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">SELECT</span> schemaname, tablename, relname <span style="color: #729fcf; font-weight: bold;">as</span> indexname, amname, indclass, opcname, typname
  <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_indexes i
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_class c <span style="color: #729fcf; font-weight: bold;">on</span> i.indexname = c.relname <span style="color: #729fcf; font-weight: bold;">and</span> c.relkind = <span style="color: #ad7fa8; font-style: italic;">'i'</span>
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_am am <span style="color: #729fcf; font-weight: bold;">on</span> c.relam = am.oid
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_index x <span style="color: #729fcf; font-weight: bold;">on</span> x.indexrelid = c.oid
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_opclass o
         <span style="color: #729fcf; font-weight: bold;">on</span> string_to_array(x.indclass::text, <span style="color: #ad7fa8; font-style: italic;">' '</span>)::oid[] @&gt; <span style="color: #8ae234; font-weight: bold;">array</span>[o.oid]::oid[]
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_type t <span style="color: #729fcf; font-weight: bold;">on</span> o.opckeytype = t.oid
<span style="color: #729fcf; font-weight: bold;">WHERE</span> amname = <span style="color: #ad7fa8; font-style: italic;">'hash'</span> <span style="color: #729fcf; font-weight: bold;">and</span> t.typname = <span style="color: #ad7fa8; font-style: italic;">'interval'</span>

<span style="color: #729fcf; font-weight: bold;">UNION</span> <span style="color: #729fcf; font-weight: bold;">ALL</span>

<span style="color: #729fcf; font-weight: bold;">SELECT</span> schemaname, tablename, relname <span style="color: #729fcf; font-weight: bold;">as</span> indexname, amname, indclass, opcname, typname
  <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_indexes i
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_class c <span style="color: #729fcf; font-weight: bold;">on</span> i.indexname = c.relname <span style="color: #729fcf; font-weight: bold;">and</span> c.relkind = <span style="color: #ad7fa8; font-style: italic;">'i'</span>
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_am am <span style="color: #729fcf; font-weight: bold;">on</span> c.relam = am.oid
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_index x <span style="color: #729fcf; font-weight: bold;">on</span> x.indexrelid = c.oid
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_opclass o
         <span style="color: #729fcf; font-weight: bold;">on</span> string_to_array(x.indclass::text, <span style="color: #ad7fa8; font-style: italic;">' '</span>)::oid[] @&gt; <span style="color: #8ae234; font-weight: bold;">array</span>[o.oid]::oid[]
       <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_type t <span style="color: #729fcf; font-weight: bold;">on</span> o.opcintype = t.oid
<span style="color: #729fcf; font-weight: bold;">WHERE</span> amname = <span style="color: #ad7fa8; font-style: italic;">'hash'</span> <span style="color: #729fcf; font-weight: bold;">and</span> t.typname = <span style="color: #ad7fa8; font-style: italic;">'interval'</span>;
</pre>

<p>Most certainly this query will return no row for you, as <em>hash</em> indexes are
not widely used, mainly because they are not crash tolerant. For seeing some
results you could remove the <code>amname</code> restriction of course, that would show
the query is working, but don't forget to add the restriction back to plan
for the upgrade!</p>

<p>But hey, why walking the extra mile here, would you ask me? After all, in
the second query we would already have had the information we needed should
we added the <code>indexdef</code> column, albeit in a human reader friendly way: the
<em>resultset</em> would then contain the <code>CREATE INDEX</code> command you need to issue to
build the index from scratch. That would be enough for checking only the
catalog, but the extra mile allows you to produce a <code>SQL</code> script to build the
indexes that need your attention post upgrade. That last step is left as an
exercise for the reader, though.</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Thu, 05 Aug 2010 11:00:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Querying%20the%20Catalog%20to%20plan%20an%20upgrade</guid>

</item>

<item>
<title> Database Virtual Machines</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Database%20Virtual%20Machines</link>
<description><![CDATA[
<p><a name="20100803-13:30" id="20100803-13:30"></a>
Today I'm being told once again about <a href="http://www.sqlite.org/">SQLite</a> as an embedded database
software. That one ain't a <em>database server</em> but a <em>software library</em> that you
can use straight into your main program. I'm yet to use it, but it looks
like <a href="http://www.sqlite.org/lang.html">its SQL support</a> is good enough for simple things — and that covers
<em>loads</em> of things. I guess read-only cache and configuration storage would be
the obvious ones, because it seems that <a href="http://www.sqlite.org/whentouse.html">SQLite use cases</a> aren't including
<a href="http://www.sqlite.org/lockingv3.html">mixed concurrency</a>, that is workloads with concurrent readers and writers.</p>

<p>The part that got my full attention is
<a href="http://www.sqlite.org/vdbe.html">The Virtual Database Engine of SQLite</a>, as this blog title would imply. It
seems to be the same idea as what <a href="http://monetdb.cwi.nl/">MonetDB</a> calls their
<a href="http://monetdb.cwi.nl/MonetDB/Documentation/MAL-Synopsis.html">MonetDB Assembly Language</a>, and I've been trying to summarize some idea about
it in my <a href="http://tapoueh.org/char10.html#sec11">Next Generation PostgreSQL</a> article.</p>

<p>The main thing is how to further optimize <a href="http://www.postgresql.org/">PostgreSQL</a> given what we have. It
seems that among the major road blocks in the performance work is how we get
the data from disk and to the client. We're still spending so many time in
the <code>CPU</code> that the disk bandwidth are not always saturated, and that's a
problem. Further thoughts on the <a href="http://tapoueh.org/char10.html#sec11">full length article</a>, but that's just about
a one page section now!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Tue, 03 Aug 2010 13:30:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Database%20Virtual%20Machines</guid>

</item>

<item>
<title> Partitioning: relation size per “group”</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Partitioning%3A%20relation%20size%20per%20%201Cgroup%201D</link>
<description><![CDATA[
<p><a name="20100726-17:00" id="20100726-17:00"></a>
This time, we are trying to figure out where is the bulk of the data on
disk. The trick is that we're using <a href="http://www.postgresql.org/docs/current/static/ddl-partitioning.html">DDL partitioning</a>, but we want a “nice”
view of size per <em>partition set</em>. Meaning that if you have for example a
parent table <code>foo</code> with partitions <code>foo_201006</code> and <code>foo_201007</code>, you would want
to see a single category <code>foo</code> containing the accumulated size of all the
partitions underneath <code>foo</code>.</p>

<p>Here we go:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">select</span> groupe, pg_size_pretty(<span style="color: #729fcf;">sum</span>(bytes)::bigint) <span style="color: #729fcf; font-weight: bold;">as</span> <span style="color: #729fcf; font-weight: bold;">size</span>, <span style="color: #729fcf;">sum</span>(bytes)
  <span style="color: #729fcf; font-weight: bold;">from</span> (
<span style="color: #729fcf; font-weight: bold;">select</span> relkind <span style="color: #729fcf; font-weight: bold;">as</span> k, nspname, relname, tablename, bytes,
         <span style="color: #729fcf; font-weight: bold;">case</span> <span style="color: #729fcf; font-weight: bold;">when</span> relkind = <span style="color: #ad7fa8; font-style: italic;">'r'</span> <span style="color: #729fcf; font-weight: bold;">and</span> relname ~ <span style="color: #ad7fa8; font-style: italic;">'[0-9]{6}$'</span>
              <span style="color: #729fcf; font-weight: bold;">then</span> <span style="color: #729fcf;">substring</span>(relname <span style="color: #729fcf; font-weight: bold;">from</span> 1 <span style="color: #729fcf; font-weight: bold;">for</span> <span style="color: #729fcf; font-weight: bold;">length</span>(relname)-7)

              <span style="color: #729fcf; font-weight: bold;">when</span> relkind = <span style="color: #ad7fa8; font-style: italic;">'i'</span> <span style="color: #729fcf; font-weight: bold;">and</span>  tablename ~ <span style="color: #ad7fa8; font-style: italic;">'[0-9]{6}$'</span>
              <span style="color: #729fcf; font-weight: bold;">then</span> <span style="color: #729fcf;">substring</span>(tablename <span style="color: #729fcf; font-weight: bold;">from</span> 1 <span style="color: #729fcf; font-weight: bold;">for</span> <span style="color: #729fcf; font-weight: bold;">length</span>(tablename)-7)

              <span style="color: #729fcf; font-weight: bold;">else</span> <span style="color: #ad7fa8; font-style: italic;">'core'</span>
          <span style="color: #729fcf; font-weight: bold;">end</span> <span style="color: #729fcf; font-weight: bold;">as</span> groupe
  <span style="color: #729fcf; font-weight: bold;">from</span> (
  <span style="color: #729fcf; font-weight: bold;">select</span> nspname, relname,
         <span style="color: #729fcf; font-weight: bold;">case</span> <span style="color: #729fcf; font-weight: bold;">when</span> relkind = <span style="color: #ad7fa8; font-style: italic;">'i'</span>
              <span style="color: #729fcf; font-weight: bold;">then</span> (<span style="color: #729fcf; font-weight: bold;">select</span> relname
                      <span style="color: #729fcf; font-weight: bold;">from</span> pg_index x
                           <span style="color: #729fcf; font-weight: bold;">join</span> pg_class xc <span style="color: #729fcf; font-weight: bold;">on</span> x.indrelid = xc.oid
                           <span style="color: #729fcf; font-weight: bold;">join</span> pg_namespace xn <span style="color: #729fcf; font-weight: bold;">on</span> xc.relnamespace = xn.oid
                     <span style="color: #729fcf; font-weight: bold;">where</span> x.indexrelid = c.oid
                    )
              <span style="color: #729fcf; font-weight: bold;">else</span> <span style="color: #729fcf; font-weight: bold;">null</span>
           <span style="color: #729fcf; font-weight: bold;">end</span> <span style="color: #729fcf; font-weight: bold;">as</span> tablename,
         pg_size_pretty(pg_relation_size(c.oid)) <span style="color: #729fcf; font-weight: bold;">as</span> relation,
         pg_total_relation_size(c.oid) <span style="color: #729fcf; font-weight: bold;">as</span> bytes,
         relkind
    <span style="color: #729fcf; font-weight: bold;">from</span> pg_class c <span style="color: #729fcf; font-weight: bold;">join</span> pg_namespace n <span style="color: #729fcf; font-weight: bold;">on</span> c.relnamespace = n.oid
   <span style="color: #729fcf; font-weight: bold;">where</span> c.relkind <span style="color: #729fcf; font-weight: bold;">in</span> (<span style="color: #ad7fa8; font-style: italic;">'r'</span>, <span style="color: #ad7fa8; font-style: italic;">'i'</span>)
         <span style="color: #729fcf; font-weight: bold;">and</span> nspname <span style="color: #729fcf; font-weight: bold;">in</span> (<span style="color: #ad7fa8; font-style: italic;">'public'</span>, <span style="color: #ad7fa8; font-style: italic;">'archive'</span>)
         <span style="color: #729fcf; font-weight: bold;">and</span> pg_total_relation_size(c.oid) &gt; 32 * 1024
<span style="color: #729fcf; font-weight: bold;">order</span> <span style="color: #729fcf; font-weight: bold;">by</span> 5 <span style="color: #729fcf; font-weight: bold;">desc</span>
       ) <span style="color: #729fcf; font-weight: bold;">as</span> s
       ) <span style="color: #729fcf; font-weight: bold;">as</span> t
<span style="color: #729fcf; font-weight: bold;">group</span> <span style="color: #729fcf; font-weight: bold;">by</span> 1
<span style="color: #729fcf; font-weight: bold;">order</span> <span style="color: #729fcf; font-weight: bold;">by</span> 3 <span style="color: #729fcf; font-weight: bold;">desc</span>;
</pre>

<p>Note that by simply removing those last two lines here, you will get a
detailed view of the <em>indexes</em> and <em>tables</em> that are taking the most volume on
disk at your place.</p>

<p>Now, what about using <a href="http://www.postgresql.org/docs/8.4/static/functions-window.html">window functions</a> here so that we get some better
detailed view of historic changes on each partition? With some evolution
figure in percentage from the previous partition of the same year,
accumulated size per partition and per year, yearly sum, you name it. Here's
another one you might want to try, ready for some tuning (schema name, table
name, etc):</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">WITH</span> s <span style="color: #729fcf; font-weight: bold;">AS</span> (
  <span style="color: #729fcf; font-weight: bold;">select</span> relname,
         pg_relation_size(c.oid) <span style="color: #729fcf; font-weight: bold;">as</span> <span style="color: #729fcf; font-weight: bold;">size</span>,
         pg_total_relation_size(c.oid) <span style="color: #729fcf; font-weight: bold;">as</span> tsize,
         <span style="color: #729fcf;">substring</span>(<span style="color: #729fcf;">substring</span>(relname <span style="color: #729fcf; font-weight: bold;">from</span> <span style="color: #ad7fa8; font-style: italic;">'[0-9]{6}$'</span>) <span style="color: #729fcf; font-weight: bold;">for</span> 4)::bigint <span style="color: #729fcf; font-weight: bold;">as</span> <span style="color: #729fcf; font-weight: bold;">year</span>
    <span style="color: #729fcf; font-weight: bold;">from</span> pg_class c
         <span style="color: #729fcf; font-weight: bold;">join</span> pg_namespace n <span style="color: #729fcf; font-weight: bold;">on</span> n.oid = c.relnamespace
   <span style="color: #729fcf; font-weight: bold;">where</span> c.relkind = <span style="color: #ad7fa8; font-style: italic;">'r'</span>
     <span style="color: #888a85;">-- and n.nspname = 'public'
</span>     <span style="color: #888a85;">-- and c.relname ~ 'stats'
</span>     <span style="color: #729fcf; font-weight: bold;">and</span> <span style="color: #729fcf;">substring</span>(<span style="color: #729fcf;">substring</span>(relname <span style="color: #729fcf; font-weight: bold;">from</span> <span style="color: #ad7fa8; font-style: italic;">'[0-9]{6}$'</span>) <span style="color: #729fcf; font-weight: bold;">for</span> 4)::bigint &gt;= 2008
<span style="color: #729fcf; font-weight: bold;">order</span> <span style="color: #729fcf; font-weight: bold;">by</span> relname
),
     sy <span style="color: #729fcf; font-weight: bold;">AS</span> (
  <span style="color: #729fcf; font-weight: bold;">select</span> relname,
         <span style="color: #729fcf; font-weight: bold;">size</span>,
         tsize,
         <span style="color: #729fcf; font-weight: bold;">year</span>,
         (<span style="color: #729fcf;">sum</span>(<span style="color: #729fcf; font-weight: bold;">size</span>) over w_year)::bigint <span style="color: #729fcf; font-weight: bold;">as</span> ysize,
         (<span style="color: #729fcf;">sum</span>(<span style="color: #729fcf; font-weight: bold;">size</span>) over w_month)::bigint <span style="color: #729fcf; font-weight: bold;">as</span> cumul,
         (lag(<span style="color: #729fcf; font-weight: bold;">size</span>) over (<span style="color: #729fcf; font-weight: bold;">order</span> <span style="color: #729fcf; font-weight: bold;">by</span> relname))::bigint <span style="color: #729fcf; font-weight: bold;">as</span> previous
    <span style="color: #729fcf; font-weight: bold;">from</span> s
  window w_year  <span style="color: #729fcf; font-weight: bold;">as</span> (partition <span style="color: #729fcf; font-weight: bold;">by</span> <span style="color: #729fcf; font-weight: bold;">year</span>),
         w_month <span style="color: #729fcf; font-weight: bold;">as</span> (partition <span style="color: #729fcf; font-weight: bold;">by</span> <span style="color: #729fcf; font-weight: bold;">year</span> <span style="color: #729fcf; font-weight: bold;">order</span> <span style="color: #729fcf; font-weight: bold;">by</span> relname)
),
     syp <span style="color: #729fcf; font-weight: bold;">AS</span> (
  <span style="color: #729fcf; font-weight: bold;">select</span> relname,
         <span style="color: #729fcf; font-weight: bold;">size</span>,
         tsize,
         rank() over (partition <span style="color: #729fcf; font-weight: bold;">by</span> <span style="color: #729fcf; font-weight: bold;">year</span> <span style="color: #729fcf; font-weight: bold;">order</span> <span style="color: #729fcf; font-weight: bold;">by</span> <span style="color: #729fcf; font-weight: bold;">size</span> <span style="color: #729fcf; font-weight: bold;">desc</span>) <span style="color: #729fcf; font-weight: bold;">as</span> rank,
         <span style="color: #729fcf; font-weight: bold;">case</span> <span style="color: #729fcf; font-weight: bold;">when</span> ysize = 0 <span style="color: #729fcf; font-weight: bold;">then</span> ysize
              <span style="color: #729fcf; font-weight: bold;">else</span> round(<span style="color: #729fcf; font-weight: bold;">size</span> / ysize::<span style="color: #8ae234; font-weight: bold;">numeric</span> * 100, 2) <span style="color: #729fcf; font-weight: bold;">end</span> <span style="color: #729fcf; font-weight: bold;">as</span> yp,
         <span style="color: #729fcf; font-weight: bold;">case</span> <span style="color: #729fcf; font-weight: bold;">when</span> previous = 0 <span style="color: #729fcf; font-weight: bold;">then</span> previous
              <span style="color: #729fcf; font-weight: bold;">else</span> round((<span style="color: #729fcf; font-weight: bold;">size</span> / previous::<span style="color: #8ae234; font-weight: bold;">numeric</span> - 1.0) * 100, 2) <span style="color: #729fcf; font-weight: bold;">end</span> <span style="color: #729fcf; font-weight: bold;">as</span> evol,
         cumul,
         <span style="color: #729fcf; font-weight: bold;">year</span>,
         ysize
    <span style="color: #729fcf; font-weight: bold;">from</span> sy
)
  <span style="color: #729fcf; font-weight: bold;">SELECT</span> relname,
         pg_size_pretty(<span style="color: #729fcf; font-weight: bold;">size</span>) <span style="color: #729fcf; font-weight: bold;">as</span> <span style="color: #729fcf; font-weight: bold;">size</span>,
         pg_size_pretty(tsize) <span style="color: #729fcf; font-weight: bold;">as</span> "+indexes",
         evol, yp <span style="color: #729fcf; font-weight: bold;">as</span> "% annuel", rank,
         pg_size_pretty(cumul) <span style="color: #729fcf; font-weight: bold;">as</span> cumul, <span style="color: #729fcf; font-weight: bold;">year</span>,
         pg_size_pretty(ysize) <span style="color: #729fcf; font-weight: bold;">as</span> "yearly <span style="color: #729fcf;">sum</span>",
         pg_size_pretty((<span style="color: #729fcf;">sum</span>(<span style="color: #729fcf; font-weight: bold;">size</span>) over())::bigint) <span style="color: #729fcf; font-weight: bold;">as</span> total
    <span style="color: #729fcf; font-weight: bold;">FROM</span> syp
<span style="color: #729fcf; font-weight: bold;">ORDER</span> <span style="color: #729fcf; font-weight: bold;">BY</span> relname;
</pre>

<p>Hope you'll find it useful, I certainly do!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Mon, 26 Jul 2010 17:00:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Partitioning%3A%20relation%20size%20per%20%201Cgroup%201D</guid>

</item>

<item>
<title> Emacs and PostgreSQL</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Emacs%20and%20PostgreSQL</link>
<description><![CDATA[
<p><a name="20100722-09:30" id="20100722-09:30"></a>
Those are my two all times favorite Open Source Software. Or <a href="http://www.gnu.org/philosophy/free-sw.html">Free Software</a>
in the <a href="http://www.gnu.org/">GNU</a> sense of the world, as both the <em>BSD</em> and the <em>GPL</em> are labeled free
there. Even if I prefer the <a href="http://www.debian.org/social_contract">The Debian Free Software Guidelines</a> as a global
definition and the <a href="http://sam.zoy.org/wtfpl/">WTFPL</a> license. But that's a digression.</p>

<p>I think that <a href="http://www.gnu.org/software/emacs/">Emacs</a> and <a href="http://www.postgresql.org/">PostgreSQL</a> do share a lot in common. I'd begin with
the documentation, which quality is amazing for both projects. Then of
course the extensibility with <a href="http://www.gnu.org/software/emacs/emacs-lisp-intro/html_node/Preface.html#Preface">Emacs Lisp</a> on the one hand and
<a href="http://www.postgresql.org/docs/8.4/static/extend.html">catalog-driven operations</a> on the other hand. Whether you're extending Emacs
or PostgreSQL you'll find that it's pretty easy to tweak the system <em>while
it's running</em>. The other comparison points are less important, like the fact
the both the systems get about the same uptime on my laptop (currently <em>13
days, 23 hours, 57 minutes, 10 seconds</em>).</p>

<p>So of course I'm using <em>Emacs</em> to edit <em>PostgreSQL</em> <code>.sql</code> files, including stored
procedures. And it so happens that <a href="http://archives.postgresql.org/pgsql-hackers/2010-07/msg01067.php">line numbering in plpgsql</a> is not as
straightforward as one would naively think, to the point that we'd like to
have better tool support there. So I've extended Emacs <a href="http://www.gnu.org/software/emacs/manual/html_node/emacs/Minor-Modes.html">linum-mode minor mode</a>
to also display the line numbers as computed per PostgreSQL, and here's what
it looks like:</p>

<center>
<p><img src="../images/emacs-pgsql-line-numbers.png" alt=""></p>
</center>

<p>Now, here's also the source code, <a href="static/dim-pgsql.el">dim-pgsql.el</a>. Hope you'll enjoy!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Thu, 22 Jul 2010 09:30:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Emacs%20and%20PostgreSQL</guid>

</item>

<item>
<title> Background writers</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Background%20writers</link>
<description><![CDATA[
<p><a name="20100719-16:30" id="20100719-16:30"></a>
There's currently a thread on <a href="http://archives.postgresql.org/pgsql-hackers/">hackers</a> about <a href="http://archives.postgresql.org/pgsql-hackers/2010-07/msg00493.php">bg worker: overview</a> and a series
of 6 patches. Thanks a lot <strong><em>Markus</em></strong>! This is all about generalizing a concept
already in use in the <em>autovacuum</em> process, where you have an independent
subsystem that require having an autonomous <em>daemon</em> running and able to start
its own <em>workers</em>.</p>

<p>I've been advocating about generalizing this concept for awhile already, in
order to have <em>postmaster</em> able to communicate to subsystems when to shut down
and start and reload, etc. Some external processes are only external because
there's no need to include them <em>by default</em> in to the database engine, not
because there's no sense to having them in there.</p>

<p>So even if <strong><em>Markus</em></strong> work is mainly about generalizing <em>autovacuum</em> so that he
has a <em>coordinator</em> to ask for helper backends to handle broadcasting of
<em>writesets</em> for <a href="http://postgres-r.org/">Postgres-R</a>, it still could be a very good first step towards
something more general. What I'd like to see the generalization handle are
things like <a href="http://wiki.postgresql.org/wiki/PGQ_Tutorial">PGQ</a>, or the <em>pgagent scheduler</em>. In some cases, <a href="http://pgbouncer.projects.postgresql.org/doc/usage.html">pgbouncer</a> too.</p>

<p>What we're missing there is an <em>API</em> for everybody to be able to extend
PostgreSQL with its own background processes and workers. What would such a
beast look like? I have some preliminary thoughts about this in my
<a href="char10.html#sec16">Next Generation PostgreSQL</a> article, but that's still early thoughts. The
main idea is to steal as much as sensible from
<a href="http://www.erlang.org/doc/man/supervisor.html">Erlang Generic Supervisor Behaviour</a>, and maybe up to its
<a href="http://www.erlang.org/doc/design_principles/fsm.html">Generic Finite State Machines</a> <em>behavior</em>. In the <em>Erlang</em> world, a <em>behavior</em> is a
generic process.</p>

<p>The <em>FSM</em> approach would allow for any user daemon to provide an initial state
and register functions that would do some processing then change the
state. My feeling is that if those functions are exposed at the SQL level,
then you can <em>talk</em> to the daemon from anywhere (the Erlang ideas include a
globally —cluster wide— unique name). Of course the goal would be to
provide an easy way for the <em>FSM</em> functions to have a backend connected to the
target database handle the work for it, or be able to connect itself. Then
we'd need something else here, a way to produce events based on the clock. I
guess relying on <code>SIGALRM</code> is a possibility.</p>

<p>I'm not sure about how yet, but I think getting back in consultancy after
having opened <a href="http://2ndQuadrant.com">2ndQuadrant</a> <a href="http://2ndQuadrant.fr">France</a> has some influence on how I think about all
that. My guess is that those blog posts are a first step on a nice journey!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Mon, 19 Jul 2010 16:30:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Background%20writers</guid>

</item>

<item>
<title> Logs analysis</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Logs%20analysis</link>
<description><![CDATA[
<p><a name="20100713-14:15" id="20100713-14:15"></a>
Nowadays to analyze logs and provide insights, the more common tool to use
is <a href="http://pgfouine.projects.postgresql.org/">pgfouine</a>, which does an excellent job. But there has been some
improvements in logs capabilities that we're not benefiting from yet, and
I'm thinking about the <code>CSV</code> log format.</p>

<p>So the idea would be to turn <em>pgfouine</em> into a set of <code>SQL</code> queries against the
logs themselves once imported into the database. Wait. What about having our
next PostgreSQL version, which is meant (I believe) to include CSV support
in <em>SQL/MED</em>, to directly expose its logs as a system view?</p>

<p>A good thing would be to expose that as a ddl-partitioned table following
the log rotation scheme as setup in <code>postgresql.conf</code>, or maybe given in some
sort of a setup, in order to support <code>logrotate</code> users. At least some
facilities to do that would be welcome, and I'm not sure plain <em>SQL/MED</em> is
that when it comes to <em>source</em> partitioning.</p>

<p>Then all that remains to be done is a set of <code>SQL</code> queries and some static or
dynamic application to derive reports from there.</p>

<p>This is yet again an idea I have in mind but don't have currently time to
explore myself, so I talk about it here in the hope that others will share
the interest. Of course, now that I work at <a href="http://2ndQuadrant.com">2ndQuadrant</a>, you can make it so
that we consider the idea in more details, up to implementing and
contributing it!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Tue, 13 Jul 2010 14:15:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Logs%20analysis</guid>

</item>

<item>
<title> Using indexes as column store?</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Using%20indexes%20as%20column%20store%3F</link>
<description><![CDATA[
<p><a name="20100708-11:15" id="20100708-11:15"></a>
<a name="%20Using%20indexes%20as%20column%20store%3F" id="%20Using%20indexes%20as%20column%20store%3F"></a>
There's a big trend nowadays about using column storage as opposed to what
PostgreSQL is doing, which would be row storage. The difference is that if
you have the same column value in a lot of rows, you could get to a point
where you have this value only once in the underlying storage file. That
means high compression. Then you tweak the <em>executor</em> to be able to load this
value only once, not once per row, and you win another huge source of data
traffic (often enough, from disk).</p>

<p>Well, it occurs to me that maybe we could have column oriented storage
support without adding any new storage facility into PostgreSQL itself, just
using in new ways what we already have now. Column oriented storage looks
somewhat like an index, where any given value is meant to appear only
once. And you have <em>links</em> to know where to find the full row associated in
the main storage.</p>

<p>There's a work in progress to allow for PostgreSQL to use indexes on their
own, without having to get to the main storage for checking the
visibility. That's known as the <a href="http://www.postgresql.org/docs/8.4/static/storage-vm.html">Visibility Map</a>, which is still only a hint
in released versions. The goal is to turn that into a crash-safe trustworthy
source in the future, so that we get <em>covering indexes</em>. That means we can use
an index and skip getting to the full row in main storage and get the
visibility information there.</p>

<p>Now, once we have that, we could consider using the indexes in more
queries. It could be a win to get the column values from the index when
possible and if you don't <em>output</em> more columns from the <em>heap</em>, return the
values from there. Scanning the index only once per value, not once per row.</p>

<p>There's a little more though on the point in the <a href="char10.html#sec10">Next Generation PostgreSQL</a>
article I've been referencing already, should you be interested.</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Thu, 08 Jul 2010 11:15:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Using%20indexes%20as%20column%20store%3F</guid>

</item>

<item>
<title> MVCC in the Cloud</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20MVCC%20in%20the%20Cloud</link>
<description><![CDATA[
<p><a name="20100706-10:50" id="20100706-10:50"></a>
<a name="%20MVCC%20in%20the%20Cloud" id="%20MVCC%20in%20the%20Cloud"></a>
At <a href="http://char10.org/">CHAR(10)</a> <strong><em>Markus</em></strong> had a talk about
<a href="http://char10.org/talk-schedule-details#talk13">Using MVCC for Clustered Database Systems</a> and explained how <a href="http://postgres-r.org/">Postgres-R</a> does
it. The scope of his project is to maintain a set of database servers in the
same state, eventually.</p>

<p>Now, what does it mean to get &quot;In the Cloud&quot;? Well there are more than one
answer I'm sure, mine would insist on including this &quot;Elasticity&quot; bit. What
I mean here is that it'd be great to be able to add or lose nodes and stay
<em>online</em>. Granted, that what's <em>Postgres-R</em> is providing. Does that make it
ready for the &quot;Cloud&quot;? Well it happens so that I don't think so.</p>

<p>Once you have elasticity, you also want <em>scalability</em>. That could mean lots of
thing, and <em>Postgres-R</em> already provides a great deal of it, at the connect
and reads level: you can do your business <em>unlimited</em> on any node, the others
will eventually (<em>eagerly</em>) catch-up, and you can do your <code>select</code> on any node
too, reading from the same data set. Eventually.</p>

<p>What's still missing here is the hard sell, <em>write scalability</em>. This is the
idea that you don't want to sustain the same <em>write load</em> on all the members
of the &quot;Cloud cluster&quot;. It happens that I have some idea about how to go on
this, and this time I've been trying to write them down. You might be
interested into the <a href="http://tapoueh.org/char10.html#sec3">MVCC in the Cloud</a> part of my <a href="http://tapoueh.org/char10.html">Next Generation PostgreSQL</a>
notes.</p>

<p>My opinion is that if you want to distribute the data, this is a problem
that falls in the category of finding the data on disk. This problem is
already solved in the executor, it knows which operating system level file
to open and where to seek inside that in order to find a row value for a
given relation. So it should be possible to teach it that some relation's
storage ain't local, to get the data it needs to communicate to another
PostgreSQL instance.</p>

<p>I would call that a <em>remote tablespace</em>. It allows for distributing both the
data and their processing, which could happen in parallel. Of course that
means there's now some latency concerns, and that some <em>JOIN</em> will get slow if
you need to retrieve the data from the network each time. For that what I'm
thinking about is the possibility to manage a local copy of a remote
tablespace, which would be a <em>mirror tablespace</em>. But that's for another blog
post.</p>

<p>Oh, if that makes you think a lot of <a href="http://wiki.postgresql.org/wiki/SQL/MED">SQL/MED</a>, that would mean I did a good
enough job at explaining the idea. The main difference though would be to
ensure transaction boundaries over the local and remote data: it's one
single distributed database we're talking about here.</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Tue, 06 Jul 2010 10:50:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20MVCC%20in%20the%20Cloud</guid>

</item>

<item>
<title> Back from CHAR(10)</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Back%20from%20CHAR</link>
<description><![CDATA[
<p><a name="20100705-09:30" id="20100705-09:30"></a>
<a name="%20Back%20from%20CHAR" id="%20Back%20from%20CHAR"></a>
It surely does not feel like a full month and some more went by since we
were enjoying <a href="http://www.pgcon.org/2010/">PGCon 2010</a>, but in fact it was already the time for
<a href="http://char10.org/talk-schedule">CHAR(10)</a>. The venue was most excellent, as Oxford is a very beautiful
city. Also, the college was like a city in the city, and having the
accomodation all in there really smoothed it all.</p>

<p>On a more technical viewpoint, the <a href="http://char10.org/talk-schedule">range of topics</a> we talked about and the
even broader one in the <em>&quot;Hall Track&quot;</em> make my mind full of ideas, again. So
I'm preparing a quite lengthy article to summarise or present all those
ideas, and I think a post series should cover the points in there. When
trying to label things, it appears that my current obsessions are mainly
about <em>PostgreSQL in the Cloud</em> and <em>Further Optimising PostgreSQL</em>, so that's
what I'll be talking about those next days.</p>

<p>Meanwhile I'm going to search for existing solutions on how to use the
<a href="http://en.wikipedia.org/wiki/Paxos_algorithm">Paxos algorithm</a> to generate a reliable distributed sequence, using <a href="http://libpaxos.sourceforge.net/">libpaxos</a>
for example. The goal would be to see if it's feasible to have a way to
offer some global <code>XID</code> from a network of servers in a distributed fashion,
ideally in such a way that new members can join in at any point, and of
course that losing a member does not cause downtime for the online ones. It
sounds like this problem has been extensively researched and is solved,
either by the <em>Global Communication Systems</em> or the underlying
algorithms. Given the current buy-in lack of our community for <code>GCS</code> my guess
is that bypassing them would be a pretty good move, even if that mean
implementing a limited form of <code>GCS</code> ourselves.</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Mon, 05 Jul 2010 09:30:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Back%20from%20CHAR</guid>

</item>

<item>
<title> Back from PgCon2010</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Back%20from%20PgCon2010</link>
<description><![CDATA[
<p><a name="20100527-14:26" id="20100527-14:26"></a>
<a name="%20Back%20from%20PgCon2010" id="%20Back%20from%20PgCon2010"></a>
This year's edition has been the <a href="http://www.pgcon.org/2010/">best pgcon</a> ever for me. Granted, it's only
my third time, but still :) As <a href="http://blog.endpoint.com/2010/05/pgcon-hall-track.html">Josh said</a> the <em>&quot;Hall Track&quot;</em> in particular was
very good, and the <a href="http://wiki.postgresql.org/wiki/PgCon_2010_Developer_Meeting">Dev Meeting</a> has been very effective!</p>

<h3>Extensions</h3>

<p class="first">This time I prepared some <a href="http://wiki.postgresql.org/wiki/Image:Pgcon2010-dev-extensions.pdf">slides to present the extension design</a> and I tried
hard to make it so that we get to agree on a plan, even recognizing it's not
solving all of our problems from the get go. I had been talking about the
concept and design with lots of people already, and continued to do so while in
Ottawa on Monday evening and through all Tuesday. So Wednesday, I felt
prepared. It proved to be a good thing, as I edited the slides with ideas from
several people I had the chance to expose my ideas to! Thanks <em>Greg Stark</em> and
<em>Heikki Linnakangas</em> for the part we talked about at the meeting, and a lot more
people for the things we'll have to solve later (Hi <em>Stefan</em>!).</p>

<p>So the current idea for <strong>extensions</strong> is for the <em>backend</em> support to start with a
file in <code>`pg_config --sharedir`/extensions/foo/control</code> containing
the <em>foo</em> extension's <em>metadata</em>. From that we know if we can install an extension
and how. Here's an example:</p>

<pre class="src">
name = foo
version = 1.0
custom_variable_classes = 'foo'
depends  = bar (&gt;= 1.1), baz
conflicts = bla (&lt; 0.8)
</pre>

<p>The other files should be <code>install.sql</code>, <code>uninstall.sql</code> and <code>foo.conf</code>. The only
command the user will have to type in order for using the extension in his
database will then be:</p>

<pre class="src">
  INSTALL EXTENSION foo;
</pre>

<p>For that to work all that needs to happen is for me to write the code. I'll
keep you informed as soon as I get a change to resume my activities on the
<a href="http://git.postgresql.org/gitweb?p=postgresql-extension.git;a=shortlog;h=refs/heads/extension">git branch</a> I'm using. You can already find my first attempt at a
<code>pg_execute_from_file()</code> function <a href="http://git.postgresql.org/gitweb?p=postgresql-extension.git;a=commitdiff;h=6eed4eca0179cbdeb737b9783084e9f03fcb7470">there</a>.</p>

<p>Building atop that backend support we already have two gentlemen competing on
features to offer to <a href="http://justatheory.com/computers/databases/postgresql/pgan-bikeshedding.html">distribute</a> and <a href="http://petereisentraut.blogspot.com/2010/05/postgresql-package-management.html">package</a> extensions! That will complete the
work just fine, thanks guys.</p>


<h3>Hot Standby</h3>

<p class="first">Heikki's talk about <a href="http://www.pgcon.org/2010/schedule/events/264.en.html">Built-in replication in PostgreSQL 9.0</a> left me with lots of
thinking. In particular it seems we need two projects out of core to complete
what <code>9.0</code> has to offer, namely something very simple to prepare a base backup
and something more involved to manage a pool of standbys.</p>

<h4>pg_basebackup</h4>

<p class="first">The idea I had listening to the talk was that it might be possible to ask the
server, in a single SQL query, for the list of all the files it's using. After
all, there's those <code>pg_ls_files()</code> and <code>pg_read_file()</code> functions, we could put
them to good use. I couldn't get the idea out of my head, so I had to write
some code and see it running: <a href="http://github.com/dimitri/pg_basebackup">pg_basebackup</a> is there at <code>github</code>, grab a copy!</p>

<p>What it does is very simple, in about 100 lines of self-contained python code
it get all the files from a running server through a normal PostgreSQL
connection. That was my first <a href="http://www.postgresql.org/docs/8.4/interactive/queries-with.html">recursive query</a>. I had to create a new function
to get the file contents as the existing one returns text, and I want <code>bytea</code>
here, of course.</p>

<p>Note that the code depends on the <code>bytea</code> representation in use, so it's only
working with <code>9.0</code> as of now. Can be changed easily though, send a patch or just
ask me to do it!</p>

<p>Lastly, note that even if <code>pg_basebackup</code> will compress each chunk it sends over
the <code>libpq</code> connection, it won't be your fastest option around. Its only
advantage there is its simplicity. Get the code, run it with 2 arguments: a
connection string and a destination directory. There you are.</p>


<h4>wal proxy, wal relay</h4>

<p class="first">The other thing that we'll miss in <code>9.0</code> is the ability to both manage more than
a couple of <em>standby</em> servers and to manage failover gracefully. Here the idea
would be to have a proxy server acting as both a <em>walreceiver</em> and a
<em>walsender</em>. Its role would be to both <em>archive</em> the WAL and <em>relay</em> them to the real
standbys.</p>

<p>Then in case of master's failure, we could instruct this <em>proxy</em> to be fed from
the elected new master (manual procedure), the other standbys not being
affected. Well apart than apparently changing the <em>timeline</em> (which will happen
as soon as you promote a standby to master) while streaming is not meant to be
supported. So the <em>proxy</em> would also disconnect all the <em>slaves</em> and have them
reconnect.</p>

<p>If we need such a finesse, we could have the <code>restore_command</code> on the <em>standbys</em>
prepared so that it'll connect to the <em>proxy's archive</em>. Now on failover, the
<em>standbys</em> are disconnected from the stream, get a <code>WAL</code> file with a new <em>timeline</em>
from the <em>archive</em>, replay it, and reconnect.</p>

<p>That means that for a full <code>HA</code> scenario you could get on with three
servers. You're back to two servers at failover time and need to rebuild the
crashed master as a standby, running a base backup again.</p>

<p>If you've followed the idea, I hope you liked it! I still have to motivate some
volunteers so that some work gets done here, as I'm probably not the one to ask
to as far as coding this is concerned, if you want it out before <code>9.1</code> kicks in!</p>



<h3>Queuing</h3>

<p class="first">We also had a nice <em>Hall Tack</em> session with <em>Jan Wieck</em>, <em>Marko Kreen</em> and <em>Jim Nasby</em>
about how to get a single general (enough) queueing solution for PostgreSQL. It
happens that the Slony queueing ideas made their way into <code>PGQ</code> and that we'd
want to add some more capabilities to this one.</p>

<p>What we talked about was adding more interfaces (event producers, event format
translating at both ends of the pipe) and optimising how many events from the
past we keep in the queue for the subscribers, in a cascading environment.</p>

<p>It seems that the basic architecture of the queue is what <code>PGQ 3</code> provides
already, so it could even be not that much of a hassle to get something working
out of the ideas exchanged.</p>

<p>Of course, one of those ideas has been discussed at the <a href="http://wiki.postgresql.org/wiki/PgCon_2010_Developer_Meeting">Dev Meeting</a>, it's about
deriving the transaction commit order from the place which already has the
information rather than <em>reconstructing</em> it after the fact. We'll see how it
goes, but it started pretty well with a design mail thread.</p>


<h3>Other talks</h3>

<p class="first">I went to some other talks too, of course, unfortunately with an attention span
far from constant. Between the social events (you should read that as <em>beer
drinking evenings</em>) and the hall tracks, more than once my brain were less
present than my body in the talks. I won't risk into commenting them here, but
overall it was very good: in about each talk, new ideas popped into my
head. And I love that.</p>


<h3>Conclusion: I'm addicted.</h3>

<p class="first">The social aspect of the conference has been very good too. Once more, a warm
welcome from the people that are central to the project, and who are so easily
available for a chat about any aspect of it! Or just for sharing a drink.</p>

<p>Meeting our users is very important too, and <a href="http://www.pgcon.org/2010/">pgcon</a> allows for that also. I've
met some people I'm used to talk to via <code>IRC</code>, and it was good fun sharing a beer
over there.</p>

<p>All in all, I'm very happy I made it to Ottawa despite the volcano activity,
there's so much happening over there! Thanks to all the people who made it
possible by either organizing the conference or attending to it! See you next
year, I'm addicted...</p>



]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Thu, 27 May 2010 14:26:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Back%20from%20PgCon2010</guid>

</item>

<item>
<title> Import fixed width data with pgloader</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Import%20fixed%20width%20data%20with%20pgloader</link>
<description><![CDATA[
<p><a name="20100427-12:01" id="20100427-12:01"></a>
So, following previous blog entries about importing <em>fixed width</em> data, from
<a href="http://www.postgresonline.com/journal/index.php?/archives/157-Import-fixed-width-data-into-PostgreSQL-with-just-PSQL.html">Postgres Online Journal</a> and <a href="http://people.planetpostgresql.org/dfetter/index.php?/archives/58-psql&#44;-Paste&#44;-Perl-Pefficiency&#33;.html">David (perl) Fetter</a>, I couldn't resist following
the meme and showing how to achieve the same thing with <a href="http://pgloader.projects.postgresql.org/#toc9">pgloader</a>.</p>

<p>I can't say how much I dislike such things as the following, and I can't
help thinking that non IT people are right looking at us like this when
encountering such prose.</p>

<pre class="src">
  map {s<span style="color: #ad7fa8; font-style: italic;">/\D*(\d+)-(\d+).*/$a.="A".(1+$2-$1). " "/</span>e} split(<span style="color: #ad7fa8; font-style: italic;">/\n/</span>,&lt;&lt;<span style="color: #ad7fa8; font-style: italic;">'EOT'</span>);
</pre>

<p>So, the <em>pgloader</em> way. First you need to have setup a database, I called it
<code>pgloader</code> here. Then you need the same <code>CREATE TABLE</code> as on the original
article, here is it for completeness:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">CREATE</span> <span style="color: #729fcf; font-weight: bold;">TABLE</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">places</span>(usps <span style="color: #8ae234; font-weight: bold;">char</span>(2) <span style="color: #729fcf; font-weight: bold;">NOT</span> <span style="color: #729fcf; font-weight: bold;">NULL</span>,
    fips <span style="color: #8ae234; font-weight: bold;">char</span>(2) <span style="color: #729fcf; font-weight: bold;">NOT</span> <span style="color: #729fcf; font-weight: bold;">NULL</span>,
    fips_code <span style="color: #8ae234; font-weight: bold;">char</span>(5),
    loc_name <span style="color: #8ae234; font-weight: bold;">varchar</span>(64));
</pre>

<p>Now the data file I've taken here:
<a href="http://www.census.gov/tiger/tms/gazetteer/places2k.txt">http://www.census.gov/tiger/tms/gazetteer/places2k.txt</a>.</p>

<p>Then we translate the file description into <em>pgloader</em> setup:</p>

<pre class="src">
[<span style="color: #8ae234; font-weight: bold;">pgsql</span>]
<span style="color: #eeeeec;">host</span> = localhost
<span style="color: #eeeeec;">port</span> = 5432
<span style="color: #eeeeec;">base</span> = pgloader
<span style="color: #eeeeec;">user</span> = dim
<span style="color: #eeeeec;">pass</span> = None

<span style="color: #eeeeec;">log_file</span>            = /tmp/pgloader.log
<span style="color: #eeeeec;">log_min_messages</span>    = DEBUG
<span style="color: #eeeeec;">client_min_messages</span> = WARNING

<span style="color: #eeeeec;">client_encoding</span> = <span style="color: #ad7fa8; font-style: italic;">'latin1'</span>
<span style="color: #eeeeec;">lc_messages</span>         = C
<span style="color: #eeeeec;">pg_option_standard_conforming_strings</span> = on

[<span style="color: #8ae234; font-weight: bold;">fixed</span>]
<span style="color: #eeeeec;">table</span>           = places
<span style="color: #eeeeec;">format</span>          = fixed
<span style="color: #eeeeec;">filename</span>        = places2k.txt
<span style="color: #eeeeec;">columns</span>         = *
<span style="color: #eeeeec;">fixed_specs</span>     = usps:0:2, fips:2:2, fips_code:4:5, loc_name:9:64, p:73:9, h:82:9, land:91:14, water:105:14, ldm:119:14, wtm:131:14, lat:143:10, long:153:11
</pre>

<p>We're ready to import the data now:</p>

<pre class="src">
dim ~/PostgreSQL/examples pgloader -vsTc pgloader.conf
pgloader     INFO     Logger initialized
pgloader     WARNING  path entry '/usr/share/python-support/pgloader/reformat' does not exists, ignored
pgloader     INFO     Reformat path is []
pgloader     INFO     Will consider following sections:
pgloader     INFO       fixed
pgloader     INFO     Will load 1 section at a time
fixed        INFO     columns = *, got [('usps', 1), ('fips', 2), ('fips_code', 3), ('loc_name', 4)]
fixed        INFO     Loading threads: 1
fixed        INFO     closing current database connection
fixed        INFO     fixed processing
fixed        INFO     TRUNCATE TABLE places;
pgloader     INFO     All threads are started, wait for them to terminate
fixed        INFO     COPY 1: 10000 rows copied in 5.769s
fixed        INFO     COPY 2: 10000 rows copied in 5.904s
fixed        INFO     COPY 3: 5375 rows copied in 3.187s
fixed        INFO     No data were rejected
fixed        INFO      25375 rows copied in 3 commits took 14.907 seconds
fixed        INFO     No database error occured
fixed        INFO     closing current database connection
fixed        INFO     releasing fixed semaphore
fixed        INFO     Announce it's over

Table name        |    duration |    size |  copy rows |     errors
====================================================================
fixed             |     14.901s |       - |      25375 |          0
</pre>

<p>Note the <code>-T</code> option is for <code>TRUNCATE</code>, which you only need when you want to
redo the loading, I've come to always mention it in interactive usage. The
<code>-v</code> option is for some more <em>verbosity</em> and the <code>-s</code> for the <em>summary</em> at end of
operations.</p>

<p>With the <code>pgloader.conf</code> and <code>places2k.txt</code> in the current directory, and an
empty table, just typing in <code>pgloader</code> at the prompt would have done the job.</p>

<p>Oh, the <code>pg_option_standard_conforming_strings</code> bit is from the <a href="http://github.com/dimitri/pgloader">git HEAD</a>, the
current released version has no support for setting any PostgreSQL knob
yet. Still, it's not necessary here, so you can forget about it.</p>

<p>You will also notice that <em>pgloader</em> didn't trim the data for you, which ain't
funny for the <em>places</em> column. That's a drawback of the fixed width format
that you can work on two ways here, either by means of </p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">UPDATE</span> places <span style="color: #729fcf; font-weight: bold;">SET</span> loc_name = <span style="color: #729fcf;">trim</span>(loc_name)&#160;;
</pre> or a custom
reformat module for <em>pgloader</em>. I guess the latter solution is overkill, but
it allows for <em>pipe</em> style processing of the data and a single database write.

<p>Send me a mail if you want me to show here how to setup such a reformatting
module in a next blog entry!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Tue, 27 Apr 2010 12:01:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Import%20fixed%20width%20data%20with%20pgloader</guid>

</item>

<item>
<title> pgloader activity report</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20pgloader%20activity%20report</link>
<description><![CDATA[
<p><a name="20100406-09:10" id="20100406-09:10"></a>
Yes. This <a href="http://pgloader.projects.postgresql.org/">pgloader</a> project is still maintained and somewhat
active. Development happens when I receive a complaint, either about a bug
in existing code or a feature in yet-to-write code. If you have a bug to
report, just send me an email!</p>

<p>If you're following the development of it, the sources just moved from <code>CVS</code>
at <a href="http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pgloader/pgloader/">pgfoundry</a> to <a href="http://github.com/dimitri/pgloader">http://github.com/dimitri/pgloader</a>. I will still put the
releases at <a href="http://pgfoundry.org/projects/pgloader">pgfoundry</a>, and the existing binary packages maintenance should
continue. See also the <a href="http://pgloader.projects.postgresql.org/dev/pgloader.1.html">development version documentation</a>, which contains not
yet released stuff.</p>

<p>This time it's about new features, the goal being to open <em>pgloader</em> usage
without describing all the file format related details into the
<code>pgloader.conf</code> file. This time around, <a href="http://database-explorer.blogspot.com/">Simon</a> is giving feedback and told me
he would appreciate that pgloader would work more like the competition.</p>

<p>We're getting there with some new options. The first one is that rather than
only <code>Sections</code>, now your can give a <code>filename</code> as an argument. <em>pgloader</em> will
then create a configuration section for you, considering the file format to
be <code>CSV</code>, setting <code>columns = *</code>. The default <em>field separator</em> is <code>|</code>,
so you have also the <code>-f, --field-separator</code> option to set that from the
command line.</p>

<p>As if that wasn't enough, <em>pgloader</em> now supports any <a href="http://www.postgresql.org/">PostgreSQL</a> option either
in the configuration file (prefix the real name with <code>pg_option_</code>) or on the
command line, via the <code>-o, --pg-options</code> switch, that you can use more than
once. Command line setting will take precedence over any other setup, of
course. Consider for example <code>-o standard_conforming_strings=on</code>.</p>

<p>While at it, some more options can now be set on the command line, including
<code>-t, --section-threads</code> and <code>-m, --max-parallel-sections</code> on the one hand and
<code>-r, --reject-log</code> and <code>-j, --reject-data</code> on the other hand. Those two last
must contain a <code>%s</code> place holder which will get replaced by the <em>section</em> name,
or the <code>filename</code> if you skipped setting up a <em>section</em> for it.</p>

<p>Your <em>pgloader</em> usage is now more command line friendly than ever!</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Tue, 06 Apr 2010 09:10:00 CEST</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20pgloader%20activity%20report</guid>

</item>

<item>
<title> Finding orphaned sequences</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Finding%20orphaned%20sequences</link>
<description><![CDATA[
<p><a name="20100317-13:35" id="20100317-13:35"></a>
<a name="%20Finding%20orphaned%20sequences" id="%20Finding%20orphaned%20sequences"></a>
This time we're having a database where <em>sequences</em> were used, but not
systematically as a <em>default value</em> of a given column. It's mainly an historic
bad idea, but you know the usual excuse with bad ideas and bad code: the
first 6 months it's experimental, after that it's historic.</p>

<p>Still, here's a query for <code>8.4</code> that will allow you to list those <em>sequences</em>
you have that are not used as a default value in any of your tables:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">WITH</span> seqs <span style="color: #729fcf; font-weight: bold;">AS</span> (
  <span style="color: #729fcf; font-weight: bold;">SELECT</span> n.nspname, relname <span style="color: #729fcf; font-weight: bold;">as</span> seqname
    <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_class c
         <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_namespace n <span style="color: #729fcf; font-weight: bold;">on</span> n.oid = c.relnamespace
   <span style="color: #729fcf; font-weight: bold;">WHERE</span> relkind = <span style="color: #ad7fa8; font-style: italic;">'S'</span>
),
     attached_seqs <span style="color: #729fcf; font-weight: bold;">AS</span> (
  <span style="color: #729fcf; font-weight: bold;">SELECT</span> n.nspname,
         c.relname <span style="color: #729fcf; font-weight: bold;">as</span> tablename,
         (regexp_matches(pg_get_expr(d.adbin, d.adrelid), <span style="color: #ad7fa8; font-style: italic;">'''([^'']+)'''</span>))[1] <span style="color: #729fcf; font-weight: bold;">as</span> seqname
    <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_class c
         <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_namespace n <span style="color: #729fcf; font-weight: bold;">on</span> n.oid = c.relnamespace
         <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_attribute a <span style="color: #729fcf; font-weight: bold;">on</span> a.attrelid = c.oid
         <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_attrdef d <span style="color: #729fcf; font-weight: bold;">on</span> d.adrelid = a.attrelid
                            <span style="color: #729fcf; font-weight: bold;">and</span> d.adnum = a.attnum
                            <span style="color: #729fcf; font-weight: bold;">and</span> a.atthasdef
  <span style="color: #729fcf; font-weight: bold;">WHERE</span> relkind = <span style="color: #ad7fa8; font-style: italic;">'r'</span> <span style="color: #729fcf; font-weight: bold;">and</span> a.attnum &gt; 0
        <span style="color: #729fcf; font-weight: bold;">and</span> pg_get_expr(d.adbin, d.adrelid) ~ <span style="color: #ad7fa8; font-style: italic;">'^nextval'</span>
)

 <span style="color: #729fcf; font-weight: bold;">SELECT</span> nspname, seqname, tablename
   <span style="color: #729fcf; font-weight: bold;">FROM</span> seqs s
        <span style="color: #729fcf; font-weight: bold;">LEFT</span> <span style="color: #729fcf; font-weight: bold;">JOIN</span> attached_seqs a <span style="color: #729fcf; font-weight: bold;">USING</span>(nspname, seqname)
  <span style="color: #729fcf; font-weight: bold;">WHERE</span> a.tablename <span style="color: #729fcf; font-weight: bold;">IS</span> <span style="color: #729fcf; font-weight: bold;">NULL</span>;
</pre>

<p>I hope you don't need the query...</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Wed, 17 Mar 2010 12:35:00 CET</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Finding%20orphaned%20sequences</guid>

</item>

<item>
<title> Getting out of SQL_ASCII, part 2</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Getting%20out%20of%20SQL_ASCII%2C%20part%202</link>
<description><![CDATA[
<p><a name="20100223-17:30" id="20100223-17:30"></a>
<a name="%20Getting%20out%20of%20SQL_ASCII%2C%20part%202" id="%20Getting%20out%20of%20SQL_ASCII%2C%20part%202"></a>
So, if you followed the previous blog entry, now you have a new database
containing all the <em>static</em> tables encoded in <code>UTF-8</code> rather than
<code>SQL_ASCII</code>. Because if it was not yet the case, you now severely distrust
this non-encoding.</p>

<p>Now is the time to have a look at properly encoding the <em>live</em> data, those
stored in tables that continue to receive write traffic. The idea is to use
the <code>UPDATE</code> facilities of PostgreSQL to tweak the data, and too fix the
applications so as not to continue inserting badly encoded strings in there.</p>

<h3>Finding non UTF-8 data</h3>

<p class="first">First you want to find out the badly encoded data. You can do that with this
helper function that <a href="http://blog.rhodiumtoad.org.uk/">RhodiumToad</a> gave me on IRC. I had a version from the
archives before that, but the <em>regexp</em> was hard to maintain and quote into a
<code>PL</code> function. This is avoided by two means, first one is to have a separate
pure <code>SQL</code> function for the <em>regexp</em> checking (so that you can index it should
you need to) and the other one is to apply the regexp to <code>hex</code> encoded
data. Here we go:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">public.utf8hex_valid</span>(str text)
 <span style="color: #729fcf; font-weight: bold;">returns</span> <span style="color: #8ae234; font-weight: bold;">boolean</span>
 <span style="color: #729fcf; font-weight: bold;">language</span> <span style="color: #729fcf; font-weight: bold;">sql</span> immutable
<span style="color: #729fcf; font-weight: bold;">as</span> $f$
   <span style="color: #729fcf; font-weight: bold;">select</span> $1 ~ $r$(?x)
                  ^(?:(?:[0-7][0-9a-f])
                     |(?:(?:c[2-9a-f]|d[0-9a-f])
                        |e0[ab][0-9a-f]
                        |ed[89][0-9a-f]
                        |(?:(?:e[1-9abcef])
                           |f0[9ab][0-9a-f]
                           |f[1-3][89ab][0-9a-f]
                           |f48[0-9a-f]
                          )[89ab][0-9a-f]
                       )[89ab][0-9a-f]
                    )*$
                $r$;
$f$;
</pre>

<p>Now some little scripting around it in order to skip intense manual and
boring work (and see, some more catalog queries). Don't forget we will have
to work on a per-column basis here...</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">public.check_encoding_utf8</span>
 (
   <span style="color: #729fcf; font-weight: bold;">IN</span> schemaname text,
   <span style="color: #729fcf; font-weight: bold;">IN</span> tablename  text,
  <span style="color: #729fcf; font-weight: bold;">OUT</span> relname    text,
  <span style="color: #729fcf; font-weight: bold;">OUT</span> attname    text,
  <span style="color: #729fcf; font-weight: bold;">OUT</span> <span style="color: #729fcf;">count</span>      bigint
 )
 <span style="color: #729fcf; font-weight: bold;">returns</span> setof record
 <span style="color: #729fcf; font-weight: bold;">language</span> plpgsql
<span style="color: #729fcf; font-weight: bold;">as</span> $f$
<span style="color: #729fcf; font-weight: bold;">DECLARE</span>
  v_sql text;
<span style="color: #729fcf; font-weight: bold;">BEGIN</span>
  <span style="color: #729fcf; font-weight: bold;">FOR</span> relname, attname
   <span style="color: #729fcf; font-weight: bold;">IN</span> <span style="color: #729fcf; font-weight: bold;">SELECT</span> c.relname, a.attname
        <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_attribute a
             <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_class c <span style="color: #729fcf; font-weight: bold;">on</span> a.attrelid = c.oid
             <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_namespace s <span style="color: #729fcf; font-weight: bold;">on</span> s.oid = c.relnamespace
             <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_roles r <span style="color: #729fcf; font-weight: bold;">on</span> r.oid = c.relowner
       <span style="color: #729fcf; font-weight: bold;">WHERE</span> s.nspname = schemaname
         <span style="color: #729fcf; font-weight: bold;">AND</span> atttypid <span style="color: #729fcf; font-weight: bold;">IN</span> (25, 1043) <span style="color: #888a85;">-- text, varchar
</span>         <span style="color: #729fcf; font-weight: bold;">AND</span> relkind = <span style="color: #ad7fa8; font-style: italic;">'r'</span>          <span style="color: #888a85;">-- ordinary table
</span>         <span style="color: #729fcf; font-weight: bold;">AND</span> r.rolname = <span style="color: #ad7fa8; font-style: italic;">'some_specific_role'</span>
         <span style="color: #729fcf; font-weight: bold;">AND</span> <span style="color: #729fcf; font-weight: bold;">CASE</span> <span style="color: #729fcf; font-weight: bold;">WHEN</span> tablename <span style="color: #729fcf; font-weight: bold;">IS</span> <span style="color: #729fcf; font-weight: bold;">NOT</span> <span style="color: #729fcf; font-weight: bold;">NULL</span>
                  <span style="color: #729fcf; font-weight: bold;">THEN</span> c.relname ~ tablename
                  <span style="color: #729fcf; font-weight: bold;">ELSE</span> <span style="color: #729fcf; font-weight: bold;">true</span>
              <span style="color: #729fcf; font-weight: bold;">END</span>
  LOOP
    v_sql := <span style="color: #ad7fa8; font-style: italic;">'SELECT count(*) '</span>
          || <span style="color: #ad7fa8; font-style: italic;">'  FROM ONLY '</span>|| schemaname || <span style="color: #ad7fa8; font-style: italic;">'.'</span> || relname
          || <span style="color: #ad7fa8; font-style: italic;">' WHERE NOT public.utf8hex_valid(encode(textsend('</span>
          || attname
          || <span style="color: #ad7fa8; font-style: italic;">'), ''hex''))'</span>;

    <span style="color: #888a85;">-- RAISE NOTICE 'Checking: %.%', relname, attname;
</span>    <span style="color: #888a85;">-- RAISE NOTICE 'SQL: %', v_sql;
</span>    <span style="color: #729fcf; font-weight: bold;">EXECUTE</span> v_sql <span style="color: #729fcf; font-weight: bold;">INTO</span> <span style="color: #729fcf;">count</span>;
    <span style="color: #729fcf; font-weight: bold;">RETURN</span> <span style="color: #729fcf; font-weight: bold;">NEXT</span>;
  <span style="color: #729fcf; font-weight: bold;">END</span> LOOP;
<span style="color: #729fcf; font-weight: bold;">END</span>;
$f$;
</pre>

<p>Note that the <code>tablename</code> is compared using the <code>~</code> operator, so that's <em>regexp</em>
matching there too. Also note that I wanted only to check those tables that
are owned by a specific role, your case may vary.</p>

<p>The way I used this function was like this:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">table</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">leon.check_utf8</span> <span style="color: #729fcf; font-weight: bold;">as</span>
 <span style="color: #729fcf; font-weight: bold;">select</span> *
   <span style="color: #729fcf; font-weight: bold;">from</span> public.check_encoding_utf8();
</pre>

<p>Then you need to take action on those lines in <code>leon.check_utf8</code> table which
have a <code>count &gt; 0</code>. Rince and repeat, but you may soon realise building the
table over and over again is costly.</p>


<h3>Cleaning up the data</h3>

<p class="first">Up for some more helper tools? Unless you really want to manually fix this
huge amount of columns where some data ain't <code>UTF-8</code> compatible... here's some
more:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">leon.nettoyeur</span>
 (
  <span style="color: #729fcf; font-weight: bold;">IN</span>  <span style="color: #729fcf; font-weight: bold;">action</span>      text,
  <span style="color: #729fcf; font-weight: bold;">IN</span>  encoding    text,
  <span style="color: #729fcf; font-weight: bold;">IN</span>  tablename   text,
  <span style="color: #729fcf; font-weight: bold;">IN</span>  columname   text,

  <span style="color: #729fcf; font-weight: bold;">OUT</span> orig        text,
  <span style="color: #729fcf; font-weight: bold;">OUT</span> utf8        text
 )
 <span style="color: #729fcf; font-weight: bold;">returns</span> setof record
 <span style="color: #729fcf; font-weight: bold;">language</span> plpgsql
<span style="color: #729fcf; font-weight: bold;">as</span> $f$
<span style="color: #729fcf; font-weight: bold;">DECLARE</span>
  p_convert text;
<span style="color: #729fcf; font-weight: bold;">BEGIN</span>
  IF encoding <span style="color: #729fcf; font-weight: bold;">IS</span> <span style="color: #729fcf; font-weight: bold;">NULL</span>
  <span style="color: #729fcf; font-weight: bold;">THEN</span>
    p_convert := <span style="color: #ad7fa8; font-style: italic;">'translate('</span>
              || columname || <span style="color: #ad7fa8; font-style: italic;">', '</span>
              || $$<span style="color: #ad7fa8; font-style: italic;">'\211\203\202'</span>$$
              || <span style="color: #ad7fa8; font-style: italic;">', '</span>
              || $$<span style="color: #ad7fa8; font-style: italic;">'   '</span>$$
              || <span style="color: #ad7fa8; font-style: italic;">') '</span>;
  <span style="color: #729fcf; font-weight: bold;">ELSE</span>
    <span style="color: #888a85;">-- in 8.2, write convert using, in 8.3, the other expression
</span>    <span style="color: #888a85;">-- p_convert := 'convert(' || columname || ' using ' || conversion || ') ';
</span>    p_convert := <span style="color: #ad7fa8; font-style: italic;">'convert(textsend('</span> || columname || <span style="color: #ad7fa8; font-style: italic;">'), '''</span>|| encoding ||<span style="color: #ad7fa8; font-style: italic;">''', ''utf-8'' ) '</span>;
  <span style="color: #729fcf; font-weight: bold;">END</span> IF;

  IF <span style="color: #729fcf; font-weight: bold;">action</span> = <span style="color: #ad7fa8; font-style: italic;">'select'</span>
  <span style="color: #729fcf; font-weight: bold;">THEN</span>
    <span style="color: #729fcf; font-weight: bold;">FOR</span> orig, utf8
     <span style="color: #729fcf; font-weight: bold;">IN</span> <span style="color: #729fcf; font-weight: bold;">EXECUTE</span> <span style="color: #ad7fa8; font-style: italic;">'SELECT '</span> || columname || <span style="color: #ad7fa8; font-style: italic;">', '</span>
         || p_convert
         || <span style="color: #ad7fa8; font-style: italic;">'  FROM ONLY '</span> || tablename
         || <span style="color: #ad7fa8; font-style: italic;">' WHERE not public.utf8hex_valid('</span>
         || <span style="color: #ad7fa8; font-style: italic;">'encode(textsend('</span>|| columname ||<span style="color: #ad7fa8; font-style: italic;">'), ''hex''))'</span>
    LOOP
      <span style="color: #729fcf; font-weight: bold;">RETURN</span> <span style="color: #729fcf; font-weight: bold;">NEXT</span>;
    <span style="color: #729fcf; font-weight: bold;">END</span> LOOP;

  ELSIF <span style="color: #729fcf; font-weight: bold;">action</span> = <span style="color: #ad7fa8; font-style: italic;">'update'</span>
  <span style="color: #729fcf; font-weight: bold;">THEN</span>
    <span style="color: #729fcf; font-weight: bold;">EXECUTE</span> <span style="color: #ad7fa8; font-style: italic;">'UPDATE ONLY '</span> || tablename
         || <span style="color: #ad7fa8; font-style: italic;">' SET '</span> || columname || <span style="color: #ad7fa8; font-style: italic;">' = '</span> || p_convert
         || <span style="color: #ad7fa8; font-style: italic;">' WHERE not public.utf8hex_valid('</span>
         || <span style="color: #ad7fa8; font-style: italic;">'encode(textsend('</span>|| columname ||<span style="color: #ad7fa8; font-style: italic;">'), ''hex''))'</span>;

    <span style="color: #729fcf; font-weight: bold;">FOR</span> orig, utf8
     <span style="color: #729fcf; font-weight: bold;">IN</span> <span style="color: #729fcf; font-weight: bold;">SELECT</span> *
          <span style="color: #729fcf; font-weight: bold;">FROM</span> leon.nettoyeur(<span style="color: #ad7fa8; font-style: italic;">'select'</span>, encoding, tablename, columname)
    LOOP
      <span style="color: #729fcf; font-weight: bold;">RETURN</span> <span style="color: #729fcf; font-weight: bold;">NEXT</span>;
    <span style="color: #729fcf; font-weight: bold;">END</span> LOOP;

  <span style="color: #729fcf; font-weight: bold;">ELSE</span>
    RAISE <span style="color: #729fcf; font-weight: bold;">EXCEPTION</span> <span style="color: #ad7fa8; font-style: italic;">'L&#233;on, Nettoyeur, veut de l''action.'</span>;

  <span style="color: #729fcf; font-weight: bold;">END</span> IF;
<span style="color: #729fcf; font-weight: bold;">END</span>;
$f$;
</pre>

<p>As you can see, this function allows to check the conversion process from a
given supposed encoding before to actually convert the data in place. This
is very useful as even when you're pretty sure the non-utf8 data is <code>latin1</code>,
sometime you find it's <code>windows-1252</code> or such. So double check before telling
<code>leon.nettoyeur()</code> to update your precious data!</p>

<p>Also, there's a facility to use <code>translate()</code> when none of the encoding match
your expectations. This is a skeleton just replacing invalid characters with
a <code>space</code>, tweak it at will!</p>


<h3>Conclusion</h3>

<p class="first">Enjoy your clean database now, even if it still accepts new data that will
probably not pass the checks, so we still have to be careful about that and
re-clean every day until the migration is effective. Or maybe add a <code>CHECK</code>
clause that will reject badly encoded data...</p>

<p>In fact here we're using <a href="http://wiki.postgresql.org/wiki/Londiste_Tutorial">Londiste</a> to replicate the <em>live</em> data from the old to
the new server, and that means the replication will break each time there's
new data written in non-utf8, as the new server is running <code>8.4</code>, which by
design ain't very forgiving. Our plan is to clean-up as we go (remove table
from the <em>subscriber</em>, fix it, add it again) and migrate as soon as possible!</p>

<p>Bonus points to those of you getting the convoluted reference :)</p>



]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Tue, 23 Feb 2010 16:30:00 CET</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Getting%20out%20of%20SQL_ASCII%2C%20part%202</guid>

</item>

<item>
<title> Getting out of SQL_ASCII, part 1</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Getting%20out%20of%20SQL_ASCII%2C%20part%201</link>
<description><![CDATA[
<p><a name="20100218-11:37" id="20100218-11:37"></a>
<a name="%20Getting%20out%20of%20SQL_ASCII%2C%20part%201" id="%20Getting%20out%20of%20SQL_ASCII%2C%20part%201"></a>
It happens that you have to manage databases <em>designed</em> by your predecessor,
and it even happens that the team used to not have a <em>DBA</em>. Those <em>histerical
raisins</em> can lead to having a <code>SQL_ASCII</code> database. The horror!</p>

<p>What <code>SQL_ASCII</code> means, if you're not already familiar with the consequences
of such a choice, is that all the <code>text</code> and <code>varchar</code> data that you put in the
database is accepted as-is. No checks. At all. It's pretty nice when you're
lazy enough to not dealing with <em>strange</em> errors in your application, but if
you think that t's a smart move, please go read
<a href="http://www.joelonsoftware.com/articles/Unicode.html">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a>
by <a href="http://www.joelonsoftware.com/">Joel Spolsky</a> now. I said now, I'm waiting for you to get back here. Yes,
I'll wait.</p>

<p>The problem of course is not being able to read the data you just stored,
which is seldom the use case anywhere you use a database solution such as
<a href="http://www.postgresql.org/">PostgreSQL</a>.</p>

<p>Now, it happens too that it's high time to get off of <code>SQL_ASCII</code>, the
infamous. In our case we're lucky enough in that the data are all in fact
<code>latin1</code> or about that, and this comes from the fact that all the applications
connecting to the database are sharing some common code and setup. Then we
have some tables that can be tagged <em>archives</em> and some other <em>live</em>. This blog
post will only deal with the former category.</p>

<p>For those tables that are not receiving changes anymore, we will migrate
them by using a simple but time hungry method: <code>COPY OUT|recode|COPY IN</code>. I've
tried to use <code>iconv</code> for recoding our data, but it failed to do so in lots of
cases, so I've switched to using the <a href="http://www.gnu.org/software/recode/recode.html">GNU recode</a> tool, which works just fine.</p>

<p>The fact that it takes so much time doing the conversion is not really a
problem here, as you can do it <em>offline</em>, while the applications are still
using the <code>SQL_ASCII</code> database. So, here's the program's help:</p>

<pre class="src">
recode.sh [-npdf0TI] [-U user ] -s schema [-m mintable] pattern
        -d    debug
        -n    dry run, only print table names and expected files
        -s    schema
        -m    mintable, to skip already processed once
        -U    connect to PostgreSQL as user
        -f    force table loading even when export files do exist
        -0    only (re)load tables with zero-sized copy files
        -T    Truncate the tables before COPYing recoded data
        -I    Temporarily drop the indexes of the table while COPYing
   pattern    ^table_name_, e.g.
</pre>

<p>The <code>-I</code> option is neat enough to create the indexes in parallel, but with no
upper limit on the number of index creation launched. In our case it worked
well, so I didn't have to bother.</p>

<p>Take a look at the <a href="static/recode.sh">recode.sh</a> script, and don't hesitate editing it for your
purpose. It's missing some obvious options to get useful in the large, such
as the <code>recode</code> <em>request</em> which is currently hardcoded to <code>l1..utf8</code>. If there's
any demand about it, I'll setup a <a href="http://github.com/dimitri">GitHub</a> project for the little script.</p>

<p>We'll get back to the subject of this entry in <em>part 2</em>, dealing with how to
recode your data in the database itself, thanks to some insane regexp based
queries and helper functions. And thanks to a great deal of IRC based
helping, too.</p>

]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Thu, 18 Feb 2010 10:37:00 CET</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Getting%20out%20of%20SQL_ASCII%2C%20part%201</guid>

</item>

<item>
<title> Resetting sequences. All of them, please!</title>
<link>http://blog.tapoueh.org/blog.dim.html#%20Resetting%20sequences%2E%20All%20of%20them%2C%20please%21</link>
<description><![CDATA[
<p><a name="20100216-16:23" id="20100216-16:23"></a>
<a name="%20Reseting%20sequences%2E%20All%20of%20them%2C%20please%21" id="%20Reseting%20sequences%2E%20All%20of%20them%2C%20please%21"></a>
So, after restoring a production dump with intermediate filtering, none of
our sequences were set to the right value. I could have tried to review the
process of filtering the dump here, but it's a <em>one-shot</em> action and you know
what that sometimes mean. With some pressure you don't script enough of it
and you just crawl more and more.</p>

<p>Still, I think how I solved it is worthy of a blog entry. Not that it's
about a super unusual <em>clever</em> trick, quite the contrary, because questions
involving this trick are often encountered on the support <code>IRC</code>.</p>

<p>The idea is to query the catalog for all sequences, and produce from there
the <code>SQL</code> command you will have to issue for each of them. Once you have this
query, it's quite easy to arrange from the <code>psql</code> prompt as if you had dynamic
scripting capabilities. Of course in <code>9.0</code> you will have <em>inline anonymous</em> <code>DO</code>
blocks.</p>

<pre class="src">
#&gt; \o /tmp/sequences.sql
#&gt; \t
Showing only tuples.
#&gt; YOUR QUERY HERE
#&gt; \o
#&gt; \t
Tuples only is off.
</pre>

<p>Once you have the <code>/tmp/sequences.sql</code> file, you can ask <code>psql</code> to execute its
command as you're used to, that's using <code>\i</code> in an explicit transaction block.</p>

<p>Now, the interresting part if you got here attracted by the blog entry title
is in fact the query itself. A nice way to start is to <code>\set ECHO_HIDDEN</code> then
describe some table, you now have a catalog example query to work with. Then
you tweak it somehow and get this:</p>

<pre class="src">
  <span style="color: #729fcf; font-weight: bold;">SELECT</span> <span style="color: #ad7fa8; font-style: italic;">'select '</span>
          || <span style="color: #729fcf;">trim</span>(<span style="color: #729fcf; font-weight: bold;">trailing</span> <span style="color: #ad7fa8; font-style: italic;">')'</span>
             <span style="color: #729fcf; font-weight: bold;">from</span> replace(pg_get_expr(d.adbin, d.adrelid),
                          <span style="color: #ad7fa8; font-style: italic;">'nextval'</span>, <span style="color: #ad7fa8; font-style: italic;">'setval'</span>))
          || <span style="color: #ad7fa8; font-style: italic;">', (select max( '</span> || a.attname || <span style="color: #ad7fa8; font-style: italic;">') from only '</span>
          || nspname || <span style="color: #ad7fa8; font-style: italic;">'.'</span> || relname || <span style="color: #ad7fa8; font-style: italic;">'));'</span>
    <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_class c
         <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_namespace n <span style="color: #729fcf; font-weight: bold;">on</span> n.oid = c.relnamespace
         <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_attribute a <span style="color: #729fcf; font-weight: bold;">on</span> a.attrelid = c.oid
         <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_attrdef d <span style="color: #729fcf; font-weight: bold;">on</span> d.adrelid = a.attrelid
                            <span style="color: #729fcf; font-weight: bold;">and</span> d.adnum = a.attnum
                            <span style="color: #729fcf; font-weight: bold;">and</span> a.atthasdef
  <span style="color: #729fcf; font-weight: bold;">WHERE</span> relkind = <span style="color: #ad7fa8; font-style: italic;">'r'</span> <span style="color: #729fcf; font-weight: bold;">and</span> a.attnum &gt; 0
        <span style="color: #729fcf; font-weight: bold;">and</span> pg_get_expr(d.adbin, d.adrelid) ~ <span style="color: #ad7fa8; font-style: italic;">'^nextval'</span>;
</pre>

<p>Coming next, a <code>recode</code> based script in order to get from <code>SQL_ASCII</code> to <code>UTF-8</code>,
and some strange looking queries too.</p>

<pre class="src">
recode.sh [-npdf0TI] [-U user ] -s schema [-m mintable] pattern
</pre>

<p>Stay tuned!</p>

<h2>20091208-12:04 pg_staging's bird view</h2>

<p><a name="20091208-12:04" id="20091208-12:04"></a>
<a name="%20pg_staging%27s%20bird%20view" id="%20pg_staging%27s%20bird%20view"></a>
One of the most important feedback I got about the presentation of <a href="pgstaging.html">pgstaging</a>
were the lack of pictures, something like a bird-view of how you operate
it. Well, thanks to <a href="http://ditaa.sourceforge.net/">ditaa</a> and Emacs <code>picture-mode</code> here it is:</p>

<center>
<p><img src="../images/pg_staging.png" alt=""></p>
</center>

<p>Hope you enjoy, it should not be necessary to comment much if I got to the
point!</p>

<p>Of course I commited the <a href="http://github.com/dimitri/pg_staging/blob/master/bird-view.txt">text source file</a> to the <code>Git</code> repository. The only
problem I ran into is that <code>ditaa</code> defaults to ouputing a quite big right
margin containing only white pixels, and that didn't fit well, visually, in
this blog. So I had to resort to <a href="http://www.imagemagick.org/script/command-line-options.php#crop">ImageMagik crop command</a> in order to avoid
any mouse usage in the production of this diagram.</p>

<pre class="src">
convert .../pg_staging/bird-view.png -crop <span style="color: #ad7fa8; font-style: italic;">'!550'</span> bird-view.png
mv bird-view-0.png pg_staging.png
</pre>

<p>Quicker than learning to properly use a mouse, at least for me :)</p>


<h2>20091201-16:45 PGday.eu feedback</h2>

<p><a name="20091201-16:45" id="20091201-16:45"></a>
<a name="%20PGday%2Eeu%20feedback" id="%20PGday%2Eeu%20feedback"></a>
At <a href="http://2009.pgday.eu/">pgday</a> there was this form you could fill to give speakers some <em>feedback</em>
about their talks. And that's a really nice way as a speaker to know what to
improve. And as <a href="http://blog.hagander.net/archives/157-Feedback-from-pgday.eu.html">Magnus</a> was searching a nice looking chart facility in python
and I spoke about <a href="http://matplotlib.sourceforge.net/gallery.html">matplotlib</a>, it felt like having to publish something.</p>

<p>Here is my try at some nice graphics. Well I'll let you decide how nice the
result is:</p>

<center>
<p><a class="image-link" href="../images/feedback.png">
<img src="../images/feedback.png"></a></p>
</center>

<p>If you want to see the little python script I used, here it is: <a href="http://pgsql.tapoueh.org/confs/pgday_2009/feedback.py">feedback.py</a>,
with the data embedded and all...</p>

<p>Now, how to read it? Well, the darker the color the better the score. For
example I had <code>5</code> people score me <code>5</code> for <em>Topic Importance</em> on the Hi-Media talk
(in french) and only <code>3</code> people at this same score and topic for <code>pg_staging</code>
talk. The scores are from <code>1</code> to <code>5</code>, <code>5</code> being the best.</p>

<p>The comitee accepted interesting enough topics and it seems I managed to
deliver acceptable content from there. Not very good content, because
reading the comments I missed some nice birds-eye pictures to help the
audience get into the subject. As I'm unable to draw (with or without a
mouse) I plan to fix this in latter talks by using <a href="http://ditaa.sourceforge.net/">ditaa</a>, the <em>DIagrams
Through Ascii Art</em> tool. I already used it and together with <a href="news.dim.html">Emacs</a>
<code>picture-mode</code> it's very nice.</p>

<p>Oh yes the baseline of this post is that there will be later talks. I seem
to be liking those and the audience feedback this time is saying that it's
not too bad for them. See you soon :)</p>


<h2>20091130-12:10 prefix 1.1.0</h2>

<p><a name="20091130-12:10" id="20091130-12:10"></a>
<a name="%20prefix%201%2E1%2E0" id="%20prefix%201%2E1%2E0"></a>
So I had two <a href="http://archives.postgresql.org/pgsql-general/2009-11/msg01042.php">bug</a> <a href="http://lists.pgfoundry.org/pipermail/prefix-users/2009-November/000005.html">reports</a> about <a href="prefix.html">prefix</a> in less than a week. It means several
things, one of them is that my code is getting used in the wild, which is
nice. The other side of the coin is that people do find bugs in there. This
one is about the behavior of the <code>btree opclass</code> of the type <code>prefix range</code>. We
cheat a lot there by simply having written one, because a range does not
have a strict ordering: is <code>[1-3]</code> before of after <code>[2-4]</code>? But when you know
you have no overlapping intervals in your <code>prefix_range</code> column, being able to
have it part of a <em>primary key</em> is damn useful.</p>

<p>Note: in <code>8.5</code> we should have a way to express <em>contraint exclusion</em> and have
PostgreSQL forbids overlapping entries for us. Not being there yet, you
could write a <em>constraint trigger</em> and use the <em>GiST index</em> to have nice speed
there, which is exactly what this <em>constraint exclusion</em> support is about.</p>

<p>It turns out the code change required is pretty simple:</p>

<pre class="src">
-    <span style="color: #729fcf; font-weight: bold;">return</span> (a-&gt;first == b-&gt;first) ? (a-&gt;last - b-&gt;last) : (a-&gt;first - b-&gt;first);
+    <span style="color: #888a85;">/*</span><span style="color: #888a85;">
+     * we are comparing e.g. '1' and '12' (the shorter contains the
+     * smaller), so let's pretend '12' &lt; '1' as it contains less elements.
+     </span><span style="color: #888a85;">*/</span>
+    <span style="color: #729fcf; font-weight: bold;">return</span> (alen == mlen) ? 1 : -1;
</pre>

<p>This happens in the <em>compare support function</em> (see
<a href="http://www.postgresql.org/docs/8.4/interactive/xindex.html">Interfacing Extensions To Indexes</a>) so that means you now have to rebuild
your <code>prefix_range</code> btree indexes, hence the version number bump.</p>


<h2>20091125-11:49 Yet Another PostgreSQL tool hits debian</h2>

<p><a name="20091125-11:49" id="20091125-11:49"></a>
<a name="%20Yet%20Another%20PostgreSQL%20tool%20hits%20debian" id="%20Yet%20Another%20PostgreSQL%20tool%20hits%20debian"></a>
So there it is, this newer contribution of mine that I presented at <a href="http://2009.pgday.eu">PGDay</a> is
now in <code>debian NEW</code> queue. <a href="pgstaging.html">pg_staging</a> will empower you with respect to what
you do about those nightly backups (<code>pg_dump -Fc</code> or something).</p>

<p>The tool provides a lot of commands to either <code>dump</code> or <code>restore</code> a database. It
comes with documentation covering about it all, except for the <em>londiste</em>
support part, which will be there in time for <code>1.0.0</code> release. The <a href="http://github.com/dimitri/pg_staging/blob/master/TODO">Todo list</a>
is getting smaller and smaller, the version you'll soon find in <code>debian sid</code>
is already called <code>0.9</code>.</p>

<p>So, how do you go about using this software, and what service it implements?</p>

<h3>it's all about deriving a staging environment from your backups</h3>

<p class="first">To validate backups, you want to restore them and check the database you get
from them. And your developers will want to sometime refresh the database
they're working with. And you could have both an integration environment and
a pre-live one: On the former, you develop new code atop a stable set of
data; while on the latter you test stable enough code (ready to go live) on
a set of data as near as live data as possible.</p>

<p>And you want to be flexible about it, so that there's not a fulltime job to
handle retoring databases each and every days, for project A integration or
project B pre-live testing, or project C accounting snapshot. Or you name
it.</p>

<p>And of course you want to have a single point of control of all your
databases. Let's call it the <em>controler</em>.</p>


<h3>setting up pg_staging</h3>

<p class="first">The <a href="pgstaging.html">pg_staging</a> setup consists of one <code>pg_staging.ini</code> file wherein you
describe your different target databases (those <code>dev</code> and <code>prelive</code> ones), and
of course where to get the production backups from. Currently you have to
serve the backups file in a format suitable for <code>pg_restore</code> (that means you
use either <code>pg_dump -Ft</code> or <code>pg_dump -Fc</code>) on an <code>apache</code> folder. The produced
<code>HTML</code> will get parsed.</p>

<p>So you setup the <code>DEFAULT</code> section with common settings, then one section per
target: the databases you want to restore. Tell <code>pg_staging</code> where they are
(<code>host</code>), etc, and it'll be able to drive them.</p>

<p>In order to being able to host more than a single restored dump on a staging
server, for the same database, we use <code>pgbouncer</code>:</p>

<pre class="src">
pg_staging&gt; pgbouncer some_db.dev
              some_db      some_db_20091029 :5432
     some_db_20090717      some_db_20090717 :5432
     some_db_20091029      some_db_20091029 :5432
</pre>

<p>So as explained into the <code>pg_staging(1)</code> man page, you have to open
non-interactive <code>SSH</code> connection from the <em>controler</em> to the <em>hosts</em> where the
databases will get restored. Then you have to do a minimal setup pgbouncer
on the <em>hosts</em> with a <code>trust</code> connection. It'll get used from <code>pg_staging</code> for
adding newly restored database and have them accessible. Then you can also
<code>switch</code> the new database to being the virtual <em>some_db</em> so that you avoid
editing any connection string on your softwares.</p>

<p>Also, install the <code>pgstaging-client</code> package on every host you target. The
client is a simple shell script that must run as root (<code>sudo</code> is used) in
order to replace your <code>pgbouncer</code> setup or manage your <code>londiste</code> services.</p>

<p>See <code>man 5 pg_staging</code> for available options, including <em>schemas</em> to filter out
either completely or just skipping data restoring in those.</p>


<h3>pg_staging usage</h3>

<p class="first">Now you're all setup, you can begin to enjoy using <code>pgstaging</code>. Enter the
console and see what you have in there.</p>

<pre class="src">
$ pg_staging
Welcome to pg_staging 0.9.
pg_staging&gt; databases
...
pg_staging&gt; restore some_db.dev
...
pg_staging&gt; pgbouncer some_db.dev
...
pg_staging&gt; dbsizes --all some_db.dev
...
pg_staging&gt; psql some_db.dev
some_db_20091125=#
</pre>

<p>And as you can see in <code>man pg_staging</code> there are a lot of commands
already. You can for example obtain a new <em>pg_restore catalog</em> from a dump
file, with some <em>schemas</em> commented out. It will even comment out <code>triggers</code>
that are using a <code>function</code> which is defined in a filtered out <code>schema</code>, for
example a <code>PGQ</code> trigger. And much much more.</p>

<p><a href="pgstaging.html">pg_staging</a> will even allow you to <code>dump</code> your production databases, but
consider installing a separate instance of it on the machine serving the
backups to your local network thanks to an <code>apache</code> directory listing!</p>


<h3>Roadmap to <code>1.0.0</code></h3>

<p class="first">What's remain to be done is testing and having <code>PITR</code> based restoring to work,
and adding some documentation (tutorial, which this blog post about is; and
<em>londiste</em> support). At this point, unless some reader here asks for a new
feature (set), I'll consider <code>pg_staging</code> ready for <code>1.0.0</code>. After all, we're
using it about daily here :)</p>

<p>Consider commenting, you should be able to easily spot my private mail
address...</p>



<h2>20091109-09:50 PGDay.eu, Paris: it was awesome!</h2>

<p><a name="20091109-09:50" id="20091109-09:50"></a>
<a name="%20PGDay%2Eeu%2C%20Paris%3A%20it%20was%20awesome%21" id="%20PGDay%2Eeu%2C%20Paris%3A%20it%20was%20awesome%21"></a>
<a href="http://2009.pgday.eu/">PGDay.eu</a> was held this week-end in Paris, and it really was a great
moment. Lots of <a href="http://2009.pgday.eu/_media/group_2009_1.jpg?cache=">attendees</a>, lots of quality talks (<a href="http://wiki.postgresql.org/wiki/PGDay.EU%2C_Paris_2009">slides</a> are online), good
food, great party: all the ingredients were there!</p>

<p>It also was for me the occasion to first talk about this tool I've been
working on for months, called <a href="pgstaging.html">pg_staging</a>, which aims to empower those boring
production backups to help maintaining <em>staging</em> environments (for your
developers and testers).</p>

<p>All in all such events keep reminding me what it means exactly when we way
that one of the greatest things about <a href="http://www.postgresql.org/">PostgreSQL</a> is its community. If you
don't know what I'm talking about, consider <a href="http://www.postgresql.org/community/">joining</a>!</p>


<h2>20091006-15:56 prefix 1.0.0</h2>

<p><a name="20091006-15:56" id="20091006-15:56"></a>
<a name="%20prefix%201%2E0%2E0" id="%20prefix%201%2E0%2E0"></a>
So there it is, at long last, the final <code>1.0.0</code> release of prefix! It's on its
way into the debian repository (targetting sid, in testing in 10 days) and
available on <a href="http://pgfoundry.org/frs/?group_id=1000352">pgfoundry</a> to.</p>

<p>In order to make it clear that I intend to maintain this version, the number
has 3 digits rather than 2... which is also what <a href="http://www.postgresql.org/support/versioning">PostgreSQL</a> users will
expect.</p>

<p>The only last minute change is that you can now use the first version of the
two following rather than the second one:</p>

<pre class="src">
-  <span style="color: #729fcf; font-weight: bold;">create</span> index idx_prefix <span style="color: #729fcf; font-weight: bold;">on</span> prefixes <span style="color: #729fcf; font-weight: bold;">using</span> gist(<span style="color: #729fcf; font-weight: bold;">prefix</span> gist_prefix_range_ops);
+  <span style="color: #729fcf; font-weight: bold;">create</span> index idx_prefix <span style="color: #729fcf; font-weight: bold;">on</span> prefixes <span style="color: #729fcf; font-weight: bold;">using</span> gist(<span style="color: #729fcf; font-weight: bold;">prefix</span>);
</pre>

<p>For you information, I'm thinking about leaving <code>pgfoundry</code> as far as the
source code management goes, because I'd like to be done with <code>CVS</code>. I'd still
use the release file hosting though at least for now. It's a burden but it's
easier for the users to find them, when they are not using plain <code>apt-get
install</code>. That move would lead to host <a href="http://pgfoundry.org/projects/prefix/">prefix</a> and <a href="http://pgfoundry.org/projects/pgloader">pgloader</a> and the <a href="http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/backports/">backports</a>
over there at <a href="http://github.com/dimitri">github</a>, where my next pet project, <code>pg_staging</code>, will be hosted
too.</p>

<p>The way to see this <em>pgfoundry</em> leaving is that if everybody does the same,
then migrating the facility to some better or more recent hosting software
will be easier. Maybe some other parts of the system are harder than the
sources to migrate, though. If that's the case I'll consider moving them out
too, maybe getting listed on the <a href="http://www.postgresql.org/download/product-categories">PostgreSQL Software Catalogue</a> will prove
enough as far as web presence goes?</p>



<h2>20090818-09:14 hstore-new &amp; preprepare reach debian too</h2>

<p><a name="20090818-09:14" id="20090818-09:14"></a>
<a name="%20hstore%2Dnew%20%26%20preprepare%20reach%20debian%20too" id="%20hstore%2Dnew%20%26%20preprepare%20reach%20debian%20too"></a>
It seems like debian developers are back from annual conference and holiday,
so they have had a look at the <code>NEW</code> queue and processed the packages in
there. Two of them were mines and waiting to get in <code>unstable</code>, <a href="http://packages.debian.org/hstore-new">hstore-new</a> and
<a href="http://packages.debian.org/preprepare">preprepare</a>.</p>

<p>Time to do some bug fixing already, as <code>hstore-new</code> packaging is using a
<em>bash'ism</em> I shouldn't rely on (or so the debian buildfarm is <a href="https://buildd.debian.org/~luk/status/package.php?p=hstore-new">telling me</a>) and
for <code>preprepare</code> I was waiting for inclusion before to go improving the <code>GUC</code>
management, stealing some code from <a href="http://blog.endpoint.com/search/label/postgres">Selena</a>'s <a href="http://blog.endpoint.com/2009/07/pggearman-01-release.html">pgGearman</a> :)</p>

<p>As some of you wonder about <code>prefix 1.0</code> scheduling, it should soon get there
now it's been in testing long enough and no bug has been reported. Of course
releasing <code>1.0</code> in august isn't good timing, so maybe I should just wait some
more weeks.</p>


<h2>20090803-14:50 prefix 1.0~rc2 in debian testing</h2>

<p><a name="20090803-14:50" id="20090803-14:50"></a>
<a name="%20prefix%201%2E0%7Erc2%20in%20debian%20testing" id="%20prefix%201%2E0%7Erc2%20in%20debian%20testing"></a>
At long last, <a href="http://packages.debian.org/search?searchon=sourcenames&amp;keywords=prefix">here it is</a>. With binary versions both for <code>postgresal-8.3</code> and
<code>postgresal-8.4</code>! Unfortunately my other packaging efforts are still waiting
on the <code>NEW</code> queue, but I hope to soon see <code>hstore-new</code> and <code>preprepare</code> enter
debian too.</p>

<p>Anyway, the plan for <code>prefix</code> is to now wait something like 2 weeks, then,
baring showstopper bugs, release the <code>1.0</code> final version. If you have a use
for it, now is the good time for testing it!</p>

<p>About upgrading a current <code>prefix</code> installation, the advice is to save data as
<code>text</code> instead of <code>prefix_range</code>, remove prefix support, install new version,
change again the columns data type:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">BEGIN</span>;
  <span style="color: #729fcf; font-weight: bold;">ALTER</span> <span style="color: #729fcf; font-weight: bold;">TABLE</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">foo</span>
     <span style="color: #729fcf; font-weight: bold;">ALTER</span> <span style="color: #729fcf; font-weight: bold;">COLUMN</span> <span style="color: #729fcf; font-weight: bold;">prefix</span>
             <span style="color: #729fcf; font-weight: bold;">TYPE</span> text <span style="color: #729fcf; font-weight: bold;">USING</span> text(<span style="color: #729fcf; font-weight: bold;">prefix</span>);

  <span style="color: #729fcf; font-weight: bold;">DROP</span> <span style="color: #729fcf; font-weight: bold;">TYPE</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">prefix_range</span> <span style="color: #729fcf; font-weight: bold;">CASCADE</span>;
  \i prefix.sql

  <span style="color: #729fcf; font-weight: bold;">ALTER</span> <span style="color: #729fcf; font-weight: bold;">TABLE</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">foo</span>
     <span style="color: #729fcf; font-weight: bold;">ALTER</span> <span style="color: #729fcf; font-weight: bold;">COLUMN</span> <span style="color: #729fcf; font-weight: bold;">prefix</span>
             <span style="color: #729fcf; font-weight: bold;">TYPE</span> prefix_range <span style="color: #729fcf; font-weight: bold;">USING</span> prefix_range(<span style="color: #729fcf; font-weight: bold;">prefix</span>);

  <span style="color: #729fcf; font-weight: bold;">CREATE</span> INDEX idx_foo_prefix <span style="color: #729fcf; font-weight: bold;">ON</span> foo
         <span style="color: #729fcf; font-weight: bold;">USING</span> gist(<span style="color: #729fcf; font-weight: bold;">prefix</span> gist_prefix_range_ops);
<span style="color: #729fcf; font-weight: bold;">COMMIT</span>;
</pre>

<p>Note: I just added the <code>gist_prefix_range_ops</code> as default for type
<code>prefix_range</code> so it'll be optional to specify this in final <code>1.0</code>. I got so
used to typing it I didn't realize we don't have to :)</p>


<h2>20090709-12:48 prefix 1.0~rc2-1</h2>

<p><a name="20090709-12:48" id="20090709-12:48"></a>
<a name="%20prefix%201%2E0%7Erc2%2D1" id="%20prefix%201%2E0%7Erc2%2D1"></a>
I've been having problem with building both <code>postgresql-8.3-prefix</code> and
<code>postgresql-8.4-prefix</code> debian packages from the same source package, and
fixing the packaging issue forced me into modifying the main <code>prefix</code>
<code>Makefile</code>. So while reaching <code>rc2</code>, I tried to think about missing pieces easy
to add this late in the game: and there's one, that's a function
<code>length(prefix_range)</code>, so that you don't have to cast to text no more in the
following wildspread query:</p>

<pre class="src">
  <span style="color: #729fcf; font-weight: bold;">SELECT</span> foo, bar
    <span style="color: #729fcf; font-weight: bold;">FROM</span> prefixes
   <span style="color: #729fcf; font-weight: bold;">WHERE</span> <span style="color: #729fcf; font-weight: bold;">prefix</span> @&gt; <span style="color: #ad7fa8; font-style: italic;">'012345678'</span>
<span style="color: #729fcf; font-weight: bold;">ORDER</span> <span style="color: #729fcf; font-weight: bold;">BY</span> <span style="color: #729fcf; font-weight: bold;">length</span>(<span style="color: #729fcf; font-weight: bold;">prefix</span>) <span style="color: #729fcf; font-weight: bold;">DESC</span>
   <span style="color: #729fcf; font-weight: bold;">LIMIT</span> 1;
</pre>

<p>And here's a simple stupid benchmark of the new function, here in
<a href="http://prefix.projects.postgresql.org/prefix-1.0~rc2.tar.gz">prefix-1.0~rc2.tar.gz</a>. And it'll soon reach debian, if my QA dept agrees (my
<a href="http://julien.danjou.info/blog/">sponsor</a> is a QA dept all by himself!).</p>

<p>First some preparation:</p>

<pre class="src">
dim=#   <span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">table</span> prefixes (
dim(#          <span style="color: #729fcf; font-weight: bold;">prefix</span>    prefix_range <span style="color: #729fcf; font-weight: bold;">primary</span> <span style="color: #729fcf; font-weight: bold;">key</span>,
dim(#          <span style="color: #729fcf; font-weight: bold;">name</span>      text <span style="color: #729fcf; font-weight: bold;">not</span> <span style="color: #729fcf; font-weight: bold;">null</span>,
dim(#          shortname text,
dim(#          status    <span style="color: #8ae234; font-weight: bold;">char</span> <span style="color: #729fcf; font-weight: bold;">default</span> <span style="color: #ad7fa8; font-style: italic;">'S'</span>,
dim(#
dim(#          <span style="color: #729fcf; font-weight: bold;">check</span>( status <span style="color: #729fcf; font-weight: bold;">in</span> (<span style="color: #ad7fa8; font-style: italic;">'S'</span>, <span style="color: #ad7fa8; font-style: italic;">'R'</span>) )
dim(#   );
NOTICE:  <span style="color: #729fcf; font-weight: bold;">CREATE</span> <span style="color: #729fcf; font-weight: bold;">TABLE</span> / <span style="color: #729fcf; font-weight: bold;">PRIMARY</span> <span style="color: #729fcf; font-weight: bold;">KEY</span> will <span style="color: #729fcf; font-weight: bold;">create</span> implicit index "prefixes_pkey" <span style="color: #729fcf; font-weight: bold;">for</span>
 <span style="color: #729fcf; font-weight: bold;">table</span> "prefixes"
<span style="color: #729fcf; font-weight: bold;">CREATE</span> <span style="color: #729fcf; font-weight: bold;">TABLE</span>
<span style="color: #8ae234; font-weight: bold;">Time</span>: 74,357 ms
dim=#   \copy prefixes <span style="color: #729fcf; font-weight: bold;">from</span> <span style="color: #ad7fa8; font-style: italic;">'prefixes.fr.csv'</span> <span style="color: #729fcf; font-weight: bold;">with</span> delimiter ; csv quote <span style="color: #ad7fa8; font-style: italic;">'"'</span>
<span style="color: #8ae234; font-weight: bold;">Time</span>: 200,982 ms
dim=# <span style="color: #729fcf; font-weight: bold;">select</span> <span style="color: #729fcf;">count</span>(*) <span style="color: #729fcf; font-weight: bold;">from</span> prefixes ;
 <span style="color: #729fcf;">count</span>
<span style="color: #888a85;">-------
</span> 11966
(1 <span style="color: #8ae234; font-weight: bold;">row</span>)
<span style="color: #8ae234; font-weight: bold;">Time</span>: 3,047 ms
</pre>

<p>And now for the micro-benchmark:</p>

<pre class="src">
dim=# \o /dev/<span style="color: #729fcf; font-weight: bold;">null</span>
dim=# <span style="color: #729fcf; font-weight: bold;">select</span> <span style="color: #729fcf; font-weight: bold;">length</span>(<span style="color: #729fcf; font-weight: bold;">prefix</span>) <span style="color: #729fcf; font-weight: bold;">from</span> prefixes;
<span style="color: #8ae234; font-weight: bold;">Time</span>: 16,040 ms
dim=# <span style="color: #729fcf; font-weight: bold;">select</span> <span style="color: #729fcf; font-weight: bold;">length</span>(<span style="color: #729fcf; font-weight: bold;">prefix</span>::text) <span style="color: #729fcf; font-weight: bold;">from</span> prefixes;
<span style="color: #8ae234; font-weight: bold;">Time</span>: 23,364 ms
dim=# \o
</pre>

<p>Hope you enjoy!</p>


<h2>20090623-10:53 prefix extension reaches 1.0 (rc1)</h2>

<p><a name="20090623-10:53" id="20090623-10:53"></a>
<a name="%20prefix%20extension%20reaches%201%2E0%20" id="%20prefix%20extension%20reaches%201%2E0%20"></a>
At long last, after millions and millions of queries just here at work and
some more in other places, the <a href="prefix.html">prefix</a> project is reaching <code>1.0</code> milestone. The
release candidate is getting uploaded into debian at the moment of this
writing, and available at the following place: <a href="http://prefix.projects.postgresql.org/prefix-1.0~rc1.tar.gz">prefix-1.0~rc1.tar.gz</a>.</p>

<p>If you have any use for it (as some <em>VoIP</em> companies have already), please
consider testing it, in order for me to release a shiny <code>1.0</code> next week! :)</p>

<p>Recent changes include getting rid of those square brackets output when it's
not neccesary, fixing btree operators, adding support for more operators in
the <code>GiST</code> support code (now supported: <code>@&gt;</code>, <code>&lt;@</code>, <code>=</code>, <code>&amp;&amp;</code>). Enjoy!</p>


<h2>20090527-14:30 PgCon 2009</h2>

<p><a name="20090527" id="20090527"></a>
<a name="%20PgCon2009" id="%20PgCon2009"></a>
I can't really compare <a href="http://www.pgcon.org/2009/">PgCon 2009</a> with previous years versions, last time I
enjoyed the event it was in 2006, in Toronto. But still I found the
experience to be a great one, and I hope I'll be there next year too!</p>

<p>I've met a lot of known people in the community, some of them I already had
the chance to run into at Toronto or <a href="http://2008.pgday.org/en/">Prato</a>, but this was the first time I
got to talk to many of them about interresting projects and ideas. That only
was awesome already, and we also had a lot of talks to listen to: as others
have said, it was really hard to get to choose to go to only one place out
of three.</p>

<p>I'm now back home and seems to be recovering quite fine from jet lag, and I
even begun to move on the todo list from the conference. It includes mainly
<code>Skytools 3</code> testing and contributions (code and documentation),
<a href="http://wiki.postgresql.org/wiki/ExtensionPackaging">Extension Packaging</a> work (Stephen Frost seems to be willing to help, which I
highly appreciate) begining with <a href="http://archives.postgresql.org/pgsql-hackers/2009-05/msg00912.php">search_path issues</a>, and posting some
backtrace to help fix some <a href="http://archives.postgresql.org/pgsql-hackers/2009-05/msg00923.php">SPI_connect()</a> bug at <code>_PG_init()</code> time in an
extension.</p>

<p>The excellent <a href="http://wiki.postgresql.org/wiki/PgCon_2009_Lightning_talks">lightning talk</a> about <u>How not to Review a Patch</u> by Joshua
Tolley took me out of the <em>dim</em>, I'll try to be <em>bright</em> enough and participate
as a reviewer in later commit fests (well maybe not the first next ones as
some personal events on the agenda will take all my <em>&quot;free&quot;</em> time)...</p>

<p>Oh and the <a href="http://code.google.com/p/golconde/">Golconde</a> presentation gave some insights too: this queueing based
solution is to compare to the <code>listen/notify</code> mechanisms we already have in
<a href="http://www.postgresql.org/docs/current/static/sql-listen.html">PostgreSQL</a>, in the sense that's it's not transactional, and the events are
kept in memory only to achieve very high distribution rates. So it's a very
fine solution to manage a distributed caching system, for example, but not
so much for asynchronous replication (you need not to replicate events tied
to rollbacked transactions).</p>

<p>So all in all, spending last week in Ottawa was a splendid way to get more
involved in the PostgreSQL community, which is a very fine place to be
spending ones free time, should you ask me. See you soon!</p>


<h2>20090514 Prepared Statements and pgbouncer</h2>

<p><a name="20090514" id="20090514"></a>
<a name="%20Prepared%20Statements%20and%20pgbouncer" id="%20Prepared%20Statements%20and%20pgbouncer"></a>
On the performance mailing list, a recent <a href="http://archives.postgresql.org/pgsql-performance/2009-05/msg00026.php">thread</a> drew my attention. It
devired to be about using a connection pool software and prepared statements
in order to increase scalability of PostgreSQL when confronted to a lot of
concurrent clients all doing simple <code>select</code> queries. The advantage of the
<em>pooler</em> is to reduce the number of <em>backends</em> needed to serve the queries, thus
reducing PostgreSQL internal bookkeeping. Of course, my choice of software
here is clear: <a href="https://developer.skype.com/SkypeGarage/DbProjects/PgBouncer">PgBouncer</a> is an excellent top grade solution, performs real
well (it won't parse queries), reliable, flexible.</p>

<p>The problem is that while conbining <code>pgbouncer</code> and <a href="http://www.postgresql.org/docs/current/static/sql-prepare.html">prepared statements</a> is
possible, it requires the application to check at connection time if the
statements it's interrested in are already prepared. This can be done by a
simple catalog query of this kind:</p>

<pre class="src">
  <span style="color: #729fcf; font-weight: bold;">SELECT</span> <span style="color: #729fcf; font-weight: bold;">name</span>
    <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_prepared_statements
   <span style="color: #729fcf; font-weight: bold;">WHERE</span> <span style="color: #729fcf; font-weight: bold;">name</span> <span style="color: #729fcf; font-weight: bold;">IN</span> (<span style="color: #ad7fa8; font-style: italic;">'my'</span>, <span style="color: #ad7fa8; font-style: italic;">'prepared'</span>, <span style="color: #ad7fa8; font-style: italic;">'statements'</span>);
</pre>

<p>Well, this is simple but requires to add some application logic. What would
be great would be to only have to <code>EXECUTE my_statement(x, y, z)</code> and never
bother if the <code>backend</code> connection is a fresh new one or an existing one, as
to avoid having to check if the application should <code>prepare</code>.</p>

<p>The <a href="http://preprepare.projects.postgresql.org/">preprepare</a> pgfoundry project is all about this: it comes with a
<code>prepare_all()</code> function which will take all statements present in a given
table (<code>SET preprepare.relation TO 'schema.the_table';</code>) and prepare them for
you. If you now tell <code>pgbouncer</code> to please call the function at <code>backend</code>
creation time, you're done (see <code>connect_query</code>).</p>

<p>There's even a detailed <a href="http://preprepare.projects.postgresql.org/README.html">README</a> file, but no release yet (check out the code
in the <a href="http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/preprepare/preprepare/">CVS</a>, <code>pgfoundry</code> project page has <a href="http://pgfoundry.org/scm/?group_id=1000442">clear instruction</a> about how to do so.</p>


<h2>20090414 Skytools 3.0 reaches alpha1</h2>

<p><a name="20090414" id="20090414"></a>
<a name="%20Skytools%203%2E0%20reaches%20alpha1" id="%20Skytools%203%2E0%20reaches%20alpha1"></a>
It's time for <a href="http://wiki.postgresql.org/wiki/Skytools">Skytools</a> news again! First, we did improve documentation of
current stable branch with hosting high level presentations and <a href="http://wiki.postgresql.org/wiki/Londiste_Tutorial">tutorials</a> on
the <a href="http://wiki.postgresql.org/">PostgreSQL wiki</a>. Do check out the <a href="http://wiki.postgresql.org/wiki/Londiste_Tutorial">Londiste Tutorial</a>, it seems that's
what people hesitating to try out londiste were missing the most.</p>

<p>The other things people miss out a lot in current stable Skytools (version
<code>2.1.9</code> currently) are cascading replication (which allows for <em>switchover</em> and
<em>failover</em>) and <code>DDL</code> support. The new incarnation of skytools, version <code>3.0</code>
<a href="http://lists.pgfoundry.org/pipermail/skytools-users/2009-April/001029.html">reaches alpha1</a> today. It comes with full support for <em>cascading</em> and <em>DDL</em>, so
you might want to give it a try.</p>

<p>It's a rough release, documentation is still to get written for a large part
of it, and bugs are still to get fixed. But it's all in the Skytools spirit:
simple and efficient concepts, easy to use and maintain. Think about this
release as a <em>developer preview</em> and join us :)</p>


<h2>20090210 Prefix GiST index now in 8.1</h2>

<p><a name="20090210" id="20090210"></a>
<a name="%20Prefix%20GiST%20index%20now%20in%208%2E1" id="%20Prefix%20GiST%20index%20now%20in%208%2E1"></a>
The <a href="http://blog.tapoueh.org/prefix.html">prefix</a> project is about matching a <em>literal</em> against <em>prefixes</em> in your
table, the typical example being a telecom routing table. Thanks to the
excellent work around <em>generic</em> indexes in PostgreSQL with <a href="http://www.postgresql.org/docs/current/static/gist-intro.html">GiST</a>, indexing
prefix matches is easy to support in an external module. Which is what
the <a href="http://prefix.projects.postgresql.org/">prefix</a> extension is all about.</p>

<p>Maybe you didn't come across this project before, so here's the typical
query you want to run to benefit from the special indexing, where the <code>@&gt;</code>
operator is read <em>contains</em> or <em>is a prefix of</em>:</p>

<pre class="src">
  <span style="color: #729fcf; font-weight: bold;">SELECT</span> * <span style="color: #729fcf; font-weight: bold;">FROM</span> prefixes <span style="color: #729fcf; font-weight: bold;">WHERE</span> <span style="color: #729fcf; font-weight: bold;">prefix</span> @&gt; <span style="color: #ad7fa8; font-style: italic;">'0123456789'</span>;
</pre>

<p>Now, a user asked about an <code>8.1</code> version of the module, as it's what some
distributions ship (here, Red Hat Enterprise Linux 5.2). It turned out it
was easy to support <code>8.1</code> when you already support <code>8.2</code>, so the <code>CVS</code> now hosts
<a href="http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/prefix/prefix/">8.1 support code</a>. And here's what the user asking about the feature has to
say:</p>

<blockquote>
<p class="quoted">
It's works like a charm now with 3ms queries over 200,000+ rows.  The speed
also stays less than 4ms when doing complex queries designed for fallback,
priority shuffling, and having multiple carriers.</p>

</blockquote>


<h2>20090205 Importing XML content from file</h2>

<p><a name="20090205" id="20090205"></a>
<a name="%20Importing%20XML%20content%20from%20file" id="%20Importing%20XML%20content%20from%20file"></a>
The problem was raised this week on <a href="http://www.postgresql.org/community/irc">IRC</a> and this time again I felt it would
be a good occasion for a blog entry: how to load an <code>XML</code> file content into a
single field?</p>

<p>The usual tool used to import files is <a href="http://www.postgresql.org/docs/current/interactive/sql-copy.html">COPY</a>, but it'll want each line of the
file to host a text representation of a database tuple, so it doesn't apply
to the case at hand. <a href="http://blog.rhodiumtoad.org.uk/">RhodiumToad</a> was online and offered the following code
to solve the problem:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">create</span> <span style="color: #729fcf; font-weight: bold;">or</span> replace <span style="color: #729fcf; font-weight: bold;">function</span> <span style="color: #edd400; font-weight: bold; font-style: italic;">xml_import</span>(filename text)
  <span style="color: #729fcf; font-weight: bold;">returns</span> xml
  volatile
  <span style="color: #729fcf; font-weight: bold;">language</span> plpgsql <span style="color: #729fcf; font-weight: bold;">as</span>
$f$
    <span style="color: #729fcf; font-weight: bold;">declare</span>
        content bytea;
        loid oid;
        lfd <span style="color: #8ae234; font-weight: bold;">integer</span>;
        lsize <span style="color: #8ae234; font-weight: bold;">integer</span>;
    <span style="color: #729fcf; font-weight: bold;">begin</span>
        loid := lo_import(filename);
        lfd := lo_open(loid,262144);
        lsize := lo_lseek(lfd,0,2);
        perform lo_lseek(lfd,0,0);
        content := loread(lfd,lsize);
        perform lo_close(lfd);
        perform lo_unlink(loid);

        <span style="color: #729fcf; font-weight: bold;">return</span> xmlparse(document convert_from(content,<span style="color: #ad7fa8; font-style: italic;">'UTF8'</span>));
    <span style="color: #729fcf; font-weight: bold;">end</span>;
$f$;
</pre>

<p>As you can see, the trick here is to use the <a href="http://www.postgresql.org/docs/current/interactive/largeobjects.html">large objects</a> API to load the
file content into memory (<code>content</code> variable), then to parse it knowing it's
an <code>UTF8</code> encoded <code>XML</code> file and return an <a href="http://www.postgresql.org/docs/current/interactive/datatype-xml.html">XML</a> datatype object.</p>


<h2>20090204 Asko Oja talks about Skype architecture</h2>

<p><a name="20090204" id="20090204"></a>
<a name="%20Asko%20Oja%20talks%20about%20Skype%20architecture" id="%20Asko%20Oja%20talks%20about%20Skype%20architecture"></a>
In this <a href="http://postgresqlrussia.org/articles/view/131">russian page</a> you'll see a nice presentation of Skype databases
architectures by Asko Oja himself. It's the talk at Russian PostgreSQL
Community meeting, October 2008, Moscow, and it's a good read.</p>

<center>
<p><a class="image-link" href="http://postgresqlrussia.org/articles/view/131">
<img src="../images/Moskva_DB_Tools.v3.png"></a></p>
</center>

<p>The presentation page is in russian but the slides are in English, so have a
nice read!</p>


<h2>20090203 Skytools ticker daemon and londiste</h2>

<p><a name="20090203" id="20090203"></a>
<a name="20090203%20Skytools%20ticker%20daemon%20and%20londiste" id="20090203%20Skytools%20ticker%20daemon%20and%20londiste"></a>
One of the difficulties in getting to understand and configure <code>londiste</code>
reside in the relation between the <code>ticker</code> and the replication. This question
was raised once more on IRC yesterday, so I made a new FAQ entry about it:
<a href="http://blog.tapoueh.org/skytools.html#ticker">How do this ticker thing relates to londiste?</a></p>


<h2>20090131 Comparing Londiste and Slony</h2>

<p><a name="20090131" id="20090131"></a>
<a name="%20Skytools%20ticker%20daemon%20and%20londiste" id="%20Skytools%20ticker%20daemon%20and%20londiste"></a>
In the page about <a href="skytools.html">Skytools</a> I've encouraged people to ask some more questions
in order for me to be able to try and answer them. That just happened, as
usual on the <code>#postgresql</code> IRC, and the question is
<a href="skytools.html#slony">What does londiste lack that slony has?</a></p>


<h2>20090128 Controling HOT usage in 8.3</h2>

<p><a name="20090128" id="20090128"></a>
<a name="%20Controling%20HOT%20usage%20in%208%2E3" id="%20Controling%20HOT%20usage%20in%208%2E3"></a>
As it happens, I've got some environments where I want to make sure <code>HOT</code> (<em>aka
Heap Only Tuples</em>) is in use. Because we're doing so much updates a second
that I want to get sure it's not killing my database server. I not only
wrote some checking view to see about it, but also made a <a href="http://www.postgresql.fr/support:trucs_et_astuces:controler_l_utilisation_de_hot_a_partir_de_la_8.3">quick article</a>
about it in the <a href="http://postgresql.fr/">French PostgreSQL website</a>. Handling around in <code>#postgresql</code>
means that I'm now bound to write about it in English too!</p>

<p>So <code>HOT</code> will get used each time you update a row without changing an indexed
value of it, and the benefit is skipping index maintenance, and as far as I
understand it, easying <code>vacuum</code> hard work too. To get the benefit, <code>HOT</code> will
need some place where to put new version of the <code>UPDATEd</code> tuple in the same
disk page, which means you'll probably want to set your table <a href="http://www.postgresql.org/docs/8.3/static/sql-createtable.html#SQL-CREATETABLE-STORAGE-PARAMETERS">fillfactor</a> to
something much less than <code>100</code>.</p>

<p>Now, here's how to check you're benefitting from <code>HOT</code>:</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">SELECT</span> schemaname, relname,
       n_tup_upd,n_tup_hot_upd,
       <span style="color: #729fcf; font-weight: bold;">case</span> <span style="color: #729fcf; font-weight: bold;">when</span> n_tup_upd &gt; 0
            <span style="color: #729fcf; font-weight: bold;">then</span> ((n_tup_hot_upd::<span style="color: #8ae234; font-weight: bold;">numeric</span>/n_tup_upd::<span style="color: #8ae234; font-weight: bold;">numeric</span>)*100.0)::<span style="color: #8ae234; font-weight: bold;">numeric</span>(5,2)
            <span style="color: #729fcf; font-weight: bold;">else</span> <span style="color: #729fcf; font-weight: bold;">NULL</span>
       <span style="color: #729fcf; font-weight: bold;">end</span> <span style="color: #729fcf; font-weight: bold;">AS</span> hot_ratio

 <span style="color: #729fcf; font-weight: bold;">FROM</span> pg_stat_all_tables;

 schemaname | relname | n_tup_upd | n_tup_hot_upd | hot_ratio
<span style="color: #888a85;">------------+---------+-----------+---------------+-----------
</span> <span style="color: #729fcf; font-weight: bold;">public</span>     | table1  |         6 |             6 |    100.00
 <span style="color: #729fcf; font-weight: bold;">public</span>     | table2  |   2551200 |       2549474 |     99.93
</pre>

<p>Here's even an extended version of the same request, displaying the
<code>fillfactor</code> option value for the tables you're inquiring about. This comes
separated from the first example because you get the <code>fillfactor</code> of a
relation into the <code>pg_class</code> catalog <code>reloptions</code> field, and to filter against a
schema qualified table name, you want to join against <code>pg_namespace</code> too.</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">SELECT</span> t.schemaname, t.relname, c.reloptions,
       t.n_tup_upd, t.n_tup_hot_upd,
       <span style="color: #729fcf; font-weight: bold;">case</span> <span style="color: #729fcf; font-weight: bold;">when</span> n_tup_upd &gt; 0
            <span style="color: #729fcf; font-weight: bold;">then</span> ((n_tup_hot_upd::<span style="color: #8ae234; font-weight: bold;">numeric</span>/n_tup_upd::<span style="color: #8ae234; font-weight: bold;">numeric</span>)*100.0)::<span style="color: #8ae234; font-weight: bold;">numeric</span>(5,2)
            <span style="color: #729fcf; font-weight: bold;">else</span> <span style="color: #729fcf; font-weight: bold;">NULL</span>
        <span style="color: #729fcf; font-weight: bold;">end</span> <span style="color: #729fcf; font-weight: bold;">AS</span> hot_ratio
<span style="color: #729fcf; font-weight: bold;">FROM</span> pg_stat_all_tables t
      <span style="color: #729fcf; font-weight: bold;">JOIN</span> (pg_class c <span style="color: #729fcf; font-weight: bold;">JOIN</span> pg_namespace n <span style="color: #729fcf; font-weight: bold;">ON</span> c.relnamespace = n.oid)
        <span style="color: #729fcf; font-weight: bold;">ON</span> n.nspname = t.schemaname <span style="color: #729fcf; font-weight: bold;">AND</span> c.relname = t.relname

 schemaname | relname |   reloptions    | n_tup_upd | n_tup_hot_upd | hot_ratio
<span style="color: #888a85;">------------+---------+-----------------+-----------+---------------+-----------
</span> <span style="color: #729fcf; font-weight: bold;">public</span>     | table1  | {fillfactor=50} |   1585920 |       1585246 |     99.96
 <span style="color: #729fcf; font-weight: bold;">public</span>     | table2  | {fillfactor=50} |   2504880 |       2503154 |     99.93
</pre>

<p>Don't let the <code>HOT</code> question affect your sleeping no more!</p>


<h2>20090121 Londiste Trick</h2>

<p><a name="20090121" id="20090121"></a>
<a name="%20Londiste%20Trick" id="%20Londiste%20Trick"></a>
So, you're using <code>londiste</code> and the <code>ticker</code> has not been running all night
long, due to some restart glitch in your procedures, and the <em>on call</em> admin
didn't notice the restart failure. If you blindly restart the replication
daemon, it will load in memory all those events produced during the night,
at once, because you now have only one tick where to put them all.</p>

<p>The following query allows you to count how many events that represents,
with the magic tick numbers coming from <code>pgq.subscription</code> in columns
<code>sub_last_tick</code> and <code>sub_next_tick</code>.</p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">SELECT</span> <span style="color: #729fcf;">count</span>(*)
  <span style="color: #729fcf; font-weight: bold;">FROM</span> pgq.event_1,
      (<span style="color: #729fcf; font-weight: bold;">SELECT</span> tick_snapshot
         <span style="color: #729fcf; font-weight: bold;">FROM</span> pgq.tick
        <span style="color: #729fcf; font-weight: bold;">WHERE</span> tick_id <span style="color: #729fcf; font-weight: bold;">BETWEEN</span> 5715138 <span style="color: #729fcf; font-weight: bold;">AND</span> 5715139
      ) <span style="color: #729fcf; font-weight: bold;">as</span> t(snapshots)
<span style="color: #729fcf; font-weight: bold;">WHERE</span> txid_visible_in_snapshot(ev_txid, snapshots);
</pre>

<p>In our case, this was more than <em>5 millions and 400 thousands</em> of events. With
this many events to care about, if you start londiste, it'll eat as many
memory as needed to have them all around, which might be more that what your
system is able to give it. So you want a way to tell <code>londiste</code> not to load
all events at once. Here's how: add the following knob to your <em>.ini</em>
configuration file before to restart the londiste daemon:</p>

<pre class="src">
    pgq_lazy_fetch = 500
</pre>

<p>Now, <code>londiste</code> will lazyly fetch <code>500</code> events at once or less, even if a single
<code>batch</code> (which contains all <em>events</em> between two <em>ticks</em>) contains a huge number
of events. This number seems a good choice as it's the default <code>PGQ</code> setting
of number of events in a single <em>batch</em>. This number is only outgrown when the
ticker is not running or when you're producing more <em>events</em> than that in a
single transaction.</p>

<p>Hope you'll find the tip useful!</p>



<h2>20081204 Fake entry</h2>

<p><a name="20081204" id="20081204"></a>
<a name="20081204%20Fake%20entry" id="20081204%20Fake%20entry"></a>
This is a test of a fake entry to see how muse will manage this.</p>

<p>With some <code>SQL</code> inside:</p>

<blockquote>
<p class="quoted"></p>

<pre class="src">
<span style="color: #729fcf; font-weight: bold;">SELECT</span> * <span style="color: #729fcf; font-weight: bold;">FROM</span> planet.postgresql.org <span style="color: #729fcf; font-weight: bold;">WHERE</span> author = "dim";

</pre></p>

</blockquote>



]]></description>
<author>Dimitri Fontaine</author>
<pubDate>Tue, 16 Feb 2010 15:23:00 CET</pubDate>
<guid>http://blog.tapoueh.org/blog.dim.html#%20Resetting%20sequences%2E%20All%20of%20them%2C%20please%21</guid>

</item>

  </channel>
</rss>
