Next Generation PostgreSQL
CHAR(10) was a great venue with great people. The conference itself and the "Hall's Talks" have been a huge source of ideas popping up. Again...
As I've been talking to several people about those ideas over the time and past conferences, I figured I should try to put them down now. And the new ones too. So in this article I'll try to organise and share those ideas that I'm growing in my head, in the hope that some people will think that maybe the ideas are worthwile enough to think about them, make them theirs, then why not work on them!
Organising ideas is all about putting them togother in boxes and attach to them the more meaningfull label you can think of. Here, I think the labels are MVCC in the Cloud, and How to Further Optimise PostgreSQL.
DISCLAIMER
Those ideas that I try to organise and present in this documents are expressed like if they were good designs. I realise that may well be not true, but that's the best way I can think of to share them. So not only that is a Work In Progress but you're welcome to find this collection nothing but a waste of time. Apologies if that's the case.
Also those are all bird's eye view and lots of implementation problems are certainly hidden all around, but my current knowlegde of the internals of the system won't allow me to be the first to catch them. Hint, hint.
MVCC in the Cloud
Nowadays all the rage is in the virtualised environments, dynamic sizing of resources and pay-only-what-you-use. They even found a funny name to sell this, welcome to the Cloud!
The problem is that while it's quite easy to understand how to apply such a reasonning to stateless services, such as webservers, it's a lot more difficult to apply the same to persistent storage services, like, say, a traditional relational database. Let's not forget that the main services of an RDBMS ain't interpreting SQL, but offering transactions and durability.
I think forgetting about that would be awful and could lead to strange solutions appearing, where you play with highly dynamic non-structured non-normalized caches that people would be happy to use as persistent storage services. Oh and of course only provide APIs, so that the application has to know all about the storage and data location, rather than having to describe what data they need. Yeah, declarative languages are not cool anymore, we don't want no Lisp, no Haskell, no RegExp, no CSS, no XML, no XSLT, NoSQL!
So, well, is it possible to apply those Cloud concepts to ACID persistent storage services such as PostgreSQL, and get some benefits out of that? I think so, and will try to explain how. If you're a web indexer robot of some kind, please consider attaching that to your distributed computing label, would you have one.
Transparent cluster
The first problem I'd try to have an idea about is how to offer a transparent service to users in the presence of replication. Say you're running a master slave Hot Standby setup, using Streaming Replication for transfering the data. Now the application code will either connect to the master or to the slave, and depending on this choice will not be able to benefit from the same service.
So the idea would be for the service offered by both the nodes of this cluster to be the same, as far as the user is concerned. This simplification could come in two steps. First, you need XID feedback from the slave to the master. That means that the older snapshot still in use in any slave should be known by the master, so that it will refrain from cleaning up the house while the party's still ongoing. It's rude to do that, even more so to friends.
Once you can trust that any snapshot on the slave is still valid on the
master, the second step would be for the slave to forward a transaction to
its master (primary_coninfo) as soon as it can't serve it itself. I think
this is as soon as the transaction would require the Virtual XID to turn
into a real one. Oh, and as you know the current transaction's snapshot is
still valid on the master, the only thing you need is being able to open a
transaction on this very snapshot over there.
And it so happens that being able to connect to a system and ask for a given snapshot you know about already would be useful for other projects (parallel dumps and distributing query processing). I think some hackers are already contemplating on how to offer such a facility. Thanks guys!
But I want my ticket to the cloud!
The previous idea goes a long way to help offering a global service to users but does nothing to help solve the real complex Cloud challenges. Those are related to the flexibility you want from the Cloud, the marketing name of it being Elasticity. To be a Cloud service, you need to be able to add and remove nodes to your service and stay online.
So, let's continue thinking, but focusing a little more.
Lots of efforts are ongoing in the domains of replications and remote acces
to data. I'm thinking about Postgres-XC, Postgres-R and SQL/MED related
work. I like those projects and architecture but I don't think getting them
to the Cloud will be practical, except for SQL/MED.
If you missed it, SQL/MED is the part of the standard that allows a database
server to expose objects from the outer space. Meaning anything from a local
text file to another database server, of the same or another
technology. It's very powerful and valuable, but the limitation is that you
can't reach outer space within your transaction boundaries. That's a pretty
good limit, though, because it means you now get autonomous transactions
(you embed a transaction into another and their COMMIT/ROLLBACK statuses are
independant, or, say, autonomous).
To provide elastic clustering in the Cloud, with some kind of transparency for the user, you don't want to lose MVCC, that thing that allows the database to respect the ACID guarantees. Maintaining MVCC in a cluster is, on the other hand, exactly what Postgres-XC, Postgres-R and Middle-R are working on.
The problem there is that either you have full support of PostgreSQL features, that's Postgres-R, or you have data distribution, that's Postgres-XC (which does not yet provide node fault-tolerance, by the way).
In the latter case, it's even a big problem to offer all the PostgreSQL features (user defined functions, triggers, rules, etc) because for parallelizing query processing (you have to do that if you're distributing the data) the vast majority of the work they're doing is in the planner and optimiser. I don't even want to think how they will make it possible for a trigger to touch a non-local table in a distributed cluster.
And one of the thing Markus showed at CHAR(10) is that what's killing replicated systems really is the communication overhead. In this light, really, the worst case would be Two Phase Commit, but it appears that Postgres-XC might not be that far away. The real big winner is Postgres-R which shows very low communication overhead, detailed next.
Remote tablespace
The idea here would be to keep the global MVCC facilities that those project provide. I'll confess that I like the Postgres-R implementation of it, even if it relies on a Group Communication System to deliver a total commit ordering of transactions in the cluster. But the advantage is that the design has been made with the idea of supporting dynamic cluster configuration: you can accept new nodes and lose some others online.
Now the GCS is a hard-sell in our community, because finding a good
implementation of one ain't an easy task, and it seems that some of
PostgreSQL developers either burnt themselves on a work-in-progress
implementation, or just don't trust the theory. So I'd like to find out some
trustworthy and light way to derive a distributed sequence that would be a
reference for XID ordering in the cluster. I'm reading docs on Vector Clock
and Paxos algorithm now.
So, say we have a global MVCC we can trust. Then, the data distribution would be easier to address at the executor level I think. What you need to be able to say is that some data are stored on another node. You need the catalogs to exists on all the nodes, and I think that the physical location property of the data is already well defined in the notion of a tablespace. So we should extend this notion to remote PostgreSQL nodes that happen to share the MVCC details with us.
That certainly already means that any and all DDL targeting the remote
tablespace needs to happen using 2 Phase Commit, or maybe an asynchronous
model would fit here, I'm not sure. But I don't really see how. The data
would exist only on the master node for any given tablespace, but the
catalogs should certainly be in-sync, unless you want to accept a whole new
error domain, but which could be made to look like a serialisation failure
after all: if the DDL changes are asynchronous, then your problem is the
lack of remote exclusive lock, I guess.
That something I've been thinking about for a long time, you can find the distributing PostgreSQL wiki page and see its history for the details. If you see so you'll notice that the page needs some refreshing (at least the ideas behind the wiki page got some refresh here and there, the most recent and important one being CHAR(10).
Then the planner would work as it's doing now, but certain parts of the plan would have to be handled to another PostgreSQL instance. And as we're working with a global MVCC idea, we keep transactional behavior.
This does not provide data distributing off itself, for that to happen you
would use current ddl partitioning facilities and have the partitions sit on
different remote tablespaces. As soon as you do that, you can distribute the
processing by having the network IO happen asynchronously, which I think
would be required to achieve any kind of performances. Plus we already have
something in this spirit with effective_io_concurrency.
Mirrored tablespace
Now of course we all like to have a choice. Here the choice would be about
data locality. Certain systems already exists (distributed file systems,
including GlusterFS or to some extend HAMMER) that propose to mix and match
data replication and data distribution. Or simply think about RAID-10. You
have both striping and mirroring there, so that's somewhat expected to want
to have that in your database product too...
That leads me to think we should offer mirror tablespaces while at it. Implementing that can be as complex as optimising 2 Phase Commit or as simple (ahah) as having a per tablespace Streaming Replication solution.
Now of course you have to explain data locality in the network to the
planner, so that it can compare the cost of running a JOIN over there then
retrieving the result set against doing that locally. It seems clean what's
best until you think that while the other node is hashing your fact table
you're free to locally execute another part of the same query plan...
It's so cloudy now I can feel the rain coming
And as all the rage is about offering elastic capabilities to your clustering solution, the next step after that is having some smart agent in the system that will notice that we're doing this or that data transfers to solve user queries so often that we should setup a mirror of this remote tablespace now.
Of course, you also want newcomers to be able to extend your elasticity, like "hey guys I'm new there, how can I help?".
While at it, your smart cluster manager program should be able to arrange
things so that the loss of any node at any time is just an optimisation
problem, not a data loss one. Nothing more critical than a serialisation
error, but maybe we should invent a new SQLERR here, I'd like to propose the
following error message to the transactions you need to abort in case of
some node unexpected disapearing: the elastic just shrinked.
How to Further Optimise PostgreSQL
It seems harder and harder to find good ideas that look effective and that are not too complex to understand. It's simply because about all of them already have been implemented into the product, if you ask me. But that does not mean there isn't a way to go, that means finding it is getting tough, and that when pursuing the ideas used to be some 1-man work for some consolidated weeks, it's getting to collaborative working of several talented people sharing a common goal and able to dedicate months of their time.
I hope it does not mean we should be happy with what we have but we rather continue improving our prefered product.
On the usage of indexes as column store
When you're handling very big tables, there's a very good chance that some column's value will get repeated all over the place. But if you index that column, you'll see the value only once and a list of pointers to the rows that are hosting it. Column storage is about compressing that data set in a way that you store the column value only once, and also about avoiding to pass the data around in the executor.
The idea here would need to first have the ability to use the index in an authoritative way, without refering the main table storage to confirm the visibility. That's already a work in progress.
When we have that, then we could think about how we're using the indexes now. Their only purpose in life is to help solve restrictive queries by accessing directly to the rows of interrest, avoiding to scan all those data that won't fit in main memory, so that you rely on your slow drives to get access to it.
Another idea here would be to be able to use the index to solve queries in other cases than just applying the restrictions and filtering they require. By that I mean that we could use our indexes to retrieve some column's value, rather than the main table's data, when we have the idea that the columns cardinality is low enough.
I guess that would mean we have optional column storage, in the case all the values you need from the table are in the available index(es).
Gavin reports that this idea may not provide the bangs for the bucks, but warns that it could be he's too much used to thinking on datawarehouse terms. Also that before engaging into such an effort, it might be good to have an idea of the benefit you're running after, which is the classic Amdahl's law or the old saying that you should always profile before to spend time on optimizing.
So it's impossible without some preliminary work to assess how useful this kind of index as a column store would be in the general case, but that wouldn't stop me to share the idea, see...
The Executor as a Virtual Machine
I've been talking a lot with Gavin, and his mind is full of datawarehouse
optimisation problems. One of them is how to speed up the data retrieval
pipeline from disk. Given what are modern CPU, we should be saturating read
capabilities of any hard drive without a sweat, and we're not there yet. It
happens that MonetDB is about there, with their MonetDB Assembly Language,
further described on the MAL reference page.
The big difference in architecture between them and us is how they execute
the queries, with code path that contain very little branches, loops, or
function calls. That alone would be responsible for an incredible
performance benefit, like, a factor of 25 to 50 times what we have now, have
I been hinted.
If you think about it in the right angle, to be able to suppress all the looping and branching and function calling we currently have into the executor code, it could be that the simplest solution (ahah) would be to expose the executor capabilities as opcodes or assembly and have the planner and optimiser be just in time compilers for this new Virtual Machine.
And while at it, we then might benefit from some intermediate representation
of the plan tree, which could start out as what the planner works with in
term of data structure. Out of that, the optimiser job would be to generate
the executor code, and I think it should be possible to define an
optimisation effort target at this point, akin to -O0 to -O3 option of
gcc. This is another much wanted feature, being able to set the amount of
effort to put into finding the best plan possible, and this executor virtual
machine or executor assembly idea may not be the most simple way to get
there, but I see it as a conceptually simple fallout.
Another thing we might want to look at is how we batch the disk level
reads. The current executor will fetch a row at a time, and fetching
something like 2 MB at a time (maybe only for seqscans) might offer some
great speed benefits, that should even balance out the waste in case you
filter out some of those data.
Automatic use of Materialized VIEWs
The initial title of this section was Matching a query against some "template" at runtime, which is the hard part of the problem, I think.
The goal is simple and easy to understand, when you maintain some materialized views with triggers, you go a long way to ensure that the content is trustworthy in a transactional way. That means using that table or the normalized ones would not change any query result. So the best would be for the relational engine we love and trust to simply (from a user perspective) be able to use them when having to solve a query that could benefit.
As of course the dynamic query will not always be written the exact same way
the VIEW is, you can't just hash the query text and compare with some
index. Also, the user queries will not embed the VIEW anywhere in their
definition, potentially, so you're not searching for a substring match
either. It's more like matching a subplan of the query. Then I guess the
planner and optimiser should consider, when evaluating the cost of any plan,
if there's a materialized view somewhere that could help running it.
In case we have manually refreshed materialized views, I guess we could still benefit from the same mecanisms, but we would need a way to invalidate the plans when the data is not fresh enough, so that we decouple the refreshing interval to the acceptable lag in usage. Some metadata and auto-analyze support would certainly do.
Analyzing VIEWs as a correlated-statistics solution
This one is tricky. Oh well you may find all the ideas here as tricky, more often than not tricky enough that they don't need pursuing. But the other ones I've been thinking about them for long enough that they just feel natural for my brain. Aha.
So one of the major problem of PostgreSQL currently is related to how its
planner and optimiser heavily depend on ANALYZE statistics. The problem is
not so much having fresh ones nowadays, thanks to autovacuum, but to their
quality. And the first quality problem we have is tied to correlated data.
What that means is that if you have a couple of colums a and b set in a way
that each time a is not null, b = 2 * a, then PostgreSQL has no
way to realize that. So if you have a query WHERE a=1 AND b=2
then it will think having the AND in there means you double the output
restrictivity, and it's not the case.
There was a thread about exactly this problem on the mailing lists, where
Simon proposed that we implement ANALYZE foo [WHERE .... ] in order to
address this problem. I think a generalisation of this proposal is to be
able to ANALYSE a VIEW, as I said here.
So if we ever get to implement a way to match a user query dynamically against some materialized view existing in the system, I suppose we could also benefit from this view's statistics in the case it's not materialized.
Planner costs and system usage statistics, or admission control
Kevin Grittner on the mailing lists proposed that we read section 2.4 of the Architecture of a Database System by Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. The idea revolves around accepting queries depending on the estimated resource usage they should need to execute, and the current availability of those on the system.
Now, it could be that a way to approach that would be to have some OS-specific statistic views in PostgreSQL, we already have a lot of those now. So we could query PostgreSQL for the current system load average, io wait, main memory usage and whatnot. The usual answer to that is that this is in no way the job of a database engine to care about such things, so please try and find the right tool for the right job, thank you very much.
Well, the idea here would be that the planner would now have access to those
information, by calling a function. And we could expose some GUC thresholds
not to overcome in the planning stage.
That means that the plan itself should store the information of the GUC as
they were at the very moment of the planning, so that you possibly get plan
invalidation to kick in and force a re-plan when you EXECUTE a PREPARE
STATEMENT in the case the current resource usage changed enough.
Now the remaining issue is that you're using the system's resources usage indicators at planning time, to be able to prune some plans, but for complex queries there can be a meaningfull delay between the planning and the execution, so you only get illusions about resource availability.
Well, unless and until you get the possibility of issuing more than one plan for any given query and then have the executor pick one. In fact you want to apply this idea recursively, so that at any given part of the plan the executor has some way to check between what environment the planner was running with and what the executor is running with. Having such a feature would also greatly enhance data locality access patterns, that also would just be a choice of subplan left to executor to make.
Oh and should we get distributed query processing on a single node, then again we would benefit here by having the planner propose plans using the capability or bypassing it, and the executor check whether it makes sense to use it now given the estimated costs. At all plan levels.
The main drawback against such a line of thought is the major overhaul in executor design this represent, and in the case of only providing distributed sorting, e.g., then the cost / benefit ratio ain't foreseen as good enough to attack it.
In the context of this document though, it's some more reasons to consider the benefits of architecting the current executor code into a Virtual Machine set of instructions. Then some of the instructions would allow for checking the current environment and branching in the plan, following the best compromise as of current execution time. Sure, that's not symplifying the implementation. But that's a very good reason to have the use case in mind before making any low-level choice.
And some other thoughts
You know the famous joke saying there are 10 Kinds of People in the World, those who understand binary and those who don't. You though that counting to 3 in decimal was so much easier, didn't you? (Hint, I told about organizing this list of ideas under two big labels, of which this is the third)
Supervized guest deamons, with an API please
There are several projects that could benefit from being integrated
alongside core processes. By that I mean being part of the start, reload,
stop and restart procedures. The current list includes autovacuum, a pgagent
scheduler, and a ticker. And helper backgrounds like PostgreSQL had in some
6.x branch, and that Markus implemented in Postgres-R. There's even an
argument about including a connection pool into the mix, but Jan won that
over a beer in Ottawa this year, so I won't insist. You'll notice there's a
lucky winner here, we needed autovacuum so bad that it's thankfully already
in core.
Providing an API that would allow to register user defined deamon processes would allow for including those other projects, and maybe some more, in a very easy way for the user.
My current thinking about that would be to steal as much as meaningfull from
the Erlang/OTP supervisor processes API, including the MaxR and MaxT
variables to protect the main system to suffer from user deamons
mis-behaviors.
It would even make sense to only provide support for gen_fsm kind of hosted
processes, meaning the API is to register a global unique name per process,
a state and code entry points (transitions). Now that we have a mecanism to
send signals to backends, with a payload, it's called LISTEN/NOTIFY. We
could certainly reuse that to send messages to the hosted state machines:
that would be the events that trigger te transitions — the event name
would match the code function.
In the case of a pgagent scheduler, we would need to be able to produce
events internally, without user interaction, but my understanding of
SIGALARM is that it's made for that. It's not clear what the best design
would be here, but maybe registering a pgagent clock service that would
NOTIFY the pgagent launcher service would do. It would also have to LISTEN
for changes on its underlying job scheduling tables, I guess.
And for querying the database, we'd use SPI, I guess.
Logs analysis
Nowadays to analyze logs and provide insights, the more common tool to use
is pgfouine, which does an excellent job. But there has been some
improvements in logs capabilities that we're not benefiting from yet, and
I'm thinking about the CSV log format.
So the idea would be to turn pgfouine into a set of SQL queries against the
logs themselves once imported into the database. Wait. What about having our
next PostgreSQL version, which is meant to include CSV support in SQL/MED,
to directly expose its logs as a system view?
A good thing would be to expose that as a ddl-partitioned table following
the log rotation scheme as setup in postgresql.conf, or maybe given in some
sort of a setup, in order to support logrotate users. At least some
facilities to do that would be welcome, and I'm not sure plain SQL/MED is
that when it comes to source partitioning.
Then all that remains to be done is a set of SQL queries and some static or
dynamic application to derive reports from there.
Conclusion
I hope some of those ideas are viable and interresting to some people, and should that be the case, seeing progress made on those would be awesome! Meanwhile, thanks for reading.
