Aquameta Chapter 2: filesystem - PostgreSQL <==> File System Bridge

Follow @aquameta

Aquameta is a web development platform built entirely in PostgreSQL. This is chapter two (introduction, chapter 1) of our tour of Aquameta's architecture.

If you want to jump right to the virtual file system demo, it's here.

File-centric vs. data-centric development

Today, generally speaking, web application development is file-centric: We store code (*.py, *.js, etc.) in files on the file system. We store tool configurations (/etc/*/*, ~/.ssh/*, etc) in files. We version-control files. Sometimes we put code or configuration in the database, but it's more of an exception than a rule. In 2016, the file is king.

Aquameta is entirely different. It is an experiment in a data-centric dev stack. Each layer in Aquameta is made up of relational data and procedural functions, and in terms of the developer experience we're designing, files in a sense don't even exist.

Of course the file system is still there under the hood, and most of our tools as software developers are still there too. As such, Aquameta's file system integration layer is more about "backwards" compatibility than a central part of the architecture.

In this chapter, we'll talk about how Aquameta is data-centric instead of file-centric, explain why we think this is a good idea, and then show Aquameta's file system integration via pgfs, a tool that makes PostgreSQL accessible as a virtual file system.

The database's killer feature

Once upon a time, people used to debate about whether to use the database or flat files to manage data. Today, we all but take it for granted that the database is better. ACID is a pretty compelling feature set, so is having a query language.

But with Aquameta, we're not talking about managing user data. We're talking about using the database to design a developer experience. In this context, the database's killer feature is a little different.

The biggest difference between the database and the file system is that the database takes a stand about the "structure of structure". It has an information model, and provides basic mechanisms for defining structure via the DDL.

The file system wouldn't claim to provide structure-defining mechanisms, but if it did, it would be as follows:

The world shall be organized as a strict heirarchy of "directories", each of which can contain other directories and/or named byte sequences called "files".

Beyond this, the file system is agnostic about structure; it leaves it up to the developer to choose from one of the many file formats for representing structured data, or design a new one.

The relational database, on the other hand, provides basic structuring mechanisms at a foundational level, via the DDL. In layman's terms, they might be described as follows:

The world shall be organized as a graph of inter-related Classifications ("tables"), which are spaces for Things ("rows") that belong together because they are in some way the same. Each Classification may have within it any number of Distinctions ("columns"), which are ways the Things in the Classification can be different from each other. Every Thing in the Classification can then have a Value ("field") for each column, that describes the particular row's difference.

This simple meta-structure of sameness and differentness is, as it turns out, wildly ubiquitous throughout how we as humans organize information. We have taken the relational database into countless domains, hospitals, warehouses, banks, social networks, fisheries, and modeled them fairly successfully.

Evidence seems to suggest that there is a common meta-structure to how we organize. Sameness and differentness seem to be the bedrock of organization itself.

Why file-centric holds us back

The file system doesn't provide structure-making mechanisms, and we see the consequences throughout our file-centric stack.

  • Web framework files like urls.py have a structure that could be trivially modeled relationally, but because they're in files, we have to write code to query them and edit them by hand.
  • The "query" to inspect say /etc/passwd takes a completely different form than the query to inspect say .git/config
  • Scripts like /sbin/adduser have to implement from scratch things that the database does out of the box.
  • Most layers in the stack have their own plugin/module system (apt, pip, JQuery plugins, nginx modules, git extensions, etc) but they only provide modularity only within their own layer.
  • Tools like Docker try to work around our abject failure as an industry to achieve cross-layer modularity by pushing the idea that an application requires exclusive control of an entire isolated file system

These are all anti-patterns that result from the lack of a shared information model. The result is a dev stack that is unnecessarily complex to program.

Why data-centric simplifies everything

A stack with an information model doesn't have the problems above, and it also opens up some incredible new opportunities.

  • We can use consistent access mechanisms across the whole stack for creating data, changing data, and querying data.
  • We aren't limited to linear files to express complex, multi-dimensional structures like graphs.
  • Since building UI against an all-data stack is straight-forward and simple, it opens up the door to all manner of visual, interface-driven development environments.
  • When all our developer tools share a common information model, they can interoperate and integrate with each other naturally and cheaply.
  • There is a kind of full-stack modularity that is possible, to combine components from different layers in the stack into one multi-layer Thing made up entirely of data.

Is this really new?

Lots of people have seen this problem and pushed for a unified information model. This classic XKCD comic comes to mind:

How is this different from yet another data standard?

First, the relational database is more than just a file format. It implements an abstract model, and its "primary" representation format is in the language of relational algebra, not syntax. In other words, while file system is agnostic about structure, the relational model is agnostic about serialization. Data structures can in a sense exist in any number of different file formats at the same time.

Second, PostgreSQL supports foreign data wrappers, which allow the primary representation to be something outside the database, but have a readable and writable interface into it, that is in the language of data. That way, instead of having to convert all our tools to this glorious new approach, we have the option to integrate with them as well. We can of course build "native" pure data application, but we can also integrate with existing systems, and build apps that exist simultaneously in a file-centric world, and a data-centric world. We can have the best of both worlds.

Conclusion

Today's file-centric stack is suboptimal because it doesn't have stucture-defining meachanisms at its foundation. There exists a common meta-structure shared between all our development tools. Aquameta's approach to rethinking the stack is to use the language of data as a universal integration pattern, and elevate our tool chain beyond ones, zeros, directories, and files, and into a world that treats structured information as a first-class citizen.

We've been experimenting with this paradigm for a while now, building prototypes of various layers and seeing how far we can push it. The more this stack takes shape, the more we realize that we are really onto something here. I think we've discovered a way to make software development simpler, and not just incrementally simpler, but a quantum leap forward.

Code!

Ok, let's demo some code.

pgfs.py - PostgreSQL as a virtual file system

pgfs is a Python daemon that makes PostgreSQL accessible as a virtual file system via fuse.

It maps the database to a tree structure as follows:

/mountpoint/{schema}/{table}/{row_primary_key}/{column_name}

Interactive demo:

First, let's start it up:

$ cd /s/aquameta/core/002-filesystem/pgfs
$ ./pgfs.py -u postgres -d aquameta /mnt/aquameta &
[1] 18922

Ok it's running, and mounted at /mnt/aquameta.

Now let's see what's in there:

$ cd /mnt/aquameta              
$ ls                                                   
bundle    filesystem          meta        pg_temp_4        pg_toast_temp_4  test  
endpoint  http_client         pg_catalog  pg_toast         public           widget  
event     information_schema  pg_temp_1   pg_toast_temp_1  semantics  
$

^^These are all the schemas in the database.

Now let's go into a schema:

$ cd widget
$ ls
dependency_css  input    widget                 widget_dependency_js  widget_view  
dependency_js   machine  widget_dependency_css  widget_fsm  
$

^^ These are all the tables in the widget schema.

Now let's take a look inside one of these tables, the widget table:

$ cd widget
$ ls
025c9178-6e78-403b-a6ee-6006a4417a87  
034c818d-bee4-420c-b395-804c7852a5ca  
06bc98a1-2be3-4414-8177-9ef22eb79be8  
080806df-d1fc-4f78-bcda-be42c09e6dae  
097acf98-baf1-4e32-98eb-4b20b8198b80  
...
$

^^ These are the rows in the widget table. They're directories, one directory for each row.

$ cd 025c9178-6e78-403b-a6ee-6006a4417a87
$ ls
css  help  html  id  name  post_js  pre_js  
$

^^ These are the fields in the widget table. They are editable files:

$ cat html
<tr id="{{= id }}" class="{{= name }}">  
    <td><button class="save">Save</button><button class="cancel">Cancel</button></td>
</tr>  
$ echo "<!-- hi mom -->" >> html
$
$ cat html
<tr id="{{= id }}" class="{{= name }}">  
    <td><button class="save">Save</button><button class="cancel">Cancel</button></td>
</tr>  
<!-- hi mom -->  
$

You can do all manner of crazy things, like grep the database, run sed scripts, etc.

A few implementation note: It only supports tables that meta can see, namely tables with a primary key. We'd like to add support for "smart" filenames, so that the database can be decorated with metadata (via meta ids) that overrides default naming with file extensions and human-readable row-ids (issue 5).

Planned development

Also in the works is fs_fdw, a foreign data wrapper that makes the file system accessible from within PostgreSQL.

The End

That's it. In chapter 3, we'll introduce Aquameta's events module.

Comments

comments powered by Disqus