Files As Metadata Format

YAML and TOML suck. Long live the FAMF!

What is FAMF?

File As Metadata Format (FAMF) is not anything horribly new. It is an extension of the Unix way of dealing with information.

Everything is a file.

Basically, each field of data is its own file; the key is the file name, and the content of the file is the field content. The exception is when the field is another set of information or a list of this information.

As an example, the metadata for this blog post could be structured like this:

title = "Files As Metadata format"
slug = "files-as-metadata-format"
published_at = "2024-07-05T21:17:36+03:30"
created_at = "2024-07-05T21:17:36+03:30"
updated_at = "2024-07-05T21:17:36+03:30"
group = "posts"
tags = ["famf", "information", "metadata"]

And the TOML would be added as some sort of header to the top of a Markdown or Djot document.

Now, if you are on Linux and you want to see the title of every post, you are kind of out of luck. You either have to create a pretty fragile script that uses awk and sed, and regex and won’t work tomorrow, or you will have to create/modify your own markup parser that gives you those titles. Never mind editing them.

Instead, imagine this file structure:

/posts
└──/files_as_metadata_format
   ├── content.dj 
   ├── upd_at
   ├── cre_at
   ├── title
   ├── slug
   └── tags

Each file contains only the information asked for. What is the content of upd_at? 2024-07-05T21:17:36+03:30 What is the content of the title? Files As Metadata format

If I want to publish it, I just do this:

date --iso-8601=seconds > pub_at # 2024-07-05T23:02:34+03:30

The folder would look like this. The application I built checks for the existence of the file and the validity of the date. And we are good to go.

/posts
└──/files_as_metadata_format
   ├── content.dj 
   ├── upd_at
   ├── pub_at <-- Here
   ├── cre_at
   ├── title
   ├── slug
   └── tags

A more advanced structure may look like this:

uses
├── cre_at
├── description
├── pub_at
├── tags
├── title
├── upd_at  
└──/things
   ├──/editor
   │  ├── main
   │  └── substitute
   ├──/languages
   │  ├── main
   │  └── substitute
   └──/os
      ├── main
      └── substitute

Imagine a page like my uses page. you have a list of stuff, and you have the information and content of the page itself.

Here you can see a key-value structure under things, where each value is its own object.

Now, you might have noticed the tags file. Isn’t that a list? Well, yes, it is, but it is just a list of single values. That is separated by a line break.

famf
information
metadata

You can easily parse this using awk, sed, fzf, your language’s split or lines function and on and on.

Why?

It is extremely simple; you don’t need to adhere to any kind of syntax, and you won’t need formatters. You won’t need special editors. You have the best tooling: bash, awk, cp, >, cat, sed, etc. If that’s not enough for you, you won’t need weird templates; they are easier to read, program, and track. You don’t need any parsers. You are free to use any language you want. You can put photos, zip files, binaries, texts, etc. You can also only read titles, publication dates, or any other fields without needing to parse all the files.

Wouldn’t that just reduce performance because of all the IOs?

Yeah, sure. I made a simple benchmark, actually. Here is the result:

bench    fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ famf  9.591 µs      │ 1.182 ms      │ 10.18 µs      │ 11.12 µs      │ 1000000 │ 1000000
╰─ toml  7.852 µs      │ 726.9 µs      │ 8.132 µs      │ 8.673 µs      │ 1000000 │ 1000000

As you can see, there is not that much of a difference. And this benchmark favors TOML, since the TOML is not inside another markup file, in which case you must first extract the TOML from that file, before you can parse it. While the FAMF structure does not work any differently, . Added to that, we are not counting the cases where you just want to get a list of titles. In those cases FAMF only opens the title files, while TOML, in the best-case scenario, has to use a specialized parser to read the stream of content, parse it until it reaches the end of TOML, and then it will give you the value. Which is not even what happens in the most actual real-world cases.

How about all the tooling?

Yes TOML, YAML, JSON and others have great tooling. You know what else has great tooling? Files! Every file manager, file manipulation tool, and file search tool is helping you navigate those bad boys.

The only thing lacking are schema validation tools.

And that should not be that hard to implement as well.

Is this actually new?

I am not sure. It should not be. But it is not a common practice outside of Unix file systems. And even then, .conf and .ini files, as well as other weird formats, are not uncommon.

If it is a better idea than JSON why won’t everyone use it?

I don’t know. Who cares! Shut up!

Update on 2024-07-08

Some people on lobste.rs started a very interesting discussion. One that is worth considering. And that is the limitations of how the file is being stored and accessed, and in general, file-system limitations; And lack of atomicity in changes.

The impact of a high number of inodes and the limitations of the number of inodes in Unix-like systems. A very valid point. For which I created 4 million directories, under each 40 text files, with different text content. I replaced a random file’s content using good ol’ sed. And, of course, I timed it. I did the same thing after deleting every directory but one. And timed it again. And did that manually a few times. The fluctuations in the performance of the system made more difference to performance than the existence of 3,999,999 other directories.
Atomicity in changing the information is also another interesting point. For example, if I want to update the content.dj file, I also need to change the upd_at file as well. In such cases, if I suddenly get interrupted, I might have changed the content.dj without changing the upd_at. Also, another program might have had the same idea of changing it. That is not desirable. However, TOML, YAML and friends only offer those nifty features if you are putting everything in one giant file. And that is because you have to load all the information at once, lock on the file, and then rewrite the whole thing. You may also think that you might change them line by line. But that is assuming that you know which lines, you have the proper serializer, and you don’t mind at least loading all the information on the memory or streaming the file, which only helps you with data-races.

In all of these cases, the answer is not TOML, YAML or JSON — or FAMF for what it’s worth. It is goddamn database.