The /pkg Hierarchy

Introduction

This document describes a filesystem organizational technique that solves several problems associated with software package management and distribution under a Unix-like operating system. Though the document uses examples from development in a GNU/Linux (hereafter refered to simply as "Linux") environment, it is straightforward to mimic the process on other Unix systems.

Motivation

The original motivation for the /pkg hierarchy was to find a generic solution for situations such as this:

To install package A, I needed library L version n (L.n), but I only had version m (L.m) installed. So I download and installed L.n, but this overwrote L.m, which broke package B. In order to upgrade package B to work with library L.n, I had to perform a system-wide (distribution) upgrade, which left package C in an ususable state. So I downloaded the source to package C, but when I tried to compile it agains library L.n, it reported the following errors... [etc]

A brief search through the Web or Usenet reveals that this is hardly an uncommon situtation, and that no Linux distribution is entirely immune to this problem of "dependency management".* The approach Linux distributors have generally taken in solving this problem is to find a collection of software packages that more-or-less work together, and then version the collection (i.e. give a version number to the distribution). However, there are problems with this approach: The two most prominent problems being that (1) it is often difficult to integrate new software packages that were not in the original distribution, and (2) third-party library version upgrades can potentially put the entire system into an unstable state.

Problems Addressed

The /pkg hierarchy has its roots in being a solution to dependency management; however, it turns out to be an adequate solution for several common problems:

Library versioning: Under the /pkg hierarchy, it is possible to have several versions of a shared library simultanously installed, even if the author has made incompatible changes between versions without bumping the library (so) number. Under traditional distributions, this is an inevitable point of vulnerability, and requires constant vigilance on the part of the distribution maintainers. Under the /pkg hierarchy, this is not an issue at all.
Version rollbacks: The common Linux distributions do not provide a reliable version rollback mechanism. For example, in the event of a version upgrade gone bad, it can be difficult (if not impossible) to rollback to a previous version, since files will generally be overwritten during the upgrade. Under the /pkg hierarchy, files are never overwritten, and previous versions of a package remain fully intact. Thus full, fail-safe rollbacks are guaranteed.
Package namespacing: Currently there are no standarized ways of avoiding naming conflicts between package commands, or even between different versions of the same package. For example, 'make' 3.79.1 will happily overwrite 'make' 3.77, as will 'hostname' from net-tools happily overwrite 'hostname' from inetutils. This obviously can lead to ugly consequences if the commands do not behave in exactly the same manner. The /pkg hierarchy has inherent namespacing, which prevents like-named files from different versions or packages from overwriting one another.
Distribution lock-in: It is typically impossible to to install packages from multiple distributors on the same system, or to change distributions seamlessly on a live system. This has the ultimate effect of locking one in to a certain distribution. However, under the /pkg hierarchy, the packages from different distributors may be simultaneously installed without conflict [this is achieved by making the "Distributor ID" part of the namespacing—this is covered in detail in the next section]
Cross-compilation: Cross-compiling has traditionally involved a lot of attention and pre-planning to do efficiently. The /pkg hierarchy allows cross-compilation to be performed in exactly the same manner as standard compilation, as the foundation for cross-compilation is built into the hierarchy itself.

While many of these problems have already been solved independently, the advantage of the /pkg hierarchy is that it simultaneously addresses all of these problems in an elegant and comprehensive manner.

Technical Overview

The /pkg hierarchy derives its name from the way packages are installed on the system. Every time a package is compiled from source, it is installed in a unique location similar to the following:

/pkg/glibc/2.2.5/.karmaki686/.000

These path elements will be referred to in this document as:

/pkg: The package root.
glibc: The package name.
2.2.5: The package version.
.karmaki686: The package distribution.
.000: The package build.

It is beneath a path like this that all files related to a given package are confined. The traditional root-level directories are re-created as subdirectories here, giving something like:

/pkg/glibc/2.2.5/.karmaki686/.000/
                                 |-bin/
                                 |-etc/
                                 |-include/
                                 |-lib/
                                 |-var/

Once a package is installed using this technique, symlinks are created to the package subdirectories all the way up the hierarchy. The resulting structure looks like the following:

/pkg/glibc/
          |-bin -> 2.2.5/bin/
          |-etc -> 2.2.5/etc/
          |-lib -> 2.2.5/lib/
          |-2.2.5/
                 |-bin -> .karmaki686/bin/
                 |-etc-> .karmaki686/etc/
                 |-lib -> .karmaki686/lib/
                 |-.karmaki686/
                 |            |-bin -> .002/bin/
                 |            |-etc-> .002/etc/
                 |            |-lib -> .002/lib/
                 |            |-.001/
                 |            |-.002/
                 |
                 |-.johndoei386/
                               |-.000/
                               |-.001/

Directory Explanations

/pkg/glibc/: By putting all packages under /pkg, we get rid of the mess that has become /opt/package, /usr/package, /home/package, and the process of spreading package contents all over the filesystem to the point where a custom tool and a database are required to track it all.
/pkg/glibc/2.2.5/: Each version of a package is given it's own directory. This makes compiling and installing new programs incredibly easy. With minimal effort, it completely fixes problems with failed dependencies or breaking packages with an upgrade.
/pkg/glibc/2.2.5/.karmaki686/: Dotfile directories at this level represent distributions, and packages from several distributions may be intermixed without conflict. Each version directory will be symlinked to the subdirectories of a particular distribution, the preference of which is easy to set on both a system-wide and individual-package basis.
/pkg/glibc/2.2.5/.karmaki686/.000/: Dotfile directories at this allow for sequential package builds. By giving each build its own directory, we guarantee that essential files are never overwritten. If a faulty build gets inadvertently installed or distributed, it is trivial to perform a rollback to the working build.

Symlinks

Consider the ldd output from the ping binary:

karmak@ariel$ ldd /bin/ping
    libm.so.6 => /pkg/glibc/2.2.5/.karmaki686/lib/libm.so.6 (0x40016000)
    libreadline.so.4.1 => /pkg/readline/4.3/.karmaki686/lib/libreadline.so.4.1 (0x40033000)
    libresolv.so.2 => /pkg/glibc/2.2.5/.karmaki686/lib/libresolv.so.2 (0x40059000)
    libnsl.so.1 => /pkg/glibc/2.2.5/.karmaki686/lib/libnsl.so.1 (0x40068000)
    libncurses.so.5 => /pkg/ncurses/5.2/.karmaki686/lib/libncurses.so.5 (0x4007f000)
    libc.so.6 => /pkg/glibc/2.2.5/.karmaki686/lib/libc.so.6 (0x400c1000)
    /pkg/glibc/2.2.5/.karmaki686/lib/ld.so => /pkg/glibc/2.2.5/.karmaki686/lib/ld.so (0x40000000)

What we see here is that packages in the /pkg hierarchy are not linked against the standard locations (/lib and /usr/lib), but instead are linked against the distribution directories. Thus it is possible to have different applications linked against different library versions, even when those libraries share the same name. By taking the linking as far as the distribution directory, we can support multiple distributions under the same hierarchy, and cross compilation becomes simply a matter of a few changes to the standard build scripts. Furthermore, by not linking against the build directories, we are free to rebuild a package as many times as necessary, and freely experiment with cross-distributor package compatibility.

The symlinks may appear to be a point of vulnerability in the system, but this is not the case. As the ldd output shows almost all of the symlinks are there for the user's convenience. The only exceptions are the symlinks to the build directories, which require only a statically linked version of 'ln' or 'sash' to repair. The alternative, overwriting files during an upgrade, is no any less error-prone and much harder to fix when things go wrong.

Benefits

Because of the highly structured layout, it is easy to write scripts that automate everything from the build procedure to nightly backups. In the long run, this structure is much more efficient than the traditional filesystem hierarchy. Some examples of the efficiency and power:

Going from the author's source code on a remote server to a ready-to-redistribute binary build typically takes less than ten commands. Subsequent builds of the same source are fully automated. All build information is automatically embedded in the binary distribution. Recipients of the binary package can repeat the entire build with three commands.
It is virtually impossible for anything on the system to break as a result of installing a package. So, no more dependency problems. Need three different versions of glibc installed? No problem.
Assigning every package (at the name level) a unique user ID can be fully automated and dynamically managed. Thus ends the nobody/nogroup fiasco.

Michael Carmack
karmak@karmak.org