\documentclass[DIVcalc]{scrartcl}

\author{J\"org Sonnenberger}
\title{The redesign of pkg\_install for pkgsrc}
\date{October 15, 2006}

\begin{document}

\maketitle

\begin{abstract}
Pkgsrc is a framework for building third party software on a variety of
systems.  It is the system of choice on DragonFly and NetBSD.

Pkgsrc was originally derived from FreeBSD ports and many features were
added to that foundation. One central component is ``pkg\_install'', a
collection of small programs to install and remove packages and other related
tasks.  While it has been extended over time, the original code base is still
mostly present, together with a number of limitations.

During Google's Summer of Code 2006 program this component was rewritten
to better fulfill the needs of pkgsrc:

\begin{itemize}
\item Integrated archive handling.
\item Full specifications of file formats and algorithms.
\item Versioned, extensible meta data.
\item Better integration of the install framework.
\end{itemize}

In the paper a comparison of the old approaches, the new solution and the
rationale, as well as the state of integration in pkgsrc and of the
conversion tools are given.
\end{abstract}

\section{Introduction}

The NetBSD Packages Collection or ``pkgsrc'' is a framework for building
third party software. Over the years it was extended to support not
only NetBSD, but a great variety of Operating Systems, ranging from
Apple's MacOS X to Interix (Microsoft Services for Unix). Beside NetBSD,
pkgsrc is the system of choice on DragonFly.

The pkgsrc infrastructure is originally derived from FreeBSD's ports
framework. Many features like the wrapper system and buildlink were 
added over the years. One specific piece is ``pkg\_install'', a 
collection of small programs to install and remove packages and manage 
related tasks. While it has been extended over time, the original code 
base is still mostly present.

Several problems have shown up with different severity, like
\begin{itemize}
\item use of external programs for the extraction of packages,
\item use of a temporary directory during extraction, followed by 
moving/copy\-ing every file to the real location,
\item missing documentation of file formats and precise syntax,
\item redundancy of installation/deinstallation scripts,
\item advanced updating facilities,
\item incoherencies between packages built from source and those 
installed via binary packages,
\item difficult interaction with high-level tools.
\end{itemize}

The Google's Summer of Code 2006 project provided an opportunity to work
on redesigning ``pkg\_install'' to fix most, if not all of the
aforementioned problems.

This paper discusses the results in comparison with the older
approaches and looks at the state of integration into the pkgsrc system.

\section{Package metadata}
\subsection{Package patterns}
\label{patterns}

The ability to match package names is needed in a number of situations.
This includes dependencies and conflicts, but also checks for security 
vulnerabilities.

In pkgsrc four different pattern types are currently used:
\begin{itemize}
\item

Plain package names form exact matches.

\item

Dewey patterns like ``gdm\textgreater{}=2.14\textless{}2.14.8'' consist
of the package base name and relation operations. Version numbers are
parsed according to a complicated rule set modeled after common
practice.

\item

Fnmatch patterns allow shell-like wildcards (``pear-5.0.[0-9]*'') and
are most commonly used to match any version of a package.

\item

Csh-style alternatives (``sun-\{jre,jdk\}\textless{}1.3.1.0.2'') are
expanded to elementary patterns. If any of those matches, the
alternative itself is matching.

\end{itemize}

All four types have at least one major limitation. Plain matches are 
actively discouraged, since they can't even deal with local patch 
versions (``estd-0.5nb1''), making them almost useless.

Csh-style alternatives are needed to handle multiple packages providing 
common functionality like ghostscript-afpl and ghostscript-gnu.

Dewey patterns are the most expressive pattern, but can't represent a 
match to all versions. Matching e.g. all sub-versions of 4.3 is a problem 
as well, since release candidates and patch level complicate the matter. 
Dewey pattern only work well, when upper or lower bound are precise. The 
old implementation also had some interesting validation bugs, e.g. 
``php\textless{}5\textgreater{}4'' is matched by ``php-4''.

Fnmatch patterns have the downside of matching more than intended. If 
there's ever a PHP module which name starts with a digit, the common 
``php-[0-9]*'' pattern for the PHP interpreter itself would match the 
PHP module as well.

The situation is complicated further as multiple patterns are sometimes 
used to reduce the number of matching packages. A dependency on PHP  4.x
for example introduces at least two patterns: ``php-4.4.*'' to  match
the API and ``php\textgreater{}=4.4.1nb3'' to specify the ABI. The
evaluation order by ``pkg\_add'' for missing dependencies is critical.
When the second pattern is evaluated first, PHP 5 would be installed
and the first pattern would be unsatisfiable as PHP 4 and PHP 5
conflict with  each other.

To reduce this mess a way to unify the four styles was needed. One more 
desirable criterion exists, which wasn't satisfied by the existing 
rules. ``pkg\_add'' has to choose a package, when more than one package 
matches a pattern. As long as they have the same base name, a built-in 
rule is used (see PHP 4/5 earlier) which selects the highest available 
version. There's no deterministic rule for csh-style alternatives though.
User interaction can be used to resolve such conflicts, but they are 
often either undesirable or unavailable (e.g. automatic package 
installation during bulk builds). The order should therefore follow 
explicitly from the pattern.

As most of the patterns in pkgsrc follow the Dewey-style it was useful
to keep it as base. The generalized version consists of a package base
name and zero or more operator/version pairs. Zero operators provide a
full wildcard match and each pair is processed in order as long as they
match. This means the incorrectly parsed pattern
``php\textless{}5\textgreater{}4'' now is valid and behaves as expected.
Beside the normal relational operators ``\textless'', ``\textless{}=''
and so on, ``\~{}'' is introduced as prefix match. ``php\~{}4.4'' matches
``php-4.4'', but also ``php-4.4pl1'' and ``php-4.4rc1''. Finally
multiple of this simple patterns can be joined using ``\textbar'' to
form alternatives. Ordering of two matches is done by the first matching
alternative first and by ordering the versions themselves if they match
the same one.

While the given rules allow easy merging of two basic patterns, it gets
more complicated, when alternatives are involved. As this is not
typically used in pkgsrc (yet), the problems are left unresolved for now
and will be revisited later. A possible solution is to consider a
package version as matching only if it matches all requirements.

\subsection{Dependencies, conflicts and compatibility}

Packages often need other packages to function properly, e.g. because
they are dynamically linked against them or call a program from them. In
a similar way, some packages can't work when installed at the same
file. Historically two packages has to be marked as conflicting, when
the package content overlapped, as the ``pkg\_add' program didn't handle
it as failure.

Another use case of patterns are explicit compatibility hints. In pkgsrc
the buildlink framework knows two kinds of dependencies -- for ABI and
API. The latter are the classic way to describe that a certain (minimal)
version is needed by a package, e.g. because a new functionality was
added in it. ABI dependencies are more complicated though. As
dependencies are normally open-ended (all later versions match), it is
hard to describe properly when the interface is compatible.

To solve this a package can explicitly specify what it is compatible to.
So instead of requiring ``libfoo\textgreater{}=1.0'', an exact match can
be used by packages depending on libfoo. The maintainer of libfoo is now
responsible for specifying what the oldest compatible version is. This
can be used for ABIs as well as module interfaces in scripting languages
like Python. Support for maintaining the compatibility list based on ELF
``sonames'' or libtool archives is planned.

\subsection{Package lists}

The heart of a package are the files within. The package list (plist for
short) contains all the files in the package, which are supposed to be
``static''. For each file a checksum is stored and it can be used to
detect undesired modifications. The old plist format also contains
modifiers to remove directories on removal and execute single line
commands. The functionality to specify permissions or ownership existed,
but was never used.

The old plists had three major issues:
\begin{itemize}

\item

It contains some package metadata, but not all. The ability to execute
commands was mentioned already. Another example is that dependencies and
conflicts are listed in the plist. The on-line description, the full
description, install and deinstall scripts, the package maintainer and
all the other information are stored separately though.

Checksums have been added as afterthought using special comments.

\item

Commands don't belong into a plist, that's what the install/deinstall
scripts are for. Firstly, it increases the number of places to audit
and secondly, it also provides a different environment.

\item

Handling of shared directories is flawed as it is often impossible or
very unpractical to factor out a base package to "own" the shared
directories. In the past most common directories have been created using
mtree from a template and were considered sticky (e.g. never to be
removed).

\end{itemize}

For the new ``pkg\_install'', @exec and @unexec are no longer supported
by unanimous consent. All non-plist rated information have been moved
and the other statements have been made local to each entry. A field for
checksums has been added as well as a field to tag entries to belong to
specific classes. The latter allows special scripts to run on the tagged
entries e.g. to register a font with fontconfig or add a texinfo page to
the local index.

The second important change is the classification of entries. Inspired
by the Solaris package tools, other types of plist entries beside simple
files are support.

Configuration files are first-class entries. When the file does not
exist at install time, it is copied from a template or created as empty
file (e.g. for logfiles). On removal, the management tools can either
keep it as is, remove it on user request or archive it for later use.

Similar to configuration files, volatile files have a template. They
are not archived or even checked for modification, but instead assumed
to be modified by the package at vim. This is useful for fixed indices
like texindex's info/dir file.

Beside files directories can be contained in the plist as well. As the
new ``pkg\_add'' creates them on demand and ``pkg\_delete'' removes them
when no other package is referring to them, this is seldom needed. It
is needed when empty directories should be part of a package or when
special permissions are required.

Two special kinds of directories are also supported. Configuration
directories can contain only configuration files and directories as
entries and are a way to mark a whole directory hierarchy as containing
only configuration files. They are supposed to be handled as whole (e.g.
archived). Exclusive directories place a directory under the sole
control of a package. No further plist entries are allowed and the system
doesn't care about the content. The package is responsible for removing
the content at deinstall time. This makes it possible to properly handle
e.g. shared-mime-info's share/mime.

Last but not least are symbolic and hard links recorded. The former
should not change its target and the latter might be converted down to a
symlink if necessary, e.g. when target and plist entry are not on the
same filesystem.

\subsection{Essential and non-essential metadata}

Some of the data attached to a package has been mentioned already -- the
package name, the list of dependencies and conflicts, the plist. Other
items are:
\begin{itemize}

\item

The prefix a package is installed to and which it is supposed to stay in
with some exceptions,

\item

How to reach the maintainer of the package.

\item

The OS version and architecture the package was built.

\item

The license(s) it can be distributed under.

\item

The short and long descriptions, both in English and local languages.

\end{itemize}

All this data can be classified as essential or as non-essential. The
former category describes what directly affects ``pkg\_install'' and the
basic user experience. Having translated descriptions is nice to have,
but the English version will always be authoritative and required. Just
because a field is essential doesn't mean that it has to be present
though. A typical example is the license field which will be missing for
most packages, but is critical for determining whether a package can be
distributed.

The separation between both classes is useful as it reflects the need of
correctly managing and preserving the meaning of a field. As the list of
metadata will change in the future, backwards-compatibility will be
needed. At the very least it must cover all the essential fields and
those have be updated as easily as possible. To achieve this, each field
has strong validation rules, which are relaxed for the non-essential
metadata.

\subsection{Package format}

The old ``pkg\_install'' just compressed tar archives containing all
files in the plist and normally one file for each of short and long
description, the plist, install and deinstall script, size infos. The
latter set is also the metadata kept in the package database (typically
/var/db/pkg or .pkg in the prefix). 

For a typical installation this easily takes a few thousand inodes. To
avoid the associated overhead, a format to keep them in one file was
needed which doesn't compromise the extensibility. Two generic markup
languages were considered, namely XML and YAML. Since white-space handling in
XML is awful and YAML is also much human-friendlier, it was preferred by
the author.

The serialized package content uses a shallow hierarchy which emphasizes
the importance of the various fields. The package itself and the plist
entries are explicitly tagged and thereby also versioned. This allows
the package tools to easily detect and convert older versions when
necessary.

Binary packages are still (compressed) tar archives. The content is
different though. In the top level directory, there's an index file
containing the serialized package description (as above). This is also
required to be first entry of the archive. Signatures will be stored as
second entry, but as no light-weight gpg verifying exists and X.509
certificates don't play nicely with the (current) setup of pkgsrc bulk
builds, this is not finalized yet. After the index file the normal files
from the plist are stored in plist order. The files are stored with the
relative path under a directory named like the package. All other plist
entries are synthesized during extraction.

Enforcing a strict order on the packages makes it possible to extract a
tarball with minimal buffering and read the content without having to
process more than the index size (up-rounded to compression blocks). The
construct of using a subdirectory for the actual file allows later
bundling of multiple packages into a single archive, with minimal
changes.

\section{The programming interface}

The implementation of ``pkg\_install'' consists of a library core and
small bindings on top. The core consists of four major components: the
pattern related functions, the package-related functions, the
plist-related functions and the package database functions.

\subsection{Pattern functions}

The pattern API provide simple accessor functions for easy access in
common situations. Both matching a pattern against a package name and
ordering two package names with regard to a pattern are supported. The
allocation and freeing of resources are kept internal.

The convenience functions are wrappers for the full implementation.
Parsing of a package name or pattern is a separate task to allow later
reuse. Functions to extract to the base package name or the list of
matched base package names for a pattern are provided. Those are useful
e.g. for a bulk build as they can reduce the quadratic runtime in the
number of packages and patterns to linearly.

\subsection{Package functions}

The package functions deal with in-memory package description and
related functions. Functions to create one from scratch or destroy it
with freeing all associated resources are provided as well as functions
to get or set the meta-data. Multi-value fields can be read either using
a temporary array or an iterator interface.

The finished package description can be validated either for basic
compliance or for the full package conformance. Descriptions which pass
the latter can be serialized using a callback interface. In the same way
package descriptions can be read back and parsed. A function to create a
binary package from a package descriptions and the files relative to
given prefix completes the interface.

Errors are classified depending on whether they are input-related or
internal. For internal errors like failing memory allocations or
violations of the API contract, the program can provide a callback which
is called with the current package descriptions, a failure code and
optional context-depending arguments. The callback is expected to
terminate the application, otherwise it is abort(3)ed. For input-related
and other "soft" errors, a different concept is used. The error callback
has the same arguments, but can return a value to decide whether or not
the processing should continued. This is a ternary value--on error the
processing can continue as long as it makes sense to diagnose further
problems, but the initial error is sticky. Alternatively the processing
will directly bail out. The callbacks are provided on package creation
or parsing, it is not yet intended to modify them.

\subsection{Plist functions}

The plist API allows the addition and removal of individual entries. The
interface is strongly typed and each type has independent accessor
functions. The implemented makes heavy use of the preprocessor to keep
redundancy in code minimal. Similiar to the generic package interface,
the plist access is mostly done using iterative callback interfaces.

\subsection{Package database functions}

The database functions are still in the progress of being revamped. The
desired interface has three components:
\begin{itemize}

\item

Functions to query the database. This should be generic enough to work
with package repositories as well.

\item

Functions to regenerate all internal state like the hash databases of
all files and the forest of packages and their relationship.

\item

Functions to modify the database as set of add/remove operations.

\end{itemize}

The first category is rudimentary implemented by providing an iterator
interface over all packages. The requirement for generalization is
important here as the same functions to decide whether a dependency is
installed can be used to find the best match in a binary repository.
Most query functions should work on binary package repositories as well
as the package database.

The second category is implemented, but has to be moved from the
standalone command into the library.

The third category is the most challenging. Single add and remove
operations work, but impose a severe limitations. Updates of non-leaf
packages would have to either remove all depending packages or leave the
database temporarily in an inconsistent state. To solve this, complex
updates should be done as sets of add and remove operations, which are
atomic from the point of the package database.

The downside is that the logic for verifying whether all dependencies
are resolved, no conflicts are present and the plists of all
to-be-installed packages are non overlapping gets a lot more
complicated. As the use of index databases is still necessary for
installations with multiple hundred packages, the usage of memory to
keep the changes in memory is increasing as well.

It is open whether it is possible and helpful to split such transactions
into minimal blocks, which keep the database in a consistent state. It
will not help when e.g. xorg-libs changes, but is useful for the generic
``update-my-system'' case.

\section{Integration and conversion}
\subsection{Staged installation}

The first step for the integration of the new ``pkg\_install'' is the
elimination of direct installation into the prefix. This makes it much
simpler to ensure that all directories created are either requested by
the administrator or handled by the framework.

Another important desire is to ensure consistent permissions as many
packages don't use the pkgsrc INSTALL\_* variables, but random
combinations of cp, pax/tar and install.

Therefore the facilities to install into a subdirectory of the working
directory were added. As pkgsrc already provided just-in-time su, it
was desirable to allow full user package builds. Many packages just use
default ownership for files and the aforementioned override directives
can be used to provide the functionality even in the old
``pkg\_install''. Some care had to be applied for packages which install
setuid/setgid binaries as the access permissions are extracted by tar
and the ownership is later changed by ``pkg\_add'', removing the
setuid/setgid bits as side effect.

\subsection{Pattern conversion}

The need to convert old patterns to the new style is an independent
effort. Both for the integration and the conversion patterns have to be
converted, but it can mostly be done on demand.

As written in section \ref{patterns}, esp. fnmatch patterns are often
not precise. A perfect automatic conversion is therefore not possible,
but the intent of most patterns can be accurately represented.

The conversion mechanism is based on type-specific rules.
Csh-alternative style patterns are expanded, each expanded pattern is
converted and the list joined with ``\textbar''. Simple package names
are converted by replacing the last hyphen with ``==''. Dewey patterns
are unchanged as they are a subset of the new grammar. The edge cases
are working as humans would expect them, so the change in functionality
is justified.

The most difficult case is the conversion of fnmatch patterns. For those
a number of heuristics are used. The pattern is matched against regular
expressions representing common use in pkgsrc. For example, when
``\^{}(.*)-\textbackslash{}[0-9]\textbackslash{}*\$'' is matched, it
means that the patterns applies to any version of the captured first
sub-expression. As such it is converted simply to the that sub-expression.
Other cases which are handled automatically are ``php-4.4.[0-9]*'' and
``php-4.4.*'', which are converted to ``php~4.4>=4.4''. ``php-4.4nb*''
are ``php-4.4nb[0-9]*'' are converted to ``php~4.4nb''.

The given rules can be used to convert all but 30 patterns used by
packages in the ``pkgsrc-2006Q2'' branch and the rest are all somewhat bogus
special cases. It is not clear, whether they will end as hard-coded
special cases or are left for human intervention.

\subsection{The new pkg flavour}

In preparation for better support of multiple packaging systems Johnny
Lam refactored the package installation and creation code over the last
summer. This dramatically simplifies the initial efforts needed for integrating
a different ``pkg\_install'' implementation. Using compatibility
wrappers for ``pkg\_info'' and ``pkg\_admin'', the changes are
concentrated to two places:
\begin{itemize}
\item mk/flavour/pkg or a copy thereof
\item mk/pkginstall
\end{itemize}

The former code has to be modified to use the new calling conventions
and use individual arguments for each dependency instead of a
space-separated list.

The latter code provides the install/deinstall script framework. Most of
the functionality has to conditionally tag corresponding items for
the new ``pkg\_install' instead of expanding the shell scripts directly.
This will be done incrementally to allow better testing.

As the interface of the package management commands is not finalized,
the implementation of this code is still a work-in-progress and not part
of the pkgsrc tree.

\subsection{Converting existing packages and installations}

The creating of package descriptions for testing a new implementation is
tiresome and with the implementation of ``pkg\_create'' a shell script
for converting existing packages was written. This script has been
extended over time to stay in sync with the feature set of
``pkg\_create''.

The biggest missing item right now is the handling of old install
scripts. Those fall in one of two categories. Either they are created
from the install/deinstall script framework or they are custom rules
for a specific package.

The first class is relatively easy to handle as the scripts create
individual entries in the package tarball or package database. The
metadata can be extracted from the bottom of each file to handle
appropriately.

The second class is more involved as there are no fixed marker in the
scripts to annotate the beginning or ending of the common fragments
(which are already handled). A second problem is that the scripts work
both as pre-installation and post-installation scripts and the calling
convention has to be emulated. It is an open question how far and in
which an automatic conversion can be successful.

\section{Conclusion}

The redesign of ``pkg\_install'' allowed fixing many of the problems of
the old implementations. The installation of packages can be done
in-place. The toolchain itself is much more self-contained, typically not
requiring external programs, but for additional features. Building
blocks for better high-level update mechanisms are provided. The modular
architecture will allow further improvements and extensions with minimal
redundancy and in a straight-forward fashion.

As side-effect of this work, pkgsrc itself has been improved in a number
of ways. During the development of the pattern conversion tools, many
bogus dependencies have been fixed. The staged installation has been
desired for years and allows catching up with OpenBSD's ports system in
that area. Beside the ability to build packages entirely as normal user,
it will allow pkgsrc to sub-packages as well.

\end{document}
