Documentation

Not logged in

Table of Contents

Not-Forking Upstream Source Code Tracker

Not-Forking semi-automatically incorporates software from upstream projects by tracking and merging. Designed for use in build and test systems, Not-forking can combine an arbitary number of upstreams accessible over the internet by any or all of git, fossil and web download. Not-Forking was originally developed for the LumoSQL project and is now fully independent.

Simple Use Case: A project needs a particular cryptographic or database library, and the library maintainers have irregular releases with many bugfixes inbetween. In order to protect the project from being influenced by the bugfix cycles of the library, the project will often decide to copy the library in-tree, and then periodically hand-port new library versions to suit the project's release cycle. This is effectively a fork. With Not-forking, the project never needs to face this decision.

More Advanced Use Case: An embedded software product requires a single image containing operating system and applications, with a new version twice a year. Embedded software is frequently shipped with wildly out of date versions because this is a hard problem and often not a priority for the product manufacturer. Not-forking reduces the maintenance of pulling many upstreams into the image, while still giving control over how this is done in the build process (eg by adding a product-specific version string to imported code.)

The overall effect is something like Fossil merge --cherry-pick or git-cherry-pick except that it additionally copes with the messiness of software including:

Not-forking is the grease that helps projects cooperate when creating bugs in the same codebase, rather than creating mutually incompatible sets of bugs.

Forking Regarded as Bad

Forking is often a social rather than a technical issue. The not-forking tool assists in keeping friction low and helping developers only fork when they absolutely need to. Not-forking helps avoid projects carrying a fork of external code they need, even if the upstream code/library is not maintained in a way that is compatible with the project. Not-forking also pushes back against Git/Github's "Fork by default" development philosophy.

In 2021 the most commonly used public source code repositories are based on the git SCM, especially Github and Gitlab. Github puts a prominent "Fork" button on every project and says "Forking is at the core of social coding at GitHub". As a result, as of January 2021, Github hosts 43 million distinct software projects, most of them not original but created by Github's Fork, and very many of them abandoned. Github is used to maintain some wonderful code, but 43 million projects is an impossibly large number.

In contrast, the Fossil SCM discourages forking and encourages regular remerging of branches. Fossil tries to make it easier to nudge users and contributors into more engagement, rather than increasingly-divergent code forks that don't talk to each other. Where Github's flagship feature is "Fork", the Fossil timeline is key to how Fossil works (demonstrated here with the source code of SQLite, a very busy project). The Fossil Timeline supports the Fossil development philosophy.

Overall not-forking configuration

A not-forking configuration is a directory containing one or more subdirectories each one defining a different project to track.

Each project tracked by not-forking needs to define what to track, and what changes to apply. This is done by providing a number of files in the project subdirectory of the configuration: the minimum requirement is an upstream definition file; other files can also be present indicating what modifications to apply (if none are provided, the upstream sources are used unchanged).

The not-forking tool refers to each project using the name of the subdirectory containing it; when handling all projects at once, it essentially lists this directory to see what's there.

Upstream definition file

This file describes the nature of the upstream. What version control system does it use? Where are its repositories? What style of version string does it use?

The file upstream.conf has a simple "key = value" format with one such key, value pair per line: blank lines and lines whose first nonblank character is a hash (#) are ignored; long lines can be split into multiple lines by ending a line with a backslash meaning continuation into the next line.

There is a special line format to indicate conditionals, with the general form:

if (condition)
...
[else if (condition)
...]
[else
...]
endif

Note that conditionals cannot at present be nested. The following conditions are available at the time of writing:

If a key is present more than once, the last value seen wins; therefore, it is possible to define a key inside a conditional block, and then to define it again outside the block to provide a default value.

The only key which must be present is vcs, and there is no default. It indicates what kind of version control system to use to obtain upstream sources; the value is the name of a version control module defined by the not-forking mechanism; at the time of writing git and download are valid values; in general, the documentation for the corresponding version control module defines what else is present in the upstream.conf file; this document describes briefly the configuration for the above two modules.

Optionally, two other keys can be present: compare and subtree.

The compare key indicates what method to use to compare two different version numbers; if omitted, it default to version which compares "normal" software version numbers: sequences of digits compare numerically, and sequences of letters compare alphabetically, with the exception that a suffix "-alpha" or "-beta" cause the version to be considered before the string without such suffix: examples of version numbers in order are:

0.9a < 0.9z < 0.10 < 1.0 < 1.1-alpha < 1.1-beta < 1.1 < 1.1a

This definition will even cope with the numbering scheme used by TeX and METAFONT which are "Pi" and "e" respectively. The definition can be extended to deal with version numbering schemes used by normal software, however it will never work correctly with the version numbers used by some software such as the CLC-INTERCAL compilers (where for example 0.26 < 1.26 < 0.27).

The subtree key indicates a directory inside the sources to use instead of the top level.

The version_filter key has the same format as if version and means that only project versions which make the condition true will be considered; for example version_filter = >= 1.0 could indicate that versions of the project before 1.0 did not provide required functionality and will not be used.

Finally a line starting with the word block is special as it introduces multiple upstream definitions related to the same project; the file will be considered divided into blocks, with the special "block" line separating them; the first block is used as "base" and concatenated with each of the subsequent blocks in turn; when looking for a particular version of the project, the first block containing it will be used; for example, a simplified version of the "LMDB" project contains:

vcs = git

block
repos = https://github.com/openldap/openldap

block
repos = https://github.com/LMDB/lmdb

This is equivalent to two separate upstream files, containing:

vcs = git
repos = https://github.com/openldap/openldap

and

vcs = git
repos = https://github.com/LMDB/lmdb

When asked for a particular version of LMDB, the program will look for it first in the OpenLDAP repository, and if not found in the LMDB repository (which contains only older versions up to 0.9.15). As a result, one can obtain any version available without having to know that they come from different places.

Another (fictional) example would be a project which switched from github to Fossil and at the same time did a bit of reorganisation of the sources:

# nothing here, the two parts have nothing in common!
block
vcs = fossil
repos = https://project.org/src/project
block
vcs = git
repos = https://github.com/some/project
subtree = PROJECT

The not-forking tool will then obtain the sources from git for older versions, looking inside the "PROJECT" directory for them; and for later versions use fossil instead, and look at the top of the directory checked out (if a version is available on both, the fossil one will be preferred because it is listed first).

Note that if one knows at which exact version number things changed it's also possible to use conditionals, however the --list-versions option will not necessarily work correctly when using conditionals, while it works when using multiple blocks. If required, a version_filter can be added to one or more block to make particular versions come from a particular source.

git

The upstream sources are available via a public git repository; the following keys need to be present:

A software version can be identified by a generic git commit ID, or by a version string similar to the one described for the compare key, if the repository offers that as an option.

fossil

The upstream sources are available via a public fossil repository; the following keys need to be present:

A software version can be identified by a generic fossil artifact ID, or by a version string similar to the one described for the compare key, if the repository offers that as an option.

download

The upstream sources are released as published versions and downloaded directly; the following keys need to be present:

At the time of writing, the program uses file to figure out how to unpack the sources, and then tar, gunzip, etc as necessary; a future version may allow to control the process if the program cannot figure out what to do with a particular download.

Modification definition file

This file contains instructions for modifying files, followed by the data that the instructions can use to make the modifications. The data may be patches or complete file contents, and the instructions are operations such as "patch" or "replace".

There can be zero or more modification definition files in the configuration directory; each file has a name ending in .mod and they are processed in lexycographic order according to the "C" locale (rather than the current locale, to guarantee consistent ordering). Note that only files are considered; if the configuration directory contains subdirectories, these are ignored, but files in there can be referenced by the .mod files.

The contents of each modification definition file are an initial part with format similar to the Upstream definition file described above ("key = value" pair, possibly with conditional blocks and conditions on the applicability of the whole file, which have a special format); this initial part ends with a line containing just dashes and the rest of the file, referred to as "final part", is interpreted based on information from the initial part.

The applicability conditions have the exact same format as the if which introduces a conditional block, without the word if; the overall effect is to ignore the whole file if the condition is false; refer to the discussion of conditionals above for the precise syntax and meaning.

One use of the applicability conditions is to indicate that some modifications are only necessary up to a particular version, because for example that modification has been accepted by upstream and is no longer necessary; or that a modification is only necessary on a particular operating system; another use of these conditions is to identify versions in which substantial upstream changes make it difficult to specify a modification which works for every possible version.

If a file is modified by more than one modification definition file, the standard ordering of the files determine the order the modifications are applied; this means that anything which replaces a file with a whole new one (as the "replace" method described below does), this is normally in a file which is very early in the lexycographic order, as it would make no sense to put it at the end where it can undo any previous modifications.

The following key is currently understood:

Other keys are interpreted depending on the value of method.

the "patch" method

The final part of the modification definition file is in a format suitable for passing as standard input to the "patch" program; the following additional keys are understood in the initial part:

the "fragment patch" method

The final part of the modification definition file is a series of patches which will be applied to sections of files rather than whole files; this may make it easier to provide a patch working on many versions of upstream sources by replacing the simple context (a few lines before and after the part to be modified) by a potentially more complex processing (for example, finding a particular function, or an easily identifiable block of code).

The fragment-diff tool can generate these starting from an "old" and a "new" version of a file and optionally a set of regular expressions which determine how to split a source into fragments.

Since this method uses the "patch" program on each fragment, it also accepts the same options as the "patch" method described above.

the "replace" method

This method indicates that one or more files in the upstream must be completely replaced; the final part of the file contains one or more lines with format "old-file = new-file", where both are relative paths, the first relative to the root of the extracted upstream sources; the second path is relative to the configuration directory.

There are no special options in the initial part of the modification specification file.

the "append" method

This method indicates that some extra text needs to be appended to an existing file; the final part is one or more blocks, separated by lines of dashes; the block starts with a file name (relative to the root of the extracted upstream sources) followed by the text to add; if a line containing just dashes needs to be added, prepend a single dash and space, for example to add the line "----" specify it as "- ----".

There are no special options in the initial part of the modification specification file.

the "sed" method

This method uses a sed-like set of replacements, with the final part of the file containing likes with format "file-glob: regular-expression = replacement" (the regular expression can contain spaces and equal signs if they are quoted with a backslash); the replacement is always done on the whole file at once.

There are no special options in the initial part of the modification specification file.

Example Configuration directory

This set of files obtains SQLite sources and replaces btree.c and btreeInt.h with the ones from sqlightning, applying a patch to vdbeaux.c and adding a line at the end of the (original) btree.h

File upstream.conf:

vcs   = git
repos = https://github.com/sqlite/sqlite.git

File btree.mod:

method = replace
--
src/btree.c    = files/btree.c
src/btreeInt.h = files/btreeInt.h

File vdbeaux.mod:

method = patch
--
--- sqlite-git/src/vdbeaux.c    2020-02-17 19:53:07.030886721 +0100
+++ new/src/vdbeaux.c      2020-03-21 13:52:24.861586555 +0100
@@ -2778,7 +2778,7 @@
      for(i=0; i<db->nDb; i++){
        Btree *pBt = db->aDb[i].pBt;
        if( sqlite3BtreeIsInTrans(pBt) ){
-        char const *zFile = sqlite3BtreeGetJournalname(pBt);
+        char const *zFile = BackendGetJournal(pBt);
          if( zFile==0 ){
            continue;  /* Ignore TEMP and :memory: databases */
          }

File btree.h.mod:

method = append
--
src/btree.h

#include "lumo-btree-additions.h"

Files files/btree.c and files/btreeInt.h: the entire files with new contents.

A more complete example can be found in the LumoSQL directory "not-fork.d/sqlite" which tracks upstream updates from SQLite.

Not-forking tool

The tool directory contain a script, not-fork which runs the not-forking mechanism on a directory. Usage is:

not-fork \[OPTIONS\] \[NAME\]...

where the following options are available:

If neither VERSION nor COMMIT_ID is specified, the default is the latest available version, if it can be determined, or else an error message. If more than one NAME is specified, VERSION and COMMIT_ID need to be provided before each NAME: the assumption is that different software projects use different version numbers.

If one or more NAMEs are specified, the tool will obtain the upstream sources as described in INPUT_DIRECTORY/NAME for each of the NAMEs specified, and attempt to apply all the required modifications; if that succeeds, OUTPUT_DIRECTORY/NAME will contain the modified sources ready to use; if that fails, an error message will explain the problem and if possible suggest corrective action (for example, if patch determines that a file has changed too much that it cannot figure out how to apply a patch supplied, the error message will indicate this and suggest to obtain a new patch for that version of the sources).

If no NAMEs are specified, the tool, will process all subdirectories of INPUT_DIRECTORY. In this special case, any VERSION or COMMIT_ID specified will apply to all rather than just the name immediately following them.

The program will refuse to overwrite the output directory if it cannot determine that it has been created by a previous run and that files have not been modified since; in this case, delete the output directory completely, or rename it to something else, and run the program again. There is currently no option to override this safety feature.

The tool looks for a configuration file located at $HOME/.config/LumoSQL/not-fork.conf to read defaults; if the file exists and is readable, any non-comment, non-empty lines are processed before any command-line options with an implicit -- prepended and with spaces around the first = removed, if present: so for example a file containing:

cache = /var/cache/LumoSQL/not-fork

would change the default cache from .cache/LumoSQL/not-fork in the user's home directory to the above directory inside /var/cache; it can still be overridden by specifying -c/--cache on the command line.

To help testing the tool, a special option --test-version=DIRECTORY can only appear in the configuration file, not the command line, and tells the tool to run the program and libraries found in that directory instead of itself: the directory is expected to be a working copy such as obtained from fossil.

We plan to add logging to the not-forking tool, in which all messages are written to a log file (under control of configuration), while the subset of messages selected by the verbosity setting will go to standard output; this will allow us to increase the amount of information provided and make it available if there is a processing error; however in the current version this is just planned, and not yet implemented.

The tool may need to access the network to obtain sources; this can be stopped in two ways:

There is a plan to add a time-based default if neither --update nor --no-update is specified (in the command line or in the configuration file): instead of always defaulting to --update, the tool would check the time of the last ypdate, and default to --no-update if that time is "recent"; this is not yet implemented, and also we need to decide what "recent" actually means in this context.

Fragment-diff tool

This command-line tool can help generating files for the "fragment_patch" modification method; the generic usage is:

fragment-diff \[OPTIONS\] OLD NEW NAME \[OLD NEW NAME\]...

where OLD and NEW are the two files to compare, and NAME is the name which will be written in the fragment patch; so for example:

fragment-diff orig/vdbe.c new/vdbe.c src/vdbe.c orig/pragma.c new/pragma.c src/pragma.c

will compare two files in the orig and new directories, and emit a fragment patch to convert the orig one into the new one; the patch itself will refer to the files as though they are found in the src directory (this command is actually what generated the file vdbe-changes.mod in LumoSQL).

The following options are currently accpted by the program:

The tool requires patterns to split the files into fragments; by default, if no patterns are provided, this will consider the whole file as a single fragment, and the output will be similar to the one produced by the standard "diff" program.

Patterns can be added by using -b to add all patterns from the program's own library, or -t to add them from a file; currently, the program's own library is empty, but there are plans to develop patterns for common cases like splitting C programs into functions. These two options can be repeated as many times as they are required.

To add patterns with -t just list regular expressions, one per line, in the file (comments starting with # and blank lines are ignored); a fragment starts on each line in the file which matches the pattern; each pattern must contain a captured sub-pattern which will be used to identify it if it occurs more than once: for example the pattern:

^((?:static\s+)?(?:void|int)\s+\S+)\b

will match lines like:

static void func1(int a, int b);
int func2(void);

and the captured sub-patterns will be:

static void func1
int func2

respectively: these will identify these two functions even though the pattern is likely to match many more lines in a C program source.

Another tool, fragment-patch, can be used to apply the output of this tool directly, rather than as part of the extraction of upstream sources; call it using:

fragment-patch \[OPTIONS\] PATCH_FILE [PATCH_FILE]...

Updates all files mentioned in any of the PATCH_FILEs provided (note that this overwrites the original files, just like the standard "patch" program). Options are: