Notes Related to the New/Imagined LumoSQL Storage API
Storage API Bugs Kept in Tickets
- Remove sqlite3BtreeGetJournalname() from src/vdbeaux.c
- Commit re-ordered btree.h to sqlite trunk to help us stay sane
- Sqlite src/backup.c assert looks up internal BTree mutex
Analysis of sqlite3 structure linkage
these aren't specific bugs, we need a strategy first.
The sqlite3 structure is used for:
- The unlock-notify feature (arguments to sqlite3ConnectionBlocked() are database handles)
- Accessing the SQLITE_ReadUncommitted flag
- Invoking the busy-handler callback
- During sqlite3BtreeOpen(), to find the VFS to use
- Accessing the SQLITE_SharedCache flag (for setting it),
- To check the same B-Tree is not attached more than once in shared-cache mode,
- To link the B-Tree into the pointer-order list of shared-cache b-trees used by the same handle (used for mutexes).
- To determine if an in-memory sub-journal should be used.
- To know how many savepoints are open in BtreeBeginTrans().
- Many, many times to assert() that the db mutex is held when the b-tree layer is accessed..
Backend API Selector Mechanism
Problem: need to have a way of selecting the lower-level store. Right now we have just two (native SQLite btree, and LMDB) and the way we select in this early design is by replacing btree.c to be one or the other. Even this implies that there is a common API with the higher layers, and an early goal is to formalise this API. drh says he is open to committing it depending on the details.
But in addition we have discussions:
*a backend API that includes a selector, so that (for example) a new table can be created in the users choice of backend. In the trivial case, this would be a configuration parameter that specified the backend to be used for the next session. In the simplest useful case, this would be for reading data from one backend and writing it out to another for backup or other batch-type purposes.
The ability to have multiple backends open for concurrent and async access. In the first version this still needs to be not very visible to any higher layer, which involves defining what the term "selector" means.
An API for more than key-value stores, however, to start with, one that handles only local key-value stores will be sufficient. This suggests that perhaps the architecture may extend to have a backend type selector, where all local key-value stores are considered to be very similar and that other kinds of backend have somewhat different APIs.
Key Size and Key Handling
In this sqlightning bug from 2015 indices get corrupted if keys are too big hyc talks about the keysize issue as follows:
The hash approach is obviously a dead end. The new approach will be to adopt an OpenLDAP-style nested index. Long keys will be broken into chunks and chained together. This will handle arbitrarily long keys and will preserve the ordering as well.
We have had other thoughts (that need to be transferred here)
OP_Savepoint is not called on SQL BEGIN
OP_Savepoint in vdbe.c calls sqlite3BtreeSavepoint() so that needs to be provided by all backends - or replaced with backend-independent code
It's a bit weird though - it seems it only calls btree savepoint on release or rollback, but not on begin
How does it know what to rollback to...?
btree.h include chain needs changed
The following needs fixed as part of splitting out backend.
These all include btreeInt.h:
backup.c btmutex.c btree.c notify.c
btreeInt.h includes btree.h
sqliteInt.h includes btree.h
Btree-specific opcode in vdbe.c calls routine in pager.c (PRAGMA journal.mode)
In pager.c/sqlite3PagerSetJournalMode()
/* This routine is only called from the OP_JournalMode opcode, and
** the logic there will never allow a temporary file to be changed
** to WAL mode.
*/
Not sure what sqlightning does with this. I think PRAGMA journal.mode should be a NOP.
File Structure in New Backend API
In order to support multiple backends, LumoSQL needs to have a more general way of matching capabilities to what is available, whether a superset or a subset of what SQLite currently does. This needs to be done in such a way that it remains easy to track upstream SQLite.
The SQLite architecture has the SQL virtual machine in the middle of everything:
vdbeapi.c
has all the functions called by the parser
vdbe.c
is the implementation of the virtual machine, and and it is
from here that calls are made into btree.c
All changes to SQLite storage code will be in vdbe.c , to insert an API shim layer for arbitary backends. All BtreeXX function calls will be replaced with backendXX calls.
lumo-backend.c
` will contain:
- a switch between different backends
- a virtual method table of function calls that can be stacked, for layering some generic functionality on any backends that need it as follows
lumo-index-handler.c
is for backends that need help with index
and/or key handling. For example some cannot have arbitary length
keys, like LMDB. RocksDB and others do not suffer from this.
lumo-transaction-handler.c
is for backends that do not have full
transaction support. RocksDB for example is not MVCC, and this will
add that layer. Similarly this is where we can implement functionality
to upgrade RO transactions to RW with a commit counter.
lumo-crypto.c
provides encryption services transparently backends
depending on a decision made in lumo-backend.c, which will cover
everything except backend-specific metadata. Full disk encryption of
everything has to happen at a much lower layer, like SQLite's idea of
a VFS.
lumo-vfs.c
provides VFS services to backends, and is invoked by
backends. lumo-vfs.c may call lumo-crypto for full file encryption
including backend metadata
Backend implementations will be in files such as backend-lmdb.c
,
backend-btree.c
, backend-mdb.c
etc.
MVCC-lite in lumo-transaction.c as follows:
We add version numbers to all values stored (either as part of the key, or as sorted values with the same key, this needs to be decided). Versions can be "incomplete" meaning there is a read-write transaction updating them and hasn't committed yet, or else they are "complete" meaning that there are no read-write transactions on that particular version.
a read-only transaction notes the latest "complete" version number at the time of the "start"; further accesses within the transaction will select the newest values for each key which are at least as old as that version. This means that in-progress, incomplete transactions are not visible to readers, and readers see a consistent state.
upgrading to read-write will check if a new (complete or incomplete) version has appeared since the read-only transaction was started; if yes, fail(); if not, create a new version but mark it as "incomplete" so a new read-only transaction will not see it yet and no new read-write transactions can start until we commit or rollback. *later we may want to go ahead and create a new version anyway, with the commit modified to check for collisions and fail/rollback if so; this will increase concurrency at the cost of rather more complexity and possibly extra commit overheads in case of collisions - but I expect no overhead in the absence of collision.
any updates of course will use the new version number from the transaction to store new values.
rollback is easy: delete version and remove any trace of it (by deleting any values which have that version number); this could take some time for a big transaction, but the idea is that one optimises for commit, not for rollback...
commit is also easy: keep version and mark it complete so it becomes visible - and maybe see if we can prune any old versions (yes, that gives us the versioning backend for free... when I realised this I decided this needs to be done!)
the "vacuum" operation can be extended to purge old versions if desired; I'm expanding on this below.