Lumosql: Native B-tree vs LMDB: Early results on which is better for what purpose

Summary. These are early comments comparing the LMDB 0.9x backend for LumoSQL with the native SQLite backend. Optimisation and input from experts may change this, as will better measurements and the LMDBv1.0 backend with encryption. These are very stable, highly optimised codebases giving pretty consistent timings for the same test across versions.

LMDB provides fast keyed access, crash safety, lock-free concurrent reads, fast random access and fast appending. SQLite native B-tree remains ahead on sequential scans and bulk writes that are not appends. At the SQL level, LMDB beats SQLite btree at the same SQLite version by an average of 1.7x on the speedtest1 suite, consistently on datasizes from 2M to 30M rows.

Applications don't work on averages, so individual test results for SQLite 3.53.1 backed by LMDB 0.9.34 follow... LMDB-backed wins 5.9x on indexed SELECT, 4.8x on INSERT-FROM-SELECT, 5x on big-INSERT-after-big-DELETE, 8x on DROP TABLE, 20x on read-to-write txn upgrade. LMDB-backed loses on database create (0.6x), CREATE INDEX (0.77x), and unindexed full-table UPDATE (0.58x). Earlier and later versions of both give similar results. The wins concentrate in keyed access and bulk operations (architectural: mmap'd B+tree traversal beats a page-cached one); the losses concentrate in operations that write many small pages (LMDB's copy-on-write amplifies).

The information here comes from doing massive nXm matrix builds and tests. The cross product testing is extremely useful, but not helpful for finding some magic ultimate combination, because there is no such combination. Measurements given here will hopefully improve a lot as more people look at the source code (note that /tool/benchmark-filter will merge benchmark results, which is handy for sharing. It also does summaries and some statistics.) These illustrative benchmark results are in loft/large-matrix-runs/benchmarks.sqlite.

Keyed and random access — LMDB wins. A single lookup or short keyed range is a focused root-to-leaf descent, and LMDB's memory-mapped pages mean that descent traverses already-mapped memory with no page cache layer in between, so it's very fast. count(*) is a special win, answered from LMDB's per-subdatabase entry count without walking rows at all and is approximately 11x faster than SQLite's native backend, which has to scan.

Large sequential scans — LMDB loses. LMDB's read traversal is ~225 instructions per row against native's ~57, because every row visit re-walks the node structure. Native SQLite has a packed page with everything it needs right there. This relates to the data structure, not how LumoSQL uses the data structure.

Writes — LMDB is slower due to its design LMDB's copy-on-write machinery is _mdb_cursor_put, mdb_page_touch, mdb_node_add, and mdb_mid2l_search. COW is fundamental to LMDB's crash-safe, lock-free-reader design but the price is slower writes. We did find one optimisation for appends (ie monotonic rowid inserts, the common bulk-load and log-append pattern). SQLite passes a hint called BTREE_APPEND, and if it sees this hint the LumoSQL btree.c shim uses LMDB's MDB_APPEND to skip the insert-position search, so this kind of write doesn't have a penalty. SQLite's hint can be wrong so we have to detect that too, falling back to the LMDB normal behaviour. If people with deep expertise of SQLite and/or LMDB internals can do some profiling there might be more of these.

LMDB is slow for apparent reads Clean reads never touch LMDB's COW. But two cases look like reads but actually perform writes, and so COW is used. The first case is transient tables (eg CTE, sort, DISTINCT, or maybe ORDER BY ... LIMIT, and undoubtedly other dusty corners of SQLite). We implement these scratch rows in an LMDB temp environment, so reading through a CTE is in fact inserting every row into a COW B+tree. Amateur-hour callgrind work suggested the memset overhead of zeroing pages was 40% in a pathological example. The second case is optimistic read-->write transaction upgrade, when an SQL statement starts reading then writes, (it seems a read cursor can held open across an UPDATE, but further investigation needed) and this promotes the read txn to a write txn. This second case could theoretically be avoided by moving the temporary environment off LMDB into a second cursor backend, but we are not even thinking about it at this stage.

Never mind ROWSUM. The Rowsum checksum code (which is per-row, not to be confused with the SQLite checksum VFS which is per-page, or the LMDBv1.0 checksum which is also per-page) has roughly equal absolute results on all backends. You might think that is because it is applied at a higher level and the backends never see it, which is true but not useful here. At SQLite 3.53.1, ds=10, rowsum-on/rowsum-off is 1.85x on stock and 2.44x on LMDB. The LMDB ratio looks worse but the baseline is faster, so absolute LMDB rowsum-on time (175 s) is still well below stock rowsum-on (231 s). This is an ideal place for optimisation work. Even just moving from SHA3_256 to BLAKE3_256 reduced the rowsum overhead by about 60%.

Rowsum has a regression at SQLite 3.35. Pre-3.35 (3.30 through 3.34.1) rowsum-on adds nothing measurable: the on/off ratio is 0.96-1.04x across every datasize from 2M to 30M rows. From 3.35.0 onwards the rowsum-on/rowsum-off ratio jumps to 1.5-1.7x at small data and climbs with datasize, reaching 2.4-2.6x at 30M rows on stock btree. Rowsum-off shows no cliff (3.34.1 and 3.35.0 are within noise), so the regression is in how rowsum interacts with whatever changed in 3.35. Per-test on stock btree, 3.34.1 to 3.35.0 rowsum-on: 1000 SELECTs without an index goes 29s --> 88s (3x); 10000 UPDATEs without an index goes 2s --> 16s (8x); INSERT-FROM-SELECT goes 2.9s --> 6.1s (2x). 3.53.1 partly recovers (1.85x at ds=10 vs 2.03x at 3.35.0) but the cliff is not closed. The LMDB shim's own btree.c does not take this regression: 3.36.0+lmdb-0.9.34 to 3.53.1+lmdb-0.9.34 rowsum-on drifts ~3% across the range.

LMDB version stability. Across 0.9.25 to 0.9.34, at fixed 3.36.0 and ds=10, rowsum-off averages 66.8 to 69.9 seconds: 5% spread, statistically indistinguishable. Rowsum-on is similar at 169.5 to 176.2 seconds (4%). Pick the latest LMDB for fixes and maintenance, not for speed.

No shared cache sqlite3_enable_shared_cache() returns 0, so LMDB-backed connections never share a cache. SQLite upstream has long deprecated shared cache, so while the source does mention it is unsupported, it is also not something you are ever likely to want.

PRAGMA integrity_check This is delegated to LMDB, no part of the SQLite integrity check is replicated or implemented in LumoSQL.