Skip to content

Storage layer

Single SQLite file per project: <workspace>/<slug>.agl.

journal_mode = DELETE (a rollback journal that exists only during a write, leaving no -wal/-shm siblings), synchronous = NORMAL, foreign_keys = ON, and a 5 s busy_timeout. One writer at a time; each worker process holds its own connection (opened lazily on first message) and serializes on the busy-timeout.

Module map

lib/storage/
__init__.py re-exports
db.py open_db / ensure_schema / PRAGMAs
schema/0001_initial.sql … 0005_backfill_ocr_stale.sql (applied in order)
repo.py ProjectRepo / PipelineRepo / CalibrationRepo /
ImageRepo / ThumbRepo / ScanRepo / NodeRepo /
BranchRepo / OcrRepo / DebugRepo
persister.py encode_image / make_thumb / Persister

Tables (overview)

TablePurpose
projectSingleton: name, slug, schema_version, notes.
pipeline_versionsFrozen YAML snapshots. Active row is current pipeline. Reproducibility.
calibrationsFrozen camera-calibration snapshots. Active row is current.
imagesEncoded JPG/PNG blob + width/height/dpi/type. sha256-deduped.
thumbsPer-image rasterized thumbnail (one per max_dim, e.g. 256).
scansOne row per capture root (webcam frame or PDF page). Soft-delete via deleted_at.
nodesOne row per pipeline-step output. Self-FK forms a tree.
branchesPer-branch user-selectable output. Independent step-back/forward per branch. Soft-delete via trashed_at.
ocr_runsOne row per OCR pass over a branch: engine, languages, status, result_json, timestamps. is_stale flags a result that no longer matches the branch’s current node.
debug_artifactsOptional debug images attached to a node (e.g. PageDewarper span overlays).

Node tree shape

flowchart TD
    Scan["scan (table row)"]
    N0["nodes(step_idx=0, depth=0)<br/>[raw capture]"]
    N1["nodes(step_idx=1, depth=1)<br/>[first pipeline step]"]
    N2["nodes(step_idx=2, depth=2)<br/>[branch point — PageDetector]"]
    A["nodes(step_idx=3, …)<br/>…leaf A"]
    B["nodes(step_idx=3, …)<br/>…leaf B"]
    Chosen["branches.chosen_node_id<br/>(per branch)"]
    Scan -->|root_node_id| N0
    N0 --> N1 --> N2
    N2 -->|"branch_label='A'"| A
    N2 -->|"branch_label='B'"| B
    A --> Chosen
    B --> Chosen

Invariants:

  • step_idx=0 is the raw root for every scan.
  • parent_id IS NULL only for raw roots.
  • branch_label is set only on nodes that are the direct output of a branch-emitting processor (e.g. PageDetector). Linear-step nodes inherit their owning branch context implicitly via ancestry.
  • is_leaf maintained by trigger nodes_set_parent_not_leaf.
  • is_branch_point flipped to 1 when a child node is inserted under it AND the parent has >1 children (marked explicitly by the chain worker).

Branches table

A branches row exists per terminal branch:

ColumnMeaning
scan_idOwning scan
branch_path"" if no split, otherwise "A", "B", "A.1"
terminal_node_idDeepest node reached by the original pipeline run
chosen_node_idWhich node represents this branch in exports/UI
trashed_atSoft-delete timestamp; non-null hides the branch from the gallery/table/grid views (per-page trash, finer than scan-level deleted_at)

Lifecycle:

  • Pipeline finishes a branch → BranchRepo.upsert(scan_id, branch_path, terminal). Both terminal_node_id and chosen_node_id set to the deepest node.
  • User clicks “back arrow” → step_back. Moves chosen_node_id one node toward root. Refuses to cross a branch point with siblings.
  • User clicks “forward arrow” → step_forward. Walks one node toward terminal_node_id.
  • Reset → chosen_node_id = terminal_node_id.

Sibling branches are not affected by step_back/step_forward on a peer branch.

Image + thumb dedup

Image bytes hashed (sha256). Re-inserting identical bytes returns the same images.id. Pipeline steps that are no-ops therefore don’t bloat storage; the node row is cheap (~200 bytes), the image is shared.

Thumbs are keyed by (image_id, max_dim). Default max_dim=256 (JPEG q=80).

PRAGMAs at open

PRAGMA journal_mode = DELETE; -- journal only exists mid-transaction:
-- no permanent -journal/-wal/-shm sidecar
-- next to the .agl file
PRAGMA synchronous = NORMAL;
PRAGMA foreign_keys = ON;
PRAGMA temp_store = MEMORY;
PRAGMA cache_size = -64000; -- 64 MiB
PRAGMA mmap_size = 268435456; -- 256 MiB
PRAGMA busy_timeout = 5000;

Query cookbook

-- 1. Full subtree for a scan
WITH RECURSIVE t(id) AS (
SELECT id FROM nodes WHERE id = (SELECT root_node_id FROM scans WHERE id = ?)
UNION ALL
SELECT n.id FROM nodes n JOIN t ON n.parent_id = t.id
)
SELECT n.* FROM nodes n JOIN t ON n.id = t.id ORDER BY n.depth, n.id;
-- 2. Project export set (post-user-override)
SELECT b.id, b.branch_path, n.filestem, i.format, i.width, i.height
FROM branches b
JOIN nodes n ON n.id = b.chosen_node_id
JOIN images i ON i.id = n.image_id
JOIN scans s ON s.id = b.scan_id
WHERE s.deleted_at IS NULL
ORDER BY s.idx, b.branch_path;
-- 3. All nodes of a step
SELECT * FROM nodes WHERE step_name = '05_pages_bw';
-- 4. Re-run a branch from a node (delete subtree)
WITH RECURSIVE sub(id) AS (
SELECT id FROM nodes WHERE id = ?
UNION ALL
SELECT n.id FROM nodes n JOIN sub ON n.parent_id = sub.id
)
DELETE FROM nodes WHERE id IN (SELECT id FROM sub) AND id != ?;

Blob-size caveat

At ~5–15 MB per page × ~8 pipeline steps, multi-thousand-page projects can reach 50–100 GB. SQLite handles up to ~140 TB in theory but row-blob fetching past several GB starts to feel sluggish. If a real project crosses ~50 GB, consider an external blob store (path column on images plus filesystem-level storage).