Storage layer

Mondher's persistence layer is a small set of async traits — one per domain slice — plus production implementations against PostgreSQL, S3-compatible blob storage, and Redis. The same traits are implemented by in-memory backends for tests.

This page documents the storage layer's shape, conventions, and operational characteristics. Source lives in crates/mondher-storage/.

Traits

Five traits cover everything persistent in Mondher today:

TraitWhat it persistsProduction backend
UserStoreUser accountsPostgreSQL users table
WorkspaceStoreWorkspaces and analysesPostgreSQL workspaces + analyses
AnnotationStoreAnnotations and threaded commentsPostgreSQL annotations + comments
ContextBlobRaw context bytes keyed by hashS3 / MinIO bucket
LatticeCacheComputed lattices keyed by hashRedis with TTL

Every method is async fn via the async_trait macro so the traits remain object-safe and can be used as Arc<dyn TraitName>. Errors are returned as StorageResult<T> — a Result<T, StorageError> alias.

Errors

StorageError is deliberately small: four tuple variants covering the operations callers care about distinguishing.

#![allow(unused)]
fn main() {
pub enum StorageError {
    NotFound(String),
    Conflict(String),
    Backend(String),
    Unavailable(String),
}
}

NotFound and Conflict are expected — callers should match on them and respond accordingly (404 vs 409 in an API). Backend and Unavailable are unexpected — they propagate as 500-class errors.

Driver-specific errors (sqlx, AWS SDK, redis) are mapped to these four variants at the backend boundary. A consumer never sees a sqlx::Error or an aws_sdk_s3::Error; they see only StorageError.

Composition

Storage is the composition root: one struct holding Arc<dyn ...> for each trait. Two constructors:

  • Storage::from_env() — reads DATABASE_URL, REDIS_URL, MINIO_URL etc. and builds production backends. Uses StorageConfig::from_env() for pool sizing and timeouts.
  • Storage::in_memory() — builds an all-in-memory stack with no external services. Use for unit tests and CI.
#![allow(unused)]
fn main() {
// Production binary
let storage = Storage::from_env().await?;

// Tests
let storage = Storage::in_memory();
}

The Arc<dyn ...> indirection means consumers can be tested by handing them mock implementations of each trait. The HTTP API (Days 113+) holds Arc<Storage> in its app state and clones it into each handler.

Configuration

StorageConfig exposes seven tunables, all settable via environment variables. Defaults are sized for the bundled docker compose stack on a developer laptop.

SettingDefaultEnv var
Postgres max connections10MONDHER_PG_MAX_CONNECTIONS
Postgres acquire timeout5 sMONDHER_PG_ACQUIRE_TIMEOUT_MS
Postgres connection max lifetime30 minMONDHER_PG_MAX_LIFETIME_SECS
Postgres idle timeout10 minMONDHER_PG_IDLE_TIMEOUT_SECS
S3 call timeout10 sMONDHER_S3_CALL_TIMEOUT_MS
Redis call timeout2 sMONDHER_REDIS_CALL_TIMEOUT_MS
Postgres max retries3MONDHER_PG_MAX_RETRIES

The defaults are conservative — production tuning depends on workload and infrastructure. For example, a deployment behind a load balancer with N application instances and a Postgres pool sized to M total connections should set MONDHER_PG_MAX_CONNECTIONS to M/N (leaving headroom for migrations and one-off scripts).

Retries and timeouts

Two helpers wrap storage operations with operational guarantees:

with_retry

Retries transient PostgreSQL errors with bounded exponential backoff. Transient means:

  • StorageError::Unavailable — any.
  • StorageError::Backend containing one of these SQLSTATEs:
    • 40001 serialization failure
    • 40P01 deadlock detected
    • 08006 connection broken
    • 08000 connection exception (generic)
    • 57P03 cannot connect now (server starting)

Non-transient errors (NotFound, Conflict, malformed input) are returned immediately. The default budget is 3 attempts (initial + 2 retries), capped at 1 second of backoff per attempt.

#![allow(unused)]
fn main() {
let user = with_retry(3, || storage.users.get(id)).await?;
}

with_timeout

Wraps any async operation with a hard deadline. Operations that don't complete within the budget return StorageError::Unavailable.

#![allow(unused)]
fn main() {
let user = with_timeout(Duration::from_secs(5), storage.users.get(id))
    .await?;
}

Typical HTTP handlers allocate part of their request budget to storage: "I have 30 seconds total; storage gets 5 seconds before I 503."

The retry and timeout helpers compose; wrap a retry inside a timeout so retries can't cumulatively exceed the deadline.

Swap-in patterns

For unit tests

Use Storage::in_memory(). Every trait method has an in-memory implementation that matches the production semantics (idempotency, not-found behavior, conflict detection, FK-like cascades).

For integration tests with one specific backend

Construct each store directly and inject:

#![allow(unused)]
fn main() {
let pool = sqlx::PgPool::connect(&database_url).await?;
let users = Arc::new(PostgresUserStore::new(pool));
let storage = Storage {
    users,
    workspaces: Arc::new(InMemoryWorkspaceStore::default()),
    // ...
};
}

This lets you exercise the real Postgres path while keeping other stores in-memory.

For mocking specific behavior

Implement a trait directly:

#![allow(unused)]
fn main() {
struct AlwaysConflictUserStore;

#[async_trait]
impl UserStore for AlwaysConflictUserStore {
    async fn create(&self, _: User) -> StorageResult<()> {
        Err(StorageError::Conflict("forced conflict".into()))
    }
    // ... other methods
}
}

Arc<dyn UserStore> accepts any type implementing the trait. The HTTP handlers don't care whether they're talking to Postgres, an in-memory HashMap, or a deliberately-broken mock.

Migrations

Migrations live in crates/mondher-storage/migrations/ and are managed by sqlx-cli. Filenames are timestamp-prefixed; SQLx applies them in order.

Common operations:

# Apply pending migrations
make db-migrate

# Drop everything and re-migrate from scratch
make db-reset

# Add a new migration
sqlx migrate add --source crates/mondher-storage/migrations my_new_table

After writing any new sqlx::query! macro:

make sqlx-prepare

This regenerates .sqlx/, the offline query cache that CI uses. Commit the changes; CI requires it.

Fixtures

make fixtures-load populates the dev database with one researcher (Alice), one workspace, one analysis, two annotations, and one two-message comment thread. All IDs are deterministic (00000000-0000-0000-0000-000000000001 for Alice, etc.) so manual exploration and tests can hard-code references.

The fixture data does not run in CI and is not part of any production deployment. It exists strictly for local development.

Benchmarks

cargo bench -p mondher-storage runs a criterion suite measuring the in-memory backends. Results land in target/criterion/; the HTML report at target/criterion/report/index.html includes plots and statistical analysis.

We do not benchmark PostgreSQL, Redis, or S3/MinIO. Those numbers depend on external service performance and have no stable meaning across hardware. Production performance is measured through application-level observability, not microbenchmarks.

Typical in-memory numbers (on a 2024-class developer laptop):

  • UserStore::create: ~500 ns - 1.5 µs
  • UserStore::get (hit): ~80 ns
  • WorkspaceStore::get_workspace: ~80 ns
  • ContextBlob::put (1 KB): ~1 µs
  • LatticeCache::get (hit): ~150 ns

Sub-microsecond reads, low-microsecond writes. The trait dispatch and lock acquisition dominate; the data structures themselves cost almost nothing.

When something goes wrong

SymptomLikely causeFix
Tests pass locally, fail in CI.sqlx/ cache outdatedmake sqlx-prepare, commit
cargo build fails with "DATABASE_URL not set"sqlx macro can't find live DB and no offline cacheexport DATABASE_URL or set SQLX_OFFLINE=true
Tests pass, then suddenly start failingStale migration state in dev DBmake db-reset
make fixtures-load errors on duplicate keysFixtures already loadedRun make db-reset first
Storage tests time out in CIService slow to start (rare)Rerun the job; if persistent, paste the log

Phase 3 considerations

The current layer is sized for v0.2. Things we know we'll want before production but haven't implemented:

  • Read replicas: PgPoolOptions only takes one URL. Phase 3 may introduce a PgRouter that picks read vs write pools.
  • Distributed tracing: SQL queries should report span context for observability. Tracing is added in Phase 3 alongside the API server.
  • Connection draining: graceful shutdown should let in-flight queries complete before closing pools. Currently we rely on process exit cleanup.
  • Auditing: a who-changed-what-when log table. Phase 4 territory.

These are intentional gaps. Shipping v0.2 with the current shape is the priority; the layer is built to extend without breaking.