Storage layer

Mondher's persistence layer is a small set of async traits — one per domain slice — plus production implementations against PostgreSQL, S3-compatible blob storage, and Redis. The same traits are implemented by in-memory backends for tests.

This page documents the storage layer's shape, conventions, and operational characteristics. Source lives in crates/mondher-storage/.

Traits

Five traits cover everything persistent in Mondher today:

Trait	What it persists	Production backend
`UserStore`	User accounts	PostgreSQL `users` table
`WorkspaceStore`	Workspaces and analyses	PostgreSQL `workspaces` + `analyses`
`AnnotationStore`	Annotations and threaded comments	PostgreSQL `annotations` + `comments`
`ContextBlob`	Raw context bytes keyed by hash	S3 / MinIO bucket
`LatticeCache`	Computed lattices keyed by hash	Redis with TTL

Every method is async fn via the async_trait macro so the traits remain object-safe and can be used as Arc<dyn TraitName>. Errors are returned as StorageResult<T> — a Result<T, StorageError> alias.

Errors

StorageError is deliberately small: four tuple variants covering the operations callers care about distinguishing.

#![allow(unused)]
fn main() {
pub enum StorageError {
    NotFound(String),
    Conflict(String),
    Backend(String),
    Unavailable(String),
}
}

NotFound and Conflict are expected — callers should match on them and respond accordingly (404 vs 409 in an API). Backend and Unavailable are unexpected — they propagate as 500-class errors.

Driver-specific errors (sqlx, AWS SDK, redis) are mapped to these four variants at the backend boundary. A consumer never sees a sqlx::Error or an aws_sdk_s3::Error; they see only StorageError.

Composition

Storage is the composition root: one struct holding Arc<dyn ...> for each trait. Two constructors:

Storage::from_env() — reads DATABASE_URL, REDIS_URL, MINIO_URL etc. and builds production backends. Uses StorageConfig::from_env() for pool sizing and timeouts.
Storage::in_memory() — builds an all-in-memory stack with no external services. Use for unit tests and CI.

#![allow(unused)]
fn main() {
// Production binary
let storage = Storage::from_env().await?;

// Tests
let storage = Storage::in_memory();
}

The Arc<dyn ...> indirection means consumers can be tested by handing them mock implementations of each trait. The HTTP API (Days 113+) holds Arc<Storage> in its app state and clones it into each handler.

Configuration

StorageConfig exposes seven tunables, all settable via environment variables. Defaults are sized for the bundled docker compose stack on a developer laptop.

Setting	Default	Env var
Postgres max connections	10	`MONDHER_PG_MAX_CONNECTIONS`
Postgres acquire timeout	5 s	`MONDHER_PG_ACQUIRE_TIMEOUT_MS`
Postgres connection max lifetime	30 min	`MONDHER_PG_MAX_LIFETIME_SECS`
Postgres idle timeout	10 min	`MONDHER_PG_IDLE_TIMEOUT_SECS`
S3 call timeout	10 s	`MONDHER_S3_CALL_TIMEOUT_MS`
Redis call timeout	2 s	`MONDHER_REDIS_CALL_TIMEOUT_MS`
Postgres max retries	3	`MONDHER_PG_MAX_RETRIES`

The defaults are conservative — production tuning depends on workload and infrastructure. For example, a deployment behind a load balancer with N application instances and a Postgres pool sized to M total connections should set MONDHER_PG_MAX_CONNECTIONS to M/N (leaving headroom for migrations and one-off scripts).

Retries and timeouts

Two helpers wrap storage operations with operational guarantees:

`with_retry`

Retries transient PostgreSQL errors with bounded exponential backoff. Transient means:

StorageError::Unavailable — any.
StorageError::Backend containing one of these SQLSTATEs:
- 40001 serialization failure
- 40P01 deadlock detected
- 08006 connection broken
- 08000 connection exception (generic)
- 57P03 cannot connect now (server starting)

Non-transient errors (NotFound, Conflict, malformed input) are returned immediately. The default budget is 3 attempts (initial + 2 retries), capped at 1 second of backoff per attempt.

#![allow(unused)]
fn main() {
let user = with_retry(3, || storage.users.get(id)).await?;
}

`with_timeout`

Wraps any async operation with a hard deadline. Operations that don't complete within the budget return StorageError::Unavailable.

#![allow(unused)]
fn main() {
let user = with_timeout(Duration::from_secs(5), storage.users.get(id))
    .await?;
}

Typical HTTP handlers allocate part of their request budget to storage: "I have 30 seconds total; storage gets 5 seconds before I 503."

The retry and timeout helpers compose; wrap a retry inside a timeout so retries can't cumulatively exceed the deadline.

Swap-in patterns

For unit tests

Use Storage::in_memory(). Every trait method has an in-memory implementation that matches the production semantics (idempotency, not-found behavior, conflict detection, FK-like cascades).

For integration tests with one specific backend

Construct each store directly and inject:

#![allow(unused)]
fn main() {
let pool = sqlx::PgPool::connect(&database_url).await?;
let users = Arc::new(PostgresUserStore::new(pool));
let storage = Storage {
    users,
    workspaces: Arc::new(InMemoryWorkspaceStore::default()),
    // ...
};
}

This lets you exercise the real Postgres path while keeping other stores in-memory.

For mocking specific behavior

Implement a trait directly:

#![allow(unused)]
fn main() {
struct AlwaysConflictUserStore;

#[async_trait]
impl UserStore for AlwaysConflictUserStore {
    async fn create(&self, _: User) -> StorageResult<()> {
        Err(StorageError::Conflict("forced conflict".into()))
    }
    // ... other methods
}
}

Arc<dyn UserStore> accepts any type implementing the trait. The HTTP handlers don't care whether they're talking to Postgres, an in-memory HashMap, or a deliberately-broken mock.

Migrations

Migrations live in crates/mondher-storage/migrations/ and are managed by sqlx-cli. Filenames are timestamp-prefixed; SQLx applies them in order.

Common operations:

# Apply pending migrations
make db-migrate

# Drop everything and re-migrate from scratch
make db-reset

# Add a new migration
sqlx migrate add --source crates/mondher-storage/migrations my_new_table

After writing any new sqlx::query! macro:

make sqlx-prepare

This regenerates .sqlx/, the offline query cache that CI uses. Commit the changes; CI requires it.

Fixtures

make fixtures-load populates the dev database with one researcher (Alice), one workspace, one analysis, two annotations, and one two-message comment thread. All IDs are deterministic (00000000-0000-0000-0000-000000000001 for Alice, etc.) so manual exploration and tests can hard-code references.

The fixture data does not run in CI and is not part of any production deployment. It exists strictly for local development.

Benchmarks

cargo bench -p mondher-storage runs a criterion suite measuring the in-memory backends. Results land in target/criterion/; the HTML report at target/criterion/report/index.html includes plots and statistical analysis.

We do not benchmark PostgreSQL, Redis, or S3/MinIO. Those numbers depend on external service performance and have no stable meaning across hardware. Production performance is measured through application-level observability, not microbenchmarks.

Typical in-memory numbers (on a 2024-class developer laptop):

UserStore::create: ~500 ns - 1.5 µs
UserStore::get (hit): ~80 ns
WorkspaceStore::get_workspace: ~80 ns
ContextBlob::put (1 KB): ~1 µs
LatticeCache::get (hit): ~150 ns

Sub-microsecond reads, low-microsecond writes. The trait dispatch and lock acquisition dominate; the data structures themselves cost almost nothing.

When something goes wrong

Symptom	Likely cause	Fix
Tests pass locally, fail in CI	`.sqlx/` cache outdated	`make sqlx-prepare`, commit
`cargo build` fails with "DATABASE_URL not set"	sqlx macro can't find live DB and no offline cache	export `DATABASE_URL` or set `SQLX_OFFLINE=true`
Tests pass, then suddenly start failing	Stale migration state in dev DB	`make db-reset`
`make fixtures-load` errors on duplicate keys	Fixtures already loaded	Run `make db-reset` first
Storage tests time out in CI	Service slow to start (rare)	Rerun the job; if persistent, paste the log

Phase 3 considerations

The current layer is sized for v0.2. Things we know we'll want before production but haven't implemented:

Read replicas: PgPoolOptions only takes one URL. Phase 3 may introduce a PgRouter that picks read vs write pools.
Distributed tracing: SQL queries should report span context for observability. Tracing is added in Phase 3 alongside the API server.
Connection draining: graceful shutdown should let in-flight queries complete before closing pools. Currently we rely on process exit cleanup.
Auditing: a who-changed-what-when log table. Phase 4 territory.

These are intentional gaps. Shipping v0.2 with the current shape is the priority; the layer is built to extend without breaking.