Storage layer
Mondher's persistence layer is a small set of async traits — one per domain slice — plus production implementations against PostgreSQL, S3-compatible blob storage, and Redis. The same traits are implemented by in-memory backends for tests.
This page documents the storage layer's shape, conventions, and
operational characteristics. Source lives in crates/mondher-storage/.
Traits
Five traits cover everything persistent in Mondher today:
| Trait | What it persists | Production backend |
|---|---|---|
UserStore | User accounts | PostgreSQL users table |
WorkspaceStore | Workspaces and analyses | PostgreSQL workspaces + analyses |
AnnotationStore | Annotations and threaded comments | PostgreSQL annotations + comments |
ContextBlob | Raw context bytes keyed by hash | S3 / MinIO bucket |
LatticeCache | Computed lattices keyed by hash | Redis with TTL |
Every method is async fn via the async_trait macro so the traits
remain object-safe and can be used as Arc<dyn TraitName>. Errors are
returned as StorageResult<T> — a Result<T, StorageError> alias.
Errors
StorageError is deliberately small: four tuple variants covering the
operations callers care about distinguishing.
#![allow(unused)] fn main() { pub enum StorageError { NotFound(String), Conflict(String), Backend(String), Unavailable(String), } }
NotFound and Conflict are expected — callers should match on them
and respond accordingly (404 vs 409 in an API). Backend and
Unavailable are unexpected — they propagate as 500-class errors.
Driver-specific errors (sqlx, AWS SDK, redis) are mapped to these four
variants at the backend boundary. A consumer never sees a sqlx::Error
or an aws_sdk_s3::Error; they see only StorageError.
Composition
Storage is the composition root: one struct holding Arc<dyn ...>
for each trait. Two constructors:
Storage::from_env()— readsDATABASE_URL,REDIS_URL,MINIO_URLetc. and builds production backends. UsesStorageConfig::from_env()for pool sizing and timeouts.Storage::in_memory()— builds an all-in-memory stack with no external services. Use for unit tests and CI.
#![allow(unused)] fn main() { // Production binary let storage = Storage::from_env().await?; // Tests let storage = Storage::in_memory(); }
The Arc<dyn ...> indirection means consumers can be tested by handing
them mock implementations of each trait. The HTTP API (Days 113+) holds
Arc<Storage> in its app state and clones it into each handler.
Configuration
StorageConfig exposes seven tunables, all settable via environment
variables. Defaults are sized for the bundled docker compose stack on a
developer laptop.
| Setting | Default | Env var |
|---|---|---|
| Postgres max connections | 10 | MONDHER_PG_MAX_CONNECTIONS |
| Postgres acquire timeout | 5 s | MONDHER_PG_ACQUIRE_TIMEOUT_MS |
| Postgres connection max lifetime | 30 min | MONDHER_PG_MAX_LIFETIME_SECS |
| Postgres idle timeout | 10 min | MONDHER_PG_IDLE_TIMEOUT_SECS |
| S3 call timeout | 10 s | MONDHER_S3_CALL_TIMEOUT_MS |
| Redis call timeout | 2 s | MONDHER_REDIS_CALL_TIMEOUT_MS |
| Postgres max retries | 3 | MONDHER_PG_MAX_RETRIES |
The defaults are conservative — production tuning depends on workload
and infrastructure. For example, a deployment behind a load balancer
with N application instances and a Postgres pool sized to M total
connections should set MONDHER_PG_MAX_CONNECTIONS to M/N (leaving
headroom for migrations and one-off scripts).
Retries and timeouts
Two helpers wrap storage operations with operational guarantees:
with_retry
Retries transient PostgreSQL errors with bounded exponential backoff. Transient means:
StorageError::Unavailable— any.StorageError::Backendcontaining one of these SQLSTATEs:40001serialization failure40P01deadlock detected08006connection broken08000connection exception (generic)57P03cannot connect now (server starting)
Non-transient errors (NotFound, Conflict, malformed input) are
returned immediately. The default budget is 3 attempts (initial + 2
retries), capped at 1 second of backoff per attempt.
#![allow(unused)] fn main() { let user = with_retry(3, || storage.users.get(id)).await?; }
with_timeout
Wraps any async operation with a hard deadline. Operations that don't
complete within the budget return StorageError::Unavailable.
#![allow(unused)] fn main() { let user = with_timeout(Duration::from_secs(5), storage.users.get(id)) .await?; }
Typical HTTP handlers allocate part of their request budget to storage: "I have 30 seconds total; storage gets 5 seconds before I 503."
The retry and timeout helpers compose; wrap a retry inside a timeout so retries can't cumulatively exceed the deadline.
Swap-in patterns
For unit tests
Use Storage::in_memory(). Every trait method has an in-memory
implementation that matches the production semantics (idempotency,
not-found behavior, conflict detection, FK-like cascades).
For integration tests with one specific backend
Construct each store directly and inject:
#![allow(unused)] fn main() { let pool = sqlx::PgPool::connect(&database_url).await?; let users = Arc::new(PostgresUserStore::new(pool)); let storage = Storage { users, workspaces: Arc::new(InMemoryWorkspaceStore::default()), // ... }; }
This lets you exercise the real Postgres path while keeping other stores in-memory.
For mocking specific behavior
Implement a trait directly:
#![allow(unused)] fn main() { struct AlwaysConflictUserStore; #[async_trait] impl UserStore for AlwaysConflictUserStore { async fn create(&self, _: User) -> StorageResult<()> { Err(StorageError::Conflict("forced conflict".into())) } // ... other methods } }
Arc<dyn UserStore> accepts any type implementing the trait. The HTTP
handlers don't care whether they're talking to Postgres, an in-memory
HashMap, or a deliberately-broken mock.
Migrations
Migrations live in crates/mondher-storage/migrations/ and are managed
by sqlx-cli. Filenames are timestamp-prefixed; SQLx applies them in
order.
Common operations:
# Apply pending migrations
make db-migrate
# Drop everything and re-migrate from scratch
make db-reset
# Add a new migration
sqlx migrate add --source crates/mondher-storage/migrations my_new_table
After writing any new sqlx::query! macro:
make sqlx-prepare
This regenerates .sqlx/, the offline query cache that CI uses. Commit
the changes; CI requires it.
Fixtures
make fixtures-load populates the dev database with one researcher
(Alice), one workspace, one analysis, two annotations, and one
two-message comment thread. All IDs are deterministic
(00000000-0000-0000-0000-000000000001 for Alice, etc.) so manual
exploration and tests can hard-code references.
The fixture data does not run in CI and is not part of any production deployment. It exists strictly for local development.
Benchmarks
cargo bench -p mondher-storage runs a criterion suite measuring the
in-memory backends. Results land in target/criterion/; the HTML
report at target/criterion/report/index.html includes plots and
statistical analysis.
We do not benchmark PostgreSQL, Redis, or S3/MinIO. Those numbers depend on external service performance and have no stable meaning across hardware. Production performance is measured through application-level observability, not microbenchmarks.
Typical in-memory numbers (on a 2024-class developer laptop):
UserStore::create: ~500 ns - 1.5 µsUserStore::get(hit): ~80 nsWorkspaceStore::get_workspace: ~80 nsContextBlob::put(1 KB): ~1 µsLatticeCache::get(hit): ~150 ns
Sub-microsecond reads, low-microsecond writes. The trait dispatch and lock acquisition dominate; the data structures themselves cost almost nothing.
When something goes wrong
| Symptom | Likely cause | Fix |
|---|---|---|
| Tests pass locally, fail in CI | .sqlx/ cache outdated | make sqlx-prepare, commit |
cargo build fails with "DATABASE_URL not set" | sqlx macro can't find live DB and no offline cache | export DATABASE_URL or set SQLX_OFFLINE=true |
| Tests pass, then suddenly start failing | Stale migration state in dev DB | make db-reset |
make fixtures-load errors on duplicate keys | Fixtures already loaded | Run make db-reset first |
| Storage tests time out in CI | Service slow to start (rare) | Rerun the job; if persistent, paste the log |
Phase 3 considerations
The current layer is sized for v0.2. Things we know we'll want before production but haven't implemented:
- Read replicas:
PgPoolOptionsonly takes one URL. Phase 3 may introduce aPgRouterthat picks read vs write pools. - Distributed tracing: SQL queries should report span context for observability. Tracing is added in Phase 3 alongside the API server.
- Connection draining: graceful shutdown should let in-flight queries complete before closing pools. Currently we rely on process exit cleanup.
- Auditing: a
who-changed-what-whenlog table. Phase 4 territory.
These are intentional gaps. Shipping v0.2 with the current shape is the priority; the layer is built to extend without breaking.