Scaling Guide
Contract Lucidity is designed to scale from a single-server demo to an enterprise deployment serving thousands of users. This guide covers vertical scaling, horizontal scaling, and the architectural patterns that support each.
Architecture Overview
Vertical Scaling
The simplest way to increase capacity: give your existing server more resources.
Worker Concurrency
The most impactful vertical scaling lever is CELERY_CONCURRENCY -- the number of document processing threads running in the worker container. This is set as an environment variable (default: 2).
# In .env or docker-compose.yml
CELERY_CONCURRENCY=8
Or override in docker-compose.yml:
cl-worker:
command: celery -A app.celery_app worker --loglevel=info --concurrency=8
After changing, restart the worker:
docker compose restart cl-worker
Concurrency Sizing Matrix
| Celery Workers | RAM | CPUs | Typical Throughput | Use Case |
|---|---|---|---|---|
| 2 | 4 GB | 2 | ~5 docs/hour | Demo / small team (< 10 users) |
| 4 | 8 GB | 2-4 | ~12 docs/hour | Small firm (10-50 users) |
| 8 | 16 GB | 4-8 | ~25 docs/hour | Mid-size firm (50-200 users) |
| 16 | 32 GB | 8+ | ~50 docs/hour | Am Law 200 (200-500 users) |
| 32+ | 64 GB+ | 16+ | ~100+ docs/hour | Am Law 100 (500+ users) |
The throughput numbers above assume your AI provider's rate limits can sustain the load. Each document makes 3-6 AI API calls. At 16 concurrent workers, you need at minimum ~100 RPM from your AI provider. See the AI Provider docs for rate limit details per tier.
Memory Considerations
Each Celery worker process consumes approximately:
| Component | Memory per Worker |
|---|---|
| Base Python process | ~150 MB |
| Document text in memory | ~10-50 MB (depends on document size) |
| AI SDK overhead | ~50 MB |
| Total per worker | ~250-350 MB |
Formula: Required RAM = (CELERY_CONCURRENCY * 350 MB) + 2 GB (OS + other containers)
For example, 8 workers: (8 * 350) + 2000 = 4800 MB ~ 5 GB minimum
Setting CELERY_CONCURRENCY higher than your available CPU cores will cause contention and may slow down processing rather than speed it up. The extraction stage (OCR via Tesseract) is CPU-intensive.
Horizontal Scaling
When a single server reaches its limits, scale horizontally by adding more instances.
Multiple Worker Instances
The easiest horizontal scaling path. Celery workers are stateless and compete for tasks from the same Redis queue.
# docker-compose.override.yml for multiple workers
services:
cl-worker-1:
extends:
service: cl-worker
container_name: cl-worker-1
environment:
- CELERY_CONCURRENCY=8
cl-worker-2:
extends:
service: cl-worker
container_name: cl-worker-2
environment:
- CELERY_CONCURRENCY=8
cl-worker-3:
extends:
service: cl-worker
container_name: cl-worker-3
environment:
- CELERY_CONCURRENCY=8
Requirements for multi-worker scaling:
- All workers must share the same Redis instance (broker)
- All workers must share the same PostgreSQL database
- All workers must have access to the same document storage volume (
/data/storage)
If workers cannot access the same /data/storage path, the extraction stage will fail with "Package not found at /data/storage/...". Use NFS, EFS (AWS), Azure Files, or a similar shared filesystem.
Multiple Backend Instances
The backend is stateless (sessions use JWT tokens, not server-side state). Add instances behind a load balancer:
services:
cl-backend-1:
extends:
service: cl-backend
container_name: cl-backend-1
cl-backend-2:
extends:
service: cl-backend
container_name: cl-backend-2
When running multiple backend instances, only one should run database migrations on startup. Use a leader election mechanism or run migrations manually before scaling:
docker exec cl-backend-1 alembic upgrade head
Then start additional instances with migrations disabled (or accept that redundant migration runs are safe -- Alembic uses a version table to prevent re-running).
Frontend Scaling
The Next.js frontend is stateless. Scale by adding instances behind a load balancer:
services:
cl-frontend-1:
extends:
service: cl-frontend
container_name: cl-frontend-1
ports:
- "3001:3000"
cl-frontend-2:
extends:
service: cl-frontend
container_name: cl-frontend-2
ports:
- "3002:3000"
Place behind a reverse proxy (Nginx, Caddy, Traefik) or cloud load balancer.
Database Scaling
Connection Pooling
As you add backend and worker instances, database connections multiply. PostgreSQL's default max_connections (100) can be exhausted.
Options:
- Increase
max_connectionsin PostgreSQL config (simple but limited) - Use PgBouncer as a connection pooler (recommended for > 8 total service instances)
# Add PgBouncer to docker-compose
cl-pgbouncer:
image: edoburu/pgbouncer:latest
container_name: cl-pgbouncer
environment:
DATABASE_URL: "postgresql://cl_user:cl_password_change_me@cl-postgres:5432/contract_lucidity"
MAX_CLIENT_CONN: 500
DEFAULT_POOL_SIZE: 25
POOL_MODE: transaction
ports:
- "6432:6432"
depends_on:
- cl-postgres
networks:
- cl-network
Then point POSTGRES_HOST=cl-pgbouncer and POSTGRES_PORT=6432 in your .env.
Read Replicas
For read-heavy workloads (large teams viewing documents simultaneously), offload read queries to PostgreSQL replicas:
Read replicas require application-level routing (separate connection strings for reads vs writes). This is not currently built into CL but can be implemented with a PostgreSQL proxy like PgPool-II or at the infrastructure level with AWS RDS read replicas or Azure read replicas.
Cloud-Specific Scaling Patterns
AWS
| Component | Service | Scaling Method |
|---|---|---|
| Frontend | ECS Fargate / EKS | Auto-scaling based on CPU |
| Backend | ECS Fargate / EKS | Auto-scaling based on request count |
| Worker | ECS Fargate / EKS | Auto-scaling based on Redis queue depth |
| Database | RDS PostgreSQL | Vertical (instance class) + read replicas |
| Storage | EFS | Automatic (shared across instances) |
| Redis | ElastiCache | Vertical (node type) |
Azure
| Component | Service | Scaling Method |
|---|---|---|
| Frontend | Azure Container Apps | Auto-scaling based on HTTP traffic |
| Backend | Azure Container Apps | Auto-scaling based on HTTP traffic |
| Worker | Azure Container Apps | KEDA scaling based on Redis queue length |
| Database | Azure Database for PostgreSQL Flexible Server | Vertical + read replicas |
| Storage | Azure Files (Premium) | Shared across instances |
| Redis | Azure Cache for Redis | Vertical (tier) |
Kubernetes (Any Cloud)
# HPA for worker pods based on Redis queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cl-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cl-worker
minReplicas: 2
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: redis_celery_queue_length
target:
type: AverageValue
averageValue: "5"
Scaling Decision Flowchart
Benchmarking
Before scaling, establish baselines:
# Measure pipeline throughput
# Upload 10 test documents and measure total time
START=$(date +%s)
# ... upload documents ...
# ... wait for all to complete ...
END=$(date +%s)
echo "Throughput: 10 documents in $((END-START)) seconds"
# Monitor during load test
docker stats --no-stream --filter "name=cl-"
| Metric | How to Measure | Target |
|---|---|---|
| Pipeline throughput | Documents completed per hour | Scales linearly with workers |
| API response time (p95) | Load testing with k6/vegeta | < 500ms for read endpoints |
| Time to first result | Upload to COMPLETE | < 3 min for a 20-page document |
| Concurrent users | Load test with realistic browsing | Scale frontend/backend instances |