Scaling Guide

Contract Lucidity is designed to scale from a single-server demo to an enterprise deployment serving thousands of users. This guide covers vertical scaling, horizontal scaling, and the architectural patterns that support each.

Architecture Overview

Vertical Scaling

The simplest way to increase capacity: give your existing server more resources.

Worker Concurrency

The most impactful vertical scaling lever is CELERY_CONCURRENCY -- the number of document processing threads running in the worker container. This is set as an environment variable (default: 2).

# In .env or docker-compose.yml
CELERY_CONCURRENCY=8

Or override in docker-compose.yml:

cl-worker:
  command: celery -A app.celery_app worker --loglevel=info --concurrency=8

After changing, restart the worker:

docker compose restart cl-worker

Concurrency Sizing Matrix

Celery Workers	RAM	CPUs	Typical Throughput	Use Case
2	4 GB	2	~5 docs/hour	Demo / small team (< 10 users)
4	8 GB	2-4	~12 docs/hour	Small firm (10-50 users)
8	16 GB	4-8	~25 docs/hour	Mid-size firm (50-200 users)
16	32 GB	8+	~50 docs/hour	Am Law 200 (200-500 users)
32+	64 GB+	16+	~100+ docs/hour	Am Law 100 (500+ users)

Throughput Depends on AI Provider

The throughput numbers above assume your AI provider's rate limits can sustain the load. Each document makes 3-6 AI API calls. At 16 concurrent workers, you need at minimum ~100 RPM from your AI provider. See the AI Provider docs for rate limit details per tier.

Memory Considerations

Each Celery worker process consumes approximately:

Component	Memory per Worker
Base Python process	~150 MB
Document text in memory	~10-50 MB (depends on document size)
AI SDK overhead	~50 MB
Total per worker	~250-350 MB

Formula: Required RAM = (CELERY_CONCURRENCY * 350 MB) + 2 GB (OS + other containers)

For example, 8 workers: (8 * 350) + 2000 = 4800 MB ~ 5 GB minimum

warning

Setting CELERY_CONCURRENCY higher than your available CPU cores will cause contention and may slow down processing rather than speed it up. The extraction stage (OCR via Tesseract) is CPU-intensive.

Horizontal Scaling

When a single server reaches its limits, scale horizontally by adding more instances.

Multiple Worker Instances

The easiest horizontal scaling path. Celery workers are stateless and compete for tasks from the same Redis queue.

# docker-compose.override.yml for multiple workers
services:
  cl-worker-1:
    extends:
      service: cl-worker
    container_name: cl-worker-1
    environment:
      - CELERY_CONCURRENCY=8

  cl-worker-2:
    extends:
      service: cl-worker
    container_name: cl-worker-2
    environment:
      - CELERY_CONCURRENCY=8

  cl-worker-3:
    extends:
      service: cl-worker
    container_name: cl-worker-3
    environment:
      - CELERY_CONCURRENCY=8

Requirements for multi-worker scaling:

All workers must share the same Redis instance (broker)
All workers must share the same PostgreSQL database
All workers must have access to the same document storage volume (/data/storage)

Shared Storage is Critical

If workers cannot access the same /data/storage path, the extraction stage will fail with "Package not found at /data/storage/...". Use NFS, EFS (AWS), Azure Files, or a similar shared filesystem.

Multiple Backend Instances

The backend is stateless (sessions use JWT tokens, not server-side state). Add instances behind a load balancer:

services:
  cl-backend-1:
    extends:
      service: cl-backend
    container_name: cl-backend-1

  cl-backend-2:
    extends:
      service: cl-backend
    container_name: cl-backend-2

Migration Safety

When running multiple backend instances, only one should run database migrations on startup. Use a leader election mechanism or run migrations manually before scaling:

docker exec cl-backend-1 alembic upgrade head

Then start additional instances with migrations disabled (or accept that redundant migration runs are safe -- Alembic uses a version table to prevent re-running).

Frontend Scaling

The Next.js frontend is stateless. Scale by adding instances behind a load balancer:

services:
  cl-frontend-1:
    extends:
      service: cl-frontend
    container_name: cl-frontend-1
    ports:
      - "3001:3000"

  cl-frontend-2:
    extends:
      service: cl-frontend
    container_name: cl-frontend-2
    ports:
      - "3002:3000"

Place behind a reverse proxy (Nginx, Caddy, Traefik) or cloud load balancer.

Database Scaling

Connection Pooling

As you add backend and worker instances, database connections multiply. PostgreSQL's default max_connections (100) can be exhausted.

Options:

Increase max_connections in PostgreSQL config (simple but limited)
Use PgBouncer as a connection pooler (recommended for > 8 total service instances)

# Add PgBouncer to docker-compose
cl-pgbouncer:
  image: edoburu/pgbouncer:latest
  container_name: cl-pgbouncer
  environment:
    DATABASE_URL: "postgresql://cl_user:cl_password_change_me@cl-postgres:5432/contract_lucidity"
    MAX_CLIENT_CONN: 500
    DEFAULT_POOL_SIZE: 25
    POOL_MODE: transaction
  ports:
    - "6432:6432"
  depends_on:
    - cl-postgres
  networks:
    - cl-network

Then point POSTGRES_HOST=cl-pgbouncer and POSTGRES_PORT=6432 in your .env.

Read Replicas

For read-heavy workloads (large teams viewing documents simultaneously), offload read queries to PostgreSQL replicas:

info

Read replicas require application-level routing (separate connection strings for reads vs writes). This is not currently built into CL but can be implemented with a PostgreSQL proxy like PgPool-II or at the infrastructure level with AWS RDS read replicas or Azure read replicas.

Cloud-Specific Scaling Patterns

AWS

Component	Service	Scaling Method
Frontend	ECS Fargate / EKS	Auto-scaling based on CPU
Backend	ECS Fargate / EKS	Auto-scaling based on request count
Worker	ECS Fargate / EKS	Auto-scaling based on Redis queue depth
Database	RDS PostgreSQL	Vertical (instance class) + read replicas
Storage	EFS	Automatic (shared across instances)
Redis	ElastiCache	Vertical (node type)

Azure

Component	Service	Scaling Method
Frontend	Azure Container Apps	Auto-scaling based on HTTP traffic
Backend	Azure Container Apps	Auto-scaling based on HTTP traffic
Worker	Azure Container Apps	KEDA scaling based on Redis queue length
Database	Azure Database for PostgreSQL Flexible Server	Vertical + read replicas
Storage	Azure Files (Premium)	Shared across instances
Redis	Azure Cache for Redis	Vertical (tier)

Kubernetes (Any Cloud)

# HPA for worker pods based on Redis queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cl-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cl-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: redis_celery_queue_length
      target:
        type: AverageValue
        averageValue: "5"

Scaling Decision Flowchart

Benchmarking

Before scaling, establish baselines:

# Measure pipeline throughput
# Upload 10 test documents and measure total time
START=$(date +%s)
# ... upload documents ...
# ... wait for all to complete ...
END=$(date +%s)
echo "Throughput: 10 documents in $((END-START)) seconds"

# Monitor during load test
docker stats --no-stream --filter "name=cl-"

Metric	How to Measure	Target
Pipeline throughput	Documents completed per hour	Scales linearly with workers
API response time (p95)	Load testing with k6/vegeta	< 500ms for read endpoints
Time to first result	Upload to COMPLETE	< 3 min for a 20-page document
Concurrent users	Load test with realistic browsing	Scale frontend/backend instances

Architecture Overview​

Vertical Scaling​

Worker Concurrency​

Concurrency Sizing Matrix​

Memory Considerations​

Horizontal Scaling​

Multiple Worker Instances​

Multiple Backend Instances​

Frontend Scaling​

Database Scaling​

Connection Pooling​

Read Replicas​

Cloud-Specific Scaling Patterns​

AWS​

Azure​

Kubernetes (Any Cloud)​

Scaling Decision Flowchart​

Benchmarking​