Skip to content

How auth-server works

The Go auth server is the single source of truth for identity. Every client and every backend service that cares about “who is this user and what can they do?” ultimately gets its answer from here.

The one job

Issue JWTs, validate JWTs, store the users + organizations + roles + permissions those JWTs describe. Every design choice is shaped by two pressures:

  1. Don’t be a bottleneck. Every API request downstream depends on being able to validate a token cheaply — so validation is local (HMAC signature) or cache-backed (Redis blacklist lookup), not a database hit per request.
  2. Carry enough claims in the token to answer the common questions inline. Roles, permissions, org context, and user identity are all in the access token, so consumers can answer “is this user a seller in this org?” without asking the auth-server.

The big picture

flowchart LR
  CLIENTS["browser · mobile · API"] -->|/api/v1/auth/*| SERVER
  SERVER -.-> PG[(Postgres<br/>users, orgs, roles, perms,<br/>refresh_tokens, sessions)]
  SERVER -.-> REDIS[(Redis<br/>token cache · blacklist<br/>rate limits · SSO state)]
  SERVER -.->|optional| EMAIL[Email provider]
  SERVER["<b>auth-server</b><br/>(Go · stateless HTTP)"]:::server
  classDef server fill:#2d4a2b,stroke:#5b8c4a,stroke-width:3px,color:#e8eaf0

The auth server is stateless HTTP. No persistent connections, WebSockets, or long-polling. Every request is independent. Horizontal scaling is “run more replicas”.

Redis is optional at the interface level (NoOpTokenCache fallback when Redis is down), but every production deployment uses it.

The JWT lifecycle

Issuance

When a user authenticates (password, SSO, refresh), the JWT service builds claims from:

  • The user’s row in users (uid, email, names).
  • If an organization_id was supplied: matching organization_members row + linked roles. Login + refresh fail loudly if the org isn’t found, the user isn’t a member, or the org is suspended.
  • If no org was supplied: the user’s global roles via user_base_roles.
  • If an app_code was supplied: matching apps row. User must have an active user_apps membership (auto-granted on first login when apps.auto_grant_on_signup=true).
  • Permissions: union across services in app.service_codes plus the always-included core service, filtered by the user’s roles.
  • Current per-user token-version counter (auth:user_tv:{user_id} in Redis) → tv claim.

These claims go into an HS256-signed JWT with JWT_ACCESS_SECRET. Access lifetime 15m, refresh 7d (30d with remember_me). Password-reset and email-verify tokens use distinct purpose-derived secrets (HMAC-SHA256 over the access secret with "ven-auth:purpose:password_reset" / "ven-auth:purpose:email_verification") plus distinct audiences — three layers prevent cross-purpose presentation.

Zero-downtime rotation. When JWT_ACCESS_SECRET_PREVIOUS / JWT_REFRESH_SECRET_PREVIOUS are set, validators try active first and fall back to previous on signature mismatch only. Each side rotates independently. See Development § Rotate JWT secrets for the operator runbook.

Refresh + rotation (family-aware)

Each refresh_tokens row carries family_id (original issuance this chain descends from) and parent_id (the row it rotated from). On POST /auth/refresh:

  1. Validate the JWT cryptographically.
  2. Look up the stored row by the tid claim.
  3. If the row is already revoked: reuse → presumed theft. Revoke every live row sharing family_id and return TokenRevoked (RFC 6819 §5.2.2.3). The legitimate user and the attacker both lose access; the user re-authenticates from scratch.
  4. Otherwise: revoke the presented row and mint a child with the same family_id + parent_id pointing back.

Revocation — four mechanisms

  1. Per-row refresh_tokens.revoked=true — used on logout, terminate session, rotation.
  2. Family revoke — bulk revoke every live row sharing a family_id.
  3. Per-user token-version bump (auth:user_tv:{user_id} INCR in Redis). Validators reject access tokens with stale tv claim. Cross-replica.
  4. Redis access-token blacklist (auth:blacklist:{jti}). Legacy per-jti path.

The multi-tenant model

Users live independently of organizations. A user can belong to zero, one, or many organizations, with different roles in each.

User (jane@acme.com)
├── base role: "base_user" (global)
├── membership in Org "Acme"
│ └── roles: ["org_admin", "seller"]
└── membership in Org "BuildCo"
└── roles: ["buyer"]

The access token is scoped to at most one organization at a time:

  • No org context: token carries only base_user global roles.
  • Org context: token claims include org_id, org_slug, org_name
    • the roles and permissions they have in that org.

Switching orgs means calling POST /auth/refresh with a different organization_id. The server issues a new token pair scoped to the new org. The client doesn’t need to log out and back in.

Seeded roles

CodeLevelScope
system_admin0platform
super_admin5platform
org_admin10org
org_manager20org
seller50org
buyer60org
org_member80org
base_user100platform

SSO flow (with PKCE)

  1. POST /auth/sso/url { provider, redirect_url, code_challenge?, code_challenge_method? } → server returns { auth_url, state }.
  2. Client navigates to auth_url; provider redirects back to redirect_url with ?code=...&state=....
  3. POST /auth/sso/callback { code, state, provider } → server validates state (atomic GETDEL), exchanges code, fetches user info, upserts user.
  4. Without PKCE → response is the standard LoginResponse. With PKCE → response is { auth_code, expires_in: 60 }.
  5. (PKCE only) POST /auth/sso/exchange { auth_code, code_verifier } → server verifies BASE64URL(SHA256(verifier)) == challenge, mints token pair from fresh user/org state.

Rate limiting — three layers

  1. Per-IP (auth:ratelimit:{client_ip}, 100/min default).
  2. Per-account sliding window keyed on sha256(email) (20/h default).
  3. Account lockout in Postgres (5 failures in 15m → 15m lockout).

When any layer trips: 429 Too Many Requests + Retry-After.

What it doesn’t do

  • Not an OIDC provider. No JWKS, no RS256. HS256 with a shared secret because every consumer is first-party.
  • No WebSockets.
  • No native multi-environment pooling. One Go binary talks to one Postgres. Different environments run separate deployments.
  • Soft-deletes only. Hard delete requires POST /admin/users/{id}/hard
    • ownership transfer.