# The Agent's Dictionary > AI builds it working. This makes it safe. Opinionated rules for building a secure, production-ready web app without over-engineering it. Point your coding agent here before it builds. ## Core principle > Software is simple. The hard part is the idea, not the building. Build the simplest thing that ships the idea, correctly and securely, for where you are now, without cornering yourself later. Get the foundations that are cheap now and expensive later right from day one. Don't add things you don't yet need. > > Defaults this dictionary assumes throughout: boring proven tech over clever tech; one app and one Postgres database and no cache until something measured forces otherwise; build for the next 12 months, not the next 5 years; validate input at the edges. ## How an agent should use this Follow this before and during any build. 1. New build, or existing app? A new build (an idea, nothing built yet) follows the build order in Foundations, reading each section's rules as you reach that phase. An existing app (code already exists) follows "Working on an existing app" in Foundations: audit against the rules, report the gaps, then change the least needed to close them safely. 2. Which pathway? Establish it and state it (see "Which pathway are you on?" in Foundations): AWS, a managed host (Railway/Render/Fly), or non-dev. Default to managed if unstated, and say so. 3. Consult per phase, not once. Re-read the relevant section's global ruling (always) plus the one pathway block for the chosen path at each phase: schema, auth, API, frontend, security, deploy. Reading it after you've written the code just documents the mistake. 4. The rules are binding defaults. Follow them unless the user explicitly overrides one; then the user wins. State the pathway and any override you are operating under so the user can correct you. 5. Fetch only what you need. This is an index plus per-section pages. Fetch the sections the task touches rather than loading everything every time. If you'd rather have the whole dictionary in one call, fetch `/llms-full.txt`. ## Not a developer? Start here New to this and you just want your idea live? You're the non-dev pathway. A "pathway" is just where your app runs: AWS (most power, most complexity), a managed host like Railway or Render (easiest for a real app with a database), or non-dev (get it live with the least fuss). When in doubt, tell your agent to use the managed pathway. Get-live shortlist (as of June 2026, platforms change their plans, so check current pricing): deploy from a GitHub repo to Railway or Render. Both deploy straight from your repo and offer a managed Postgres. Budget a few dollars a month once you add an always-on database, not zero: the free tiers are for trying it out, not running it, since they pause the app when it's idle or expire the database after a month. For a purely static site (no server or database), use Netlify or Cloudflare Pages, which have genuine free tiers. Pick one, don't shop forever. Never (the short list): don't put passwords or API keys in your code, don't save uploaded files on the server, don't build your own login or payments, don't skip database backups, and don't put your database on the public internet. Your agent handles all of these, and the sections below tell it how. ## 1. Foundations ### Build order (starting from an idea) For a new build, work in this order, reading each section as you reach it. Earlier choices constrain later ones, so don't skip ahead. 1. Pathway and the data-architecture questions: establish where it runs and what data is in play. These drive everything, so answer them before writing a line of schema. 2. Data & database: model the data, pick IDs, plan migrations. 3. Auth & access control: provider, sessions, RBAC, tenancy. 4. APIs: endpoints, validation, error shape, idempotency. 5. Frontend & rendering, then UI, forms & UX: only what the product needs. 6. Security pass: walk Security against everything you have built (validation, SSRF if you fetch URLs, secrets, IAM, uploads). This is not optional polish. 7. Deployment & CI/CD plus Observability: the production-ready checklist is the gate to ship. 8. Accessibility, SEO, performance, privacy: apply as the surfaces they cover get built, don't bolt them on at the end. Don't add anything from Scaling yet. Those are deferred until a measured need appears. ### Working on an existing app When the code already exists, you are not rebuilding it to match this dictionary. You are making the smallest safe change that moves it toward the rulings, security first. **Do:** - Audit before you touch anything. Check the area you're working in against the relevant sections and list the gaps, especially security: secrets in code, missing validation, SQL built by string, public buckets, sequential IDs in URLs, no rate limiting on auth. - Report the gaps and let the user pick what to fix, rather than silently rewriting. - Fix security gaps first. Those are the ones that hurt; style and structure gaps are optional. - Make the smallest diff that closes the gap. Match the codebase's existing patterns where they are sound, and don't reformat or re-architect working code as a side effect. - For anything risky (schema changes, auth changes, swapping a dependency), use the safe paths already in this dictionary: expand-contract migrations, backward-compatible deploys, a tested rollback. Assume the table is large and live. **Never:** - Rewrite a working module to "bring it up to standard" when the user asked for one change. - Apply a ruling in a way that breaks current behaviour. A rule that takes the app down is worse than the gap it closed. - Introduce a breaking schema or API change in one step. Expand-contract, always. **Why:** Existing apps have users, data, and working behaviour. The dictionary's value here is catching real risks and closing them without becoming the thing that broke production. ### Which pathway are you on? **Do:** Before building, establish the deployment pathway and state it. Ask the user: "Where will this run, AWS, a managed host (Railway/Render/Fly), or do you just want it live with the least fuss (you're not a developer)?" If they don't answer or don't know, default to the Managed platform pathway and say so. For each topic, read the Global ruling (always) plus the single Pathway block matching the chosen path, and ignore the other pathways. **Never:** Guess the pathway silently, or apply one platform's specifics (such as AWS RDS Proxy) on a different platform. **Why:** The rulings are universal, but secrets, storage, email, pooling, queues, and deploy differ by platform. Applying the wrong platform's specifics is as bad as ignoring the rule. **Escape hatch:** A user who names their stack overrides everything. The default is only for when they don't. ### Before you build anything, the data-architecture questions **Do:** Before writing any schema or CRUD, answer these for the data in play (and where it will run), and let the answers drive the model: 1. Do we even NEED to store this at all? (The cheapest data is the data you don't keep.) 2. What is the data, and what shape? 3. How much, and at what growth rate? 4. Who owns the truth, what is the system of record? 5. How available must it be? 6. How sensitive is it, PII, secrets, regulated? 7. Can we trust it, where and how is it validated? (See the edge-validation rule.) 8. Who can read it, and who can write it? 9. How long is it kept, retention and deletion? 10. How does it get IN and OUT, ingest and export? 11. Where will this run, AWS, a managed platform (Railway/Render/Fly), or non-dev (just get it live)? This selects the pathway (see "Which pathway are you on?"). **Never:** Jump straight to a table plus generated CRUD because the entity "obviously" needs storing. **Why:** Schema, indexes, access control, and retention are nearly impossible to retrofit cleanly once data exists; these answers are the design, not paperwork before it. **Before (agent cold):** ```sql -- "We have users, so:" CREATE TABLE users ( id SERIAL PRIMARY KEY, email TEXT, password TEXT, ssn TEXT, created_at TIMESTAMP DEFAULT now() ); -- + auto-generated create/read/update/delete for every column ``` **After (this dictionary):** ```text Q1 Need it? Yes, system of record for accounts. Q4 Owner: us. Q6 Sensitivity: email = PII; SSN = regulated, do we even need it? No. Drop it. Q7 Trust: email validated at the API edge. Q8 Access: user reads own row; only auth service writes credentials. Q9 Retention: delete 30 days after account closure. ``` ```sql CREATE TABLE users ( id uuid PRIMARY KEY DEFAULT uuidv7(), -- time-ordered PK; PG 18+ (see note) email citext NOT NULL UNIQUE, -- PII, validated at edge password_hash text NOT NULL, -- never store plaintext created_at timestamptz NOT NULL DEFAULT now(), deleted_at timestamptz -- soft-delete drives 30-day retention job ); -- No SSN column: the question "do we need it?" removed an entire liability. -- uuidv7() is built in on PostgreSQL 18+ (GA Sep 2025) and is time-ordered, so it stays -- index-friendly as a primary key. On PG < 18 use gen_random_uuid() (UUIDv4, built in since -- PG 13, no extension), correct but randomly distributed, which fragments the PK index. ``` ### The boring-tech default **Do:** Choose proven, widely-deployed technology with well-understood failure modes over new or clever technology. Default to Postgres, a monolith, and a managed auth provider. **Never:** Reach for the novel database/framework/runtime because it's interesting or benchmarks well in a blog post. **Why:** Novelty is a real cost paid in unknown failure modes, thin documentation, and a small pool of people (and answers) when it breaks at 2am. **Escape hatch:** Adopt the new thing only when the boring option provably cannot do the job, measured, not assumed. ## 2. Project setup ### Getting started: where to deploy **Global (every pathway):** **Do:** Deploy from a Git repo via CI to a platform that runs your app as a stateless container (or static bundle) next to a managed Postgres. The deploy must be repeatable and one-command; the database must be managed (automatic backups + point-in-time recovery you don't hand-roll). Choose the platform by the pathway, not by hype. **Never:** Deploy by SSH-ing into a box and editing files, run Postgres on the same disk as the app with no backups, or pick a platform you can't get logs and a one-click rollback from. **Why:** Where you deploy decides the *how* of half this dictionary (pooling, secrets, storage, email, jobs); fix the pathway up front so every later choice has an answer. **Pathway: AWS** **Do:** Run the container on ECS on Fargate (use ECS Express Mode for the simplest single-service path); managed Postgres on RDS (or Aurora); an ALB for TLS and health checks. Define infrastructure as code (CDK/Terraform) once you run more than one service. **Never:** Start on raw EC2 you patch by hand, or reach for EKS/Kubernetes for a single app. **Why:** ECS on Fargate gives containers without managing servers; RDS gives the managed-database guarantees. (Avoid App Runner, it stopped taking new customers in 2026; ECS Express Mode is its replacement.) **Pathway: Managed (Railway / Render / Fly)** **Do:** Point the platform at your GitHub repo and let it build, run the container, and provision its managed Postgres. Use the `DATABASE_URL` it gives you; it does rolling deploys and health checks for you. **Never:** Bring your own orchestration here, use what the platform provides. **Why:** These platforms collapse deploy + database + TLS + rollout into `git push`; that is the whole point of choosing one. **Pathway: Non-dev (just get it live)** **Agent:** Put the code in a GitHub repo, provision managed Postgres on the chosen platform, wire `DATABASE_URL`, deploy from the repo, and confirm automatic backups are on. **Tell the user:** "Go to railway.app (or render.com), sign in with GitHub, click New Project → Deploy from your repo, and add a Postgres database from their menu. It then redeploys automatically every time we push." **Never** tell the user to copy files onto a server or run something on their own computer to keep the site up. ### Which database **Do:** Default to PostgreSQL for essentially everything: relational data, JSON via `jsonb`, full-text search (`tsvector` + GIN index), geo via PostGIS, and queues at small scale (`SELECT ... FOR UPDATE SKIP LOCKED` over a jobs table with a composite index on `(status, created_at)`). **Never:** Reach for MongoDB, DynamoDB, Elasticsearch, or a dedicated queue broker as the *first* datastore because the data "feels" document-y or because scale is anticipated. **Why:** One engine you know deeply beats four you half-know; Postgres covers the long tail (JSON, search, geo, queue) well enough that the second datastore is almost never needed in the first 12 months. **Escape hatch:** A genuine, specific, *measured* need, heavy document workloads, true high-volume time-series, or a search-first product, justifies a specialised store. Even then, start on Postgres and let measurement force the move (see Source of truth: derived stores must rebuild from Postgres). ### Monolith vs services **Do:** Ship one deployable monolith, organised internally by domain (see Repo structure). Split only when a concrete force demands it: a hot path that must scale independently, a hard team-ownership boundary, or genuinely divergent runtime/compliance needs. **Never:** Start with microservices "to be ready to scale." **Why:** Premature services buy a distributed system's failure modes, network partitions, partial failures, distributed transactions, deploy choreography, without the scale that would justify paying for them. **Escape hatch:** When you do split, split along one of the named boundaries above, not by technical layer. ### Language / runtime **Do:** Use the language the team already ships in, pinned to one runtime version across the whole codebase (lock it in the project manifest and CI, e.g. `.nvmrc` / `.python-version` / `go.mod`). No second language without a hard, named reason. **Never:** Adopt a new or trendy language/runtime for a production system to learn it; run a polyglot stack "because the right tool for the job"; or let local, CI, and prod drift onto different runtime versions. **Why:** The runtime is the floor everything else stands on, not where you innovate. Version drift between environments is a top source of "works on my machine" bugs, pin it once, everywhere. ### ORM vs raw SQL vs query builder **Do:** Use a mature ORM or query builder for ordinary app CRUD; drop to parameterised raw SQL for the few complex or performance-critical queries. **Never:** Hand-concatenate SQL strings or interpolate user input into a query (injection); never let the ORM silently emit N+1 queries, eager-load or batch instead (see N+1). **Why:** ORMs kill boilerplate and keep parameterisation automatic; raw SQL keeps the hard 5% readable and fast. Use each where it wins. **Escape hatch:** Raw SQL is still parameterised SQL, pass values as bind parameters, never via string formatting, even when you've left the ORM. ### Repo structure **Do:** Start with one repository, organised by feature/domain (e.g. `billing/`, `accounts/`) rather than by technical layer (`controllers/`, `models/`, `services/`). **Never:** Split into multiple repos before there is a real team or ownership reason. **Why:** Feature-first layout keeps a change to one capability in one place; per-layer folders scatter every feature across the tree. Multi-repo adds versioning and cross-repo-change cost you don't need yet. ### Config & environments **Do:** Read all config from environment variables, keep secrets out of the repo, and maintain separate config per stage (dev/staging/prod). Build the artifact once and promote that *same* artifact through stages; only config differs between them. **Never:** Commit config or secrets, hardcode per-environment values, or rebuild a separate artifact per stage. **Why:** Promoting one artifact means the binary you tested in staging is byte-for-byte the one in prod, divergence can only come from config, which is far easier to audit. **Escape hatch:** Secrets belong in a managed secrets store (your platform's secret manager) injected as env vars at runtime, not in committed `.env` files, commit only a `.env.example` with empty values. ### Source of truth **Do:** Treat the database as the single source of truth. Caches, search indexes, denormalised tables, and client state are derived copies, and you must always be able to rebuild every one of them from the database. **Never:** Treat a cache, index, or client-held value as canonical, or let a derived store drift with no rebuild path. **Why:** When (not if) a derived copy goes stale or corrupt, "rebuild from the source of truth" is the recovery plan. If the copy *is* the only truth, there is no recovery. ## 3. Data & database ### IDs: UUID vs auto-increment **Do:** Make every externally-visible identifier (URLs, API payloads) non-sequential. Use a **UUIDv7** (time-ordered) value: either as the primary key directly, or keep an internal `bigint` identity key for joins plus a separate `uuid` external id. **Never:** Expose raw auto-increment integer PKs in URLs or APIs (`/users/123`). Don't reach for random **UUIDv4** as a PK either, its randomness destroys B-tree index locality, bloating writes and index size. **Why:** Sequential ids leak row counts and let anyone enumerate your data; UUIDv7 keeps ids opaque while staying index-friendly because the leading bits are time-ordered. **Escape hatch:** Internal-only tables that are never addressed by an outside caller can stay on plain `bigint identity`, the rule is about what crosses the trust boundary. **Before (agent cold):** ```sql CREATE TABLE users ( id serial PRIMARY KEY, ... ); -- route: GET /users/123 ← enumerable, leaks "we have ~123 users" ``` **After (this dictionary):** ```sql -- PG 18+: native uuidv7(); PG 17 or older: pg_uuidv7 extension or app-side gen CREATE TABLE users ( id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- internal joins external_id uuid NOT NULL DEFAULT uuidv7() UNIQUE -- the public id ); -- route: GET /users/018f3c9a-7e6b-7c41-9a2e-2f1b6d4c8a55 ← opaque, non-sequential ``` ### Schema design defaults **Do:** Normalise first, aim for 3NF. Default columns to `NOT NULL`; allow `NULL` only when "unknown/absent" is a real, meaningful state. Index every foreign key (see Indexing). **Never:** Reach for JSONB to dodge modelling. Use JSONB only for genuinely schemaless or highly variable data, not as a junk drawer for fields you were too lazy to define. **Why:** A real schema gives you constraints, types, and query planning; a JSONB blob gives you none of that and silently rots. ### Migrations **Do:** Treat every schema change as something that must run safely against a live production table with traffic on it. Assume the table is large and in use. - Adding a column WITH a default is safe on Postgres 11+ (the default is stored in the catalog, applied on read, no table rewrite). Don't avoid defaults out of habit; the old "never add a column with a default" rule is obsolete. - Make changes additive first. Add new things; never rename or drop in the same step. - For any rename or type change, use expand-contract across separate deploys: (1) add the new column, (2) backfill in batches, (3) switch the app to read/write the new column, (4) drop the old column in a later deploy once nothing references it. - Create indexes with CREATE INDEX CONCURRENTLY. If it fails it leaves an INVALID index, drop it and retry. - Set a short lock_timeout (e.g. 5s) before DDL so a migration fails fast instead of queuing behind a lock and freezing the table. **Never:** - ALTER COLUMN ... SET NOT NULL directly on a populated table, it scans the whole table under an ACCESS EXCLUSIVE lock. Instead: ADD CONSTRAINT ... CHECK (col IS NOT NULL) NOT VALID (instant), then VALIDATE CONSTRAINT (doesn't block reads/writes), then SET NOT NULL (no scan on PG12+). - Rename a column in one step, it breaks every query using the old name. Use expand-contract. - Drop a column the current app version still reads. Deploy the code that stops using it first. - A single UPDATE backfilling millions of rows. Batch it. - Add an index without CONCURRENTLY in production. **Why:** A migration that locks a busy table is an outage. **Escape hatch:** At 0 users / empty tables, do the simple thing. The moment real data is in the table, these rules apply. **Before (agent cold):** `ALTER TABLE users ALTER COLUMN email SET NOT NULL;`, scans the whole table under an exclusive lock, blocks all reads and writes, downtime on a large table. **After (this dictionary):** `ALTER TABLE users ADD CONSTRAINT users_email_nn CHECK (email IS NOT NULL) NOT VALID;` then `ALTER TABLE users VALIDATE CONSTRAINT users_email_nn;` then `ALTER TABLE users ALTER COLUMN email SET NOT NULL;`, no blocking lock, no downtime. ### Money / decimals **Do:** Store money as integer minor units (pence/cents) or as exact `NUMERIC`/`DECIMAL`. Store the currency code (ISO 4217) alongside every amount. **Never:** `float` or `double` for money, binary floating point cannot represent `0.10` exactly and rounding errors compound. ### Timestamps & timezones **Do:** Store every timestamp as `timestamptz`. Store and compute in UTC; convert to a local timezone only at the display/reporting edge. When a query means "today" or "this month" in a user's local zone, truncate in that zone with the 3-arg `date_trunc(field, ts, zone)` (PG 16+), which returns a `timestamptz` you can compare directly. **Never:** Store naive (`timestamp` without tz) values, and never filter date ranges with `created_at::date = current_date`, that casts to the server's local day and is off by a day whenever the server zone differs from the user's. **Why:** UTC storage with edge conversion is the only model that survives DST shifts, multi-region servers, and users in different zones. **Before (agent cold):** ```sql created_at timestamp, -- naive, ambiguous zone -- "orders today" using server-local time: SELECT * FROM orders WHERE created_at::date = current_date; -- wrong unless server tz == user tz ``` **After (this dictionary):** ```sql created_at timestamptz NOT NULL DEFAULT now(), -- "orders today" for a user in America/New_York (PG 16+ 3-arg date_trunc): SELECT * FROM orders WHERE created_at >= date_trunc('day', now(), 'America/New_York') AND created_at < date_trunc('day', now(), 'America/New_York') + interval '1 day'; ``` ### Soft delete vs hard delete **Do:** Default to hard delete, backed by foreign-key integrity (see Foreign keys & cascades) and reliable backups. **Never:** Add a `deleted_at` column reflexively. Use soft delete only when you genuinely need recovery, audit, or legal retention, and when you do, enforce "exclude deleted" globally via a view or default scope, never per-query. **Why:** The first place someone forgets the `WHERE deleted_at IS NULL` filter is a data leak that ships deleted rows to a user. ### Enums **Do:** Use a lookup table (foreign key) for value sets that may change or carry metadata. Use a `CHECK` constraint for tiny, fixed sets (e.g. `status IN ('active','inactive')`). **Never:** Use native Postgres `ENUM` types. Adding a value means `ALTER TYPE ... ADD VALUE` (which can't be used in the same transaction it's added in), reordering is impossible, and a value can never be removed. ### Foreign keys & cascades **Do:** Always declare foreign keys and let the database enforce integrity. Choose `ON DELETE` deliberately: `CASCADE` only where the child truly cannot exist without the parent; otherwise `RESTRICT`/`NO ACTION` (the default) or `SET NULL`. **Never:** Sprinkle `ON DELETE CASCADE` for convenience, one deleted parent row can silently wipe out half the database. **Why:** Cascade is irreversible and invisible until it fires; `RESTRICT` fails loudly and safely instead. ### Indexing **Do:** Index the columns you filter, join, and sort on, and every foreign key. For multi-column filters use a composite index, and remember column order matters (leftmost-prefix wins). Confirm with `EXPLAIN (ANALYZE)` that the planner actually uses it. **Never:** Index every column. Each index taxes every write and consumes storage; unused indexes are pure overhead. ### N+1 queries **Do:** Load related data in one query, a join, an eager load, or a batched `WHERE id IN (...)`. **Never:** Issue one query per row inside a loop. The classic trap is an ORM lazily loading a relation inside a `.map`/`forEach`. **Why:** Query count that scales with row count turns a 1-query page into a 1000-query page under real data. **Escape hatch:** Detect it by logging query count per request and alerting when the count grows with the result set. ### Pagination **Do:** Use keyset (cursor) pagination for large or unbounded lists, page on an indexed `WHERE (created_at, id) < (:last_ts, :last_id) ORDER BY created_at DESC, id DESC LIMIT n`. **Never:** Use `OFFSET` for deep or unbounded paging, it scans and discards every skipped row and drifts (skips/repeats rows) under concurrent writes. **Escape hatch:** `OFFSET`/`LIMIT` is fine for small, bounded admin lists where deep pages never happen. ### Connection pooling **Global (every pathway):** **Do:** Put a connection pooler in front of Postgres (PgBouncer, RDS Proxy, or your framework's built-in pool). Default to **transaction-pooling mode**, it gives the best connection reuse. Serverless/Lambda multiplies raw connections, so pool externally there. **Never:** Assume transaction mode is "fire and forget." It still doesn't carry true session state across statements: `LISTEN`/`NOTIFY`, session-level advisory locks (`pg_advisory_lock`), and a `SET` meant to persist across a request all break or silently pin. Switch to **session mode** only for the connections that genuinely need those. **Why:** Postgres connections are heavyweight; without pooling a burst of clients exhausts `max_connections` and the database stops accepting work. **Escape hatch:** Prepared statements work in transaction mode on PgBouncer 1.21+ (set `max_prepared_statements > 0`). With PgBouncer 1.22+ and Prisma Client 5.10+ you no longer need Prisma's legacy `?pgbouncer=true` flag; on older versions you do. RDS Proxy multiplexes the extended-query protocol but pins on `SET` and session-level prepared statements. **Pathway: AWS** **Do:** Use RDS Proxy as the managed pooler in front of RDS/Aurora, it multiplexes connections and rides out failovers. Reach for your own PgBouncer only if you need transaction-pooling behaviour RDS Proxy doesn't provide. **Never:** Open a fresh connection per Lambda invocation straight to Postgres, that exhausts `max_connections`; put RDS Proxy in between. **Why:** RDS Proxy is the AWS-native answer to serverless connection storms (it pins on `SET` and session-level prepared statements, so keep those off the hot path). **Pathway: Managed (Railway / Render / Fly)** **Do:** Use the platform's pooled connection string (e.g. Supabase's transaction-mode port, or the platform's pooled URL); default to transaction mode. **Never:** Run your own PgBouncer next to a platform that already pools for you. **Why:** The pooled URL is the supported path; transaction mode carries the same session-state caveats as the Global block. **Pathway: Non-dev (just get it live)** **Agent:** Use the platform's provided (pooled) `DATABASE_URL`; do not deploy a separate pooler. **Tell the user:** "Your platform handles database connections for you, nothing to set up here." ## 4. Auth & access control ### Auth: provider vs roll-your-own **Do:** Use a managed auth provider (Cognito, Clerk, Auth0, Supabase Auth, WorkOS). Let it own login, password reset, MFA, email verification, and session management. **Never:** Hand-roll login, password reset, MFA, or session logic for a new app. **Why:** "Simple" auth has dozens of subtle ways to be insecure (timing attacks, token replay, reset-link reuse, account enumeration), and you will not maintain it as well as a provider. **Escape hatch:** Essentially none for a new app. Self-host an existing battle-tested system (e.g. Keycloak) only if a hard compliance or air-gap requirement forbids SaaS, still not custom code. ### Token / session storage **Do:** Keep the session token in an `httpOnly`, `Secure`, `SameSite=Lax` cookie. For SPAs, hold the short-lived access token in memory (a JS variable, not web storage) and the refresh token in an `httpOnly` cookie. Because `httpOnly` cookies are auto-sent, add CSRF protection (a custom header the server requires, or double-submit token). **Never:** Store session, access, or refresh tokens in `localStorage` or `sessionStorage`. Never set `SameSite=None` without `Secure`, and never rely on the browser's default `SameSite`. **Why:** Any XSS can read web storage; `httpOnly` cookies are unreadable from JS, which contains the blast radius. The trade-off is CSRF, which `SameSite` plus a required custom header closes. **Escape hatch:** A native mobile app uses the platform secure keystore (Keychain / Keystore), not web storage rules. Use `SameSite=Strict` when no cross-site navigation needs the cookie (defaults to `Lax` otherwise). ### Password handling **Do:** If you must store passwords, hash with Argon2id (OWASP's default: m=64 MiB, t=3, p=1, tuned to ~100ms/hash on your hardware). bcrypt at cost ≥12 is an acceptable fallback. See "Auth: provider vs roll-your-own", a provider should own this entirely. **Never:** Store plaintext, and never use fast/general-purpose hashes (MD5, SHA-1, SHA-256, plain HMAC) for passwords. Never feed bcrypt inputs over 72 bytes without pre-hashing, it silently truncates. **Why:** Fast hashes are trivially brute-forced; password hashing must be deliberately slow and memory-hard. ### Authorization / RBAC **Do:** Model roles and permissions explicitly. Enforce authorization server-side on every request, in a centralised middleware/policy layer, and deny by default. **Never:** Rely on a hidden UI element, a disabled button, or any client-side check as the access control. Never trust an `is_admin`/role claim sent from the client, derive it server-side from the authenticated identity. **Why:** The client is attacker-controlled; the only enforcement that exists is the one on the server. ### Row-level security + connection pooler **Do:** When combining Postgres RLS with per-request tenant context, set the context as transaction-local with `SET LOCAL` (or `set_config(..., true)`) inside an explicit transaction, and verify isolation under the real pooler. **Never:** `SET app.current_tenant` (session-level) on a connection drawn from a transaction-mode pooler (PgBouncer `pool_mode = transaction`, Supabase's transaction port, RDS Proxy pinning aside). Never use statement-mode pooling with session GUCs at all, even `SET LOCAL` can land on a different connection. **Why:** Pooled connections are reused across requests; a session-level variable can leak from one tenant's request to the next, turning the isolation feature into a cross-tenant data leak. `SET LOCAL` is discarded at `COMMIT`/`ROLLBACK`, so it cannot outlive the transaction that owns the pooled connection. **Escape hatch:** Session-level `SET` is only safe if the connection is exclusively held for the request's lifetime (e.g. a dedicated, non-transaction-pooled connection, or `pool_mode = session`) and reset on checkout, confirm, don't assume. **Before (agent cold):** ```sql -- per request, on a transaction-pooled connection (e.g. Supabase/PgBouncer) SET app.current_tenant = '42'; SELECT * FROM invoices; -- relies on RLS using app.current_tenant -- connection returns to pool STILL set to '42'; next tenant inherits it ``` **After (this dictionary):** ```sql BEGIN; SET LOCAL app.current_tenant = '42'; -- scoped to this transaction only SELECT * FROM invoices; -- RLS sees the correct tenant COMMIT; -- variable is discarded with the txn ``` ### Multi-tenancy **Do:** Default to a shared database with a `tenant_id` column on every tenant-owned row, enforced in every query and ideally backed by RLS (see "Row-level security + connection pooler"). **Never:** Reach for a separate database or schema per tenant by default. **Why:** Per-tenant databases multiply migration and operational cost linearly with customers. **Escape hatch:** Use a separate database/schema per tenant only when hard isolation or a specific compliance requirement demands it. ## 5. Security ### Input validation **Do:** Parse all external input at the edge with a schema (zod, Pydantic, class-validator). Reject unknown fields (`.strict()` in zod, `extra="forbid"` in Pydantic); pass typed, trusted data inward. **Never:** Trust the client, query string, headers, path params, or webhook bodies. Don't sprinkle ad-hoc `if (!x) throw` checks deep in business logic. **Why:** One validated boundary means everything inside is typed and safe; scattered checks always miss a path. ### SSRF / user-supplied URLs **Do:** For any feature that fetches a user-supplied URL (scraper, webhook tester, image/avatar proxy, import-from-URL), defend in depth: allowlist schemes (`http`/`https`) and, where you can, hosts; resolve DNS and reject if **any** resolved address is private/loopback/link-local/unique-local/CGNAT for BOTH IPv4 and IPv6, re-resolving after every redirect; disable redirects or re-validate each hop; run the fetcher with least-privilege egress and no ambient credentials, isolated from credentialed infra. Where the platform supports it, also block the metadata endpoint at the network layer (e.g. enforce IMDSv2 with hop limit 1). **Never:** `fetch(userUrl)` directly; validate the hostname once and then follow redirects blindly; or check only the *first* resolved address. **Why:** A raw fetch can be pointed at cloud metadata (`169.254.169.254`) to steal task-role credentials, or at internal services behind your perimeter. Hostname-string blocking is bypassed via DNS rebinding, redirects, decimal/hex/octal IPs, IPv4-mapped IPv6, and multi-record DNS, you must block on **every resolved** address, not the name. **Escape hatch:** If the destination set is fixed and known (e.g. a single partner's API), a strict host allowlist alone is enough. **Before (agent cold):** ```js // fetches whatever the user gives us const res = await fetch(userUrl); const body = await res.text(); ``` **After (this dictionary):** ```js import { lookup } from "node:dns/promises"; import net from "node:net"; function isBlockedIp(ip) { // Normalize IPv4-mapped IPv6 (e.g. ::ffff:a9fe:a9fe or ::ffff:169.254.169.254) // down to the embedded IPv4 and re-check it. new URL() may store the hex form, // so a string match on "::ffff:" is NOT enough. if (net.isIPv6(ip) && ip.toLowerCase().includes("::ffff:")) { const tail = ip.slice(ip.lastIndexOf(":") + 1); if (net.isIPv4(tail)) return isBlockedIp(tail); // dotted form const hex = ip.toLowerCase().split("::ffff:")[1] || ""; // hex form a9fe:a9fe const parts = hex.split(":"); if (parts.length === 2 && parts.every(p => /^[0-9a-f]{1,4}$/.test(p))) { const n = (parseInt(parts[0], 16) << 16) | parseInt(parts[1], 16); return isBlockedIp([24, 16, 8, 0].map(s => (n >>> s) & 255).join(".")); } } if (net.isIPv4(ip)) { const [a, b] = ip.split(".").map(Number); return a === 0 || a === 127 || a === 10 || // this-host, loopback, private (a === 172 && b >= 16 && b <= 31) || // private (a === 192 && b === 168) || // private (a === 169 && b === 254) || // link-local + cloud metadata (a === 100 && b >= 64 && b <= 127); // CGNAT / common k8s pod CIDR } if (net.isIPv6(ip)) { const v6 = ip.toLowerCase(); return v6 === "::1" || v6 === "::" || // loopback, unspecified v6.startsWith("fe80") || // link-local v6.startsWith("fc") || v6.startsWith("fd"); // unique-local (incl. fd00:ec2::254, fd20:ce::254) } return true; // unparseable -> reject } async function safeFetch(userUrl) { const u = new URL(userUrl); if (u.protocol !== "http:" && u.protocol !== "https:") throw new Error("scheme not allowed"); // all:true -> check EVERY A/AAAA record, not just the first (defeats multi-record rebinding) const records = await lookup(u.hostname, { all: true }); if (records.length === 0 || records.some(r => isBlockedIp(r.address))) { throw new Error("blocked address"); } // redirect:"error" throws on any 3xx (no redirect-time re-resolution gap); hard timeout; no ambient creds. // Run on a restricted-egress worker so a residual bypass can't reach metadata/internal hosts. return fetch(u, { redirect: "error", signal: AbortSignal.timeout(5000) }); } ``` > Note: even with this check there is a DNS-rebind TOCTOU window between `lookup` and `fetch`. The load-bearing control is restricted egress on the worker; treat the address check as defence-in-depth, not the sole barrier. If you must follow redirects, re-run `safeFetch` against each `Location` rather than using `redirect: "follow"`. ### Secrets **Global (every pathway):** **Do:** Keep secrets in the platform secret store (AWS Secrets Manager / SSM Parameter Store, Vault, your host's secret store); inject at runtime as env vars or mounted files; rotate them. **Never:** Commit secrets to git, bake them into Docker images, or ship them in a client bundle (anything in the browser is public). **Escape hatch:** Local dev uses a `.env` file that is gitignored and never the source of prod values. **Pathway: AWS** **Do:** Store secrets in AWS Secrets Manager (built-in rotation) or SSM Parameter Store (cheaper `SecureString`); grant the task role read on only its own secrets; inject as task-definition secrets at runtime. **Never:** Put secrets as plain environment variables in the task definition, or grant `secretsmanager:*` on `*`. **Why:** A scoped task role keeps secrets out of images and out of every other workload's reach (see IAM / least privilege). **Pathway: Managed (Railway / Render / Fly)** **Do:** Set secrets in the platform's Environment Variables / Secrets UI (Fly: `fly secrets set`); reference them at runtime. **Never:** Commit a `.env` with real values or paste secrets into the build command. **Why:** The platform's secret store injects them into the running process only, the supported, repo-free path. **Pathway: Non-dev (just get it live)** **Agent:** Set every key in the platform's Environment Variables / Secrets settings; never in code or committed files. **Tell the user:** "API keys and passwords go in your platform's 'Variables' / 'Secrets' settings page (Railway: your service → Variables; Render: Environment), never in the code, a screenshot, or a chat message. If a key ever leaks, rotate it: create a new one, delete the old." ### IAM / least privilege **Do:** Scope every role to the specific actions and resources it needs, one narrowly-scoped role per workload. **Never:** `"Action": "*"` or `"Resource": "*"` in a production policy. **Why:** A wildcard role turns any app compromise into account compromise; see SSRF for how a leaked task role gets exfiltrated. ### SQL injection **Do:** Use parameterised queries / bound parameters for every value. **Never:** Build SQL by concatenating or interpolating user input, including "just this once" in a raw query. Note that table/column names can't be bound; allowlist those against a fixed set, never interpolate from input. **Why:** ORMs and query builders parameterise for you; the risk reappears the instant you drop to raw SQL. ### XSS / output encoding **Do:** Rely on your framework's automatic template escaping (React, Jinja, ERB, etc.). If you must render user-supplied HTML, sanitise it with DOMPurify (use `isomorphic-dompurify` for SSR) before rendering. Set a restrictive `Content-Security-Policy`. **Never:** Concatenate user data into HTML, or pass it to `innerHTML` / `dangerouslySetInnerHTML` unsanitised. **Why:** Auto-escaping is on by default, the only XSS you ship is the escaping you deliberately bypass. ### CORS **Do:** Default to same-origin. If cross-origin is required, allowlist specific known origins and echo back only a match. **Never:** Reflect an arbitrary `Origin` header, and never combine `Access-Control-Allow-Origin: *` with `Access-Control-Allow-Credentials: true`. **Why:** Reflecting the origin (or `*` with credentials) lets any site make authenticated requests as your user, it's same-origin policy with the lock taped open. ### Rate limiting **Do:** Rate-limit at the edge (gateway/CDN/proxy), then add per-identity limits on expensive or abuse-prone endpoints (login, signup, password reset, search, write APIs). Return `429` when exceeded. **Never:** Rely on client-side throttling, or leave auth endpoints unlimited. **Escape hatch:** Brute-forceable endpoints (login, OTP, reset) need per-account *and* per-IP limits even if global edge limits exist. ### File uploads **Do:** Validate type by actual content (magic bytes), not the extension or `Content-Type`; enforce a size cap; generate your own filename; store uploads in object storage off the app server; serve and accept via signed URLs. **Never:** Trust the supplied extension/`Content-Type`, keep the client's filename, or write uploads into a web-served or executable path. **Why:** A `.jpg` that is really a `.php`/`.html` dropped in a public directory becomes remote code execution or stored XSS. ### Dependencies / supply chain **Do:** Commit a lockfile, pin versions, install only from official registries, and run automated dependency/vulnerability scanning (Dependabot/Renovate plus an `npm audit`/`pip-audit` step) in CI. **Never:** Add a dependency for a few lines you can write yourself, or `install` from an arbitrary git URL/tarball. **Why:** Every dependency is code you run with your privileges; fewer, pinned, scanned deps shrink the attack surface. ### Dependency cooldown **Do:** Don't install brand-new releases the moment they publish. Enforce a release-age cooldown so a version must be a few days old before it's installable: `min-release-age` (npm), `minimumReleaseAge` (pnpm, Bun), or `npmMinimalAgeGate` (Yarn). A week is the cautious setting, a day is the practical floor. Keep committing lockfiles, use `npm ci` or frozen installs, and consider disabling install scripts in CI. **Never:** Auto-adopt the latest release the instant it publishes, especially via an agent that silently bumps versions. **Why:** Most malicious releases of popular packages are caught and pulled within hours, so even a one-day delay filters them out at the install layer. Agent-written code makes this worse, because it's hard to track which versions got pulled in. **Escape hatch:** Fast-track a genuine emergency security fix past the cooldown for that one package; the cooldown is for routine upgrades. ## 6. Building AI features If your app calls a model (an LLM or similar), that model is the newest untrusted, non-deterministic, metered dependency in your stack. Treat it like one. These rulings are global; where a mechanism differs by provider, the provider is named as a current example, not a default. See also: input validation and SSRF (Security), data sent to third parties (Privacy), output rendering (XSS in Security), and prompt regression tests (Testing). ### Treat model output as untrusted input **Do:** Validate every model response against a strict schema (zod, Pydantic) before you act on it. Use the provider's structured-output or JSON mode where it exists, then validate anyway. This is the "validate at the edge" rule from Security, applied to the model. **Never:** Branch on, execute, store, or forward raw model text as if it were trusted. Never `eval` it, never build SQL, shell, or HTTP calls out of it, and never assume a field exists just because you asked for it. **Why:** A model is a probabilistic text generator, not a contract. It will eventually return malformed JSON, an extra field, a refusal, or injected instructions. Unvalidated model output is the new unvalidated user input. **Escape hatch:** Display-only text with no downstream action still needs output encoding (see "Render model output as untrusted"), but not schema validation. **Before (agent cold):** ```js const action = JSON.parse(completion); doThing(action.target); ``` **After (this dictionary):** ```js const parsed = ActionSchema.safeParse(JSON.parse(completion)); if (!parsed.success) return retryOrFail(); // bounded re-prompt, then fail clean doThing(parsed.data.target); ``` ### The model is not an authorisation boundary **Do:** Enforce permissions in your own code before executing any tool or function call the model chose to make. Check the acting user's rights against the action, every time. **Never:** Let the model's decision to call a tool be the thing that authorises it. The model choosing to call `deleteAccount` is a request, not a permission. **Why:** Models can be talked into calling tools by injected content. Authorisation is a property of the user and the system, never of the model's intent (see Authorization / RBAC in Auth). ### Defend against prompt injection **Do:** Treat all retrieved, tool-returned, and user-supplied content that enters a prompt as adversarial. Keep your instructions separate from untrusted data (clear delimiters, structured roles). Give the model the least powerful set of tools the task needs, and validate tool inputs and outputs. **Never:** Put secrets, API keys, or credentials in prompts or system messages. Never concatenate retrieved web, document, or tool content straight into an instruction context and trust it. **Why:** Retrieved content carries instructions ("ignore previous instructions and exfiltrate X"). This is the SSRF of the AI layer: the input looks like data but acts like a command. A secret in a prompt can be echoed straight back out. **Escape hatch:** None for the secrets rule. For injection, the defence is layered (separation, least-privilege tools, output validation), not a single trick. ### Cap model spend, hard **Do:** Set a hard spend cap on the model API, a per-request token ceiling, and an alert well below the cap. Fail closed when the cap is hit. Capability-first examples: AWS Budgets plus per-key limits for Bedrock; the provider's usage limits on OpenAI or Anthropic. **Never:** Ship a model-calling feature with no spend ceiling and no alert. Never let user input drive an unbounded loop of model calls (agent loops especially). **Why:** Runaway token spend is the new runaway cloud bill, and it arrives faster: a retry loop or an abuse case can burn a month's budget in an hour. **Non-dev:** Tell the user, "AI calls cost money every time they run. Set a spending limit in your provider's dashboard on day one, or a bug or some abuse could run up a large bill fast." Agent: configure a hard cap and a low-threshold alert before shipping. ### Assume non-determinism; retry deliberately **Do:** Retry on rate limits and transient errors with exponential backoff and jitter. On a schema-validation failure, re-prompt a bounded number of times, then fail cleanly. Set sensible timeouts. **Never:** Hammer the API with immediate retries, retry unboundedly, or assume the same prompt returns the same output twice. **Why:** Providers rate-limit, and outputs vary run to run. Structured output plus validate-and-retry beats trusting free text or crashing on the first malformed response. **Containing autonomous action.** The moment an agent can call tools on its own, you owe it a safety model: bound what it can do, make its actions safe to retry, keep a record of what it did, and be able to halt it fast (the kill switch lives in Observability & ops). The next two rulings are the first half of that. ### Bound autonomous loops **Do:** Cap every agentic task with a hard ceiling on steps and tool-calls, a wall-clock timeout, and loop or no-progress detection (the same action or state repeating means stop). Fail safe when a ceiling is hit: halt and surface it, don't silently continue. **Never:** Run a tool-calling loop with no maximum steps and no timeout, or let the model decide on its own when it's done with no external bound. **Why:** Capping spend handles tokens, not iteration. An agent can loop, retry, or thrash within budget and still cause harm or lock up. The bound is the seatbelt on autonomy. ### Model-chosen side effects must be idempotent **Do:** Any side-effecting action the model triggers (charge, trade, send, write) must carry an idempotency key so a retried or duplicated tool call runs once. Reuse the rule you already have (see Idempotency in APIs); the key is generated by your deterministic code, never by the model. **Never:** Retry or re-issue a model-chosen side effect without an idempotency key. Never let "retry deliberately" and non-determinism combine into a double-charge, double-trade, or double-send. **Why:** Your own rules mean tool calls will be retried and outputs will vary, so without a key that is a duplicate real-world action. Highest stakes for balances and ledgers (see Money and ledgers). ### Render model output as untrusted **Do:** Encode or sanitise model output before rendering it, exactly as you would user content. If you render model-produced markdown or HTML, sanitise it with a vetted library first. **Never:** Pass raw model output to `innerHTML` or `dangerouslySetInnerHTML`. **Why:** Model output is user-influenced content, so rendering it unescaped is an XSS hole (see XSS / output encoding in Security). **Non-dev:** Agent only: never inject model text into the page as raw HTML. ### Mind what you send, and pin the model **Do:** Treat the prompt as data leaving your system. Send the minimum PII to a third-party model API, know the provider's data-retention and training-use terms, and redact what you can (see Privacy). Pin the model version and test before moving to a new one. **Never:** Send regulated or sensitive data to a model API without checking it's allowed and necessary. Never silently ride "latest": a model update is an untested dependency upgrade that can change behaviour under you. **Why:** Sending data to a provider is a data-sharing decision, and a model version is a dependency like any other. ### Keep an auditable, redacted trail of model calls **Do:** Store the prompt and response (or a reference to them) for each model call, with secrets and PII redacted, so you can reconstruct what the model was asked and what it returned when something misfires (see Observability & ops). **Never:** Log raw prompts or responses that contain secrets or personal data, or run autonomous actions with no record of what drove them. **Why:** When a model-driven action goes wrong you need its input and output, and that trail is also the audit record for any side effect it caused (see Money and ledgers). ### Test prompts like code **Do:** Keep a small eval set (representative inputs with the properties a good answer must have) and run it in CI when you change a prompt, model, or schema (see Testing). Assert on shape and key invariants, not exact strings. **Never:** Ship a prompt or model change with no way to tell whether it got better or worse. **Why:** Prompts are logic. A change that helps one case quietly breaks another, and without evals you only find out in production. ## 7. APIs ### REST defaults **Do:** Use plural resource-noun URLs (`/v1/orders`, `/v1/orders/{id}`), HTTP verbs for action (GET read, POST create, PUT/PATCH update, DELETE remove), and JSON request/response bodies. Return the right status: 200 OK (read/update), 201 Created (with `Location` header pointing to the new resource), 204 No Content (delete or genuinely empty body); 400 malformed (unparseable JSON, bad Content-Type, broken HTTP), 401 unauthenticated, 403 authenticated-but-forbidden, 404 not found, 409 conflict (duplicate, version/optimistic-lock clash), 422 syntactically valid but semantically rejected (field-level validation failure); 500 server fault. Version from the very first endpoint with a `/v1` URL prefix. Paginate every list endpoint (see Pagination). **Never:** Verbs in URLs (`/getOrders`, `/createOrder`), 200-with-`{ "error": ... }`, or shipping unversioned so the first breaking change forces a scramble. Don't blur 400 and 422, 400 is "I couldn't parse this", 422 is "I parsed it and it's wrong". **Why:** Verbs and status codes are the contract; clients, proxies, and caches act on them without reading your prose. A `/v1` you never break is free; retrofitting versioning onto a live unversioned API is not. **Escape hatch:** A genuinely non-CRUD action (`POST /v1/orders/{id}/refund`) may be a POST to a sub-resource verb, that is the documented exception, not licence to verb everything. ### Error handling **Do:** Return one consistent error envelope on every failure path, `{ "error": { "code": "string_slug", "message": "human readable", "request_id": "..." } }`, with the matching HTTP status (see REST defaults). Log full detail (stack, SQL, params) server-side keyed to the same correlation/request id, and return that id to the client. **Never:** Leak stack traces, SQL, exception class names, file paths, or raw ORM errors to clients. Never return a bare string or a different shape per endpoint. **Why:** A stable, machine-readable `code` lets clients branch without parsing prose; the request id turns "it broke" into a one-line log lookup. Leaked internals are both a support nightmare and an attacker's map. **Escape hatch:** If you want a ratified standard instead of a house envelope, use RFC 9457 Problem Details (`application/problem+json`, with `type`/`title`/`status`/`detail`/`instance`), it obsoletes RFC 7807. Pick one shape and use it everywhere; do not mix. ### Idempotency **Do:** For unsafe, retryable operations (POST that charges, signs up, or creates), require a client-supplied `Idempotency-Key` header (a client-generated UUID). On first request, persist `key -> (status, response body)` in Postgres inside the same transaction as the side effect, with a `UNIQUE` constraint on the key; on any replay of the same key, return the stored result instead of re-executing. Scope keys per endpoint and per authenticated user, and expire them (24h is typical). Handle the concurrent-replay race: a second in-flight request with the same key must block or return 409, not run in parallel. **Never:** Assume the network won't double-deliver. A client that times out WILL retry, and a non-idempotent charge endpoint double-charges. Never store the key only after the side effect succeeds, a crash in between leaves it replayable. **Why:** Timeouts and retries are normal, not edge cases; the key is the only thing that makes "did my POST land?" answerable safely. **Escape hatch:** Naturally idempotent verbs (GET, PUT to a known id, DELETE) need no key. ### Request validation **Do:** Validate the full request, body, path params, and query string, against a schema at the boundary, before any business logic or DB call runs (see Input validation). Reject unknown/extra fields. On failure return 422 with field-level errors: `{ "error": { "code": "validation_failed", "fields": { "email": "must be a valid email" } } }`. **Never:** Trust the client, reach into `req.body.whatever` ad hoc deep in a handler, or silently ignore unexpected fields (mass-assignment risk). **Why:** One boundary check means business logic only ever sees well-formed input; rejecting unknown fields stops clients from quietly setting columns you never exposed. **Framework note:** the schema library is stack-specific, zod (Express / Hono / Fastify / Next.js), class-validator DTOs + `ValidationPipe({ whitelist: true })` (NestJS), Pydantic models (FastAPI). See the framework page for the exact wiring. ### Minimal response fields **Do:** Return only the fields the client needs. Define an explicit output shape per endpoint (a serializer / DTO / response schema / column `select`) and map to it; default to excluding everything until you deliberately add it. **Never:** Serialize a raw ORM entity or `SELECT *` straight to JSON. That leaks internal columns (password hashes, internal flags, soft-delete timestamps, other rows' foreign keys) the instant someone adds a column, and bloats payloads. **Why:** An allowlisted output shape *cannot* accidentally leak a newly-added sensitive column; a "return the row" handler leaks it the day the column lands. **Framework note:** Node/Next.js, map to a plain object or a zod `.pick()` output schema; NestJS, a response DTO + `ClassSerializerInterceptor` with `@Expose`/`@Exclude`; FastAPI, a Pydantic `response_model`. See the framework page. ### Shared types between frontend and API **Do:** Make the API the single source of truth for its types and derive the client's types from it; derive request/response types from the same schema that validates them. **Never:** Hand-maintain a second copy of the response shape in the frontend, it silently drifts from the server the first time a field changes. **Why:** One definition turns a breaking API change into a compile error in the frontend, not a runtime surprise in production. **Framework note:** Strongest in Node/TypeScript, weaker elsewhere. TS monorepo, share a types package, or use tRPC (end-to-end inference, no codegen) when one team owns both ends. Cross-language or public APIs, generate clients from an OpenAPI spec the server emits (FastAPI emits OpenAPI automatically; add a generator for Node). See the framework page. ## 8. Frontend & rendering ### Framework choice **Do:** React via Next.js for app-like products (auth, dashboards, real-time). Astro for content-first/mostly-static sites (blog, docs, marketing). Pick one, app-wide. **Never:** Hand-roll a framework, or mix several in one app. **Why:** A boring, well-trodden framework gives you hiring, docs, and answered questions; a custom one gives you a maintenance burden nobody else understands. **Escape hatch:** Vue or Svelte if that is already the team's stack, known beats optimal. New project with no existing stack: use the defaults above. ### Rendering **Do:** Match the rendering mode to the page, not the app. Static/SSG for content; SSR only where SEO or first paint actually matters; client-render only the genuinely interactive islands. **Never:** Ship a heavy client-side SPA for what is really a content site, or SSR a logged-in dashboard that no crawler will ever see. **Why:** Sending a megabyte of JS to render an article tanks load time and SEO for no benefit; SSR-ing a private dashboard burns server cost for nothing. ### State management **Do:** Framework-built-in state (component/context) for UI state. TanStack Query for all server data, caching, refetching, mutations, invalidation. That covers ~90% of real apps. **Never:** Reach for Redux-style global state until local + server state genuinely cannot cope. Don't store fetched server data in a global store and hand-wire its cache. **Why:** Most "global state" is really server cache; a query library handles staleness and refetch with far less code than a hand-rolled store. **Escape hatch:** A small global store (Zustand for one shared blob, Jotai for many independent atoms) for truly cross-cutting *client* state, theme, auth session, a complex editor. Still not Redux ceremony. ### Optimistic UI / perceived speed **Do:** Update the UI instantly on user action and reconcile with the server in the background. Show skeletons/placeholders, not blank screens or spinners. On server rejection, roll back the optimistic change visibly and surface the error. **Never:** Block the UI behind a spinner for an action that almost always succeeds, or swallow a failed reconcile so the user believes a write landed when it didn't. **Why:** Perceived speed is a feature. Instant feedback is the difference between an app that feels alive and one that feels broken, but a silent failed write is worse than a slow one. --- Deep frontend (component architecture, design systems, animation, accessibility internals) is out of scope here, see **The boundary**. ## 9. UI, forms & UX ### Validation **Do:** Validate on the client for instant feedback and re-validate everything on the server as the only authority. Share the schema (zod, Valibot) across both so the rules can't drift. **Never:** Treat a client-side check as a security or integrity guarantee. The browser is attacker-controlled. **Why:** Client validation is UX, server validation is correctness. Skip the server half and a curl request walks straight past your form (see security, data). **Escape hatch:** None. Even a purely internal tool gets server validation. ### Inline errors **Do:** Put each error next to the field that caused it, say what's wrong and how to fix it, and tie it to the input with `aria-describedby` plus a live region so screen readers hear it (see accessibility). **Never:** Dump a single generic "Invalid input" banner at the top for a fixable field-level problem, or signal an error with red colour alone. **Why:** A vague banner makes the user hunt for the broken field, and specificity is the whole point of validation. ### Validation timing **Do:** Validate a field on blur once the user has finished with it, and re-check the whole form on submit. Clear a field's error the moment it becomes valid. **Never:** Fire errors on every keystroke while someone is still typing an email or password. **Why:** Yelling at a half-typed field reads as the form being broken, not the input. ### Preserve input **Do:** Keep everything the user typed when validation fails, and return focus to the first bad field. On a server round-trip, echo the submitted values back into the form. **Never:** Clear the form, reset selects, or lose a long text body because one field failed. **Why:** Making someone retype a working answer because of an unrelated error is the fastest way to lose the submission. ### Double-submit **Do:** Disable the submit control the instant a mutating request is in flight and re-enable it only when the request settles. Show progress on the button itself so the click clearly registered. **Never:** Leave a live submit button during the request and rely on the user not clicking twice. **Escape hatch:** If you can't disable in time (slow JS, no-JS fallback), the server idempotency key below is your real defence. ### Idempotent submits **Do:** Pair every create-style submit with an idempotency key generated client-side and honoured server-side, so a retry, refresh, or double-click resolves to one record (see api). **Never:** Assume a disabled button is enough. Network retries, the back-then-forward dance, and impatient reloads all bypass the UI. **Why:** Disabling the button prevents the common case, idempotency prevents the duplicate charge. ### Submission feedback **Do:** Confirm success explicitly with a toast, a redirect, or a visibly updated view, and on failure keep the data, explain what happened, and offer a retry. **Never:** Return the user to a quiet, unchanged screen on success, or swallow a rejected request silently. **Why:** Silence after a submit reads as failure, so people resubmit. A dead error reads as a dead app. ### The four states **Do:** Design empty, loading, error, and success for every async view and treat them as first-class work, not afterthoughts. A view that fetches has all four. **Never:** Ship a blank screen while loading or a spinner that can spin forever with no timeout and no error path. **Why:** "Happy path only" is the single most common UI defect, and the other three states are where users actually live. ### Empty & error states **Do:** Make empty states explain what goes here and offer the action that fills it. Give error states a concrete way out: a retry, a back link, a support route, never a dead end. **Never:** Render a bare "No data" or an unhandled stack trace. **Escape hatch:** A truly empty list inside a richer view can be a one-line hint, not a full illustrated zero-state. ### No layout shift **Do:** Reserve space for incoming content with skeletons that match its shape, so the page doesn't jump as data arrives. This is also what protects your CLS (the Core Web Vitals "good" target is 0.1 or below at the 75th percentile) (see performance). **Never:** Render at zero height and let content shove everything down, and don't swap a spinner for content of a different size. ### Input types & keyboards **Do:** Use the correct input type so mobile gets the right keyboard and the browser gets free validation: `email`, `tel`, `url`, `number`, `date`, `search`. Set `inputmode` where the type alone is wrong (a numeric PIN that isn't a `number`). **Never:** Use a plain text box for an email or a numeric code, or a `number` input for things like phone numbers and card numbers that aren't quantities. ### Labels & autocomplete **Do:** Give every input a real `