Database Normalization & Denormalization

Learn how database normalization eliminates data anomalies using normal forms (1NF, 2NF, 3NF, BCNF) and how denormalization optimizes read performance in backend systems.

BeginnerDatabasesChapter: Database Systems10 min read

The Problem: Data Anomalies and Inconsistency

When designing a database schema, the way you group attributes into tables dictates how your application maintains data integrity over time. A naive, un-normalized table design that lumps customer profiles, product prices, order histories, and supplier locations into a single wide table causes structural bugs known as relational anomalies:

Update Anomaly: If a customer changes their name, the application must locate and update every order row that customer has ever placed. If the network drops or a write fails halfway, the database enters an inconsistent state where the same customer has different names in different rows.
Insertion Anomaly: If you want to insert a new product and its supplier details into the database, you cannot do so until a customer actually purchases the product, because the primary key requires an order ID.
Deletion Anomaly: If a customer cancels their order, deleting that order row from the database also deletes the customer's email, name, and supplier's physical address, losing valuable business context.

To prevent these anomalies, relational database schemas are structured using a process called normalization.

Normalization Steps and Normal Forms

Normalization systematically splits wide tables into smaller, cohesive tables linked by foreign key relationships. The process is governed by a series of rules called normal forms. Each normal form builds upon the rules of the previous one.

First Normal Form (1NF)

To satisfy 1NF, tables must contain only atomic values.

Atomic Values: A column cannot store comma-separated list values, arrays, or nested records (e.g. storing 'Laptop, Mouse' in a single items string column).
No Repeating Groups: Multiple columns cannot store the same type of data (e.g. item_1, item_2, item_3).
Primary Key: A primary key must be defined to uniquely identify each row.

Second Normal Form (2NF)

To satisfy 2NF, a table must be in 1NF, and all non-key columns must depend on the entire primary key.

Partial Dependency: If a table has a composite primary key (e.g. OrderID + ProductID), but a non-key column like CustomerEmail depends only on OrderID, that column has a partial dependency. Under 2NF, attributes depending on only a subset of a composite primary key must be split out into their own tables.

Third Normal Form (3NF)

To satisfy 3NF, a table must be in 2NF, and no non-key columns can depend on other non-key columns.

Transitive Dependency: If $X \rightarrow Y$ (X determines Y) and $Y \rightarrow Z$ (Y determines Z), then $X \rightarrow Z$ is a transitive dependency. For example, if ProductID (X) determines SupplierID (Y), and SupplierID determines SupplierCountry (Z), then SupplierCountry has a transitive dependency on ProductID. Under 3NF, these transitively dependent attributes must be isolated into a separate table (e.g. a suppliers table).

Boyce-Codd Normal Form (BCNF)

A strict version of 3NF, BCNF requires that for every functional dependency $X \rightarrow Y$, the determinant $X$ must be a candidate key (a column or set of columns that could serve as a primary key). BCNF resolves anomalies that occur in tables with multiple, overlapping composite candidate keys.

Normalization Mathematics and Decomposition

Normalization is grounded in database theory through functional dependencies: if a value in column X uniquely determines a value in column Y, we write $X \rightarrow Y$.

When spliting tables during normalization, the database engine must ensure:

Loss-less Join Decomposition: Splitting table $R$ into tables $R_1$ and $R_2$ is lossless if joining them back together reconstruction produces the exact original records without introducing fake (spurious) rows. Mathematically, this requires that the intersection of $R_1$ and $R_2$ is a candidate key for at least one of the tables.
Dependency Preservation: All functional dependencies defined on the original table must still be enforceable by applying constraints to individual normalized tables, preventing the need to join tables during write checks.

Denormalization Trade-offs

While normalized schemas guarantee data integrity, they introduce performance costs.

Join Latency: To display a customer checkout page, a 3NF schema requires joining orders, order_items, customers, products, and suppliers tables. As database sizes grow, executing complex multi-table joins uses significant CPU and memory.
Write Costs: Updating normalized data is clean, but inserting a new business event requires writing to multiple tables, updating separate B-Tree indexes, and managing distributed transaction locks.

Denormalization is the process of selectively reintroducing redundancy into a normalized schema to speed up read operations.

Dimension	Normalization (3NF)	Denormalization
Write Performance	High (single-row updates, no duplicate syncs)	Low (must update multiple locations to prevent drift)
Read Performance	Low (requires resource-heavy joins)	High (direct queries on a single table)
Storage Overhead	Minimal (no duplicated data attributes)	High (redundant copies stored on disk)
Data Integrity	Enforced at database engine schema level	Requires application-level synchronization

Relational Views vs Materialized Views

Rather than altering the physical schema to denormalize, backend systems can utilize views:

Relational Views

A standard view is a saved SQL query. When queried, it acts as a virtual table. The database engine executes the underlying query on the fly, performing necessary joins.

Benefit: Views simplify queries for developers while maintaining normalized storage on disk. However, they do not improve performance because they execute the joins on every request.

Materialized Views

A materialized view physically caches the query results on disk.

Benefit: Reading from a materialized view is extremely fast because it bypasses the join queries. However, the data can become stale. Materialized views must be refreshed explicitly (e.g., in PostgreSQL using REFRESH MATERIALIZED VIEW) or updated dynamically through write triggers or database event loops.

Prerequisites

Database Indexes

Code Examples

Core Literature References

A Relational Model of Data for Large Shared Data Banks

by E. F. Codd — Communications of the ACM, pp. 377-387

View source

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.