sorc-lab.github.io

www.sorclab.com

View on GitHub

Sorc Lab

Relational Database: Data Normalization

This article will cover the main normalization “forms”: 1NF, 2NF, 3NF, 3.5NF (Boyce-Codd Normal Form or “BCNF”), 4NF, 5NF, and 6NF.

Data normalization best practices are defined in tiers known as “Normal Forms”. Each Normal Form (NF) contains rules to achieve that form, and the next tier/form includes more rules, and also include the rules of the form before it. For example, if you comply with 2NF, that means you also much comply with 1NF, etc.

Most practical examples of a real world database only cover up to 3NF, and the rest are more theoretical, or apply to a “data warehouse”, which will be covered here. For most cases, you can stop after 3NF to get a good idea of how your data relations should work.

The following are examples of what each form requires and how they can be violated.

1NF: First Normal Form

Rule:

❌ Violation of 1NF:

Table: Orders
+------------+-------------------+-----------------------+
| OrderID    | CustomerName      | Products              |
+------------+-------------------+-----------------------+
| 101        | Alice             | Apple, Banana, Orange |
| 102        | Bob               | Milk, Bread           |
+------------+-------------------+-----------------------+

Problem: The Products column stores a list of items in a single field.

✅ Correct 1NF:

Table: Orders
+------------+-------------------+-----------+
| OrderID    | CustomerName      | Product   |
+------------+-------------------+-----------+
| 101        | Alice             | Apple     |
| 101        | Alice             | Banana    |
| 101        | Alice             | Orange    |
| 102        | Bob               | Milk      |
| 102        | Bob               | Bread     |
+------------+-------------------+-----------+

Fix: Each row has one product per record (atomic values).

2NF: Second Normal Form

Rule:

❌ Violation of 2NF:

Table: OrderDetails
Composite Key = (OrderID, ProductID)

+------------+-----------+-------------+--------------+
| OrderID    | ProductID | ProductName | CustomerName |
+------------+-----------+-------------+--------------+
| 101        | P01       | Apple       | Alice        |
| 101        | P02       | Banana      | Alice        |
| 102        | P03       | Milk        | Bob          |
+------------+-----------+-------------+--------------+

Problem:

✅ Correct 2NF:
Split into separate tables:

Orders
+------------+--------------+
| OrderID    | CustomerName |
+------------+--------------+
| 101        | Alice        |
| 102        | Bob          |
+------------+--------------+

Products
+-----------+-------------+
| ProductID | ProductName |
+-----------+-------------+
| P01       | Apple       |
| P02       | Banana      |
| P03       | Milk        |
+-----------+-------------+

OrderDetails
+------------+-----------+
| OrderID    | ProductID |
+------------+-----------+
| 101        | P01       |
| 101        | P02       |
| 102        | P03       |
+------------+-----------+

Fix: Each non-key attribute depends on the whole key, not part of it.

3NF: Third Normal Form

Rule:

❌ Violation of 3NF:

Table: Customers
Primary Key = CustomerID

+------------+--------------+-------------+----------------+
| CustomerID | CustomerName | ZipCode     | City           |
+------------+--------------+-------------+----------------+
| C01        | Alice        | 12345       | New York       |
| C02        | Bob          | 54321       | Los Angeles    |
+------------+--------------+-------------+----------------+

Problem:

✅ Correct 3NF:

Customers
+------------+--------------+-------------+
| CustomerID | CustomerName | ZipCode     |
+------------+--------------+-------------+
| C01        | Alice        | 12345       |
| C02        | Bob          | 54321       |
+------------+--------------+-------------+

ZipCodes
+-------------+----------------+
| ZipCode     | City           |
+-------------+----------------+
| 12345       | New York       |
| 54321       | Los Angeles    |
+-------------+----------------+

Fix: Now City depends on ZipCode, and ZipCode is properly factored into its own table.

Summary: 1NF, 2NF, 3NF

What is a Dependency?

If you are not familiar with the terms “dependency” or “transitive dependency”, this section might help.

Using a previous example with ZipCode inside Customers table:

Table: Customers
Primary Key = CustomerID

+------------+--------------+-------------+----------------+
| CustomerID | CustomerName | ZipCode     | City           |
+------------+--------------+-------------+----------------+
| C01        | Alice        | 12345       | New York       |
| C02        | Bob          | 54321       | Los Angeles    |
+------------+--------------+-------------+----------------+

When we say “X depends on Y”, it means:

Example: City depends on ZipCode

So whenever you know a ZipCode, you can uniquely determine the City. That’s why we say:

ZipCode → City

This is a transitive dependency if ZipCode is not a primary key in the Customers table but is instead just another attribute.

Why it matters

If you store City inside a Customers table:

By splitting ZipCodeCity into a separate table, those anomalies disappear.
The Answer: ZipCodes table:

ZipCodes
+-------------+----------------+
| ZipCode     | City           |
+-------------+----------------+
| 12345       | New York       |
| 54321       | Los Angeles    |
+-------------+----------------+

Another Way to Look at It

A City depends on a ZipCode, not the other way around. This is because a ZipCode uniquely identifies a City. If we tried to use the City as a key we would quickly realize that a City can have multiple ZipCode associated with it. Thus, a City cannot uniquely identify a ZipCode:

Los Angeles → 90001, 90002, 90003, ...

Instead, have City depend on ZipCode to uniquely identify a City:

3.5NF: Boyce-Codd Normal Form (BCNF)

Rule: For every non-trivial functional dependency X → Y, X must be a superkey. Why it exists: Sometimes 3NF still allows anomalies.

BCNF Violation

Table: Courses
+-----------+----------+----------+
| CourseID  | Teacher  | Room     |
+-----------+----------+----------+
| MATH101   | Alice    | R1       |
| CS102     | Bob      | R2       |
| ENG201    | Alice    | R1       |
+-----------+----------+----------+

Dependencies:

Here, Teacher is not a key, but it determines Room.

✅ BCNF Fix
Split into two tables:

Courses(CourseID, Teacher)
TeacherRooms(Teacher, Room)

4NF: Fourth Normal Form

Rule: No multi-valued dependencies (MVDs).

Example violation:

StudentCourseClub
+----------+---------+---------+
| Student  | Course  | Club    |
+----------+---------+---------+
| Alice    | Math    | Drama   |
| Alice    | Math    | Chess   |
| Alice    | History | Drama   |
| Alice    | History | Chess   |
+----------+---------+---------+

Alice independently takes multiple courses and joins multiple clubs.
This creates a Cartesian explosion.

Fix (4NF): Split into two tables:

StudentCourse(Student, Course)
StudentClub(Student, Club)

Rule: No multi-valued dependencies (MVDs).

4NF Violation

Table: StudentInterests
+---------+----------+----------+
| Student | Sport    | Hobby    |
+---------+----------+----------+
| Alice   | Soccer   | Chess    |
| Alice   | Soccer   | Painting |
| Alice   | Tennis   | Chess    |
| Alice   | Tennis   | Painting |
+---------+----------+----------+

4NF Fix Split into two tables:

StudentSports(Student, Sport)
StudentHobbies(Student, Hobby)

5NF: Fifth Normal Form, aka Project-Join Normal Form (PJ/NF)

Rule: No join dependencies — a table should not be reconstructable only by joining multiple tables in ways that cause redundancy.

5NF Violation

Imagine we track which Suppliers can provide which Products for which Projects:

SupplierProductProject
+----------+---------+---------+
| Supplier | Product | Project |
+----------+---------+---------+
| S1       | P1      | J1      |
| S1       | P2      | J1      |
| S1       | P1      | J2      |
| S1       | P2      | J2      |
+----------+---------+---------+

Here:

5NF Fix

Split into three separate relationships:

SupplierProduct(Supplier, Product)
ProductProject(Product, Project)
SupplierProject(Supplier, Project)

Now we can reconstruct the full relation by joins, without redundancy.

6NF: Sixth Normal Form

Rule: Every join dependency is trivial; tables are decomposed to the smallest possible granularity (often 2 columns).

6NF Violation

Table: EmployeeSalary
+---------+-----------+-------------+------------------+
| EmpID   | Salary    | Dept        | ValidFrom        |
+---------+-----------+-------------+------------------+
| E1      | 60000     | IT          | 2023-01-01       |
| E1      | 65000     | IT          | 2024-01-01       |
| E1      | 65000     | HR          | 2024-01-01       |
+---------+-----------+-------------+------------------+

6NF Fix

Break into separate tables by fact type:

EmployeeDept(EmpID, Dept, ValidFrom)
EmployeeSalary(EmpID, Salary, ValidFrom)

Now each table tracks one independent time-varying attribute.

Summary: 3.5NF/BCNF, 4NF, 5NF, 6NF

How necessary is it to check all the boxes?