Theory of Normalization Theory

Normal Forms and Theory of Normalization

by K. Yue

1. Normal Forms Using Functional Dependencies

1.1 First Normal Form

A relation is in 1NF if all attribute values are atomic: no repeating group, no composite attributes, no internal structures.
Semantically, an attribute is atomic if it cannot be broken down to smaller pieces with individual meanings.
Theoretically, a relation may only have atomic attributes. Thus, all pure 'relations' satisfy 1NF.
In practice, DBMS may allow data types with composite values and internal structures, e.g. set, JSON, spatial, XML, etc.

Example

Consider the following table with 3 records. It is not in 1 NF.

The value "10000, 12000, 13000" of the field EmpIds can be broken into three components, "10000", "12000", and "13000" with individual meanings.
The same is true for the field Names.
Note the plural forms of EmpIds and Names.

DeptId	ManagerId	EmpIds	Names
D123	110	10000, 12000, 13000	Lady Gaga, Eminem, Lebron James
D225	440	21000, 22000	Rajiv Gandhi, Bill Clinton
D337	300	31000	John Smithson

An alternate design of the relation in 1NF is shown below. The following instance has six rows.

DeptId	ManagerId	EmpId	Name
D123	110	10000	Lady Gaga
D123	110	12000	Eminem
D123	110	13000	Lebron James
D225	440	21000	Rajiv Gandhi
D225	440	22000	Bill Clinton
D337	300	31000	John Smithson

Why atomic?
- Relational theory and operations treat attributes as atomic.
- No need to worry about the correctness and consistency of internal structures.
Relations not satisfying 1NF have undesirable redundancy and anomalies.

Example

Consider the tuple (EmpId: 12345, OSSkills: {Windows, Linux, Solaris}).

It will be difficult to identify all Employees with Linux skills.
It will be difficult to join other tables using OSSkills.
Data entry problems and issues, e.g. Linux linux, linx, etc., may further degrade data quality and introduce inconsistency.
On the other hand, relations may be in Non-First Normal Form (NFNF of NF²), mainly for performance considerations. E.g., NoSQL DB.

1.2 Second Normal Form

A relation R is in 2NF if
1. R is in 1NF, and
2. all non-prime attributes are fully dependent on the candidate keys.
Review: A prime attribute (also called key attribute) appears in one or more candidate keys. Otherwise, it is a non-prime (non-key) attribute. Note that a relation may have many candidate keys.
There is no partial dependency in 2NF.

Example

The following relation is not in 2NF. (Assume that the number of credits of a given course does not change). Note the redundancy and anomalies.

Enroll(Course, Credit, Student, Grade)

CourseId	Credit	StudentId	Grade
C1	3	S1	A
C1	3	S2	B
C1	3	S3	B
C2	2	S1	A
C2	2	S4	D

We assume the following FD.

CourseId -> Credit
CourseId, StudentId -> Grade

Thus,

{CourseId, StudentId} is the only candidate key.
Prime attributes: CourseId, StudentId
Non-prime attribute: Credit, Grade.
FD (1), CourseId -> Credit, is a partial dependency that violates 2NF.
- CourseId (a proper subset of a CK) -> Credit (a non-prime attribute)

To convert to relations in 2NF, decompose Enroll into

Enroll(CourseId, StudentId, Grade) {CourseId, StudentId -> Grade}
Class(CourseId, CreditId) {CourseId -> Credit}

Both tables are in 2NF.

{CourseId -> Credit} violates 2NF in Enroll(CourseId, Credit, StudentId, Grade)

CourseId is a proper subset of a CK (CourseId, StudentId) in Enroll

{CourseId -> Credit} does not violate 2NF in Class(CourseId, Credit)
- CourseId is a CK in Class. It is not a proper subset of a CK.

Example from Ricardo:

NewClass(courseNo, stuId, stuLastName, facId, schedule, room, grade). We have:

courseNo, stuId -> grade
stuId -> stuLastName
courseNo -> facId, schedule, room

StuId -> stuLastName and courseNo -> facId, schedule, room violate 2NF

To convert to 2NF, decomposition:

Course(courseNo, facId, schedule, room) { courseNo -> facId, schedule, room } The FD is no longer violating 2NF in the new table Course since courseNo is a CK in Course.
Student(stuId, stuLastName) { StuId -> stuLastName } The FD is no longer violating 2NF in the new Student table since StuId is a CK in Student
Enroll(courseNo, stuId, grade) { courseNo, stuId -> grade }

1.3 Third Normal Form

(New definition) A relation R is said to be in the third normal form if for every non-trivial functional dependency X -> A,
1. X is a superkey, or
2. A is a prime (key) attribute.
(Old definition: included for historical reason. Do not use it for normalization analysis.) A relation R is in 3NF if
1. R is in 2NF, and
2. There is no transitive dependency of non-prime attributes on the candidate keys.
Review: A superkey S of a relation R satisfies S -> R (uniqueness). It does not need to satisfy the minimal property.
3NF can identify unnecessary redundancy that 2NF cannot identify.
3NF still cannot eliminate all redundancies due to functional dependencies.

Example

The following relationis in 2NF, but is not in 3NF.

DeptId	ManagerId	EmpId	Name
D123	110	10000	Lady Gaga
D123	110	12000	Eminem
D123	110	13000	Lebron James
D225	205	21000	Rajiv Gandhi
D225	205	22000	Bill Clinton
D337	333	31000	John Smithson

If we assume the following canonical set of FDs:
1. EmpId -> Name, DeptId
2. DeptId -> ManagerId
then
- There is only one candidate key: EmpId
- Prime attribute: EmpId
- Non-prime attributes: Name, DeptId, ManagerId.
- The relation is in 2NF.
The relation is not in 3NF because:
- EmpId is the only candidate key.

EmpId is prime.
DeptId and ManagerId are non-prime.
DeptId -> ManagerId violates 3NF because:
1. DeptId is not a SK.
2. ManagerId is non-prime.

To resolve, decompose the relation into:
1. Department(DeptId, MangaerId) { DeptId -> ManagerId }
2. Employee(EmpId, Name, DeptId) { EmpId -> Name, DeptId }

Example

Consider the relation R(CITY, STREET, ZIP) with the FDs:

CITY STREET -> ZIP, and
ZIP -> CITY.

There are two candidate keys:

CITY STREET, and
ZIP STREET

Hence, all attributes are prime attributes, and the relation is in both 2NF and 3NF.

Note that a relation such as Employee(EmePId, Name, Street, City, Zip, State) is not in 3NF.

This is a classical example you can find in many database textbooks. Note that the two FDs may not actually be correct in the United States.

3NF does not eliminate all redundancies due to functional dependencies.

1.4 BCNF (Boyce-Codd Normal Form)

A relation R is said to be in BCNF if for every non-trivial functional dependency X -> Y in R, X is a superkey.

Example

Consider the relation

S(SId, PId, SName, Quantity) with the following assumptions:

SId is unique for every supplier.
SName is unique for every supplier.
Quantity is the accumulated quantities of a part supplied by a supplier. Given a supplier and a part, the Quantity is unique.
A supplier can supply more than one part.
A part can be supplied by more than one supplier.

We have the following non-trivial FD:

SId -> SName
SName -> SId
SId PId -> Quantity
SName PId -> Quantity

Note that SId and SName are equivalent.

The candidate keys are:

SId PId
SName PId

Prime attributes: SId, PId, SName

Non-prime attribute: Quantity.

The candidate keys are:

SId PId
SName PId

Prime attributes: SId, PId, SName

Non-prime attribute: Quantity.

The relation is in 3NF. Note:

SId -> SName does not violate 3NF as SName is prime.
SName -> SId does not violate 3NF as SId is prime.
SId PId -> Quantity does not violate 3NF as {SId, PId} is a CK and also a SK.
SName PId -> Quantity does not violate 3NF as {SName, PId} is a CK and also a SK.

However, there are unnecessary redundancy.

SId	SName	PId	Quantity
S1	ABC	P1	10
S1	ABC	P2	20
S1	ABC	P3	21
S2	DEF	P1	40
S2	DEF	P4	13
S3	XYK	P3	18

Thus, 3NF does not detect all design problems using FD.

However, S is not in BCNF because, for example, the functional dependency

SId -> SName is

non-trivial, and
SId is not a superkey.

To deal with it, we can decompose S(SId, PId, SName, Quantity) into

(1) Supplier(SId, SName) with

SId -> SName
SName -> SId

with two candidate keys:

SId
SName

(2) Supply(SId, PId, Quantity) with

SId, PId -> Quantity.

Both are in BCNF.

Example:

Consider the relation R(A, B, C, D) with

A -> B, B -> C, C -> A and C -> D.

There are three candidate keys:

Since every left hand side of any non-trivial functional dependency is a superkey, R is in BCNF.

1.5 Checking Highest Normal Form by Violations

To find the highest normal form for a relation R, check every non-trivial FD X->Y of R for violation.

Note that in the table below, A is a single attribute. Use the decomposition rule if necessary. For example, for AB->CD, check AB->C and AB->D.

Normal Form's Violation	Non-trivial FD X -> A
2NF	(1) X is a proper subset of a candidate key of R, and (2) A is a non-prime attribute.
3NF	(1) X is not a superkey of R, and (2) A is a non-prime attribute.
BCNF	X is not a superkey.

If there is no violation of a normal form, then R is in that normal form.
If there is one violation of a normal form, then R is not in that normal form.

Example:

Consider R(A,B,C,D) {A->B, B->AC, C->D}

Using decomposition rule, we have {A->B, B->A, B->C, C->D}

We find two CK: [1] A, [2] B
Prime attributes: A, B
Non-prime attributes: C, D

Checking for violation:

FD	Ok with 2NF	Ok with 3NF	OK with BCNF
A->B	Yes	Yes	Yes
B->A	Yes	Yes	Yes
B->C	Yes	Yes	Yes
C->D	Yes	No	No

Thus, the highest NF is 2NF

1.6 Motivation of BCNF

The purpose of BCNF is to eliminate any unnecessary redundancy that FD can create in a relation.
- In a BCNF relation, no value can be predicted from any sets of non-unique attributes, using only FD.
- This is because in a BCNF relation, using FD only,
  - any attribute value can only be determined by a superkey,
  - but the superkey is unique.
- However, there are other type of dependencies.
- Therefore, there are higher normal forms.

Example

Consider the relation R(CITY, ZIP, STREET) again

Using the code for the postal office, we have

CITY STREET -> ZIP, and ZIP -> CITY.

Hence, there are two candidate keys:

CITY STREET, and
ZIP STREET

Therefore, R is not in BCNF since in ZIP -> CITY, ZIP is not a superkey.

However, if we decompose R into two relations, each with two attributes, then the FD

CITY STREET -> ZIP is lost (i.e. cannot be assured within a single relation)

Therefore, we better leave the relation alone.

Sometimes it is not possible for a relation to be in BCNF => need to settle in a less strict normal form (3NF).

1.7 Normalization Theory Using Functional Dependencies

To use the theory on FD:
1. For a relation of a set of attributes, we analyze the assumptions of the applications.
2. From the assumptions, we obtain a set of FDs.
3. Find a canonical cover of the set of FDs.
4. Find all candidate keys, prime and non-prime attributes.
5. If the relation is not in BCNF, we perform decomposition.
6. If BCNF cannot be satisfied, we aim for 3NF.

Example

Consider the following relation:

Supply(SupplierId, SupplierName, ProductId, ProductDesc, Quantity, ArrivalTime)

The relation stores the quantities and arrival times of shipments of products (identified by ProductId) from suppliers (Identified by SupplierId). A supplier may not have a unique name. Furthermore, the product description, ProductDesc, may be the same for two products. A supplier may supply the same product many times, each with a different ArrivalTime.

The functional dependencies (FD) of the relation:

SupplierId -> SupplierName
ProductId -> ProductDesc
SuplierId, ProductId, ArrivalTime -> Quantity

CK: {SupplierId, ProductId, ArrivalTime}

Non-prime attributes: SupplierName, ProductDesc, Quantity

Highest Normal Form: 1NF

SupplierId -> SupplierName violates 2NF since

SupplierId is a part (proper subset) of a candidate key, i.e., {SupplierId} ⊂ {SupplierId, ProductId, ArrivalTime}, and
Quantity is non-prime.

Not that A ⊂ B means that A is a proper subset of B.

2. Decomposition

Decomposition is a major tool for constructing relations satisfying high enough normal forms.
Decomposition should be disciplined:
1. More relations may be less efficient in storage.
2. More relations may be less efficient in executing queries.
More importantly, some decompositions are harmful:
1. Lossy decompositions.
2. Decompositions that do not preserve dependencies.
Hence, it is important to have lossless dependency-preserving decomposition (good decomposition).

2.1 Lossy Decomposition

Example:

Consider the relation Emp(EmpId, DeptId, ManagerId) with

EmpId -> DeptId
DeptId -> ManagerId

Note that we do not have ManagerId -> DeptId in this example, since this organization allows a manager to manage more than one Departments. Note that ManagerId 90000 manages two Departments.

EmpId	DeptId	ManagerId
E1	ACCT	M3
E2	HR	M3
E3	ENG	M6
E4	ENG	M6

The relation is not in 3NF because of the FD

DeptId (not a SK) -> ManagerId (non-prime)

Suppose we decompose the relation into

Emp1(EmpId, ManagerId)
Dept(DeptId, ManagerId)

The common attribute for the component relations is ManagerId. The relations are obtained by projections from Emp:

Emp1:

EmpId	ManagerId
E1	M3
E2	M3
E3	M6
E4	M6

Dept:

DeptId	ManagerId
ACCT	M3
HR	M3
ENG	M6

If we do not lose any information by the decomposition, we should get the original relation using the natural join.

However, Emp1 |x| Dept is

EmpId	DeptId	ManagerId
E1	ACCT	M3
E1	HR	M3
E2	ACCT	M3
E2	HR	M3
E3	ENG	M6
E4	ENG	M6

This is not the same as the original relation Emp. Spurious rows are incorrectly included in the result.

Hence, the decomposition of Emp(EmpId, DeptId, ManagerId) into

Emp1(EmpId, ManagerId) and
Dept(DeptId, ManagerId)

is lossy. It is not a good decomposition.

2.2 Lossless Decomposition

Example:

Consider now the following decomposition of Emp(EmpId, DeptId, ManagerId):

Emp2(EmpId, DeptId) and
Emp3(EmpId, ManagerId)

The common attribute is EmpId. We have Emp2 and Emp3:

Emp2:

EmpId	DeptId
E1	ACCT
E2	HR
E3	ENG
E4	ENG

Emp3:

EmpId	ManagerId
E1	M3
E2	M3
E3	M6
E4	M6

Hence, Emp2 |x| Emp3:

EmpId	DeptId	ManagerId
E1	ACCT	M3
E2	HR	M3
E3	ENG	M6
E4	ENG	M6

This is the same as the original relation Emp. Therefore, the decomposition does not lose any information. It is a lossless decomposition.

Definition. A decomposition is lossless if the natural joins of the component relations result in the original relation. Otherwise, it is lossy.

2.3 Theory of Lossless Decomposition

Example:

Why is the decomposition of Emp(EmpId, Dept, ManagerId) into

(1) Emp1(EmpId, ManagerId) and Dept(DeptId, ManagerId) lossy, and

(2) Emp2(EmpId, DeptId) and Emp3(EmpId, ManagerId) lossless?

Theorem: Suppose R(X, Y, Z) is decomposed into R1(X, Y) and R2(X, Z). X is the set of common attributes in R1 and R2. The decomposition is lossless if and only if

(a) X -> Y, or
(b) X -> Z.

Example:

In case (1), X is ManagerId, Y is EmpId, Z is DeptId.

Neither condition (a) nor (b) is satisfied. Hence, (1) is lossy.

In case (2), X is EmpId, Y is DeptId, Z is ManagerId.

Both conditions (a) and (b) are satisfied. Hence, (2) is lossless.

For decompositions into more than two relations, use the chase matrix algorithm, which is not covered in this course.

2.4 Dependency-Preserving Decomposition

Example:

For the relation Emp(EmpId, DeptId, ManagerId) with

EmpId -> DeptId
DeptId -> ManagerId,

The decomposition of Emp into

Emp2(EmpId, DeptId) and
Emp3(EmpId, ManagerId)

is lossless but it does not preserve dependencies:

the FD DeptId -> ManagerId

cannot be assured within a single relation after the decomposition. No relation contains both attributes.

For example, if we add the information Emp E6 work in the ACCT Department under manager M9 carelessly, we may have the following table.

Since ACCT has M3 as manager already, it cannot also have M9 as its manager.

Emp2:

EmpId	DeptId
E1	ACCT
E2	HR
E3	ENG
E4	ENG
E6	ACCT

Emp3:

EmpId	ManagerId
E1	M3
E2	M3
E3	M6
E4	M6
E6	M9

As a result, the FD DeptId -> ManagerId is violated.

Department with DeptId ACCT has two ManagerId
1. M3 (via EmpId E1)
2. M9 (via EmpId E6)

Thus, for the relation Emp(EmpId, DeptId, ManagerId) with

EmpId -> DeptId
DeptId -> ManagerId,

the best decomposition is

Emp1(EmpId, DeptId) and
Dept(DeptId, ManagerId)

It is easy to show that, the decomposition is lossless, preserves dependencies, and that Emp1 and Dept are both in BCNF.

2.4.2 Decomposition Algorithms

It is possible to decompose a relation such that
1. all member relations are in 3NF,
2. the decomposition is lossless, and
3. all FDs are preserved.
It is also possible to decompose a relation such that
1. all member relations are in BCNF, and
2. the decomposition is lossless, but
3. not all FDs may be preserved.

2.5 Algorithm for decomposition into 3NF relations

There are many algorithms for decomposition.
We will not cover the details of the algorithms, but they are illustrated by the example below.
In particular, the following examples show the steps of an lossless, FD preserving algorithm that guarantees 3NF.

Example:

Consider R(A,B,C,D,E) with F = {A->BC, CD -> E, BA -> C, D->B}.

Step 1. Find a canonical cover G for F.

The FD BA->C is redundant.

G = {A->BC, CD -> E, D->B}.

We may perform normalization analysis to see whether decomposition is necessary.

L/NR: A, D
M: C
R: B, E

We have: AD+ = AD BC E

Thus, CK: [1] AD
prime: A, D
non-prime: B, C, E

Normalization analysis:

Non-trivial FD	2NF	3NF	BCNF
A -> B: [1] A ⊂ AD, [2] A is not a SK, [3] B is non-prime	violate	violate	violate
A -> C: [1] A ⊂ AD, [2] A is not a SK, [3] C is non-prime	violate	violate	violate
CD -> E: [1] CD ⊄ AD, [2] C is not a SK, [3] E is non-prime	ok	violate	violate
D -> B: [1] D ⊂ AD, [2] D is not a SK, [3] B is non-prime	violate	violate	violate

Thus, the highest normal form of R is 1NF. Decomposition is necessary.

Step 2. For every FD X->Y in G, create a relation with the schema XY and add it to the result D. This step preserves FD and resolves NF violations.

Relations created:

R1(A,B,C) with A->BC
R2(C,D,E) with CD->E
R3(B,D) with D->B

It can be seen very easily that R1, R2 and R3 are all in 3NF and BCNF. Furthermore, all FDs are preserved.

Step 3. If no relation in D contains a candidate key of R, create a new relation with a candidate key of R being the schema, and add it to the result D. This step assures losslessness.

There is only one candidate key of R: AD. Since none of R1, R2 and R3 contains AD, create the relation

R4(A,D) with no FD: {}

Step 4. Simplify the decomposition D by removing relations that are redundant (i.e. that its schema is a subset of the schema of another relation).

No action as there is no redundant relation.

The result relations are all in BCNF.

Example:

Consider R(A,B,C,D,E) with {A->BCD, BC->D, D->C}

Using the algorithm,

(1) Canonical cover: {A->BC, BC->D, D->C}; A->D is removed since it is a redundant FD.

(2) The following relations are created:

R1(A,B,C) with {A-> BC},
R2(B,C,D) with {BC->D, D->C},
R3(C,D) with {D->C}

(3) There is only one candidate key AE. Since it is not in any of R1, R2 or R3, R4 is created.

R4(A,E)

(4) R3(C,D) is removed as redundant.

As in result, we have:

R1(A,B,C) with {A-> BC}, in BCNF
R2(B,C,D) with {BC->D, D->C}, in 3NF but not in BCNF
R4(A,E) with {}, in BCNF

There are other decomposition algorithms.
Sometimes, it is not possible to decompose a relation into two relations losslessly and preserve all FD, just to achieve BCNF.

Example:

Consider the relation R(A, B, C) with A -> B and C -> B.

R is not in 2NF. It is not possible to decompose R into two relations losslessly while preserving all functional dependencies.

However, it is possible to decompose into three BCNF relations losslessly and with all functional dependencies preserved:

R1(A, B),
R2(B, C) and
R3(A, C).

Consider the relation R(A, B, C) with A -> B and BC -> A.

R is not in BCNF. It is not possible to decompose R into BCNF relations losslessly while preserving all FD.