Normal Forms and Theory of Normalization

by K. Yue

1. Normal Forms Using Functional Dependencies

1.1 First Normal Form

Example

Consider the following table with 3 records. It is not in 1 NF.

DeptId ManagerId EmpIds Names
D123 110 10000, 12000, 13000 Lady Gaga, Eminem, Lebron James
D225 440 21000, 22000 Rajiv Gandhi, Bill Clinton
D337 300 31000 John Smithson

An alternate design of the relation in 1NF is shown below. The following instance has six rows.

DeptId ManagerId EmpId Name
D123 110 10000 Lady Gaga
D123 110 12000 Eminem
D123 110 13000 Lebron James
D225 440 21000 Rajiv Gandhi
D225 440 22000 Bill Clinton
D337 300 31000 John Smithson

Example

Consider the tuple (EmpId: 12345, OSSkills: {Windows, Linux, Solaris}).

1.2 Second Normal Form

Example

The following relation is not in 2NF.  (Assume that the number of credits of a given course does not change). Note the redundancy and anomalies.

Enroll(Course, Credit, Student, Grade)

CourseId Credit StudentId Grade
C1 3 S1 A
C1 3 S2 B
C1 3 S3 B
C2 2 S1 A
C2 2 S4 D

We assume the following FD.

  1. CourseId -> Credit
  2. CourseId, StudentId -> Grade

Thus,

  1. {CourseId, StudentId} is the only candidate key.
  2. Prime attributes: CourseId, StudentId
  3. Non-prime attribute: Credit, Grade.
  4. FD (1), CourseId -> Credit, is a partial dependency that violates 2NF.

To convert to relations in 2NF, decompose Enroll into

  1. Enroll(CourseId, StudentId, Grade) {CourseId, StudentId -> Grade}
  2. Class(CourseId, CreditId) {CourseId -> Credit}

Both tables are in 2NF.

  1. {CourseId -> Credit} violates 2NF in Enroll(CourseId, Credit, StudentId, Grade)
  2. {CourseId -> Credit} does not violate 2NF in Class(CourseId, Credit)

Example from Ricardo:

NewClass(courseNo, stuId, stuLastName, facId, schedule, room, grade). We have:

courseNo, stuId -> grade
stuId -> stuLastName
courseNo -> facId, schedule, room

StuId -> stuLastName and courseNo -> facId, schedule, room violate 2NF

To convert to 2NF, decomposition:

  1. Course(courseNo, facId, schedule, room) { courseNo -> facId, schedule, room } The FD is no longer violating 2NF in the new table Course since courseNo is a CK in Course.
  2. Student(stuId, stuLastName) { StuId -> stuLastName } The FD is no longer violating 2NF in the new Student table since StuId is a CK in Student
  3. Enroll(courseNo, stuId, grade) { courseNo, stuId -> grade }

1.3 Third Normal Form

Example

DeptId ManagerId EmpId Name
D123 110 10000 Lady Gaga
D123 110 12000 Eminem
D123 110 13000 Lebron James
D225 205 21000 Rajiv Gandhi
D225 205 22000 Bill Clinton
D337 333 31000 John Smithson

Example

Consider the relation R(CITY, STREET, ZIP) with the FDs:

  1. CITY STREET -> ZIP, and
  2. ZIP -> CITY.

There are two candidate keys:

  1. CITY STREET, and
  2. ZIP STREET

Hence, all attributes are prime attributes, and the relation is in both 2NF and 3NF.

Note that a relation such as Employee(EmePId, Name, Street, City, Zip, State) is not in 3NF.

This is a classical example you can find in many database textbooks. Note that the two FDs may not actually be correct in the United States.

1.4 BCNF (Boyce-Codd Normal Form)

Example

Consider the relation

S(SId, PId, SName, Quantity) with the following assumptions:

  1. SId is unique for every supplier.
  2. SName is unique for every supplier.
  3. Quantity is the accumulated quantities of a part supplied by a supplier. Given a supplier and a part, the Quantity is unique.
  4. A supplier can supply more than one part.
  5. A part can be supplied by more than one supplier.

We have the following non-trivial FD:

  1. SId -> SName
  2. SName -> SId
  3. SId PId -> Quantity
  4. SName PId -> Quantity

Note that SId and SName are equivalent.

The candidate keys are:

  1. SId PId
  2. SName PId

Prime attributes: SId, PId, SName

Non-prime attribute: Quantity.

The candidate keys are:

  1. SId PId
  2. SName PId

Prime attributes: SId, PId, SName

Non-prime attribute: Quantity.

The relation is in 3NF. Note:

  1. SId -> SName does not violate 3NF as SName is prime.
  2. SName -> SId does not violate 3NF as SId is prime.
  3. SId PId -> Quantity does not violate 3NF as {SId, PId} is a CK and also a SK.
  4. SName PId -> Quantity does not violate 3NF as {SName, PId} is a CK and also a SK.

However, there are unnecessary redundancy.

SId SName PId Quantity
S1 ABC P1 10
S1 ABC P2 20
S1 ABC P3 21
S2 DEF P1 40
S2 DEF P4 13
S3 XYK P3 18

Thus, 3NF does not detect all design problems using FD.

However, S is not in BCNF because, for example, the functional dependency 

SId -> SName is

  1. non-trivial, and
  2. SId is not a superkey.

To deal with it, we can decompose S(SId, PId, SName, Quantity) into

(1) Supplier(SId, SName) with  

SId -> SName
SName -> SId

with two candidate keys:

  1. SId
  2. SName

(2) Supply(SId, PId, Quantity)  with 

SId, PId -> Quantity.

Both are in BCNF.

Example: 

Consider the relation R(A, B, C, D) with

A -> B,  B -> C, C -> A and C -> D.

There are three candidate keys:

  1. A
  2. C

Since every left hand side of any non-trivial functional dependency is a superkey, R is in BCNF.

1.5 Checking Highest Normal Form by Violations

To find the highest normal form for a relation R, check every non-trivial FD X->Y of R for violation.

Normal Form's Violation Non-trivial FD X -> A
2NF (1) X is a proper subset of a candidate key of R, and
(2) A is a non-prime attribute.
3NF (1) X is not a superkey of R, and
(2) A is a non-prime attribute.
BCNF X is not a superkey.

Example:

Consider R(A,B,C,D) {A->B, B->AC, C->D}

Using decomposition rule, we have {A->B, B->A, B->C, C->D}

We find two CK: [1] A, [2] B
Prime attributes: A, B
Non-prime attributes: C, D

Checking for violation:

FD Ok with 2NF Ok with 3NF OK with BCNF
A->B Yes Yes Yes
B->A Yes Yes Yes
B->C Yes Yes Yes
C->D Yes No No

Thus, the highest NF is 2NF

1.6 Motivation of BCNF

Example 

Consider the relation R(CITY, ZIP, STREET) again
       
Using the code for the postal office, we have

CITY STREET -> ZIP, and ZIP -> CITY.

Hence, there are two candidate keys:

  1. CITY STREET, and
  2. ZIP STREET

Therefore, R is not in BCNF since in ZIP -> CITY, ZIP is not a superkey.

However, if we decompose R into two relations, each with two attributes, then the FD

CITY STREET -> ZIP is lost (i.e. cannot be assured within a single relation)

Therefore, we better leave the relation alone.

1.7 Normalization Theory Using Functional Dependencies

Example

Consider the following relation:

Supply(SupplierId, SupplierName, ProductId, ProductDesc, Quantity, ArrivalTime)

The relation stores the quantities and arrival times of shipments of products (identified by ProductId) from suppliers (Identified by SupplierId). A supplier may not have a unique name. Furthermore, the product description, ProductDesc, may be the same for two products. A supplier may supply the same product many times, each with a different ArrivalTime.

The functional dependencies (FD) of the relation:

SupplierId -> SupplierName
ProductId -> ProductDesc
SuplierId, ProductId, ArrivalTime -> Quantity

CK:  {SupplierId, ProductId, ArrivalTime}

Non-prime attributes: SupplierName, ProductDesc, Quantity

Highest Normal Form: 1NF

SupplierId -> SupplierName violates 2NF since

  1. SupplierId is a part (proper subset) of a candidate key, i.e., {SupplierId} ⊂ {SupplierId, ProductId, ArrivalTime}, and
  2. Quantity is non-prime.

Not that A ⊂ B means that A is a proper subset of B.

2. Decomposition

2.1 Lossy Decomposition

Example:

Consider the relation Emp(EmpId, DeptId, ManagerId) with 

EmpId ->  DeptId
DeptId ->  ManagerId

Note that we do not have ManagerId -> DeptId in this example, since this organization allows a manager to manage more than one Departments. Note that ManagerId 90000 manages two Departments.

EmpId DeptId ManagerId
E1 ACCT M3
E2 HR M3
E3 ENG M6
E4 ENG M6

          
The relation is not in 3NF because of the FD

DeptId (not a SK) -> ManagerId (non-prime)

Suppose we decompose the relation into

Emp1(EmpId, ManagerId)
Dept(DeptId, ManagerId)

The common attribute for the component relations is ManagerId. The relations are obtained by projections from Emp:

Emp1:                     

EmpId ManagerId
E1 M3
E2 M3
E3 M6
E4 M6

Dept:

DeptId ManagerId
ACCT M3
HR M3
ENG M6

If we do not lose any information by the decomposition, we should get the original relation using the natural join.

However,  Emp1 |x| Dept is     

EmpId DeptId ManagerId
E1 ACCT M3
E1 HR M3
E2 ACCT M3
E2 HR M3
E3 ENG M6
E4 ENG M6

         
This is not the same as the original relation Emp. Spurious rows are incorrectly included in the result.

Hence, the decomposition of Emp(EmpId, DeptId, ManagerId) into
 
Emp1(EmpId, ManagerId) and
Dept(DeptId, ManagerId)

is lossy.  It is not a good decomposition.

2.2 Lossless Decomposition

Example:

Consider now the following decomposition of Emp(EmpId, DeptId, ManagerId):

Emp2(EmpId, DeptId)  and
Emp3(EmpId, ManagerId)

The common attribute is EmpId. We have Emp2 and Emp3:

Emp2:                     

EmpId DeptId
E1 ACCT
E2 HR
E3 ENG
E4 ENG

Emp3:

EmpId ManagerId
E1 M3
E2 M3
E3 M6
E4 M6

Hence, Emp2 |x| Emp3:

EmpId DeptId ManagerId
E1 ACCT M3
E2 HR M3
E3 ENG M6
E4 ENG M6

This is the same as the original relation Emp.  Therefore, the decomposition does not lose any information.  It is a lossless decomposition.

Definition. A decomposition is lossless if the natural joins of the component relations result in the original relation. Otherwise, it is lossy.

2.3 Theory of Lossless Decomposition

Example:

Why is the decomposition of Emp(EmpId, Dept, ManagerId) into

(1) Emp1(EmpId, ManagerId) and Dept(DeptId, ManagerId) lossy, and

(2) Emp2(EmpId, DeptId) and Emp3(EmpId, ManagerId) lossless?

Theorem: Suppose R(X, Y, Z) is decomposed into R1(X, Y) and R2(X, Z).  X is the set of common attributes in R1 and R2.  The decomposition is lossless if and only if

(a) X -> Y, or
(b) X -> Z.

Example:

In case (1), X is ManagerId, Y is EmpId, Z is DeptId.

Neither condition (a) nor (b) is satisfied.  Hence, (1) is lossy.

In case (2), X is EmpId, Y is DeptId, Z is ManagerId.

Both conditions (a) and (b) are satisfied.  Hence, (2) is lossless.

2.4 Dependency-Preserving Decomposition

Example:                        

For the relation Emp(EmpId, DeptId, ManagerId) with 

EmpId ->  DeptId
DeptId ->  ManagerId,

The decomposition of Emp into

Emp2(EmpId, DeptId)  and
Emp3(EmpId, ManagerId)

is lossless but it does not preserve dependencies:

the FD  DeptId -> ManagerId

cannot be assured within a single relation after the decomposition. No relation contains both attributes.

For example, if we add the information Emp E6 work in the ACCT Department under manager M9 carelessly, we may have the following table.

Emp2:                     

EmpId DeptId
E1 ACCT
E2 HR
E3 ENG
E4 ENG
E6 ACCT

Emp3:

EmpId ManagerId
E1 M3
E2 M3
E3 M6
E4 M6
E6 M9

As a result, the FD  DeptId ->  ManagerId is violated.

Thus, for the relation Emp(EmpId, DeptId, ManagerId) with 

EmpId ->  DeptId
DeptId ->  ManagerId,

the best decomposition is

Emp1(EmpId, DeptId)  and
Dept(DeptId, ManagerId)

It is easy to show that, the decomposition is lossless, preserves dependencies, and that Emp1 and Dept are both in BCNF.

2.4.2 Decomposition Algorithms

  1. It is possible to decompose a relation such that
    1. all member relations are in 3NF,
    2. the decomposition is lossless, and
    3. all FDs are preserved.
  2. It is also possible to decompose a relation such that
    1. all member relations are in BCNF, and
    2. the decomposition is lossless, but
    3. not all FDs may be preserved.

2.5 Algorithm for decomposition into 3NF relations

Example:

Consider R(A,B,C,D,E) with F = {A->BC, CD -> E, BA -> C, D->B}.

Step 1. Find a canonical cover G for F.

The FD BA->C is redundant.

G = {A->BC, CD -> E, D->B}.

We may perform normalization analysis to see whether decomposition is necessary.

L/NR: A, D
M: C
R: B, E

We have: AD+ = AD BC E

Thus, CK: [1] AD
prime: A, D
non-prime: B, C, E

Normalization analysis:

Non-trivial FD 2NF 3NF BCNF
A -> B: [1] A ⊂ AD, [2] A is not a SK, [3] B is non-prime violate violate violate
A -> C: [1] A ⊂ AD, [2] A is not a SK, [3] C is non-prime violate violate violate
CD -> E: [1] CD ⊄ AD, [2] C is not a SK, [3] E is non-prime ok violate violate
D -> B: [1] D ⊂ AD, [2] D is not a SK, [3] B is non-prime violate violate violate

Thus, the highest normal form of R is 1NF. Decomposition is necessary.

Step 2. For every FD X->Y in G, create a relation with the schema XY and add it to the result D. This step preserves FD and resolves NF violations.

Relations created:

R1(A,B,C) with A->BC
R2(C,D,E) with CD->E
R3(B,D) with D->B

It can be seen very easily that R1, R2 and R3 are all in 3NF and BCNF. Furthermore, all FDs are preserved.

Step 3. If no relation in D contains a candidate key of R, create a new relation with a candidate key of R being the schema, and add it to the result D. This step assures losslessness.

There is only one candidate key of R: AD. Since none of R1, R2 and R3 contains AD, create the relation

R4(A,D) with no FD: {}

Step 4. Simplify the decomposition D by removing relations that are redundant (i.e. that its schema is a subset of the schema of another relation).

No action as there is no redundant relation.

The result relations are all in BCNF.

Example:

Consider R(A,B,C,D,E) with {A->BCD, BC->D, D->C}

Using the algorithm,

(1) Canonical cover: {A->BC, BC->D, D->C}; A->D is removed since it is a redundant FD.

(2) The following relations are created:

R1(A,B,C) with {A-> BC},
R2(B,C,D) with {BC->D, D->C},
R3(C,D) with {D->C}

(3) There is only one candidate key AE. Since it is not in any of R1, R2 or R3, R4 is created.

R4(A,E)

(4) R3(C,D) is removed as redundant.

As in result, we have:

R1(A,B,C) with {A-> BC}, in BCNF
R2(B,C,D) with {BC->D, D->C}, in 3NF but not in BCNF
R4(A,E) with {}, in BCNF

Example:

Consider the relation R(A, B, C) with A -> B and C -> B.

R is not in 2NF.  It is not possible to decompose R into two relations losslessly while preserving all functional dependencies.

However, it is possible to decompose into three BCNF relations losslessly and with all functional dependencies preserved:

R1(A, B),
R2(B, C) and
R3(A, C).

Consider the relation R(A, B, C) with A -> B and BC -> A.

R is not in BCNF.  It is not possible to decompose R into BCNF relations losslessly while preserving all FD.