Normal Forms and Theory of Normalization
by K. Yue
1. Normal Forms Using Functional Dependencies
1.1 First Normal Form
Example
Consider the following table with 3 records. It is not in 1 NF.
DeptId | ManagerId | EmpIds | Names |
D123 | 110 | 10000, 12000, 13000 | Lady Gaga, Eminem, Lebron James |
D225 | 440 | 21000, 22000 | Rajiv Gandhi, Bill Clinton |
D337 | 300 | 31000 | John Smithson |
An alternate design of the relation in 1NF is shown below. The following instance has six rows.
DeptId | ManagerId | EmpId | Name |
D123 | 110 | 10000 | Lady Gaga |
D123 | 110 | 12000 | Eminem |
D123 | 110 | 13000 | Lebron James |
D225 | 440 | 21000 | Rajiv Gandhi |
D225 | 440 | 22000 | Bill Clinton |
D337 | 300 | 31000 | John Smithson |
Example
Consider the tuple (EmpId: 12345, OSSkills: {Windows, Linux, Solaris}).
1.2 Second Normal Form
Example
The following relation is not in 2NF. (Assume that the number of credits of a given course does not change). Note the redundancy and anomalies.
Enroll(Course, Credit, Student, Grade)
CourseId | Credit | StudentId | Grade |
C1 | 3 | S1 | A |
C1 | 3 | S2 | B |
C1 | 3 | S3 | B |
C2 | 2 | S1 | A |
C2 | 2 | S4 | D |
We assume the following FD.
Thus,
To convert to relations in 2NF, decompose Enroll into
Both tables are in 2NF.
Example from Ricardo:
NewClass(courseNo, stuId, stuLastName, facId, schedule, room, grade). We have:
courseNo, stuId -> grade
stuId -> stuLastName
courseNo -> facId, schedule, room
StuId -> stuLastName and courseNo -> facId, schedule, room violate 2NF
To convert to 2NF, decomposition:
1.3 Third Normal Form
Example
DeptId | ManagerId | EmpId | Name |
D123 | 110 | 10000 | Lady Gaga |
D123 | 110 | 12000 | Eminem |
D123 | 110 | 13000 | Lebron James |
D225 | 205 | 21000 | Rajiv Gandhi |
D225 | 205 | 22000 | Bill Clinton |
D337 | 333 | 31000 | John Smithson |
Example
Consider the relation R(CITY, STREET, ZIP) with the FDs:
There are two candidate keys:
Hence, all attributes are prime attributes, and the relation is in both 2NF and 3NF.
Note that a relation such as Employee(EmePId, Name, Street, City, Zip, State) is not in 3NF.
This is a classical example you can find in many database textbooks. Note that the two FDs may not actually be correct in the United States.
1.4 BCNF (Boyce-Codd Normal Form)
Example
Consider the relation
S(SId, PId, SName, Quantity) with the following assumptions:
We have the following non-trivial FD:
Note that SId and SName are equivalent.
The candidate keys are:
Prime attributes: SId, PId, SName
Non-prime attribute: Quantity.
The candidate keys are:
Prime attributes: SId, PId, SName
Non-prime attribute: Quantity.
The relation is in 3NF. Note:
However, there are unnecessary redundancy.
SId | SName | PId | Quantity |
S1 | ABC | P1 | 10 |
S1 | ABC | P2 | 20 |
S1 | ABC | P3 | 21 |
S2 | DEF | P1 | 40 |
S2 | DEF | P4 | 13 |
S3 | XYK | P3 | 18 |
Thus, 3NF does not detect all design problems using FD.
However, S is not in BCNF because, for example, the functional dependency
SId -> SName is
To deal with it, we can decompose S(SId, PId, SName, Quantity) into
(1) Supplier(SId, SName) with
SId -> SName
SName -> SId
with two candidate keys:
(2) Supply(SId, PId, Quantity) with
SId, PId -> Quantity.
Both are in BCNF.
Example:
Consider the relation R(A, B, C, D) with
A -> B, B -> C, C -> A and C -> D.
There are three candidate keys:
Since every left hand side of any non-trivial functional dependency is a superkey, R is in BCNF.
1.5 Checking Highest Normal Form by Violations
To find the highest normal form for a relation R, check every non-trivial FD X->Y of R for violation.
Normal Form's Violation | Non-trivial FD X -> A |
2NF | (1) X is a proper subset of a candidate key of R, and (2) A is a non-prime attribute. |
3NF | (1) X is not a superkey of R, and (2) A is a non-prime attribute. |
BCNF | X is not a superkey. |
Example:
Consider R(A,B,C,D) {A->B, B->AC, C->D}
Using decomposition rule, we have {A->B, B->A, B->C, C->D}
We find two CK: [1] A, [2] B
Prime attributes: A, B
Non-prime attributes: C, D
Checking for violation:
FD | Ok with 2NF | Ok with 3NF | OK with BCNF |
A->B | Yes | Yes | Yes |
B->A | Yes | Yes | Yes |
B->C | Yes | Yes | Yes |
C->D | Yes | No | No |
Thus, the highest NF is 2NF
1.6 Motivation of BCNF
Consider the relation R(CITY, ZIP, STREET) again
Using the code for the postal office, we have
CITY STREET -> ZIP, and ZIP -> CITY.
Hence, there are two candidate keys:
Therefore, R is not in BCNF since in ZIP -> CITY, ZIP is not a superkey.
However, if we decompose R into two relations, each with two attributes, then the FD
CITY STREET -> ZIP is lost (i.e. cannot be assured within a single relation)
Therefore, we better leave the relation alone.
1.7 Normalization Theory Using Functional Dependencies
Example
Consider the following relation:
Supply(SupplierId, SupplierName, ProductId, ProductDesc, Quantity, ArrivalTime)
The relation stores the quantities and arrival times of shipments of products (identified by ProductId) from suppliers (Identified by SupplierId). A supplier may not have a unique name. Furthermore, the product description, ProductDesc, may be the same for two products. A supplier may supply the same product many times, each with a different ArrivalTime.
The functional dependencies (FD) of the relation:
SupplierId -> SupplierName
ProductId -> ProductDesc
SuplierId, ProductId, ArrivalTime -> Quantity
CK: {SupplierId, ProductId, ArrivalTime}
Non-prime attributes: SupplierName, ProductDesc, Quantity
Highest Normal Form: 1NF
SupplierId -> SupplierName violates 2NF since
Not that A ⊂ B means that A is a proper subset of B.
2. Decomposition
2.1 Lossy Decomposition
Example:
Consider the relation Emp(EmpId, DeptId, ManagerId) with
EmpId -> DeptId
DeptId -> ManagerId
Note that we do not have ManagerId -> DeptId in this example, since this organization allows a manager to manage more than one Departments. Note that ManagerId 90000 manages two Departments.
EmpId | DeptId | ManagerId |
E1 | ACCT | M3 |
E2 | HR | M3 |
E3 | ENG | M6 |
E4 | ENG | M6 |
The relation is not in 3NF because of the FD
DeptId (not a SK) -> ManagerId (non-prime)
Suppose we decompose the relation into
Emp1(EmpId, ManagerId)
Dept(DeptId, ManagerId)
The common attribute for the component relations is ManagerId. The relations are obtained by projections from Emp:
Emp1:
EmpId | ManagerId |
E1 | M3 |
E2 | M3 |
E3 | M6 |
E4 | M6 |
Dept:
DeptId | ManagerId |
ACCT | M3 |
HR | M3 |
ENG | M6 |
If we do not lose any information by the decomposition, we should get the original relation using the natural join.
However, Emp1 |x| Dept is
EmpId | DeptId | ManagerId |
E1 | ACCT | M3 |
E1 | HR | M3 |
E2 | ACCT | M3 |
E2 | HR | M3 |
E3 | ENG | M6 |
E4 | ENG | M6 |
This is not the same as the original relation Emp. Spurious rows are incorrectly included in the result.
Hence, the decomposition of Emp(EmpId, DeptId, ManagerId) into
Emp1(EmpId, ManagerId) and
Dept(DeptId, ManagerId)
is lossy. It is not a good decomposition.
2.2 Lossless Decomposition
Example:
Consider now the following decomposition of Emp(EmpId, DeptId, ManagerId):
Emp2(EmpId, DeptId) and
Emp3(EmpId, ManagerId)
The common attribute is EmpId. We have Emp2 and Emp3:
Emp2:
EmpId | DeptId |
E1 | ACCT |
E2 | HR |
E3 | ENG |
E4 | ENG |
Emp3:
EmpId | ManagerId |
E1 | M3 |
E2 | M3 |
E3 | M6 |
E4 | M6 |
Hence, Emp2 |x| Emp3:
EmpId | DeptId | ManagerId |
E1 | ACCT | M3 |
E2 | HR | M3 |
E3 | ENG | M6 |
E4 | ENG | M6 |
This is the same as the original relation Emp. Therefore, the decomposition does not lose any information. It is a lossless decomposition.
Definition. A decomposition is lossless if the natural joins of the component relations result in the original relation. Otherwise, it is lossy.
2.3 Theory of Lossless Decomposition
Example:
Why is the decomposition of Emp(EmpId, Dept, ManagerId) into
(1) Emp1(EmpId, ManagerId) and Dept(DeptId, ManagerId) lossy, and
(2) Emp2(EmpId, DeptId) and Emp3(EmpId, ManagerId) lossless?
Theorem: Suppose R(X, Y, Z) is decomposed into R1(X, Y) and R2(X, Z). X is the set of common attributes in R1 and R2. The decomposition is lossless if and only if
(a) X -> Y, or
(b) X -> Z.
Example:
In case (1), X is ManagerId, Y is EmpId, Z is DeptId.
Neither condition (a) nor (b) is satisfied. Hence, (1) is lossy.
In case (2), X is EmpId, Y is DeptId, Z is ManagerId.
Both conditions (a) and (b) are satisfied. Hence, (2) is lossless.
2.4 Dependency-Preserving Decomposition
Example:
For the relation Emp(EmpId, DeptId, ManagerId) with
EmpId -> DeptId
DeptId -> ManagerId,
The decomposition of Emp into
Emp2(EmpId, DeptId) and
Emp3(EmpId, ManagerId)
is lossless but it does not preserve dependencies:
the FD DeptId -> ManagerId
cannot be assured within a single relation after the decomposition. No relation contains both attributes.
For example, if we add the information Emp E6 work in the ACCT Department under manager M9 carelessly, we may have the following table.
Emp2:
EmpId | DeptId |
E1 | ACCT |
E2 | HR |
E3 | ENG |
E4 | ENG |
E6 | ACCT |
Emp3:
EmpId | ManagerId |
E1 | M3 |
E2 | M3 |
E3 | M6 |
E4 | M6 |
E6 | M9 |
Thus, for the relation Emp(EmpId, DeptId, ManagerId) with
EmpId -> DeptId
DeptId -> ManagerId,
the best decomposition is
Emp1(EmpId, DeptId) and
Dept(DeptId, ManagerId)
It is easy to show that, the decomposition is lossless, preserves dependencies, and that Emp1 and Dept are both in BCNF.
2.4.2 Decomposition Algorithms
2.5 Algorithm for decomposition into 3NF relations
Example:
Consider R(A,B,C,D,E) with F = {A->BC, CD -> E, BA -> C, D->B}.
Step 1. Find a canonical cover G for F.
The FD BA->C is redundant.
G = {A->BC, CD -> E, D->B}.
We may perform normalization analysis to see whether decomposition is necessary.
L/NR: A, D
M: C
R: B, E
We have: AD+ = AD BC E
Thus, CK: [1] AD
prime: A, D
non-prime: B, C, E
Normalization analysis:
Non-trivial FD | 2NF | 3NF | BCNF |
A -> B: [1] A ⊂ AD, [2] A is not a SK, [3] B is non-prime | violate | violate | violate |
A -> C: [1] A ⊂ AD, [2] A is not a SK, [3] C is non-prime | violate | violate | violate |
CD -> E: [1] CD ⊄ AD, [2] C is not a SK, [3] E is non-prime | ok | violate | violate |
D -> B: [1] D ⊂ AD, [2] D is not a SK, [3] B is non-prime | violate | violate | violate |
Thus, the highest normal form of R is 1NF. Decomposition is necessary.
Step 2. For every FD X->Y in G, create a relation with the schema XY and add it to the result D. This step preserves FD and resolves NF violations.
Relations created:
R1(A,B,C) with A->BC
R2(C,D,E) with CD->E
R3(B,D) with D->B
It can be seen very easily that R1, R2 and R3 are all in 3NF and BCNF. Furthermore, all FDs are preserved.
Step 3. If no relation in D contains a candidate key of R, create a new relation with a candidate key of R being the schema, and add it to the result D. This step assures losslessness.
There is only one candidate key of R: AD. Since none of R1, R2 and R3 contains AD, create the relation
R4(A,D) with no FD: {}
Step 4. Simplify the decomposition D by removing relations that are redundant (i.e. that its schema is a subset of the schema of another relation).
No action as there is no redundant relation.
The result relations are all in BCNF.
Example:
Consider R(A,B,C,D,E) with {A->BCD, BC->D, D->C}
Using the algorithm,
(1) Canonical cover: {A->BC, BC->D, D->C}; A->D is removed since it is a redundant FD.
(2) The following relations are created:
R1(A,B,C) with {A-> BC},
R2(B,C,D) with {BC->D, D->C},
R3(C,D) with {D->C}
(3) There is only one candidate key AE. Since it is not in any of R1, R2 or R3, R4 is created.
R4(A,E)
(4) R3(C,D) is removed as redundant.
As in result, we have:
R1(A,B,C) with {A-> BC}, in BCNF
R2(B,C,D) with {BC->D, D->C}, in 3NF but not in BCNF
R4(A,E) with {}, in BCNF
Example:
Consider the relation R(A, B, C) with A -> B and C -> B.
R is not in 2NF. It is not possible to decompose R into two relations losslessly while preserving all functional dependencies.
However, it is possible to decompose into three BCNF relations losslessly and with all functional dependencies preserved:
R1(A, B),
R2(B, C) and
R3(A, C).
Consider the relation R(A, B, C) with A -> B and BC -> A.
R is not in BCNF. It is not possible to decompose R into BCNF relations losslessly while preserving all FD.