Normalization

by K. Yue

1. Introduction

Example:

If a relation R is in BCNF, then R is also in 3NF, 2NF and 1NF.

If a relation is in 2NF, then

  1. It is in 1NF,
  2. it may or may not be in 3NF, and
  3. it may or may not be in BCNF.

If a relations is not in 3NF, then

  1. It is not in BCNF.
  2. It may or may not be in 1NF or 2NF.

General Overview

2. Functional Dependencies (FD)

Example

Many to one relationships.

For many applications, the relationship between SSNum and NAME are many to one in a relation R(..,SSNum, Name, ...)

SSNum        ->     Name
(many)                 (one)

Assumptions:

  1. A SSN uniquely identifies a person.
  2. Given a SSNum, there can only be one Name associated with it (not allowing alias, etc.)
  3. Many different SSNum's (persons) may have the same Name.
  4. There should not be two tuples with the same SSNum, but different NAME in all instances of R.

Terms:

  1. SSNum uniquely determines Name.
  2. Name is functionally determined by SSNum.
  3. There is a functional dependency SSNum -> NAME.
  4. Hence, a functional dependency specifies a many to one relationship between two sets of attributes.

For example,

SSNum NAME PHONE
123456789 Peter 123-456-7890
123456789 Paul 713-283-7066
222229999 Mary 713-283-7066

is not allowed if we assume SSNum -> NAME.

Example

In a university, there may be a many-to-one relationship between {CourseId, StudentId} and Grade.

Interpretations:

  1. A student may have only one grade for a course.
  2. We say that there is a functional dependency:
  3. Note that under different assumptions, the functional dependency may not be true.
  4. For example, if a student is allowed to retake a course, then he may have two grades for the same course (in different semesters), then CourseId, StudentId -> Grade  is false.
  5. We may actually have CourseId, StudentId, Semester -> Grade

Example:

In most applications, we have

SSNum -> Name             (i.e.  a person has only one SSNum.)

However, in a criminal database, several bad guys may use the same fake SSNum, and thus

SSNum -> Name  is not true.

Or, if you are dealing with an international data base with many countries.  Each country may has its own SSNum.  Two countries may issue the same SSNum.  Hence,

SSNum -> Name   is not true.

We may instead have  SSNum, Country -> Name.

Definition of FD:

Example:

DEPT_NO -> MANAGER_NO:

There are no two tuples with the same DEPT_NO but different MANAGER_NO.  A department has only one manager.

CourseId, StudentId, Semester -> Grade

There are no two tuples with the same CourseId, StudentId and Semester but different Grade.  That is, any student taking a course in a semester has an unique grade. Note that it may not be true for a university. Instead, the following may be true:

CourseId, StudentId, Year, Semester -> Grade

Keys and Superkeys

Some properties of Functional Dependency

Example

In EMPLOYEE(EMP_NO, DEPT_NO, MANAGER_NO) with

EMP_NO -> DEPT_NO and
DEPT_NO -> MANAGER_NO.

By transitivity, EMP_NO -> MANAGER_NO
By union rule, EMP_NO -> EMP_NO, DEPT_NO, MANAGER_NO
By augmentation, EMP_NO, MANAGER_NO -> DEPT_NO, MANAGER_NO
                       
Hence, EMP_NO is a (candidate) key of EMPLOYEE(EMP_NO, DEPT_NO, MANAGER_NO).

On the other hand, DEPT_NO is not a candidate key since we do not have DEPT_NO -> EMP_NO.

Furthermore, there are four superkeys:

  1. EMP_NO
  2. EMP_NO, DEPT_NO
  3. EMP_NO, MANAGER_NO
  4. EMP_NO, DEPT_NO, MANAGER_NO

Closure of Attributes

Example

Consider R(A,B,C,D) with

F = {B->A, A->C, AB->D, D->AC}

A+ = AC
B+ = ABCD
C+ = C
D+ = ACD

Thus, B is a candidate key (CK).

No proper superset of B is a candidate key (since it will not be minimal).

Remaining non-empty subset of ABCD to check:

AC+ = AC
AD+ = ACD
CD+ = ACD
ACD+ = ACD

Thus, B is the only CK.

Example

Consider R(A,B,C,D,E) with

F = {A-> B, AB -> CD, BC->A, D->E}

We have

A+ = ABCDE (candidate key: unique and minimal)
B+ = B
C+ = C
D+ = DE
E+ = E
(AB)+ = ABCDE
(AC)+ = ABCDE (unique but not minimal)
(AD+ = ABCDE (unique but not minimal)
(AE)+ = ABCDE (unique but not minimal)
(BC)+ = ABCDE (candidate key: unique and minimal)
(BD)+ = BDE
(BE)+ = BE
(CD)+ = CDE
(CE)+ = CE
(DE)+ = DE
(ABC)+ = ABCDE (unique but not minimal)
(ABD)+ = ABCDE (unique but not minimal)
(ABE)+ = ABCDE (unique but not minimal)
(ACD)+ = ABCDE (unique but not minimal)
(ACE)+ = ABCDE (unique but not minimal)
(ADE)+ = ABCDE (unique but not minimal)
(BCD)+ = ABCDE (unique but not minimal)
(BCE)+ = ABCDE (unique but not minimal)
(BDE)+ = BDE
(CDE)+ = ABCDE (unique but not minimal)
(ABCD)+ = ABCDE (unique but not minimal)
(ABCE)+ = ABCDE (unique but not minimal)
(ABDE)+ = ABCDE (unique but not minimal)
(ACDE)+ = ABCDE (unique but not minimal)
(ABCDE)+ = ABCDE (unique but not minimal)
(ABCDE)+ = ABCDE (unique but not minimal)

Finding Candidate keys

Example:

For R(A,B,C), need to check A, B, C, AB, AC, BC and ABC for candidate keys.

Thus, the problem is O(en).

To find all candidate keys for a set of FD, F:

  1. Find the canonical cover, FC, first. This simplifies F. (This step is optional.)
  2. Use heuristics to cut down the number of sets of attributes to check.
    1. If X does not appear in the RHS of any f in FC, every candidate key must include X.
    2. If X appears in the RHS of a fd in FC but does not appear in the LHS of any f in FC, then x is not a part of any candidate key.
    3. If X is found to be a CK, then any proper superset of X is not a CK and needs not be checked.

Example

Consider the following relation:

Supply(SupplierId, SupplierName, ProductId, ProductDesc, Quantity, ArrivalTime)

The relation stores the quantities and arrival times of shipments of products (identified by ProductId) from suppliers (Identified by SupplierId). A supplier may not have a unique name. Furthermore, the product description, ProductDesc, may be the same for two products. A supplier may supply the same product many times, each with a different ArrivalTime.

The functional dependencies (FD) of the relation:

SupplierId -> SupplierName
ProductId -> ProductDesc
SuplierId, ProductId, ArrivalTime -> Quantity

Decomposition:

Supplier(SupplierId, SupplierName) {SupplierId -> SupplierName}
Product(ProductId, ProductDesc) {ProductId -> ProductDesc}
Supply(SuplierId, ProductId, ArrivalTime, Quantity) {SuplierId, ProductId, ArrivalTime -> Quantity}

2. Normal Forms Using Functional Dependencies

First Normal Form

Example

Consider the following table with 3 records.  It is not in 1 NF.

DEPT_NO MANAGER_NO EMP_NO NAME
D123 54321

10000, 12000, 13000

Lady Gaga, Eminem, Lebron James

D225 42315 21000, 22000 Rajiv Gandhi, Bill Clinton
D337 33323 31000 John Smithson

The corresponding relation with 6 tuples is in 1 NF:

DEPT_NO MANAGER_NO EMP_NO NAME
D123 54321 10000 Lady Gaga
D123 54321 12000 Eminem
D123 54321 13000 Lebron James
D225 42315 21000 Rajiv Gandhi
D225 42315 22000 Bill Clinton
D337 33323 31000 John Smithson

Example

Consider the tuple (EmpId: 12345, OSSkills: {Windows, Linux, Solaris}).

Second Normal Form

Example

The following relation is not in 2NF.  (Assume the number of credits of a given course does not change).  Note the redundancy and anomalies.

Enroll(Course, Credit, Student, Grade)

Course Credit Student Grade
C1 3 S1 A
C1 3 S2 B
C1 3 S3 B
C2 2 S1 A
C2 2 S4 D

That is, we assume the following FDs.

  1. Course -> Credit
  2. Course, Student -> Grade

Thus,

  1. Course, Student is the only candidate key.
  2. Prime attributes: Course, Student
  3. Non-prime attribute: Credit, Grade.
  4. FD (1) is a violation of 2NF.

To convert to 2NF, decompose Enroll into

  1. Enroll(Course, Student, Grade)
  2. Class(Course, Credit)

Example from Hoffer (Partial):

Invoice(OrderId, OrderDate, ProductId, ProductName, Quantity)

FD:

  1. OrderId -> OrderDate
  2. ProductId -> ProductName
  3. OrderId, ProductId -> Quantity

FD 1 and 2 violate 2NF

To convert to 2NF, decomposition:

  1. Order(OrderId, OrderDate) with {OrderId -> OrderDate}
  2. Product(ProductId, ProductName) with {ProductId -> ProductName}
  3. OrderLine(OrderId, ProductId, Quantity) {OrderId, ProductId -> Quantity}

Third Normal Form

Example

DEPT_NO MANAGER_NO EMP_NO NAME
D123 54321 10000 Lady Gaga
D123 54321 12000 Eminem
D123 54321 13000 Lebron James
D225 42315 21000 Rajiv Gandhi
D225 42315 22000 Bill Clinton
D337 33323 31000 John Smithson

Example

Consider the relation

S(SNUM, PNUM, SNAME, QUANTITY) with the following assumptions:

  1. SNUM is unique for every supplier.
  2. SNAME is unique for every supplier.
  3. QUANTITY is the accumulated quantities of a part supplied by a supplier. Given a supplier and a part, the quantity is unique.
  4. A supplier can supply more than one part.
  5. A part can be supplied by more than one supplier.

We have the following non-trivial functional dependencies:

  1. SNUM -> SNAME
  2. SNAME -> SNUM
  3. SNUM PNUM -> QUANTITY
  4. SNAME PNUM -> QUANTITY

Note that SNUM and SNAME are equivalent.

The candidate keys are:

  1. SNUM PNUM
  2. SNAME PNUM

Prime attributes: SNUM, PNUM, SNAME

Non-prime attribute: QUANTITY.

The relation is in 3NF. However, there are unnecessary redundancy.

SNUM SNAME PNUM QUANTITY
S1 ABC P1 10
S1 ABC P2 20
S1 ABC P3 21
S2 DEF P1 40
S2 DEF P4 13
S3 XYK P3 18

Example

Consider the relation R(CITY, STREET, ZIP) with the FDs:

  1. CITY STREET -> ZIP, and
  2. ZIP -> CITY.

There are two candidate keys:

  1. CITY STREET, and
  2. ZIP STREET

Hence, all attributes are prime attributes and the relation is in both 2NF and 3NF.

Note that a relation such as EMPLOYEE(EMP_ID, EMP_NAME, Street, City, Zip, State) is not in 3NF.

This is a classical example you can find in many database textbook. The FDs may not be correct in the United States. See, for example: Why all 5-digit ZIP Code™ lists are obsolete.

BCNF (Boyce-Codd Normal Form)

Example                  

EMPLOYEE(EMP_NO, NAME, DEPT_NO, MANAGER_NO) with

EMP_NO -> NAME
EMP_NO -> DEPT_NO
DEPT_NO -> MANAGER_NO

is not in BCNF.

The functional dependency  DEPT_NO -> MANAGER_NO is

(1)  non-trivial, and
(2)  DEPT_NO is not a superkey.

We can decompose

EMPLOYEE(EMP_NO, NAME, DEPT_NO, MANAGER_NO) into

EMP(EMP_NO, NAME, DEPT_NO) with

EMP_NO -> NAME, DEPT_NO

and

DEPARTMENT(DEPT_NO, MANAGER_NO) with

DEPT_NO -> MANAGER_NO

Both relations are in BCNF since

Recall that these are the good relations without the anomalies in the previous example.

Example    

Consider again the relation

S(SNUM, PNUM, SNAME, QUANTITY) with the following non-trivial functional dependencies:

  1. SNUM -> SNAME
  2. SNAME -> SNUM
  3. SNUM PNUM -> QUANTITY
  4. SNAME PNUM -> QUANTITY

Note that SNUM and SNAME are equivalent.

The candidate keys are:

  1. SNUM PNUM
  2. SNAME PNUM

Prime attributes: SNUM, PNUM, SNAME

Non-prime attribute: QUANTITY.

S is not in BCNF because, for example, the functional dependency 

SNUM -> SNAME is

To deal with it, we can decompose S(SNUM, PNUM, SNAME, QUANTITY) into

(1) SUPPLIER(SNUM, SNAME) with  

SNUM -> SNAME
SNAME -> SNUM

with two candidate keys:

  1. SNUM
  2. SNAME

(2) SUPPLY(SNUM, PNUM, QUANTITY)  with 

SNUM, PNUM -> QUANTITY.

Both are in BCNF.

Example: 

Consider the relation R(A, B, C, D) with

A -> B,  B -> C, C -> A and C -> D.

There are three candidate keys:

  1. A
  2. B
  3. C

Since every left hand side of any non-trivial functional dependency is a superkey,  R is in BCNF.

Motivation of BCNF

Example 

Consider the relation R(CITY, ZIP, STREET)
       
Using the code for the postal office, we have

CITY STREET -> ZIP, and ZIP -> CITY.

Hence, there are two candidate keys:

  1. CITY STREET, and
  2. ZIP STREET

Therefore, R is not in BCNF since in ZIP -> CITY, ZIP is not a superkey.

However, if we decompose R into two relations, each with two attributes, then the functional dependency

CITY STREET -> ZIP is lost (i.e. cannot be enforced within a single relation)

Therefore, we better leave the relation alone.

Normalization Theory Using Functional Dependencies

Example

Consider the following relation:

Supply(SupplierId, SupplierName, ProductId, ProductDesc, Quantity, ArrivalTime)

The relation stores the quantities and arrival times of shipments of products (identified by ProductId) from suppliers (Identified by SupplierId). A supplier may not have a unique name. Furthermore, the product description, ProductDesc, may be the same for two products. A supplier may supply the same product many times, each with a different ArrivalTime.

The functional dependencies (FD) of the relation:

SupplierId -> SupplierName
ProductId -> ProductDesc
SuplierId, ProductId, ArrivalTime -> Quantity

CK:  {SupplierId, ProductId, ArrivalTime}

Non-prime attributes: SupplierName, ProductDesc, Quantity

Highest Normal Form: 1NF

SupplierId -> SupplierName violates 2NF since SupplierId is a part of a candidate key and Quantity is non-prime.

3. Decomposition (Only partially covered)

Lossy Decomposition

Example:

Consider the relation EMP(EMP_NO, DEPT_NO, MGR_NO) with 

EMP_NO ->  DEPT_NO
DEPT_NO ->  MGR_NO

Note that we do not have MGR_NO -> DEPT_NO in this example, since a manager can manage more than one departments under the assumptions made for this example.

EMP_NO DEPT_NO MGR_NO
12345 ACCT 90000
12399 HR 90000
30000 ENG 98000

          
The relation is not in BCNF because of the FD

DEPT_NO -> MGR_NO

Suppose we decompose the relation into

EMP1(EMP_NO, MGR_NO)
DEPT(DEPT_NO, MGR_NO)

The common attribute is MGR_NO. They are obtained by projections from EMP:

EMP1:                     

EMP_NO MGR_NO
12345 90000
12399 90000
30000 98000

DEPT:

DEPT_NO MGR_NO
ACCT 90000
HR 90000
ENG 98000

If we do not loss any information by the decomposition, we should get the original relation from the natural join.

However,  EMP1 |x| DEPT is     

EMP_NO DEPT_NO MGR_NO
12345 ACCT 90000
12345 HR 90000
12399 ACCT 90000
12399 HR 90000
30000 ENG 98000

         
This is not the same as the original relation EMP. Spurious tuples were incorrectly created.

Hence, the decomposition of EMP(EMP_NO, DEPT_NO, MGR_NO) into
 
EMP1(EMP_NO, MGR_NO) and
DEPT(DEPT_NO, MGR_NO)

is lossy.  It is not a good decomposition.

Lossless Decomposition

Example:

Consider now the following decomposition of EMP(EMP_NO, DEPT_NO, MGR_NO):

EMP2(EMP_NO, DEPT_NO)  and
EMP3(EMP_NO, MGR_NO)

The common attribute is EMP_NO. We have EMP2 and EMP3:

EMP2:                     

EMP_NO DEPT_NO
12345 ACCT
12399 HR
30000 ENG

EMP3:

EMP_NO MGR_NO
12345 90000
12399 90000
30000 98000

Hence, EMP2 |x| EMP3:

EMP_NO DEPT_NO MGR_NO
12345 ACCT 90000
12399 HR 90000
30000 ENG 98000

This is exactly the same as the original relation EMP.  Therefore, the decomposition does not loss any information.  It is a lossless decomposition.

Theory of Lossless Decomposition

Example:

Why is the decomposition of EMP(EMP_NO, DEPT, MGR_NO) into

(1) EMP1(EMP_NO, MGR_NO) and DEPT(DEPT_NO, MGR_NO) lossy, and

(2) EMP2(EMP_NO, DEPT) and EMP3(EMP_NO, MGR_NO) lossless?

Theorem: Suppose R(X, Y, Z) is decomposed into R1(X, Y) and R2(X, Z).  X is the set of common attributes in R1 and R2.  The decomposition is lossless if and only if

(a) X -> Y, or
(b) X -> Z.

Example:

In case (1), X is MGR_NO, Y is EMP_NO, Z is DEPT.

Neither condition (a) not (b) is satisfied.  Hence, (1) is lossy.

In case (2), X is EMP_NO, Y is DEPT_NO, Z is MGR_NO.

Both conditions (a) and (b) are satisfied.  Hence, (2) is lossless.

Dependency-Preserving Decomposition

Example:                        

For the relation EMP(EMP_NO,DEPT_NO,MGR_NO) with 

EMP_NO ->  DEPT_NO
DEPT_NO ->  MGR_NO,

The decomposition of EMP into

EMP2(EMP_NO, DEPT_NO)  and
EMP3(EMP_NO, MGR_NO)

is lossless but does not preserve dependencies:

the FD  DEPT_NO -> MGR_NO

cannot be enforced by any relation after the decomposition. No relation contains both attributes.

For example, if we add the information EMP 23000 work in the ACCT department under manager 97000 and are not careful, we may have:

 EMP2:                     

EMP_NO DEPT
12345 ACCT
12399 HR
30000 ENG
23000 ACCT

EMP3:

EMP_NO MGR_NO
12345 90000
12399 90000
30000 98000
23000 97000

The FD  DEPT_NO ->  MGR_NO is violated.

Thus, for the relation EMP(EMP_NO,DEPT_NO,MGR_NO) with 

EMP_NO ->  DEPT_NO
DEPT_NO ->  MGR_NO,

the best decomposition is into

EMP1(EMP_NO, DEPT_NO)  and
DEPT(DEPT_NO, MGR_NO)

It is easy to show that, the decomposition is lossless, preserves dependencies, and that EMP1 and DEPT are both in BCNF.

Algorithm for decomposition in 3NF relations (not covered)

Example:

Consider R(A,B,C,D,E) with F = {A->BC, CD -> E, BA -> C, D->B}.

Step 1. Find a canonical cover G for F. (Loosely speaking, an equivalence of F with the least number of FD and attributes)

The FD BA->C is redundant.

G = {A->BC, CD -> E, D->B}.

Step 2. For every FD X->Y in G, create a relation with the schema XY and add it to the result D.

Relations created:

R1(A,B,C) with A->BC
R2(C,D,E) with CD->E
R3(B,D) with D->B

It can be seen very easily that R1, R2 and R3 are all in 3NF. Furthermore, all FDs are preserved.

Step 3. If no relation in D contains a candidate key of R, create a new relation with a candidate key of R being the schema and add it to the result D.

There is only one candidate key of R: AD. Since none of R1, R2 and R3 contains A, create the relation

R4(A,D) with no FD

Step 4. Simplify D by removing relations that are redundant (i.e. that its schema is a subset of the schema of another relation).

No action as there is no redundant relation.

The result relations are all in BCNF.

Example:

Consider R(A,B,C,D,E) with {A->BCD, BC->D, D->C}

Using the algorithm,

(1) Canonical cover: {A->BC, BC->D, D->C}

(2) The following relations are created:

R1(A,B,C) with {A-> BC},
R2(B,C,D) with {BC->D, D->C},
R3(C,D) with {D->C}

(3) There is only one candidate key AE. Since it is not in any of R1, R2 or R3, R4 is created.

R4(A,E)

(4) R3(C,D) is removed as redundant.

As in result, we have:

R1(A,B,C) with {A-> BC}, in BCNF
R2(B,C,D) with {BC->D, D->C}, in 3NF but not in BCNF
R4(A,E) with {}, in BCNF

Example:

Consider the relation R(A, B, C) with A -> B and C -> B.

R is not in 2NF.  It is not possible to decompose R into two relations losslessly while preserving all functional dependencies.

However, it is possible to decompose into three relations losslessly and with all functional dependencies preserved:

R1(A, B),
R2(B, C) and
R3(A, C).