Normalization

by K. Yue

1. Normal Forms

A set of rules to avoid redundancy and inconsistency.
Require the concepts of
- functional dependency (FD, most important: up to BCNF)
- multivalued dependency (MVD for 4NF)
- join dependency (5NF)
Seven Common Normal Forms in ascending order: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, DKNF.
Higher normal forms are more restrictive.
A relation is in a higher normal form implies that it is in a lower normal form, but not vice versa.

Example:

If a relation is in BCNF, then it is also in 3NF, 2NF and 1NF.

If a relation is in 2NF, then

It is in 1NF,
it may or may not be in 3NF, and
it may or may not in BCNF.

If a relations is not in 3NF, then

It is not in BCNF.
It may or may not be in 1NF or 2NF.

In general, the higher the normal forms a relation is in, the better the design of the relation in terms of avoiding redundancy and inconsistency.
However, it may be necessary to consider other issues, especially performances.
- Higher normal forms may be achieved by decomposition, resulting in more relations. More joins may thus be needed to provide the data for a query.
1NF and 2NF are more interesting for historical reasons.
4NF and 5NF involves the concept of multivalued and join dependencies (MVD and JD). They are hard to understand and even harder to use in most situations.
Domain Key Normal Form (DKNF) involves the concept of constraints.
Based on the concept of functional dependencies (FD), the most important normal forms are
- 3NF and
- BCNF (Boyce-Codd Normal Form).

Functional Dependencies (FD):

Types of relationship between attributes:
- Many to one (0..* to 0..1)
- Many to many (0..* to 0..*)

Example

Many to many relationships.

Consider the relation Enrol:

Course	Student	Grade
C1	S1	A
C1	S2	B
C1	S3	B
C2	S1	A
C2	S4	D

Under reasonable assumptions, there are many to many relationships between the attributes

Course and Student: course may enrol many students; a student may take many courses)
Course and Grade
Student and Grade
{Course, Grade} and Student: both S2 and S3 have a grade of B in Course C1.

However, the relationship between {Course, Student} and Grade is not a many-to-many relationship if we assume that a student can only has one grade for a given course.

A many to many relationship between two attributes means that there is no constraint and no dependency between the values of the attributes.

Example

Many to one relationships.

For many applications, the relationship between SS# and NAME are many to one.

SS# -> NAME
(many) (one)

Interpretations and terminology:

Many different SS#'s (persons) may have the same NAME.
Given a SS#, there can only be one NAME associated with it (not allowing alias, etc).
There should not be two tuples with the same SS#, but different NAME.
SS# uniquely determines NAME.
NAME is functionally determined by SS#.
There is a functional dependency SS# --> NAME.
Hence, a functional dependency specifies a many to one relationship between attributes.

For example,

SS#	NAME	PHONE
123456789	Peter	A
123456789	Paul	B
222229999	Mary	B

is not allowed if we assume SS# -> NAME.

Example

In a university, there may be a many-to-one relationship between {COURSE#, STUDENT#} and GRADE.

Interpretations:

A student may have only one grade for a course.
We say there is a functional dependency:
- COURSE# STUDENT# --> GRADE
- {COURSE#, STUDENT#} determines GRADE.
Note that under different assumptions, the functional dependency may not be true.
For example, if a student is allowed to retake a course, then he may have two grades for the same course (in different semesters), then COURSE# STUDENT# --> GRADE is false.
We may actually have COURSE# STUDENT# SEMESTER --> GRADE

Hence, functional dependency is a result of the requirements and business logics of the applications.
There is no universally true non-trivial functional dependency.
In other words, functional dependencies depend on the semantic of the problems.

Example

In most application, we have

SS# --> NAME (i.e. a person has only one SS#.)

However, in a criminal database, several bad guys may use the same fake SS#, and thus

SS# --> NAME is not true.

Or, if you are dealing with an international data base with many countries. Each country may has its own SS#. Two countries may issue the same SS#. Hence,

SS# --> NAME is not true.

We may instead have SS# COUNTRY --> NAME.

A relation scheme R is said to satisfy the functional dependency X --> Y if for any relation r that uses R, if there are two tuples s and t in r such that s[X] = t[X], then s[Y] = t[Y].
i.e. the same value of X implies the same value of Y.
Definition of FD from EN:

FD definition

Example

SS# --> SNAME:

There are no two tuples with the same SS# but different names.

DEPT-NO --> MANAGER-NO:

There are no two tuples with the same DEPT-NO but different MANAGER-NO. A department has only one manager.

SUPPLIER# PART# DATE --> QUANTITY

There are no two tuples with the same SUPPLIER#, PART# and DATE but different QUANTITY. That is, any supplier has only one shipment of a part on a given date.

Armstrong's axioms

A set of axioms for inference with FD: http://en.wikipedia.org/wiki/Armstrong%27s_axioms.
Axioms: 'self-evidence' or 'assumed'.
Three basic axioms:
1. Reflexivity: If X and Y are sets of attributes and Y is a subset of X, then X --> Y.
2. Augmentation: If X --> Y then X Z --> Y Z.
3. Transitivity: If X --> Y and Y --> Z then X --> Z
Three additional rules that can be proven by the basic axioms.

Pseudo-transivitiy Rule: If X--> Y, YZ -> A then XZ -> A
Decomposition Rule: If X --> Y Z, then X --> Y and X --> Z.
Union Rule: If X --> Y and X --> Z then X --> Y Z.

Armstrong's axioms are sound and complete.
- Sound: implies only FD that are correct.
- Complete: can be used to imply all correct FD.
CS students need to know how to infer using a formal mathematical model.

Example

Let X be CITY STREET, Y be STREET, then Y is a subset of X, and X --> Y or CITY STREET --> STREET. (Reflexivity).

If two tuples have the same value of CITY STREET, then they have the same value of STREET.
This is so trivial that we call a functional dependency likes CITY STREET --> STREET a trivial functional dependency. They do not actually specify a business requirement

A --> A and B C --> B are trivial.

Since trivial functional dependencies do not actually give you any information, we are only interested in non-trivial functional dependency.

If EMP-NO --> DEPT-NO and DEPT-NO --> MANAGER-NO
then EMP-NO --> MANAGER-NO.

Interpretation: If

every employee works for only one department, and
every department has only one manager, then

then every employee has only one manager.

Example

Prove the union Rule.

Proof.

(1) X -> Z (given)
(2) X X -> X Z (augmentation of (1) with X)
(3) X -> XZ (simplification of (2))
(4) X -> Y (given)
(5) XZ -> YZ (augmentation of (4) with Z)
(6) X -> YZ (transitivity on (3) and (5))

Exercise

Prove the pseudo transitivity rule.