Normalization Theory
by K. Yue
1. Functional Dependencies
Example:
If a relation R is in BCNF, then R is also in 3NF, 2NF and 1NF.
If a relation is in 2NF, then
If a relations is not in 3NF, then
Functional Dependencies (FD):
Example
Many to many relationships.
Consider an instance of the relation Enroll:
Course | Student | Grade |
C1 | S1 | A |
C1 | S2 | B |
C1 | S3 | B |
C2 | S1 | A |
C2 | S4 | D |
Under reasonable assumptions, there are many to many relationships between these sets of attributes:
However, the relationship between {Course, Student} and Grade may not be a many-to-many relationship if we assume that a student can only has one grade for a given course.
Example
Many to one relationships.
For many applications, the relationship between the two sets of attributes, SSNUM and NAME, are many to one.
SSNUM -> NAME
(many) (one)
Assumptions:
Terms:
For example, the relation instance:
SSNum | NAME | PHONE |
123456789 | Peter | 123-456-7890 |
123456789 | Paul | 713-283-7066 |
222229999 | Mary | 713-283-7066 |
is not allowed if we assume SSNum -> NAME.
Example
In a university, there may be a many-to-one relationship between {CourseId, StudentId} and Grade.
In a university, there may be a many-to-one relationship between {CourseId, StudentId} and Grade.
Interpretations:
Note that AB->CD is a shorthand notation for {A,B} -> {C,D}
Example
In most applications, we have
SSNum -> Name (i.e. a person has only one SSNum.)
However, in a criminal database, several bad guys may use the same fake SSNum, and thus
SSNum -> Name is not true.
Or, if you are dealing with an international data base with many countries. Each country may has its own SSNum. Two countries may issue the same SSNum. Hence,
SSNum -> Name is not true.
We may instead have SSNum, Country -> Name.
Definition of FD (from EN):
A relation scheme R is said to satisfy the functional dependency X -> Y if for any relation r that uses R, if there are two tuples s and t in r such that s[X] = t[X], then s[Y] = t[Y].
Example: This instance of R does not violate X->Y.
X | Y | Z |
'A' | 1 | 110 |
'A' | 1 | 123 |
'A' | 1 | 345 |
'B' | 2 | 232 |
'C' | 1 | 110 |
This instance of R violates X->Z.
X | Y | Z |
'A' | 1 | 110 |
'A' | 1 | 123 |
'A' | 1 | 345 |
'B' | 2 | 232 |
'C' | 1 | 110 |
'C' | 2 | 212 |
In order to have X-> Y, all instances must not violate the conditions.
Example
SSNUM -> SNAME:
There are no two tuples with the same SSNUM but different names.
DEPT_NO -> MANAGER_NO:
There are no two tuples with the same DEPT_NO but different MANAGER_NO. A department has only one manager.
SNUM, PNUM, DATE -> QUANTITY
There are no two tuples with the same SNUM, PNUM and DATE but different QUANTITY. That is, any supplier has only one shipment of a part on a given date.
In the example of poorly designed database:
EMP_NO | NAME | DEPT_NO | MANAGER_NO |
10000 | Lady Gaga | D123 | 54321 |
12000 | Aamir Khan | D123 | 54321 |
13000 | Lebron James | D123 | 54321 |
21000 | Narendra Modi | D225 | 42315 |
22000 | Aishwarya Rai | D225 | 42315 |
31000 | John Smithson | D337 | 33323 |
If we assume that a department has only one manager, we have:
DEPT_NO -> MANAGER_NO
Note that we also have:
NAME, DEPT_NO -> MANAGER_NO
EMP_NO, DEPT_NO -> MANAGER_NO
and so on.
Armstrong's axioms
Example
Let X be CITY STREET, Y be STREET, then Y is a subset of X, and X -> Y or CITY STREET -> STREET. (Reflexivity).
A -> A and BC -> B are trivial.
If EMP_NO -> DEPT_NO, and
DEPT_NO -> MANAGER_NO
then EMP_NO -> MANAGER_NO
Interpretation: If
then every employee has only one manager.
Proof with Armstrong axioms.
Example
Prove the union Rule.
Proof.
(1) X -> Z (given)
(2) X X -> X Z (augmentation of (1) with X)
(3) X -> XZ (simplification of (2))
(4) X -> Y (given)
(5) XZ -> YZ (augmentation of (4) with Z)
(6) X -> YZ (transitivity on (3) and (5))
Exercise
Prove the pseudo-transitivity rule.
Examples
(1) Prove that F = AC->B, B->D, AE->C}
implies AE->D
Proof. For example:
[1] AE-> C (given)
[2] AC -> B (given)
[3] AAE -> B (pseudo-transitivity on (1) and (2))
[4] AE -> B (simplification of (3))
[5] B -> D (given)
(2) Prove that F = {AB->C, AC->D, BD->E}
implies AB->E
Proof. For example:
[1] AB-> C (given)
[2] AC -> D (given)
[3] AAB -> D (pseudo-transitivity on (1) and (2))
[4] AB -> D (simplification of (3))
[5] BD -> E (given)
[6] ABB-> E (pseudo-transitivity on (4) and (5))
[7] AB-> E (simplification of (6))
Keys and Superkeys
Example
In EMPLOYEE(EMP_NO, DEPT_NO, MANAGER_NO) with
EMP_NO -> DEPT_NO and
DEPT_NO -> MANAGER_NO.
By the transitivity axiom, EMP_NO -> MANAGER_NO.
By the union rule, EMP_NO -> EMP_NO DEPT_NO, MANAGER_NO.
Hence, EMP_NO is a candidate key of EMPLOYEE(EMP_NO, DEPT_NO, MANAGER_NO).
On the other hand, DEPT_NO is not a candidate key since we do not have DEPT_NO -> EMP_NO.
Furthermore, there are four superkeys:
Closure of Attributes
Example
Consider R(A,B,C,D) with
F = {B->A, A->C, AB->D, D->AC}
A+ = AC
B+ = ABCD
C+ = C
D+ = ACD
Thus, B is a candidate key (CK).
No proper superset of B is a candidate key (since it will not be minimal).
Remaining non-empty subset of ABCD to check:
AC+ = AC
AD+ = ACD
CD+ = ACD
ACD+ = ACD
Thus, B is the only CK.
Example (A more degenerate case)
Consider:
F = {A-> B, BC -> DA, BD -> C, E-> A, AC -> DE}
We have
A+ = AB
B+ = B
C+ = C
D+ = D
E+ = EAB
(AB)+ = AB
(AC)+ = ABCDE
(AD+ = ABCDE
(AE)+ = ABE
(BC)+ = ABCDE
(BD)+ = ABCDE
(BE)+ = ABE
(CD)+ = CD
(CE)+ = ABCDE
(DE)+ = ABCDE
(ABC)+ = ABCDE
...
(ABE)+ = ABE
...
Algorithm for finding X+ for a set of FDs F.
X+ <- X
while (there exists a FD P -> Q such that P is a subset of X+ and there are attributes K in Q that is not in X+) {
X+ <- X+ U Q
}
Examples:
(1) Relation R(A,B,C,D,E) has exactly four superkeys. Can you deduce from this statement the number of candidate keys? If yes, how many CKs are there? Justify your answer.
Solution:
No. If ABC is the only CK of R, then there are four superkeys: ABC, ABCD, ABCE and ABCDE. On the other hand, if there are three CKs: ABCD, ABCE and ABDE, there are also four superkeys: ABCD, ABCE, ABDE and ABCDE.
(2) A relation R of four attributes has two candidate keys, what are the maximum and minimum numbers of superkeys R may have?
Solution:
Minimum: 3; e.g. when the candidate keys are ABC and ABD.
Maximum: 12; e.g. when the candidate keys are A and B.
(3) Consider the following valid instance of a relation R(A,B,C). Can you deduct from it all candidate keys of R? If yes, what are the candidate keys? If not, why?
A |
B |
C |
a1 |
b1 |
c1 |
a2 |
b2 |
c1 |
a2 |
b2 |
c2 |
a3 |
b1 |
c1 |
a3 |
b3 |
c1 |
Solution:
Yes, there is only one candidate key: ABC. This is because for all proper subsets of ABC, there are two or more tuples with the same values and thus no proper subsets of ABC can be a candidate key. This leaves only ABC.
Closure of a set of functional dependencies
Consider F = {A->B, B->C}
F+ = {
A->{}, A->A,
A->B,
A->C, A-> AB, A-> AC, A-> BC, A->ABC,
B->{}, B->B, B->C, B->BC,
C->{}, C->C,
AB->{}, AB->A, AB->B, AB->C, AB->AB, AB->AC, AB->BC, AB->ABC,
AC->{}, AC->A, AC->B, AC->C, AC->AB, AC->AB, AC->BC, AC->ABC,
BC->{}, BC->B, BC->C, BC->BC,
ABC->{}, ABC->A,
ABC->B,
ABC->C, ABC-> AB, ABC-> AC, ABC-> BC, ABC->ABC }
Note that
Equivalence and cover
Canonical and Minimal Covers
Example
Consider the F = {A->B, AB->C}.
B is extraneous since for G = {A->B, A->C}, and F+ = G+.
Example
In F = {A->B, AB->C, B->C},
AB->C is redundant since for
G = {A->B, B->C}, AB+ = ABC.
Alternatively,
G |- AB-> C.
In F = {A->B, AB->C, B->C, A->D},
G1 = {A->B, B->C, A->D} is a minimal cover.
G2 = {A->BD, B->C} is a canonical cover.
Exercise:
Consider F: {A->C, BCD->A, C->E, CD-> A, AB->C}
Example:
Find a canonical cover for F = {BC->AE, AD->BCE, A->E, AE->D, BCD->F, AB->C}
Solution:
Basically, we iteratively remove all extraneous attributes and redundant function dependencies.
We use decomposition rule to ensure the RHS to contain only a single attribute so we can work on them one by one. F becomes:
(1) BC -> A
(2) BC -> E
(3) AD -> B
(4) AD -> C
(5) AD -> E
(6) A -> E
(7) AE -> D
(8) BCD -> F
(9) AB -> C
To investigate whether B or C is extraneous in BC -> A, we note that in F:
B+ = B
C+ = C
This means B alone and C alone cannot determine A and neither of them is extraneous.
On the other hand, in F:
A+ = ABCDEF
That means A alone can determine all other attributes. Any other attributes in the LHS with A in a FD are thus extraneous, we thus have the following by removing D in [2], [3] and [4], and B in [9].
(1) BC -> A
(2) BC -> E
(3) A -> B
(4) A -> C
(5) A -> E
(6) A -> E
(7) A -> D
(8) BCD -> F
(9) A -> C
Removing identical FD, we have F:
(1) BC -> A
(2) BC -> E
(3) A -> B
(4) A -> C
(5) A -> E
(6) A -> D
(7) BCD -> F
For (7), since B+ = B, C+ = C and D+ = D. However, BC+ = ABCDEF, and thus D is extraneous. Thus, we now have:
(1) BC -> A
(2) BC -> E
(3) A -> B
(4) A -> C
(5) A -> E
(6) A -> D
(7) BC -> F
To remove redundant FD, we consider whether we can deduce a FD when it is removed.
For (1) BC -> A, removing it result in F':
(1) BC -> E
(2) A -> B
(3) A -> C
(4) A -> E
(5) A -> D
(6) BC -> F
In F': we have
BC+ = BCE, which does not include A. Thus, F' does not imply BC -> A and it is not redundant.
For (2) BC -> E, removing it and we have F':
(1) BC -> A
(2) A -> B
(3) A -> C
(4) A -> E
(5) A -> D
(6) BC -> F
In F', we have BC+ = ABCDEF. Thus, F' |= BC -> E and BC -> E is redundant. Remove it and we have:
(1) BC -> A
(2) A -> B
(3) A -> C
(4) A -> E
(5) A -> D
(6) BC -> F
Using this method, we can find that there are no more redundant FD.
Finally, we use the union rule to merge FD with the same LHS and get the canonical cover:
{BC -> AF, A-> BCDE}
Note that the canonical cover is not unique. Another canonical cover is:
{BC -> A, A-> BCDEF}
Exercise:
Consider F: {AB->CE, BC->D, D->BC, C->E, A->C, A->E}
Find:
Exercise:
Can there be more than one canonical covers for a set of FDs?
Finding Candidate keys
Example
For R(A,B,C), need to check A, B, C, AB, AC, BC and ABC for candidate keys.
Thus, the problem is O(e^N).
To find all candidate keys of R with a set of FD, F:
Example:
For R(A,B,C,D,E,F) with F = {BC->AE, AD->BC, A->E, AE->D, BCD->F, AB->C}, find all candidate keys.
Solution:
There are 31 candidates for the CK. These are the non-empty subsets of ABCDEF.
In F, we have:
A+ = ABCDEF
B+ = B
C+ = C
D+ = D
E+ = E
F+ = F
Thus, A is a CK. As a result, it is not necessary to check all proper supersets of A as they will not be minimal.
Continue to check sets of two attributes:
BC+ = ABCDEF
BD+ = BD
BE+ = BE
BF+ = BF
CD+ = CD
CE+ = CE
CF+ = CF
DE+ = DE
DF+ = DF
EF+ = EF
Thus, BC is another CK.
Further checking on sets of three attributes:
BDE+ = BDE
BDF+ = BDF
BEF+ = BEF
CDE+ = CDE
CDF+ = CDF
CEF+ = CEF
DEF+ = DEF
Note that it is not necessary to check proper subsets of A or BC.
We check sets of four attributes:
BDEF+ = BDEF
CDEF+ = CDEF
We have exhausted all candidates and there are only two CKs: A and BC.
Alternatively, we can use the canonical cover:
{BC -> A, A-> BCDEF}
It can be seen that A and BC are CK. Furthermore, DEF appears in RHS of some FD but not the LHS of any FD, thus any of them can not appear in any CK.
2. Normal Forms using Functional Dependencies
First Normal Form
Consider the following table with 3 records. It is not in 1 NF.
DEPT_NO | MANAGER_NO | EMP_NO | NAME |
D123 | 110 | 10000, 12000, 13000 |
Lady Gaga, Eminem, Lebron James |
D225 | 42315 | 21000, 22000 | Rajiv Gandhi, Bill Clinton |
D337 | 33323 | 31000 | John Smithson |
The corresponding relation with 6 tuples is in 1 NF:
DEPT_NO | MANAGER_NO | EMP_NO | NAME |
D123 | 110 | 10000 | Lady Gaga |
D123 | 110 | 12000 | Eminem |
D123 | 110 | 13000 | Lebron James |
D225 | 42315 | 21000 | Rajiv Gandhi |
D225 | 42315 | 22000 | Bill Clinton |
D337 | 33323 | 31000 | John Smithson |
Example
Consider the tuple (Empid: 12345, OSSkills: "Windows, Linux, Solaris").
Second Normal Form
Example
The following relation is not in 2NF. (Assume the number of credits of a given course does not change). Note the redundancy and anomalies.
Enroll(Course, Credit, Student, Grade)
Course | Credit | Student | Grade |
C1 | 3 | S1 | A |
C1 | 3 | S2 | B |
C1 | 3 | S3 | B |
C2 | 2 | S1 | A |
C2 | 2 | S4 | D |
That is, we assume the following FDs.
Thus,
To convert to 2NF, decompose Enroll into
Third Normal Form
Example
DEPT_NO | MANAGER_NO | EMP_NO | NAME |
D123 | 54321 | 10000 | Lady Gaga |
D123 | 54321 | 12000 | Eminem |
D123 | 54321 | 13000 | Lebron James |
D225 | 42315 | 21000 | Rajiv Gandhi |
D225 | 42315 | 22000 | Bill Clinton |
D337 | 33323 | 31000 | John Smithson |
Using the new definition.
The functional dependency DEPT_NO -> MANAGER_NO is
(1) non-trivial,
(2) DEPT_NO is not a superkey, and
(3) MANAGER_NO is not a prime attribute.
Thus, it violates the 3NF.
Example
For the relation R(CITY, ZIP, STREET)
Using the code for the postal office, we may have:
CITY STREET -> ZIP, and ZIP -> CITY.
Hence, there are two candidate keys:
Hence,
Prime attributes: STREET, CITY, ZIP
R is in the 3NF because
Note that a relation such as EMPLOYEE(EMP_ID, EMP_NAME, Street, City, Zip, State) is not in 3NF.
This is a classical example you can find in many database textbooks. The FDs are actually not valid in the United States. See, for example: Why all 5-digit ZIP Code™ lists are obsolete.
Note:
Example
Consider the relation
S(SNUM, PNUM, SNAME, QUANTITY) with the following assumptions:
We have the following non-trivial functional dependencies:
Note that SNUM and SNAME are equivalent.
The candidate keys are:
Prime attributes: SNUM, PNUM, SNAME
Non-prime attribute: QUANTITY.
S is in 3NF because
The relation is in 3NF. However, there are unnecessary redundancy.
SNUM | SNAME | PNUM | QUANTITY |
S1 | ABC | P1 | 10 |
S1 | ABC | P2 | 20 |
S1 | ABC | P3 | 21 |
S2 | DEF | P1 | 40 |
S2 | DEF | P4 | 13 |
S3 | XYK | P3 | 18 |
BCNF (Boyce-Codd Normal Form)
Example
EMPLOYEE(EMP_NO, NAME, DEPT_NO, MANAGER_NO) with
EMP_NO -> NAME
EMP_NO -> DEPT_NO
DEPT_NO -> MANAGER_NO
is not in BCNF.
The functional dependency DEPT_NO -> MANAGER_NO is
(1) non-trivial, and
(2) DEPT_NO is not a superkey.
We can decompose
EMPLOYEE(EMP_NO, NAME, DEPT_NO, MANAGER_NO) into
EMP(EMP_NO, NAME, DEPT_NO) with
EMP_NO -> NAME, DEPT_NO
and
DEPARTMENT(DEPT_NO, MANAGER_NO) with
DEPT_NO -> MANAGER_NO
Both relations are in BCNF since
Recall that these are the good relations without the anomalies in the previous example.
Example
Consider again the relation
S(SNUM, PNUM, SNAME, QUANTITY) with the following non-trivial functional dependencies:
Note that SNUM and SNAME are equivalent.
The candidate keys are:
Prime attributes: SNUM, PNUM, SNAME
Non-prime attribute: QUANTITY.
S is not in BCNF because, for example, the functional dependency
SNUM -> SNAME is
To deal with it, we can decompose S(SNUM, PNUM, SNAME, QUANTITY) into
(1) SUPPLIER(SNUM, SNAME) with
SNUM -> SNAME
SNAME -> SNUM
with two candidate keys:
(2) SUPPLY(SNUM, PNUM, QUANTITY) with
SNUM, PNUM -> QUANTITY.
Both are in BCNF.
Example:
Consider the relation R(A, B, C, D) with
A -> B, B -> C, C -> A and C -> D.
There are three candidate keys:
Since every left hand side of any non-trivial functional dependency is a superkey, R is in BCNF.
Motivation of BCNF
Consider the relation R(CITY, ZIP, STREET) again
Using the code for the postal office, we have
CITY STREET -> ZIP, and ZIP -> CITY.
Hence, there are two candidate keys:
Therefore, R is not in BCNF since in ZIP -> CITY, ZIP is not a superkey.
However, if we decompose R into two relations, each with two attributes, then the functional dependency
CITY STREET -> ZIP is lost (i.e. cannot be enforced within a single relation)
Therefore, we better leave the relation alone.
Normalization Theory Using Functional Dependencies
Example:
[1] Consider R(A, B, C, D, E) with
F = {BC->D, A->C, C->BD, AD->E}
(a) What are A+, B+, C+, D+ and E+?
(b) What are the candidate keys? Why?
(c) Show all prime attributes and non-prime attributes?
(d) Give a canonical cover of F?
(e) What is the highest normal form (up to BCNF) of R? Why?
(f) If R is not in BCNF, can you provide a lossless FD preserving decompositions of R into BCNF relations?
Solution:
(a) A+=ABCDE, B+=B, C+= BCD, D+=D, E+=E
(b) The candidate key is A
(c) Prime: A, non-prime: BCDE
(d) {A->CE, C->BD}
(e) 2NF since C->D violates 3NF: D is non-prime and C is not a Sk.
(f) Yes, the decomposition to R(A,C,E) {A->CE} and R2(B,C,D) {C->BD} satisfy the requirement.
[2] Consider R(A, B, C, D, E, F) with
F = {CD->E, A->BD, AC->EF. C->BD, F->E, EF->D}
(a) What are A+, B+, C+, D+, E+, F+?
(b) What are the candidate keys? Why?
(c) Show all prime attributes and non-prime attributes?
(d) Give a canonical cover of F?
(e) What is the highest normal form (up to BCNF) of R? Why?
(f) If R is not in BCNF, can you provide a lossless FD preserving decompositions of R into BCNF relations?
Solution:
(a) A+=ABD, B+=B, C+=BCDE, D+=D, E+=E, F+=DEF
(b) The candidate key is AC
(c) Prime: AC; non-prime: BDEF
(d) {A->BD, AC->F, C->BDE, F->DE}
(e) 1NF. A->B violates 2NF as A is a part of the a CK and B is non-prime.
(f) Yes, the following decomposition satisfies the requirement:
R1(B,C,D,E) {C->BDE}
R2(D,E,F) {F->DE}
R3(A,B,D) {A->BD}
R4(A,C,F) {AC->F}
3. Decomposition
Lossy Decomposition
Example:
Consider the relation EMP(EMP_NO, DEPT_NO, MANAGER_NO) with
EMP_NO -> DEPT_NO
DEPT_NO -> MANAGER_NO
Note that we do not have MANAGER_NO -> DEPT_NO in this example, since a manager can manage more than one departments under the assumptions made for this example.
EMP_NO | DEPT_NO | MANAGER_NO |
12345 | ACCT | 90000 |
12399 | HR | 90000 |
30000 | ENG | 98000 |
The relation is not in BCNF because of the FD
DEPT_NO -> MANAGER_NO
Suppose we decompose the relation into
EMP1(EMP_NO, MANAGER_NO)
DEPT(DEPT_NO, MANAGER_NO)
The common attribute is MANAGER_NO. They are obtained by projections from EMP:
EMP1:
EMP_NO | MANAGER_NO |
12345 | 90000 |
12399 | 90000 |
30000 | 98000 |
DEPT:
DEPT_NO | MANAGER_NO |
ACCT | 90000 |
HR | 90000 |
ENG | 98000 |
If we do not lose any information by the decomposition, we should get the original relation from the natural join.
However, EMP1 |x| DEPT is
EMP_NO | DEPT_NO | MANAGER_NO |
12345 | ACCT | 90000 |
12345 | HR | 90000 |
12399 | ACCT | 90000 |
12399 | HR | 90000 |
30000 | ENG | 98000 |
This is not the same as the original relation EMP. Spurious tuples were incorrectly created.
Hence, the decomposition of EMP(EMP_NO, DEPT_NO, MANAGER_NO) into
EMP1(EMP_NO, MANAGER_NO) and
DEPT(DEPT_NO, MANAGER_NO)
is lossy. It is not a good decomposition.
Example: A lossy decomposition using the supply database:
supply(snum, pnum, quantity) {snum, pnum} -> quantity
decomposed into:
s1(snum, quantity)
s2(pnum, quantity)
select *
from (select snum, quantity from supply) as s1
natural join
(select snum, pnum from supply) as s2;
Lossless Decomposition
Example:
Consider now the following decomposition of EMP(EMP_NO, DEPT_NO, MANAGER_NO):
EMP2(EMP_NO, DEPT_NO) and
EMP3(EMP_NO, MANAGER_NO)
The common attribute is EMP_NO. We have EMP2 and EMP3:
EMP2:
EMP_NO | DEPT_NO |
12345 | ACCT |
12399 | HR |
30000 | ENG |
EMP3:
EMP_NO | MANAGER_NO |
12345 | 90000 |
12399 | 90000 |
30000 | 98000 |
Hence, EMP2 |x| EMP3:
EMP_NO | DEPT_NO | MANAGER_NO |
12345 | ACCT | 90000 |
12399 | HR | 90000 |
30000 | ENG | 98000 |
This is exactly the same as the original relation EMP. Therefore, the decomposition does not lose any information. It is a lossless decomposition.
Theory of Lossless Decomposition
Example:
Why is the decomposition of EMP(EMP_NO, DEPT_NO, MANAGER_NO) into
(1) EMP1(EMP_NO, MANAGER_NO) and DEPT(DEPT_NO, MANAGER_NO) lossy, and
(2) EMP2(EMP_NO, DEPT_NO) and EMP3(EMP_NO, MANAGER_NO) lossless?
Theorem: Suppose R(X, Y, Z) is decomposed into R1(X, Y) and R2(X, Z). X is the set of common attributes in R1 and R2. The decomposition is lossless if and only if
(a) X -> Y, or
(b) X -> Z.
Example:
In case (1), X is MANAGER_NO, Y is EMP_NO, Z is DEPT_NO.
Neither condition (a) not (b) is satisfied. Hence, (1) is lossy.
In case (2), X is EMP_NO, Y is DEPT_NO, Z is MANAGER_NO.
Both conditions (a) and (b) are satisfied. Hence, (2) is lossless.
Example:
Consider R(A,B,C,D,E) with {A->BC, CD -> E, BA -> C, D->B}.
It is decomposed into R1(A,B), R2(A,C), R3(C,D,E) and R4(B,E).
Step 1. Create a table of 5 columns (number of columns) and 4 rows (number of relations). Populate it with b(i,j).
Relation | A |
B |
C |
D |
E |
R1 | b(1,1) |
b(1,2) |
b(1,3) |
b(1,4) |
b(1,5) |
R2 | b(2,1) |
b(2,2) |
b(2,3) |
b(2,4) |
b(2,5) |
R3 | b(3,1) |
b(3,2) |
b(3,3) |
b(3,4) |
b(3,5) |
R4 | b(4,1) |
b(4,2) |
b(4,3) |
b(4,4) |
b(4,5) |
Step 2. For each relation Ri, set all attribute Aj that appears in Ri from b(i,j) to a(j).
Relation | A |
B |
C |
D |
E |
R1 | a(1) |
a(2) |
b(1,3) |
b(1,4) |
b(1,5) |
R2 | a(1) |
b(2,2) |
a(3) |
b(2,4) |
b(2,5) |
R3 | b(3,1) |
b(3,2) |
a(3) |
a(4) |
a(5) |
R4 | b(4,1) |
a(2) |
b(4,3) |
b(4,4) |
a(5) |
Step 3. While changes can be made with a FD X-> Y, with two rows in the table having the common X values in the following manner:
for every attribute W in Y:
Note that a specific FD can be applied more than once.
Applying A-> BC:
Relation | A |
B |
C |
D |
E |
R1 | a(1) |
a(2) |
a(3) |
b(1,4) |
b(1,5) |
R2 | a(1) |
a(2) |
a(3) |
b(2,4) |
b(1,5) |
R3 | b(3,1) |
b(3,2) |
a(3) |
a(4) |
a(5) |
R4 | b(4,1) |
a(2) |
b(4,3) |
b(4,4) |
a(5) |
Applying CD -> E: no change since no two rows has the same values in CD.
Relation | A |
B |
C |
D |
E |
R1 | a(1) |
a(2) |
a(3) |
b(1,4) |
b(1,5) |
R2 | a(1) |
a(2) |
a(3) |
b(2,4) |
b(2,5) |
R3 | b(3,1) |
b(3,2) |
a(3) |
a(4) |
a(5) |
R4 | b(4,1) |
a(2) |
b(4,3) |
b(4,4) |
a(5) |
Applying BA -> C: no change since R1 and R2 already have the same a's value: a(3).
Relation | A |
B |
C |
D |
E |
R1 | a(1) |
a(2) |
a(3) |
b(1,4) |
b(1,5) |
R2 | a(1) |
a(2) |
a(3) |
b(2,4) |
b(2,5) |
R3 | b(3,1) |
b(3,2) |
a(3) |
a(4) |
a(5) |
R4 | b(4,1) |
a(2) |
b(4,3) |
b(4,4) |
a(5) |
Applying D->B: no change. No D's have the same value.
Relation | A |
B |
C |
D |
E |
R1 | a(1) |
a(2) |
a(3) |
b(1,4) |
b(1,5) |
R2 | a(1) |
a(2) |
a(3) |
b(2,4) |
b(2,5) |
R3 | b(3,1) |
b(3,2) |
a(3) |
a(4) |
a(5) |
R4 | b(4,1) |
a(2) |
b(4,3) |
b(4,4) |
a(5) |
In fact, no FD can be applied again to change the matrix.
Step 4. If there is a row with only a's, the decomposition is lossless. Otherwise, there is no row with only a's and the decomposition is lossy.
Since there is no row with only a's, the decomposition is lossy.
Example:
Now suppose that C->DE is also in the FDs. That is, we have:
R(A,B,C,D,E) with {A->BC, CD -> E, BA -> C, D->B, C->DE}.
We will now have one more step.
Applying C->DE:
Relation | A |
B |
C |
D |
E |
R1 | a(1) |
a(2) |
a(3) |
a(4) |
a(5)
|
R2 | a(1) |
a(2) |
a(3) |
a(4) |
a(5)
|
R3 | b(3,1) |
b(3,2) |
a(3) |
a(4) |
a(5)
|
R4 | b(4,1) |
a(2) |
b(4,3) |
b(1,4) |
a(5) |
Now we have two rows with only a's and thus the decomposition is lossless.
Dependency-Preserving Decomposition
Example:
For the relation EMP(EMP_NO,DEPT_NO,MANAGER_NO) with
EMP_NO -> DEPT_NO
DEPT_NO -> MANAGER_NO,
The decomposition of EMP into
EMP2(EMP_NO, DEPT_NO) and
EMP3(EMP_NO, MANAGER_NO)
is lossless but does not preserve dependencies:
the FD DEPT_NO -> MANAGER_NO
cannot be enforced by any relation after the decomposition. No relation contains both attributes.
For example, if we add the information EMP 23000 work in the ACCT department under manager 97000 and are not careful, we may have:
EMP2:
EMP_NO | DEPT_NO |
12345 | ACCT |
12399 | HR |
30000 | ENG |
23000 | ACCT |
EMP3:
EMP_NO | MANAGER_NO |
12345 | 90000 |
12399 | 90000 |
30000 | 98000 |
23000 | 97000 |
The FD DEPT_NO -> MANAGER_NO is violated.
Thus, for the relation EMP(EMP_NO,DEPT_NO,MANAGER_NO) with
EMP_NO -> DEPT_NO
DEPT_NO -> MANAGER_NO,
the best decomposition is into
EMP1(EMP_NO, DEPT_NO) and
DEPT(DEPT_NO, MANAGER_NO)
It is easy to show that, the decomposition is lossless, preserves dependencies, and that EMP1 and DEPT are both in BCNF.
Algorithm for decomposition in 3NF relations
Example:
Consider R(A,B,C,D,E) with F = {A->BC, CD -> E, BA -> C, D->B}.
Step 1. Find a canonical cover (as opposed to a minimal cover in EN) G for F.
The FD BA->C is redundant.
G = {A->BC, CD -> E, D->B} is a canonical cover.
Step 2. For every FD X->Y in G, create a relation with the schema XY and add it to the result D.
Relations created:
R1(A,B,C) with A->BC
R2(C,D,E) with CD->E
R3(B,D) with D->B
This step ensures that all FDs are preserved.
Step 3. If no relation in D contains a candidate key of R, create a new relation with a candidate key of R being the schema and add it to the result D.
There is only one candidate key of R: AD. Since none of R1, R2 and R3 contains both A and D, create the relation
R4(A,D) with no FD
Step 4. Simplify D by removing relations that are redundant (i.e. that its schema is a subset of the schema of another relation).
No action as there is no redundant relation.
The result relations are all in BCNF.
Example:
Consider R(A,B,C,D,E) with {A->BCD, BC->D, D->C}
Using the algorithm,
(1) Canonical cover: {A->BC, BC->D, D->C}
(2) The following relations are created:
R1(A,B,C) with {A-> BC},
R2(B,C,D) with {BC->D, D->C},
R3(C,D) with {D->C}
(3) There is only one candidate key AE. Since it is not in any of R1, R2 or R3, R4 is created.
R4(A,E)
(4) R3(C,D) is removed as redundant since it is a subset of R2.
As the result, we have:
R1(A,B,C) with {A-> BC}, in BCNF
R2(B,C,D) with {BC->D, D->C}, in 3NF but not in BCNF
R4(A,E) with {}, in BCNF
Example:
Consider the relation R(A, B, C) with A -> B and C -> B.
R is not in 2NF. It is not possible to decompose R into two relations losslessly while preserving all functional dependencies.
However, it is possible to decompose into three relations losslessly and with all functional dependencies preserved:
R1(A, B),
R2(B, C) and
R3(A, C).
4. Higher Normal Forms (Additional Materials only; will not be in the exam.)
Multivalued Dependencies
Example 1:
Consider the following instance of the relation R(Emp_No, Dept_NO, Skill):
EMP_NO | DEPT_NO | SKILL |
100 | D101 | PHP |
100 | D102 | PHP |
100 | D101 | MySQL |
100 | D102 | MySQL |
200 | D101 | PHP |
300 | D103 | Graphics |
300 | D104 | Graphics |
400 | D102 | PHP |
400 | D102 | Graphics |
400 | D102 | MySQL |
There are no non-trivial functional dependencies. R is in BCNF.
If the department an employee is working on is independent of the skill that he has, there is redundancy. For example, the fact that employee 100 has the skill PHP is stored twice.
Example 2:
Under the assumption of the previous example, we have
Emp_No ->-> Dept_NO and
Emp_No ->-> Skill.
Suppose we have the following relation instance under a different set of assumptions:
EMP_NO | DEPT_NO | SKILL |
100 | D101 | PHP |
100 | D102 | PHP |
100 | D101 | MySQL |
300 | D103 | Graphics |
Without knowing the underlying assumptions, this instance of the relation implies that the multivalued dependency Emp_No ->-> Dept_NO does not hold true since
Note:
A more precise definition of MVD:
Example 3:
Consider Example 1 again. In the data modeling, there may be two classes with a many-to-many association.
If one follows the data modeling and the mapping to relation guideline, two relations will be created (instead of one).
This shows the importance of good data modeling.
Some Properties Of Multivalued Dependencies
Fourth Normal Form
Example 4:
The relation R(Emp_No, Dept_NO, Skill) in Example 1 is in BCNF but not in 4NF. It should be decomposed into:
R1(Emp_No, Dept_NO) and
R2(Emp_No, Skill).
Note that these two relations are created with good data modeling and mapping to relations.
Embedded Multivalued Dependencies
Example 5:
Consider the following relation R(Proj_No, Emp_No, Skill):
PROJ_NO | EMP_NO | SKILL |
P1 | E1 | PHP |
P1 | E2 | PHP |
P1 | E1 | MySQL |
P1 | E2 | MySQL |
P2 | E1 | Graphics |
P2 | E3 | Graphics |
P3 | E3 | Graphics |
P3 | E3 | PHP |
P4 | E4 | MySQL |
For this application, each project (Proj_No) has a number of employees (Emp_No) and each project requires a list of skills (Skill). We have:
Proj_No ->-> Emp_No
Proj_No ->-> Skill.
An employee provides some skills that a project needs. If an employee has a skill that is not needed by the project (e.g. employee E1 may have the skill of 'Internet'), the skill is not stored in the tuples of project P1, which does not require the skill 'Internet'.
Note that Emp_No ->-> Skill does not hold in R.
R is not in 4NF.
We can decompose the relation into two relations:
R1(Proj_No, Emp_No) and
R2(Proj_No, Skill)
R1:
PROJ_NO | EMP_NO |
P1 | E1 |
P1 | E2 |
P2 | E1 |
P2 | E3 |
P3 | E3 |
P4 | E4 |
R2:
PROJ_NO | SKILL |
P1 | PHP |
P1 | MySQL |
P2 | Graphics |
P3 | Graphics |
P3 | PHP |
P4 | MySQL |
Both R1 and R2 are now in 4NF.
However, if skill is a multi-valued attribute of an employee, then we should have
Emp_No ->-> Skill
It does not show up in R because it is embedded. If we project R to remove PROJ_NO, this relationship appears.
These embedded MVD's are not enforced by the relations R1 and R2.
These MVD's only display themselves after projection.
We may have the following three classes with three many-to-many associations between each pair of them:
Decomposition of R into R1 and R2 will lose this embedded multivalued dependency.
Example 6:
There are (trivial) embedded MVD's Emp_No ->-> Skill | φ and Emp_No ->-> Proj_No | φ in R of the previous example.
To remedy the problem, decompose the relation R into:
R1(Proj_No, Emp_No)
R2(Proj_No, Skill) and
R3(Emp_No, Skill).
All relations are in 4NF (5NF too) and the embedded MVD are not lost.
Example 7:
Consider an application with four classes A, B, C and D with primary keys A_ID, B_ID, C_ID and D_ID. There are many to many binary associations between A and C as well as B and C. Furthermore, there is a ternary association between A, B and D.
If somebody has not performed a good data modeling, it is possible to come up with a relation R(A_ID, B_ID, C_ID, D_ID).
However, the MVD C_ID ->-> A_ID (or C_ID ->-> B_ID) is not true in R because of the additional attribute D_ID. If the attribute D_ID is removed by projection, then the independence between A_ID and B_ID for a given value of C_ID will show up. Hence, there is a embedded MVD C_ID ->-> A_ID | B_ID in R.
Join Dependencies
Example 8:
In Example 5 with embedded MVD, there is a non-trivial JD of R(Proj_No, Emp_No, Skill):
{{Proj_No, Emp_No}, {Proj_No, Skill}, {Emp_No, Skill}}
Fifth (Project-Join) Normal Form
Example 9:
The relation R(Proj_No, Emp_No, Skill) of Example 3 does not satisfy 5NF.
Domain-Key Normal Form
Example 10:
Consider the relation scheme Enrollment(Course_No, Student_No, Grade)
The key is {Course_No, Student_No}.
KC: Course_No, Student_No -> Grade.
DC:
Domain(Course_No): 001..999 (i.e. In(Course_No, {001..999}
Domain(Grade): {A,B,C,D,F,I,P}
Domain(Student_No): string(1..10)
GC:
One of the GC may be:
if Course_No mod 10 >= 8 then
(Grade ε {'P', 'F', 'I'}
else
(Grade ε {'A', 'B', 'C', 'D', 'F', 'I'};
end if;
The relation Enrollment is in PJNF.
However, the relation Enrollment is not in DKNF since the GC is not a natural consequence of the KC and DC.
To solve the problem, we may make the following decomposition:
Pass_Fail_Course_Enrollment(Course_No, Student_No, Grade) with
DC:
Domain(Course_No): {I | I ε 1..999 and I mod 10 >= 8}
Domain(Grade): {'F', 'I', 'P'}
Regular_Course_Enrollment(Course_No, Student_No, Grade) with
DC:
Domain(Course_No): {I | I ε 1..999 and I mod 10 < 8}
Domain(Grade): {'A', 'B', 'C', 'D', 'F', 'I'}
Both relations are now in DKNF. However, this is usually not done. Instead, stored procedures may be used to enforce the constraint.
Example 11:
A non-trivial FD is a GC.
A FD with the determinant being a key is a KC.
Hence, if a relation satisfies BCNF, then all its FD can be deducted from KC.
If a relation does not satisfy BCNF, then there is a FD with a non-key determinant. Thus, it will not satisfy DKNF too.
Example 12:
A non-trivial MVD is a GC.
There is no way to express a MVD using KC or DC.
Thus, if a relation is not in 4NF, it is also not in DKNF.