CSCI 5333.4 DBMS
Homework #5

Due date: 10/26 at the beginning of the class. (updated: 10/26/2011)

(1) As the volume and complexity of data increase quickly, data preparation becomes an important issue for database and data mining. This assignment combines data preparation and simple database design and will serve as the template of the next assignment on SQL/PHP.

Eventually, we will build a Web application to show selected statistics of education for different countries in the world. There are many public datasets available, we will use a dataset from United Nation Data:

http://data.un.org/Explorer.aspx?d=SOWC

In "Global Indicator Database", there are five datasets:

Your task is to download all five datasets, clean and simplify them, design relational schema and store all information of the data in the database.

The site provides various download options, select the ones that are most suitable for your tasks.

Study the data to clean and simplify it. For examples, selected fields may have the sane values for all entries and are thus not needed to store in a separate column in a relation. Some field values may be derivable from other fields.

Turn in:

(1) A report on how you obtain, clean and store the data.

(a) What format of the data have you downloaded and why?

(b) What properties have you found about the five datasets that can be used to prepare and simplify the data?

(c) What mechanisms have you employed to clean and prepare the data and why?

(d) What are the relation schema of the tables you have designed to store the data? Identify all candidate keys.

(e) What are the SQL commands you have used to create the tables and populate them?

(f) What are the format of the cleaned data files you used to populate the tables, and why?

The answer to question (1) should be a MS Word file with the name of <<Yourlastname>>_h5report.docx.

(2) Turn in the text file that contains the answer of (e). Call your file <<Yourlastname>>_h5sql.txt. Also turn in the cleaned data files with the names <<Yourlastname>>_primary.txt, <<Yourlastname>>_secondary.txt, <<Yourlastname>>_tertiary.txt, <<Yourlastname>>_GNIl.txt and <<Yourlastname>>_expenditure.txt respectively.

Send all seven files in a single email to the TA with the subject: HW#5: <<Last Name>>, <<First Name>>, <<Student ID>>.

Notes and tips: