CSCI 5733
XML Application Development
Spring 2006
Homework #1
Due date: Feb 9, 2006 (Thursday)
(a) XML Generation (20%) Write a simple Perl program to read a TSV (Tab Separated Values) text file storing library information and output a KML file to the standard output stream. The first line of the TSV file gives the column name. The remaining content contains one record per line with each field separated by a tab. Keyhole Markup Language (KML) is the native markup language for the popular Google's Earth software: http://www.keyhole.com/kml/kml_doc.html.
There are many fields in the input file and many elements in KML. Study the following example carefully to find out what to extract and what to include in the output KML.
Your program should take one command line argument, the input file name, e.g.,
h1a.pl input.tsv.txt > output.kml
For this input file input.tsv.txt, you get this output.kml.
You must name your file h1a.pl and put it in a top level directory "hw" (not under "pages") in your csci (i.e. dcm) account so the TA can test it with a standard test set. Create a bat file h1a.bat so the program can be run as below, with the input and output files are command line arguments.
h1a input.tsv.txt output.kml
You may also use Java for this project. If so, name your program h1a.java and have h1a.bat calls it.
Not following the naming convention will result in your program not graded as an automatic grading program may be used.
In the next two questions, we will use a XML dataset, storing article information in SIGMOD Records: http://www.cs.washington.edu/research/xmldatasets/www/repository.html#sigmod-record. The dataset is http://www.cs.washington.edu/research/xmldatasets/data/sigmod-record/SigmodRecord.xml. However, we can see that it is not well-formed. I have thus converted it to a well-formed document for our use: SigmodRecord_alt.xml. This quick and dirty conversion is semantically incorrect since all potentially violating entities are converted to '. Since we are only interested in having a good dataset with an appropriate size, that will serve our programming assignment well.
(b) XML consumption and storage (35%). Store SigmodRecord_alt.xml in the hw directory of your dcm account. Write a Perl program, h1b.pl, to extract the information and it in MySQL database, which will then be used in the next question. You should store it in your own database in MySQL. You can design your own tables. You may select to store the column values in xml encoded format or just regular string. However, you need to state the relation schema and your decision.
Your tables should be created by your program. You may assume that the program does not exist before running the program (so you do not need to drop the tables inside your program). Your program should be run like this:
h1b.pl SigmdRecord_alt.xml
It is not necessary to use an XML parser. The idea is to show one of the advantage of XML: it is very easy to understand and parse. Regular expressions should be sufficient. The solution will use regular expressions. We will have plenty of opportunities to use XML parsers in the coming assignments.
(c) Simple XML Web Service (45%) Write a CGI-Perl program (PHP acceptable) to serve information about Sigmod Records publications in XML format.
Your program should be called h1c.pl (or h1c.php if you are using PHP). The program accepts five different HTTP parameters which are elaborated by the following examples:
http://dcm.cl.uh.edu/youraccount/h1c.pl?authors=yes
should return:
<?xml version='1.0' encoding='ISO-8859-1' ?>
<authors>
<author>'zg'r Ulusoy</author>
<author>A. Bolour</author>
<author>A. Desai Narasimhalu</author>
<author>A. Mahboob</author>
<author>A. Min Tjoa</author>
<author>A. Prasad Sistla</author>
<author>A. R. Hurson</author>
<author>Aaron M. Tenenbaum</author>
<author>Aaron Watters</author>
...
</authors>
http://dcm.cl.uh.edu/youraccount/h1c.pl?author=Mike+P.+Papazoglou
should return:
<?xml version='1.0' encoding='ISO-8859-1' ?>
<articles author="Mike P. Papazoglou">
<article volume="21" number="4">Database Research at the Queensland University of Technology.</article>
<article volume="28" number="1">Contextualizing the Information Space in Federated Digital Libraries.</article>
</articles>
http://dcm.cl.uh.edu/youraccount/h1c.pl?issues=yes
should return:
<?xml version='1.0' encoding='ISO-8859-1' ?>
<issues>
<issue volume="11" number="1" />
<issue volume="11" number="3" />
<issue volume="11" number="4" />
<issue volume="12" number="1" />
<issue volume="12" number="2" />
<issue volume="12" number="3" />
...
</issues>
http://dcm.cl.uh.edu/youraccount/h1c.pl?volume=13&number=03
should return:
<?xml version='1.0' encoding='ISO-8859-1' ?>
<issue volume="13" number="03">
<article>Adding Database Management to Ada.</article>
<article>An Informal Approach to Formal Specifications.</article>
<article>Comparison of Four Relational Data Model Specifications.</article>
<article>How the DBA Stole Christmas, or The Case for Separately Compiled Access Modules.</article>
<article>Interfacing a Query Language to a CODASYL DBMS.</article>
<article>Research Issues in Database Specification.</article>
<article>The Devolution of Functional Analysis.</article>
</issue>
Your program does not need to handle erroneous input values.
The data must be extract from the MySQL database you created in question (b). Study the output of the examples very carefully as the program may be graded automatically.
This approach is similar to Web Service provided by Representational State Transfer (REST). In REST, the basic idea is to use only HTTP to provide Web services with results stored in XML format. All input parameters to the Web Services are stored in the URL. However, there are also difference. For example, in URL, the recommended URL format for:
http://dcm.cl.uh.edu/youraccount/h1c.pl?author=Mike+P.+Papazoglou
is something like:
http://dcm.cl.uh.edu/youraccount/authors/Mike%20P.%20Papazoglou
REST is popular. For example, Amazon.com provides Web Services using either REST or SOAP. The vast majority of developers use REST because of its simplicity.
Hints and notes:
Turn in your work in an envelope with your name, section and student id clearly specified:
Dr. Kwok-Bun Yue
Professor, Computer Science and Computer Information Systems
Chair, Division of Computing and Mathematics
University of Houston-Clear Lake
2700 Bay Area Boulevard
Houston, TX 77058
Yue's home page
yue@uhcl.edu
281-283-3864