CSCI 4230
 Internet Application Development
 Spring 2001
 Homework #3

Due date: February 19, 2001.

(1)    Write a function hashify to take any number of parameters and return a hash.  Each parameter is a string of the format of "student:major:grade".  The key of the hash is the major field.  The value is a reference to a hash with key and value of student and grade respectively.

For example, for

my @students = ("Bun Yue;CSCI;A",
                "Mary Matalin;CSCI;B",
                "Joe Go;SWEN;B+",
                "Kenny Bee;CSCI;B",
                "Sadegh Davari;SWEN;A",
                "Mary Matalin;CSCI;C");

my %result = hashify(@students);

After executing the code above, %result will have the the following contents:

$result{'CSCI'} = {"Bun Yue" => "A", "Kenny Bee" => "B", "Mary Matalin" => "C"};
$result{'SWEN'} = {"Joe Go" => "B+", "Sadegh Davari" => "A"};

(2)    Write the Perl code to print %result of question (1) to generate the following output:

Major CSCI:
   Bun Yue => A
   Kenny Bee => B
   Mary Matalin => C
Major SWEN:
   Joe Go => B+
   Sadegh Davari => A

(3)    Write a simple Perl program, PageDownloader.pl for downloading a Web page together with its link pages.  The format of the program call is:

PageDownloader.pl url dir report

The program extracts and saves the page specified by the furl into the directory dir and generates a report to the file report.txt.  The program creates the subdirectory dir in the current directory and generates an error message if the directory already exists.  The program saves the page as main.html.  It also extracts all html links (with extensions .htm or .html) and save them in the subdirectory as link00001.html, link00002.html, and so on.

For example, executing

>PageDownloader.pl http://sce.uhcl.edu/yue/courses/csci4230/Spring2001/h3q3test.html h3 report.txt

will read http://sce.uhcl.edu/yue/courses/csci4230/Spring2001/h3q3test.html and create the subdirectory h3 under the current directory and save the report to h3/report.txt.

The source code for http://sce.uhcl.edu/yue/courses/csci4230/Spring2001/h3q3test.html:

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="Author" content="Bun Yue">
   <meta name="GENERATOR" content="Mozilla/4.72 [en] (Windows NT 5.0; U) [Netscape]">
   <title>Test Case for Homework #3 Question #3 CSCI 4230 Spring 2001</title>
</head>
<body text="#000000" bgcolor="#FFFFCC" link="#0000EE" vlink="#551A8B" alink="#FF0000">
Some links:
<ul>
<li>
<a href="h1.html">homwork #1</a></li>

<li>
<a href="h2.html">homework #2</a></li>

<li>
<a href="http://turquoise.rocks.uhcl.edu/yue/index.html">Yue's home</a></li>

<li>
<a href="http://turquoise.rocks.uhcl.edu/yue/a404.html">An 404 at turquoise</a>.</li>

<li>
<a href="http://www.yahoo.com/index.html">http://www.yahoo.com/</a>.</li>

<li>
<a href="http://sce.uhcl.edu/yue/index.html">Yue's nas home</a></li>

<li>
<a href="http://www.yahoo.com/a404.html">A yahoo's 404</a>.</li>
</ul>

</body>
</html>

Out of the seven links, five are good.  The two a404.html links are broken.  Thus the page itself is saved in main.html and the five link pages are saved as link00001.html to link00004.html.

The directory h3 looks like:

Your program should generate a report (report.txt) as close to below as possible.

This report: report.txt
URL for extraction: http://sce.uhcl.edu/yue/courses/csci4230/Spring2001/h3q3test.html
Subdirectory: h3
Extraction time: Thu Feb  8 18:33:28 2001

Main URL http://sce.uhcl.edu/yue/courses/csci4230/Spring2001/h3q3test.html is saved as: main.html.

Total links: 7
Total pages saved: 5
Total pages failed to be saved: 2

Saved links are:

http://sce.uhcl.edu/yue/courses/csci4230/Spring2001/h1.html => link00001.html.
http://sce.uhcl.edu/yue/courses/csci4230/Spring2001/h2.html => link00002.html.
http://turquoise.rocks.uhcl.edu/yue/index.html => link00003.html.
http://www.yahoo.com/index.html => link00004.html.
http://sce.uhcl.edu/yue/index.html => link00005.html.

Links that cannot be saved:

http://turquoise.rocks.uhcl.edu/yue/a404.html
http://www.yahoo.com/a404.html

Several assumptions/requirements can be made:

(1)    It needs to handle both relative and absolute URL, but not default pages.  For example, it does not need to handle http://sce.uhcl.edu/yue, but will need to handle http://sce.uhcl.edu/yue/index.html.

(2)    Proper error messages should be given if the main url is broken or the directory already exists.

(3)    Only link pages with the extensions .html and .htm should be checked.

(4)    Tips: refer to homework #3 of the Fall 2000 semester.

Turn in:

(1)    Source code
(2)    A listing of the report file using the test case above.
(3)    A disk containing the source code and the test run.
(4)    Create a top level directory "hw" under your account (not under the subdirectory pages).  Create "h3" under "hw". Put all programs into hw/h3.  The T.A. will access your account to test run your program.