Internet Publishing
"Materials used in this course are the property of the author. These lessons may be used only by course participants for self-study purposes. Application for permission to use these materials for other educational purposes such as for teaching or as a basis for teaching should be directly submitted to the author."
9. Indexing the Web - More HTML (Frames)
This lesson deals with:
This lesson covers search engines on theWeb. The Web is large, dynamic and unstructured. It can, therefore, be difficult to find things. Luckily, we have many forms of assistance. The first thing we can do is to create our own personal list or surfboard. We do this by maintaining "bookmarks" or by keeping a list of URLs we have visited or to which we have been referred.
But, how do we find places we have never been to before? Is there a telephone directory for the Web. The answer is "yes and no". More and more providers have created hierarchical outlines of the Web, or parts of the Web. One way to locate something on the Web is to go to the root directory of one of the large Internet providers: EUnet, Compuserve, etc.
There are also many search engines and independent direcotries such as Alta Vista and Yahoo. These are machines where one can search on a word in an index, not much different from using a card catalog at the library.
Directories
Using directories are a great way to find information which lends itself to hierarchic description. However, information can be describe in many different ways. There is no one correct way for classifying web pages. This is the same problem library technicians have when they classify books. Some books fit into several categories, and other books don't fit into any category. The only taxonomy where one can speak of one correct solution is the classification of species which have developed from the first cell. The development of a species makes a tree where species which belong to the same group have the same origin, and no other species not belonging to that group have the same origin. When this concerns books and web pages, there is no such strict categorization. Therefore, we have reason to be in doubt when we begin at the root of a tree and shall search the branches; which branch should we climb?
Human memory is associate and not hierarchical. Even though I count myself as being a "hierarchal reductionist" (someone who believes that everything can be explained by what it is made up of), I often think that it is difficult to find the desired information when "manually" using directories. The directories fortunately are searchable so that one doesn't need to browse the entire hierarchy. One can search the entire directory as a single plane. Imagine that someone chops down the tree and indexes all the branches and leaves according to content. That would be much easier to search. The directories also seem associative. If you know one daily newspaper on the Web, you can find it in the directory by search on it. Beside the known newspaper, you will find all the other newspapers on the Web. This associative characteristic would not be around if the tree was chopped up and organized alphabetically in a woodpile.
Directories (searchtrees) seem to grow partly because the user adds things to the tree, partly because whomever maintains the tree (the gardeners) add things that they come across.
Search Engines (robots)
The other primary way to search is to go to a database where the entire Web is indexed. Since the Web is large and dynamic, it is of course impossible for one database to have a complete overveiew of the whole Net. Lycos maintains that its index grows faster than the Web does, so that the index will in the end cover the entire Web. But, the world is not linear. It will be steadily more difficult for them to discover unknown pages when there are fewer unchartered pages to discover (the law of diminishing returns). In additon, the old chartered pages will have changed, so that the pages must be indexed anew.
But how does this happen? Fortunately, they have robots to do the job.
Robots have operated on the Net for some years now. They do a good deal of important work, work which couldn't be done without the help of robots. It is actually misleading to call them robots. The word "robot" gives the impression that it is a program which moves itself around (read more about agents). However, the robot actually stands still. It is a program which contacts WWW-servers and reads all the documents it finds. A robot handle a spectrum of useful tasks:
Maintenance: A robot can search all the documents on a server and report on links that don't go anywhere (dead branches).
Statistical Analysis: The first robots were used to discover and count the web-servers, how many there were and what kind of traffic load they had.
Mirroring: Mirroring is common in FTP and is also used on the Web. The thought behind it is that files/archives should exist in several places around the world, so that customers can retrieve the files from the place closest to them. (An example of web pages that are mirrored are WWW FAQ.)
Discovery: This is perhaps the most exciting use of robots. The robot snoops around, finds and indexes web trees. The user can then search the database which the robot has created.
All this is, of course, not free. It costs bandwidth (capacity on the network). A robot which doesn't behave can also overload a server by quickly asking for many documents one after the other (rapid fire). Netscape is guilty of this when the client requests many inline-images simultaneously.
We can also imagine robots implemented as part of the client. Wouldn't it be nice to be able to start a Web search from your client? That your client searched all the servers it found to find the information you were out after? These are called client robots and are, fortunately, not in use for serveral obvious reasons. (This is left as an exercise for the student.)
"Help! I'm being counted!"
In Norway, we are all familiar with a fairytale about the kid (as in baby goat, not human child) who could count. (One for the calf, two for the pig, three for the cow... (written by Alf Prøysen)). Most of the animals had never heard of counting, and they were a little anxious about it being dangerous. ("Now, he's counted you too!") I experienced a little of the same angst when I discovered that one of my WWW-servers was being visited by a search robot. ("Help! I'm being indexed!")
How can you find out if you have been visited by a robot? All transactions leave a clue in the log. The log contains normally just IP numbers, but with a little training, one can learn to recognize foreign numbers which access dusty pages in which no one is interested. (Many WWW-servers, for example WebSite, can be set up to do inverse lookup of IP numbers, so that the domain name is recorded in the log in stead of IP numbers.) That's how I discovered scooter. I found the scooter's home page via a program which I have on the PC called IP resolver (Novell windows program). I could have also logged in on a UNIX machine and given the command nslookup (name server lookup). I found the adress 204.123.2.54 in my log, and IP resolver told me that the address was the same as scooter.pa-x.dec.com. So I looked it up the Web crawler and found that scooter's address was located at http://scooter.pa-x.dec.com/.
Scooter insists that he is a friendly robot. He says that he is registered in a list of robots at http://info.webcrawler.com/mak/projects/robots/active.html. The robot gives advice about how one can set up a file robot.txt in the root of the web-server so that robots which come to visit can find out where they should go and where they may not go. Scooter also claims that it will not find no new web-servers. If Scooter has visited you, that means that your server already is listed on a well-known list. That thought is almost just as frightening: some knows where I live! Scooter is used by, for example, Alta Vista which is among the most used indexed databases.
Read more about robots!
If I don't have a file called robots.txt, then anyone asking for such a file will get an error message. I can find this by looking in the error.log: grep "c:\httpd\htdocs\robots.txt" error.log > sniffers.txt. In this way, I found that I had also been visited by The Architext Spider at http://www.atext.com/.
Here is an example of a robots.txt file:
# /robots.txt for http://fweb.idb.hist.no/ User-agent: * #attention all robots: Disallow: /~fredrik/lest/ #private notes Disallow: /temp/ #temporary or not original files
The star means that the following lines are applicable for all robots. Then, a list of directories of where the robot is not allowed to go follows. In this way, I can ensure that the robot doesn't waste time on worthless documents.
It is difficult for the robot itself to classify the documents it finds. An alternative strategy which is used by aliweb assumes that the person running a web-server can create a description of the documents which can be found on the server. He then sends a message to an aliweb-server telling where the description is. Aliweb will then check out these descriptions and place them in a central database. It is obvious that the quality of the data in this database will be much higher than the quality of data on the blind robots. However, since aliweb is dependent upon active support from the people running all the web-servers, the database will never be complete.
Offer Search Capabilities on Your Server
For those using WebSite, an indexing program is included -- WebIndex. With this program you easily can create an index of your files. Tools for making HTML pages to use the index also come along with WebSite. This is pretty well explained that you can try, but it's not part of the assignments for this lesson.
Problem 1: Compare Lycos, Web-crawler and Alta Vista
Pick out a pretty limited search topic, that is a search with special/seldom used words. Search in Lycos, Web-crawler and Alta Vista (or other search engines of your choice) and compare the results. What are the results with regard to what the search engines claim about the number of documents indexed? (Are they bragging?)
OBS-1! The answers here should be posted in the Hypernews: "General discussions about Internet Publishing" with the title "Assignment lesson 9.1".
Due date: 20 may 1997
OBS-2! See also the assignment for part 2 of this lesson, which shall be turned into your assigned teacher's assistant.
Referances
Robots (search programs)
Alta Vista
HotBot
WebCrawler
Searching
Lycos, Inc. Home Page
For norwegians:Kvasir (Schibstednett)
World Wide
Web Robots, Wanderers, and Spiders
Architext Spider
A
Standard for Robot Exclusion
W3 Search Engines (list
with many search engines)
Directories
For norwegians:Kvasir (Schibstednett)
For norwegians:Origo
(Telenor)
For norwegians:HOVEDMENY NORWEB
For
norwegians:MIMES BRØNN
(Telenor)
Yahoo
Frames are a Netscape specific expansion of HTML which is implemented in Netscape Navigator 2.0 and 3.0. Netscape has its own description of the FRAMES implementation which I use here. Otherwise, an excellent guide to Frames is written by Charlton Rosee.
Frames are not a part of the HTML 3.2 standard and has therewith, not become a part of the standardised web technology. Many began using frames after it was introduced, because it provides the webauthor some new possibilities. Therefore, I thought it was a good idea to include it in this course. The trend at the moment is not to use frames, precisely because the technique has not yet been standardised.
Frames make it possible to specify several independent regions, called frames, within the same window:
Frames make a few new presentation techniques possible. The most important are:
Targeted windows are a general mechanism which builds upon the mechanism of navigating through windows. This means that a document can be sent to a named window. This could be a client window or it could be a frame.
Frames are defined as Frame Documents. Here the BODY tags are changed out with FRAMESET tags:
<HTML> <HEAD> </HEAD> <FRAMESET> </FRAMESET> </HTML>
The syntax for defining frames is similar to that of tables.
<FRAMESET>. .</FRAMSET>
These are the tags for defining the frames. There are two attributes, ROWS and COLS:
In order to be able to divide a window into more than simple frames, we must use nested FRAMESETs. More about this under layout.
<FRAME>. .</FRAME>
This tag defines a frame in a FRAMESET. Here are the following possible attributes:
<NOFRAME>. .</NOFRAME>
These tags can be used to write alternative code for clients that don't support frames.
Below is an example for dividing a document into frames:
<FRAMESET COLS="50%,50%"> <FRAMESET ROWS="50%,50%"> <FRAME SRC="frame_1.htm"> <FRAME SRC="frame_2.htm"> </FRAMESET> <FRAMESET ROWS="33%,33%,33%"> <FRAME SRC="frame_3.htm"> <FRAME SRC="frame_4.htm"> <FRAME SRC="frame_5.htm"> </FRAMESET> </FRAMESET> +----------------------------------------------------------+ | | | | | | | | frame 3 | | frame 1 | | | | | | |-----------------------------+ | | | | | | +----------------------------| frame 4 | | | | | | | | |-----------------------------+ | | | | frame 2 | | | | frame 5 | | | | | | | +----------------------------------------------------------+
With regular HTML links, new documents will pop up in the same frame where the original link was clicked. (OBS! The back-button in the client will not work at the frame-level; it works in the main window.) A new attribute for steering documents to named frames is now introduced:
TARGET = "frame_name"
where frame_name is a name for a frame definition. This attribute can be used in several HTML tags:
Create a page divided into in three frames. The top 10% shall be a column with a heading. The area underneath shall be dividen into two columns where the left column will contain a table of contents with links to the first 8 lessons in Internet Publishing. The right column shall display the lesson pointed to in the table of contents. Upon start up, Lesson 1 shall appear in the right hand column.
See image below:
Due date: 20 may 1997.
The following should be sent to your assigned teacher's assistant: the HTML code definition for the page, along with the main frame and the frame with the table of contents.