WEBTOOLBOX
Administrator and User Guide


1. INTRODUCTION

1.1 Overview

WebToolbox is a powerful tool for those who manage web sites/pages. It may be used to: The Edit Strings function edits a set of web documents, replacing all instances of string X by string Y. It can operate on:

The Edit Tags function edits a set of web documents, changing all tags to upper/lower case. It can operate on:

The Check Links function lists and/or checks all hyperlinks in a set of web documents. It can operate on:

The Check Metatags function lists all metatags in a set of web documents. It can operate on:

The Submit URL function submits a URL to a set of search engines, and may be used to promote a website.
Details of search engines are contained in a data file, referred to as the Search Engine Database. See Section 2.7.4.2 for details.

Note: a "web document" is a file written in a web markup language (such as HTML, SHTML, XML). These are plain text languages employing plain text tags. In fact, some of the functions may be used to operate on plain text files which contain other markup languages (eg SGML) or plain text files which do not contain markup language.

1.2 Contents

    1. INTRODUCTION
       1.1 Overview
       1.2 Contents
       1.3 Document References
       1.4 Document Control
       1.5 Terms and Terminology
       1.6 Copyright
    2. WEBTOOLBOX: INTRODUCTION AND OVERVIEW
    3. WEBTOOLBOX: USER GUIDE
       3.1 Introduction
       3.2 Invocation and Arguments
       3.3 Action: Edit Strings
       3.4 Action: Edit Tags
       3.5 Action: Check Links
       3.6 Action: Check Metatags
       3.7 Action: Submit URL
    4. WEBTOOLBOX: ADMINISTRATOR GUIDE
       4.1 Introduction
       4.2 Product Distribution & Contents
       4.3 Installing and Configuring WebToolbox
       4.4 The Perl Script
       4.5 Restrictions and Deficiencies
       4.6 Test Status
       4.7 Errors and Diagnostics

1.3 Document References

  [1] Learning Perl (2nd Edition), Schwartz & Christiansen, O'Reilly & Associates, Inc.
  [2] Perl Cookbook, Christiansen & Torkington, O'Reilly & Associates, Inc.

1.4 Document Control

Document version: 1.1
Date: 3 August 2000
Script version (to which this document relates): 1.3
Document changes:

1.5 Terms and Terminology

1.6 Copyright

The following copyright applies to this document and associated script.
Copyright (c) 1999,2000 Beaumont Systems Ltd. All rights reserved.

The script may or may not be provided in source code form. This source code is the intellectual property of Beaumont Systems Ltd., and, as such, is subject to copyright and legal protection. It may not be copied or redistributed, in whole or in part, without the express permission of Beaumont Systems Ltd.


2. WEBTOOLBOX: INTRODUCTION AND OVERVIEW


3. WEBTOOLBOX: USER GUIDE

3.1 Introduction

3.2 Invocation and Arguments

WebToolbox is invoked from the command line, and as follows...
  webtool.pl [-e] [-xd xxx] [-xf xxx] [-xt]
  webtool.pl [-t] [-tl] [-xd xxx] [-xf xxx]
  webtool.pl [-l] [-lo xxx] [-lg] [-lt ttt] [-xt] [-xr]
  webtool.pl [-m] [-xf xxx] [-xt]
  webtool.pl [-s] [-sd xxx] [-xr]
The first argument (-e,-t,-l,-m,-s) specifies the basic action, and one of these must be selected. There is no default action.
Subsequent arguments qualify this selection and are used as necessary.

The arguments are as follows...

Argument Used with Use Default
-e - Action: Edit Strings (edit set of web documents to replace strings). -
-t - Action: Edit Tags (edit set of web documents to change tags to upper/lower case). -
-l - Action: Check Links (in set of web documents). -
-m - Action: Check Metatags (in set of web documents). -
-s - Action: Submit URL (to set of search engines). -
-tl -t Specifies the use of lower case. Upper case
-ta -t Specifies that attributes (types and values) are to be converted also. Tag types only
-lo xxx -l Specifies the extent of checking, where xxx is one of...
  • 0 - list links (without verification)
  • 1 - check some links (local files only)
  • 2 - check all links
Check all links
-lg -l Report good links (...as well as bad). Report bad links only
-lt ttt -l Set timeout value = ttt seconds. 20
-sd xxx -s Specifies the filename of the search engine database. se.dat
-xd xxx - Specifies the root directory. -
-xf xxx - Specifies the input filename. -
-xt -e,-l,-m Operation on ALL text files. Some text files only (web documents)
-xr - Restart log file. (Previous contents discarded). Append
-v - Specifies verbose mode. -
-h - Requests help. -
Table 2-1 : Program Arguments

Use WebToolbox with no arguments or webtool -h to obtain help information.

Upon invocation, WebToolbox prompts the user for further parameters, depending on the action selected. See the following subsections for further information.

3.3 Action: Edit Strings

3.3.1 General

When invoked with the -e argument, WebToolbox edits a set of web documents, replacing all instances of string X by string Y.
The function can operate on:

This function is particularily useful for maintaining links in a collection of web documents (HTML documents).

3.3.2 Inputs

On startup, WebToolbox issues some/all of the following prompts:
  Directory or filename ? :
  String (to be found/replaced) ? :
  String (new) ? :
A directory or filename can be specified via the command-line arguments -xd and -xf respectively. The first prompt (Directory or filename ? :) appears when neither -xd nor -xf is used, or when one is used but the name given is found not to be a valid directory or filename.

3.3.3 Outputs

3.4 Action: Edit Tags

3.4.1 General

When invoked with the -t argument, WebToolbox edits a set of web documents, changing all tags to upper/lower case.
The function can operate on: The function converts normal tags, and within these, the tag types and (optionally, using -ta) attribute types and values. It does not convert "special" tags, these being: <!...> (comment/script code block), <%...>.
The following table gives some examples of conversions...

Before After Arguments
<center> <CENTER> (none*)
<hr noshade> <HR noshade> ...or...
><HR NOSHADE>
(none)
-ta
<body bgcolor=white text="black"> <BODY bgcolor=white text="black"> ...or...
<BODY bgcolor=WHITE TEXT="BLACK">
(none)
-ta
<h1>WEBTOOLBOX<br>Administrator and User Guide</h1> <H1>WEBTOOLBOX<BR>Administrator and User Guide</H1> (none*)

* Note: -ta not relevant here, because there are no tag attributes present.

3.4.2 Inputs

On startup, WebToolbox issues some/all of the following prompts:
  Directory or filename ? :
A directory or filename can be specified via the command-line arguments -xd and -xf respectively. The first prompt (Directory or filename ? :) appears when neither -xd nor -xf is used, or when one is used but the name given is found not to be a valid directory or filename.

3.4.3 Outputs

3.5 Action: Check Links

3.5.1 General

When invoked with the -l argument, WebToolbox checks links (hyperlinks) in a set of web documents.

There are a number of other arguments that may be used in conjunction with -l. See Table 2-1 for a summary.
Use -lo to specify/restrict the extent of checking, ie to list links (without verification), check some links only (local files) or check all links.
Use -lg to report good links (as well as bad).
Use -lt ttt to alter the timeout value.
Use -xr to restart the log file.

3.5.2 Inputs

On startup, it issues the following prompt:
  URL/local directory/local filename ? :
The response may take one of three forms:

3.5.3 Outputs

WebToolbox directs its output to a log file, webtool.log.

The output takes the following form when listing links (using -lo 0) in a local file...

  --------------------
  WEBTOOLBOX started. Wed Sep 15 23:38:56 1999.
  Filename: /usr/arf/web/html/andyf/aflinks.html, # links: 46
    Link: tagtype=a, name=href, value=http://cm.bell-labs.com/cm/cs/who/dmr/st.html.
    Link: tagtype=a, name=href, value=http://playground.sun.com/pub/ipng/html/INET-IPng-Paper.html.
    Link: tagtype=a, name=href, value=http://www.nexor.co.uk/public/rfc/index/rfc.html.
    Link: tagtype=a, name=href, value=http://www.cs.utah.edu/csinfo/texinfo/gnats/gnats.html.
    Link: tagtype=a, name=href, value=http://www.hwg.org/resources/html/index.html.
    Link: tagtype=a, name=href, value=http://WWW.Stars.com/Authoring/HTML/. (Full URL: http://www.stars.com/Authoring/HTML/).
    Link: tagtype=a, name=href, value=http://www.gamelan.com/.
    Link: tagtype=a, name=href, value=http://www.cgi-resources.com/.
    Link: tagtype=a, name=href, value=http://www.perl.com/.
    Link: tagtype=a, name=href, value=http://www.iconbazaar.com/.
    Link: tagtype=a, name=href, value=http://www.clipart.co.uk/.
    Link: tagtype=a, name=href, value=http://www.econ.cbs.dk/~gemal/urlheaven/index.html.
    Link: tagtype=a, name=href, value=http://www.data.com/.
    Link: tagtype=a, name=href, value=http://www.byte.com/.
The output takes the following form when checking links in a remote web page (HTML document)...
  --------------------
  WEBTOOLBOX started. Thu Sep 16 11:11:57 1999.
  URL: http://members.netscapeonline.co.uk/beaumontsystems. URL OK, base=http://members.netscapeonline.co.uk/beaumontsystems/, contenttype=text/html, # links: 7
    Link: tagtype=a, name=href, value=http://members.netscapeonline.co.uk/beaumontsystems/products/uwipnifss.html.
      ERROR: Problem accessing http://members.netscapeonline.co.uk/beaumontsystems/products/uwipnifss.html, response code=404, message=Not Found
    Link: tagtype=a, name=href, value=http://members.netscapeonline.co.uk/beaumontsystems/products/netdbov.html.
      ERROR: Problem accessing http://members.netscapeonline.co.uk/beaumontsystems/products/netdbov.html, response code=404, message=Not Found

3.5.4 Notes

It is important to note that this action can take a significant amount of time. It depends on the number of files to be checked, the total number of links to be checked and the amount of time required to check each link.
The checking of a link involves a finite amount of time. In particular, it involves sending a request to the web server and awaiting a response. WebToolbox specifies a timeout period with the request to ensure that the request is terminated if a response is not received from the remote server within N seconds. At best, the server may respond immediately. At worst, the server may take some time to respond (because it is busy), or it may not respond at all (because it is extremely busy, currently 'down' or no longer exists).
WebToolbox assumes a default timeout value of 20 seconds. The timeout value is important. If it is too short, then WebToolbox may abandon a check before the remote server has had a chance to respond, and it may report an error when there isnt one. On the other hand, if the timeout value is too long, then WebToolbox may waste time awaiting a response that will never come.

If you anticipate a large number of links, you may prefer to run WebToolbox during a less busy period (perhaps overnight).

3.6 Action: Check Metatags

3.6.1 General

When invoked with the -m argument, WebToolbox checks metatags in a set of web documents.
It lists all metatags that it finds, with details (tag name, tag value) of each. Optionally, it also lists web documents that do not contain any metatags.

HTML meta tags contain meta-information about the page itself. This information is not displayed normally (when a person views the web page with a browser), but applications (and this includes search engines) can, and do, make use of it.

3.6.2 Inputs

3.6.3 Outputs

The output takes the following form...
--------------------
  WEBTOOLBOX started. Thu Nov 25 14:13:24 1999.
  Action: 3.
  Filename: xfileov.html
    [TagName] copyright [TagValue] (c) 1999 Beaumont Systems Ltd.
    [TagName] description [TagValue] XFILE, a utility for managing files
    [TagName] keywords [TagValue] XFILE, file utility, link verification, Perl
  Filename: netdbov.html
    [TagName] copyright [TagValue] (c) 1999 Beaumont Systems Ltd.
    [TagName] keywords [TagValue] NETDB, network publication, network application, CGI
  Filename: wtbov.html
    [TagName] copyright [TagValue] (c) 1999 Beaumont Systems Ltd.
    [TagName] description [TagValue] WEBTOOLBOX, a utility for website management
    [TagName] keywords [TagValue] WEBTOOLBOX, website management, link verification, URL submission, website promotion, search engine, Perl
  Filename: wtb.html
    [TagName] copyright [TagValue] (c) 1999 Beaumont Systems Ltd.
  Filename: xfile.html
    [TagName] copyright [TagValue] (c) 1999 Beaumont Systems Ltd.
  Filename: wtbov_g.html
    [TagName] copyright [TagValue] (c) 1999 Beaumont Systems Ltd.
    [TagName] description [TagValue] WEBTOOLBOX, a utility for website management
    [TagName] keywords [TagValue] WEBTOOLBOX, website management, link verification, URL submission, website promotion, search engine, Perl
  Filename: wtbov_f.html
    [TagName] copyright [TagValue] (c) 1999 Beaumont Systems Ltd.
    [TagName] description [TagValue] WEBTOOLBOX, a utility for website management
    [TagName] keywords [TagValue] WEBTOOLBOX, website management, link verification, URL submission, website promotion, search engine, Perl

3.6.4 Notes

3.7 Action: Submit URL

3.7.1 General

When invoked with the -s argument, WebToolbox submits a URL to a set of search engines.

There are a number of other arguments that may be used in conjunction with -s. See Table 2-1 for a summary.
Use -sd xxx to specify a search engine database other than the default one (se.dat).
Use -xr to restart the log file.
Use -xf xxx to provide the submission parameters in an input file (rather than interactively).

Search engines: Information relating to the search engines in contained in a separate data file, referred to as the Search Engine Database. See Section 3.7.4.2.

Submission parameters: The search engines require various parameters relating to submission. See Section 3.6.4.1. There are two way to provide these parameters: interactively (the default method, not using -xf xxx) or via an input file (using -xf xxx).

3.7.2 Inputs

WebToolbox requires two sets of inputs: On startup (without the -xf xxx argument), it issues the following prompts:
  URL to be submitted ? :
  E-mail to be submitted ? :
  Name to be submitted ? :
  Title to be submitted ? :
  Description to be submitted ? :
  Keywords to be submitted ? :
If invoked with the -xf xxx argument, it reads the specified input file in order to obtain the parameters.

3.7.3 Outputs

WebToolbox generates output in the form of log messages. These are directed to STDOUT (if -v is used) and a log file, webtool.log.

The output takes the following form...

--------------------
  WEBTOOLBOX started. Sun Nov 14 09:39:22 1999.
  Action: 3.
  Number of search engines: 40/40/0.
  URL: http://www.qwerty.com
  Submitting URL to search engine (Altavista)...successful.
  Submitting URL to search engine (Excite)...successful.
  Submitting URL to search engine (Hotbot)...successful.
  Submitting URL to search engine (Infoseek)...successful.
  Submitting URL to search engine (Lycos)...successful.
  Submitting URL to search engine (Webcrawler)...successful.
  Submitting URL to search engine (Search United Kingdom)...successful.
  Submitting URL to search engine (Excite United Kingdom)...successful.
  Submitting URL to search engine (Lycos United Kingdom)...successful.
  Submitting URL to search engine (Cyber Britain United Kingdom)...successful.
  Submitting URL to search engine (Acoon Germany)...successful.
  Submitting URL to search engine (Altavista Germany)...successful.
  Submitting URL to search engine (Blitzsuche Germany)...successful.
  Submitting URL to search engine (Infoseek Germany)...successful.
  Submitting URL to search engine (Lotse Germany)...successful.
  Skipping search engine (Rex Germany), requires description.
  Submitting URL to search engine (Spider Germany)...successful.
  Submitting URL to search engine (Voila France)...successful.
  Submitting URL to search engine (Info Tiger)...successful.
  Submitting URL to search engine (Aeiwi)...successful.
  Submitting URL to search engine (Anzwers)...successful.
  Submitting URL to search engine (Canada)...FAILED!
    (Response Code=404, Message=Object Not Found)
  Submitting URL to search engine (Claymont)...successful.
  Submitting URL to search engine (Crawler Germany)...successful.
  Submitting URL to search engine (E-Special Germany)...successful.
  Submitting URL to search engine (Euro Ferret)...successful.
  Submitting URL to search engine (Excite Australia)...successful.
  Submitting URL to search engine (Funky Cat)...successful.
  Submitting URL to search engine (Google)...successful.
  Submitting URL to search engine (ICQ-It)...successful.
  Submitting URL to search engine (Infomak)...successful.
  Submitting URL to search engine (Magellan)...successful.
  Submitting URL to search engine (Northern Light)...successful.
  Skipping search engine (REX), requires description.
  Submitting URL to search engine (Sear)...successful.
  Submitting URL to search engine (Surf Gopher)...successful.
  Submitting URL to search engine (UK MAX Search)...successful.
  Submitting URL to search engine (Voila)...successful.
  Submitting URL to search engine (What-U-Seek)...successful.
  Submitting URL to search engine (World Search Engine)...FAILED!
    (Response Code=500, Message=read timeout,  chunk 6.)
  Number of SE entries in database: 40
  Number of valid entries: 40
  Number of valid entries with missing parameters: 2
  Number of invalid entries: 0
  Number of SEs contacted: 38
  Number of SEs skipped: 2
  Number of successful submissions: 36
  Number of unsuccessful submissions: 2
The above example shows: (mostly) successful submissions; some failed submissions; some search engines being skipped because of insufficient parameters; at the end, a summary (statistics).

3.7.4 Notes

3.7.4.1 Search Engines
There are a lot of search engines in existence, some more well known than others. There is also considerable variation in the way that they handle URL submissions. Some use HTTP GET, others use HTTP POST. They vary in the arguments they require/accept and the names associated with these arguments. For instance: all require a URL; many require an e-mail address, others dont; some take arguments that are specific to them.
3.7.4.2 The Search Engine Database
The product comes with a separate data file (se.dat) which contains details of search engines. This file is referred to as the Search Engine Database. Currently, it contains more than 40 entries and includes all the major search engines (eg AltaVista, Excite, Lycos, etc) ...and many others.
In essence, it contains a list of search engines, with one entry/line per search engine, and each entry contains information on a particular search engine - information such as search engine name, URLs of home and submission form pages, information relating to how the search engine handles submissions, etc.
The file is a plain text file. This facilitates maintenance, so that is possible to add details of new search engines, modify existing entries (if a search engine changes the way it handles submissions) and/or disable certain entries (eg country-specific search engines).

The -sd xxx argument makes it possible to create and use your own search engine database. You might want to do this to: define a specific subset of search engines (eg the main ones, or country-specific ones); define a single entry, for test purposes.

3.7.4.3 When a submission comes into effect
If a search engine receives a submission and the parameters appear acceptable, it returns an 'OK' response, acknowledging the submission. This response does not indicate or imply that the submission has been put into immediate effect, nor that it will be put into effect in the immediate future.
There is usually a significant delay (often several days or more) between submitting a URL to a search engine and the submission coming into effect (ie a search showing the URL in a list of findings). In the intervening period, the search engine checks and indexes the URL.
3.7.4.4 The effectiveness of submission
If you want to increase the chances of someone finding your website via search engines, there are two main things you can do: It should be noted that these actions increase the chances of success, but they do not guarantee it.
3.7.4.5 Using a file to provide submission parameters
If you want to provide submission parameters in a file, rather than interactively, then the file must contain lines of the following form:
  [URL]xxx
  [EMAIL]xxx
  [NAME]xxx
  [TITLE]xxx
  [DESCRIPTION]xxx
  [KEYWORDS]xxx
In all cases, 'xxx' is an appropriate value. The URL parameter is mandatory, the e-mail address is optional but recommended and other parameters are optional. However, you should note that the more parameters you provide, the more search engines that webtool will be able to contact.


4. WEBTOOLBOX: ADMINISTRATOR GUIDE

4.1 Introduction

This is the WEBTOOLBOX Administrator Guide.

4.2 Product Distribution & Contents

4.2.1 Distribution Form

WebToolbox is distributed in the form of a single archive file. (An archive file is one that contains a set of other files, and these other files must be extracted from the archive before they can be used).
It is available in one of two archive formats: (Note, however, that some Win32 zip utilities (such as WinZip) can handle .tar files, and there are some Unix utilities that can handle .zip files).

4.2.2 Distribution Contents

The distribution consists of...

4.2.3 File List

Filename Type Use Notes
wtb.pl application - (Executable) Perl script
wtb.exe application - Win32 executable
se.dat data Search engine database.
Contains information on all search engines.
Maintain as necessary.
wtb.html documentation Administrator and User Guide (this document) -
Table 3-1 : File List

4.3 Installing and Configuring WebToolbox

4.3.1 Prerequisites

The prerequisites are... Perl. Perl packages are available for Unix, Win32 (Win95/98/NT) and other platforms. If the system on which you are intending to install WebToolbox does not have a Perl package already installed, then you can obtain a Perl package in a number of ways... For more information on packages/libraries/modules, see document reference [2]. It includes a chapter on packages/libraries/modules, and a section on building and installing CPAN modules.

4.4 The Perl Script

4.4.1 Organisation

The webtool Perl script is organised as follows:

4.4.2 Perl Features

WebToolbox employs the following Perl features and techniques: In addition, the script employs tips, techniques and optimizations obtained from the following sources:

4.5 Restrictions and Deficiencies

  1. Perl version: webtool was written using Perl 5 (Perl Release 5, 5.000 onwards).
    It has not been tested on earlier (ie pre- Release 5) Perl systems, and so may or may not work on such systems.

4.6 Test Status

WebToolbox has been tested with and without the -w and -T flags.

WebToolbox has been tested on the following platforms:

4.7 Errors and Diagnostics


Copyright (c) 1999,2000 Beaumont Systems Ltd. All Rights Reserved.