Project DIAL Report - 6.7. The targeted web spider

6.7. The targeted web spider - Operational issues

6.7.1. Problems of targeting
6.7.2. Issues concerning permissions to index
- Robot exclusion standards
- Commercial and other password-controlled sites

The targeted web spider used for Project DIAL is the interaction of three programs - Gromit (the robot or spider program), Wallace (the harness program that controls Gromit) and the Feathers web indexing software (where the sites to be made searchable are identified). Some technical details of their operation are contained in the Appendix.

In summary, how it works is that the editors of the Internet indexes (such as DIAL Index) identify those web sites with high value legal content, and when indexing them also issue an instruction that the web spider is to download the content of this site for indexing, specifying the page or pages at which the web spider is to start. The harness program reads the list of instructions from the web indexing software, and then sends off multiple instances of the web spider program, each to download the content of a particular web site. The harness program ensures that only one instance of the web spider software is ever downloading from a particular site, to avoid saturating that site with spider requests and denying access to other users. The harness ensures that the web spider is `well behaved', causing minimum impact on the sites from which it downloads web pages.

Once copies of the pages from the remote site are downloaded, every word on those pages is added to the word-occurrence index by the SINO Search Engine, and can then be searched with all the normal SINO search functions using search interfaces such as DIAL Search. When a user searches the downloaded pages and obtains a set of search results, the documents to which the user is taken are those on the original host, not those cached on the Project DIAL server. The cached downloaded documents are not available for browsing via the Project DIAL server.

We call this a targeted web spider, because it is not designed to traverse the web generally, its downloading being limited to the site specified in the original URL specified when it is invoked. For example, if the web spider is instructed to download the URL http://actag.canberra.edu.au/actag/ (ie the A.C.T. Lawnet site), any linked pages that are at the same level or below the original URL in the file hierarchy on the same server will be downloaded, but any other linked pages will be ignored. For example, pages in a sub-directory http://actag.canberra.edu.au/users/ would be ignored, as would those in the root directory http://actag.canberra.edu.au/ ). The web spider is not allowed to wander `off site'.

6.7.1. Problems of targeting

It is this `targeting' feature, essential to the value of this approach, that also causes complexity in its operation. Project DIAL provides the first extensive opportunity to test the targeting of this web spider. Our initial aim is to index every word of high value legislation-related sites located around the world on the web, particularly (but not limited to) those which do not have their own search facilities. At this prototype stage we have been deferring indexing some very large legislation sites which have their own search engines, but are concentrating on sites which are not otherwise searchable.

The range of legislation sites against which the web spider has been sent can be seen on the page `World Law Search Libraries:Legislation Library (Project DIAL)'[134]http://www.austlii.edu.au/links/World_Law_Search_Libraries/Legislation_Library_(Project_DIAL)/], and there is a list of countries involved on the DIAL Search page under `Coverage of DIAL Search'[135]http://www.austlii.edu.au/au/special/dial/DIALsearch.html#Coverage]. Other sites against which the web spider has been sent can be seen from the presence of the button next to their names in DIAL Index, and a general list of their contents is on the DIAL Search page under `Coverage of DIAL Search'[136]http://www.austlii.edu.au/au/special/dial/DIALsearch.html#Coverage]. Some of these sites are in fact not yet searchable in DIAL Search because, although the web spider has been sent to download them, it has not yet succeeded in doing so.

Not all legislation which is available on the web can as yet be made searchable by this approach. Our experience to date, while largely successful, has revealed difficulties such as the following in targeting the web spider:

For some legislation, it is impossible to identify a URL to instruct the spider to start at, because there is no URL which contains links only to the target legislation (plus minor incidental materials) which does not also contain links to large amounts of irrelevant material which is at the same hierarchical level as the legislation. To overcome this problem requires a more complex targeting mechanism which allows limitation of the web spider to identified sub-directories.
A related (and very common) problem is where the only table of contents with links to the desired legislation has been placed at a hierarchical level different from and not above the location of the actual legislation. This also requires a more complex targeting mechanism.
Some formats in which legislation is held (eg some Folio Views databases) are not yet able to be indexed because they do not provide stable URLs which can be placed on a results list.
Some legislation (eg from South Africa) is only available from gopher (not http://) addresses, and these have presented some temporary processing problems.
Some legislation is held in PDF (Portable Document Format), is not browsable by conventional web browsers, and can't be indexed at present. It will be valuable to make PDF documents searchable, so users can more readily find those of value.

All of these problems can be overcome by refinements to the targeting mechanisms, development of which are now underway. These changes will expand greatly the range of legislation and other documents which can be added to DIAL Search.

A more difficult problem is that some collections of legislative materials are stored not as web pages but as contents of a database to which there is a web interface[137]. It is usually not possible to `go behind' the web interface in order to download the full contents of the database for indexing.

6.7.2. Issues concerning permissions to index

There are two other impediments to the range of documents accessible to the web spider, both of which must be overcome, not primarily by technical means, but by obtaining permission to index.

Robot exclusion standards

It is a convention of the web that any sites may be indexed by web spiders unless the site operator indicates a contrary intention[138]. Some sites, including some legislation sites, place a file on their site telling web spiders (robots) which parts of their site they do not wish to have indexed, and in some cases this may exclude the whole site. Well-behaved robots such as Gromit check for robot exclusions and observes them. Details of the operation and protocols of web robots can be obtained from The Web Robots Page[139]http://info.webcrawler.com/mak/projects/robots/robots.html].

Where a site which we wish to index excludes web spiders, there is no alternative but to ask the site operator to make an exception for our web spider, on the basis that DIAL Search is an search facility designed specifically for legal researchers, not merely one of the many general purpose web spiders traversing the web. The site operator can them make a simple change to the robot exclusion file which informs our web spider that it has permission to index even though others do not.

Commercial and other password-controlled sites

The number of law sites on the web which are operated on a commercial basis and require users to use a password to access material on their site is still small but is growing and can be expected to expand substantially. There are significant examples of such sites in developing countries such as India, China, Malaysia and Turkey, not only in countries such as the USA and Australia.

The technical issues involved in having a web spider provide a password before it indexes a site can be overcome, but the issue is whether the site operator is willing to have the site indexed and therefore to provide a password. It may be to the considerable advantage of the operators of commercial sites to have their sites made searchable via DIAL Search, as it means that DIAL Search users will be able to find that pages relevant to their research exist on a commercial site, but will not be able to access those pages unless they first obtain a password. In effect, this amounts to advertising for the commercial site. In some cases, payment of an annual subscription may be required in order to obtain access for the web spider, but the costs involved would usually not be substantial.

[134]

[135]

[136]

[137] The web interface is provided through the Common Gateway Interface (CGI) protocols.

[138] This raises some interesting copyright issues concerning implied licences which have not yet been explored fully by the Courts.

[139]

[Previous] [Next] [Up] [Title]