In summary, how it works is that the editors of the Internet indexes (such as DIAL Index) identify those web sites with high value legal content, and when indexing them also issue an instruction that the web spider is to download the content of this site for indexing, specifying the page or pages at which the web spider is to start. The harness program reads the list of instructions from the web indexing software, and then sends off multiple instances of the web spider program, each to download the content of a particular web site. The harness program ensures that only one instance of the web spider software is ever downloading from a particular site, to avoid saturating that site with spider requests and denying access to other users. The harness ensures that the web spider is `well behaved', causing minimum impact on the sites from which it downloads web pages.
Once copies of the pages from the remote site are downloaded, every word on those pages is added to the word-occurrence index by the SINO Search Engine, and can then be searched with all the normal SINO search functions using search interfaces such as DIAL Search. When a user searches the downloaded pages and obtains a set of search results, the documents to which the user is taken are those on the original host, not those cached on the Project DIAL server. The cached downloaded documents are not available for browsing via the Project DIAL server.
We call this a targeted web spider, because it is not designed to traverse the web generally, its downloading being limited to the site specified in the original URL specified when it is invoked. For example, if the web spider is instructed to download the URL http://actag.canberra.edu.au/actag/ (ie the A.C.T. Lawnet site), any linked pages that are at the same level or below the original URL in the file hierarchy on the same server will be downloaded, but any other linked pages will be ignored. For example, pages in a sub-directory http://actag.canberra.edu.au/users/ would be ignored, as would those in the root directory http://actag.canberra.edu.au/ ). The web spider is not allowed to wander `off site'.
The range of legislation sites against which the web spider has been sent can be seen on the page `World Law Search Libraries:Legislation Library (Project DIAL)'[134]http://www.austlii.edu.au/links/World_Law_Search_Libraries/Legislation_Library_(Project_DIAL)/], and there is a list of countries involved on the DIAL Search page under `Coverage of DIAL Search'[135]http://www.austlii.edu.au/au/special/dial/DIALsearch.html#Coverage]. Other sites against which the web spider has been sent can be seen from the presence of the button next to their names in DIAL Index, and a general list of their contents is on the DIAL Search page under `Coverage of DIAL Search'[136]http://www.austlii.edu.au/au/special/dial/DIALsearch.html#Coverage]. Some of these sites are in fact not yet searchable in DIAL Search because, although the web spider has been sent to download them, it has not yet succeeded in doing so.
Not all legislation which is available on the web can as yet be made searchable by this approach. Our experience to date, while largely successful, has revealed difficulties such as the following in targeting the web spider:
A more difficult problem is that some collections of legislative materials are stored not as web pages but as contents of a database to which there is a web interface[137]. It is usually not possible to `go behind' the web interface in order to download the full contents of the database for indexing.
Where a site which we wish to index excludes web spiders, there is no alternative but to ask the site operator to make an exception for our web spider, on the basis that DIAL Search is an search facility designed specifically for legal researchers, not merely one of the many general purpose web spiders traversing the web. The site operator can them make a simple change to the robot exclusion file which informs our web spider that it has permission to index even though others do not.
The technical issues involved in having a web spider provide a password before it indexes a site can be overcome, but the issue is whether the site operator is willing to have the site indexed and therefore to provide a password. It may be to the considerable advantage of the operators of commercial sites to have their sites made searchable via DIAL Search, as it means that DIAL Search users will be able to find that pages relevant to their research exist on a commercial site, but will not be able to access those pages unless they first obtain a password. In effect, this amounts to advertising for the commercial site. In some cases, payment of an annual subscription may be required in order to obtain access for the web spider, but the costs involved would usually not be substantial.
[137] The web interface is provided through the Common Gateway Interface (CGI) protocols.
[138] This raises some interesting copyright issues concerning implied licences which have not yet been explored fully by the Courts.