In summary, how it works is that the editors of the internet indexes (such as DIAL Index) identify those web sites with high value legal content, and when indexing them also issue an instruction that the web spider is to download the content of this site for indexing. The harness program reads the list of instructions from the web indexing software, and then sends of multiple instances of the web spider program, each to download the content of a particular web site. The harness program ensures that only one instance of the web spider software is ever downloading from a particular site, to avoid saturating that site with spider requests and denying access to other users. The harness ensures that the web spider is `well behaved', causing minimum impact on the sites from which it downloads web pages.
Once copies of the pages from the remote site are downloaded, every word on those pages is indexed by AustLII's SINO Search Engine, and can then be searched with all the normal SINO search functions using interfaces such as DIAL Search (discussed below). When a user searches the downloaded pages and obtains a set of search results, the documents to which the user is taken are those on the original host, not those cached on the AustLII server. The cached downloaded documents are not available for browsing via AustLII's servers
We call this a targeted web spider, as is not designed to traverse the web generally, its downloading being limited to the site specified in the original URL specified when it is invoked. For example, if the web spider is instructed to download the URL http://actag.canberra.edu.au/actag/ (ie the A.C.T. Lawnet site), any linked pages that are at the same level or below the original URL in the file hierarchy on the same server will be downloaded, but any other linked pages will be ignored. The web spider is not allowed to wander `off site'.
At this prototype stage we are not attempting to index very large legislation sites which have their own search engines, but are concentrating on smaller sites which are not otherwise searchable.
In testing to date, legislation from Vietnam (two extensive sets), Zambia, Mongolia and Israel have been indexed successfully. Further testing is underway on legislation from India, Portugal, Alberta (Canada), and Nova Scotia (Canada). More legislation is added as suitable targets are found.
Not all legislation which is available on the web can as yet be indexed by this approach. Our experience to date, while largely successful, has revealed difficulties such as the following in targeting the web spider: