- 6. The targeted web spider - From indexing to searching

6. The targeted web spider - From indexing to searching

AustLII's web spider
The DIAL prototype tests - problems of targeting

AustLII's web spider

AustLII's targeted web spider is the interaction of three programs - Gromit (the robot or spider program), Wallace (the harness program that controls Wallace) and the Feathers web indexing software (where the sites to be indexed are identified). Some technical details of their operation are contained in the Appendix.

In summary, how it works is that the editors of the internet indexes (such as DIAL Index) identify those web sites with high value legal content, and when indexing them also issue an instruction that the web spider is to download the content of this site for indexing. The harness program reads the list of instructions from the web indexing software, and then sends of multiple instances of the web spider program, each to download the content of a particular web site. The harness program ensures that only one instance of the web spider software is ever downloading from a particular site, to avoid saturating that site with spider requests and denying access to other users. The harness ensures that the web spider is `well behaved', causing minimum impact on the sites from which it downloads web pages.

Once copies of the pages from the remote site are downloaded, every word on those pages is indexed by AustLII's SINO Search Engine, and can then be searched with all the normal SINO search functions using interfaces such as DIAL Search (discussed below). When a user searches the downloaded pages and obtains a set of search results, the documents to which the user is taken are those on the original host, not those cached on the AustLII server. The cached downloaded documents are not available for browsing via AustLII's servers

We call this a targeted web spider, as is not designed to traverse the web generally, its downloading being limited to the site specified in the original URL specified when it is invoked. For example, if the web spider is instructed to download the URL http://actag.canberra.edu.au/actag/ (ie the A.C.T. Lawnet site), any linked pages that are at the same level or below the original URL in the file hierarchy on the same server will be downloaded, but any other linked pages will be ignored. The web spider is not allowed to wander `off site'.

The DIAL prototype tests - problems of targeting

It is this `targeting' feature, essential to the value of this approach, that also causes complexity in its operation. Project DIAL provides the first extensive opportunity to test the targeting of a web spider. Our initial aim is to index every word of high value legislation sites located around the world on the web, particularly those which do not have their own search facilities.

At this prototype stage we are not attempting to index very large legislation sites which have their own search engines, but are concentrating on smaller sites which are not otherwise searchable.

In testing to date, legislation from Vietnam (two extensive sets), Zambia, Mongolia and Israel have been indexed successfully. Further testing is underway on legislation from India, Portugal, Alberta (Canada), and Nova Scotia (Canada). More legislation is added as suitable targets are found.

Not all legislation which is available on the web can as yet be indexed by this approach. Our experience to date, while largely successful, has revealed difficulties such as the following in targeting the web spider:

For some legislation, it is impossible to identify a URL to instruct the spider to start at, because there is no URL which contains links only to the target legislation (plus minor incidental materials) which does not also contain links to large amounts of irrelevant material which is at the same hierarchical level as the legislation. For example, it is impossible to index UK legislation with indexing the rest of the OHMS site as well.
Some formats in which legislation is held (eg Folio Views databases) are not yet able to be indexed by this approach.
Some legislation (eg from South Africa) is only available from gopher (not http://) addresses, and these cannot yet be processed.

Some of these problems will be able to be overcome by refinements to the targeting mechanisms.

[Previous] [Next] [Up] [Title]