Project DIAL Report - 2.4. Automated indexes and Internet-wide search engines

2.4. Automated indexes and Internet-wide search engines

2.4.1. Operation of web spiders and search engines
2.4.2. Robot indexes are not comprehensive
2.4.3. Robot indexes contain too much `noise'
2.4.4. Robot indexes are difficult to search for particular countries
2.4.5. Users find searching difficult

Internet-wide search engines based on data collected by web spiders, such as Alta Vista, Hot Bot and others, are remarkable technical achievements. If they allow every word of even a substantial portion of the estimated 150 million pages on the world-wide web to be searched effectively, that is astonishing enough. They allow forms of legal research (and every other type of research) never before possible.

However, there are limitations on the effectiveness of Internet-wide search engines for legal research, and it is important that they be understood so that the continuing relevance of intellectual indexes (directories) to legal research can be appreciated. These limitations are also an important element in the design of the relationship between search engines and intellectual indexes taken in the Project DIAL prototype.

2.4.1. Operation of web spiders and search engines

Details on all aspects of Internet-wide search engines and on web robots may be obtained from the Search Engine Watch site[40]http://searchenginewatch.com/] and from the Web Robots Page[41]http://info.webcrawler.com/mak/projects/robots/robots.html] (particularly the Web Robots FAQ[42]http://info.webcrawler.com/mak/projects/robots/faq.html]).

While many readers will be familiar with the operation of Internet-wide search engines, `How Search Engines Work'[43]http://searchenginewatch.com/work.htm] (on Search Engine Watch ) gives a simple explanation:

Search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes.

Everything the spider finds goes into the second part of a search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated new information.

Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine.

Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.

2.4.2. Robot indexes are not comprehensive

The notion that one can search `every word on every web site in the world' is encouraged by the publicity for search engines However, Internet-wide robot indexes, such as Alta Vista are not as comprehensive as people often assume, and they definitely do not index `every word on the web'.

There are a number of reasons for this[44]http://searchenginewatch.com/size.htm]:

Some robots only index a sample of pages on a particular site (at least at any one time), and do not continue indexing until they complete all pages on a site in one session. In 1996 it was claimed that Alta Vista only indexed about 10% of the pages of moderately large web sites (600 / 6,000 pages in the example cited), and not denied by Alta Vista[45]http://www5.zdnet.com/anchordesk/talkback/talkback_11638.html]. Alta Vista now claims to index sites without any limit on pages.
Some robots do not index every word on a page, but only index certain information on a page, such as titles, information in meta-tags and an arbitrary number of words at the start of the page.
Well-behaved robots[46]http://info.webcrawler.com/mak/projects/robots/robots.html ] adhere to the robot exclusion standard[47]http://info.webcrawler.com/mak/projects/robots/exclusion.html], by which web servers tell robots which pages they may not index on a site. Because of the effects of some robots on server performance, and for other reasons, many servers exclude robots. All major search engines observe robot exclusions.
There are some technical problems with frames and with dynamically created web pages that mean that some implementations cannot be included in web spider indexing.
Some web spiders (including Alta Vista) only re-index some sites as infrequently as every three months, so there may be new pages added in that period that are not indexed.

Such factors led to estimates in 1996 that even the largest Internet-wide search engines only indexed about 20% of the estimated 150 million web pages. However, the most recent figures published by Search Engine Watch[48]http://searchenginewatch.com/features.htm] include claims by Alta Vista to index 100 M pages, HotBot 80 M, Excite 55 M and others less than that. The estimated total number of web pages is now estimated to be in excess of 200 M[49]http://searchenginewatch.com/size.htm ], so, whatever the exact situation may be, it is still the case that no search engines can claim to index all pages on the web.

Despite these limitations, searches over Internet wide search engines can be very effective. For example, all four of the known legislation collections (or at least some pages from them) tested above in relation to intellectual indexes, were found in searches over Alta Vista, but only by using ILRG's LawRunner which customises searches over Alta Vista for particular country domains[50]http://www.ilrg.com/nations/].

2.4.3. Robot indexes contain too much `noise'

It is difficult to make searches precise enough to find only legal materials using Internet-wide robot indexes, because they index predominantly non-legal material. Unless a search is for very unusual terms used only in a legal context, the search results will include a large number of acontextual items. Given that search engines such as Alta Vista will often find some thousands of items satisfying a search request, then unless the relevance ranking facility can distinguish which items concern law, the relevant items can be difficult to find among the irrelevant ones. It is usually necessary to try to impose some ad hoc search limitation (in addition to the real search terms) such as `law or legislation or code or court' or some such, to try to stem the flood of irrelevant information (or more likely, to trick the relevance ranking into putting legally oriented material first). However, this does not work well and is beyond the search abilities of most users.

2.4.4. Robot indexes are difficult to search for particular countries

It is also difficult for most users to limit searches to materials concerning laws of particular countries[51], and failure to do so will usually result in the search being flooded with material from North America and other `content rich' parts of the Internet. For example, on Alta Vista, a search for Vietnamese legal materials requires a search which is limited to materials which are located on a server in Vietnam (the `domain:vn' delimiter) or contain `Vietnam or Viet Nam' (because there are valuable Vietnamese legal materials not contained in Vietnam) - and this is still somewhat hit or miss. FindLaw and LawRunner fix part of this problem by automating the use of the domain limitation in Alta Vista.

2.4.5. Users find searching difficult

Where it is necessary to resort to sophisticated search techniques such as discussed under the last two points, in order to `screen out' irrelevant material, most users are likely to find search engines too difficult for them to use. There are no automated indexes which index legal materials only, so users are required to perform this function through their searches.

[44] See `How big are the search engines?' (Search Engine Watch) at and references linked therefrom, for detailed discussion of all these matters.

[45] The claim by John Pike, webmaster of the American Federation of Scientists, and the reply by Alta Vista are available at and discussed in `The Alta Vista Size Controversy' on Search Engine Watch at

[46] See Martin Koster `The Web Robots Pages' at for details of the operation of web robots

[47] See the `Robots Exclusion' page, dealing with both the standard and the Meta Tag for robot exclusion at

[48] `Search Engine Features Chart', 5 November 1997 at

[49] This page provides an estimate from Alta Vista of 150M as at June 1997, from which a very conservative 1/3 addition of pages within the last 6 month period to December 1997 has been extrapolated.

[50]

[51] For example, on Alta Vista, a search for Vietnamese legal materials requires a search which is limited to materials which are located on a server in Vietnam (the `domain:vn' delimiter) or contain `Vietnam or Viet Nam' - and this is still somewhat hit or miss.

[Previous] [Next] [Up] [Title]