[Previous]
[Next]
[Up]
[Title]
2.4. Automated indexes and Internet-wide search engines
Internet-wide search engines based on data collected by web spiders, such as
Alta Vista, Hot Bot and others, are remarkable technical achievements. If they
allow every word of even a substantial portion of the estimated 150 million
pages on the world-wide web to be searched effectively, that is astonishing
enough. They allow forms of legal research (and every other type of research)
never before possible.
However, there are limitations on the effectiveness of Internet-wide search
engines for legal research, and it is important that they be understood so that
the continuing relevance of intellectual indexes (directories) to legal
research can be appreciated. These limitations are also an important element in
the design of the relationship between search engines and intellectual indexes
taken in the Project DIAL prototype.
Details on all aspects of Internet-wide search engines and on web robots may be
obtained from the Search Engine Watch site[40]http://searchenginewatch.com/]
and from the Web Robots Page[41]http://info.webcrawler.com/mak/projects/robots/robots.html]
(particularly the Web Robots FAQ[42]http://info.webcrawler.com/mak/projects/robots/faq.html]).
While many readers will be familiar with the operation of Internet-wide search
engines, `How Search Engines Work'[43]http://searchenginewatch.com/work.htm]
(on Search Engine Watch ) gives a simple explanation:
Search engines have three major elements. First is the spider, also
called the crawler. The spider visits a web page, reads it, and then follows
links to other pages within the site. This is what it means when someone refers
to a site being "spidered" or "crawled." The spider returns to the site on a
regular basis, such as every month or two, to look for changes.
Everything the spider finds goes into the second part of a search
engine, the index. The index, sometimes called the catalog, is like a giant
book containing a copy of every web page that the spider finds. If a web page
changes, then this book is updated new information.
Sometimes it can take a while for new pages or changes that the
spider finds to be added to the index. Thus, a web page may have been
"spidered" but not yet "indexed." Until it is indexed -- added to the index --
it is not available to those searching with the search engine.
Search engine software is the third part of a search engine. This
is the program that sifts through the millions of pages recorded in the index
to find matches to a search and rank them in order of what it believes is most
relevant.
The notion that one can search `every word on every web site in the world' is
encouraged by the publicity for search engines However, Internet-wide robot
indexes, such as Alta Vista are not as comprehensive as people often
assume, and they definitely do not index `every word on the web'.
There are a number of reasons for this[44]http://searchenginewatch.com/size.htm]:
- Some robots only index a sample of pages on a particular site (at least at
any one time), and do not continue indexing until they complete all pages on a
site in one session. In 1996 it was claimed that Alta Vista only indexed
about 10% of the pages of moderately large web sites (600 / 6,000 pages in the
example cited), and not denied by Alta Vista[45]http://www5.zdnet.com/anchordesk/talkback/talkback_11638.html].
Alta Vista now claims to index sites without any limit on pages.
- Some robots do not index every word on a page, but only index certain
information on a page, such as titles, information in meta-tags and an
arbitrary number of words at the start of the page.
- Well-behaved robots[46]http://info.webcrawler.com/mak/projects/robots/robots.html ]
adhere to the robot exclusion standard[47]http://info.webcrawler.com/mak/projects/robots/exclusion.html],
by which web servers tell robots which pages they may not index on a site.
Because of the effects of some robots on server performance, and for other
reasons, many servers exclude robots. All major search engines observe robot
exclusions.
- There are some technical problems with frames and with dynamically created
web pages that mean that some implementations cannot be included in web spider
indexing.
- Some web spiders (including Alta Vista) only re-index some sites as
infrequently as every three months, so there may be new pages added in that
period that are not indexed.
Such factors led to estimates in 1996 that
even the largest Internet-wide search engines only indexed about 20% of the
estimated 150 million web pages. However, the most recent figures published by
Search Engine Watch[48]http://searchenginewatch.com/features.htm]
include claims by Alta Vista to index 100 M pages, HotBot 80 M, Excite 55 M and
others less than that. The estimated total number of web pages is now estimated
to be in excess of 200 M[49]http://searchenginewatch.com/size.htm ],
so, whatever the exact situation may be, it is still the case that no search
engines can claim to index all pages on the web.
Despite these limitations, searches over Internet wide search engines can be
very effective. For example, all four of the known legislation collections (or
at least some pages from them) tested above in relation to intellectual
indexes, were found in searches over Alta Vista, but only by using ILRG's
LawRunner which customises searches over Alta Vista for particular country
domains[50]http://www.ilrg.com/nations/].
It is difficult to make searches precise enough to find only legal materials
using Internet-wide robot indexes, because they index predominantly non-legal
material. Unless a search is for very unusual terms used only in a legal
context, the search results will include a large number of acontextual items.
Given that search engines such as Alta Vista will often find some thousands of
items satisfying a search request, then unless the relevance ranking facility
can distinguish which items concern law, the relevant items can be difficult to
find among the irrelevant ones. It is usually necessary to try to impose some
ad hoc search limitation (in addition to the real search terms) such as `law or
legislation or code or court' or some such, to try to stem the flood of
irrelevant information (or more likely, to trick the relevance ranking into
putting legally oriented material first). However, this does not work well and
is beyond the search abilities of most users.
It is also difficult for most users to limit searches to materials concerning
laws of particular countries[51], and failure
to do so will usually result in the search being flooded with material from
North America and other `content rich' parts of the Internet. For example, on
Alta Vista, a search for Vietnamese legal materials requires a search which is
limited to materials which are located on a server in Vietnam (the `domain:vn'
delimiter) or contain `Vietnam or Viet Nam' (because there are valuable
Vietnamese legal materials not contained in Vietnam) - and this is still
somewhat hit or miss. FindLaw and LawRunner fix part of this problem by
automating the use of the domain limitation in Alta Vista.
Where it is necessary to resort to sophisticated search techniques such as
discussed under the last two points, in order to `screen out' irrelevant
material, most users are likely to find search engines too difficult for them
to use. There are no automated indexes which index legal materials only, so
users are required to perform this function through their searches.
[40]
[41]
[42]
[43]
[44] See `How big are the search engines?'
(Search Engine Watch) at and references linked therefrom, for detailed
discussion of all these matters.
[45] The claim by John Pike, webmaster of the
American Federation of Scientists, and the reply by Alta Vista are available at
and discussed in `The Alta Vista Size Controversy' on Search Engine Watch at
[46] See Martin Koster `The Web Robots Pages'
at for details of the operation of web robots
[47] See the `Robots Exclusion' page, dealing
with both the standard and the Meta Tag for robot exclusion at
[48] `Search Engine Features Chart', 5
November 1997 at
[49] This page provides an estimate from Alta
Vista of 150M as at June 1997, from which a very conservative 1/3 addition of
pages within the last 6 month period to December 1997 has been extrapolated.
[50]
[51] For example, on Alta Vista, a search for
Vietnamese legal materials requires a search which is limited to materials
which are located on a server in Vietnam (the `domain:vn' delimiter) or contain
`Vietnam or Viet Nam' - and this is still somewhat hit or miss.
[Previous]
[Next]
[Up]
[Title]