|
Reading Guide:
Hypertext and Retrieval
3. Text retrieval - principles and evaluation
|
|
[Previous]
[Next] [Title]
[This Part is complete for 2000]
'Text retrieval' (also sometimes know as 'full text searching' or 'free text searching')
refers to computerised systems which allow users to find particular combinations
of words in large bodies of text whether or not those texts have any uniform structure.
Howard Turtle (see below) gives a more technical (and accurate) definition
' a retrieval system applies some matching function to the representation of
the information needs and the representation of each document to determine which
documents to retrieve'.
For those unfamiliar with text retrieval, the following glossaries may be helpful,
either as a starting point for reading, or to come back to later:
-
Search Engine Watch's Search
Engine Glossary - simple but useful
- I-Search Digest Search
Engine Terms - A very comprehensive glossary with links to many of the
resources it discusses. Includes brief analyses of all the main web search
engines, as well as technical terms. If you read this glossary you will know
just about every bit of jargon there is relating to web searching.
- Scott Weiss' Glossary
for Information Retrieval - a much more technical glossary, oriented toward
information retrieval research. Not for the faint-hearted.
In the late 1950s, the basic technique underlying most computerised searching
of large bodies of text was developed, variously know as the 'concordance', 'inverted
file' or 'word occurrence index'. In summary, it involves the construction of
an alphabetical list of every different word in each document in a set of documents,
with the locations of each occurrence of that word recorded next to it. 'Searches'
of the documents is in fact carried out over this 'concordance', not over the
actual texts. (See a short
history of text retrieval in law for more details.)
There seem to be few explanations of concordances or the underlying operation
of search engines available on the web. The following reference, although it
is old (1988), does explain the essentials of concordance-based searching:
Graham Greenleaf, Andrew Mowbray and David Lewis
Retrieval Techniques (extract from Chapter 2 '`Basic principles of legal
information retrieval' Australasian Computerised Legal Information Handbook
Chapter 2, Butterworths, 1988).
The following points should be noted:
The text retrieval system described in this book uses a relatively complex concordance
five place concordance. Many text retrieval systems in use on the internet use
far less complex concordances, sometimes only recording the number of a document
in which a word is found, but not the location of the word within the document
(much less the location of paragraphs or sentences).
- For discussion of some basic concepts of text retrieval, and concepts used
in indexing, see Dabney II. A Basic Model for Document Retrieval
Systems' (in Dabney, cited below) [No longer on the web]
- Howard Turtle 'Text retrieval in the legal world' - 5.1 Exact match models
Artificial Intelligence & Law, (1995) Vol 3 Nos 1-2 (see LAWS 4609
Course Materials #3 at the Law Reserve Desk).
How do you determine the effectiveness of a text retrieval system? Although Jon
Bing says "The science of information retrieval lacks a comprehensive theoretical
foundation', a lot of experimental effort has been made, with 'recall' and 'precision'
as the two main measurements of quality of retrieval results.
The effectiveness of full text retrieval is often measured in terms of precision
and recall. The following table illustrates this. Assume that there are 100 documents
in a collection being searched.The search retrieves 10 documents, only 4 of which
(after inspection) are found to be relevant. However, after inspecting all the
other 90 documents, we find that there are 2 relevant documents not found by the
search.
|
Relevant
|
Not relevant
|
Total
|
Retrieved
|
Hits
a = 4
|
False drops
b = 6
|
a + b =10
|
Not retreived
|
Misses
c = 2
|
Dodged
d = 88
|
c + d =90
|
Total
|
a + c = 6
|
b + d = 94
|
a+b+c+d = 100
|
Precision and recall table (derived from Miranda Pao - reference below)
- 'Precision' is the ratio of a to (a + b), or the proportion of relevant
material retrieved to all material retrieved. Retrieval of irrelevant material
(ie b) lowers this ratio. Here, the ratio is 4 / 10 = 40%
- 'Recall' is the ratio of a to (a + c), or the proportion of relevant
material retrieved to all relevant material retrievable. Failure to retrieve
relevant material (ie c) lowers this ratio. Here, the ratio i s 4 / 6 = 66%
Ideally, both precision and recall should be as close to 1 /1 ( ie 1 or 100%)
as possible. We want to retrieve all relevant documents (perfect recall) and no
irrelevant documents (perfect precision).
See Precision
and recall (in Greenleaf Mowbray and Lewis cited above)
For a lengthier discussion, see Dabney
IV. Key Ideas in the Evaluation of Document Retrieval Systems (in Dabney
cited below)
The principal problem in the science of information retrieval is caused by the
simple fact that no information retrieval system delivers both perfect recall
with perfect precision.
Many who have studied information retrieval have alleged that there is inverse
relationship between precision and recall: the more we do to improve precision,
recall drops; the more we do to increase recall, precision drops. Among others,
this is asserted by Pao (reference below, p12), and suggested by Blair and Maron
(reference below). Summarising 30 years of information retrieval research, Pao
says `recall and precision are bound to lie within the 40 to 60 percent range';
but says that the inverse relationship is often noted but not proven.
This leads to two main issues:
- How do we define the most desirable `balance' between precision and recall
(if you can't have 100% of each)? . Does the answer depend on the type of
documents, and the purpose of the search task?
- What can we do to change retrieval systems so that both precision improves
and recall improves?
Bing classifies the search features of text retrieval systems as `recall devices'
or `precision devices' (Jon Bing in an extract in LAWS 4609 Course Materials 1996
#3 p37 at the Law Reserve Desk) depending on whether their use enhance precision
or recall. For example, truncations and thesaurii enhance recall, as does use
of synonyms (and the OR connector), whereas use of the AND connector, and limiting
a search over specified databases or parts of databases (eg headnotes) improves
precision.
The following materials are not available on the internet but provide valuable
discussion:
- Howard Turtle 'Text retrieval in the legal world' - 3. Evaluation in IR
Artificial Intelligence & Law, (1995) Vol 3 Nos 1-2 (see LAWS 4609
Course Materials 1996 #3 p46 at the Law Reserve Desk).
- Miranda Pao `Evaluation and Measurement' extracts from Chapter 11, Concepts
of Information Retrieval, Libraries Unlimited, Colorado, 1989 (see LAWS
4609 Course Materials 1996 #3 p34 at the Law Reserve Desk)
D Blair and M E Maron `An evaluation of retrieval effectiveness for a full-text
document retrieval system' (1985) 28(3) Communications of the ACM 289 (see
LAWS 4609 Course Materials 1996 #3 p38 at the Law Reserve Desk for an extract)
is the most famous and controversial article in the field of text retrieval and
law. This is simply because if is one of very few large-scall empirical studies
in text retrieval, and just about the only one related to legal information.
Blair and Maron attacked conventional wisdom, by arguing that recall was much
lower than the 75% recall that users of this system (lawyers) demanded and though
they were getting (because that is when they stopped searching). They were in
fact only getting 20% recall, but with 75% precision. This gives some support
for the alleged inverse precision/recall ratio.
Some additional comments on the experiment:
- It is one of few experiments in a very large real life legal database
(40,000 docts, 350,000 pages) and shows the care and expense involved in serious
information retrieval experiments (which is why there are relatively few).
- Some aspects of the experiment are questionable. Relevance [both
A and C] was determined by the users themselves (blind controls on samples)
( see p29). To determine recall they couldn't test relevance of all non-retrieved
documents, so they only tested samples; It is unclear whether they
only compared what was retrieved from the samples with what wasn't (p15),
but whatever they did, this approach may have been flawed because (i) their
`sample frames' that were tested were those `rich in relevant documents',
which must skew the result against recall, if searches are over the whole
database; and (ii) if they only compared part of a database they are using
'unrealistically small sample sizes', like others they criticise. However,
they insist that their experiment 'gives a maximum value for recall for each
result' (p15) - but we might question whether in fact it minimises recall.
- Were the searches a fair test? They seem to assume that the lawyers
/ paralegals were being rational (from experience) in keeping their queries
as they did. The searchers often used 5 concept terms connected by AND, because
they thought that to broaden the search would risk `output overload' . However,
if this is how they searched, then unless they used synonyms extensively,
recall must be low. Blair and Maron's defence of the searchers used is essentially
that these were the best and most experienced searchers available - if they
can't use free text retrieval, who could?
- They assert that recall also `decreases as the size of the database
increases' (p31), but don't explain why. They refer to `output overload'
as a `frequent problem of full-text retrieval systems' (though that is the
exact reverse of their results!!) and that users counter it by reducing the
retrieved set by using the AND operator to limit the number of documents retrieved.
- They concludethat the assumptions underlying full text retrieval are
wrong., and that earlier studies supporting the effectiveness of full
text retrieval were wrong because they were not based on large databases.
It is possible that Blair & Maron's experiemnt is mainly significant in showing
that that Boolean retrieval or `exactness' retrieval (as Bing calls it) without
a relevance ranking method gives psychological encouragement to attempt to reduce
large potentially relevant sets by inappropriate means. The use of relevance ranking
to overcome the deficiencies of Boolean retrieval will be discussed in later parts
of this topic.
There is considerable dispute, prompted to a large extent by Blair and Maron's
experiment, over what lawyers do actually want and demand from computerised
retrieval: do they prefer to maximise recall or precision. Dabney and Burson
have contributed to this debate.
Daniel Dabney 'The Curse of Thamus' - see 'Ramifications for the Users
of CALR Systems' in 'The Curse of Thamus: An Analysis of Full-Text Legal
Document Retrieval' (1986) 78(5) Law Library Journal 5-40 (was on Yale Law
School web site - No longer on the web)
Some notes on Dabney's approach (GG):
- He accepts (like Blair &Maron) that lawyers want recall above all else,
but unlike B&M he is talking about case research; but considers they will
accept relatively low precision.
- He gives reasons why case/statute systems may behave differently from litigation
support systems in text retrieval performance, but the factors he gives go
both ways. The most sensible conclusion is that another experiment is needed
to prove anything significant about legal research systems (as opposed to
litigation support systems.
- He identifies 4 types of queries where computerised retrieval is most effective,
particularly fact retrieval and citation retrieval;
- He makes 5 suggestions how retrieval performance may be improved, but he
takes a somewhat defeatist approach , and seems to largely accept B&M's
conclusions, even for research systems.
Scott F Burson `A reconstruction of Thamus: comments on the evaluation
of legal information retrieval systems' (1987) 79(134) law Library Journal (see
LAWS 4609 Course Materials 1996 #3 p94 at the Law Reserve Desk)
Some notes on Burson's approach (GG):
- He starts by interpreting Dabney as saying that Lexis and Westlaw probably
only provide 20% recall - and that this `conjecture' is `almost certainly
correct' In doing so he accepts similarilty of the use of text retreival for
litiigation support (as by Blair and Maron) and for legal research - and this
is a dubious assumption.
- He says Dabney mistaken in arguing for improved recall, as this will degrade
performance. He argues that lawyers want precison, not recall, in computerised
tools -- but since these are not the only tools available, they can obtain
recall by other means.
- He argues high recall was necessary in the B&M study, as an information
retrieval sysem was the only method of recall available. He then argues this
isn't necessary in case/statute systems, simply because there are other (manual)
research tools are also available which can complement the computerised tools;
- He asserts that changing tools to increase recall will lower precision,
but without evidence, and is against such changes. Iis he just assuming an
immutable inverse ratio?
- His opinion is that lawyers prefer high precision to high recall in computerised
research tools, becasue they then have other tools available to obtain high
recall (eg following up citations). In doing so he simply avoids the question
of whether they think they are getting high recall, as in B&Ms study,
or any evidence of whether they know what they can effectively get.
- Again, this is a defeatist attitude, with no discussion of possible improvements
to tools so as to improve recall without degrading precision, as this is assumed
to be an impossible `will of the wisp'. Burson simply accepts low recall in
computer-assisted legal research [cf litigation support] but isn't concerned.
As a result of these discussions, it is clear that the supposed inverse relationship
between precision and recall poses a very difficult dilemma for the use of text
retrieval systems in law. It may well be that the use of relevance ranking systems,
either in conjunction with boolean retrieval or separately from it, provides the
way out of this dilemma. This will be discussed in the subsequent readings on
legal research via the internet, where the greatest use of relevance ranking systems
has been made.
- Daniel Dabney 'A Reply to West Publishing Company and Mead Data Central
on The Curse of Thamus.' (1986) 78 Law Library Journal 349 (was on Yale Law
School web site - no longer on the web)
One of the most important innovations in text retrieval, which can be used both
as an alternative and an enhancement to boolean retrieval, is relevance ranking
systems . They are also sometimes called 'best match' or 'probability models'
(Turtle) or a 'statistical interface' (Feldman). Relevance ranking is also discussed
in Part 5 of this Reading Guide in the context of internet legal research, where
it has had the most effect.
The common feature of the various approaches to relevance ranking is the attempt
to rank documents by the probability that they will be relevant to the query,
such that the most relevant document is first in the list of retrieved documents,
the next most relevant is displayed second, and so on.
The simplest relevance ranking system would be to count the number of times
each search term occurred in a document, calculate the sum of all search terms
occurring in the document, and list the document with the highest total as 'most
relevant'. This would be a very crude measure.
Howard Turtle 'Text retrieval in the legal world' Parts 5.5 and 5.8 in Artificial
Intelligence & Law, (1995) Vol 3 Nos 1-2 discusses the basic elements
of relevance ranking systems.
The main question is how to calculate the 'significance' of a particular occurrence
of a word in a document, as a means of assessing the relative likely relevance
of the document in which it occurs. The second question is how to compound the
measures of significance of each occurrence of each search term so as to give
an overall measure of significance of a document.
Turtle says that two measures of word occurrence significance are used in
most relevance ranking systems:
(i) Inverse document frequency = the number of occurrences of a term
in whole database divided by the number of words in the whole database. This
means (very roughly) that terms which occur in relatively fewer documents in
the whole collection are given greater weight (also called the 'discrimination
value' of the term).
(ii) Within document frequency = the number of occurrences of a term
in a document divided by the number of words in the whole document. Therefore
(again in rough terms) a word that occurs a lot in a short document gets a high
score on this criterion.
One approach is then that the weighting to be given to an occurrence of a
term in a particular document is the product of these measures for that term.
As a result, for example, a term that occurs a lot in a short document (and
so has high within document frequency), but doesn't occur very often in the
whole database (and so has a high inverse document frequency) will get a very
high overall weighting.
Finally, each occurrence of a search term in a document is multiplied
by its appropriate weighting. The measure of relevance of the whole document
would then be the sum of the weights of each occurrence of each search term
in the document. So, a document which has many occurrences of search terms which
have high overall weightings will be regarded as a document high in relevance.
There are many measures of relevance implemented in search engines more complex
than this simple one.
AustLII implements relevance ranking in its SINO
search engine in two ways (selected from the 'Find' setting), as both an alternative
to boolean searching and as an enhancement to it:
- 'Any of these words' searches, which is searching based solely on
the relevance ranking of search terms, with no boolean or proximity operators
able to be used. The relevance ranking algorithm is described briefly in
Freeform Search Help ('any of these words' used to be called 'Freeform'
on AustLII)
- 'Boolean query' searches, are in fact not just boolean queries,
but have results ordered by likely relevance. A boolean search is first carried
out, and only those documents which satisfy the boolean search (ie an 'exact
match' search) are then ranked into likely order of relevance.
The availability of relevance ranking can alter a user's search strategy using
boolean retrieval. It is often wise to use a broad boolean search in order to
maximise recall, relying upon the relevance ranking to supply the precision that
the boolean search lacks.
There is a brief discussion of how popular internet search engines rank web
pages in
How Search Engines Rank Web Pages (on Search Engine Watch).
If a search engine uses relevance ranking in any way, it is no longer possible
to evaluate its results using simple measures of recall and precision. For example,
if the first 50 documents retrieved by a relevance ranked search are all or most
of the relevant documents in a database, it doesn't much matter that a search
retrieves 100 documents but the last 50 were not very relevant, because the user
can just keep working down the list until the degree of relevance is no longer
high enough to warrant their continuing reading additional items.
Turtle (see 3.2 in his article) suggests methods of constructing a precision
/ recall curve, which shows recall values at pre-defined recall points (eg from
0.1 to 1.0 of all relevant documents). One way of thinking about such a curve
is that it measures such things as 'if a user stops browsing when precision
drops below 50% (ie only one in two documents is regarded as relevant), does
this occur when recall of only 30% of relevant documents has occurred, or not
until 90% of relevant documents have been recalled?
Westlaw's WIN and Lexis' 'Freestyle' were two pre-internet examples of so-called
'natural language' searching, which were early implementations of relevance ranking
to legal documents. In the LAWS 4609 Course Materials #3 at the UNSW Law Reserve
Desk there are extracts from search manuals - WestLaw 'Retrieving documents with
WIN - WestLaw is natural' (Chapter 5 of Introduction to WestLaw (1993),
WestLaw) at p105 and Lexis 'Freestyle' searching (Information from Lexis help
screens) at p110.
For discussion of evaluations, see:
- Sheilla Desert, "Westlaw's Natural v. Boolean Searching: A Performance
Study" 85 Law Library Journal 713 (Was on Yale Law School web site
- no longer on the web)
- Daniel Dabney 'A Reply to West Publishing Company and Mead Data Central
on The Curse of Thamus.' (1986) 78 Law Library Journal 349 (Was on Yale Law
School web site - no longer on the web)
The purpose of these extracts is to act as an introduction to how hypertext and
text retrieval may be combined in unusual ways to produce a result where 'the
sum is greater than the parts'. In section 5 following, dealing with legal research
on the internet, there is a similar theme in how it is necessary to use 'intellectual
indexing' (which are essentially structures created using hypertext) in addition
to full text retrieval, in order to get the best results.
The main reading for this section is extracts from Graham Greenleaf, Andrew Mowbray
and Peter van Dijk 'Representing
and using legal knowledge in integrated decision support systems: DataLex WorkStations'
Artificial Intelligence and Law Kluwer, Vol 3, Nos 1-2, 1995, 97-124;
This pre-AustLII paper outlines an approach to the integration of hypertext and
text retrieval in legal applications, much but not all of which has subsequently
been implemented on AustLII. In a later reading guide, the integration of both
these technologies with inferencing systems is discussed.
It is only necessary to read the parts listed below at this stage.
In Paquin, Blanchard, and Thomasset `Loge-expert: From a legal expert system to
an information system for non-lawyers' Proceedings of the 4th International
Conference on Artificial Intelligence and Law, ACM Press 1991, p254 (copy
available at UNSW Law Reserve desk in LAWS 4609 Materials #3 (1996) Theories
of computerising law: hypertext and text retrieval at p20 - ignore the expert
systems aspects for the moment).
This article contains one of the few attempts to compare the benefits and
costs of using text retrieval, hypertext and expert systems to computerise legal
information.
Diagram of relative costs and effectiveness of different forms of computerisation
of law (following Paquin, Blanchard, and Thomasset 1991)
- By 'noise' they mean the opposite of precision (items retrieved but not
relevant). By 'silence' they mean the opposite of recall (items relevant
but not retrieved).
-
These authors developed their theory (in pre-internet times) before the widespread
use of relevance ranking (which makes arguments about the high level of 'noise'
in text retrieval much harder to sustain), and before it was clear that large-scale
automation of the creation of hypertext links was possible (which dramatically
reduces the costs of creation of hypertexts).
There are many enhancements to basic boolean ('exact match') retrieval systems,
and quite a few alternative approaches to boolean retrieval. Some have been tested
for legal information retrieval.
Two surveys of the various approaches that have been taken to text retrieval
in law are:
Howard Turtle 'Text retrieval in the legal world' Artificial Intelligence &
Law, (1995) Vol 3 Nos 1-2 (see LAWS 4609 Course Materials #3 at the UNSW Law
Reserve Desk) is one of the most systematic surveys of pre-internet approaches
to legal information retrieval. It is often quite technical but is recommended
highly. Howard Turtle played an important role in developing the WIN (WestLaw
is Natural) relevance ranking retrieval system for WestLaw.
Eric Schweighofer
''The Revolution in Legal Information Retrieval or: The Empire Strikes Back'
- 1999 (1) The Journal of Information, Law and Technology (JILT). Schweighofer's
paper is for the most part less technical than Turtle's. Among the many projects
summarised and discussed, Schweighofer includes summaries of:
[Previous] [Next]
[Title]