EnroWiki : InformationRetrieval

This is an old revision of InformationRetrieval from 2006-05-01 19:12:20.

Information retrieval

Zipf's Law

We can analyse part of a sentence, such as a subphrase describing a protein–protein interaction or part of a sentence containing a gene and a protein name, but we always run into Zipf's law whenever we write down the rules for how the extraction is done (http://dx.doi.org/10.1371/journal.pbio.0030065.g002).

A small number of patterns describe a reasonable portion of protein–protein interactions, gene names, or mutations, but many of those entities are described by a pattern of words that's only ever used once. Even if we could collect them all—which is impossible—we can't stop new phrases from being used. (Rebholz-Schuhmann, Dietrich & Kirsch, Harald & Couto, Francisco, Facts from Text---Is Text Mining Ready to Deliver?, in 3 PLoS Biology, 2, e65 (2005), http://dx.doi.org/10.1371/journal.pbio.0030065

)

Index-, topic-, and content words

It is generally accpeted within the text retrieval community that the words used in documents relate to the documents' content, and that documents with similar content will tend to use a similar vocabulary. Therefore, text retrieval systems index and compare documents by the words they contain. The quality of these index words is usually measured by how well the words can disciminate a topic or document from all other topics or documents.

Several researchers have attempted to evaluate index words by their content. Damerau, for example, uses word frequency to extract doman-oriented vocabulary which he defines as "a list of content words (not necessarily complete) that would characteristically be used in talking about a particular subject, say education, as opposed to the list of words used to talk about, say aviation". However, his evaluation method does not guarantee that the words are domain-oriented. It only shows that the selected words are used more often in the domain's documents. Therefore, good discriminators can easily be accepted as topic words by this test if they happened to be used more often in one domain than the other domains tested. But a good discriminating word is not necessarily a content word.

Manning and Schütze define non-content words informally as "words that taken in isolation … do not give much information about the contents of the document". Note that although a content word is usually relevant to the topic being discussed in the document, it does not need to be.

Many researchers opt to remove stopwords using a preset stopword list for efficiency and effectiveness. However, these stopword lists vary from one database to another and from one language to another, so it is desirable that topic relevance measures be able to identify these non-content words as such regardless of the language or database used. (Al-Halimi, Reem K. and Tompa, Frank W., Using Word Position in Documents for Topic Characterization, cs-2003-36, University of Waterloo, Canada, oct, 2003, http://www.cs.uwaterloo.ca/research/tr/2003/36/relevance_measures_TR.pdf

)

Luhn used [Zipf's Law] as a null hypothesis to enable him to specify two cut-offs, an upper and a lower, thus excluding non-significant words. The words exceeding the upper cut-off were considered to be common and those below the lower cut-off rare, and therefore not contributing significantly to the content of the article. He thus devised a counting technique for finding significant words. Consistent with this he assumed that the resolving power of significant words, by which he meant the ability of words to discriminate content, reached a peak at a rank order position half way between the two cut-offs and from the peak fell off in either direction reducing to almost zero at the cut-off points. A certain arbitrariness is involved in determining the cut-offs. There is no oracle which gives their values. They have to be established by trial and error.

The fundamental hypothesis made now is that a content-bearing word is a word that distinguishes more than one class of documents with respect to the extent to which the topic referred to by the word is treated in the documents in each class.

In the simple case of removing high frequency words by means of a 'stop' word list we are attempting to increase the level of discrimination between document. (van Rijsbergen, C. J., Information Retrieval, Butterworths, London, 2nd ed., 1979, http://www.dcs.gla.ac.uk/Keith/Preface.html

)

TF * IDF

There are various ways of combining term frequencies and inverse document frequencies, and empirical studies (Salton, Gerard & Yang, Chung-Shu, On the specification of term values in automatic indexing, in 29 Journal of Documentation, 4, 351-372 (1973)) show that the optimal combination may vary from collection to collection. Generally, tf is multiplied by idf to obtain a combined term weight. Alternatives would be for instance to entirely discard terms with idf below a set threshold — which seems to be slightly better for searchers that require high precision. Both measures are usually smoothed by taking logarithms rather than the straight measure — or some similar simple transformation — to avoid dramatic effects of small numbers. (Karlgren, Jussi, The Basics of Information Retrieval: Statistics and Linguistics, 2000, http://www.sics.se/ jussi/Undervisning/texter/ir-textbook.pdf

)

Document length

Most algorithms in use introduce document length as a normalization factor of some sort, if the documents in the document base vary as regards length (Salton, Gerald & Buckley, Christopher, Term weighting approaches in automatic text retrieval, in 24 Information Processing and Management, 5, 513-523 (1988)). (Karlgren, Jussi, The Basics of Information Retrieval: Statistics and Linguistics, op. cit.)