Ranganathan's Laws to the Web n's Laws to the Web

Abstract
This paper analyzes the Web and raises a significant question: "Does the Web save the time of the users?" This question is analyzed in the context of Five Laws of the Web. What do these laws mean? The laws are meant to be elemental, to convey a deep understanding and capture the essential meaning of the World Wide Web. These laws may seem simplistic, but in fact they express a simple, crystal-clear vision of what the Web ought to be. Moreover, we intend to echo the simplicity of Ranganathan's Five Laws of Library Science which inspired them.
Keywords
World Wide Web, Ranganathan's laws, Five Laws of Library Science

--------------------------------------------------------------------------------



Introduction
The World Wide Web is an Internet system that distributes graphical, hyperlinked information, based on the hypertext transfer protocol (HTTP). The Web is the global hypertext system providing access to documents written in a script called Hypertext Markup Language (HTML) that allows its contents to be interlinked, locally and remotely. The Web was designed in 1989 by Tim Berners-Lee at the European Organization for Nuclear Research (CERN) in Geneva (Noruzi, 2004).

We live in exciting times. The Web, whose history spans a mere dozen years, will surely figure amongst the most influential and important technologies of this new century. The information revolution not only supplies the technological horsepower that drives the Web, but fuels an unprecedented demand for storing, organizing, disseminating, and accessing information. If information is the currency of the knowledge-based economy, the Web will be the bank where it is invested. It is a very powerful added value of the Web that users can access resources online electronically, that for whatever reason are not in the traditional paper-based collections. The Web provides materials and makes them online accessible, so they can be used. This is the real difference between the Web and libraries. Therefore, webmasters build web collections not for vanity but for use.

The Web is interested in its cybercitizens (users) using its resources for all sorts of reasons: education, creative recreation, social justice, democratic freedoms, improvement of the economy and business, support for literacy, life long learning, cultural enrichment, etc. The outcome of this use is the betterment of the individual and the community in which we live –the social, cultural, economic and environmental well being of our world. So the Web must recognize and meet the information needs of the users, and provide broad-based services.

The Five Laws of Library Science
Shiyali Ramamrita Ranganathan (1892-1972) was considered the father of Library Science in India. He developed what has been widely accepted as the definitive statement of ideal library service. His Five Laws of Library Science (1931) is a classic of library science literature, as fresh today as it was in 1931. These brief statements remain as valid -in substance if not in expression- today as when they were promulgated, concisely representing the ideal service and organizational philosophy of most libraries today:

Books are for use.
Every reader his or her book.
Every book its reader.
Save the time of the reader.
The Library is a growing organism.
Although these statements might seem self-evident today, they certainly were not to librarians in the early part of the 20th century. The democratic library tradition we currently enjoy had arisen in America and England only in the latter part of the nineteenth century (Sayers, 1957). For Ranganathan and his followers, the five laws were a first step toward putting library work on a scientific basis, providing general principles from which all library practices could be deduced (Garfield, 1984).

In 1992, James R. Rettig posited a Sixth Law, an extension of Ranganathan's laws. He conceived that Sixth Law "Every reader his freedom" as applicable only to the type of service (i.e., instruction or provision of information).

New information and communication technologies suggest that the scope of Ranganathan's laws may appropriately be extended to the Web. Nowadays the same five laws are discussed and reused in many different contexts. Since 1992, the 100th anniversary of Ranganathan's birth, several modern scholars of library science have attempted to update his five laws, or they reworded them for other purposes.

'Book, reader, and library' are the basic elements of Ranganathan's laws. Even if we replace these keywords with other elements, Ranganathan's laws still work very well. Based on Ranganathan's laws, several researchers have presented different principles and laws. For instance, "Five new laws of librarianship" by Michael Gorman (1995); "Principles of distance education" by Sanjaya Mishra (1998); "Five laws of the software library" by Mentor Cana (2003); "Five laws of children's librarianship" by Virginia A. Walter (2004); "Five laws of web connectivity" by Lennart Björneborn (2004); and "Five laws of diversity/affirmative action" by Tracie D. Hall (2004).

Gorman's laws are the most famous. He has reinterpreted Ranganathan's laws in the context of today's library and its likely future. Michael Gorman has given us his five new laws of librarianship:

Libraries serve humanity.
Respect all forms by which knowledge is communicated.
Use technology intelligently to enhance service.
Protect free access to knowledge; and
Honor the past and create the future (Crawford & Gorman, 1995).
Gorman (1998a,b) believes that S.R. Ranganathan invented the term 'Library Science' and beautifully demonstrates how his laws are applicable to the future issues and challenges that librarians will face. Gorman's laws are not a revision of Ranganathan's laws, but another completely separate set, from the point of view of a librarian practicing in a technological society (Middleton, 1999).

Furthermore, based on Ranganathan's laws, Jim Thompson (1992) in protesting against a library services, revised Ranganathan's laws to the following statements:

Books are for profit.
Every reader his bill.
Every copy its bill.
Take the cash of the reader.
The library is a groaning organism.
Whether one looks to Ranganathan's original Five Laws of Library Science or to any one of the many new interpretations of them, one central idea is immediately clear: Libraries and the Web exist to serve people's information needs.

The Five Laws of the Web
The Five Laws of the Web are inspired by the “Five Laws of Library Science” which were the seed of all of Ranganathan's practice. These laws form the foundation for the Web by defining its minimum requirements. While the laws seem simple on first reading, think about some of the conversations on the Web and how neatly these laws summarize much of what the Web community believes. Although they are simply stated, the laws are nevertheless deep and flexible. These laws are:

Web resources are for use.
Every user his or her web resource.
Every web resource its user.
Save the time of the user.
The Web is a growing organism.
The Web consists of contributions from anyone who wishes to contribute, and the quality of information or the value of knowledge is opaque, due to the lack of any kind of peer reviewing. Moreover, the Web is an unstructured and highly complex conglomerate of all types of information carriers produced by all kinds of people and searched by all kinds of users (Björneborn & Ingwersen, 2001).

This new revised version of Ranganathan's laws gives us the grounding for librarians' profession just as the 193l original did. The Web exists to help users achieve success through serving user information needs in support of the world community. Information needs are met through web pages and documents appropriate to web users. In fact, the Five Laws of the Web are really the foundations for any web user-friendly information system. What they require is universal access as a right of cybercitizenship in the information age. Like most laws, they look simple until you think about them. We explain each law here:

1. Web resources are for use
The Web was designed to meet the human need to share information resources, knowledge, and experience. Webmasters want people to interact with their web sites and pages, click on them, read them, print them if they need to, and have fun. So web sites are not statues or temples users admire from a distance. This law implies that the Web is for using and learning and information is there to be used. This law is very important because information serves no purpose if it is not utilized and at least available for people to attempt to learn. The role of the Web is to serve the individual, community and service, and to maximize social utility in the communication process.

The dominant ethic of the Web is service to society in general. The question "how will this change improve the service that the Web gives better? " is a very effective analytical tool. Another aspect of this law is its emphasis on a mission of use both by the individual seeker of truth and for the wider goals and aspirations of society. So “information is for use and should not be hidden or altered from people” (Middleton, 1999).

The Web is central to freedom, intellectual, social, and political. A truly free society without the Web freely available to all is an oxymoron. A society that censored the Web is a society open to tyranny. For this reason, the Web must contain and preserve all records of all societies, communities and languages and make these records available to all. We should put the emphasis on free access to information. Old web pages should be protected by Internet Archive (www.archive.org) and national libraries for future users. The Web of the future must be one that retains not only the best of the past but also a sense of the history of the Web and of scholarly communication.

The Web must acquire materials and make them accessible so they can be used. The Web needs to be accessible to users. A webmaster who has faith in this law is happy only when the users read and use his web pages. As some webmasters are currently closing their files by password-protected systems, and others charging fees and introducing fines, law one admonishes: Web resources are for use.

What we are producing and delivering via the Web and how well we are doing that, are the tangible results of the Web. So what is best practice now and what does this indicate for the future of the Web?

Just as Newton's first law of motion ("A body at rest remains at rest unless acted upon by an outside force") is a statement of the obvious, the first law of the Web also puts forth an obvious and elemental principle. But even so, it is a law that is often violated in the practice and use of the Web. Medieval and monastic libraries, as an extreme example, were chained books to the shelves. The books literally were attached to the shelves with brass chains and could only be used in a single location. Obviously, this was done primarily for preservation of the books rather than to facilitate their use. On the other hand, it might be argued that this method of controlling access helped prevent theft and thereby facilitated use!

But you don't have to go all the way back to medieval times to find ways by which librarians can obstruct the use of library materials. Limiting access to books and information resources has prevailed through time, and exists even today. Maintaining special web collections with limited access; storing materials off-site; restricting access to web resources based on memberships, fees, or even by selecting materials that are contracted in such a way as to limit use to particular classes of users (such as when a public library, or a library that is open to the public, eliminates print resources in favor of an electronic version of the material that is only accessible to certain users with passwords) are all modern equivalents of chaining books to the shelves (Leiter, 2003). And all bring into question whether the Web is adhering to the first law: Web resources are for use.

Another aspect of this first law is that either the Web is about service or it is about nothing. In order to deliver and reap the rewards of services, the Web must identify the benefits that society can reasonably expect and then devise means of delivering those benefits. Service always has a purpose and of course, price, and the Web has a purpose. If web resources are for use, what happens to unused resources?

The Web relies on user-orientation to justify and develop the Web operations. Suominen (2002) called this 'userism'. At the outset, let us distinguish between good and valuable user-orientation on the one hand, and naive, biased and ideological userism on the other hand. One can speak of the latter when users' interests are assumed, self-evidently, as the only possible rationale for the Web operations, to the extent that no other rationales are even considered. This can be illustrated by a simple example. There is something particularly convincing in the claim that

The Web exists for users. Therefore, the interests of users must be the basis of the Web operations;
The Web exists for researchers and writers, so the interests of researchers and writers should be central in the Web policies;
The Web exists for society, and it should serve the interests of society.
It can be argued that these three assertions are not mutually exclusive, for surely the interests of society are those of the cybercitizens, so claims 1 and 2 are included in claim 3.

Furthermore, one might assume that these three different categories are collective that individual interests reduce to collective interests by way of the collective culture contributing to the creation of individuals, 'culture speaks in us' (Suominen, 2002).

This law dictates the development of systems that accommodate the use of web resources. For instance, updating and regular indexing of web site resources facilitates the use of site resources and the Web in general.

2. Every user his or her web resource
This law has many important implications for the Web. This law reveals the fundamental need for balance between making web resources and the basic right of all users to have access to the web resources they need anywhere in the world. This makes diffusion and dissemination very important; each web resource should call to mind a potential user.

A web site must formulate access policies that ensure that the collection it is building and maintaining is appropriate and adequate to fulfill the expectations of its community of users. In other words, the collection must be appropriate to the web site's mission. A web site must contain resources appropriate to the needs of all its users. Any web site that limits access in any way must ensure that this restriction does not prevent adequate access to the collection by the users that web site was created to serve. Access policies also have implications for search engines.

However, there is an even more practical aspect to this law. Webmasters must know their users well if they are to provide them with the materials they need for their research or that they wish to read. A responsibility, therefore, of any webmaster is to instruct and guide users in the process of search for web documents they need for enjoyment, education or research. Clearly, it is the business of webmasters to know the user, to know the web resources, to actively help in the finding and retrieving by every user of his or her web resource, and to help search engines in the process of indexing web sites. Webmasters need to ask themselves:

Who might want to access information resources?
Who will or won't have access?
What are the issues surrounding access to printing, passwords, etc. ?
Webmasters must acknowledge that users of web sites, themselves included, use and value different means of communications in the pursuit of knowledge, information and entertainment. Web sites must value all means of preserving and communicating the records and achievements of the human mind and heart. This second law dictates that the Web serves all users, regardless of social class, sex, age, ethnic group, religion, or any other factor. Every cybercitizen has a right to information. Webmasters and search engine designers should do their best to meet cybercitizens' needs.

3. Every web resource its user
When a web user searches the Web, or gains access to the Web's services, there are certain web resources that will meet his or her needs. It is webmasters' job to ensure that the connection between the user and the web resources is made and that connection is as practical, easy and speedy as possible. Appropriate arrangement of documents in a web site is also an important means of achieving this objective of the third law.

If a web resource is secretly published by a web site, but its diffusion and dissemination otherwise kept secret, the web resource may not be readily discovered and retrieved until the user has reached a crisis in his or her research. At such a time, a frustrated user may seek out a webmaster or someone else with knowledge of the needed web resource's existence, or may simply stumble upon it by serendipity. While either scenario may represent a happy ending for the user, they are not the preferred model of web service. And in the worst case, the web resource may remain invisible indefinitely.

How can a webmaster find a user for every web resource? There are many ways in which a web site can actively work to connect its resources to its users:

Distribution of new web resources via mailing lists, listservs and discussion groups;
Making new web resource list on the home page of the site, etc.;
Submitting resources to popular search engines and directories, which is the most common way of indexing the new resources of a web site.
The use of a structured, well-organized and more categorized site map/index is a necessity, as it ensures uniformity of treatment of various web resources on similar topics. It should be simple, and easy to use. This is something most webmasters probably feel that they already do, but their site maps are not always clear and easy to use. Also important is a correct link to web resource, as mislinking and misindexing a resource can make it all but invisible to the user and, for all practical purposes, lost. To help users to find resources that are topically related, web site designers should use navigational links.

The point here is that webmasters should add content with specific user needs in mind, and they should make sure that users can find the content they need easily. They should make certain that their content is something their users have identified as a need, and at the same time make sure they do not clutter up their web site with content no one seems to care about (Steckel, 2002). Webmasters need to continue adding unique content to their web sites, because the high quality content is everything.

This third law is the most sensible, and it is consistently broken by most webmasters and web writers on most subjects. This law stipulates that a web resource exists for every user, and that resource should be well described and indexed in the search engines' indexes, displayed in an attractive manner on the site, and made readily available to users. This law leads naturally to such practices as open access rather than closed files, a coherent site arrangement, an adequate site map, and a search engine for each site. "It should be easy for users to search for information from any page on a site. Every page should include a search box or at least a link to a search page" (Google, 2003).

4. Save the time of the user
This law presents the biggest challenge to the Web administrators, webmasters and search engine designers. Webmasters should always bear in mind that the time of users is very important and precious. A web site must always formulate policies with the information needs of its users in mind. Web site collection must be designed and arranged in an inviting, obvious, and clear way so as not to waste the time of users as they search for web resources they need.

This law has both a front-end component (make sure people quickly find what they are looking for) and a back-end component (make sure our data is structured in a way that information can be retrieved quickly). It is also imperative that we understand what goals our users are trying to achieve on our site (Steckel, 2002).

Webmasters have helped save the time of the user by creating a user friendly web site. When a site has been finished, uploaded and tested with users, their experiences will be worth reading. Perhaps then, the question is that "is the web site user-friendly?" A webmaster should think about users and how to attract them, develop for them, cater to them, if s/he wants to satisfy the Web community. We need to remember that the webmasters' job is to help web users research effectively and efficiently, to update web sites, and to make them easy to navigate. So user friendliness and usefulness are important.

Perhaps this law is not so self-evident as the others. None the less, it has been responsible for many reforms in web site administration. A web site must examine every aspect of its policies, rules, and systems with the one simple criterion that saving the time of the user is vital to the web site's mission.

There are other ways to satisfy this law. A well-planned and executed site map saves the time of the user. Saving the time of the user means providing efficient, thorough access to web resources. It means satisfied web users. This is the prime measure of the web site's success; disappointed or frustrated users mean that web site has failed in its duty and its responsibility. This law might be restated as: Serve the user well.

In order to save the time of the user, web sites need to effectively and efficiently design systems that will enable the users to find what they are looking for quickly and accurately, as well as to explore the vast amount of collection of information available that could potentially be useful. This fourth law emphasizes efficient service to the users, which implies a well design and easy-to-understand map/index to the site.

5. The Web is a growing organism
The Web reflects the changes in our world and will continue to grow as we move along in life and contribute to its riches. It is indeed a growing organism. We need to plan and build with the expectation that the Web and its users will grow and change over time. Similarly we need to keep our own skill levels moving forward (Steckel, 2002).

The Web presents an interesting dilemma for librarians. For while only about 50,000 books are published each year in the United States, the World Wide Web contains an ever-growing and changing pool of about 320 million web pages. When a book is published, it has been assessed by editors and publishers, and hopefully has some value. Moreover, when a web page is published, it has simply been uploaded to a server somewhere. There are no guidelines for the Web. Anyone can publish--and does. Librarians can play an important role in weeding through the dross and establishing annotated lists of links that patrons can feel confident about using. The boundless resources found on the Web benefit from a librarian's expertise in such areas as indexing and cataloguing, as well as search techniques; there will be an increased demand for these types of skills as users demand more value from the searches that they conduct (Syracuse University, 2004).

Today, the Google index of the Web contains over 8 billion web pages (Google, 2004) and the Web is growing at a rapid rate, providing a huge source of information for users and a huge potential client base for businesses who have a web presence (Thelwall, 2000). The Internet Archive is building a digital library of web sites and other cultural artifacts in digital form. Like a paper library, it provides free access to researchers, historians, scholars, and the general public. Its information collection contains 30 billion web pages. Its wayback machine, which currently contains over 100 terabytes of data and is growing at a rate of 12 terabytes per month, is the largest known database in the world, containing multiple copies of the entire publicly available Web (Internet Archive, 2004). For better or for worse, the Web plays an important role in all countries and societies.

The fifth law tells us about the last vital characteristic of the Web and stresses the need for a constant adjustment of our outlook in dealing with it. The Web grows and changes and will do so always. Change and growth go together, and require flexibility in the management of the Web collection, in the use of cyberspace, in the retention and deployment of users, and in the nature of web programs. The Web collection increases and changes, information technologies change and people will change. So this fifth law recognizes that growth will undoubtedly occur and must be planned for systematically.

Discussion
The Five Laws of the Web help to identify the Web as a powerful inspiration for technological, educational and social change. The user is rightly the center of attention in this process. So, it is only through understanding user needs and characteristics that webmasters and search engine designers can build tools to help users meet their information needs. Saving the user's time by providing convenient access mechanisms is a principal concern of the Web. Furthermore, some writers and webmasters like to share their information and knowledge with others through web pages. This is because the Web is for use, and can provide a dynamic source of information for all kinds of users.

The growth of userism in recent Web thinking can be understood partly in relation to the prevailing neo-liberalistic view of society. When human beings are reduced to customers, consumers or users, society can be reduced to a market. A critique of userism is thus topical (Suominen, 2002).

Conclusion
What should we learn from these Five Laws of the Web? It is our hope that the reader has gained two things from this essay: first, a new appreciation for the work of the great Indian librarian; second, a renewed perspective on and appreciation of our work as information professionals and librarians. We started this paper with a question "What do these laws mean?". The first four of these reflect the way of thinking that we call userism. According to these laws, the Web's raison d'être lies in its relationship with users and use.

These laws are as applicable to the current practice of the Web as they will be to the Web of tomorrow. These laws are not only applicable to the Web in general but characterize the establishment, enhancement, and evaluation of online databases and digital library services as well. These five laws concisely represent the ideal service and organizational philosophy of the Web. Therefore, we can evaluate web sites by applying the Five Laws of the Web.

We end the paper with other questions for future readers. What will the next great age of the Web be? Is the Web as civilizing force or a cause of exclusion? Is it a bastion of intellectual freedom? Is it a vital force for social and cultural cohesion? Whatever it is now, it will assume to the essential roles libraries have had throughout the ages.

Acknowledgements
The author wishes to thank Mrs. Marjorie Sweetko for her helpful comments.

References
Berners-Lee, T. (1989, September 24). Information management: a proposal. CERN, March 1989, May 1990. Retrieved September 2, 2004, from http://www.w3.org/History/1989/proposal.html
Björneborn, L. (2004). Small-world link structures across an academic web space: a library and information science approach. Ph.D. Thesis. Royal School of Library and Information Science, Copenhagen, Denmark. p. 245-246.
Björneborn, L., & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics , 50 (1), 65-82.
Cana, M. (2003, July 5). Open source and Ranganathan's five laws of library science. Retrieved October 22, 2004, from http://www.kmentor.com/socio-tech-info/archives/000079.html
Crawford, W., & Gorman, M. (1995). Future libraries: dreams, madness & reality. Chicago and London, American Library Association.
Garfield, E. (1984). A tribute to S.R. Ranganathan, the father of Indian Library Science. Part 1. Life and Works. Current Contents, 6, February 6, 5-12.
Google (2003). 10 Tips for enterprise search: a best practices tip sheet. Retrieved October 15, 2004, from http://www.google.com/appliance/pdf/google_10_tips.pdf
Google (2004, November 15). Google's index nearly doubles. Retrieved November 15, 2004, from http://www.google.com/googleblog/
Gorman, M. (1995). Five new laws of librarianship. American Libraries, 26 (8), 784-785.
Gorman, M. (1998a). Our singular strengths: mediations for librarians. Chicago, IL: American Library Association.
Gorman, M. (1998b). The five laws of library science: then & now. School Library Journal, 44 (7), 20-23.
Hall, T.D. (2004). Making the starting line-up: best practices for placing diversity at the center of your library. 2004 National Diversity in Libraries Conference" Diversity in Libraries Making It Real". May, 4-5, Atlanta, Georgia. Retrieved October 15, 2004, from http://www.librarydiversity.org/MakingtheStartingLine.pdf
Internet Archive (2004). Web Archive. Retrieved October 15, 2004, from http://www.bibalex.org/english/initiatives/internetarchive/web.htm and http://www.archive.org/web/web.php
Leiter, R.A. (2003). Reflections on Ranganathan's five laws of library science. Law Library Journal, 95 (3), 411-418.
Middleton, T. (1999, October 14). The five laws of librarianship. Retrieved October 15, 2004, from http://www2.hawaii.edu/~trishami/610a.html
Mishra, S. (1998, October 12). Principles of distance education. Retrieved October 15, 2004, from http://hub.col.org/1998/cc98/0051.html
Noruzi, A. (2004). Introduction to Webology. Webology, 1(1). Article 1. Retrieved October 5, 2004, from http://www.webology.ir/2004/v1n1/a1.html
Ranganathan, S.R. (1931). The five laws of library science. Madras: Madras Library Association.
Rettig, J.R. (1992). Self-determining information seekers. RQ, 32 (2), winter, 158-63. Retrieved October 11, 2004, from http://archive.ala.org/rusa/forums/rettig.pdf
Sayers, W.C.B. (1957). Introduction to the first edition, (Ranganathan, S.R.) The five laws of library science. London: Blunt and Sons Ltd., p. 13-17.
Steckel, M. (2002). Ranganathan for information architects. Boxes and Arrows, 7 October. Retrieved October 20, 2004, from http://www.boxesandarrows.com/archives/ranganathan_for_ias.php
Suominen, V. (2002). User interests as the rationale of library operations: a critique. Public Library Quarterly, 35 (2). Retrieved October 15, 2004, from http://www.splq.info/issues/vol35_2/07.htm
Syracuse University, School of Information Studies (2004). Librarians in the 21st century: libraries and the Internet. Retrieved October 19, 2004, from http://iststudents.syr.edu/~project21cent/
Thompson, J. (1992). The five laws of library science. Newsletter on Serials Pricing Issues, 47, September 13. Retrieved October 12, 2004, from http://www.lib.unc.edu/prices/1992/PRIC47.HTML#47.3
Thelwall, M. (2000). Who is using the .co.uk domain? Professional and media adoption of the Web. International Journal of Information Management, 20 (6), 441-453.
Walter, V.A. (2001). Children and libraries: getting it right. Chicago: American Library Association.

--------------------------------------------------------------------------------

Bibliographic information of this paper for citing:
Noruzi, A. (2004). "Application of Ranganathan's Laws to the Web." Webology, 1(2), Article 8. Available at: http://www.webology.ir/2004/v1n2/a8.html

--------------------------------------------------------------------------------

Alert us when: New articles cite this article

--------------------------------------------------------------------------------

This article has been cited by other articles:
Bartholomew, J.K. (2006). How to Save the Time of the Reader: Ranganathan Revisited. ATLA 2006 Annual Conference. Tuesday, June 20, 2006.
Giersch, S., McMartin, F., & Morgan, G. (2006). The Role of Evaluation in the Use, Re-use and Dissemination of OERs in Science [The Role of Evaluation in (Re)-Using Open Education Science Resources]. In: Open Education 2006: Community, Culture, and Content. Logan, Utah State University, September 27-29, 2006.
Jain, V., & Saraf, S. (2006). Google Search Engine and its Usefulness to Library Professionals. DESIDOC Bulletin of Information Technology, 26(5), 23-28.
Kanyengo, C.W. (2006). Managing Digital Information Resources in Africa: preserving the integrity of scholarship. Presented at the Bridging the North-South Divide in Scholarly Communication on Africa. Threats and Opportunities in the Digital era. Leiden, The Netherlands, 6–8 September 2006.
McMenemy, David (2007). Ranganathan's relevance in the 21st century. Library Review, 56 (2), 97-101.
Medino Munoz, Juan and Blazquez Viedma, Marta (2005). Atencion al usuario en bibliotecas. In: Proceedings VII Encuentro de Bibliotecarios Municipales de Gran Canaria, Santa Brigida, Gran Canaria (Spain).
Noruzi, A. (2005). Web Impact Factors for Iranian Universities. Webology, 2 (1), Article 11. Available at: http://www.webology.ir/2005/v2n1/a11.html
Noruzi, A. (2005). Aplicação das Leis de Ranganathan na Web. ExtraLibris, 2. Translated into Brazilian Portuguese by Moreno Barros. Available at: http://www.extralibris.info/arquivos/2005/03/aplicacao_das_l.html
Noruzi, A. (2006). Application of Ranganathan's Laws to the Web. Translated into Persian by Fatemeh Amouhosseini & Tahereh AzimiNia. Iranian Journal of Information Science, forthcoming. http://www4.irandoc.ac.ir/jrnl-all.htm
Rozniakowska, M., MARGAS, M., Kitlinska, I., & Bogdol, P. (2006). Libraries past and today. Hybrid, digital: What will be and what can effect their models in future. In: Libraries of the XXIst century. Will we survive? Conference for librarians with participants from abroad, Lodz, June 19-21, 2006.
Yao, Y.Y. (2005). Web intelligence: new frontiers of exploration. In: Proceedings of the 2005 International Conference on Active Media Technology (AMT 2005), Tarumi, H., Li, Y. and Yoshida, T. (Eds.), Takamatsu, Kagawa, Japan, May 19-21, 2005, pp. 3-8.
Zalta, E.N. (2005). A New Model for Open Access to Scholarly and Educational Content. In: Proceedings of Advancing the Effectiveness and Sustainability of Open Education Conference. Utah State University, September 28, 2005. pp. 170-173.

Vector Model Information Retrieval

IntroductionThe discussion is divided into two parts; one covers basic math, and the second covers the math of the vector model. This document is meant to be used in conjunction with the textbook and other class material.

The first section covers essential mathematical building blocks:

Logarithms
Cosine
Summation
Dot Product (vector multiplication)
The second section explains the math behind the calculations required for the vector model:

Inverse Document Frequency (idf)
Normalized Frequency (f i, j )
Weight Calculations
Mathematical Building Blocks
Logarithms: log(N)
The logarithm is the first mathematical function we need to understand. The vector model has equations like log ( N / n i). If you don't know what a logarithm is, then you cannot understand what that means.

First of all, what is a function? In its simplest form, a function is a computation that takes a number as input, performs a calculation on that input, and returns the value of the calculation. Consider the square function, "sq". sq(2) returns 4. sq(9) returns 81. So sq(N) just means to return the value of N * N.

Following that line of thinking, log (2) means to return the logarithm of the number 2. Log (N) means to return the value of the logarithm of the variable N, for whatever value N might have.

We know how to square a number but what is the logarithm function? How is it computed?

We won't go into the derivation but if you play with Google a little, you can see the types of values that the logarithm function returns. Google has a built-in calculator that we will use. Type "log 0.5" (without the quotes) into Google and you get -0.301029996. Try "log 95" and you get 1.97772361. Do a few more. What is log 6666 ? Just poking at random numbers gives us a little feel for the function, but an organized set of numbers reveals the true pattern:

N log ( N )
1 0
10 1
100 2
1000 3
10000 4

10 to the power of log (N) = N. So, if you have a number, and you take its logarithm, and then raise 10 to that power, you get back to your original number.

The chart shows that the logarithm function is useful in compressing numbers from very big to managable sizes; it's an "order of magnitude" reduction calculation. Also, we note that the log(1) is equal to zero. Both of these are useful and necessary in calculating weights, as we will soon see.

One more interesting characteristic of the logarithm function is that logs of values between 0 and 1 are negative numbers. For instance, log (0.5) = -0.301029996. A lot of times you'll see -log (X) in an equation. If X is a variable that ranges from zero to one, now you know why the negative sign may be there: the problem may demand that the number be positive. Putting in the negative sign will turn a negative value positive.

So now if you see log (N) in an equation, you have a handle on what it means and can estimate some values for it! A logarithm just measures the order of magnitude of a number N. It's just a smaller number substituting for the original one. And in the special case where N is one, then the log (N) will be zero.

Cosine
The cosine is a trigonometric function with some properties that are very useful in analyzing vectors.

We are representing documents in our collection as vectors. We also represent the query as a vector. If we could somehow compare the query vector to the vectors of each of the documents, then we would be able to say which of the documents is "closest" to the query. This would gives us a ranking for our search. See the class notes for a good description of how that works.

So how do you rank the closeness of vectors? One way to consider (but it will get more complicated, of course!) is to just take the angle between them. If the angle is very small, then the two vectors are pretty close. If the angle is big, then they are not. Easy. However, if we want to do math on this and we are dealing with angles, we get a lot of weird numbers like 72 degrees. Just like the log function smooths out numbers, angles are easier to deal with if they are normalized to vary in a regular way... like between one and zero, for instance. 0 could mean "totally different" and 1 could mean "identical" with a full range of values (similarities) between 0 and 1.

Now, if we make a little chart of some angles and their cosines, we can see how the cosine smooths out the angles:

Angle A Cosine (A)
0 1
30 .86
45 .70
60 .5
90 0

The table shows that when the angle between two vectors is very small, the cosine approaches 1. Indeed, if the two vectors match perfectly, the cosine equals 1. In our IR application, that would be a perfect match between the query vector and a document vector.. As the angle increases, the cosine (the degree of similarity between the vectors) diminishes until it reaches zero when the vectors are at 90 degrees.

For those who like pictures:



In a coordinate system where every coordinate had equal weight, that might be an approximage measure of similarity. However, for our vector model, we want to value different coordinates (search terms) differently, so we have to make it a little more complicated. We'll do that pretty soon!

For now, we see that cosine is a function that measures a relationship between two vectors. The cosine of the angle between them approaches one as the vectors converge and approaches zero as the angle between the vectors approaches ninety degrees.

Summation

Now we need to review the summation operator, pronounced "sigma." It looks scary but it isn't that bad!

The sigma is an operator, just like a multiplication sign or a division sign. It says to add together a series of numbers. When you see a multiplication sign, you know you are going to be multiplying numbers. When you see a sigma, you know you will be adding together a series of numbers.

The full expression looks like this:



This symbol denotes the sum of the terms: a1 + a2 + a3+ ... + an. "a1" means the first number in the series. " a2 " is the second, etc. "an" is the final term in the series. You know lots of series: 1, 2, 3, 4, ... is a series. 2, 4, 6, 8, 10, ... is a series. 1, 3, 2, 4, 3, 5, 4, 6, ... is a series. "a1 + a2 + a3+ ... + an" and the sigma expression above are two different ways of saying the same thing.

There are a few variables associated with that expression that you need to understand as well. The variables determine how many terms get added and which members of the series are included:

The variable k is the index of summation. The values of k run from 1 to n in the expression above.
The a's are the terms that get added together. They can be very complex expressions, but they are always members of a series and they always get added together.
The number 1 is the lower limit of summation. That tells you which term to start with.
The variable n tells you when to stop, the upper limit of summation.
When you see an expression like that, the English translation would be "Take the summation from k equals 1 to k equals n of a sub k." (Finney et. al. 1995)

Let's look at some examples:

Example 1:



In English: Take the sum of all k's from k=1 to k=5

Numerically: 1 + 2 + 3 + 4 + 5 = 15 ... fifteen is the answer.

Example 2:



In English: Take the sum of the squares of k as k goes from k=3 to k=6

Numerically: 9 +16 + 25 + 36 = 86 ... 86 is the answer.

Example 3:



In English: Take the sum from k=2 to k=4 of (k - 2) divided by 2

Numerically: 0 + 1/2 + 1 = 1.5 ... 1.5 is the answer.

So, a sigma just means "take the sum of this series" and when you see a sigma, you just figure out what the terms are, and then add them together. You'll always be adding them together; that's what it means.

Dot Product
Fasten your seat belt! This section gets a little challenging, but it is the essence of how the vector model works. This section describes yet another mathematical operator.

When mathematicians (or physicists, or engineers, or librarians) want to multiply two vectors together, one way they do it is with an operator called a "dot product." Its symbol is a big huge fat dot: • If you had two vectors A and B and wanted to multiply them together, you'd write A • B. You'd pronounce that expression "A dot product B." The result of a dot product operation is a scalar (a single number), not a vector.

If the vectors are perfectly aligned, you calculate a dot product by multiplying their corresponding terms and adding the products together. There will be as many products as there are dimensions in the coordinate system. Thus, for a three dimensional vector, the dot product would be the sum (see, it's a scalar!) of the corresponding coordinates multiplied with one another.

If A = (a1 , a2, a3 ) and B = (b1 , b2, b3 ), then

A • B = a1b1 + a2b2 + a3b3 (where a1b1 means the first coordinate in A times the first coordinate in B, etc.)

Example: If A was a vector (3, 4, 7) and B was a vector (9, 2, 1) then

A • B = (3 * 9) + (4 * 2) + (7 * 1) = 27 + 8 + 7 = 42.

Hint: in our complex document world, there are N documents and so there are N terms in the vectors!

You may recognize this as a sum of a series and recall the summation operator:



This just says that for two vectors, A and B, that are perfectly aligned, A • B equals the sum of the products of the corresponding coordinates of the two vectors.

However, if vectors are not perfectly aligned, we need a fudge factor to account for the skew in their alignment. Thus the generalized formula for a dot product brings in our friend the cosine:

A • B = |A| |B| cos

where measures the angle between the vectors A and B and |A| means the absolute value of the vector. [The derivation of this comes from the Law of Cosines but you can just believe that it is true!]

If you think about some cases, you'll see that this makes sense. When the vectors are pointing at right angles, cos is zero, so the dot product is zero. When the vectors are very close, the cosine approaches one and dampers it just a little. This is used in physics when calculating work: the vectors are force and distance, and the angle is the angle at which the force is applied. If you push a dresser with your shoulder so the force is parallel to the floor, it will slide easily (but will scratch the floor!) If you lift the dresser up and move it, it's a lot more work (but you don't scratch the floor!)

One variation of this will show up later. If we divide both sides of this equation by |A| |B|, we get:



Since we reasoned earlier that cosine was a good proxy for measuring the similarity between vectors, now we have discovered a relationship that we can use to describe the similarity between two vectors. Using "sim (A, B) to mean that similarity, substitute that in for the cosine above and you get:



This is the key mathematical expression we need for the vector model, because it provides a formula for calculating similarity between vectors in terms of those vectors. There are, however, a couple of other dot product formulas we need to see before we get into the vector model. This uses some algebra to make variations on the equations we've already seen.

Starting from A • B = |A| |B| cos , if you substitute A for B (in other words, use the same vector for both A and B) we get:

A • A = |A| |A| cos

But if it is the same vector, then it is lined up with itself, so the angle between them is zero, so the cosine of the angle equals 1.

Substituting 1 for cos gives us A • A = |A| |A| = | A | 2

If we take the square root of each side we get



Recalling that A • B = a1b1 + a2b2 + a3b3 , we can see that A •A = a1a1 + a2a2 + a3a3

But a1a1 is a12 so a more compact way of saying this is:

A •A = a12 + a22 + a32

By substituting the series for the dot product, we get



Under the square root sign we see a series, the sum of the squares. Thus this equation can be written more compactly as:



We will see very soon how this helps calculate the degree of match between a query vector and a document vector.

Well, I told you to buckle your seatbelt! This is all the background you need for dot products, and it is all the background you need to understand the vector model of information retrieval.

Vector Model Calculations
The vector model of Information Retrieval relies on three sets of calculations. This model can work on selected index words or on full text. In the discussion below, it really doesn't matter.

The calculations needed for the vector model are:

The weight of each index word across the entire document set needs to be calculated. This answers the question of how important the word is in the entire collection.
The weight of every index word within a given document (in the context of that document only) needs to be calculated for all N documents. This answers the question of how important the word is within a single document.
For any query, the query vector is compared to every one of the document vectors. The results can be ranked. This answers the question of which document comes closest to the query, and ranks the others as to the closeness of the fit.
Throughout all the discussion, we need to have the basic vector model definitions at our disposal:

N The total number of documents in the system. If your database contains 1,100 documents that are indexed, then N = 1,100.
ki
Some particular index term.
ni
The number of documents in which the term ki appears. In the case above (where N = 1,100), this can range from 0 to 1100.
freq i , j The number of times a given keyword (ki ) appears in a certain document (document "j")
w i , j The term weight for ki in document j.
w i , q The term weight for ki in the query vector.
q = (w 1 , q , w 2 , q , ... w t , q) The query vector. These are coordinates in t-dimensional space. There are t terms in the index system; the query vector has a weight for every one of them. (The book puts little vector signs over the variable name; I am bolding it instead).
dj = (w 1 , j , w 2 , j , ... w t , j) A document vector. There are N documents in the system so there are N document vectors. Each one has a weight for each keyword in the indexing system. If the system is large, many of these values will be zero, since many keywords will not appear in the document vector for any given document. In mathematical terms, this is called "sparse".
f i , j Normalized frequency of ki in dj
dj A representative document, the "j th" one.


The vector model definition on p. 27 describes these terms (Baeza-Yates & Ribeiro-Neta, 1999). In addition, it explains the query vector and document vectors. These consist of coordinate sets where the coordinates are the weights for each term in the entire system. With very little discussion, the book then lays out this rather intimidating pair of equations to describe similarity between the vectors:



We are going to work our way through this, and you will soon see what this means.

Similarity
Starting back with our discussion of dot products, we remember that we could use cosine as a proxy for similarity between vectors, and arrived at this equation:



Translated into English, that says that the similarity between two vectors is equal to the dot product of the vectors divided by the product of the absolute values of the vectors. In the vector model problem, we know that our two vectors are the query vector and a document vector. So, letting A be a document vector (dj ) and B be the query vector (q), we can substitute for A and B to get the starting point that the book uses:



Now we use the dot product math that we learned earlier to expand this out. Why do we want to do that?

The basic problem is that the formula does not incorporate the weights of the terms in the query vector and the document vectors. We want to use the weights to find out which document(s) are closest to the query vector. To solve for the similarity, then, we have to use a series of equations to express this same formula in terms of the vector coordinates (the weights).

First, recall that:

A • B = a1b1 + a2b2 + a3b3 from the definition of dot product above. This just means that to calculate the value of a dot product, you multiply the first coordinates together, then add them to the product of the second coordinates, then add that sum to the product of the third coordinates, and you keep going until you hit the end of the coordinates. You end up with a single number, not a vector.

Now, in our vector model, we are saying that A is the document vector, dj = (w 1 , j , w 2 , j , ... , w t , j)

and B is our query vector q = (w 1 , q , w 2 , q , ... , w t , q)

So we substitute for A and B, and expand out the terms, taking each term of dj , multiplying it by the corresponding term of q, and adding up all these products:

dj • q = (w 1 , j * w 1 , q ) + (w 2 , j * w 2, q ) + ... + (w n , j * w n , q )

So we are starting to incorporate the weights of the terms here. Putting this back into our equation for similarity we get:



Now what about that denominator, | d | | q |. What can we do with that? If we start with equation 2:



and we substitute a document vector for A, we get

for the document component

and

for the query component

Substituting these into our monster equation, we have:



Well, we already know how to get rid of dot products: expand them out into the sum of the products of corresponding terms. So we need to expand out the two square roots in the denominator.

Remember, though, that when you take the dot product of a vector with itself, you get a series of squares. If it is D • D, the answer will be d1d1 + d2d2 + d3d3 which is the same as d12 + d22 + d32. For our document vector, that turns into a series of the sums of the squares of the weights:

dj • dj = (w 1 , j * w 1 , j) + (w 2 , j * w 2, j) + ... + (w n , j * w n , j)

or

dj • dj = w 1 , j 2 + w 2 , j 2 + ... + w n , j 2

From our little work on sigma, we recognize that as the sum of the squares, a very simple mathematical series. Since for the document vector the terms are the weights w i , j we end up with:



The query vector similarly reduces to



Each of these looks intimidating, but the meaning is pretty simple. You just work through the coordinates one at a time. For each one, you square it and add it to the running total.

So, putting our two sigmas back into the equation gives us:



Now are almost there. One last substitution and we'll be where the book ended up. The numerator is also a series that can be expressed through sigma notation:

(w 1 , j * w 1 , q) + (w 2 , j * w 2, q) + ... + (w n , j * w n , q)

Looking at this we see a series being summed, where each term is the product of the weight in the document vector times the weight in the query vector for the given keyword. In sigma notation that is:



and if we put that back into our equation we get the intimidating equation in the book:



Now, though, we know what it means and where it came from. We could even calculate some sample values if we felt so inclined. Given a document vector and a query vector, all we have to do is multiply some pairs of weights, add up those products, do a little division, and we have a measure of similarity. It looks hard but it is really pretty easy. The class notes give a numerical example, and my vector utility program shows the values and how these sums are calculated for sample queries you enter.

Word values and weights
Up to this point we have thrown around a lot of expressions like (w n , j * w n , q) while ignoring exactly what those weights are. We have seen that using weights we can calculate the similarity between vectors, and thus the closeness of fit between a document vector and a query vector. By comparing the query vector to many document vectors, we can rank the documents as to the "goodness" of the fit against the query. But just what are those weights and where do they come from? The next sections describe how the weights are calculated. There are actually a lot of weights that need calculating!

One set of weights that we need, obviously, are the weights used in the query vector. The query vector contains every index term usable in the document set. The weight in the query vector reflects the keyword's importantance in the context of the entire document set. When N is large, this set of weights would probably get fairly stable. A new document could be added to the document set without significantly changing values in the query vector.

The second set of weights that we need is potentially enormous. For each of N documents in the document set, we need to calculate a weight for every index term in that document. Each document has a document vector containing the weight for every index term appearing in the document. If you have 1000 documents and 1000 index terms then you need to calculate 1,000,000 weights. Every time you add a new document to the document set, you need to calculate a set of weights for the words in the new document.

What is it that makes a word important to a search? It can be important in two contexts: in the setting of the original document, and in the context of the entire collection. A word that appears in every document, for instance, would have no value across the document set. To calculate keyword weights, then, we need a way to combine the importance of a word in a document and to measure the importance of a word in the entire document set. Once we have those values, we can calculate the weights. A table shows how these factors combine in general terms:

Importance in document Importance in document set Weight
High High Very high
Low High Medium
High Low Medium
Low Low Very low

Information retrieval specialists use different methods for calculating document vector weights than for calculating query vector weights, and have a variety of techniques of measuring word importance in different contexts. In general, though, they use statistical techniques incorporating word frequency analysis to calculate these values. We will look at the frequency algorithms included in the textbook and then finally see how they are combined into weights.

f i, j : Normalized term frequency (tf) of ki in d j
The first frequency we examine is the term frequency of a word within a document. We are going to learn how to calculate the normalized frequency of ki in d j.

Normalization is a mathematical process of smoothing or confining to a set range. As explained in the textbook, a raw count of frequency is essentially useless. In a very large document, a rare word may still appear 25 times. A small document in the same document set may have no words that occur 25 times. How can you compare those counts? We need to normalize the frequency count, so that it measures how rare a word is relative to its document regardless of the size of the document.

One way to do that is to compare the number of times a given word appears (freq i , j) to the number of times the most popular word in the text appears (max (freq l, j)). This is a way of scaling down the numbers and adjusting for the raw text size. If you divide them, you get a very controlled range of values, less than or equal to one. So we quickly get the formula in the book:



This just says that the normalized frequency for a given word in a given document ( f i , j ) is equal to the raw frequency of the word in the document (freq i , j) divided by the raw frequency of the most common word in the document (max l freq l , j )

To see how nicely this normalizes the raw frequency counts, look at an example. As always, freq i , j is the raw count of a given term ki in a given document d j . Furthermore, we will pretend we have computed the value of freq i , j for all the terms in d j and know the most common word, max (freq i , j). Often that will be the word "the"; here we pretend it occurs 100 times.

word freq i , j max (freq i , j) f i , j
interception 1 100 .01
resolution 10 100 0.1
of 50 100 0.5
the 100 100 1

Again, "the" appears 100 times and is the most popular phrase. That earns it a normalized frequency of 1.0. The word "resolution" appeared ten times in this document, so its normalized frequency is 10/100 or 0.1. The word "interception" occurs but once and has a normalized frequency of .01.

In general, we see that common words would have higher normalized frequencies and rarer words would have lower normalized frequencies. Zipf's law predicts that there will be few words close to one and a lot of words in the very low range with a fair number in the middle, but that is a different topic. Finally, for a word in the larger document set but not contained in the document under examination,
f i , j would be equal to zero, since freq i , j = 0 and that is in the numerator of the normalized frequency.

So that's what normalized frequency is: a measurement of frequency ranging from zero to 1 for each term in a document. This formula assigns a higher value to words that appear more often than to words that appear less often; in some sense, the more common words are more important or valuable than the words that only show up once or twice.

Inverse document frequency
Now we need to come up with a measure of frequency in the context of the entire document set. This is called the "inverse document frequency" and is given by the formula idf = log (N/ni). What does it mean and what's that formula all about?

First, let's consider that "inverse" part. Unlike the case of normalized frequency within a document, in the context of the entire document set we want we want the importance of a word to go down as the frequency of the word goes up. When the word appears in every document, we want this value to be zero. After all, if it appears in every document, it is worthless as a differentiator between documents. When the word only appears in a few documents, we want this to have a high value. Hence, the inverse: high count, low value; low count, high value. This is just the opposite of the normalized frequency.

So how do we put a number on this? Remember the definitions:

N is total # of documents
ni is # of documents in which the search term of interest (ki) appears

We can make a ratio of N divided by ni, (N / ni,) that will express this inverse relationship pretty nicely.

To see it, make a little table:

N ni N/ni Discussion
1000 1 1000 Low count, high value! If the word only shows up once, the ratio turns out to be a very big number. Hmm, if we used this for a weight it might dominate too much.
1000 100 10 This is a reasonable number, perhaps...
1000 500 2 High count, low value.
1000 1000 1 Oops! We said that when the word appears in every document, we wanted the weight to be zero. Here our inverse ratio is equal to one. This is a problem.

So,we are close, but we have some problems. How do we fix the problems with our inverse ratio? We will use the logarithm function to normalized the values.

Instead of N/ni, consider log(N/ni):

N ni log(N/ni) Discussion
1000 1 3 Well, 3 seems like a nice number to use as a weighting factor for a really rare word.
1000 100 2 I like 2 too
1000 500 .301029996 If a word shows up in half the documents, it isn't worth much... hmmm, .3 sounds pretty good.
1000 1000 0 Bingo! We said that when the word appears in every document, we wanted the weight to be zero. Look at this! It's gonna work!.

So this looks like a pretty good idf. It is zero when the word appears in all the documents, and scales up slowly even as our collection size gets enormous.

To use a diving analogy, the "idf" is comparable to the "degree of difficulty" of a dive. The idf of a keyword is constant across a document collection. It is only needs to be calculated once for a given set of documents. The degree of difficulty of a dive does not change as different divers do it, and the idf of a search term does not change as the keyword is used in searches on different documents.

Weights
Now, finally, we get to calculate weights!

We got a preview earlier of how we want weights to behave. If we add some sample data we see how well a simple multiplication of factors will work in creating weights:

Normalized frequency Inverse document frequency Weight Sample nf Sample idf Sample weight (nf * idf)
High High Very high .9 1.2 1.08
Low High Medium .2 .9 .18
High Low Medium .85 .04 0.034
Low Low Very low .15 .09 0.0135

As illustrated, since both terms have been normalized, we can simply multiply them together to get one weight value that follows the overall rules in the table.

This, in fact, is the formula recommeded for calculating document weights:

This just says that the weight for ki in dj equals its normalized frequency (nf) times its inverse document frequency (idf). It will work well for both document weights.

For use in a query weight, of course, the normalized frequency would need to be calculated over the entire document set, not just a single document. The text suggests an enhancement in the calculation of the query vector:



This is saying the weight of term i in query vector q equals the normalized frequency of the search term in the "text of the information request q". Since this would usually be a low number, there is a fudge factor of 0.5: the recalculated normalized frequency gets half weight and a free "0.5" value is added in. This would increase the value of a search phrase that was entered more than one time. If you really wanted to find "jump" you could search for "jump jump jump" and this would increase the value of that phrase. I have my doubts about the efficacy of that as a user interface.

In my test software, I used the unmodified normalized term frequency for the query vector weight. Looking at the suggested term above, it seems as if the left multiplicand would be nearly constant.

Conclusion
The vector model of Information Retrieval is a powerful tool for constructing search machinery. This discussion provided an introduction to the mathematical concepts required for understanding the vector model and showed the application of those concepts in the development of the model.

References
Baeza-Yates, R. & Ribeiro-Neta, B. (1999). Modern information retrieval. New York : ACM Press.

Finney, R., Thomas, G., Demana, F., & Waits, B. (1995). Calculus : Graphical, numerical, algebraic. New York : Addison-Wesley Publishing Company.

Information Retrieval Models

Information Retrieval Models ................ 9



Information Retrieval Models
1 Introduction
The purpose of this chapter is two-fold: First, we want to set the stage for the problems in information retrieval that we try to address in this thesis. Second, we want to give the reader a quick overview of the major textual retrieval methods, because the InfoCrystal can help to visualize the output from any of them. We begin by providing a general model of the information retrieval process. We then briefly describe the major retrieval methods and characterize them in terms of their strengths and shortcomings.

2 General Model of Information Retrieval

The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. We use the word "document" as a general term that could also include non-textual information, such as multimedia objects. Figure 4.1 provides a general overview of the information retrieval process, which has been adapted from Lancaster and Warner (1993). Users have to formulate their information need in a form that can be understood by the retrieval mechanism. There are several steps involved in this translation process that we will briefly discuss below. Likewise, the contents of large document collections need to be described in a form that allows the retrieval mechanism to identify the potentially relevant documents quickly. In both cases, information may be lost in the transformation process leading to a computer-usable representation. Hence, the matching process is inherently imperfect.

Information seeking is a form of problem solving [Marcus 1994, Marchionini 1992]. It proceeds according to the interaction among eight subprocesses: problem recognition and acceptance, problem definition, search system selection, query formulation, query execution, examination of results (including relevance feedback), information extraction, and reflection/iteration/termination. To be able to perform effective searches, users have to develop the following expertise: knowledge about various sources of information, skills in defining search problems and applying search strategies, and competence in using electronic search tools.

Marchionini (1992) contends that some sort of spreadsheet is needed that supports users in the problem definition as well as other information seeking tasks. The InfoCrystal is such a spreadsheet because it assists users in the formulation of their information needs and the exploration of the retrieved documents, using the a visual interface that supports a "what-if" functionality. He further predicts that advances in computing power and speed, together with improved information retrieval procedures, will continue to blur the distinctions between problem articulation and examination of results. The InfoCrystal is both a visual query language and a tool for visualizing retrieval results.

The information need can be understood as forming a pyramid, where only its peak is made visible by users in the form of a conceptual query (see Figure 2.1). The conceptual query captures the key concepts and the relationships among them. It is the result of a conceptual analysis that operates on the information need, which may be well or vaguely defined in the user's mind. This analysis can be challenging, because users are faced with the general "vocabulary problem" as they are trying to translate their information need into a conceptual query. This problem refers to the fact that a single word can have more than one meaning, and, conversely, the same concept can be described by surprisingly many different words. Furnas, Landauer, Gomez and Dumais (1983) have shown that two people use the same main word to describe an object only 10 to 20% of the time. Further, the concepts used to represent the documents can be different from the concepts used by the user. The conceptual query can take the form of a natural language statement, a list of concepts that can have degrees of importance assigned to them, or it can be statement that coordinates the concepts using Boolean operators. Finally, the conceptual query has to be translated into a query surrogate that can be understood by the retrieval system.




--------------------------------------------------------------------------------




Figure 1.1: represents a general model of the information retrieval process, where both the user's information need and the document collection have to be translated into the form of surrogates to enable the matching process to be performed. This figure has been adapted from Lancaster and Warner (1993).



--------------------------------------------------------------------------------

Similarly, the meanings of documents need to be represented in the form of text surrogates that can be processed by computer. A typical surrogate can consist of a set of index terms or descriptors. The text surrogate can consist of multiple fields, such as the title, abstract, descriptor fields to capture the meaning of a document at different levels of resolution or focusing on different characteristic aspects of a document. Once the specified query has been executed by IR system, a user is presented with the retrieved document surrogates. Either the user is satisfied by the retrieved information or he will evaluate the retrieved documents and modify the query to initiate a further search. The process of query modification based on user evaluation of the retrieved documents is known as relevance feedback [Lancaster and Warner 1993]. Information retrieval is an inherently interactive process, and the users can change direction by modifying the query surrogate, the conceptual query or their understanding of their information need.

It is worth noting here the results, which have been obtained in studies investigating the information-seeking process, that describe information retrieval in terms of the cognitive and affective symptoms commonly experienced by a library user. The findings by Kuhlthau et al. (1990) indicate that thoughts about the information need become clearer and more focused as users move through the search process. Similarly, uncertainty, confusion, and frustration are nearly universal experiences in the early stages of the search process, and they decrease as the search process progresses and feelings of being confident, satisfied, sure and relieved increase. The studies also indicate that cognitive attributes may affect the search process. User's expectations of the information system and the search process may influence the way they approach searching and therefore affect the intellectual access to information.

Analytical search strategies require the formulation of specific, well-structured queries and a systematic, iterative search for information, whereas browsing involves the generation of broad query terms and a scanning of much larger sets of information in a relatively unstructured fashion. Campagnoni et al. (1989) have found in information retrieval studies in hypertext systems that the predominant search strategy is "browsing" rather than "analytical search". Many users, especially novices, are unwilling or unable to precisely formulate their search objectives, and browsing places less cognitive load on them. Furthermore, their research showed that search strategy is only one dimension of effective information retrieval; individual differences in visual skill appear to play an equally important role.

These two studies argue for information displays that provide a spatial overview of the data elements and that simultaneously provide rich visual cues about the content of the individual data elements. Such a representation is less likely to increase the anxiety that is a natural part of the early stages of the search process and it caters for a browsing interaction style, which is appropriate especially in the beginning, when many users are unable to precisely formulate their search objectives.

1.3 Major Information Retrieval Models

The following major models have been developed to retrieve information: the Boolean model, the Statistical model, which includes the vector space and the probabilistic retrieval model, and the Linguistic and Knowledge-based models. The first model is often referred to as the "exact match" model; the latter ones as the "best match" models [Belkin and Croft 1992]. The material presented here is based on the textbooks by Lancaster and Warner (1992) as well as Frakes and Baeza-Yates (1992), the review article by Belkin and Croft (1992), and discussions with Richard Marcus, my thesis advisor and mentor in the field of information retrieval.

Queries generally are less than perfect in two respects: First, they retrieve some irrelevant documents. Second, they do not retrieve all the relevant documents. The following two measures are usually used to evaluate the effectiveness of a retrieval method. The first one, called the precision rate, is equal to the proportion of the retrieved documents that are actually relevant. The second one, called the recall rate, is equal to the proportion of all relevant documents that are actually retrieved. If searchers want to raise precision, then they have to narrow their queries. If searchers want to raise recall, then they broaden their query. In general, there is an inverse relationship between precision and recall. Users need help to become knowledgeable in how to manage the precision and recall trade-off for their particular information need [Marcus 1991].

1.3.1.1 Standard Boolean

In Table 2.1 we summarize the defining characteristics of the standard Boolean approach and list its key advantages and disadvantages. It has the following strengths: 1) It is easy to implement and it is computationally efficient [Frakes and Baeza-Yates 1992]. Hence, it is the standard model for the current large-scale, operational retrieval systems and many of the major on-line information services use it. 2) It enables users to express structural and conceptual constraints to describe important linguistic features [Marcus 1991]. Users find that synonym specifications (reflected by OR-clauses) and phrases (represented by proximity relations) are useful in the formulation of queries [Cooper 1988, Marcus 1991]. 3) The Boolean approach possesses a great expressive power and clarity. Boolean retrieval is very effective if a query requires an exhaustive and unambiguous selection. 4) The Boolean method offers a multitude of techniques to broaden or narrow a query. 5) The Boolean approach can be especially effective in the later stages of the search process, because of the clarity and exactness with which relationships between concepts can be represented.

The standard Boolean approach has the following shortcomings: 1) Users find it difficult to construct effective Boolean queries for several reasons [Cooper 1988, Fox and Koll 1988, Belkin and Croft 1992]. Users are using the natural language terms AND, OR or NOT that have a different meaning when used in a query. Thus, users will make errors when they form a Boolean query, because they resort to their knowledge of English.




--------------------------------------------------------------------------------





Table 2.1: summarizes the defining characteristics of the standard Boolean approach and list the its key advantages and disadvantages.
--------------------------------------------------------------------------------

For example, in ordinary conversation a noun phrase of the form "A and B" usually refers to more entities than would "A" alone, whereas when used in the context of information retrieval it refers to fewer documents than would be retrieved by "A" alone. Hence, one of the common mistakes made by users is to substitute the AND logical operator for the OR logical operator when translating an English sentence to a Boolean query. Furthermore, to form complex queries, users must be familiar with the rules of precedence and the use of parentheses. Novice users have difficulty using parentheses, especially nested parentheses. Finally, users are overwhelmed by the multitude of ways a query can be structured or modified, because of the combinatorial explosion of feasible queries as the number of concepts increases. In particular, users have difficulty identifying and applying the different strategies that are available for narrowing or broadening a Boolean query [Marcus 1991, Lancaster and Warner 1993]. 2) Only documents that satisfy a query exactly are retrieved. On the one hand, the AND operator is too severe because it does not distinguish between the case when none of the concepts are satisfied and the case where all except one are satisfied. Hence, no or very few documents are retrieved when more than three and four criteria are combined with the Boolean operator AND (referred to as the Null Output problem). On the other hand, the OR operator does not reflect how many concepts have been satisfied. Hence, often too many documents are retrieved (the Output Overload problem). 3) It is difficult to control the number of retrieved documents. Users are often faced with the null-output or the information overload problem and they are at loss of how to modify the query to retrieve the reasonable number documents. 4) The traditional Boolean approach does not provide a relevance ranking of the retrieved documents, although modern Boolean approaches can make use of the degree of coordination, field level and degree of stemming present to rank them [Marcus 1991]. 5) It does not represent the degree of uncertainty or error due the vocabulary problem [Belkin and Croft 1992].

1.3.1.2 Narrowing and Broadening Techniques

As mentioned earlier, a Boolean query can be described in terms of the following four operations: degree and type of coordination, proximity constraints, field specifications and degree of stemming as expressed in terms of word/string specifications. If users want to (re)formulate a Boolean query then they need to make informed choices along these four dimensions to create a query that is sufficiently broad or narrow depending on their information needs. Most narrowing techniques lower recall as well as raise precision, and most broadening techniques lower precision as well as raise recall. Any query can be reformulated to achieve the desired precision or recall characteristics, but generally it is difficult to achieve both. Each of the four kinds of operations in the query formulation has particular operators, some of which tend to have a narrowing or broadening effect. For each operator with a narrowing effect, there is one or more inverse operators with a broadening effect [Marcus 1991]. Hence, users require help to gain an understanding of how changes along these four dimensions will affect the broadness or narrowness of a query.




--------------------------------------------------------------------------------




Figure 2.2: captures how coordination, proximity, field level and stemming affect the broadness or narrowness of a Boolean query. By moving in the direction in which the wedges are expanding the query is broadened.


--------------------------------------------------------------------------------

Figure 2.2 shows how the four dimensions affect the broadness or narrowness of a query: 1) Coordination: the different Boolean operators AND, OR and NOT have the following effects when used to add a further concept to a query: a) the AND operator narrows a query; b) the OR broadens it; c) the effect of the NOT depends on whether it is combined with an AND or OR operator. Typically, in searching textual databases, the NOT is connected to the AND, in which case it has a narrowing effect like the AND operator. 2) Proximity: The closer together two terms have to appear in a document, the more narrow and precise the query. The most stringent proximity constraint requires the two terms to be adjacent. 3) Field level: current document records have fields associated with them, such as the "Title", "Index", "Abstract" or "Full-text" field: a) the more fields that are searched, the broader the query; b) the individual fields have varying degrees of precision associated with them, where the "title" field is the most specific and the "full-text" field is the most general. 4) Stemming: The shorter the prefix that is used in truncation-based searching, the broader the query. By reducing a term to its morphological stem and using it as a prefix, users can retrieve many terms that are conceptually related to the original term [Marcus 1991].

Using Figure 2.2, we can easily read off how to broaden query. We just need to move in the direction in which the wedges are expanding: we use the OR operator (rather than the AND), impose no proximity constraints, search over all fields and apply a great deal of stemming. Similarly, we can formulate a very narrow query by moving in the direction in which the wedges are contracting: we use the AND operator (rather than the OR), impose proximity constraints, restrict the search to the title field and perform exact rather than truncated word matches. In Chapter 4 we will show how Figure 2.2 indicates how the broadness or narrowness of a Boolean query could be visualized.

2.3.1.3 Smart Boolean

There have been attempts to help users overcome some of the disadvantages of the traditional Boolean discussed above. We will now describe such a method, called Smart Boolean, developed by Marcus [1991, 1994] that tries to help users construct and modify a Boolean query as well as make better choices along the four dimensions that characterize a Boolean query. We are not attempting to provide an in-depth description of the Smart Boolean method, but to use it as a good example that illustrates some of the possible ways to make Boolean retrieval more user-friendly and effective. Table 2.2 provides a summary of the key features of the Smart Boolean approach.

Users start by specifying a natural language statement that is automatically translated into a Boolean Topic representation that consists of a list of factors or concepts, which are automatically coordinated using the AND operator. If the user at the initial stage can or wants to include synonyms, then they are coordinated using the OR operator. Hence, the Boolean Topic representation connects the different factors using the AND operator, where the factors can consist of single terms or several synonyms connected by the OR operator. One of the goals of the Smart Boolean approach is to make use of the structural knowledge contained in the text surrogates, where the different fields represent contexts of useful information. Further, the Smart Boolean approach wants to use the fact that related concepts can share a common stem. For example, the concepts "computers" and "computing" have the common stem comput*.




--------------------------------------------------------------------------------



Table 2.2: summarizes the defining characteristics of the Smart Boolean approach and list the its key advantages and disadvantages.


--------------------------------------------------------------------------------

The initial strategy of the Smart Boolean approach is to start out with the broadest possible query within the constraints of how the factors and their synonyms have been coordinated. Hence, it modifies the Boolean Topic representation into the query surrogate by using only the stems of the concepts and searches for them over all the fields. Once the query surrogate has been performed, users are guided in the process of evaluating the retrieved document surrogates. They choose from a list of reasons to indicate why they consider certain documents as relevant. Similarly, they can indicate why other documents are not relevant by interacting with a list of possible reasons. This user feedback is used by the Smart Boolean system to automatically modify the Boolean Topic representation or the query surrogate, whatever is more appropriate. The Smart Boolean approach offers a rich set of strategies for modifying a query based on the received relevance feedback or the expressed need to narrow or broaden the query. The Smart Boolean retrieval paradigm has been implemented in the form of a system called CONIT, which is one of the earliest expert retrieval systems that was able to demonstrate that ordinary users, assisted by such a system, could perform equally well as experienced search intermediaries [Marcus 1983]. However, users have to navigate through a series of menus listing different choices, where it might be hard for them to appreciate the implications of some of these choices. A key limitation of the previous versions of the CONIT system has been that lacked a visual interface. The most recent version has a graphical interface and it uses the tiling metaphor suggested by Anick et al. (1991), and discussed in section 10.4, to visualize Boolean coordination [Marcus 1994]. This visualization approach suffers from the limitation that it enables users to visualize specific queries, whereas we will propose a visual interface that represents all whole range of related Boolean queries in a single display, making changes in Boolean coordination more user-friendly. Further, the different strategies of modifying a query in CONIT require a better visualization metaphor to enable users to make use these search heuristics. In Chapter 4 we show how some of these modification techniques can be visualized.

2.3.1.4 Extended Boolean Models

Several methods have been developed to extend the Boolean model to address the following issues: 1) The Boolean operators are too strict and ways need to be found to soften them. 2) The standard Boolean approach has no provision for ranking. The Smart Boolean approach and the methods described in this section provide users with relevance ranking [Fox and Koll 1988, Marcus 1991]. 3) The Boolean model does not support the assignment of weights to the query or document terms. We will briefly discuss the P-norm and the Fuzzy Logic approaches that extend the Boolean model to address the above issues.




--------------------------------------------------------------------------------



Table 2.3: summarizes the defining characteristics of the Extended Boolean approach and list the its key advantages and disadvantages.


--------------------------------------------------------------------------------

The P-norm method developed by Fox (1983) allows query and document terms to have weights, which have been computed by using term frequency statistics with the proper normalization procedures. These normalized weights can be used to rank the documents in the order of decreasing distance from the point (0, 0, ... , 0) for an OR query, and in order of increasing distance from the point (1, 1, ... , 1) for an AND query. Further, the Boolean operators have a coefficient P associated with them to indicate the degree of strictness of the operator (from 1 for least strict to infinity for most strict, i.e., the Boolean case). The P-norm uses a distance-based measure and the coefficient P determines the degree of exponentiation to be used. The exponentiation is an expensive computation, especially for P-values greater than one.

In Fuzzy Set theory, an element has a varying degree of membership to a set instead of the traditional binary membership choice. The weight of an index term for a given document reflects the degree to which this term describes the content of a document. Hence, this weight reflects the degree of membership of the document in the fuzzy set associated with the term in question. The degree of membership for union and intersection of two fuzzy sets is equal to the maximum and minimum, respectively, of the degrees of membership of the elements of the two sets. In the "Mixed Min and Max" model developed by Fox and Sharat (1986) the Boolean operators are softened by considering the query-document similarity to be a linear combination of the min and max weights of the documents.

2.3.2 Statistical Model

The vector space and probabilistic models are the two major examples of the statistical retrieval approach. Both models use statistical information in the form of term frequencies to determine the relevance of documents with respect to a query. Although they differ in the way they use the term frequencies, both produce as their output a list of documents ranked by their estimated relevance. The statistical retrieval models address some of the problems of Boolean retrieval methods, but they have disadvantages of their own. Table 2.4 provides summary of the key features of the vector space and probabilistic approaches. We will also describe Latent Semantic Indexing and clustering approaches that are based on statistical retrieval approaches, but their objective is to respond to what the user's query did not say, could not say, but somehow made manifest [Furnas et al. 1983, Cutting et al. 1991].

2.3.2.1 Vector Space Model

The vector space model represents the documents and queries as vectors in a multidimensional space, whose dimensions are the terms used to build an index to represent the documents [Salton 1983]. The creation of an index involves lexical scanning to identify the significant terms, where morphological analysis reduces different word forms to common "stems", and the occurrence of those stems is computed. Query and document surrogates are compared by comparing their vectors, using, for example, the cosine similarity measure. In this model, the terms of a query surrogate can be weighted to take into account their importance, and they are computed by using the statistical distributions of the terms in the collection and in the documents [Salton 1983]. The vector space model can assign a high ranking score to a document that contains only a few of the query terms if these terms occur infrequently in the collection but frequently in the document. The vector space model makes the following assumptions: 1) The more similar a document vector is to a query vector, the more likely it is that the document is relevant to that query. 2) The words used to define the dimensions of the space are orthogonal or independent. While it is a reasonable first approximation, the assumption that words are pairwise independent is not realistic.

2.3.2.2 Probabilistic Model

The probabilistic retrieval model is based on the Probability Ranking Principle, which states that an information retrieval system is supposed to rank the documents based on their probability of relevance to the query, given all the evidence available [Belkin and Croft 1992]. The principle takes into account that there is uncertainty in the representation of the information need and the documents. There can be a variety of sources of evidence that are used by the probabilistic retrieval methods, and the most common one is the statistical distribution of the terms in both the relevant and non-relevant documents.

We will now describe the state-of-art system developed by Turtle and Croft (1991) that uses Bayesian inference networks to rank documents by using multiple sources of evidence to compute the conditional probability
P(Info need|document) that an information need is satisfied by a given document. An inference network consists of a directed acyclic dependency graph, where edges represent conditional dependency or causal relations between propositions represented by the nodes. The inference network consists of a document network, a concept representation network that represents indexing vocabulary, and a query network representing the information need. The concept representation network is the interface between documents and queries. To compute the rank of a document, the inference network is instantiated and the resulting probabilities are propagated through the network to derive a probability associated with the node representing the information need. These probabilities are used to rank documents.

The statistical approaches have the following strengths: 1) They provide users with a relevance ranking of the retrieved documents. Hence, they enable users to control the output by setting a relevance threshold or by specifying a certain number of documents to display. 2) Queries can be easier to formulate because users do not have to learn a query language and can use natural language. 3) The uncertainty inherent in the choice of query concepts can be represented. However, the statistical approaches have the following shortcomings: 1) They have a limited expressive power. For example, the NOT operation can not be represented because only positive weights are used. It can be proven that only 2N*N of the 22N possible Boolean queries can be generated by the statistical approaches that use weighted linear sums to rank the documents. This result follows from the analysis of Linear Threshold Networks or Boolean Perceptrons [Anthony and Biggs 1992]. For example, the very common and important Boolean query ((A and B) or (C and D)) can not be represented by a vector space query (see section 5.4 for a proof). Hence, the statistical approaches do not have the expressive power of the Boolean approach. 3) The statistical approach lacks the structure to express important linguistic features such as phrases. Proximity constraints are also difficult to express, a feature that is of great use for experienced searchers. 4) The computation of the relevance scores can be computationally expensive. 5) A ranked linear list provides users with a limited view of the information space and it does not directly suggest how to modify a query if the need arises [Spoerri 1993, Hearst 1994]. 6) The queries have to contain a large number of words to improve the retrieval performance. As is the case for the Boolean approach, users are faced with the problem of having to choose the appropriate words that are also used in the relevant documents.

Table 2.4 summarizes the advantages and disadvantages that are specific to the vector space and probabilistic model, respectively. This table also shows the formulas that are commonly used to compute the term weights. The two central quantities used are the inverse term frequency in a collection (idf), and the frequencies of a term i in a document j (freq(i,j)). In the probabilistic model, the weight computation also considers how often a term appears in the relevant and irrelevant documents, but this presupposes that the relevant documents are known or that these frequencies can be reliably estimated.




--------------------------------------------------------------------------------







Table 2.4: summarizes the defining characteristics of the statistical retrieval approach, which includes the vector space and the probabilistic model and we list the their key advantages and disadvantages.


--------------------------------------------------------------------------------

If users provide the retrieval system with relevance feedback, then this information is used by the statistical approaches to recompute the weights as follows: the weights of the query terms in the relevant documents are increased, whereas the weights of the query terms that do not appear in the relevant documents are decreased [Salton and Buckley 1990]. There are multiple ways of computing and updating the weights, where each has its advantages and disadvantages. We do not discuss these formulas in more detail, because research on relevance feedback has shown that significant effectiveness improvements can be gained by using quite simple feedback techniques [Salton and Buckley 1990]. Furthermore, what is important to this thesis is that the statistical retrieval approach generates a ranked list, however how this ranking has been computed in detail is immaterial for the purpose of this thesis.

2.3.2.3 Latent Semantic Indexing

Several statistical and AI techniques have been used in association with domain semantics to extend the vector space model to help overcome some of the retrieval problems described above, such as the "dependence problem" or the "vocabulary problem". One such method is Latent Semantic Indexing (LSI). In LSI the associations among terms and documents are calculated and exploited in the retrieval process. The assumption is that there is some "latent" structure in the pattern of word usage across documents and that statistical techniques can be used to estimate this latent structure. An advantage of this approach is that queries can retrieve documents even if they have no words in common. The LSI technique captures deeper associative structure than simple term-to-term correlations and is completely automatic. The only difference between LSI and vector space methods is that LSI represents terms and documents in a reduced dimensional space of the derived indexing dimensions. As with the vector space method, differential term weighting and relevance feedback can improve LSI performance substantially.

Foltz and Dumais (1992) compared four retrieval methods that are based on the vector-space model. The four methods were the result of crossing two factors, the first factor being whether the retrieval method used Latent Semantic Indexing or keyword matching, and the second factor being whether the profile was based on words or phrases provided by the user (Word profile), or documents that the user had previously rated as relevant (Document profile). The LSI match-document profile method proved to be the most successful of the four methods. This method combines the advantages of both LSI and the document profile. The document profile provides a simple, but effective, representation of the user's interests. Indicating just a few documents that are of interest is as effective as generating a long list of words and phrases that describe one's interest. Document profiles have an added advantage over word profiles: users can just indicate documents they find relevant without having to generate a description of their interests.

2.3.3 Linguistic and Knowledge-based Approaches

In the simplest form of automatic text retrieval, users enter a string of keywords that are used to search the inverted indexes of the document keywords. This approach retrieves documents based solely on the presence or absence of exact single word strings as specified by the logical representation of the query. Clearly this approach will miss many relevant documents because it does not capture the complete or deep meaning of the user's query. The Smart Boolean approach and the statistical retrieval approaches, each in their specific way, try to address this problem (see Table 2.5). Linguistic and knowledge-based approaches have also been developed to address this problem by performing a morphological, syntactic and semantic analysis to retrieve documents more effectively [Lancaster and Warner 1993]. In a morphological analysis, roots and affixes are analyzed to determine the part of speech (noun, verb, adjective etc.) of the words. Next complete phrases have to be parsed using some form of syntactic analysis. Finally, the linguistic methods have to resolve word ambiguities and/or generate relevant synonyms or quasi-synonyms based on the semantic relationships between words. The development of a sophisticated linguistic retrieval system is difficult and it requires complex knowledge bases of semantic information and retrieval heuristics. Hence these systems often require techniques that are commonly referred to as artificial intelligence or expert systems techniques.

2.3.3.1 DR-LINK Retrieval System

We will now describe in some detail the DR-LINK system developed by Liddy et al., because it represents an exemplary linguistic retrieval system. DR-LINK is based on the principle that retrieval should take place at the conceptual level and not at the word level. Liddy et al. attempt to retrieve documents on the basis of what people mean in their query and not just what they say in their query. DR-LINK system employs sophisticated, linguistic text processing techniques to capture the conceptual information in documents. Liddy et al. have developed a modular system that represents and matches text at the lexical, syntactic, semantic, and the discourse levels of language. Some of the modules that have been incorporated are: The Text Structurer is based on discourse linguistic theory that suggests that texts of a particular type have a predictable structure which serves as an indication where certain information can be found. The Subject Field Coder uses an established semantic coding scheme from a machine-readable dictionary to tag each word with its disambiguated subject code (e.g., computer science, economics) and to then produce a fixed-length, subject-based vector representation of the document and the query. The Proper Noun Interpreter uses a variety of processing heuristics and knowledge bases to produce: a canonical representation of each proper noun; a classification of each proper noun into thirty-seven categories; and an expansion of group nouns into their constituent proper noun members. The Complex Nominal Phraser provides means for precise matching of complex semantic constructs when expressed as either adjacent nouns or a non-predicating adjective and noun pair. Finally, The Natural Language Query Constructor takes as input a natural language query and produces a formal query that reflects the appropriate logical combination of text structure, proper noun, and complex nominal requirements of the user's information need. This module interprets a query into pattern-action rules that translate each sentence into a first-order logic assertion, reflecting the Boolean-like requirements of queries.



--------------------------------------------------------------------------------



Table 2.5: characterizes the major retrieval methods in terms of how deal with lexical, morphological, syntactic and semantic issues.


--------------------------------------------------------------------------------

To summarize, the DR-LINK retrieval system represents content at the conceptual level rather than at the word level to reflect the multiple levels of human language comprehension. The text representation combines the lexical, syntactic, semantic, and discourse levels of understanding to predict the relevance of a document. DR-LINK accepts natural language statements, which it translates into a precise Boolean representation of the user's relevance requirements. It also produces a summary-level, semantic vector representations of queries and documents to provide a ranking of the documents.

2.4 Conclusion

There is a growing discrepancy between the retrieval approach used by existing commercial retrieval systems and the approaches investigated and promoted by a large segment of the information retrieval research community. The former is based on the Boolean or Exact Matching retrieval model, whereas the latter ones subscribe to statistical and linguistic approaches, also referred to as the Partial Matching approaches. First, the major criticism leveled against the Boolean approach is that its queries are difficult to formulate. Second, the Boolean approach makes it possible to represent structural and contextual information that would be very difficult to represent using the statistical approaches. Third, the Partial Matching approaches provide users with a ranked output, but these ranked lists obscure




--------------------------------------------------------------------------------



Table 2.6: lists some of the key problems in the field of information retrieval and possible solutions.


--------------------------------------------------------------------------------

valuable information. Fourth, recent retrieval experiments have shown that the Exact and Partial matching approaches are complementary and should therefore be combined [Belkin et al. 1993].

In Table 2.6 we summarize some of the key problems in the field of information retrieval and possible solutions to them. We will attempt to show in this thesis: 1) how visualization can offer ways to address these problems; 2) how to formulate and modify a query; 3) how to deal with large sets of retrieved documents, commonly referred to as the information overload problem. In particular, this thesis overcomes one of the major "bottlenecks" of the Boolean approach by showing how Boolean coordination and its diverse narrowing and broadening techniques can be visualized, thereby making it more user-friendly without limiting its expressive power. Further, this thesis shows how both the Exact and Partial Matching approaches can be visualized in the same visual framework to enable users to make effective use of their respective strengths.