Information Access for A Digital Library: Cheshire II and the Berkeley Environmental Digital Library

 

Ray R. Larson

School of Information Management and Systems, University of California, Berkeley, Berkeley, California

 

Chad Carson

Computer Science Division, EECS, University of California, Berkeley, Berkeley, California

 

 

Abstract

 

 The Cheshire II system was originally developed to provide a bridge from conventional online library catalogs to full-text online resources. Recently we have begun using the system to implement full-text and fielded searching of bibliographic information for the UC Berkeley Digital Library Initiative project sponsored by NSF, NASA and ARPA. The Cheshire system is also being used to provide scalable performance for image querying using the “Blobworld” image representation. This paper will review the characteristics of the Cheshire II system and examine its performance and behavior when applied to a collection of large full-text documents in the TREC Interactive Retrieval Track and its performance in Blobworld image searching.

 

INTRODUCTION

 

Digital libraries, and the technologies that support them, are increasingly finding application in the burgeoning internet environment. These digital libraries range from simple aggregations of HTML pages on individual WWW sites to multimedia databases with complex structure and supporting applications. Research is still ongoing as to what constitutes a Digital Library and its facilities, organizational principles, technical support mechanisms and contents.

Digital documents -- whether they might be simple text files, HTML-based WWW pages, digital photographs, a multimedia encyclopedia, or a collection of simple page images – usually will participate in larger aggregations of other digital documents. These aggregations of documents can be organized based on a variety of criteria, ranging from simple “file name order” to complex multimedia databases and richly interlinked hypertext structures on WWW sites. With increasing frequency, these collections of digital documents are being described as “Digital Libraries.''

 

The Digital Library Initiative, sponsored by NSF, NASA and DARPA, began in 1996 and recently completed the first cycle of funding for projects researching Digital Libraries. This paper reports on part of the work sponsored by that program at the University of California, Berkeley in developing search services for full text and content-based image searching in the Environmental Digital Library. In the following discussion we will concentrate on the use of the Cheshire II information retrieval system as it has been applied to the information access needs of the UC Berkeley Environmental Digital Library. We will also examine some preliminary work on evaluation of the Cheshire II information retrieval algorithms and their performance in the TREC interactive track and in providing efficient content-based image retrieval using the BlobWorld image segmenting system developed by Carson, et al (1998).

 

We begin by describing the Cheshire II system and its characteristics. Then we describe its use in the UC Berkeley Environmental Digital Library for full-text and content-based image retrieval and present some results from evaluation of the Cheshire II system in conjunction with the TREC conference and some preliminary evaluation of content-based image searching.

THE CHESHIRE II SYSTEM

When the Cheshire II system was first concieved (in the late 1980’s) the aim was to develop a “next-generation” online library catalog system that could provide ranked retrieval based on probabilistic retrieval methods, while still supporting Boolean retrieval methods expected in the online catalog systems of that era. Since that time the system has been constantly redesigned and updated to accommodate the information retrieval needs of a much broader world. The Cheshire II system now finds its primary usage in full text or structured metadata collections based on SGML and XML, often as the search engine behind a variety of WWW-based “search pages” or as a Z39.50 (1995) server for particular applications. Current applications of the Cheshire II system include:

 

1.        Use as a web-based library catalog for the UC Berkeley Physical Sciences libraries.

2.        Use as the primary search engine for UC Berkeley NSF/NASA/ARPA Digital Library Project (where the system is used for searching full-text OCR from scanned page image files linked to SGML bibliographic records, as well as supporting image content retrieval as part of the “Blobworld” research effort as discussed later in this paper.

3.        Use as the search engine for the History Data Service at the University of Essex in the UK, as part of the Z39.50-based Arts and Humanities Data Service (AHDS).

4.        Use as the search engine for collections of archival documents (in the Encoded Archival Description DTD) at the University of Liverpool in the UK.

5.        Use as the metadata search engine for the Networked Social Science Tools and Resources (NESSTAR) project based at The Data Archive (Essex University) in the UK, the Danish Data Archives and Norwegian Social Science Data Services.

6.        Use for California Sheet Music Project at UC Berkeley, linking MARC metadata with page images of early California music publications.

7.        Use in the DARPA-sponsored Unfamiliar Metadata Vocabularies project at UC Berkeley where it is being used for probabilistic cross-language entry vocabulary mapping.

8.        Use as the search engine in the ChaCha intranet web search system used for the UC Berkeley domain. (Alper 1998). (http://cha-cha.berkeley.edu/).

 

The Cheshire II system includes the following features:

 

1.        It supports SGML and XML as the primary database format of the underlying search engine. The system also provides support for full-text data linked to SGML or XML metadata records. MARC format records for traditional online catalog databases are supported using MARC to SGML conversion software developed for the project.

2.        It is a client/server application where the interfaces (clients) communicate with the search engine (server) using the Z39.50 v.3 Information Retrieval Protocol. The system also provides a general Z39.50 Gateway with support for mapping Z39.50 queries to local Cheshire databases and to relational databases.

3.        It includes a programmable graphical direct manipulation interface under X on Unix and Windows NT. There is also CGI interpreter version that combines client and server capabilities. These interfaces permit searches of the Cheshire II search engine as well as any other z39.50 compatible search engine on the network.

4.        It permits users to enter natural language queries and these may be combined with Boolean logic for users who wish to use it.

5.        It uses probabilistic ranking methods based on the Logistic Regression research carried out at Berkeley to match the user's initial query with documents in the database. In some databases it can provide two-stage searching where a set of “classification clusters”(Larson 1991) is first retrieved in decreasing order of probable relevance to the user's search statement. These clusters can then be used to provide feedback about the primary topical areas of the query, and retrieve documents within the topical area of the selected clusters. This aids the user in subject focusing and topic/treatment discrimination. Similar facilities are used in the Unfamiliar Metadata Vocabularies project at Berkeley for mapping users’ natural language expressions of topics to appropriate controlled vocabularies (http://sims.berkeley.edu/research/metadata).

6.        It supports open-ended, exploratory browsing through following dynamically established linkages between records in the database, in order to retrieve materials related to those already found. These can be dynamically generated “hypersearches” that let users issue a Boolean query with a mouse click to find all items that share some field with a displayed record.

7.        It uses the user's selection of relevant citations to refine the initial search statement and automatically construct new search statements for relevance feedback searching.

8.        All of the client and server facilities can be adapted to specific applications using the Tcl scripting language.

 

The Cheshire II Search Engine

 

The Cheshire II search engine supports both probabilistic and Boolean searching. The design rationale and features of the Cheshire II search engine have been discussed elsewhere (Larson, et al. 1996; Larson & McDonough, 1997) and will only be briefly repeated here.

 

The search engine typically functions as a Z39.50 information retrieval protocol server providing access to a set of databases. Alternatively the search engine can be accessed via the HTTP CGI mechanism using the scriptable version of the server software (called WebCheshire). In either case, the system supports various methods for translating a searcher's query into the terms used in indexing the database. These methods include elimination of unused words using field-specific stopword lists, particular field-specific query-to-key conversion or “normalization” functions, standard stemming algorithms (Porter stemmer) and support for mapping database and query text words to single forms based on the WordNet dictionary and thesaurus using a adaption of the WordNet “Morphing” algorithm and exception dictionary.

 

However, the primary functionality that distinguishes the Cheshire II search engine is support for probabilistic searching on any indexed element of the database. This means that a natural language query can be used to retrieve the records that have of highest estimated probability of being relevant given the user's query. In both cluster searching and direct probabilistic searching of the database, the Cheshire II search engine supports a very simple form of relevance feedback, where any items found in an initial search (Boolean or probabilistic) can be selected and used as queries in a relevance feedback search.

 


The probabilistic retrieval algorithm used in the Cheshire II search engine is based on the logistic regression algorithms developed by Berkeley researchers (Cooper, et al. 1992, 1994a, 1994b). Formally, the probability of relevance given a particular query and a particular record in the database P(R | Q, D) is calculated and the records are presented to the user ranked in order of decreasing values of that probability. In the Cheshire II system P(R | Q, D) is calculated as the “log odds” of relevance logO(R | Q, D), where for any events A and B the odds O(A |B) is a simple transformation of the probabilities P(A | B) / P(~A | B). The Logistic Regression method provides estimates for a set of coefficients, ci, associated with a set of S statistics, Xi, derived from the query and database, such that


where c0 is the intercept term of the regression, for the set of M terms (i.e., words, stems or phrases) that occur in both a particular query and a given document. The regression equation and coefficients used in the TREC-7 Interactive Track are the same as used in our TREC-6 entry. These are based on the TREC-3 Adhoc entry from Berkeley (Cooper, et al. 1994b) where the regression coefficients were estimated using relevance judgements from the TIPSTER test collection. The basic elements are:

 


This is the log of the absolute frequency of occurrence for term tj in the query averaged over the M terms in common between the query and the document. The coefficient c1 used in the current version of the Cheshire II system is 1.269.


This is square root of the query length (i.e., the number of terms in the query disregarding stopwords). The c2 coefficient used is -0.310.


This is the log of the absolute frequency of occurrence for term tj in the document averaged over the M common terms. The c3 coefficient used is 0.679.


This is square root of the document length. In Cheshire II the raw size of the document in bytes is used for the document length. The c4 coefficient used is -0.0674.

 



This is the log of the inverse document frequency (IDF) for term tj in the document averaged over the M common terms. The IDF is calculated as the total number of documents in the database, divided by the number of documents that contain term tj. The c5 coefficient used is 0.223.


This is the log of the number of common terms. The c6 coefficient used in Cheshire II is 2.01.

 


These coefficients and elements of the ranking algorithm have proven to be quite robust and useful across a broad range of document types.

 

Probabilistic searching, as noted above, requires only a natural language statement of the searcher's topic, and thus no formal query language or Boolean logic is needed for such searches. However, the Cheshire II search engine also supports complete Boolean operations on indexed elements in the database, and supports searches that combine probabilistic and Boolean elements. At present, combined probabilistic and Boolean search results are evaluated using the assumption that the Boolean retrieved set has an estimated P(R | Qbool,D) = 1.0 for each document in the set, and 0 for the rest of the collection. The final estimate for the probability of relevance used for ranking the results of a search combining Boolean and probabilistic strategies is simply:


 

 

 


where P(R | Qprob,D)$ is the probability estimate from the probabilistic portion of the search, and P(R | Qbool,D) the estimate from the Boolean. This has the effect of restricting the results to those items that match the Boolean portion, with ordering based on the probabilistic portion.

 

Relevance feedback is implemented quite simply, as probabilistic retrieval based on extraction of content-bearing elements (such as titles, subject headings, etc.) from any items that have already been seen and selected by a user. Similarly, multiple records may be selected and submitted for feedback searching. In this case the contents of all those records are merged into a single query and submitted for searching. A feedback search is accomplished by parsing the selected record(s) and extracting the record elements specified for the index used for topical searching (as specified in the database configuration file). Each of these record elements is combined to form a single query, which is then submitted to the same probabilistic retrieval process described above. At the present time we do not use any methods for eliminating poor search terms from the selected records, nor special enhancements for terms common between multiple selected records (Salton & Buckley 1990), but we plan to experiment further with various enhancements to our relevance feedback method.

 

Cheshire II and the UC Berkeley Digital Library

 

The UC Berkeley Environmental Digital Library has had a primary research and implementation focus of developing “Work-Centered Information Services” to support the work practices of persons involved in environmental planning and evaluation for the State of California (Wilensky  1995, 1996). The collaborative partners in developing the contents and functionality of the Environmental Digital Library include the State of California’s Resources Agency (and especially the Department of Water Resouces and the Department of Fish and Game) as well as various regional and county agencies within the state.

 

The Environmental Digital Library collections began as a testbed for research in computer science and information technology; it has since become a valuable repository of environmental and biological information. As of mid 1999, the collection represents about three quarters of a terabyte of data, including over 70,000 digital images, over 300,000 pages of environmental documents, and over a million records in geographical and botanical databases. All of these data are accessible in online searchable databases; they are also freely available for the purpose of research and experimentation. The types of data included are:

 

¨       Botanical Data: The CalFlora Database contains taxonomical and distribution information for more than 8000 native California plants. The Occurrence Database includes over 300,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos.

 

¨       Geographical Data: Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000-record USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area.

 

¨       Documents: Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species. The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research.

 

¨       Photographs: The photo collection includes 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos. The Corel images are used within the project for computer vision research such as the BlobWorld work discussed later.

 

This paper is concerned primarily with access to the Document and Photograph collections of the Environmental Digital Library. For the Documents collection the primary search method is the Cheshire II system. The preparation of  text documents for searching includes the following steps:

 

1.        A document is scanned on a XEROX 620 Document Capture Machine. The scan software, Xerox Documents on Demand, runs on a PC. Following is a list of files created by the scan process. (XXX is the Elib ID, assigned to it at scan time.)

a.        XXX.con -- directory containing tiff files and the bib.elib file

b.       00000000.tif -- 600 dpi tiff files, one file per each page in the document 

c.      bib.elib -- file containing document information (author, title, Elib ID, number of pages, date, etc.). This information is entered by the person running the scanning operation. The file is used as the basis for the SGML metadata record used in Cheshire II indexing.

 

2.        The XXX.con directory is moved to a UNIX machine's local disk.

 

3.        From the bib.elib file, a similar file called bib.rfc1357 is created. The bib.rfc1357 file is for compatibility with a Dienst document server, though we do not currently use a Dienst document server.

 

4.        GIF files are made from the 600 dpi tiffs using PBM utilities.

 

5.        The 600 dpi tiffs are run twice through OCR processing, using ScanWorx, to produce the following output, stored in separate sub-directories:

a.        OCR ASCII text -- plain ASCII text 

b.       OCR XDOC output -- contains word position information

 

6.        A file called hyperOCR is created. This file contains all the plain OCR output for a document plus hyperlinks to the GIF images.

 

7.        The document is moved to its final location from which it will be accessed via the Digital Library Web pages and the Cheshire II search engine.

 

8.        The document is indexed for Cheshire II searches. First the basic bib.elib record created in step 1-c. above  is converted to an SGML record that includes additional tagged fields containing the location of the hyperocr file for the document and the location of the directory containing the individual page OCR text files. This new record is appended to the Cheshire database file for the Digital Library documents collection. Next, Indexing is run on batches of new records added to the Cheshire database. The indexes created by the indexing process include:

 

a.        Author: This keyword index is derived from the author fields (both personal and corporate) of the SGML record.

b.       Title: A keyword index derived from the title fields of the SGML.

c.        Xtitle: A phrase index (exact title with terms in original order) derived from the title fields of the SGML.

d.       Doctype: A keyword index derived from the document type field of the SGML records (Types include “County General Plan”, “Environmental Impact Statement”, “Report”, etc.)

e.        Localnum: A keyword index containing the Environmental Digital Library ID number for the document.

f.         Project: A keyword index containing the name of the project associated with this document. Projects include, for example: SNEP (the Sierra Nevada Ecosystem Project) and CORMP (the California Ocean Resources Management Program).

g.       Series: A keyword index for series information from the SGML record.

h.       Date: A keyword index for the date of the document from the SGML record.

i.         Topic: A keyword full-text index containing all words from all of the SGML record fields combined with all words extracted from the hyperocr version of the document described above.

j.         Pages: A keyword full-text index containing all words from each individual OCR’ed page created in step 5-a. above. This index is used to provide page-level full-text access to the document.

 

Access to the documents collection is provided by an HTML form that provides entry widgets for the indexes above (either as text boxes for keyword full-text indexes or selection “pick-lists” for limited vocabulary indexes such as Project names). The user is also given the option of searching the full text at the document or the individual page level. During a search, the user-provided information in the HTML form is passed to the CGI version of Cheshire II where a Tcl script directs retrieval and formatting of the selected SGML records for display. The Environmental Digital Library document search page is available at http://elib.cs.berkeley.edu/docs/query.shtml. Access to all collections and methods ia available via http://elib.cs.berkeley.edu/matrix.html.

 

Most of the photographs in Environmental Digital Library collection were originally slides that were processed by a photo lab and provided to the Digital Library Project on Kodak PhotoCD. (The exception is the Corel Stock photo collection alteady on CD). This technology provides up to six resolutions of each image in Kodak's PCD format, which is stable and convenient for automatically processing large numbers of images. The images arrive with about 100 on a CD and they are typically processed in batches of several hundred at a time. The PCD images are copied onto magnetic disk storage and two of the resolutions are converted to JPEG format for web browsing. The JPEGs are stored on disk in a UNIX filesystem by PhotoCD number.

 

In addition to the images themselves, we are also provided with a description of each image by the photographer or institution that owns the collection. Typically this arrives as delimited text dump of the collection's database, often a PC-based database such as FileMaker Pro or Microsoft Access. For example, the minimum we look for on our flora and fauna pictures is a taxonomic name for each picture, as well as an identifying number that maps the description to the actual photograph - usually this is the PhotoCD number. If available, we also like to have the photographer's name, the date and place of the photograph, and any other information that would be useful to people who are looking for pictures of plants and animals. In some cases, additional processing is done to exact the color histograms from an image and convert it to a form that can be queried in the database. The descriptions for each picture that we receive from photographers and institutions are automatically reformatted and loaded into a relational database table within our Informix Universal Server database.

 

Each database record contains a photo ID that maps to the actual photo stored on disk. For queries against the database metadata (including the principle colors that appear in a photo when available), an HTML query form is used. This query form accesses a CGI script that creates an SQL (Standard Query Language) query , delivers the query to the Informix database, and processes the results to create a new web page to display matching images from the collection(s). For content-based image retrieval, however, we do not rely on provided metadata, but instead use the BlobWorld image segmentation and analysis system(Carson 1998).

 

BLOBWORLD IMAGE INDEXING AND RETRIEVAL

 

Perhaps the largest collections of data in digital libraries today (in terms of data storage size) are made up of images. These image collections are increasing in size and more collections are made available as time goes by. The contents of these collections range from art and museum images to photographs of people and of the natural world. Some collections are proprietary and some are public access, but the usually share a common problem – they are often contain very diverse images and they are usually poorly indexed as to the contents of the images. While text retrieval has been studied for more than 30 years, image retrieval systems are relatively new and the techniques for automatically indexing and retrieving images have not kept pace with the collections they are searching. There are many shortcomings of current image retrieval systems, largely due to their choice of image representations and to their methods of accessing those representations to find images. When users are searching a collection of images, we assume that they usually do so in hopes of finding images that contain particular objects (“things'') (Enser 1993, Forsythe, Malik & Wilensky 1997), but most existing image retrieval systems represent images based only on their low-level features (“stuff'', such as color histograms or textures included), with little regard for the spatial organization of those features.

 

Most image retrieval systems provide few clues to users about why particular images are returned as the results of a query, and usually can provide no insight about refining the query for more accurate retrieval. Often the user knows only that he has submitted a query for, say, a brown bear but in return has retrieved many irrelevant images and very few pictures of bears. In this section we briefly describe an image representation (“Blobworld'') based on finding regions of coherent color and texture in images. These regions usually correspond to objects or parts of objects; thus querying in Blobworld is more meaningful than it is with more simple “stuff'' representations.

 

The Blobworld representation uses the Expectation-Maximization (EM) algorithm to perform automatic segmentation based on image features. EM iteratively models the joint distribution of color and texture with a mixture of Gaussians; the resulting pixel-cluster memberships provide a segmentation of the image. After the image is segmented into regions, a description of each region's color, texture, and spatial characteristics is produced. In a querying task, the user can access the regions directly, in order to see the segmentation of the query image and specify which aspects of the image are important to the query. When query results are returned, the user sees the Blobworld representation of the returned images; this assists greatly in understanding “false drops” and in refining the query.

 

Portions of this work have been published in (Carson, et al. 1997; Carson, et al. 1998). A more detailed examination of the use of Cheshire with Blobworld is in preparation (Larson & Carson 2000)

 

Related Work

 

The best-known image database system is IBM's Query by Image Content (QBIC) (Flickner, et al. 1995), which allows an operator to specify various properties of a desired image. The system then displays a selection of potential matches to those criteria, sorted by a score of the appropriateness of the match. Region segmentation is largely manual. Photobook (Pentland, Picard & Sclaroff 1996). incorporates more sophisticated representations of texture and a degree of automatic segmentation. Other examples of systems that identify materials using low-level image properties include Virage (Gupta & Jain 1997), VisualSEEk (Smith & Chang 1995), Candid (Kelly, Cannon & Hush 1995), and Chabot (Carson & Ogle 1996; Ogle & Stonebraker 1995). None of these systems codes spatial organization in a way that supports object queries.

 

Promising work by Lipson et al. (1997) retrieves images based on spatial and photometric relationships within and across image regions. Little or no segmentation is done; the regions are derived from low-resolution images. Jacobs et al. (1995) have used multiresolution wavelet decompositions to perform queries based on iconic matching.

 

Classical object recognition techniques usually rely on clean segmentation of the object from the rest of the image or are designed for fixed geometric objects such as machine parts. Neither constraint holds in the case of natural images: the shape, size, and color of objects like tigers and airplanes are quite variable, and segmentation is imperfect. Clearly, classical object recognition does not apply. More recent techniques can identify specific objects drawn from a finite (on the order of 100) collection, but no present technique is effective at the general image analysis task, which requires both image segmentation and image classification.

 

The Blobworld approach uses the EM algorithm for segmentation based on color and texture features jointly. Earlier work has used EM and/or the Minimum Description Length (MDL) principle to perform segmentation based on motion (Ayer & Sawhney 1995), or scaled intensities (Wells, et al. 1995), but EM has not previously been used on joint color and texture. Related approaches such as deterministic annealing (Puzicha & Buhmann 1998) and classical clustering (Jain & Farrokhnia 1991) have been applied to texture segmentation without color.

 

Blobworld

 

The Blobworld representation is related to the notion of photographic or artistic scene composition. In the sense discussed by Shaft and Ramakrishnan (1996), the Blobworld descriptors constitute an example of a summary representation because they are concise and relatively easy to process in a querying framework.

 

Blobworld is distinct from color-layout matching as in QBIC (Flickner, et al. 1995) in that it is designed to find objects or parts of objects. Each image may be treated as an ensemble of “blobs'' possessing a number of attributes. The average number of blobs in an image is six. Each blob represents a region of the image which is roughly homogeneous with respect to color and/or texture. A blob is described by its color distribution, mean texture descriptors, spatial centroid, and contour description.

 Extracting color and texture features

Our goal is to assign the pixels in the original image to a relatively small number of groups, where each group represents a set of pixels that are coherent in their color and local texture properties; the motivation is to reduce the amount of raw data while preserving the information needed to understand the image. Given the unconstrained nature of the images in our database, it is important that the tools we employ to meet this goal be as general as possible without sacrificing an undue amount of descriptive power.

 

Color

 

Color is a very important cue in extracting information from images. Color histograms are commonly used in content-based retrieval systems (Flickner, et al. 1995; Ogle & Stonebraker 1995; Swain & Ballard 1991) and have proven very useful. However, the global characterization of a color histogram is poor at, for example, distinguishing between a field of orange flowers and a tiger, because it lacks information about how the color is distributed spatially.  We believe that for image retrieval it is vital to group color in localized regions and to fuse color and textural properties.

 

Each image pixel has a three-dimensional color descriptor in the L*a*b* color space (Wyszecki & Stiles 1982). We use this color space because it is approximately perceptually uniform; thus distances in this space are meaningful.

 

 Texture

 

Texture is a well-researched property of image regions, and many texture descriptors have been proposed, including multi-orientation filter banks (Malik & Perona 1990) and the second-moment matrix (Foerstner 1994; Gaarding & Lindeberg 1995). We will not elaborate here on the classical approaches to texture segmentation and classification, both of which are challenging and well-studied tasks. Rather, we discuss a new perspective related to texture descriptors and texture grouping motivated by the content-based retrieval task. Whereas color is a point property, texture is a local-neighborhood property. It does not make sense to talk about the texture of zebra stripes at a particular pixel without specifying a neighborhood around that pixel. In order for a texture descriptor to be useful, it must provide an adequate description of the underlying texture parameters, and it must be computed in a neighborhood which is appropriate to the local structure being described.

 

The first requirement could be met to an arbitrary degree of satisfaction by using multi-orientation filter banks such as steerable filters; we chose a simpler method that is sufficient for our purposes. The second requirement, which may be thought of as the problem of scale selection, has not received the same level of attention in the literature. This is unfortunate, since texture descriptors computed at the wrong scale only confuse the issue.

 

In Blobworld we use a novel method of scale selection which works in tandem with a fairly simple but informative set of texture descriptors. The scale selection method is based on edge/bar polarity stabilization, and the texture descriptors arise from the windowed second moment matrix. Both are derived from the gradient of the image intensity, which we denote by ÑI. We compute ÑI using the first difference approximation along each dimension. This operation is often accompanied by smoothing, but we have found this preprocessing operation unnecessary for the images in our collection.

 


To make the notion of scale concrete, we define the scale to be the width of the Gaussian window within which the gradient vectors of the image are pooled. The second moment matrix for the vectors within this window, computed about each pixel in the image, can be approximated using

where Gs(x,y) is a separable binomial approximation to a Gaussian smoothing kernel with variance s2.

 

At each pixel location, Ms(x,y) is a 2 ´ 2 symmetric positive semidefinite matrix; thus it provides us with three pieces of information about each pixel. Rather than work with the raw entries in Ms, it is more common to deal with its eigenstructure (Bigun, et al. 1991; Foerstner 1994). Consider a fixed scale s and pixel location, let l1and l2 (l1 ³ l2) denote the eigenvalues of Ms at that location, and let f denote the argument of the principal eigenvector of Ms . When l1is large compared to l2, the local neighborhood possesses a dominant orientation, as specified by f. When the eigenvalues are comparable, there is no preferred orientation, and when both eigenvalues are negligible, the local neighborhood is approximately constant.

 

 Scale selection

 

We may think of s as controlling the size of the integration window around each pixel within which the outer product of the gradient vectors is averaged. s has been called the integration scale or artificial scale by various authors (Foerstner 1994; Gaarding & Lindeberg 1995) to distinguish it from the natural scale used in linear smoothing of raw image intensities. Note that s=s(x,y) the scale varies across the image.

 

 In order to select the scale at which Ms is computed, i.e., to determine the function s(x,y), we make use of a local image property known as polarity. The polarity is a measure of the extent to which the gradient vectors in a certain neighborhood all point in the same direction. (In the computation of second moments, this information is lost in the outer product operation; i.e., gradient vector directions differing by 180° are indistinguishable.) The polarity at a given pixel is computed with respect to the dominant orientation f in the neighborhood of that pixel. For ease of notation, let us consider a fixed scale and pixel location.

 

The process of selecting a scale is based on the derivative of the polarity with respect to scale. First, we compute the polarity at every pixel in the image for sk=k/2, k=0,1,…,7, thus producing a “stack'' of polarity images across scale. Then, for each k, the polarity image computed at scale sk is convolved with a Gaussian with standard deviation 2sk to yield a smoothed polarity image psk. For each pixel, we select the scale as the first value of sk for which the difference between successive values of polarity (psk - psk-1) is less than 2%. By this process we are performing a soft version of local spatial frequency estimation, since the smoothed polarity tends to stabilize once the scale window encompasses one approximate period. Since we stop at sk=3.5, the largest period we can detect is approximately 10 pixels. Note that in uniform regions the selected scale is not meaningful and is set to zero. We declare a region to be uniform if its mean contrast across scale is less than 0.1.

 

Texture features

 

Once a scale s* is selected for each pixel, that pixel is assigned three texture descriptors. The first is the polarity at that scale, p=ps*. The other two, which are taken from Ms*, are the anisotropy, defined as a = 1-l2 / l1 and the normalized texture contrast, defined as c = 2Öl1 + l2.

 

Combining color and texture features

 

Grouping based on the raw color and texture features would cause oversegmentation due to local color variation: for example, each stripe of a tiger would become its own region because its color is distinct from the color of its neighboring stripes. In order to avoid this problem, we smooth the color features using a Gaussian at the selected scale. In effect, a given textured patch in an image first has its texture properties extracted and then is replaced by a smooth patch of averaged color. In this manner, the color and texture properties in a given region are decoupled; for example, a zebra is a gray horse plus stripes.

 

The final color/texture descriptor for a given pixel consists of six values: three for color and three for texture. The three color components are the smoothed L*a*b* coordinates. The three texture components are ac, pc, and c, computed at the selected scale; the anisotropy and polarity are each modulated by the contrast, since they are meaningless in regions of low contrast. Note that the orientation and selected scale do not appear in the feature vector; as a result, grouping can occur across variations in scale and orientation.

 

Finally, we append the (x,y) position of the pixel to the feature vector. Including the position generally decreases oversegmentation and leads to smoother regions.

 

Grouping with the EM Algorithm

 

Once an image has been processed using the above feature extraction scheme, the result is a large set of feature vectors, which we may regard as points in an eight-dimensional feature space. In order to divide these points into groups, we make use of the Expectation-Maximization (EM) algorithm (Dempster, Laird & Rubin 1977) to determine the maximum likelihood parameters of a mixture of K Gaussians inside the feature space.

 

The EM algorithm is used for finding maximum likelihood parameter estimates when there is missing or incomplete data. In our case, the missing data is the region to which the points in the feature space belong. We estimate values to fill in for the incomplete data (the “E-Step''), compute the maximum likelihood parameter estimates using this data (the “M-Step''), and repeat until a suitable stopping criterion is reached. In the case where EM is applied to learning the parameters for a mixture of Gaussians, both steps can be combined into a single update step. A full formal description of the EM algorithm used and its application can be found in (Carson, et al. 1998). It is important to note that we are not trying to find the “true'' K or “true'' Gaussian distributions. There is no such thing; the images are obviously not produced by drawing pixels from a mixture of Gaussian distributions. Rather, we are simply trying to choose clusters that allow us to segment the images effectively.

 

Once a model is selected, the next step is to perform spatial grouping of those pixels belonging to the same color/texture cluster. We first produce a K-level image which encodes pixel-cluster memberships by replacing each pixel with the label of the cluster for which it attains the highest likelihood. To enforce a small amount of spatial smoothness in this representation, we apply a 3 ´ 3 maximum-vote filter to the raw cluster-membership image. Finally, we run the resulting image through a connected-components algorithm to produce a set of labeled image regions.

 

A simple description of each region's color, texture, and spatial characteristics is stored for matching and retrieval. We store a color histogram of the pixels within the region. This histogram is based on bins with width 20 in each dimension of L*a*b* space. (L* ranges from 0 to 100; a* and b* range approximately from -100 to +100.) The gamut corresponding to 0 £ R,G,B £ 1 contains 218 of these bins.

 

To match the color of two regions, we use the quadratic distance between their histograms x and y (Hafner, et al. 1995):


where A = [aij] is a matrix of weights between 0 and 1 representing the similarity between bins i and j based on the distance between the bin centers. This quadratic distance measure allows us to give a high score to two regions with similar colors, even if the colors fall in different histogram bins.

For each blob, the mean texture contrast and anisotropy is also stored. We do not store the selected scale, since we want to be invariant to scales in the range sk = 0,…,3.5. Although polarity is used for scale selection, we discard it here, since in any textured or uniform region it is approximately zero due to the scale selection process. The distance between two texture descriptors is defined as the Euclidean distance between their respective values of (c,ac), since anisotropy is meaningless in regions of low contrast.

 

We use Fourier descriptors to represent the shape of a region. We use the FDs of the (x,y) representation of the region's outer contour. These descriptors have an arbitrary phase offset due to the arbitrary starting point on the contour. Simply ignoring the phase won't work, since phase carries much of the information about the shape. In order to remove this starting point ambiguity, we convert the Fourier series to a superposition of ellipses of increasing frequency. By ignoring the absolute phase of the first-order ellipse but retaining the relative phases of successive ellipse, we remove the ambiguity about the starting point. Afer converting these ellipses back to a Fourier series, we store the first ten components of the series. (Since each frequency has an x part and a y part, and each of these has real and imaginary parts, specifying these first ten components requires 40 parameters.) The distance between two shapes is the Euclidean distance between their 40-dimensional feature vectors.

 

Querying in Blobworld

 

In the Blobworld system, the user composes a query by submitting an image to the segmentation/feature extraction algorithm in order to see its Blobworld representation, selecting the blobs to match, and finally specifying the relative importance of the blob features. The user may also submit blobs from several different images; for example, a query might be the disjunction of the blobs corresponding to airplanes in several images, in order to provide a query that looks for airplanes of several shades.

 

We define an “atomic query'' as one which specifies a particular blob to match (e.g., “like-blob-1''). A “compound query'' is defined as either an atomic query or a conjunction or disjunction of compound queries (“like-blob-1 and like-blob-2''). In the future, we might expand this definition to include negation (“not-like-blob-1'') and to allow the user to specify two blobs with a particular spatial relationship as an atomic query (“like-blob-1-left-of-blob-2''). One recent addition is a query that specifies a particular blob against a particular background. The “background blob” is computed by combining and averaging the colors of all other blobs in an image. Thus each blob will have a corresponding background blob for any given image.

 

Once a compound query is specified, we score each database image based on how closely it satisfies the compound query. The score mi for each atomic query (like-blob-i) is calculated as follows:

 

1.        For each blob bj in the database image (with feature vector vj)

a.        Find the feature vector vi for the desired blob bi. This vector consists of the stored color, texture, position, and shape descriptors.

b.      

Find the Mahalonobis distance between vi and vj using the diagonal covariance matrix (feature weights) set by the user:

c.        Measure the similarity between bi and bj using mij = e-dij/2 . This score is ~1 if the blobs are identical in all relevant features; it decreases as the match becomes less perfect.

2.        Take mI = maxj(mij).

 

The compound query score for the database image is calculated using fuzzy-logic operations \cite{fuzzy}. For example, if the query is “like-blob-1 and (like-blob-2 or like-blob-3),'' the overall score for the image is min{m1, max{m2, m3}}. The user can also specify a weighting si for each atomic query. If  “like-blob-i'' is part of a disjunction in the compound query, the weighted score for atomic query i is mI’ = simi; if it is in a conjunction, its weighted score is mI = 1-sI × (1-mI).

 

We then rank the images according to overall score and return the best matches, indicating for each image which set of blobs provided the highest score; this information helps the user refine the query by showing a Blobworld representation of the matching image with the high-ranking region highlighted. After reviewing the query results, the user may change the weighting of the blob features or may specify new blobs to match.  Figure 1 shows the results of a Blobworld search against a database of  10000 images from the Corel stock photo collection (using Cheshire II indexing as described below), the blobs used to query the system are in the top left-hand corner, the weights selected by the user are shown in the top right-hand corner. The resulting images (with matching blob highlighted) are shown in the 

 

Text Box:  

Figure 1. Tiger Image Search Results

Indexing Blobworld using Cheshire II

 


In the above matching process the careful reader will note that there appears to be no other way of obtaining the full ranking of the database vs the query blobs other than to sequentially process all of the blobs (and images) in the database. This obviously provides the best possible matching using the above algorithm, but it is not a scalable solution to the image retrieval problem. While the algorithm works quite rapidly on current server hardware, once the database size exceeds a few thousand images the response time becomes too great for regular interactive use. The Cheshire II system is being used to provide one solution to this dilemma.

 

Our approach to indexing blob representations in the Cheshire system involves further reducing the feature vectors of color, texture and spatial values generated for each blob. This involves transforming the feature vector into a set of discrete values (that we call bins) representing a range of real values in the underlying feature vector. The color descriptor is already discretized, since it's a histogram of color values. We simply store each filled bin, weighted by the value of that bin.  We discretize contrast into five bins and (contrast ´ anisotropy) into four bins.  We discretize (x,y) location into nine bins (three each direction).  We discretize the area and eccentricity of the first Fourier ellipse into five bins each.

 

The reduced feature vector, along with a unique ID for the blob and for the image containing the blob, is stored as a simple SGML record that looks like :

 

<BLOB>

<IMAGEID>10000</IMAGEID>

<IMAGESEQ>0</IMAGESEQ>

<BLOBID>1</BLOBID>

<BINS> COL141 COL142 COL203 COL203 COL203 COL203 COL203 COL204 COL204 COL204

COL204 COL204 COL204 TEX1 LOC868 SHPA3 SHPB4 </BINS>

<BACKGROUND> COL3 COL4 COL7 COL8 COL26 COL27 COL31 COL74 COL141 COL142

COL142 COL203 COL203 COL204 COL204 COL204 COL204</BACKGROUND>

<FEATURES> </FEATURES>

</BLOB>

 

In the above record the “<BINS>” tag contains the discetized color (COLxxx), texture (TEXx), location (LOCxxx) and shape (SHPAx and SHPBx) values for blob #1 of image #10000. The color values with higher weights in the record are indicated by frequency of occurrence within the field (for example the color bin COL203 has a high weight in the record.

 

These records for each blob in the image collection are treated as a Cheshire SGML file and each of the bins for the blob are indexed as if they were words in a text (accumulating frequency information from duplicated bins). During retrieval, users select a sample blob from an image to search (either from a sample set of images or from previous search results). The user may set controls to indicate the relative importance of  color, texture, position, and shape of the blob and importance for the blob overall. (The blobworld search interface is available at http://elib.cs.berkeley.edu/photos/blobworld/). The user is also given a choice (on the advanced query form) of whether to use the full scan search of the database or an indexed search using the Cheshire II system.

 

When a search is submitted requesting the Cheshire II system (via the CGI scriptable interface), the first step is to retrieve the matching blob record(s) for the query blob(s) from the database. Then, the blob weights set by the user are are used to provide determine the weights with which the various types of bins in the blob are given in the next step. The weighted bins (weighting is indicated by duplication of bins in the query) are submitted as a probabilistic query against the Cheshire II Blob database and the resulting blob records are returned ranked using the same ranking method as text (described above). This results in a restricted set of blobs that are similar to the query blob by having color, texture, shape or location bins in common with the query blob. The ranking algorithm tends to promote those database blobs with the most features in common, influenced by the frequency weighting of the query blob bins and the database blob bins. For multiple blob queries the preceding steps are performed for each query blob, and the results merged by intersecting the result lists on the image id number and ranking those images with multiple matching blobs ahead of  those with only one matching blob.  The final result of Cheshire II processing is a list of “candidate” image IDs  that are likely to contain matching blobs.  This list is then passed to the sequential ranking algorithm described above for detailed comparison of the full feature vectors of the query blob(s) and the candidate blobs.

 

Obviously, bin distribution in blobs is not the same as term distribution in text. Also the full ranking algorithm is more flexible in cases of approximate color and texture matching, so  the combined retrieval method will not (usually) be as effective in terms of precision and recall as the full ranking alone. The question remains of whether it is sufficiently accurate given the efficiency and scalability benefits. Some testing results (for both text and image retrieval) are described below.

 

EVALUATION

 

This section briefly discusses the UC Berkeley entry in the TREC7 Interactive Track to give some insight on Cheshire II performance for text retrieval. We then describe some image retrieval results from a database of 10000 images.

 

The TREC interactive track is designed to investigate user interaction with information retrieval systems in performing a prescribed set of queries during a limited time span.  In our TREC7 study eight searchers conducted eight searches each, half on the Cheshire II system and the other half on the Zprise system, for a total of 64 searches. Questionnaires were administered to gather information about basic demographic and searching experience, about each search, about each of the systems, and finally, about the user’s perceptions of the systems. We will concentrate on the overall performance of the Cheshire II system and will briefly describe the interfaces used in the study and how they differ in design goals and implementation.  More details on the study may be found in Gey, et al. (1999). 

Text Retrieval: TheTREC Interactive Track

 

The administration of the interactive track followed the protocols set down in the track guidelines. This mandated a minimum group of 8 participant searchers, each of whom conduct 8 searches, half on the control system (ZPRISE, identified as “Z”) and half on the experimental system (Cheshire II, identified as “C”). The database contents were 3 years of the full-text contents of the Financial Times. Each searcher was asked to use the features of the respective interfaces to select documents that they considered to relevant to one or more aspects of the specific topic. Each search was limited to 15 minutes.

 

Text Box: Asp. Precision	Topic								
System	352i	353i	357i	362i	365i	366i	387i	392i	Grand Total
a	0.4418	0.3860	0.4870	0.5913	0.9168	0.9168	0.8125	0.6110	0.6454
b	0.7448	0.3483	0.4125	0.7500	0.9018	0.8875	0.7375	0.8310	0.7017
C	0.7993	0.2145	0.6368	0.7058	0.9793	0.9375	0.8458	0.8068	0.7407
clus	0.4355	0.3594	0.5514	0.6773	0.8750	0.4291	0.5938	0.6276	0.5686
irisa	0.7448	0.4333	0.5715	0.7500	0.7223	1.0000	0.9063	0.8088	0.7421
irisp	0.7333	0.1750	0.5533	0.6368	1.0000	0.9168	0.7500	0.7085	0.6842
iriss	0.6745	0.6021	0.5803	0.6194	0.8875	0.8611	0.7884	0.7303	0.7179
J24	0.6250	0.5178	0.5083	0.5568	0.9375	0.8750	0.6615	0.7315	0.6767
list	0.2811	0.1916	0.4200	0.5209	0.9584	0.3750	0.8084	0.7540	0.5387
MB	0.8831	0.4876	0.4164	0.7249	0.7524	0.9679	0.6494	0.6020	0.6855
MR	0.6342	0.4309	0.2857	0.7056	1.0000	1.0000	0.7944	0.7190	0.6962
ok_noRF	0.9018	0.4984	0.3033	0.4981	0.8840	0.8750	0.7710	0.4874	0.6524
ok_withRF	0.8578	0.5865	0.3818	0.3558	0.7520	1.0000	0.8335	0.7475	0.6893
RUINQ-G	0.5844	0.5109	0.3618	0.4628	0.8814	0.7918	0.8854	0.6228	0.6357
RUINQ-R	0.5558	0.3160	0.4391	0.6055	0.8674	0.7889	0.7689	0.6819	0.6286
Z	0.9500	0.4405	0.5030	0.6043	1.0000	0.8333	0.9333	0.7093	0.7467
ZP	0.5863	0.4045	0.4285	0.8928	0.9688	0.9500	0.8873	0.8493	0.7459
zp_noRF	0.8875	0.2323	0.4303	0.7640	0.8750	0.7918	0.9015	0.8250	0.7134
Grand Total	0.6733	0.4090	0.4367	0.6324	0.8928	0.8354	0.7776	0.6916	0.6689

Table 1: Aspectual Precision for all TREC-7 Interactive Track Participants

The pooled results for all systems were evaluated at NIST by the TREC evaluators and “Aspectual Precision” and “Aspectual Recall” for each searcher was calculated. Table 1 shows the values for Aspectual Precision by TREC topic for all systems in the TREC Interactive Track. Table 2 shows the values for Aspectual Recall for all of the participating systems. Note that these two tables were derived from the per-search Recall and Precision figures reported by NIST. Note also that in these tables all of the system usages were combined in the calculations, therefore “ok_noRF” which was used in two separate experiments has the results from both experiments combined. The two Berkeley systems (“C” and “Z”, the Cheshire II system and ZPRISE systems respectively) are shown in boldface in Tables 1 and 2. The control system “Z” performed marginally better than the experimental system in terms of the Aspectual Precision. It is also interesting to note that virtually identical performance was achieved by the “ZP” system from NMSU and the “zp_noRF” from the Okapi Group, I believe that both of these systems, like “Z”, are unmodified ZPRISE systems.

Text Box: Aspectual Recall	Topic								
System	352i	353i	357i	362i	365i	366i	387i	392i	Grand Total
a	0.1250	0.2728	0.3658	0.3540	0.8023	0.5355	0.4723	0.3403	0.4085
b	0.2413	0.1593	0.3080	0.2918	0.8543	0.5000	0.3888	0.2433	0.3733
C	0.2768	0.1820	0.3080	0.3123	0.8543	0.5358	0.4443	0.4373	0.4188
clus	0.0893	0.1024	0.2310	0.1459	0.6874	0.2144	0.1110	0.1840	0.2207
irisa	0.2858	0.1593	0.2888	0.2708	0.5208	0.2860	0.1943	0.2430	0.2811
irisp	0.2413	0.1138	0.3655	0.1873	0.7918	0.5358	0.2220	0.3403	0.3497
iriss	0.1653	0.2161	0.2791	0.2605	0.7186	0.3571	0.3193	0.2985	0.3268
J24	0.1875	0.2275	0.3080	0.1875	0.8750	0.2503	0.4445	0.5280	0.3760
list	0.0534	0.0683	0.2791	0.1354	0.6770	0.0715	0.2499	0.2846	0.2274
MB	0.3368	0.1950	0.2090	0.3274	0.6161	0.4490	0.4522	0.2401	0.3532
MR	0.3061	0.1949	0.1155	0.2737	0.7260	0.3776	0.3171	0.4603	0.3464
ok_noRF	0.3349	0.2161	0.2406	0.3020	0.8489	0.4288	0.4305	0.2675	0.3837
ok_withRF	0.4105	0.2275	0.3655	0.3123	0.8230	0.3930	0.3610	0.2848	0.3972
RUINQ-G	0.2619	0.2340	0.2909	0.2499	0.8288	0.3930	0.4723	0.2918	0.3817
RUINQ-R	0.2365	0.2427	0.2214	0.3333	0.7449	0.3651	0.4073	0.2560	0.3489
Z	0.2233	0.2955	0.2695	0.2918	0.8230	0.4285	0.6113	0.4723	0.4269
ZP	0.2768	0.1593	0.3465	0.1668	0.8543	0.4643	0.5555	0.4308	0.4068
zp_noRF	0.3838	0.1138	0.3080	0.1878	0.8333	0.3930	0.5000	0.4723	0.3990
Grand Total	0.2478	0.1875	0.2573	0.2592	0.7503	0.3749	0.3750	0.3238	0.3472

Table 2. Aspectual Recall for all TREC-7 Interactive Track Participants

The Cheshire II system also did not perform as well the control ZPRISE system in these experiments. This fact can largely be attributed to the more complex interactions required to perform the search tasks on the generic Cheshire II interface than on the ZPRISE system. In addition, there were some specific search failures due to misspelling (one searcher had 0 Precision and Recall for one search due to this).

 


The design of the Cheshire II client interface (shown with the TREC FT database in Figure 2), was driven by a number of goals:

1.        To support a consistent interface to a wide variety of Z39.50 servers, and to dynamically adapt to the particular server.to reduce the cognitive load on the users wishing to interact with multiple distributed information retrieval systems by providing a single interface for them all;

2.        To minimize use of additional windows during users' interactions with the client in order to allow them to concentrate on formulating queries and evaluating the results, and not expend additional mental effort and time switching their focus of attention from the search interface to display clients;

3.        To provide functions not immediately related to searching, such as print and e-mail facilities, to facilitate users' ability to 'take the results home'; and

4.        To design a help system within the interface that would assist users not only in the mechanics of operating the Cheshire II client, but also in the more general tasks of selecting appropriate resources for searching, formulating appropriate queries, and employing various search tactics.

 

However, the initial design goals for the client interface made the assumption that most of the information that would be viewed in the search interface would be brief metadata records for documents, and not full text documents themselves. The ability to view full-text documents such as the FT articles used in the TREC-7 Interactive track experiments was added to the existing interface easily. But, as the comments and questionnaire responses indicated, this was probably not an optimal implementation for the tasks posed by the experiments, which involve reading or scanning the full text to find relevant “aspects” in retrieved documents and noting them. We plan to provide a new interface that will facilitate the sort of reading and scanning activities required for the TREC interactive task and to test it in the TREC 8 interactive track.

 

Image search results

 

We tested image retrieval using a sample of the Corel stock photos collection consisting of 10000 images ranging from automobile racing to pictures of gardens and houses. 10 query images were selected and hand compared against the database using metadata descriptive information to determine the relevant images in the database. The relevance selection task is much simpler for images than for text – if the query image is a tiger and the database image contains a tiger then that image is considered relevant. Our primary concern was to compare the full matching process described above with the indexed search using Cheshire II. The average results over the ten queries for the full search (labeled all_full_rp_data in figure 3 and all_full_fpn_data in figure 4), the indexed Cheshire II search with no subsequent processing  (labeled all_raw_rp_data in figure 3 and  all_raw_fpn_data  in figure 4) and the indexed Cheshire II search followed by full processing of the subset returned by that search (labeled all_partial_rp data in figure 3 and as all_partial_fpn_data in figure 4), are shown in Figures 3 and 4. As may be seen in these figures, the raw retrieval performs worst (but is the most efficient), the partial retrieval performs reasonably closely to the full search, but requires scanning only a selectable number of images and not the entire database.  Overall, the performance of image retrieval does not approach the levels seen for text retrieval. Further research is being conducted on many aspects of image retrieval and we hope to increase both the effectiveness of the process as a whole as well as making the process efficient enough for general use.

CONCLUSIONS

 

Naturally, it impossible to draw any firm general conclusions on text retrieval in the Digital library from our small sample of searchers and searches in the TREC and the preliminary image retrieval results reported here. In the interactive retrieval tas as observed in our TREC results the overall performance of the Cheshire II system was quite good, although it was not dramatically better than the control system on average. These results, as has often been noted in previous TREC interactive evaluations, tend to be highly influenced by individual behavior and search techniques. Testing of content-based image retrieval methods is still in very early stages, and there is much work to do in determining what methods will provide the most effective and efficient performance for Digital Libraries. The Cheshire II system is a ongoing research project that will continue to incorporate new techniques into the system, aiming to provide an effective general tool for information access in Digital Libraries.

Acknowledgements

The original development of the Cheshire II system was sponsored by a College Library Technology and Cooperation Grants Program, HEA-IIA, Research and Demonstration Grant #R197D30040 from the U.S. Department of Education. Further development work on the Cheshire II system and on BlobWorld was supported as part of Berkeley's NSF/NASA/ARPA Digital Library Initiative Grant #IRI-9411334. Current work is being supported as part of the "Search Support for Unfamiliar Metadata Vocabularies" research project at UC. Berkeley, sponsored by DARPA contract N66001-97-C-8541; AO# F477.

REFERENCES

 

Ayer, S., & Sawhney, H. (1995). Layered representation of motion video using robust maximum-likelihood estimation of mixture models and MDL encoding. In Proceedings of the international conference on computer vision (pp. 777{784). New York.

Text Box:  

Figure 2: Cheshire II Long Display

 


Belongie, S., Carson, C., Greenspan, H., & Malik, J. (1998). Color- and texture-based image segmentation using EM and its application to content-based image retrieval. In Proceedings of the international conference on computer vision New York.

 

Bigun, J., Granlund, G., & Wiklund, J. (1991). Multidimensional orientation estimation with applications to texture analysis and optical flow. IEEE Transactions Pattern Analysis and Machine Intelligence, 13(8), 775-790.

 

Carson, C., Belongie, S., Greenspan, H., & Malik, J. (1997). Region-based image querying. In IEEE workshop on the content-based access of image and video libraries. New York.

 

Carson, C., Belongie, S., Greenspan, H., & Malik, J. (1998). Color- and texture-based image segmentation using EM and its application to image querying and classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, In Press.

 

Carson, C., & Ogle, V. (1996). Storage and retrieval of feature data for a very large online image collection. IEEE Data Engineering Bulletin, 19(4).

 

Cooper, W. S., Chen, A., & Gey, F. C. (1994a). Experiments in the probabilistic retrieval of full text documents. In Text Retrieval Conference (TREC-3) Draft Conference Papers. Gaithersburg, MD: National Institute of Standards and Technology. 25

 

Cooper, W. S., Gey, F. C., & Chen, A. (1994b). Full text retrieval based on a probabilistic equation with coefficients fitted by logistic regression. In D. K. Harman (Ed.), Second Text Retrieval Conference (TREC-2), Gaithersburg, MD, USA, 31 Aug.-2 Sept. 1993 (NIST-SP 500-215) (pp. 57{66). Washington: National Institute of Standards and Technology.

 

Cooper, W. S., Gey, F. C., & Dabney, D. P. (1992). Probabilistic retrieval based on staged logistic regression. In 15th annual international ACM SIGIR conference on research and development in information retrieval, Copenhagen, Denmark, June 21-24 (pp. 198{210). New York: ACM.

 

Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Soc., Ser. B, 39(1), 1{38.

 

Enser, P. (1993). Query analysis in a visual information retrieval context. J. Doc. and Text Management, 1(1), 25{52.

 

Flickner, M., Sawhney, H., Niblack, W., Ashley, J., et al.. (1995). Query by image and video content: The QBIC system. IEEE Computer, 28(9), 23{32.

 

Foerstner, W. (1994). A framework for low level feature extraction. In Proceedings eur. conference computer vision (pp. 383-394). New York.

 

Forsythe, D., Malik, J., & Wilensky, R. (1997). Searching for digital pictures. Scientic American, 276(6), 88-93.

 

Garding, J., & Lindeberg, T. (1996). Direct computation of shape cues using scale-adapted spatial derivative operators. International J. Computer Vision, 17(2), 163-191.

 

Gey, F.C., Jiang, H., Chen, A., & Larson, R.R. (1999) Manual queries and Machine Translation in

Cross-language Retrieval and Interactive Retrieval with Cheshire II at TREC-7. To appear in Voorhees, E.M. (ed) The Seventh Text Retrieval Conference (TREC7). NIST

 

Gupta, A., & Jain, R. (1997). Visual information retrieval. Communications of the. ACM, 40(5), 70-79.

 

Hafner, J., Sawhney, H., Equitz, W., Flickner, M., & Niblack, W. (1995). Efficient color histogram indexing for quadratic form distance functions. IEEE Transactions Pattern Analysis and Machine Intelligence, 17(7), 729-736.

 

Jacobs, C., Finkelstein, A., & Salesin, D. (1995). Fast multiresolution image querying. In Proceedings siggraph. New York.

 

Jain, A. (1989). Fundamentals of digital image processing. Prentice Hall.

 

Jain, A. K., & Farrokhnia, F. (1991). Unsupervised texture segmentation using Gabor filters. Pattern Recognition, 24(12), 1167-1186.

 

Kelly, P., Cannon, M., & Hush, D. (1995). Query by image example: The CANDID approach. In SPIE Proceedings storage and retrieval for image and video databases (pp. 238-248). New York.

 

Larson, R. R. (1991). Classication clustering, probabilistic information retrieval, and the online catalog. Library Quarterly, 61(2), 133-173.

 

Larson, R. R & McDonough, J (1998) Cheshire II at TREC 6: Interactive Probabilistic Retrieval. In E. Voorhees and D. Harman (Eds.) Information Technology: The Sixth Text Retrieval Conference (TREC-6). NIST special publication 500-240. (pp. 649-649) Gaithersburg, MD : NIST, August 1998.

 

Larson, R. R., McDonough, J., O'Leary, P., Kuntz, L., & Moon, R. (1996). Cheshire II: Designing a next-generation online catalog. Journal of the American Society for Information Science, 47(7), 555-567.

 

Larson, R R. & Carson, C. (2000). Text and Image Retrieval for a Digital Library: Cheshire II and BlobWorld. In preparation.

 

Lipson, P., Grimson, E., & Sinha, P. (1997). Configuration based scene classification and image indexing. In Proceedings IEEE computer soc. conference computer vision and patt. rec. (pp. 1007-1013). New York.

 

Malik, J., & Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms. Journal of the. Optical Society of America. A, 7(5), 923-932.

 

Ogle, V., & Stonebraker, M. (1995). Chabot: Retrieval from a relational database of images. IEEE Computer, 28(9), 40-48.

 

Pentland, A., Picard, R., & Sclaroff, S. (1996). Photobook: Content-based manipulation of image databases. International J. Computer Vision, 18(3), 233-254.

 

Puzicha, J., & Buhmann, J. M. (1998). Multiscale annealing for real-time unsupervised texture segmentation. In Proceedings of the international conference on computer vision, New York.

 

Salton, G., & Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41, 288-297.

 

Shaft, U., & Ramakrishnan, R. (1996). Data modeling and querying in the PIQ image DBMS. IEEE Data Engineering Bulletin, 19(4), 28-36.

 

Smith, J. R., & Chang, S.-F. (1995). Single color extraction and image query. In Proceedings of the IEEE international conference on image processing (pp. 528-531). New York.

 

Swain, M., & Ballard, D. (1991). Color indexing. International J. Computer Vision, 7(1), 11-32.

 

Wells, W., Kikinis, R., Grimson, W., & Jolesz, F. (1995). Adaptive segmentation of MRI data. In International conference on computer vision, virtual reality and robotics in medicine (pp. 59-69). New York.

 

Wilensky, R. (1995). UC Berkeley's digital library project. Communications of the ACM, 38(4), 60-61.

 

Wilensky, R. (1996). Toward work-centered information services. Computer, 29(5), 37-43.

 

Wyszecki, G., & Stiles, W. (1982). Color science: Concepts and methods, quantitative data and formulae (Second ed.). Wiley.

 

Z39.50-1995, A. (1995). Information retrieval (Z39.50): Application service denition and protocol specification (ANSI/NISO Z39.50-1995). Bethesda, MD: NISO.


         

          Figure 3. Precision and Recall for Image Searching.

         

          Figure 4. Precision vs Number Retrieved for Image Searching