1. Introduction
Cheshire is a client/server information retrieval system that brings
modern information retrieval techniques to a wide array of data domains.
Cheshire provides uniform document storage in the form of SGML, supports
probabilistic search, and supports Z39.50 interoperability with dozens
of library information systems around the world.
Cheshire is now several years old. Its program logic is coded entirely
in C and most of its user interface is done in Tcl/Tk. Further development
is being inhibited by a system complexity that has outgrown its original
design and its dated software technologies. New technologies are now available
that can dramatically reduce coding effort and enhance robustness, maintainability,
and interoperability. This paper proposes to re-engineer Cheshire into
a modern software system, with the hope of ensuring its future viability
as a platform for information retrieval research. The next section outlines
the system objectives of a next generation system. Section 3 discusses
the technologies available for meeting those objectives. Section 4 proposes
a modular reorganization of Cheshire functionality into packages. Section
5 discusses the directory structure for deployment. Section 6 examines
issues in migrating the existing system to the new design. Section 7 concludes.
2. Design Objectives
To continue Cheshire's viability as a research platform, the system
must appeal to users and developers alike. It must satisfy the information
needs of users, and it must also make it easy for developers to modify
and experiment with the system. I identify below seven sets of features
I would like to see in a next generation Cheshire.
-
Distributed Queries. We want Cheshire to be able to satisfy a user's
cross-domain information need. To do that, a Cheshire server needs to look
at not only the database it maintains, but also other information sources
reachable over the Internet. Each server needs access to meta information
about other information sources, from which it can decide whether to query
that particular source to satisfy a particular information need.
-
Interoperability. To maximize its domain reach, Cheshire systems
need to not only interoperate with each other, but with as many different
types of systems as possible. Cheshire needs to support international standards
such as Z39.50, emerging protocols such as SDLIP, and be ready to adopt
future interfaces.
-
Concurrency, Scalability and Robustness. A research system is most
useful if the results of good research can be immediately deployed to serve
a large number of users. Cheshire should be able to efficiently use machine
resources to serve concurrent requests, grow linearly in throughput as
more hardware resources are added, and offer reliable operations suitable
for academic environments.
-
Web-Based System Adminitration. The system should be easy to deploy
and easy to administer. The administrator should be presented with
a unified view of system operations and given easy means to customize them.
A web-based administration tool will give administrators the most flexibility
in accessing the system.
-
Dynamic Databases. An administrator should be able to incrementally
grow the database without interrupting user services. The database should
be automatically indexed as it is grown.
-
Simplified, Structured, Maintainable Code. A research platform exists
to serve innovation. It is constantly evolving as new ideas are tested
and old ideas discarded. Developers come and go. This process becomes more
vibrant if the system is accessible to new developers and is amenable to
change.
-
Unprivileged Deployment. A research system is a toy for everyone.
Its users will not always have privileged access to the operating system.
Cheshire should not require such access for deployment.
3. New Technologies
A number of new software technologies exist today that can help achieve
the objectives outlined in the previous section.
-
Java Programming Language. Java has
taken the computing world by storm. It is elegant in design, easy to learn,
and a pleasure to use. A developer writes Java code once, and it
is instantly portable to almost every platform. Java is strongly typed
and object oriented from the ground up, giving us programs that are more
robust, more maintainable, and more expandable. Networking is at the core
of Java. Everything from naming, to its execution environment, to API design
is tailored to networked computing. A rich set of reusable Java components
is available to provide infrastructure support, and Java integration is
available for nearly every major language. Java performance is more than
adequate to provide the systems glue, with performance critical components
written in other languages. Java 1.3 promises to provide significant improvements
in client-side performance. Java tools are mostly free. Java programmers
are easy to find. And as one of the most successful computing platforms
of our era, users can expect continued expansions, upgrades, and community
support.
-
Java RMI
Remote Method Invocation makes it extremely simple to implement client/server
communications. A method on a remote object is called exactly the same
as a local method. The network is nearly invisible.
-
Java HotSpot Server
(V2.0 available for Solaris and Linux in fall, 2000). High performance
multithreaded Virtual Machine for server applications. Can be used in Cheshire
for high performance concurrency.
-
Berkeley DB 3.0.55. Better concurrency
support. Java API included.
-
JavaServlet Pages.
A Java architecture for generating dynamic content, to be delivered through
standard web servers. We can use this to build a web interface for Cheshire.
More and more thin clients such as PDA's will have web browsing capability.
Cheshire may see radically different kinds of use in the future. For example,
a book store patron may want to check on his PalmPilot to find out if an
expensive book is available at a public library and if yes, immediately
make a time limited reservation.
-
DOM Compliant XML Parsers. Available
with Java API's. Conform to the DOM standard for efficient (fewer passes)
parsing of XML.
-
Forte Integrated Development
Environment. A Java IDE that includes an integrated debugger, GUI access
to object tree, etc.
-
Java Naming and
Directory Interface Comes with a reference directory service. Cheshire
servers connected to the Internet can discover each other through this
service, allowing them to cooperate both in the local area and in the wide
area.
-
SDLIP
A simple information retrieval protocol available with Java transport and
CORBA and HTTP bindings. SDLIP is simple to understand and elegant in design.
SDLIP is far simpler to implement than Z39.50 and may find more support
from a wider array of information sources, particularly from the fast evolving
web search engines. The simplicity of SDLIP can encourage experimentation
in user interface, server architecture, and retrieval algorithms. SDLIP
may reduce the barrier to entry for distributed library information systems
the way the web reduced barriers to publishing. The web paradigm has been
a move away from structure, away from careful selection and long range
planning, and a move toward universal, low barrier access while powerful
computers mine structure from information where its human makers were spared
from the effort.
All of the technologies listed above are free. Source code is available
for most of them.
After some examination, I propose to not adopt the following technologies.
-
Java Enterprise Edition Emphasizes
support for enterprise computing needs such as reliable operations and
security. Includes Enterprise Java Beans component infrastructure. EJB
provides component support services such as life cycle management, persistance,
and load balancing. I decided against using this technology in Cheshire.
Most of its power is not needed for Cheshire and EJB adds unnecessary complexity
to the development process, which may in the end inhibit experimentation
in a research system rather than promote it.
-
CORBA CORBA is a distributed objects
architecture used for building distributed systems. In the foreseeable
future, SDLIP already provides all the server-to-server interoperability
Cheshire will need. In fact, SDLIP itself in a distributed information
systems architecture that sits on top of CORBA. For our purposes, SDLIP
interoperability subsumes CORBA interoperability. At the present, there
isn't a clear need to write distributed objects that sit below SDLIP. Even
if that need arises in the future, it may be simpler to convert the object
interfaces to RMI, which provides a distributed architecture native to
the Java language.
4. Source Code Organization
Java provides a "package" based object naming system in which the location
of a file in the directory tree, in either source or object form, corresponds
to its package name. The idea is that objects with like functionality would
be grouped together in a package. These objects have privileged access
to each other's non-private and non-protected fields and methods, whereas
only public fields and methods are accessible to objects in different packages.
Thus, the "package" provides a level of encapsulation higher than the "class".
A package is a collection of classes that do similar work. Packages themselves
can be grouped into hierarchies.
For a system as complex as Cheshire, a package oriented organization
can be very useful. It makes the broader purpose of a class self documenting,
while at the same time encouraging the developer to not straddle functionality
that should belong to different classes with a single class. The diagram
below gives the proposed modular view of the new Cheshire, with the modules
named by the packages that will implement them. A module contained in another
module is a subpackage of the container package. Arrows indicate data flow.
Cheshire packages should be rooted at edu.berkeley.cheshire
to ensure their uniqueness across the Internet. Symbolic links and aliases
can be used to get around any inconveniences caused by long package names.
The
edu.berkeley prefix was omitted in the above diagram for display
considerations. The packages are described below.
-
edu.berkeley.cheshire.document Contains access classes
to the document collection. Uses javax.xml.parsers routines for
XML parsing and various utility functions in edu.berkeley.cheshire.util
for format conversions and parsing of other data types.
-
edu.berkeley.cheshire.document.index Contains record index
classes for the document collection. Since a file may contain multiple
records (or documents), and records may be of varying length, a record
index allows random access to a named record.
-
edu.berkeley.cheshire.index Contains indexing classes that
provide key-based access to document collection. Uses com.sleepycat.db
for the underlying database that stores the index.
-
edu.berkeley.cheshire.index.util Various utilities for
examining the contents of an index, recovering from an incomplete index
build, etc.
-
edu.berkeley.cheshire.retrieval Contains classes used for
query planning, query traversal, result set merging, result set storage,
etc.
-
edu.berkeley.cheshire.boolean Contains classes used only
during boolean search.
-
edu.berkeley.cheshire.probabilistic Contains classes used
only during probabilistic search.
-
edu.berkeley.cheshire.interface Contains the multi-threaded
Cheshire server. Its threads are used to execute user queries as well as
administrative updates. Uses classes in SDLIP, Z39.50 and JSP subpackages
to translate protocol requests into an internal query plan representation.
In this context, a user may be a Cheshire client or another Cheshire server
executing a distributed query. Indexes are also built using this interface.
This
is the only access point into the Cheshire system for both users and administrators.
-
edu.berkeley.cheshire.interface.sdlip Contains classes
used by the Cheshire server to provide SDLIP service. Uses the edu.stanford.sdlip
toolkit.
-
edu.berkeley.cheshire.interface.z3950 Contains classes
used by the Cheshire server to provide Z39.50 service. May use legacy C
modules for Z39.50 functionality.
-
edu.berkeley.cheshire.interface.jsp Contains Java Servlet
classes that create dynamic JSP content to be delivered by an external
web server.
-
edu.berkeley.cheshire.interface.admin Contains classes
used by the Cheshire server to provide the administrative interface. Uses
a private RMI based protocol.
-
edu.berkeley.cheshire.interface.jndi Interface to JNDI
services for discovering other Cheshire servers on the Internet. Also used
to make its own services known to other servers.
-
edu.berkeley.cheshire.client
-
edu.berkeley.cheshire.client.user A Java Applet user interface
for submitting search requests. May locate servers from both a configuration
file and JNDI services
-
edu.berkeley.cheshire.client.user.sdlip Implements the
SDLIP protocol, using edu.stanford.sdlip.
-
edu.berkeley.cheshire.client.user.z3950 Implements the
Z39.50 protocol. May use legacy C modules.
-
edu.berkeley.cheshire.client.user.jndi Interface for discovering
Cheshire servers on the Internet.
-
edu.berkeley.cheshire.client.admin A Java Applet for system
adminitration. A command line interface is also available. Uses a private
RMI based protocol. Indexing is done through this interface. If the adminitrative
client shares a file system with the Cheshire server, document collections
may be specified as local files. Otherwise, document URL's may be given.
-
edu.berkeley.cheshire.util Contains data structures and
utilities shared across packages that do not have a natural primary home.
For example, classes in the retrieval package uses document
package classes. The classes used to access document databases, even though
shared between two packages, have a natural home in document and
thus do not belong here.
-
edu.berkeley.cheshire.util.log Utility classes for logging
system messages.
-
edu.berkeley.cheshire.util.sgml Parsing or conversion routines.
-
edu.berkeley.cheshire.util.marc Parsing or conversion routines.
-
edu.berkeley.cheshire.util.z3950 Z39.50 routines shared
by server and client
-
javax.xml.parsers DOM compliant XML parsing package.
-
edu.stanford.sdlip SDLIP Java transport from stanford.
-
com.sleepycat.db Java API for Berkeley DB embedded database
system.
5. Deployment Organization
Using standard UNIX naming conventions for directories,I propose the
following directory tree for deployment.
-
$CHESHIRE_HOME/bin Scripts for starting servers or clients.
-
$CHESHIRE_HOME/lib JAR files and native libraries.
-
$CHESHIRE_HOME/src Root of source tree.
-
$CHESHIRE_HOME/doc System documentation. javadoc source code documentation.
-
$CHESHIRE_HOME/etc Default configurations files for servers and
clients. Configuration files in users' home directories may override the
options set here.
-
$CHESHIRE_HOME/var
-
$CHESHIRE_HOME/var/log System log files.
-
$CHESHIRE_HOME/var/pool Berkeley DB shared buffer pool backing
files.
This deployment scheme does not require root access.
6. Migration Path
Having outlined a redesign of the Cheshire system using completely new
technologies, some important questions now need to be answered. What is
the right implementation strategy? Do we integrate the new technologies
and new source code with the existing system one piece at a time, or do
we build the system completely from scratch all at once?
The integration approach would be done from the top down. The top down
approach says first we implement the glue that holds the system together
by building interfaces to legacy components, then we replace those components
one at a time. The opposite, bottoms up approach makes little sense for
our situation; we would see none of the benefits of the new technology
at the frontend until the entire system is complete. The integration approach
has the benefit of quickly realizing the benefits of modern server technologies
in a deployable system. It also forces us to consider the full set of supporting
interfaces as each new interface is implemented, reducing the chance that
important functionalities become incompatible with the new interface.
The integration approach has two disadvantages. The first is that integrating
legacy components at each step of the way forces us to deal with the old
architecture, and will not fully benefit from the cleanness and consistency
of a new design. The second is the enormous amount of programming effort
that will be required to devise, implement and test these interfaces. This
author is of the opinion that these two disadvantages far outweight the
benefits of the integration approach. The existing Cheshire system is usable
and there isn't an immediate need for a new server. An interface that doesn't
work can always be redesigned. And such a redesigned interface will still
be more self-consistent and would have required less implementation effort
than an interface built on integration with legacy components at every
step of the way.
Rebuilding the system from scratch does seem to be the more sensible
approach. Due to the complexity of the Cheshire system, however, it will
be difficult to sustain a happy development effort if we go on to build
a massive system and see nothing working whatsoever for three months. I
propose that we take a mini-system-evolves-to-full-system approach. We
start by building a simple multi-threaded, boolean search, SDLIP service,
written all from scratch. From there we expand to where Cheshire is today,
and beyond. Under this approach, it is important to have a carefully designed
skeleton structure established up-front, as I believe I have done in the
previous sections. Each piece of new code may be solving a tiny piece of
the puzzle, but with the skeleton for reference, we always know where we
are and what our purpose is.
7. Conclusion
I have proposed a re-design of the Cheshire information retrieval system.
While the design does not in itself achieve the systems objectives outlined
in section 2, it does achieve the architectural objective of accommodating
all seven of those system objectives. We enable distributed queries through
a clean inter-server discovery and search interface. We style toward multiple
interfaces and extensibility from the grounded up, with clean plug-ins
for interoperability. We take advantage of Java and Java tools to write
maintainable, concurrent, high performance code. Java web technologies
enable a broad base of clients and allow Cheshire to be searched and administered
from anywhere, without having to install software locally. The groundwork
has been laid for a next generation platform for information retrieval.