-
Cheshire3: A focus on high performance, network of workstations style operations.
A scalable, extensible platform for information retrieval research. Scalable
performance allows us to explore more resource intensive IR techniques.
-
JavaSpaces - high level coordination mechanism for distributed systems.
Provides a light weight publish/subscribe distributed programming model
-
"messages" in a java space should be small. Use direct, point-to-point
mechanisms for bulk data transfers.
-
Cheshire will have low volume transactions.
-
distributed transactions, leasing, events come for "free"
-
Use JaveSpaces to govern distributed transactions
-
A single operational model for Cheshire that encompasses single node installations,
uniformly administered clusters, as well as independently administered
federations.
-
every operation is a distributed operation
-
an operation is applied over a set of collections
-
collections
-
single node or cluster: can be partitions of other collections
-
federation: can be partitions or subsets of other collections. In other
words, collections in a loosely coupled federation may have overlapping
records.
-
Virtual Collection: the external interface (or view) to collections. A
VC may present only part of the underlying real collection in its interface.
A VC may grow or shrink dynamically within the bounds of the real collection.
A search only needs to be done over documents in VC, not all documents
in the collection. This gives us a way to logically partition a collection
across a number of machines for performance increase, but with built in
redundancy in the case of node failures. When a node failures, its VC is
simply distributed (logically) to other nodes in the cluster.
-
Cheshire servers can be organized into server groups. A server group can
be thought of as an administrative unit.
-
Designing for parallelism and scalability
-
One I/O worker per disk (conceptually). The philosophy here is that we
don't want the OS to attempt more parallelism in software than actually
exists in hardware. If Berkeley DB is intelligent about concurrent accesses
to disk, we already have this effect and don't need to do anything special
here.
-
One data worker for each data resource. Here, a data resource should not
straddle multiple disks. The philosophy is that there is an optimal way
to concurrently access each resource. And the data worker is responsible
for scheduling this access.
-
One compute worker for each CPU (conceptually). The OS can be thought to
be exactly this kind of worker. The OS knows about its CPU resources and
schedules them accordingly.
-
One task worker for each task. A task may be protocol translation, search,
display, update, etc.
-
A distinction should be made between task and data workers in the same
address space and those in difference address spaces. In general, data
intensive tranfers should happen only between data and task workers in
the same address space. Task workers have direct read access to data. Writes
are handled by background data workers.
-
use Profiling tools for performance tuning
-
Building an open system
-
Loosely coupled object architecture
-
distinguish between structural and behavioral patterns in an object architecture
-
superclasses captures the core contract, while interfaces capture ancillary
contracts.
-
objects can be created on demand. If this is done (could be an attractive
option for Document objects), it should be hidden from clients.
-
object composition (black box reuse) is more reusable than class inheritance
(white box reuse)
-
Reduce the proliferation of types with "unsupported operation" exceptions.
That is, define broad, general types and leave it to implementation to
support or refuse operations.
-
Override uncaughtException in root threadgroup for error handling. All
threads should be decendents of root group.
-
Follow these
code conventions where practical.
-
Packages contain functionally related classes and do not necessarily reflect
on structural patterns. For example, although index and retrieval objects
are aggregated under (managed by) document objects, the index and retrieval
packages are not subpackages of the document package. Using packages to
document structural patterns is both unnecessary and insufficient. They
are, however, a useful way to define behavioral groups; the document package
deals with documents, index package deals with indexing, etc.