Running Cheshire

There are three main programs to be run to build a cheshire database, two main client programs, two different servers and several that display information about the indexes or database. Here we'll go into the building and informational programs. The clients are discussed in the scripting page.

The buildnew.sh script (see the bottom of the page) is set up to run these programs as needed in the indexing process and can be used as a reference when doing this by hand.


buildassoc
The buildassoc program is the first to be run and ascertains the locations of the individual records to be indexed. It records these in an associator file, referenced in the 'assocfil' field in the configuration file. If it is run over a directory, rather than a single file, it will also produce a '(directory name).cont' file which contains the files in the 'filecont' format needed by the configuration file.

Buildassoc thus has two ways in which it can be run which affects one of the command line arguments, however the other two remain the same.

To run it on a sinple SGML file:
buildassoc [-q] sgmlfile [assocfile] sgmltag

And to run it recursively over a directory:
buildassoc [-q] -r directory assocfile sgmltag

The '-q' flag stands for 'quiet', and if given then the standard output of record length is suppressed.
The sgmlfile argument is the file name where the sgml data is kept.
'-r directory' gives the directory over which to run. This should be given as a full path, as this will be what is recorded in the DATA.cont file.
assocfile is the name of the associator file to create. Typically the name of your database with .assoc appended to it. If it is not given in the single file version, then the associator will will be called the same name as your sgml file with '.assoc' appended to it.
sgmltag is the name of the top level sgml tag in your DTD. For example <EAD>, <TEI.2>, <USMARC> and so forth. Don't include the brackets around the tag name as Unix interpreters will have a field day.

Notes:

Example: buildassoc -r /home/cheshire/cheshire/quickstep/DATA quickstep.assoc email

index_cheshire
The index_cheshire program is what does most of the indexing work once the record locations have been established. Using the configuration file, it builds the indexes and component indexes from the data.

It is run as:
index_cheshire [-b] [-T tempdirname] [-L logfilename ] configfilename [startrec] [maxrec]

The -b flag is recommended to be given, as then index_cheshire will use batch loading techniques that greatly increase the indexing speed. However it does require a lot more disk space while indexing.
'-T tempdirname' gives index_cheshire an alternate directory in which to create its temporary files.
'-L logfilename' gives a new name to the log file generated during the indexing process. The default is INDEX_LOGFILE.
configfilename is the name of the configuration file, typically DBCONFIGFILE.
startrec, if given, makes cheshire start at the numbered record rather than the first. If not given, then index_cheshire will start at the first record.
maxrec is the last record number to process. It defaults to the last record in the database.

The file INDEX_LOGFILE, or other log file as specified, will be created, or have information appended to it if it already exists, over the indexing process. This will record any errors which occur during the indexing process, and is hence vital for debugging a misperforming system.

To add records to an existing database, run the buildassoc program first without removing the old assoc file. The new records will then be appended to the end. Then run index_cheshire with the startrec argument of the first new record to process.

Notes:

Example: index_cheshire -b DBCONFIGFILE

index_clusters
While index_cheshire handles the components and main indexes, the index_clusters program must be run if any clusters are defined. It is run in much the same way as index_cheshire:

index_clusters [-b] [-T tempfiledir] [-L logfilename ] configfilename [INDEXONLY]

If the INDEXONLY argument is given, then index_clusters will not try to create the cluster data, but just index it. This is for cases when the cluster data has already been created, but needs to have the indexes for it rebuilt.
The other arguments to index_clusters are the same as for index_cheshire.


countdb
Countdb gives frequency information about the items in an index. It reports the total number of entries, the average number of records per entry, the maximum records in an entry and the top 200 frequencies.

Usage: countdb configfilename mainfile_name index_name

Configfilename is the path of the configuration file for the database.
Mainfile_name is the name of the configuration, as recorded in the 'filetag' field in the configuration.
Index_name is the name of the index, as recorded in the 'indextag' field.

For example:
countdb DBCONFIGFILE archives subject

dumpdb
Dumpdb takes the same arguments as countdb, but shows the entire index with every key and the corresponding frequencies and total records.

dumppost
Dumppost, like dumpdb, shows the entire index, but gives the record numbers and frequency of occurance within the record for every key in the database. It takes the same arguments as countdb.

highpost
The highpost program produces a list of keys which have a frequency above the minimum given in the minpost argument. Otherwise the arguments are the same as countdb.

Usage: highpost minpost configfilename mainfile_name index_name

dtd_parser
This program will test a dtd to see if it is valid. If so, it will dump a pattern file to '(dtd).completeout'. If this is empty and very little output was produced, or an error message given, then the DTD is invalid. Be sure to check that the DTD has a DOCTYPE, a lot are distributed without this.

Usage: dtd_parser dtd_name [sgml_catalog_name]

dtd_name is the file name of the dtd, not the public identifier for it.
If the dtd needs to reference other files, then you will also need to give the file path for the catalog file in the second argument.

test_config
The test_config program will, surprisingly enough, test the validity of a configuration file. Information about the file is reported as it runs through it.

Usage: test_config configfilename


Sample Buildnew.sh Script

Here is a sample, only slightly non trivial buildnew script. If given an argument it will treat this as the start record for index_cheshire to continue processing at. It creates its logs in a 'logs' directory and otherwise is fairly tidy.

#!/bin/sh

# Set up environment
cd /home/cheshire/cheshire/metadata
export PATH=$PATH:/home/cheshire/cheshire/bin

rm DBCONFIGFILE DATA.cont metadata.assoc

# Trash old index/assoc files for complete rebuild
if [ "$1" == "" ]; then
  rm logs/INDEX_LOGFILE indexes/*.index
fi
    

# Rebuild associator.
buildassoc -r /home/cheshire/cheshire/metadata/DATA metadata.assoc cheshireRecord

cp DBCONFIGFILE.top DBCONFIGFILE
cat DATA.cont >> DBCONFIGFILE
cat DBCONFIGFILE.bottom >> DBCONFIGFILE

if [ "$1" == "" ]; then
  index_cheshire  -L logs/INDEX_LOGFILE DBCONFIGFILE
else 
  index_cheshire -L logs/INDEX_LOGFILE DBCONFIGFILE $1
fi