Cheshire II Commands

buildassoc - Create or extend a CheshireII associator file


SYNOPSIS

buildassoc [-q][-s] sgmlfile [assocfile] sgmltag
or:
buildassoc [-q][-s] -r directory assocfile sgmltag

DESCRIPTION

Buildassoc creates or extends the associator file for an SGML data file used in the Cheshire II system. The associator file contains information about the starting position and length of the SGML records in the SGML data file, and associates a logical record ID number with the record. This logical ID number is used to identify and retrieve records from the SGML data file(s).

There are two forms of this command, one for single SGML data files (where the file contains all of the SGML data), and the other for situations where there are multiple SGML files each containing multiple records within a directory subtree. In the second form (indicated by the -r) only the top-level directory is provided as an argument, and the subtree is recursively scanned for files which contain the identifying SGML tag sgmltag. This recursive form of the command generates a special file of "FILECONT" tags for inclusion in the configuration file for the database.

The sgmlfile argument should be the name of the (single) SGML file to be processed.

The directory argument with the -r should be the pathname of the root directory subtree to be processed.

The assocfile argument should be the name of the associator output file to be generated. In the -r form of the command, this argument is required, in the single file form it is optional and if omitted, the associator file name will be the same as the sgmlfile name with the extension ".assoc" attached.

The sgmltag argument should be the top-level tag defining a record in the SGML data file.

If the -q flag is used, information about each record processed is suppressed, otherwise the program indicates each record found and it's length.

If the -s flag is used, extra header information is skipped and not counted as part of the record. This is useful when, for example, XML records that you want to index as documents are "wrapped" with an XML declaration, DOCTYPE and extra begin and end tags at the beginning and end of the file to make the whole file a single valid XML document. In extraction, these leading and end tags are ignored and only the tags asked for are used as begin and end tags of "documents".

OUTPUT FILES

When the first (single SGML file) form of the command is used, only the associator file is created.

When the recursive (-r) form of the command is used, in addition to the associator file a file with the same name as the directory argument, with the extension ".cont" is generated. This file contains a sequence of "<FILECONT>" statements for inclusion in the configuration file. If a directory named /TESTDATA was used, these might look like:

<FILECONT ID=1 MIN=1 MAX=4> /TESTDATA/test1/ft1 </FILECONT>
<FILECONT ID=2 MIN=5 MAX=13> /TESTDATA/test1/ft2 </FILECONT>
<FILECONT ID=3 MIN=14 MAX=27> /TESTDATA/test1/ft3 </FILECONT>
<FILECONT ID=4 MIN=28 MAX=61> /TESTDATA/test1/ft4 </FILECONT>
<FILECONT ID=5 MIN=62 MAX=77> /TESTDATA/test2/ft5 </FILECONT>
<FILECONT ID=6 MIN=78 MAX=93> /TESTDATA/test2/ft6 </FILECONT>

Each line generated fills in the minimum and maximum logical record numbers in the file. These lines are added to the configuration file for the database following the <FILENAME> tag (which should contain only the root directory name - /TESTDATA).

ERROR INFORMATION

Errors are reported to stderr;

BUGS

None known

SEE ALSO

Configuration file documentation, index_cheshire

AUTHOR

Ray R. Larson ()