Go to the previous, next section.

Comparisons

The COMPARE command is used to specify that a comparison is to be performed. The comparison may be done within the current calculation or may be done with a different PSF, a different coordinate set, a different parameter set, a different hydrogen bond list, a different non-bonded list, a different set of constraints, a different set of averaging data structures, a different Poisson-Boltzmann data structure or any combination thereof. In addition, the COMPARE command can be used to reorient an entire coordinate trajectory with respect to a reference coordinate set.

Before one uses the COMPARE command, it is important to understand what the the program needs to do a comparison. When the analysis facility is called, it needs at least seven data structures; a PSF, a coordinate set, a set of harmonic constraints, a parameter set, a parameter coding set, hydrogen bond list, and a non-bonded list. It may also require the Poisson-Boltzmann data structure or one of dynamical average data structures. The function of the COMPARE command is to create another set of these seven data structures to be used for comparison calculations, and to establish a relationship between the two sets of data structures. Let us examine each of these tasks in turn.

The creation of the comparison data structures is a very straightforward operation. The COMPARE command allows one to specify Fortran unit numbers which refer to binary files for the comparison PSF, coordinate set, harmonic constraints, parameter set, hydrogen bond list, and non-bonded list. The parameter coding list is generated without user intervention. In addition, the user may generate a new hydrogen bond list or non-bonded list with different selectional rules than were used in the main calculation.

If a user does not specify how to create a particular data structure, the data structures are taken from the main calculation (in the case of the PSF, constraints, coordinates, and parameters) or generate them using parameters as specified in the main calculation (the case for the hydrogen bond and non-bonded list). In addition, if any dynamical averaging data structures or Poisson-Boltzmann data structures have been created and the comparison PSF is the same as the main, then the comparison dynamical averaging data structures and Poisson-Boltzmann data structures will be assigned to those of the main.

The relating of the two calculations is more involved. Basically, if one has a pairing between the atoms in the main PSF and the atoms in the comparison PSF, it is possible to relate any internal coordinate or hydrogen bond. This pairing is generated in a three step process where we first match segments, then residues, and finally atoms --- mirroring the hierarchy of organization used in the PSF.

Compare Command Syntax: Syntax of the COMPARE command
Creation of Comparison Data: Creation of comparison data structures
Matching of Comparison Data: Matched between data structures
Reorienting a Coordinate Trajectory: Reorienting a coordinate trajectory
Comparison Messages: Messages produced during the comparison

`COMPARE` Command Syntax

COMPARE

    [PSF unit-number] [PARM unit-number] [CONS unit-number]

          {   MAIN    }
    [COOR { coor-spec } [STORe] del]
          {  COMPare  }

    [SEGMATCH repeat(segid segid) del ]

    [repeat(resmatch-command)]

    [repeat(atomatch-command)]

                                               [FIRSTU  unit-number]
               [SIDE ]                         [NUNIT integer      ]
    [COORMATCH [BACK ] [SAVE name] [ DYNAMICS  [OUTPUTU unit-number] [MASS] ]
               [NOROT]                         [NFILE integer      ]
                                               [SKIP integer       ]

    [HBOND [UNIT unit-number] hbond-specs del]
           [      COPY      ]

    [NBOND [UNIT unit-number] non-bond-specs del]
           [      COPY      ]

    [PBE [UNIT unit-number] del]
         [NAME word       ]

    [VERBOSE]

resmatch-command ::=

    RESMATCH {  ALLSEG   } [PRINT] [               NOHOMOLOGY            ]
             {segid segid}         [HOMOLOGY [range-spec] repeat(cons-spec)]

             repeat(resid resid) deldel

range-spec ::= RANGES repeat(range-set) del

range-set ::= resid resid  resid resid

cons-spec ::= CONSERVE repeat(resname) del


atomatch-command ::=

             [  ALLSEG   ] [        ALLRES         ]
    ATOMATCH [segid segid] [   RESID resid resid   ]
                           [RESNAME resname resname]

             [ONLY] repeat(iupac iupac) del

name ::= word

Syntactic ordering: The operands of the `resmatch-command' and `atomatch-command' must be specified in the given order, except for the PRINT option in the resmatch-command.

The `coor-spec' is a file specification for coordinates. See section Syntax of READ Command, for the detailed syntax.

The `non-bond-specs' are described in section Syntax for Generation of Non-bonded Interactions.

The `hbond-specs' are described in section Syntax of the Hydrogen Bond Command.

NOTE: The syntax above requires that single delimiters sometimes be placed next to double delimiters. In this case, be sure to leave a space between the single delimiter and the succeeding double delimiter. $ $$ is OK; $$$ or $$ $ is not.

The Creation of the Comparison Data Structures.

The PSF, CONS, and PARM keywords give the unit numbers of the comparison protein structure file, harmonic constraints, and parameter set; respectively. They must all be binary files created by a WRITE command to the main part of CONGEN, see section WRITE -- Writes Data Structures to External Files. If any are not specified, the corresponding comparison data structure is presumed to be the same as the main data structure.

The COOR keyword specifies where the comparison coordinates come from and, optionally, where they are to be placed when the command finishes execution. The comparison coordinates can come from three sources; the main set, the comparison set in the main part of CONGEN, or from an external file. If the COOR option is left out or if you use the option MAIN, the comparison coordinates come from the main set. If you specify COMPARE in the COOR option, the coordinates come from the comparison coordinates in the main part of CONGEN, see section Data Structures, and section The Coordinate Manipulation Commands, for more details about this set. Otherwise, the coordinates will be read from a file. You are free to use either card image or binary files for these coordinates; but be sure that all coordinates you need are specified. See section Reading coordinates, on the format of the coordinate files.

The NBOND and HBOND sub-commands are used to specify the comparison non-bonded and hydrogen bond list, respectively. The UNIT option is each of commands allows the specification of a Fortran unit number which refers to the binary file containing the lists. The COPY option requests that the main data structure be assigned to the comparison data structure. If neither option is specified, the program will generate a new list. The `nbond-specs', see section Syntax for Generation of Non-bonded Interactions, and `hbond-specs', see section Syntax of the Hydrogen Bond Command, specify generation and sigmoid switching function parameters.

When either of these data structures are read in, it is possible to modify the settings for the various switching function and energy evaluation parameters. Simply specify the values you want changed. You cannot change the selection method when something is read in (that doesn't make sense anyway). In the case of the hydrogen bonds, the file does not contain any of the selection and switching function parameters. If you specify a selection option, it will ignored even though you will get a message telling you what the values the program read are. Please ignore such information.

If any of the options or cutoffs for the HBOND or NBOND commands are not specified, the reference structure parameters will be used. If no HBOND or NBOND command is given, the list will be generated for a comparison PSF or coordinate set using the method employed for the reference structure.

The PBE keyword is used to specify the Poisson-Boltzmann data structure for the comparison structure. The data structure may be read from a disk file or may be recalled from an internal symbol. The UNIT keyword specifies that the Poisson-Boltzmann data structure be read from a binary file created using the PBE WRITE command, see section PBE WRITE Command. The NAME keyword specifies that the data structure be taken from a symbol as specified by the PBE SAVE command, see section PBE SAVE Command. If no option is specified and if the comparison PSF is the same as the main PSF, then the comparison PBE data structure will be assigned to the main PBE data structure

Matching of the Comparison Data Structures

The matching process begins after the PSF, COOR, and PARM options are handled. Segments are matched first.

The SEGMATCH command specifies how segments are to be matched. In the absence of a SEGMATCH command, the segment are paired by matching the segments which have the same identifier. If the command is given, the pairs of segment identifiers dictate the matching of segments.

Once the segments are matched, we must match the residues in every matched segment. Central to this problem is the finding of homologies between two sequences.(19)

The problem of finding homologies has been attacked by many others. The method used in the analysis facility is a modification of the string to string correction problem,(20) a problem well known in the computer science literature. The string to string correction problem can be stated as follows: Given a cost function for deleting any character from a string, a cost function for inserting a character into a string, and cost function for changing some character in the string into another character; find the lowest cost sequence of insertions, deletions, and changing operations that transform one string into the other. The cost functions in this algorithm are general in that they specify a different value for every possible character or pairs of characters. This problem is solvable in time and computer storage proportional to the product of the lengths of the strings.

We can see that this problem is applicable to finding a homology. As a protein genetic code evolves, insertions, deletions, and changes occur in its sequence. The best homology is one which minimizes these mutations. The algorithm calculates what is needed, a matching of residues that gives the lowest cost transformation of one sequence into another. Further, by establishing an appropriate change function, we can find homologies which permit conservative changes.

With the homology finding process understood, we can turn to the RESMATCH command. Any number of RESMATCH commands may be specified up to the number of segment pairs. Each command can specify the matching of residues in one segment pair or all. The first part of a RESMATCH command specifies what segment pair the command applies to. A segment identifier pair refers to that pair only; ALLSEG refers to all segments pairs which have not yet been processed. The RESMATCH keeps a record of which segment pairs have been specified in a command; any which are not specified are matched by finding the homology between their sequences. Further, each segment can be matched only once.

The next part of the RESMATCH command specifies how the homology should be found. If NOHOMOLOGY is specified, then the homology finding step is bypassed. If HOMOLOGY is not specified either, then the homology is taken over the complete sequences in the segment pairs. When HOMOLOGY is specified, one may specify the range of residues over which the homology should be taken. The ranges are specified as pairs of pairs of residue identifiers. The first pair gives the starting and ending residues in the first segment; the second pair gives the starting and ending residues for the second segment. More than one set of ranges may be specified. Any number of conserved sets of residues may be specified using the CONSERVE command. Each use of CONSERVE specifies that all residues whose names are given may be considered equivalent when the homology is found. The PRINT option, if specified, causes all calls to the homology finding subroutine in a RESMATCH command to print the matching of the sequences. Finally, in addition to the homology operations, one may specify that certain residues are to be matched. The residue identifier pairs at the end of a RESMATCH command match those residues.

Once the matching of residues with segment pairs has taken place, we can tackle the final step of matching the two PSF's --- matching their atoms. If no ATOMATCH commands are given, the default action is to use the IUPAC nomenclature for each atom in the residues and match atoms which have the same IUPAC name. This is usually acceptable as the IUPAC nomenclature results in many equivalent atoms being properly matched.

With ATOMATCH commands, one can tailor the atom matching process as well. Any number of ATOMATCH commands, up to the number of residue pairs, may be specified. As with RESMATCH, the ATOMATCH command processor keeps a record of what residue pairs have had their atoms matched. Any residue pairs which have not been matched at the end of ATOMATCH processing have their atoms matched by the default algorithm.

The first part of an ATOMATCH command specifies what segment pair the command applies to. If ALLSEG is specified or if no segment identifier pair is given, the command will apply to the residue pairs in all segment pairs. If a segment identifier pair is given, the command will be applied to that pair only.

The next part of the command specifies what residue pairs the command applies to. ALLRES or no specification of residue results in the command applying to all residues within the selected segment pairs. RESID means that the next two words are the residue identifiers of the residues whose atoms are to be matched. RESNAME means that the following two words are residue names and that the atom matching specified in the command applies to all such pairs found in the given segment pairs. For this option, the program will check for the residue name pair in either order, e.g. if we say RESMATCH RESNAME ALA GLY ..., it does not matter which sequence has GLY and which has ALA. On the other hand, the order in the pair does matter with the RESID option.

The option, ONLY, specifies that the default operation of matching atoms by IUPAC names should not take place. Normally, the atoms in a residu pair are matched by IUPAC names before the user matching of atoms is executed. Specifying ONLY will override the preliminary matching.

Finally, the pairs of words at the end of the command give the matching of atoms specified through their IUPAC names.

Once the atom pairing is complete, it is checked for duplicate references to any atom as the pairing must be an equivalence relation. Any pair which refers to an atom which is referred to by another pair is deleted. Should this occur, it is important to see what command is causing a duplication and to fix the problem. The most common source for this error is to forget to specify ONLY in the ATOMATCH command. Usually, the pair deleted from a duplicate is the pair one is interested in.

It should be noted that the matching process is performed even if just coordinates are being compared. The default action results in an atom pairing that matches every atom to itself.

Once the atom pairing is established, one can specify that the coordinates of paired atoms be matched. There are two operations involved in fitting coordinates -- translation and rotation. The translation operation is as follows: The geometric center or center of gravity of the main atoms in the atom pair list and the comparison atoms in the atom pair list are calculated. The difference between the centers is added to the comparison coordinates to minimize the least squares translational differences between the coordinates. The rotation operation is as follows: The comparison coordinate set can be rotated about its center to minimize the least square differences between the two sets. The rotation can be done with respect with several different sets of atoms.

The COORMATCH option specifies how the coordinate matching take place. If COORMATCH is not specified, no attempt is made to match the coordinates. If COORMATCH is specified without any modifier, it results in a translation followed by a least square rotational fit of all paired atoms. If MASS is specified, the center of gravities are used for the translational fit; otherwise, the geometric centers are used. If SIDE is specified, the least square rotational fit takes place with all atoms pairs where at least one of the atoms in the pair does not have an IUPAC name that corresponds to a protein backbone. If BACK is specified, the least square rotational fit takes place with the complement of atoms that would be matched by SIDE. I.e. BACK will do the matching using only backbone atoms, and SIDE will do the match using side chain atoms. If NOROT is specified, no rotation takes place. Finally, the SAVE option will save the coordinate transformation necessary to fit the comparison coordinates onto the reference coordinates. This can be used later by the coordinate transformation commands to move atoms around according to this transformation, see section The Coordinate Manipulation Commands.

Reorienting a Coordinate Trajectory

If one is interested in reorienting every set of coordinates found in a dynamics trajectory with respect to some reference structure, one can use the DYNAMICS option in conjunction with the COORMATCH options in the COMPARE command. Whatever options specified in the COORMATCH command are used to translate and/or rotate all the coordinates in the trajectory. The matching of the two PSF's determines which atom pairs are used in the fit.

The process of reorienting a coordinate trajectory works as follows: A series of files containing the trajectory are assigned to successive units prior to a CONGEN run. The coordinates stored therein are presumed to have been written every NSAVC steps. CONGEN will read each coordinate, select some periodically, reorient them, and write them to successive units where each output file will have a user specified number of coordinates. The following table lists the options involved:

Option  Default  Purpose

FIRSTU     51    The first unit of the trajectory to be read.

NUNIT      1     The number of units to be read starting with FIRSTU.

SKIP       1     Only those coordinate whose dynamics step number
                 modulo SKIP will be reoriented and written out.

OUTPUTU    61    The first unit number of the output trajectory.

NFILE            The number of coordinates written to each output file.
                 If left out, this will be set to the number of
                 coordinates in the first input file.

The title of the output trajectory will be copied from the input trajectory.

If atoms were fixed during the dynamics, see section Fixing Atoms in Place, the new trajectory produced will not have fixed atoms because the rotations applied to each coordinate set will be different thereby yielding different coordinates for the fixed atoms. Fixing the coordinates leads to a large space reductions, so the reorientation process will therefore result in potentially much larger trajectory files.

When the process is complete, the comparison coordinates will be the last set of coordinates written to the output trajectory files. This is true even if `COOR unit' is specified. Also, if the SAVE option is used, then the saved transformation will be the one used for the last set of coordinates.

Comparison Messages

As the COMPARE command processor proceeds, it will output messages on its progress. If you make a mistake, the messages may give you some idea of what you did wrong. In interpreting any error messages you receive, keep in mind the fact that as each part of the command is interpreted, it is removed from the command line. If you make a typo so something doesn't get deleted, it may be interpreted as something else. For example, if one were to specify,

COMPARE SEGMATCH A B $ -
        RESMATCH A B RANGES 1 100 10 110 $$

the error messages would say that RANGES is too long to be a residue identifier and that the residue id's are not all paired. The error in the command is that the RANGES command is not terminated by a single delimiter. This results in RANGES being left in, and it and the rest of the command being interpreted as resid pairs. If this seems hopelessly obscure, it unfortunately is.

If one desires a detailed print out of the coordinate matching and atom matching, specify the option, VERBOSE. Your printout will weigh pound instead of ounces.