Program  glimmer  takes two inputs:  a sequence file (in FASTA format)
and a collection of Markov models for genes as produced by the program
 build-imm .  It outputs a list of all open reading frames (orfs) together
with scores for each as a gene.

The first few lines of output specify the settings of various
parameter in the program:

 Minimum gene length is the length of the smallest fragment
   considered to be a gene.  The length is measured from the first base
   of the start codon to the last base *before* the stop codon.
   This value can be specified when running the program with the  -g  option.

 Minimum overlap length is a lower bound on the number of bases overlap
   between 2 genes that is considered a problem.  Overlaps shorter than
   this are ignored.

 Minimum overlap percent is another lower bound on the number of bases
   overlap that is considered a problem.  Overlaps shorter than this
   percentage of *both* genes are ignored.

 Threshold score is the minimum in-frame score for a fragment to be
   considered a potential gene.

 Use independent scores indicates whether the last column that scores each
   fragment using independent base probabilities is present.

 Use first start codon indicates whether the first possible start codon
   is used or not.  If not, the function  Choose_Start  is called to
   choose the start codon.  Currently it computes hybridization energy
   between the string  Ribosome_Pattern  and the region in front of
   the start codon, and if this is above a threshold, that start site
   is chosen.  The ribosome pattern string can be set by the  -s  option.
   Presumably function  Choose_Start  should be modified to do something
   cleverer.


The next portion of the output is the result for each orf:

 Column 1 is an ID number for reference purposes.  It is assigned
   sequentially starting with  1  to all orfs whose  Gene Score  is
   at least  90 .  I'll make this a command-line option when I decide
   what letter to use.

 Column 2 is the reading frame of the orf.  Three forward (F1, F2 and F3)
   and three reverse (R1, R2 and R3).  These correspond with the headings
   for the scores in columns 9-14.

 Column 3 is the start position of the orf, i.e., the first base *after*
   the previous stop codon.

 Column 4 is the position of the first base of the first start codon in
   the orf.  Currently I use atg, ctg, gtg and ttg as start codons.

 Column 5 is the position of the last base *before* the stop codon.  Stop
   codons are taa, tag, and tga.  Note that for orfs in the reverse
   reading frames have their start position higher than the end position.
   The order in which orfs are listed is in increasing order by
   Max {OrfStart, End}, i.e., the highest numbered position in the orf,
   except for orfs that "wrap around" the end of the sequence.

 Columns 6 and 7 are the lengths of the orf and gene, respectively, i.e.,
   1 + |OrfStart - End|  and  1 + |GeneStart - End| .

 Column 8 is the score for the gene region.  It is the probability (as
   a percent) that the Markov model in the correct frame generated this
   sequence.  This value matches the value in the corresponding column
   of frame scores--an orf in reading frame R1 has a Gene Score equal to
   the value in the R1 column of frame scores for that orf.

 Columns 9-14 are the scores for the gene region in each of the 6 reading
   frames.  It is the probability (as a percent) that the Markov model in
   that frame generated this sequence.

 Column 15 is the probability as a percent that the gene sequence was generated
   by a model of independent probabilities for each base, and represents to
   some extent the probability that the sequence is "random".


When two genes with ID numbers overlap by at least a sufficient
amount (as determined by  Min_Olap  and  Min_Olap_Percent ), a line
beginning with  ***  is printed and scores for the overlap region
are printed.  If the frame of the high score of the overlap
region matches the frame of the longer gene, then a message is
printed that the shorted gene is rejected.  Otherwise, a message
is printed that *both* genes are "suspect".  A suspect or reject
message for any gene is only printed once, however.

A message is also printed if a gene with an ID number wholly contains another
gene with an ID number.  The longer "shadows" the shorter.


At the end a list of "putative" gene positions is produced.  The first
column is the ID number, the second is the start position, the third
is the end position.  For "suspect" genes, a notation in  [] 's follows:

 [Bad Olap  a  b  c]  means that gene number  a  overlapped this one and
   was shorter but scored higher on the overlap region.   b  is the length
   of the overlap region and  c  is the score of *this* gene on the overlap
   region.  There should be a  [Shorter ...]  notation with gene  a
   giving its score.

 [Shorter  a  b  c]  means that gene number  a  overlapped this one and
   was longer but scored lower on the overlap region.   b  is the length
   of the overlap region and  c  is the score of *this* gene on the overlap
   region.  There should be a  [Bad olap ...]  notation with gene  a
   giving its score.

 [Shadowed by  a]  means that this gene was completed contained as part
   of gene  a 's region, but in another frame.

 [Delay by  a  b  c  d]  means that this gene was tentatively rejected
   because of an overlap with gene  b , but if the start codon is postponed
   by  a  positions, then this would be a valid gene.  The start position
   reported for this gene includes the delay.   c  is the length of the overlap
   region that caused the rejection and  d  is the score in this gene's frame
   on that overlap region.

Note that a gene marked as rejected may appear in this list.  This can
occur if the gene that caused the rejection was itself rejected.  The
actual algorithm to produce the list is as follows:

  Consider the genes in decreasing order by length.  If gene  x  is to
  be rejected because of an overlap with longer gene  y   that has not been
  rejected, then gene  x  is rejected and does not appear in the list.
  Otherwise, all notations for gene  x  that are not caused by rejected
  genes are reported.

I think a "delayed" gene might incorrectly be listed as causing a problem
by the part of it that was eliminated by the delay.  Probably the remaining
portion should be reinserted into the sorted list base on its now-shorter
length, and any notations caused by it should be re-checked to see if
they're affected by shortening the gene.  Let's save this for the next
version.


To compile the program:

      g++ orf-score.c -lm -o orf-score

  Uses include files  delcher.h  context.h  strarray.h  gene.h


To run the program:

  First run  build-imm  on a set of sequences to make the Markov models.

      build-imm train.seq

  This will produce a file  train.seq.deltas.context

  Then run  glimmer

      glimmer hflu.seq train.seq

  Options can be specified after the 2nd file name

      glimmer hflu.seq train.seq <options>

  Options are:
    -f     Use ribosome-binding energy to choose start codon
    +f     Use first codon in orf as start codon
    -g n   Set minimum gene length to n
    -o n   Set minimum overlap length to n.  Overlaps shorter than this
           are ignored.
    -p n   Set minimum overlap percentage to n%.  Overlaps shorter than
           this percentage of *both* strings are ignored.
    -r     Don't use independent probability score column
    +r     Use independent probability score column
    -s s   Use string s as the ribosome binding pattern to find start codons.
    -t n   Set threshold score for calling as gene to n.  If the in-frame
           score >= n, then the region is given a number and considered
           a potential gene.