Program glimmer takes two inputs: a sequence file (in FASTA format) and a collection of Markov models for genes as produced by the program build-imm . It outputs a list of all open reading frames (orfs) together with scores for each as a gene. The first few lines of output specify the settings of various parameter in the program: Minimum gene length is the length of the smallest fragment considered to be a gene. The length is measured from the first base of the start codon to the last base *before* the stop codon. This value can be specified when running the program with the -g option. Minimum overlap length is a lower bound on the number of bases overlap between 2 genes that is considered a problem. Overlaps shorter than this are ignored. Minimum overlap percent is another lower bound on the number of bases overlap that is considered a problem. Overlaps shorter than this percentage of *both* genes are ignored. Threshold score is the minimum in-frame score for a fragment to be considered a potential gene. Use independent scores indicates whether the last column that scores each fragment using independent base probabilities is present. Use first start codon indicates whether the first possible start codon is used or not. If not, the function Choose_Start is called to choose the start codon. Currently it computes hybridization energy between the string Ribosome_Pattern and the region in front of the start codon, and if this is above a threshold, that start site is chosen. The ribosome pattern string can be set by the -s option. Presumably function Choose_Start should be modified to do something cleverer. The next portion of the output is the result for each orf: Column 1 is an ID number for reference purposes. It is assigned sequentially starting with 1 to all orfs whose Gene Score is at least 90 . I'll make this a command-line option when I decide what letter to use. Column 2 is the reading frame of the orf. Three forward (F1, F2 and F3) and three reverse (R1, R2 and R3). These correspond with the headings for the scores in columns 9-14. Column 3 is the start position of the orf, i.e., the first base *after* the previous stop codon. Column 4 is the position of the first base of the first start codon in the orf. Currently I use atg, ctg, gtg and ttg as start codons. Column 5 is the position of the last base *before* the stop codon. Stop codons are taa, tag, and tga. Note that for orfs in the reverse reading frames have their start position higher than the end position. The order in which orfs are listed is in increasing order by Max {OrfStart, End}, i.e., the highest numbered position in the orf, except for orfs that "wrap around" the end of the sequence. Columns 6 and 7 are the lengths of the orf and gene, respectively, i.e., 1 + |OrfStart - End| and 1 + |GeneStart - End| . Column 8 is the score for the gene region. It is the probability (as a percent) that the Markov model in the correct frame generated this sequence. This value matches the value in the corresponding column of frame scores--an orf in reading frame R1 has a Gene Score equal to the value in the R1 column of frame scores for that orf. Columns 9-14 are the scores for the gene region in each of the 6 reading frames. It is the probability (as a percent) that the Markov model in that frame generated this sequence. Column 15 is the probability as a percent that the gene sequence was generated by a model of independent probabilities for each base, and represents to some extent the probability that the sequence is "random". When two genes with ID numbers overlap by at least a sufficient amount (as determined by Min_Olap and Min_Olap_Percent ), a line beginning with *** is printed and scores for the overlap region are printed. If the frame of the high score of the overlap region matches the frame of the longer gene, then a message is printed that the shorted gene is rejected. Otherwise, a message is printed that *both* genes are "suspect". A suspect or reject message for any gene is only printed once, however. A message is also printed if a gene with an ID number wholly contains another gene with an ID number. The longer "shadows" the shorter. At the end a list of "putative" gene positions is produced. The first column is the ID number, the second is the start position, the third is the end position. For "suspect" genes, a notation in [] 's follows: [Bad Olap a b c] means that gene number a overlapped this one and was shorter but scored higher on the overlap region. b is the length of the overlap region and c is the score of *this* gene on the overlap region. There should be a [Shorter ...] notation with gene a giving its score. [Shorter a b c] means that gene number a overlapped this one and was longer but scored lower on the overlap region. b is the length of the overlap region and c is the score of *this* gene on the overlap region. There should be a [Bad olap ...] notation with gene a giving its score. [Shadowed by a] means that this gene was completed contained as part of gene a 's region, but in another frame. [Delay by a b c d] means that this gene was tentatively rejected because of an overlap with gene b , but if the start codon is postponed by a positions, then this would be a valid gene. The start position reported for this gene includes the delay. c is the length of the overlap region that caused the rejection and d is the score in this gene's frame on that overlap region. Note that a gene marked as rejected may appear in this list. This can occur if the gene that caused the rejection was itself rejected. The actual algorithm to produce the list is as follows: Consider the genes in decreasing order by length. If gene x is to be rejected because of an overlap with longer gene y that has not been rejected, then gene x is rejected and does not appear in the list. Otherwise, all notations for gene x that are not caused by rejected genes are reported. I think a "delayed" gene might incorrectly be listed as causing a problem by the part of it that was eliminated by the delay. Probably the remaining portion should be reinserted into the sorted list base on its now-shorter length, and any notations caused by it should be re-checked to see if they're affected by shortening the gene. Let's save this for the next version. To compile the program: g++ orf-score.c -lm -o orf-score Uses include files delcher.h context.h strarray.h gene.h To run the program: First run build-imm on a set of sequences to make the Markov models. build-imm train.seq This will produce a file train.seq.deltas.context Then run glimmer glimmer hflu.seq train.seq Options can be specified after the 2nd file name glimmer hflu.seq train.seq Options are: -f Use ribosome-binding energy to choose start codon +f Use first codon in orf as start codon -g n Set minimum gene length to n -o n Set minimum overlap length to n. Overlaps shorter than this are ignored. -p n Set minimum overlap percentage to n%. Overlaps shorter than this percentage of *both* strings are ignored. -r Don't use independent probability score column +r Use independent probability score column -s s Use string s as the ribosome binding pattern to find start codons. -t n Set threshold score for calling as gene to n. If the in-frame score >= n, then the region is given a number and considered a potential gene.