William C. Ray, Robert S. Munson Jr. and Charles J. Daniels
Children's Research Institute and The Department of Pediatrics,
The Ohio State University, 700 Childrens Dr. Columbus, OH 43205 USA,
and The Department of Microbiology, The Ohio State University,
484 West 12th Ave. Columbus, OH 43210 USA
Motivation:
The process of determining the functional sequence content
of an organism is confounded by several factors. Large protein coding sequences
are relatively easy to find by statistical methods. Smaller proteins however
may escape detection due to their size falling below some arbitrary researcher-defined
minimum cutoff, or the inability to precisely define a promoter, or translational
start. Promoter and regulatory sequences themselves are difficult to define
due to a significant amount of allowable sequence variation, as well as a probable
lack of any completely accurate whole-organismal gene catalogs to date. Finally,
certain genes coding functional RNAs may have insufficient structural or sequence
constraints to be detectable by normal sequence structure/pattern searching
methods. In those cases where there are multiple closely related organisms that
have been sequenced, there is additional information that may be used in the
investigation of sequence content --- that being the possible conserved nature
of functional sequences between the organisms.
We present a method for the utilization of this conserved information to detect genes and other potentially functional sequences that may be missed by standard ORF-calling, RNA finding, and pattern matching software. The tricross programs produce a multi-way cross comparison of three sets of sequences, determine which are conserved in all three sets, and produce a graphical (VRML) representation as well as alignments of all sequence triples found. The software can also be applied to a pair of sequence sets, though the noise in the results increases.
Results:
Tricross has been used to examine the intergenic-sequence
content of the three archaeal Pyrococcus genomes to determine the most
highly related sequences remaining between the annotated protein and RNA coding
sequences. Set to relatively stringent similarity requirements for the search,
tricross found 101 intergenic sequences conserved
between the three organisms. Interestingly, 29 of these appear to contain members
of a family of small RNA molecules only recently discovered in the Archaea.
While some of the remaining 72 appear to be individual highly conserved promoter
sequences, others have no currently known biological significance. Although
originally developed to facilitate the examination of intergenic sequences,
none of the tricross logic is inherently specific
to intergenic sequences. The software can also be applied to gene sequences,
and has been used to produce inter-genomic gene order dot-plots for Haemophilus
influenzae vs H. ducreyi (unpublished data), and Neisseria meningiditis
Z2491 (serogroup A) vs N. meningiditis Z58 (serogroup B) vs N. gonorrhoeae.
The research described in the abstract and results contained on this page have been submitted to Bioinformatics for publication.
The tricross software is available as a .tar.gz file from this link (minor FASTA format bugfix version, 20021203). Local intallations of FASTA, ClustalW, and potentially some Perl modules will be required to support the tricross software. The code is currently in stable Beta condidtion, meaning that it executes and produces correct output on genome sets for which I have tried it, but much of the configuration is still done by the clunky method of editing constants directly in the source. Look to the top of tri2analyzenew.pl and tricross.pl (as well as in the README) for information on where changes need to be made for your site. The archive as delivered is configured to analyze the three Pyrococcus genome's intergenic regions, and example data files to support this analysis.
Data described in the Bioinformatics submission is available through the links provided below.
Tricross detects and indicates multi-way relationships between sequences in mutiple organisms. Applied to the three sequenced Pyrococcus genomes' intergenic regions, out of roughly 6 billion possible triples of sequences, tricross finds 134 that appear to be conserved between all three organisms.
List of all tricross-discovered 5 and 6 way sequence neighbor triples in the P. furiosis, P. horikoshii and P. abyssi Archaeal genomes.
VRML world graph of the conserved sequences in sequence-id space. Each sphere links to the Clustal-W alignment of the sequence participants.
Macintosh and PC versions of the Cosmo VRML viewer are linked for download from my software page.
Many of the sequences discovered fall into apparent sequence family groups, but other than the occurrence of certain highly conserved promoter sequences, only one family is of currently known biological significance. The Archaea appear to display similarities to Eucaryotes not only in their transcriptional machinery, but also in the maintainance of a stable-RNA modification scheme facilitated by small RNAs known as snoRNAs. Until recently, the presence of snoRNA homologs in the Archaea was unknown, as snoRNAs contain insufficient conserved sequence content to be discovered by traditional sequence pattern finding methods. Tricross demonstrates that functional sequences such as these may be detected by use of the potentially significant conservation amongst a group of closely related organisms. In the set of conserved sequences detected by tricross, 30 appear to be Archaeal snoRNAs.
List of tricross-discovered conserved C/D-Box snoRNA-like sequences in the triples found above. The sequences have been trimmed to just the snoRNA sequence, and aligned on the C/D'/C'/D box regions.