Tricross: Using dot-plots in sequence-id space to detect uncataloged intergenic features.

William C. Ray, Robert S. Munson Jr. and Charles J. Daniels

Children's Research Institute and The Department of Pediatrics,
The Ohio State University, 700 Childrens Dr. Columbus, OH 43205 USA,
and The Department of Microbiology, The Ohio State University,
484 West 12th Ave. Columbus, OH 43210 USA

Abstract:

Motivation:
The process of determining the functional sequence content of an organism is confounded by several factors. Large protein coding sequences are relatively easy to find by statistical methods. Smaller proteins however may escape detection due to their size falling below some arbitrary researcher-defined minimum cutoff, or the inability to precisely define a promoter, or translational start. Promoter and regulatory sequences themselves are difficult to define due to a significant amount of allowable sequence variation, as well as a probable lack of any completely accurate whole-organismal gene catalogs to date. Finally, certain genes coding functional RNAs may have insufficient structural or sequence constraints to be detectable by normal sequence structure/pattern searching methods. In those cases where there are multiple closely related organisms that have been sequenced, there is additional information that may be used in the investigation of sequence content --- that being the possible conserved nature of functional sequences between the organisms.

We present a method for the utilization of this conserved information to detect genes and other potentially functional sequences that may be missed by standard ORF-calling, RNA finding, and pattern matching software. The tricross programs produce a multi-way cross comparison of three sets of sequences, determine which are conserved in all three sets, and produce a graphical (VRML) representation as well as alignments of all sequence triples found. The software can also be applied to a pair of sequence sets, though the noise in the results increases.

Results:
Tricross has been used to examine the intergenic-sequence content of the three archaeal Pyrococcus genomes to determine the most highly related sequences remaining between the annotated protein and RNA coding sequences. Set to relatively stringent similarity requirements for the search, tricross found 101 intergenic sequences conserved between the three organisms. Interestingly, 29 of these appear to contain members of a family of small RNA molecules only recently discovered in the Archaea. While some of the remaining 72 appear to be individual highly conserved promoter sequences, others have no currently known biological significance. Although originally developed to facilitate the examination of intergenic sequences, none of the tricross logic is inherently specific to intergenic sequences. The software can also be applied to gene sequences, and has been used to produce inter-genomic gene order dot-plots for Haemophilus influenzae vs H. ducreyi (unpublished data), and Neisseria meningiditis Z2491 (serogroup A) vs N. meningiditis Z58 (serogroup B) vs N. gonorrhoeae.

Availability:

The research described in the abstract and results contained on this page have been submitted to Bioinformatics for publication.

The tricross software is available as a .tar.gz file from this link (minor FASTA format bugfix version, 20021203). Local intallations of FASTA, ClustalW, and potentially some Perl modules will be required to support the tricross software. The code is currently in stable Beta condidtion, meaning that it executes and produces correct output on genome sets for which I have tried it, but much of the configuration is still done by the clunky method of editing constants directly in the source. Look to the top of tri2analyzenew.pl and tricross.pl (as well as in the README) for information on where changes need to be made for your site. The archive as delivered is configured to analyze the three Pyrococcus genome's intergenic regions, and example data files to support this analysis.

Data described in the Bioinformatics submission is available through the links provided below.

Data:

Tricross detects and indicates multi-way relationships between sequences in mutiple organisms. Applied to the three sequenced Pyrococcus genomes' intergenic regions, out of roughly 6 billion possible triples of sequences, tricross finds 134 that appear to be conserved between all three organisms.

List of all tricross-discovered 5 and 6 way sequence neighbor triples in the P. furiosis, P. horikoshii and P. abyssi Archaeal genomes.

VRML world graph of the conserved sequences in sequence-id space. Each sphere links to the Clustal-W alignment of the sequence participants.

Macintosh and PC versions of the Cosmo VRML viewer are linked for download from my software page.

Many of the sequences discovered fall into apparent sequence family groups, but other than the occurrence of certain highly conserved promoter sequences, only one family is of currently known biological significance. The Archaea appear to display similarities to Eucaryotes not only in their transcriptional machinery, but also in the maintainance of a stable-RNA modification scheme facilitated by small RNAs known as snoRNAs. Until recently, the presence of snoRNA homologs in the Archaea was unknown, as snoRNAs contain insufficient conserved sequence content to be discovered by traditional sequence pattern finding methods. Tricross demonstrates that functional sequences such as these may be detected by use of the potentially significant conservation amongst a group of closely related organisms. In the set of conserved sequences detected by tricross, 30 appear to be Archaeal snoRNAs.

List of tricross-discovered conserved C/D-Box snoRNA-like sequences in the triples found above. The sequences have been trimmed to just the snoRNA sequence, and aligned on the C/D'/C'/D box regions.