head.shtml

Further details on the contig construction/Disclaimer.

Because our main aim was to identify candidate genes involved in pycnodysostosis, strong emphasis was placed on extending the sequences of the contigs as far as possible, since this could provide essential extra information. Therefore, it was necessary to use all available sequences, even if they probably were not completely accurate. In addition, both the 5' and the 3' ends of the EST clones have been used, even if the 5' and 3' sequences of the final contig could not be made to overlap. In such cases, formal proof that these sequences really belong to the same gene/cDNA is lacking, although many ESTs often spanned the same gap in the consensus sequence, making a chimeric contig extremely unlikely. Additional gaps may occur in the 5' portion of the contig as a result of widely differing lengths of the inserts, causing gaps between the various 5' sequences, and in the 3' portion as a result of alternative polyadenylation sites. In some cases, relatively short overlaps between ESTs have been used if sufficiently convincing. In other cases, the contig is entirely dependent on a single EST (being the only one that crosses a certain gap), making the composition and the sequence of the contig very sensitive to any type of error that might have been made during the entire process. For instance, I have encountered several instances in which e.g. ym34a02.r1 (the 5' sequence of an EST) turned out to belong to the 3' sequence ym35a02.s1 (and vice versa). In short: although we feel that most contigs give a reasonably accurate view of the actual cDNA sequence, there is every reason to regard them as no more than helpful intermediates on a road toward a more definitive, full-length sequence. The fact that the great majority of the sequences has been determined on a single strand only is just one consideration that should make one extremely cautious, especially in those instances in which the open reading frame can not be unequivocally identified. Several other possible sources of errors are discussed in the following paragraphs.

EST sequences are single-pass sequences that have been read automatically, which means that they are likely to contain errors, especially toward their ends and in GC-rich regions. Most of the errors turn out to involve additional bases, especially Gs and Ts, in runs of these bases. This is most clearly the case for the older EST sequences of the WashU/Merck project (codings like ym23a04 or zc34g05 with accession numbers starting with T, R, H or N). The AA sequences, especially the murine ones are much better. Multiple errors toward the end of a sequence prevents automatic merging of overlapping sequences, which will lead to multiple contigs for the same cDNA if such methods are employed without human oversight. It is especially this problem that still gives "manual" contig building (using e.g. FastA to test low-scoring Blast matches that might represent poorly overlapping ends) an advantage over completely automated groupings of ESTs. Especially because the extra G/T sequencing errors are almost systematic, the correct sequence in areas of conflict can often be recognized, and this correction process may lead to the identification of additional overlapping sequences.

The additional (incorrect) nucleotides at the 3' ends of ESTs will lead to frameshifts in regions that are covered by only one or two ESTs, because these portions of the sequence can not always be corrected. The possible coding character of a sequence may be obscured in this way. Fortunately, with the increasing number of mouse EST sequences, it is now possible to determine whether or not a human sequence encodes a protein and, if so, which of the reading frames is used to specify a protein. This is decided by judging the pattern of mismatches between the human and the murine sequence. In general, most of the differences between coding sequences from different species will occur at the third position of codons, thereby leaving the encoded amino acid sequence intact. Most of the frameshifts in the region of overlap between the human or murine EST sequences can therefore be detected by checking whether these third position mismatches remain in the same register in both sequences. However, the exact position of the mistake can be more difficult to determine, especially when both the human and the murine sequences contain the same error in a region that is completely conserved. Nevertheless, comparison of human and murine sequences has corrected a large number of frameshift errors in the human sequences and has helped enormously to detect protein-coding sequences. However, it should be realized that only portions of the human consensus sequences were covered by murine ESTs and could be checked in this way. Other portions will probably still contain frameshifts. This is also the reason why the files containing the protein sequences are often derived from only part of the consensus sequence or are even completely absent. In these cases, the sequence of the other regions or of the entire contig was not considered to be of sufficient quality to be used for the derivation of the tentative protein sequences.

Another common problem encountered during the analysis was the presence of alternatively spliced or unspliced forms of some cDNAs. It was sometimes difficult to determine whether the first or the second situation applied. Making the correct assessment of the situation was quite important, because it made quite a difference in those cases in which a portion of the cDNA-contig ended with such an EST. Alternative splicing involves additional exons of the same cDNA, which might again lead to the identification of additional overlapping ESTs. The presence of an intron in the cDNA sequence usually not only made it impossible to extend the sequence, but more importantly usually disrupts the reading frame, because of stp codons in all three frames. Still, instances of apparently unspliced, but polyadenylated cDNAs, which contained what seemed to be an intron bordered by consensus splice donor or acceptor sites, were surprisingly common in some genes/cDNAs, suggesting that such an intron might have some biological significance. Perhaps not surprisingly, several of these introns were present in the 5' UTR of an mRNA, where they would not interrupt the reading frame.

Alternative polyadenylation sites were also frequently observed, but these pose less of a problem, because they do not lead to difficult decisions, as was the case for putative introns, since both should be incorporated in the cDNA sequence. Nevertheless, the distance between the polyA sites may become so large that none of the ESTs cover the distance between them, leading to separate cDNA sequences for clones that are derived from the same gene.

Of course, there are also administrative errors that may have led to mistakes in the consensus sequences. Some of the more troubling errors involve different names (usually differing by a single character) for the 5' and 3' ends of the cDNA-inserts or, much worse, the same name for different cDNAs. The latter situation may result in the creation of chimeric consensus sequences. Finally, the orientations of the inserts were not always compatible. Sometimes this was the result of the presence of two different genes with overlapping 3' ends, but in other cases this should probably be attributed to priming of the oligo dTprimer on the wrong strand (following first-strand-synthesis) or again to some administrative error.

Last but not least, the fact that I often had only a schematic mental picture of the gene/cDNA under analysis, in combination with the large number of genes, probably has resulted in errors that are entirely attributable to deletions, insertions and crossing-overs that have occurred in my brain. I really hope that you will look with a healthy amount of skepticism to the sequences and report likely errors to me, so that they may be corrected.

The existence of two or more RH markers covering the same gene and the fact that their mapping positions are sometimes quite different prove two things: first, that the RH map is approaching a high density (although many genes that have been mapped to 1q21 by other means are not yet represented on the RH map) and second, that the resolution of RH mapping is not as accurate as is suggested by the precision of its coordinates. Since many of the markers that cover the same gene are separated by markers from several other genes, alternative possibilities such as breakage suppression near the centromere (which could result in inflated RH distances), or large introns can be excluded. It is in general more likely that the differences have been caused by false positive or negative results in the radiation hybrid mapping of individual markers.