
This WWW page represents the spin-off of a screening for genes that might be involved in the bone disease pycnodysostosis. All ESTs (expressed sequence tags) that had been mapped to the most centromere-proximal region of chromosome 1q in release 11 of the radiation hybrid (RH) map of the Whitehead Institute, were used for the identification of genes (follow this link to the WI RH map). The borders of the screen were the centromere and WI-11851. The initial aim of the present screen was to identify candidate genes that might be involved in the bone disease pycnodysostosis, and the RH map was chosen as the starting point because it has the highest density of actual genes (as opposed to anonymous polymorphic markers) and has a relatively high resolution. The following procedure was used to convert an EST-based STS (Sequence Tagged Site) to a cDNA-sequence and, if possible, to a (partial) protein sequence:
The results of this effort are given below as a service to the research community. A short description of each sequence is provided and the consensus sequences are made available. Wherever possible, the partial protein sequence and the murine cDNA and protein sequences have also been determined and can also be accessed. It should be noted that EST contigs have also been built (on a much larger and more automated scale) for the Unigene collection and for the SCIENCE96 human genetic map. However, these efforts are presently mainly intended at grouping sequences together, not to provide a consensus sequence that has been filtered for errors. The sequences presented here are, in general, longer and more accurate than the EST clusters identified in Unigene. For instance, different transcription units that overlap at their 3' ends, which occurs several times among the investigated transcription units, are usually grouped together in the Unigene collection. In addition, we have performed an analysis of the encoded protein whenever possible.
A drawback of the present page is that it is not fully cross-referenced, for instance with respect to the ESTs that have been used to construct a contig. This turned out to be too much work for something that was not the main goal of our study. Users are therefore urged to use the consensus sequences as queries to screen the databases of interest themselves. Given the very rapid growth of dbEST, this will be necessary anyhow, because novel sequences that may correct or significantly extend the present consensus sequence may appear any day. We will try to improve the page in the coming months and are open for suggestions.
It turns out that the gene that is involved in pycnodysostosis is the cathepsin K gene, which is a protease involved in bone remodelling (Gleb et al., Science 273, 1236-1238 (1996); Johnson et al., Genome Research 6, 1051-1055 (1996)). The gene is indeed present in this interval and had been RH mapped. Some practical remarks about the sequence-files:
All files are in the GCG-format (5 blocks of 10 nucleotides in a row). For most purposes, the most important feature of this format is the fact that information about the construction history of the file is contained WITHIN the sequence. For users working with other programs, that do not contain a "GCG->other program" conversion routine, this may be very inconvenient, because the intervening text should be manually removed using a text editor. In the future, we will try to remove this information from the files.
ALL markers have been hypertext-linked to a file with the nucleotide sequence of the cDNA with the extension ".seq". This was done even for those markers, from which the full-length cDNA sequence was already available in the database. In that case, a copy of the GenBank-database entry has simply been provided, which has in a few cases been extended with additional sequence information derived from the ESTs. In all other cases, the contents of the sequence-file has been based on overlapping EST sequences. The cDNA represents the consensus sequence based on all ESTs, but it has in some cases been edited for various reasons, e.g. because a comparison to the murine consensus sequence revealed a likely frame-shift in the human sequence. The name of the cDNA-file is nearly always identical to that of the marker, following the omission of the hyphen in the marker name. There are two exceptions to this rule: first of all, when two or more markers cover the same cDNA, the name of only one of the markers has been chosen to represent the entire set. Secondly, the name of the file of the TIGR-markers would exceed 8 characters, which is inconvenient for people who make use of DOS-programs. TIGR-A002G29 has therefore been abbreviated to T002G29, for instance.
When intron sequences are available (some ESTs seem to be derived from genomic fragments or incompletely spliced mRNAs), these have been included in a separate ".seq"-file. These "genomic" files carry the same name as the cDNA-files followed by a "g", although the name of the marker sometimes had to be abbreviated, again because of the DOS 8-character-limit. The protein sequences also carry the marker name, and are followed by the extension ".pep". Only protein sequences that have a reasonable chance of being correct have been included. In many cases, there was an unacceptable risk of multiple frame-shifts in the sequence. Despite this restriction, the ends of incomplete protein sequences may sometimes be in error.
The partial murine cDNA sequences are also available, if murine ESTs that were highly similar to the human cDNA-contig were at least present in the database. The file-names of the murine cDNAs start with "mus" or "m", followed by the number-part in the name of the marker. Up till now, this does not lead to confusion. All numbers between 10000 and 20000 are derived from WI-markers, all numbers between 30000 and 40000 are derived from SGC-markers. For numbers below 10000, the human marker may be derived from either the WI, IB, or NIB-series. All files starting with "00" are based on TIGR-markers.
Because only the 5' ends of the murine ESTs have been determined, most murine cDNA-files contain a single contiguous sequence. In the cases in which two contigs are included in a single file, this is based on a clear similarity of both sequences to the SAME human cDNA sequence. For the moment, there is therefore no formal proof that the two parts are really derived from the same cDNA. They might also originate from two highly related, yet distinct, cDNAs. The murine genomic and protein files follow the same conventions as their human counterparts. The same applies to the cDNA- and protein-files from other species. In general, these files are based on one or two ESTs only and are therefore more sensitive to errors.
To help interested researches to estimate how long ago a file has been updated, we have added both the date of the last revision or check and the most recent EST from which sequence-information has been incorporated in the contig (this does not necessarily mean that the contig was changed by this addition). For contigs that apparently represent rare transcripts, these most recent EST may already be quite old. The "age" of most ESTs can be derived from the letter in their accession number. The order is T, R, H, N, W, AA going from the oldest to the most recent ESTs. Knowing the latest EST that has been added should help people focus on the more recent matches, whenever they run a sequence against the EST database.