head.shtml

1q21 EST Sequence Properties (Release 2.0)


WI-6908. Last revised/checked: May 29, 1997 Most recent EST: GB:H16889 Unigenelink Sequence name: wi6908.seq A 0.9 kb sequence with a central gap. No obvious similarities and without any long open reading frames. The location of this marker on 1q is not certain, its RH value is very different from the other 1q markers, but is also quite different from those on 1p. It is apparently very close to the centromere. TIGR-A006Y17 is also part of this sequence.
SGC32297. Last revised/checked: July 2, 1997 Most recent EST: GB:R44888 Unigenelink Sequence name: sgc32297.seq This sequence might also be located on 1p, it is in any case close to the centromere of Chr1. The sequence is 1.1 kb in length, but it only consists of the 5' (GB:R24291) and 3' (GB:R44888) ends of a single clone. The 3' end contains an Alu repeat and a MER20 repeat, which have been masked by Ns. No similarities to coding sequences.
WI-8997. Last revised/checked: Feb 4, 1997 Most recent EST: GB:AA031328 Unigenelink Sequence name: wi8997.seq One of the three FCGR1 genes, coding for the high affinity Fc gamma receptor. There are three highly related FCGR1 genes mapping around the centromere, one on 1p12, two on 1q21 (Maresco et al., Cytogenet. Cell Genet. 73, 157-163 (1996)). WI-8997 is supposed to be identical to the 3' end of the FCGR1A gene (GB:M91645), which has been mapped on Chromosome 1q21, compatible with the present localization. The B and C form are 99 and 98% similar over the 150 bp covered by WI-8997 (1 and 2 bp difference, respectively), but ESTs representing the B form lack certain parts that are present in the A form. The mapping of the marker to the position closest to the centromere is possibly the result of the fact that the gene in that position closest will be present in the largest number of radiation hybrids.
SGC30600. Last revised/checked: May 29, 1997 Most recent EST: GB:AA338960 Unigenelink Sequence name: sgc30600.seq A 1.0 kb sequence with a single gap based on the ends of three EST inserts. The sequence does not show any clear similarities to other database entries.
WI-12966. Last revised/checked: May 21, 1997 Most recent EST: GB:AA321779 Unigenelink Sequence name: wi12966.seq A 1.3 kb sequence that probably mainly or entirely consists of 3' UTR. Contains a masked Alu-repeat, which has been crossed. One clone (yw91g08) ends in an L1 repeat. There are two polyA addition sites, the most 5' with the canonical AATAAA, the 3' site contains the much less common AATACA-signal. Two sequences, GB:F02101 and GB:F02102 seem to diverge from the consensus-sequence of the contig at their downstream ends, which suggests that alternative splicing occurs at the sequence TTTAG|CTTTT, but the stretch of alternative nucleotides is quite short. TIGR-A001Z01 (GB:G19805) is also part of this contig.
D1S442. Last revised/checked: May 1, 1997 Sequence name: D1S442.seq A polymorphic marker, not known to contain coding sequences. Not similar to anything else. This marker has also been mapped on the genetic map.
SGC33871. Last revised/checked: July 1, 1997 Most recent EST: GB:AA479148 Unigenelink Sequence name: sgc33871.seq Genomic sequence: Protein sequence: Other species: mus33871.seq Most recent EST: GB:AA423669 Protein sequence: Not part of a known gene, although part of the contig is clearly coding based on the similarities to several mouse ESTs, one of which skips an exon. The cDNA also contains tigr-005K39 and Bdab5d06. The sequence has a length of 1.4 kb with one gap.
WI-17395. Last revised/checked: May 29, 1997 Most recent EST: GB:N39874 Unigenelink Sequence name: wi17395.seq Not part of a known gene and no rodent similarities. 0.9 kb of sequence with a gap and a repetitive part (which I have partially removed, the borders of this region still show detectable similarity to a few repeats).
NIB736. Last revised/checked: May 29, 1997 Most recent EST: GB:H17808 Unigenelink Sequence name: nib736.seq A 0.9 kb sequence with a central gap and without any clear characteristics. SHGC-3211 is also part of the contig.
WI-8000. Last revised/checked: June 2, 1997 Most recent EST: GB: AA420558 Unigenelink Sequence name: wi8000.seq Genomic sequence: wi8000g.seq Protein sequence: wi8000.pep Other species: mus8000.seq Most recent EST: GB:AA466243 Protein sequence: mus8000.pep Other species: rat8000.seq Identical to the cDNA encoding a "brain-expressed HHCPA78 homolog", GB:S73591, now named VDUP1 for Vitamin D3 upregulated protein 1. It encodes a 391 aa protein (GP:S73591), which is highly similar to the proteins encoded by the rat N27 cDNA gb:U30789 and the Mustela vison cycloheximide-induced cDNAs gb:U13891 and gb:U13888. The latter two sequences are probably derived from the same mRNA. The protein is also similar to several C. elegans proteins. The previously mentioned HHCPA78 does not feature on the list of protein hits. Remarkably, the rat cDNA-sequence is about 750 bp longer at the 5' end. The WI-8000 cDNA is covered by over 250 EST sequences. I have therefore not attempted to screen this huge list for splice variants and genomic clones. What has been done, however, is to screen the WI-8000 protein sequence against the EST database. It appears that many s1-clones start on an internal (coding) A-rich sequence AAAAAAGAAAAGAAA. A related sequence is encoded by ESTs GB:H08947 and GB:W68215 (among several others). EST za87b01 is alternatively spliced or internally deleted. Rat (GB:C06727 and GB:H32712) and porcine (GB:Z81181) clones are also available. The WI-8000 cDNA is also covered by STS TIGR-A002G31 (see below) and Bda44g03 and tigr-A002O32, which have not been mapped by Whitehead and are not contained in dbSTS. b-44g03 is among the EST hits, however.
WI-15443. Last revised/checked: June 2, 1997 Most recent EST: GB:AA333799 Unigenelink Sequence name: wi15443.seq Genomic sequence: wi15443g.seq Protein sequence: wi15443.pep Other species: mus15443.seq Most recent EST: GB:AA250187 Protein sequence: mus15443.pep Other species: bru15443.seq Most recent EST: GB:AA161573 A 1.8 kb sequence that shows a strong similarity to the putative C. elegans protein C11H1.2 (gp:Z70205) and several other predicted C. elegans and yeast proteins, that are all relatively hydrophobic in character. The human protein is at least 250 aa long. Highly related proteins from mouse, from Drosophila (GB:AA391343) and from the filarial parasite Brugia malayi can also be found in the EST database. The 5' end of the WI-15443 sequence is quite similar to the sequence of the CpG-clone 45G1 (gb:Z61133), but the number of mismatches makes it likely that this sequence is derived from a related gene, somewhere else in the genome. Several EST clones contain an L1-like sequence, other clones (yf21d07) contain an Alu-repeat. Several other clones contain intronic sequences. GB:R08969 and GB:AA090641 diverge from one another at a probably splice site. One of the two may be an intron, the other should then be an alternatively spliced exon. GB:D20759 skips an exon, which seems to be located in the 3' UTR. Splicing in the 3' UTR is very uncommon. The alternative exon might also represent a rarely spliced intron, since it ends with a good acceptor site (C/TnCAG), but it starts with the sequence AG|GCCAGT. GC-donor sites are not unheard of, though. One of the s1-clones, that normally start on the polyA-tail at the 3' end of the mRNA, has primed on an A-rich sequence toward the 5' end of the contig.
TIGR-A003P17. Last revised/checked: May 29, 1997 Unigenelink Sequence name: wi8668.seq Not part of a known gene, but this STS is part of the WI-8668 contig (see directly below).
WI-8668. Last revised/checked: May 29, 1997 Most recent EST: GB:AA428433 Unigenelink Sequence name: wi8668.seq Genomic sequence: Protein sequence: Other species: mus8668.seq Most recent EST: GB:AA146436 Protein sequence: 0.9 kb of sequence with a central gap. Not part of a known gene nor similar to other genes. This sequence also contains TIGR-A003P17. The mouse sequence is derived from the 3' UTR and contains a repeated element which has been masked by Ns in the sequence.
TIGR-A002G31. Last revised/checked: June 2, 1997 Unigenelink Sequence name: wi8000.seq This STS is contained in the WI-8000 contig (see above).
SGC35000. Last revised/checked: May 29, 1997 Most recent EST: GB:H72274 Unigenelink Sequence name: sgc35000.seq 0.9 kb with a central gap. Not part of a known gene. Only represented by a single cDNA in the EST database. A clear similarity to a sequence present on 4p16 in the Huntington region (gb:Z49237). A rare repeat maybe. A relatively high similarity of questionable significance has been observed for more markers, usually in their 3' UTR and especially when this region is quite long. This suggests that there are still a large number of medium-to-low-copy number repeats in the human genome waiting to be discovered.
WI-497. Last revised/checked: May 1, 1997 Unigenelink Sequence name: wi497.seq A polymorphic genomic marker without any clear similarities. No overlap with EST sequences.
WI-7969. Last revised/checked: Sept 20, 1996 Most recent EST: GB:H60267 Unigenelink Sequence name: wi7969.seq Derived from the FMO5 gene, encoding flavin-containing monooxygenase 5 (positions 1684-2326). FMO1 and 2 are involved in detoxification, a defect in FMO2 leads to a fishy odor of the person affected. Other FMO-genes have also been mapped to 1q. See also WI-18060, which might be derived from another position in the FMO5 gene, but which shows some peculiarities. Other STS from this cDNA are SHGC-162 and SHGC-12943 (according to Unigene).
WI-11526. Last revised/checked: May 29, 1997 Most recent EST: GB:R12571 Unigenelink Sequence name: wi11526.seq Genomic sequence: Protein sequence: wi11526.pep Other species: Most recent EST: GB: Protein sequence: Not derived from a known gene nor convincingly similar to another protein. This EST-contig is 0.9 kb long with a central gap and represented by only 3 cDNA clones. This STS marker overlaps with WI-11405 (see below).
SGC31941. Last revised/checked: May 29, 1997 Most recent EST: GB:AA382383 Unigenelink Sequence name: sgc31941.seq Genomic sequence: s31941g.seq Protein sequence: sgc31941.pep Other species: mus31941.seq Most recent EST: GB:AA139291 Protein sequence: mus31941.pep A 2.5 kb contig with a single gap. The cDNA might encode a motor protein. It has a domain that is similar to myosin and it also shows similarities to kinesin and an endosomal protein. The highest score is with pir:S44243. These similarities are mainly the result of the high Glu- and Gln- content and may be of little functional significance. In the human contig, GB:AA090859 is either a chimeric cDNA or an alternatively spliced form. The unique part of this EST is not similar to anything in the database. In the murine contig, EST GB:AA072145 shows a lot of differences to the other mouse ESTs that are difficult to explain.
WI-11405. Last revised/checked: May 29, 1997 Unigenelink Sequence name: wi11526.seq See above at WI-11526. The two STSs overlap, but they use different primer sets.
WI-8440. Last revised/checked: May 12, 1997 Most recent EST: GB:AA292816 Unigenelink Sequence name: wi8440.seq Genomic sequence: wi8440g.seq Protein sequence: wi8440.pep Other species: mus8440.seq Most recent EST: GB:AA289705 Protein sequence: mus8440.pep Represents the c-jun leucine zipper interactive protein (PIR:B46132), which was identified in a two-hybrid screen using the c-jun leucine zipper as the bait. However, the protein encoded by the 1.1 kb EST-contig represented by wi8440.seq is much larger than the PIR:entry, which probably only covers the part of the cDNA encoding the jun-leucine zipper binding domain of the protein. Most human EST sequences derived from this gene contain an intron that contains stopcodons in all three reading frames toward its 3' end. This intron is absent from the mouse ESTs, however, and the wi8440.seq-sequence is based on the spliced variant, since this yields an uninterrupted reading frame that also contains the c-jun binding region. The intron is included in the genomic sequence wi8440g.seq (g stands for genomic). A second possible intron is present at the extreme 5' end of the human sequence, since the similarity to the mouse sequence drops sharply at a point in the sequence that shows a good similarity to a splice acceptor site. GB:AA251605 probably sips an exon between CCCCGCACA|GT and .......|ATATTAAA. GB:W16619 contains 13 additional basepairs between TTGGATGAG and GTCACCGTT. This is also the location of an intron and the 13 additional basepairs are the result of the use of an upstream AG dinucleotide. Based on its association with c-jun, the protein is likely to represent a transcription factor, but there are also some similarities to akt kinase sequences in the mouse protein, which extends more to the 5' end and is 1.6 kb long. The quality of these matches is relatively low, however, making this a tentative assignment. The WI-8440 EST-contig also covers STS tigr-A005Z14. The mouse ESTs also show alternative splicing in the form of exon skipping.
SGC32182. Last revised/checked: May 20, 1997 Most recent EST: GB:AA339201 Unigenelink Sequence name: sgc32182.seq Genomic sequence: Protein sequence: sgc32182.pep Other species: mus32182.seq Most recent EST: GB:AA068884 Protein sequence: mus32182.pep The 1.3 kb sequence with a central gap seems to encode a transmembrane protein with a domain that is similar to carbonic anhydrase. However, such a domain is also present on a pair of human proteins called p54/58N, and on the amino-terminal end of proteoglycans (phosphacan) and receptor-type tyrosine phosphatases. The function of this domain is not clear. Two sequences (GB:AA08433 and AA080898), derived from the same EST clone are labeled as human clones, but show a very high level of similarity to murine ESTs, whereas they are clearly different from the other human clones. Therefore they have been included in the murine sequence, although the existence of a second human gene that is very similar to the mouse gene can of course not be excluded. GB:AA322820 skips an exon in the human sequence between TGAGG|GCCCA and TGCAG|CCCCA. The murine SGC32182 gene probably overlaps in a 3'-3' orientation with the murine SGC31751 gene, which contains a long 3' extension in comparison with the human gene.
WI-7732. Last revised/checked: Sept 20, 1996 Unigenelink Sequence name: wi7732.seq The histone H2B.1 gene. This gene is also covered by Cda0ab09 (according to Unigene).
TIGR-A003N45. Last revised/checked: Nov 27, 1996 Most recent EST: GB:AA044343 Unigenelink Sequence name: t003n45.seq Genomic sequence: t003n45g.seq Protein sequence: t003n45.pep Other species: m003n45.seq Most recent EST: GB:AA108610 Genomic sequence: m003n45g.seq Protein sequence: m003n45.pep The transcript from this gene contains an unspliced intron in at least four clones, and this also is the case for the mouse EST sequences. The intron is located in the 5' UTR of the cDNA. The protein sequence is very clearly similar to other proteins, namely to the bacterial BolA proteins, which are putative regulators of murein gene expression. The best match is with gp:Z37111. Related proteins are also found in yeasts, plants and animals. Since murein is a component of the bacterial cell wall, this means that the encoded protein should have another function in these organisms. The protein is only 125 amino acids long, and the complete sequences of both the human and mouse proteins are available. The proteins are over 85% similar. The mouse protein might start three amino acids upstream from the human Met-startcodon. The TIGR-A003N45-contig also contains WI-16272 (see below). A related cDNA (t003n45b.seq) is also available upon request. This sequence also shows complex splicing and polyadenylation patterns and there is some evidence for yet a third cDNA. The t003n45b.pep-sequence is most similar to the yeast YGL220w protein (gp:Z72742).
WI-18060. Last revised/checked: Nov 14, 1996 Most recent EST: GB:H51750 Unigenelink Sequence name: wi18060.seq The FMO5 gene, see also WI-7969 (above). These two markers are quite widely spaced on the RH map, considering the fact that they are supposed to cover the same gene, but I have encountered several similar situations (see the table). The WI-18060 sequence has been derived from an EST, that is identical to the 5' end of the FMO5 mRNA (GB:Z47533) for most of its sequence. There is no overlap with other FMO5-ESTs, which are all derived from more 3' positions in the gene. I have made some confusing observations regarding the WI-18060-EST (yp81c12.r1/s1; GB:H51749 and GB:G51750, respectively): it is in the reverse orientation with respect to the FMO5 gene and all the other ESTs. This would suggest that the 5' end of the FMO5 gene is transcribed on both strands, an unlikely situation. A mix-up of the orientations of the insert (which happens now and then) is unlikely in this case, because the 3' (s1)-sequence (that partially overlaps with the extreme 5' end of the FMO5 cDNA) contains an AATAAA polyA-signal at the expected distance from the end, suggesting that it is derived from a genuine mRNA. A chimeric cDNA (derived from two unrelated mRNAs) is also unlikely, because STS WI-18060, which is derived from the same s1-sequence (GB:H51750), is completely upstream of the full-length(?) FMO5 cDNA and its localization to 1q21 is therefore not based on known FMO5 sequences. The chance that two independent cDNAs, derived from two genes located very close to each other on 1q21 would end up in the same clone seems very small indeed, and it is therefore unlikely that WI-18060 has been derived from a chimeric EST. One exotic possibility is that this represents a genomic sequence after all, and that the polyA-tail and the polyadenylation signal are derived from a partial pseudogene that is present just upstream from the FMO5 gene or in an intron of the gene. Such a scenario is also suggested by the fact that the length of the EST (1070 bp) is larger than the distance covered on the FMO5 cDNA, suggesting that the middle of the EST-insert contains an intron. Resolving this issue is possible by sequencing the FMO5 gene.
SGC30813. Last revised/checked: Jan 16, 1997 Most recent EST: GB:AA065094 Unigenelink Sequence name: sgc30813.seq Genomic sequence: Protein sequence: sgc30813.pep Other species: rat30813.seq Most recent EST: GB:H33534 Protein sequence: rat30813.pep A 2.1 kb sequence with four gaps with a clear similarity to the Xenopus elav-like ribonucleoprotein (etr1; gp:U16800). The human (>140 aa) and rat proteins are identical in the regions of overlap, with the exception of a variation in the length of a CAG-rich repeat sequence, which encodes (Gln)n. This repeat is masked in the file with the nucleotide sequence, because it gives a lot of background in the database searches.
WI-15174. Last revised/checked: Nov 14, 1996 Most recent EST: GB:H51507 Unigenelink Sequence name: wi15174.seq This gene is only covered by two EST sequences and the compiled 0.75 kb sequence, that contains a central gap, does not have any similarity to known proteins or sequences in other species. A clear open reading frame is also lacking.
WI-16232. Last revised/checked: March 27, 1997 Most recent EST: GB:AA227883 Unigenelink Sequence name: wi16232.seq This contig is difficult to explain: it contains most of the same sequences as WI-13356, but the large majority of ESTs are in the reverse orientattion. Based on the length of the sequence-contig (more than 1.3 kb) and the presence of several sequences that seem to be alternatively spliced, it appears that the clones representing this contig are somehow in the wrong orientation. I have not been able to find a plausible explanation for this. The WI-16232 alternative sequences are also present in the rat homologue of WI-13356, which encodes a PI4-kinase. For more details, see at WI-13356 below. Again, note how far markers derived from the same gene are separated from one another (about 10 cR, with many intervening markers). The gene encoding the PI4-kinase might well be large, and the markers are derived from the 5' and 3' end of the gene, but this still does not sufficiently explain the difference (let alone the intervening markers).
WI-8386. Last revised/checked: Feb 12, 1997 Most recent EST: GB:AA043991 Unigenelink Sequence name: wi8386.seq Genomic sequence: Protein sequence: Other species: Most recent EST: GB: Protein sequence: A 1.4 kb contiguous sequence that only shows some similarities to simple sequence and other repeats. Therefore probably derived from a gene with a long 3' UTR. This is also suggested by the absence of murine ESTs (3' UTRs may be quite poorly conserved) and by the fact that only the extreme 5' end of the contig shows similarity to another human EST (gb:AA203595) in a way that suggests the presence of a coding domain. Predictions regarding the properties of the protein are not yet possible. The sequence of the related sequence wi8386b.seq is available upon request.
WI-8123. Last revised/checked: Feb 12, 1997 Most recent EST: GB:AA025534 Unigenelink Sequence name: wi8123.seq Genomic sequence: Protein sequence: wi8123.pep Other species: Most recent EST: GB: Protein sequence: A 1.2 kb sequence with a central gap. No convincing similarities to other DNA or protein sequences. This EST-contig overlaps with markers SHGC-15372 (Stanford Human Genome Center; GB:G15089) and with TIGR-A005M02 (GB:G20321). The SHGC marker does not feature on the Stanford RH map (at least not in this position on 1q), but rather has been mapped to Chr14, despite the fact that it overlaps with the other two markers.
SGC32664. Last revised/checked: Dec 16, 1996 Most recent EST: GB:AA136468 Unigenelink Sequence name: sgc32664.seq Genomic sequence: s32664g.seq Protein sequence: sgc32664.pep Other species: mus32664.seq Most recent EST: GB:AA110993 Protein sequence: mus32664.pep Other species: rat32664.seq Most recent EST: GB:H32182 This 0.6 kb cDNA, which is probably full length, encodes a protein that is related to bacterial ribosomal S21 proteins, a yeast protein and a Salmonella rhamnulose kinase. The similarities are not extremely strong. SGC32206 (see below) is derived from a rare variant-cDNA that encodes an extended 3' UTR (ESTs GB:N63268, GB:R20655, and GB:H53531. Other cDNAs end at upstream AATAAA and non-optimal AATATA poly-adenylation sites. The longest 3' UTR, that covers SGC32206, contains an Alu-repeat. An intron is present in quite a number of the ESTs, and in the almost full-length cDNA GB:U79258. This cDNA lacks the short upstream exon, however. The intron contains several ATG-codons, and the U79258-file contains a putative translation product derived from this intron, but most ATGs are not part of a good Kozak-consensus sequence, due to the absence of purines on the -3 and/or +4 positions. Moreover, this intron is never present in the murine cDNAs. None of the possible open reading frames starting in this intron overlap with the S21-like open reading frame. The genomic sequence, including this intron is contained in the s32664g.seq-file. Remarkably, the mouse S21-protein, which shows a 89% identity to the human protein in the first 62 amino acids, has a different carboxyl-terminus. The sequences from the 3' ends of the cDNAs are also clearly divergent, which makes a simple error in the reading frame less likely. It is therefore possible that the mouse sequences are derived from a paralogous, non-homologous gene. In conclusion, the situation is kind of complex.
SGC31751. Last revised/checked: May 30, 1997 Most recent EST: GB:AA411081 Unigenelink Sequence name: sgc31751.seq Genomic sequence: Protein sequence: sgc31751.pep Other species: mus31751.seq Most recent EST: GB:AA432911 Protein sequence: mus31751.pep This 1.8 kb contig consists of a large number of ESTs. It is for a large part coding, based on the changes in the third positions in one rat EST and many mouse ESTs, but the encoded protein does not show a clear similarity to well-characterized proteins. There is a very distant similarity to a neuronal Glu transporter, but this seems to reflect the high percentage of hydrophobic amino acids. The human and murine sequences show several frame-shifts with respect to one another and the very high level of similarity between the two contigs makes it difficult to decide which frame is correct between two consecutive frame-shifts. The human sequence contains an intron that is present in most of the ESTs, but is absent in some. The reason that I describe this insert as an intron, and not as an alternatively spliced exon, is that it bordered by consensus splice sites, CTTGT|GTAAG and CCTCCCTTCCCCTCTGCAG|GCCGA. More importantly, there are multiple stop codons in all three frames. However, the putative open reading frame in the spliced cDNAs is not much longer. Unfortunately, the sequences around the splice sites do not agree very well with one another, which makes it difficult to decide which frame is correct. For the moment, the intron has therefore been retained in the human sequence. The murine sequence also contains the putative intron. It is also much more extensive at the 3' side, but this is based on a rather short overlap (of approximately 40 bp) of only a single EST (GB:W74838) and this part of the contig may therefore be erroneous. This 3' extension in the mouse does overlap with several murine clones in the opposite orientation, that are most like the the human SGC32182 gene, which has been mapped not very far from here. This makes it more likely that the murine extension is correct after all, and shows again how close the genes are to one another, especially when they are in a 3'-3' orientation. Related human and murine sequences, which might in the end help to establish the correct reading frame, are also present in the EST database and are available upon request. The human cDNA also covers SHGC-10321 (GB:G14461). The two STSs cover separate parts of the contig, however.
WI-16272. Last revised/checked: Nov 14, 1996 Most recent EST: GB: Unigenelink Sequence name: t003n45.seq Derived from the same contig as TIGR-A003N45 (see above), but quite a bit separated from it on the RH-map.
WI-14283. Last revised/checked: May 1, 1997 Most recent EST: GB:AA223344 Unigenelink Sequence name: wi14283.seq Genomic sequence: wi14283g.seq Protein sequence: wi14283.pep Other species: mus14283.seq Most recent EST: GB:AA237591 Genomic sequence: m14283g.seq Protein sequence: mus14283.pep The cDNA encodes a protein that is similar to proteins involved in vacuolar transport of proteins. The present contig is highly related to another cDNA sequence (GB:U35246) which is also claimed to be a human cDNA. I have some doubts about the species-designation, however. The last 1500 basepairs of U35246 are virtually identical (3 mismatches in over 1500 bp or 99.8%) to the recently published rat sequence (GB:U81160), whereas the remaining 240 bp at the 5' end show many more differences to the rat sequence, all at the third position in codons. This suggests that a rat cDNA has inadvertedly been sequenced and that the 5' end has been obtained via 5' RACE on human mRNA. Of note, the other cDNAs in the same publication are also derived from rat. The view that U35246 is derived from a non-human species is also supported by the fact that not a single EST in the database is derived from this cDNA (there are no rat ESTs in dbEST covering this cDNA). The murine sequence is largely based on GB:U66865, but I have added a 3' (untranslated) extension that is based on EST sequences. All ESTs are identical in sequence to U66865 in the region of overlap, with the exception of two ESTs, GB:AA183278 and AA237591, that probably contain an intron and an alternatively spliced exon (or another cDNA cloned in the same vector), respectively. The WI-14283 cDNA and the homologous sequences from mouse and rat encode a vacuolar protein sorting protein similar to the yeast VPS45 protein. This protein is quite similar to the mammalian proteins and the C. elegans gene, which is encoded on cosmid C44C1 (GB:U41030) is readily detected even at the nucleotide level. The related C. elegans protein UNC-18 is labeled as a vesicle transport protein and an acetylcholine regulator. A human homolog(ue) of this protein does also exist. The human wi14283.seq-sequence also covers NIB1471, which has been used for the Whitehead YAC-STS map.
IB3045. Last revised/checked: May 20, 1997 Most recent EST: GB:AA345236 Unigenelink Sequence name: ib3045.seq Genomic sequence: Protein sequence: ib3045.pep Other species: mus3045.seq Most recent EST: GB: Protein sequence: mus3045.pep The protein encoded by this EST-contig is related to the yeast (S. pombe) longevity assurance protein (LAG1, gp:U76608) and even more to LAG1-related C. elegans proteins (gp:U42438 and gp:U40415). A 1.7 kb contiguous sequence and a protein of at least 250 aa. The mouse contig overlaps at its 3' end with the 3' end of the mus288.seq- sequence (see NIB288). There are several related cDNAs, that can be compiled from the ESTs in the database.
SGC32206. Last revised/checked: Nov 19, 1996 Unigenelink Sequence name: sgc32664.seq This marker is located in a rare Alu-repeat containing extension of the 3' UTR of SGC32664 (see above), which probably encodes a protein with similarity to the ribosomal S21 proteins.
WI-11473. Last revised/checked: March 25, 1997 Most recent EST: GB:AA194147 Unigenelink Sequence name: wi11473.seq Genomic sequence: Protein sequence: Other species: mus11473.seq Most recent EST: GB:AA274819 Protein sequence: One of two convergently transcribed genes, of which the longest 3' UTRs (there are alternative polyadenylation sites) overlap by 100 bp with the gene represented by WI-13356, which encodes a phosphoinositol-4-kinase (see below). The WI-11473 primers are specific for the WI-11473 cDNA, since one of them is located outside the WI-13356 transcription unit. The compiled sequence is presently 760 bp and is only represented by a small number of ESTs. The sequence ends with an ATTAAA signal. The murine sequence is more extended toward the 5' end and shows some similarity to zinc finger proteins. So far, the murine protein is only 41 amino acids, however, making a definite assignment a risky undertaking.
WI-5177. Last revised/checked: Dec 9, 1996 Unigenelink Sequence name: wi5177.seq A random genomic STS, with no similarity to EST sequences. It is present in the same YAC contig as D1S2343, WI-7217 (profillagrin), WI-7815 (trychohyalin), UTR-9853 (S100A10, calpactin light chain), GATA51H09, WI-9245 (a SPRR), WI-7842 (another SPRR, see below).
WI-12245. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: wi12245.seq This marker is derived from the Cathepsin K/O/X gene (GB:U13665; all different names for the same gene). It is derived from the 5' end of the gene. Another marker, WI-9390 (GB:G07268), is clearly derived from the 3' end of the same gene, but has surprisingly been mapped to a much more telomeric position on 1q via YAC-STS mapping (in WC1.20). The RH location of this marker is not available. A clear discrepancy. See also SGC35262 (below) which is derived from the cathepsin S gene. One would expect the cathepsin genes to be clustered.
SGC34368. Last revised/checked: Nov 19, 1996 Most recent EST: GB:W86675 Unigenelink Sequence name: sgc34368.seq Genomic sequence: Protein sequence: sgc34368.pep Other species: mus34368.seq Most recent EST: GB:W36491 Protein sequence: Other species: rat34368.seq Most recent EST: GB:H34823 Protein sequence: This marker is part of a cDNA with a good similarity to bacterial 50S ribosomal L9 proteins, although the eukaryotic protein is more extended at the amino-terminal end (the complete protein sequence has most probably been obtained). The best match is with E. coli L9 (sp:P02418). I am a bit surprised that there are no mammalian matches. One would expect that the cDNAs of such abundant proteins would have been cloned some time (ages!!!) ago. On the other hand, the message is not extremely abundant, as judged from the number of EST clones, although it is apparently very well represented in a retina cDNA library. One EST (GB:W26374) does not match the consensus sequence from 1 to 370. It may be an alternatively spliced form.
WI-18164. Last revised/checked: Feb 14, 1997 Most recent EST: GB:AA149863 Unigenelink Sequence name: wi18164.seq Only characterized by a few EST-clones. A 0.7 kb sequence. No matches to the protein database. The contig shows significant similarity to sequences present in various genomic clones, although a number of gaps need to be introduced. It seems likely therefore that this contig is mainly composed of 3' UTR.
SGC35293. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: sgc35293.seq This marker was directly derived from the sequence for Small Proline Rich Protein 2a (SPRR2), GB:M20030. Remarkably, WI-17060, which encodes another SPRR-protein (GB:M21539), has been mapped much further downstream (see below), whereas the SPRR genes are known to be clustered.
WI-13356. Last revised/checked: April 9, 1997 Most recent EST: GB:AA282706 Unigenelink Sequence name: wi13356.seq Genomic sequence: Protein sequence: wi13356.pep Other species: mus13356.seq Most recent EST: GB:AA260873 Protein sequence: mus13356.pep Other species: rat13356.seq Protein sequence: rat13356.pep This marker covers a transcription unit that overlaps with WI-11473 (see above). Several clones (e.g. GB:R98965 and GB:44360) overlap for 100 bp with the 3' end of the WI-11473 sequence. These clones represent the 3' end of the transcription unit, which ends with a AATAAA consensus polyadenylation signal. Most clones end 300 bp further upstream at another polyadenylation signal and do not overlap with WI-11473. The WI-13356 cDNA, which was recently published (Meyers and Cantley, J. Biol. Chem. 272, 4384-4390 (1997) GB:U81802) and which ends at the upstream polyadenylation signal, encodes a PI4-kinase. Especially at the 5' side of the mRNA alternatively splicing occurs, as evidenced by the homologous rat cDNA (GB:D84667) and by EST GB:AA282706. Several mouse ESTs contain part of intron sequences. The most surprising feature about the WI13356 sequence is that it largely overlaps with the sequence of WI16232, but in the reverse orientation. Since this overlap is present at the 5' end of the PI4 kinase cDNA and most probably spans several exons, a real reverse transcript seems unlikely. It is more likely that for some reason the orientation of the inserts has been reversed, although there is no obvious reason why this should have happened (e.g. there is no A-rich stretch on which oligodT-priming might have occurred). Some of the rat alternative exons are present in the WI-16232 transcription unit, which in fact constitutes the evidence that this transcript spans several exons. For the moment the WI-16232 contig has been retained as a separate entry. Other details about the WI-13356 contig: one EST (GB:W52129) skips a 37 bp sequence in the cDNA, leading to a frame-shift and a premature stop. This is probably a recombined clone, since the borders of the deletion are identical over a 6 bp stretch. The WI-13356 sequence shows identity to the first 120 bp (in the reverse orientation) of GB:U15590, which encodes human heat-shock protein 27. This probably represents an error on the part of the HSP27 sequence, since this part is clearly similar to various PI4-kinase sequences and is present in several independent clones.
WI-6771. Last revised/checked: May 30, 1997 Most recent EST: GB:AA187628 Unigenelink Sequence name: wi6771.seq This marker is part of a gene that encodes a protein of unknown function that has similarity to ankyrin-containing proteins. The marker is at the 3' end of a transcript that is a non-spliced and alternatively poly-adenylated variant of the NIB288/WI-7370 (GB:G06543) transcript described below. It is situated around position 1200 in the full-length transcript. The marker is part intron, part exon, the perfetly normal AATAAA polyA site being located in an intron. A spliced EST (GB:186759 (5') and GB:AA187628 (3')) that ends at this polyadenylation site does also exist. For the moment, I have retained WI-6771 as a separate sequence, but it is also contained in the genomic sequence of NIB288, nib288g.seq (see below). WI-6771 has been mapped to the same YACs as WI-8118, WI-9627, WI-7370 (=NIB288), and D1S498. It is also known as D1S2372.
SGC31587. Last revised/checked: Jan 10, 1997 Most recent EST: GB:AA128568 Unigenelink Sequence name: s31587hv.seq This sequence is derived from the cDNA of DNA binding regulatory factor (GB:X85786). However, the ESTs cover mainly the 3' UTR, which is not contained in the Genbank entry. In addition, there are some consistent differences between the ESTs and the database entry, which is why I have compiled my own sequence. The differences are in GC-rich regions, and may be caused by compressions in the gel ans subsequent misinterpretations of the EST sequences, but at least one difference is present on both strands. Because all differences are downstream from the stopcodon, the protein is not (yet?) affected. The s31587hv.seq-sequence is the complete coding sequence combined with "my own" 3' UTR, which still contains a gap. The complete sequence is at least 3.0 kb long.
TIGR-A002I04. Last revised/checked: Feb 14, 1997 Most recent EST: GB:AA206559 Unigenelink Sequence name: t002i04.seq Genomic sequence: Protein sequence: t002i04.pep Other species: m002i04.seq Most recent EST: GB:AA071777 Protein sequence: A compiled 1.9 kb sequence that represents the human variant of the rat and bovine p87 transport-like protein or SV2 form A protein protein (form B is highly similar, but clearly distinct; the latter protein is represented by WashU ESTs GB:R53361 and GB:T80035 and mouse GB:R74749). The 1.9 kb sequences encodes the last 200 aa of the 742 aa protein, which is very highly conserved between species.
NIB288. Last revised/checked: May 30, 1997 Most recent EST: GB:AA430987 Unigenelink Sequence name: nib288.seq Genomic sequence: nib288g.seq Protein sequence: nib288.pep Other species: mus288.seq Most recent EST: GB:AA396251 Protein sequence: Derived from the same gene as WI-6771 (see above), but from the normal transcript. The full length (4333 bp) cDNA is of unknown function (GB:D31891). It is expressed in the immature myeloid cell line KG1 and its gene product (of about 1300 aa) is related to the G9a protein, an ankyrin-repeat containing protein that is encoded somewhere in the MHC-complex and is, again, of unknown function. The NIB288 cDNA is even more related to a putative DNA topoisomerase II from C. elegans. Other DNA-interacting proteins also feature on the list of related proteins. Another human match with a relatively high score is the MG44 protein, another DNA binding protein with similarity to SRY. There are several EST sequences that are similar to this protein, but this might reflect the presence of ankyrin repeats. This makes it also a little bit difficult to discriminate the homologous murine ESTs from those that are simply related to NIB288, because of the presence of various well-conserved domains. The ESTs from a related gene may show a higher level of similarity to NIB288 than real mus288 ESTs, if the latter are derived from a less-conserved part of the gene. This has led to an initial misassignment of some murine ESTs. Some murine ESTs are, however, even similar to the 3' untranslated region of NIB288, and these ESTs are present in the mus288.seq-file, which has been entirely based on nucleotide sequence similarities. The mus288-sequence overlaps at its extreme 3' end with the 3' sequence of mus3045 (see at IB3045 below). The human mRNAs do not seem to overlap. WI-7370 (GB:06543) is another marker derived from this cDNA. This marker has also been used for the Whitehead YAC-STS map. It is present on YACs 764-A-1, 789-E-5, 854-D-5 and 947-E-1. These YACs are positive for WI-8118, WI-9627, WI-6771 (see above), D1S498 and D1S2347.
WI-15024. Last revised/checked: May 13, 1997 Most recent EST: GB:H20877 Unigenelink Sequence name: wi15024.seq Genomic sequence: wi15024g.seq Protein sequence: wi15024.pep Other species: mus15024.seq Most recent EST: GB:W13076 Protein sequence: Other species: rat15024.seq Most recent EST: GB:H32838 Protein sequence: rat15024.pep This marker is only covered by a single EST clone, but this clone has good similarity to both rat (GB:H32838) and mouse EST sequences (GB:W13076 and W11461). The contig also overlaps with a genomic (CpG-island) fragment (GB:HS58D6R). The other end of this CpG island clone, HS58D6F, is not similar to anything. The deduced protein sequence is weakly, but convincingly, related to yeast YPL191c (GP:Z73547) and YGL082w (GP:Z72604). However, none of this leads to a clearly defined function of the putative protein, which is at least 100 aa long. A second, related cDNA is also present in the database. Remarkably, this sequence seems to be described in the opposite direction, since all ESTs are in the reverse orientation. The sequence of this contig (wi15024b.seq) is available upon request.
WI-17680. Last revised/checked: May 13, 1997 Most recent EST: GB:AA416763 Unigenelink Sequence name: wi17680.seq Genomic sequence: Protein sequence: Other species: mus17680.seq Most recent EST: GB:AA209852 Protein sequence: Other species: pig17680.seq Most recent EST: GB:Z84190 Protein sequence: A 2.6 kb sequence, with two gaps and two masked T-rich repeats. The correct order of the first two human fragments is by no means certain. There are two polyadenylation sites, one about 1.1 kb upstream of the other. Both are preceded by canonical AATAAA polyA signal sites. The similarity to the mouse sequence is restricted to a non-coding part of the sequence, which means it has not been proven that the two sequences really encode a homologous protein. It is however still possible that the human sequence mainly consists of 3' UTR and that only part of this 3' UTR has been conserved in the mouse. Although the 1.4 kb murine sequence has an open reading frame of some length, it lacks a methionine start codon. The extreme 3' end of the human cDNA is quite similar to a pig EST (in the reverse orientation), GB:SSZ84190 (pig17680). GB:N98426 is a chimeric cDNA that is only for the first 110 bp identical to the 3' end of WI-17680.
WI-16732. Last revised/checked: May 13, 1997 Most recent EST: GB:R08189 Unigenelink Sequence name: wi16732.seq A 0.6 kb sequence based on a small number of ESTs. The sequence contains an Alu repeat, which has been removed. Most or all of the sequence of this contig is derived from the 3' UTR of this cDNA. It does not show any similarity to other genes. tigr-A008P08 is also part of this contig.
SGC35262. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: sgc35262.seq Derived from the cathepsin S gene (GB:M90696). See also WI-12245, which is derived from another cathepsin family member, Cathepsin K. One would expect that these genes would be relatively close to one another in the genome, which is usually the case for genes that arose by duplication and are still present on the same chromosome.
WI-15828. Last revised/checked: June 4, 1997 Most recent EST: GB:H43073 Unigenelink Sequence name: wi15828.seq Genomic sequence: Protein sequence: Other species: mus15828.seq Most recent EST: GB:AA268358 Protein sequence: This cDNA contains an Alu-repeat in its 3' UTR and none of the present EST clones crosses this repeat. This makes it impossible to reach the coding part of the transcript. A recently identified mouse EST may in the end help to circumvent this problem. At present, this EST is still on its own and only covers the extreme 3' end of the mRNA, however. The fact that it is detectable suggests that the level of conservation of this transcription unit should be quite strong.
WI-17569. Last revised/checked: June 4, 1997 Most recent EST: GB:AA402743 Unigenelink Sequence name: wi17569.seq Genomic sequence: Protein sequence: wi17569.pep Other species: mus17569.seq Most recent EST: GB:AA213063 Protein sequence: The ongoing sequencing of the human genome has revealed that the entire 3' end of this 1.4 kb contig is quite similar to other genomic sequences. The major part of the 3' end of WI-17569 has now been removed. The 5' end of the contig overlaps with a small CpG clone, 63f1 (GB:Z55775 and GB:Z55776). The last part of GB:F00707 is not compatibel with the other human ESTs. The cause of this discrepancy is not very obvious. WI-17569 is similar to several mouse ESTs, but this gives relatively little additional information. The putative protein encoded by the human mRNA might be related to a yeast protein with the accession number sp:P40098.
SGC34816. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: sgc34816.seq The ROR-gamma gene (GB:U16997), encoding a retinoid-receptor-like orphan receptor (so without a known ligand). Always an interesting candidate for a hereditary disease!
WI-16757. Last revised/checked: Jan 20, 1997 Most recent EST: GB:AA031777 Unigenelink Sequence name: wi16757.seq Genomic sequence: Protein sequence: Other species: mus16757.seq Most recent EST: GB:AA190161 Protein sequence: A 2.9 kb sequence that mainly consists of the 3' UTR of the Aryl Hydrocarbon Receptor Nuclear Translocator (ARNT) gene. The 3' UTR is not present in the GB:M69238 entry of the ARNT cDNA. The length of the 3' UTR is at least 2.5 kb and it contains a CA-repeat and a long stretch of Ts, which have both been replaced by Ns in the sequence. This transcription unit also covers SGC/WI-30626 (GB:G21200; see below), SHGC10310 (GB:G11399), CHLC.UTR_01521_M69238.P56088 (GB:G15891), SHGC-11178 (GB:G11246), and SHGC-19027 (GB:G31040). In addition, Bos taurus minisatellite marker RME23 (GB:BTU15433) is probably also derived from this gene. The murine sequence contains a large gap with respect to the human sequence in its 3' UTR. ARNT KO mice show defects in angiogenesis and in their response to glucose and oxygen deprivation (Maltepe et al., Nature 386, 403-407 (1997)).
SGC30626. Last revised/checked: Jan 20, 1997 Unigenelink Sequence name: wi16757.seq Identical to WI-16757 and several other STSs (see directly above).
SGC34987. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: sgc34987.seq Identical to the MCL1 gene (myeloid cell differentiation protein, involved in leukemias or lymphomas). The full-length MCL1 cDNA is contained in GB:L08246. The MCL1 transcript also covers SHGC-15108 (GB:G14999).
WI-14860. Last revised/checked: Dec 2, 1996 Most recent EST: GB: Unigenelink Sequence name: wi14860.seq Genomic sequence: Derived from the human (mitochondrial) ubiquinol cytochrome-c reductase core I protein cDNA (GB:L16842 and GB:D26485), although the actual STS WI-14860 does not overlap with these files. In fact, WI-14860 does not match to anything in dbEST except itself. The match with the mitochondrial sequences is entirely based on the 5' end of the EST insert from which WI-14860 is derived. Possibly, WI-14860 represents a rare 3' end. Remarkably, the protein sequence also shows significant similarity to mitochondrial proteases, such as rat mitochondrial processing protease P52 (GB:D13907), which seems to be an altogether different function.
SGC35405. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: sgc35405.seq Small Proline Rich Protein 2C (SPRR2C; GB:M21539). Quite similar, but not identical to SHGC-35293 (SPRR2A; GB:20030), which has been located to a more centromeric position on 1q (see above).
WI-7842. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: wi7842.seq Also a SPRR (GB:M19888). SPRR stands for Small Proline-Rich Protein. There are many genes encoding these proteins, that form a small family. All genes are known to be clustered in a short interval on 1q21. At least one of the YACs positive for WI-7842 are also positive for D1S2343, WI-7217 (profillagrin), WI-5177, WI-7815 (trychohyalin), UTR-9853 (S100A10, calpactin light chain), GATA51H09, WI-9245 (another SPRR), and D1S2346.
WI-17060. Last revised/checked: May 12, 1997 Most recent EST: GB:AA321239 Unigenelink Sequence name: wi17060.seq Genomic sequence: wi17060g.seq The 0.5 kb cDNA for CAAF1 (calcium-binding protein in human amniotic fluid 1; GB:D83664 (cDNA) and D83657 (gene)), also named calgranulin C (GB:X97859) or S100A12 (GB:X98288, X98289, X98290). Calgranulin A and B (S100A8 and S100A9) have been mapped closest to loricrin, which is also located on 1q21 in the Epidermal Differentiation Complex (Marenholz et al., Genomics 37, 295-302 (1996)). It is clear that the S100-gene complex contains additional genes of the same family, that have not yet been mapped by other means (see also WI-16548 below), or have only been mapped recently, such as this gene. The bovine homolog(ue) is also available as GB:D49548.
WI-2862. Last revised/checked: June 4, 1997 Unigenelink Sequence name: wi2862.seq A random genomic STS, not similar to other entries in the database. This marker has also been used for the Whitehead YAC-STS project. Although most hits are ambiguous, it probably maps to YACs 713_h_12, 717_c_3 and 870_c_5, since these YACs also contain other STSs in the region, such as D1S305 and IB3580 (SGC33740).
WI-11760. Last revised/checked: April 16, 1997 Most recent EST: GB: Unigenelink Sequence name: wi11760.seq Genomic sequence: wi11760g.seq Protein sequence: wi11760.pep Other species: mus11760.seq Most recent EST: GB: Protein sequence: mus11760.pep A 0.5 kb full-length cDNA encoding an unknown, but full-length protein. The 137 aa protein (136 aa in mouse) encodes a protein with a clear signal peptide and several hydrophobic stretches. ESTs GB:N30852 and GB:AA044232 are polyadenylated, but unspliced forms of the cDNA. The single intron is located just downstream of the two methionines at ATGATGG|TG. Both ATGs might function as a startcodon. In the murine sequence only the second methionine is present. The upstream (5') ends of the two unspliced clones (GB:N41379 and AA044371) are probably located in an intron. The position of the intron in the mouse sequence seems to be conserved. It is present in five EST clones, whereas the spliced form is represented by more than ten clones. However, the murine situation is more complex, because the unspliced variants contain CG in stead of the expected AG in an otherwise acceptable splice acceptor site, although it contains an AG dinucleotide 5 nt upstream of the CG. To complicate matters even more, one splice variant (GB:W67111) not only lacks the "intron", but also the first 3 nucleotides, TAG, of the downstream exon. The borders of the murine intron are ATGG|GTAACC.....polyYGTAGCCTCG|TAG|TCGG. Since the splice acceptor site at the identical position in the human ESTs is absolutely perfect, my present interpretation is that some of the murine alleles carry a mutated splice acceptor site which leads to defective/alternative splicing. The 80 bp insert sequence contains an in-frame stop-codon, but no ATG codon, which suggests that a protein can not be translated from these mRNAs.
WI-15199. Last revised/checked: June 4, 1997 Most recent EST: GB:H46842 Unigenelink Sequence name: wi15199.seq Covered by only two ESTs. No similarities to other genes or proteins.
TIGR-A004W05. Last revised/checked: June 13, 1997 Unigenelink Sequence name: wi6903.seq Identical to WI-6903, which is downstream from MUC1 (WI-5995). Encodes a family-member of the rat SCAMP (Secretory Carrier Membrane Protein 37) gene. For additional information see at WI-6903.
WI-4536. Last revised/checked: May 12, 1997 Most recent EST: GB:AA337698 Unigenelink Sequence name: wi4536.seq Genomic sequence: wi4536g.seq Other species: pig4536.seq Most recent EST: GB: Protein sequence: wi4536.pep A random genomic STS, that seems to overlap with an exon. The sequence is similar to pig EST GB:F15007. The genomic sequence contains a good splice acceptor site at the position of divergence with the cDNA sequence. WI-4536 has been mapped to YACs 717-C-3, 955-E-11 and 736-H-10 and is present in the YAC-contig that also contains IB3262 and WI-9711 (see below and directly below at RP_S27_2).
RP_S27_2. Last revised/checked: June 25, 1997 Sequence name: wi9711.seq Genomic sequence: wi9711g.seq Based on GB:L19739, which encodes ribosomal protein S27 or metallo-panstimulin. See WI-9711 below for additional details.
WI-15073. Last revised/checked: July 2, 1997 Unigenelink Sequence name: wi17491.seq From the same gene, but from another position in the cDNA as WI-17491 (see below for more information).
SGC32326. Last revised/checked: Jan 30, 1997 Most recent EST: GB:AA122417 Unigenelink Sequence name: sgc32326.seq Genomic sequence: Protein sequence: sgc32326.pep Other species: mus32326.seq Most recent EST: GB:AA199119 Protein sequence: mus32326.pep The 2.1 kb sequence encodes a protein with the highest similarity to sp:P38753, a hypothetical yeast protein with an SH3 domain. Several other SH3-domain containing proteins also feature on the "hit-list", such as Grb2. The SGC32326 protein sequence is highly related to the protein encoded the cDNA contig represented by GB:L49705. I have also compiled a contig of this transcript (s32326b.seq), which is available upon request.
WI-16548. Last revised/checked: June 26, 1997 Most recent EST: GB: Unigenelink Sequence name: wi16548.seq Genomic sequence: wi16548g.seq Protein sequence: wi16548.pep Other species: mus16548.seq Most recent EST: GB: Protein sequence: mus16548.pep Another novel S100-like protein just like WI-17060. WI-16548 is most similar to S100B, which is on chromosome 21, S100A1 is the most related member on 1q21. There are at least 10 S100 genes on this part of the chromosome, most of which have been physically mapped on YAC contigs. These genes are part of the Epidermal Differentiation Complec (EDC), that also contains other genes expressed in the epidermis. Few of these genes are present on the RH map, possibly because they were already mapped by other means. The present human sequence contains two alternative non-coding first exons, the mouse sequence even contains an additional 5' extension. These 5' UTRs are quite GC-rich and it is likely that the sequence contains some mistakes, as the various EST sequences do not agree very well at some points. It is possible that one of the murine 5' UTRs represent a transcription start in an intron or a genomic fragment, since the last part of this sequence is C/T-rich and ends in AG. The human EST GB:AA038823 lacks exon 2, and GB:T98187 also skips part of the mRNA. The markers WI-8650 (GB:G11630) and CHLC.GCT15E11.P17446 (see also below under GCT15E11) are also derived from this transcription unit. The cDNA contains a (CAG)n (=GCT) element, and this part of the cDNA is also present in the genomic sequence GB:U23859. WI-8650 has been mapped to YAC 955-E-11, which also contains D1S2346, WI-6071 (=S100A9, calgranulin B), IB3262, D1S2463 (=WI-4536, see above), WI-9711 (see below), D1S2418 (=WI-9245, SPRR), D1S2400 (=WI-7842, SPRR, see above), GCT15E11 (see below), and possibly WI-8190.
TIGR-A002G29. Last revised/checked: July 1, 1997 Most recent EST: GB:AA220223 Unigenelink Sequence name: t002g29.seq Genomic sequence: Protein sequence: t002g29.pep Other species: m002g29.seq Most recent EST: GB: Protein sequence: m002g29.pep A 1.7 kb cDNA that is covered by more than 100 matching sequences in dbEST. Many of the ESTs are claimed to be similar (in this case: identical) to GB:M35718, the fibroblast growth factor receptor BFR2 or K-sam. Although this is true, this is likely to be due to a stray EcoRI fragment in the cDNA sequence of the K-sam clone, since this fragment is not present in several other Genbank files that contain full-length cDNAs of the same receptor, and that are otherwise identical in sequence. The ORF of TIGR-A002G29, which is most likely full length, shows some similarity to bacterial 6-phosphogluconate dehydrogenases. The best match, which is still quite weak, is with the E. coli enzyme (GP:U14430). The conclusion that the open reading frame is probably full length is mainly based on the observation that the level of similarity to the murine sequence drops dramatically upstream of an ATG-codon, that is in a good Kozak-consensus (GCCATGG). There are at least four exons in this gene, since several ESTs skip one (e.g. GB:T32028) or even two exons (GB:Z42265). The latter event results in a frame-shift. One of the mouse ESTs (GB:W51350) is alternatively spliced and lacks the second human exon skipped in GB:Z42265. The contig is almost identical in sequence to a cDNA derived from glioblastoma cells, that is present in a Japanese patent application (GB:E08542 and GB:E08543). No details on its function were given. The cDNA also covers STS SHGC-11135 (GB:G13549). One EST, za86a07 (r1= , s1=GB:N76101), is quite remarkable with regard to its structure; the s1-file covers the 5' end of SGC33740 (see below) in the same orientation, the r1-file covers the 5' end of TIGR-A002G29, also in the same orientation. This is only possible if the clone would contain bothe genes, TIGR-A002G29 would be present in an intron of SGC33740 or if the clone is scrambled in some way. Again, like in previous cases, the scrambling option seems to be the most likely, but it also suggests that the two genes are probably neighbors in the genome.
SGC34058. Last revised/checked: June 17, 1997 Most recent EST: GB:AA411756 Unigenelink Sequence name: sgc34058.seq Genomic sequence: s34058g.seq Protein sequence: sgc34058.pep Other species: mus34058.seq Most recent EST: GB:AA387333 Genomic sequence: m34058g.seq Protein sequence: mus34058.pep Other species: rat34058.seq Most recent EST: GB:H32493 Protein sequence: rat34058.pep Other species: dro34058.seq Protein sequence: dro34058.pep This STS-marker is located in a very complex region, that contains of two overlapping genes, that may both have alternative 3' ends. The two 3' ends of SGC34058 are about 500 bp apart. The STS itself is present at the 3' end of the longest of the two mRNA-species in this gene. Both transcripts overlap with the transcription unit defined by STS WI-16155 (see below). In fact, the two last introns of both genes do almost completely overlap, although the coding regions are clearly separate. A clear polyadenylation signal is lacking from the WashU/ Merck s-files that are derived from the longer SGC34058 transcript. The sequences that normally would be expected to "begin" with a polyA-tail, "start" at different positions within an approximately 100 bp region. It can at present not be excluded that these clones are in fact genomic in origin, but I can not identify an A-rich sequence motif that might explain the presence of such a large number (there are more than 10 such clones) in various cDNA-libraries. One clone, GB:AA279439, contains a diverging A/T rich sequence at its 5' end, which would correspond to the 3' end of the putative transcript, but it is at present unclear whether this AT-tich region is really present at this position. Regarding splicing, the situation is also rather complex. Several human clones contain what seems to be an intron sequence, and the intron seems to contain two different splice acceptor sites, one of which will result in a truncated protein as a result of a frame-shift. This upstream splice site is used in at least two cDNAs, although a poly-pyrimidine tract at this splice site is not very obvious. The most likely explanation for the presence of these cDNAs is that this splice site is used because it represents the first AG downstream from the branch point sequence, which has been defined by the polypyrimidine tract of the very nice second splice acceptor site, which is located 25 bp downstream. This second splice site is also used in the murine sequence, but some unspliced forms (containing an upstream AG as well) are also found among the murine cDNA clones. Both murine clones (GB:W80183 and GB:AA254293 are in the reverse orientation, however, which might indicate that they are derived from the mus16155 transcription unit or are genomic in origin. The SGC34058 cDNA is presently 2.3 kb in length. It encodes a protein that is quite similar to the recently identified Drosophila misato protein, that encodes a protein with tubulin and myosin like motifs, that is involved in cell division (Miklos et al., PNAS 94, 5189-5195 (1997); GB:U80043). Other tubulin beta-like sequences are also identified in a database search. The alternative splice site upstream of the last exon would of course lead to a different carboxyl-terminus. Although Miklos et al. claim that there is no true counterpart of this protein in yeast, the human protein readily identifies the hypothetical protein YMR211w (pir:S55093) as the next-best match.
WI-9711. Last revised/checked: June 26, 1997 Most recent EST: GB: Unigenelink Sequence name: wi9711.seq Genomic sequence: wi9711g.seq Protein sequence: wi9711.pep Other species: mus9711.seq Most recent EST: GB: Genomic sequence: mus9711g.seq Protein sequence: The original STS is a genomic fragment that covers part of the gene encoding the ribosomal protein S27 (GB:U57847), also known as metallopanstimulin (GB:L19739). This gene also covers marker WI-8593 (GB:G06949). WI-9711 has also been used for the YAC-STS mapping project of the Whitehead lab; it maps to the same YACs (717-C-3, 736-H-10, 955-E-11) as IB3262 (not included in the RH map) and WI-4536, which is included in the RH map and which has been mapped close to the other STS marker representing WI-9711, RP_S27_2 (see above). Usually, when one starts screening the various databases using a full-length cDNA sequences one comes across a few variant clones, that often represents splice variants or alternatively polyadenylated forms. In this case, however, the situation turns out to be almost bizarre. First of all, it turns out that a processed pseudogene is present on Chr7q31 on the BAC clone GS274A07 (GB:AC002075). This sequence is more than 95% identical. It contains a deletion of one amino acid and two amino acid substitutions with respect to the gene- sequence. It is probably not expressed, because there are no ESTs matching it, whereas there are many ESTs matching the WI-9711 sequence. A second problem is that the last 100 bp of the WI-9711 sequence match with both the 5' ends of both the cDNA (GB:M31520) and the genomic (GB:U12202) sequence of the human ribosomal protein S24 in the reverse orientation. This does not seem possible, it may be that the cDNA clone of the S24 gene was chimeric and that the genomic sequence has not been determined entirely independently. There are many WI-9711 sequences crossing the putative junction, meaning that an error on the WI-9711 cDNA sequences can be ruled out. Finally, the splicing pattern of the human and mouse cDNAs in the part of the cDNA seem rather wild. Whereas the second intron in the human genomic sequence is only derived from the original STS, which was genomic in origin, the first intron is represented in many EST clones. In the human situation there seem to be three classes of clones: one type ships the entire intron and splices ATGCCT|GTGAGT to TCTTAG|CTCGCA (where the ATG in the first exon is the start codon), a second type seems to use an alternative splice donor site in the intron, with the sequence TCCCCG|GTGTGT (the T in position 4 in the intron makes this a non-consensus splice site). Finally, a third type does not remove the intron at all. In the murine clones, the situation is even more complex. All the human splice events can also be traced among the murine ESTs, but there is one additional complication: there are murine clones that use an alternative splice acceptor site in the intron, which is 24 bp (8 aa) upstream from the other acceptor sites. The upstream site ends with TAAG|TC, the downstream one with TTAG|CT and both motifs are preceded by a good polypyrimidine tract. Based on the level of conservation between the human and the mouse sequence and on the pattern of mismatches, I would not be surprised if part of the intron was coding. Translation may start at the upstream ATG or it may start at an ATG that is present in the intron itself. Some of the protein products might be translated in another reading frame. This clearly needs to be studied with real experiments. Anyone interested?
WI-17491. Last revised/checked: July 2, 1997 Most recent EST: GB:AA378659 Unigenelink Sequence name: wi17491.seq Genomic sequence: wi17491g.seq Protein sequence: wi17491.pep Other species: mus17491.seq Most recent EST: GB:AA407990 Genomic sequence: m17491g.seq Protein sequence: mus17491.pep A 1.0 kb sequence. The same gene as covered by WI-15073 (see above), which is more to the 3' end of the cDNA than WI-17491. GB:W23551 skips an exon. The same exon is present in the unspliced, but polyadenylated EST GB:AA017325, which features in the genomic sequence together with some additional genomic clones wi17491g.seq. The genomic sequences now contains one complete and one partial intron. The first intron is also present in the murine clone GB:AA389117. The partial open reading frame (the 5' end of the contig is GC-rich and the problems are not yet completely resolved) of WI-17491 shows a weak similarity to the yeast myosin-like protein (GP:L01992). The extreme 3' end of the WI-17491 mRNA probably overlaps with the 3' end of another gene (WI-17491b) that is identical to NF-AT (or NF45), a transcription factor (GB:U10323). The overlap is only about 60 bp and the two polyadenylation signals are contained in the sequence tTTTATTcAATAAAt. The STS SHGC-11878 (GB:11259) is located in the NF-AT cDNA. The NF-AT cDNA has also been mapped to 1q21.
SGC34808. Last revised/checked: May 27, 1997 Unigenelink Sequence name: wi6812.seq Probably the 3' UTR of the human IL6 receptor and part of the same cDNA as WI-6812 (see below), but the STS itself is far from STS WI-6812 (GB:G05547) in this contig.
WI-6812. Last revised/checked: May 27, 1997 Most recent EST: GB:AA381253 Unigenelink Sequence name: wi6812.seq The present sequence is more than 2.2 kb in length and probably represents the 3' UTR of the human IL6 receptor, since it is similar to the same region of the rat IL6 receptor (GB:M58587), which is several kb long (see also SGC34808 above). The sequence of the 3' UTR of the human IL6 receptor cDNA is not available in the database. The IL6 receptor gene is known to be located in the 1q21 region, however. Interestingly, the IL6 receptor gene is most related to the CNTF receptor and the IL11 receptor, which are both located on Chr9 and lie in close proximity to each other. The IB3580 gene, which is present on the Whitehead YAC-STS map and which is represented by SGC33740 on the RH map (see directly below) also has a counterpart on Chr 9. This suggest that part of 1q21 and part of Chr9 have a common evolutionary origin. They might e.g. be the result of an ancient duplication event. WI-6812 alias D1S2375 has been mapped to YACs 778-H-2 (D1S303, WI-7978) and probably 811-H-6 (WI-7978).
SGC33740. Last revised/checked: May 26, 1997 Most recent EST: GB:AA410854 Unigenelink Sequence name: sgc33740.seq Genomic sequence: s33740g.seq Protein sequence: sgc33740.pep Other species: mus33740.seq Most recent EST: GB:AA403904 Protein sequence: mus33740.pep This sequence is a very complex transcription unit. The present cDNA contig contains GB:D63478, the KIAA0144 myeloblast mRNA. The protein encoded by this mRNA is extremely Ser/Thr-rich and is highly conserved in the mouse. However, it turns out that the cDNA present in the database entry is just one of a number of possible splice variants. At present, there seem to be three different 5' ends, that diverge from one another at the same postion in the cDNA. At present, all ends are represented by a single EST. The GB:D63478 5' end is represented by GB:N76101, which extends more to the 5' end. It seems likely that thic clone is a (truncated) genomic clone, since it is an s1-clone (that normally corresponds to the 3' end of a cDNA) and because the other end of this clone represents the 5' end of the TIGR-A002G29 contig (see above). Other 5' ends are represented by GB:AA341150 and GB:AA247130. Both clones diverge from the GB:D63478 sequence 5' from the sequence TAGGCGCAG|TATTCTACC. The unique parts are very GC-rich, but are not related to one another. At the 3' end the situation is also complex. Several ESTs end at the same polyadenylation signal as GB:D63478, but a large number are derived from an alternative 3' end, that is reached by alternative splicing. This alternative 3' end is also present in the murine sequence. Since the splicing takes place in the open reading frame of GB:D63478 (at the sequence GATACAACACTG|GAAGAAAA), the resulting protein will also have an alternative carboxyl-terminus. However, it has been very difficult to establish the correct reading frame, because the conservation with the mouse sequence is so extremely high that there are many regions that do not contain a single third position change. However, frame shifts do occur in this region (based on discrepancies between the ESTs and on the position of third position changes further downstream), which makes it hard to identify the exact site of the mistake. Another factor that makes life rather complex is the presence of quite a number of genomic fragments (or of mRNAs that have read through the first polyadenylation signal), that made it difficult at first to discern the meaning of all these sequences. A definitive assignment of the correct reading frame must await the appearance of some additional ESTs in this region therefore. The two Chr1-markers SGC33740 and IB3580 (the latter present on the Whitehead YAC-STS map) are located in the alternative 3' untranslated region. SGC33740, D1S305, and WI-7978 (NIB241) are all present on the same YACs (950_e_2 and 951_f_6) and have very similar RH coordinates, suggesting that they are really quite close to one another. A highly related sequence is present on Chr9. Its sequence is available upon request.
SGC32043. Last revised/checked: June 26, 1997 Most recent EST: GB:R38652 Unigenelink Sequence name: sgc32043.seq A 0.96 kb sequence with a single gap. Only represented by the 5' and 3' ends of two EST clones. No clear similarity to other sequences, except for the last 300 bp, which are derived from a repeated sequence, that is present several times in the database.
D1S305. Last revised/checked: Dec 9, 1996 Sequence name: d1s305.seq One of the few genetic markers in this region that has also been mapped using radiation hybrids. It is also known as AFM220xf8. This marker has also been used in the YAC-STS project and maps to the same doubly-linked contig as WI-7978 (directly below) and IB3580 (=SGC33740).
WI-7978. Last revised/checked: June 30, 1997 Unigenelink Sequence name: nib241.seq The DRADA gene. See below at NIB241. WI-7978 is present in the YAC-STS map of the Whitehead lab and is present on the same YACs (950-E-2 and 951-F-6) as IB3580 (SGC33740) and D1S305. Both markers are present in the RH map as well. Other YACs that contain the WI-7978 gene are 778-H-2 (D1S2375 (=WI-6812), D1S303) and 811-H-6 (WI-6812).
SGC32441. Last revised/checked: June 17, 1997 Most recent EST: GB:F11455 Unigenelink Sequence name: sgc32441.seq A 0.65 kb sequence, that consists of the 5' and 3' end of a single EST clone. The lower strand might encode a protein with some similarity to RING proteins, but this observation is not very trustworthy.
NIB241. Last revised/checked: June 30, 1997 Unigenelink Sequence name: nib241.seq Genomic sequence: nib241g.seq Protein sequence: nib241.pep Identical to the double-stranded RNA adenosine deaminase gene (DRADA; GB:U18121 and GB:U10439), that is also covered by WI-7978 (see over there for some details on the position of this gene in the Whitehead YAC-STS map). A 6.5 kb mRNA in total. The genomic sequence contains additional 5' and 3' sequences based on GB:U32571 and GB:U32347 and some genomic sequences based on various EST sequences. The EST yv25d07 is most probably a genomic sequence, whereas the status of AA:057407 and AA449679 is less clear. They contain the same sequence divergence and are both in thje reverse orientation, but the site of sequence divergence is not compatible with a splice acceptor site. Finally, GB:AA096321 skips a 2 kb part of the mRNA, possibly as a result of an internal deletion, because the sequence CTTCT is present at both ends of the deletion, suggesting internal recombination.
GCT15E11. Last revised/checked: June 26, 1997 Sequence name: wi16548.seq A simple sequence repeat from CHLC, that is part of the WI-16548 cDNA (see above). It is possibly located on YAC 955-E-11. That is also positive for a number of other 1q21 STSs, such as D1S2346, WI-8650, WI-6071 (S100A9), IB3262, WI-4536, WI-9711, WI-9245, WI-7842 (the latter two STSs are derived from SPRR genes), and WI-8190.
WI-12606. Last revised/checked: Dec 12, 1996 Most recent EST: GB:AA137125 Unigenelink Sequence name: wi12606.seq Genomic sequence: The THBS3 or thrombospondin3 gene (GB:L38969). This gene is known to reside about 3 kb upstream of MUC1 (WI-5995; GB:J05581), see directly below) in the same transcriptional orientation (Vos et al., J. Biol. Chem. 267, 12192-12196 (1992)). Several ESTs (ym97f11, yo38e08 and yo36f07) are unspliced and in the reverse orientation with respect to the mRNA and probably represent genomic DNA fragments.
WI-5995. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: wi5995.seq Genomic sequence: The MUC1 gene (GB:J05581), present 3 kb downstream of the THBS3 gene (WI-12606; GB:L38969; see directly above). Also mapped to YAC 887_h_8 by the Whitehead people. This YAC is also positive for GATA-P19287 (=GATA85H08), IB1251 and is probably positive for D1S2358 (WI-6296), IB708 (not RH mapped) and WI-7160 (the glucocerebrosidase gene GBA), which is about 30 kb from the MUC1 gene (Long et al., Genomics 33, 177-184 (1996)). The MUC1 gene encodes mucin1 or episialin, an O-glycosylated protein that is overexpressed in many carcinomas.
WI-16155. Last revised/checked: June 17, 1997 Most recent EST: GB:AA418434 Unigenelink Sequence name: wi16155.seq Genomic sequence: wi16155g.seq Protein sequence: wi16155.pep Other species: mus16155.seq Most recent EST: GB:AA397083 Protein sequence: mus16155.pep Other species: pig16155.seq A very complex arrangement of genes and polyadenylation sites. At present, the best interpretation seems to be that this marker is located in a region of overlap between the 3' ends of two transcription units. One of the transcription units (wi16155) is defined by only two ESTs, GB:H24035 (ym54a02.s1) and GB:AA056343 (zl66f04.s1). The other transcription unit is SGC34058 (see above). Because there are many ESTs derived from this unit (as judged from their reverse orientation with respect to the WI-16155 sequence) it is still possible to derive the entire last exon from the latter gene. The exon border is defined by the murine EST GB:387333, that is highly similar to wi16155/sgc34058 sequence downstream from a point in the sequence that is a very good splice acceptor site: CCCTCCTCCTTTCTTAG|GTTTCCC. In addition, the pattern of mismatches between the human and murine sequences clearly indicates that this region is coding on the WI-16155 strand. The murine sequence can be extended quite a bit to the 5' end, although this is in part based on the sequence from an EST, GB:W48478, that does not perfectly match the two files that overlap with it on both sides. The GB:W48478- sequence is derived from a Life Tech mouse cDNA, which for some reason sometimes (certainly not always!) show a number of differences with respect to the sequences from the Merck-mice. This means that this part of the mus16155-sequence should be considered as tentative, although it is not excluded that the final sequence will be very similar to the present one in this region. The putative protein product is not obviously similar to anything in the protein database. The information given above is likely to be more or less correct, the remaining part, however, is not guaranteed to remain in the contig much longer. The problem is the following: The r1-files of the two s1-files that contain the WI-16155 polyadenylation signal, seem to be derived from different cDNAs. Even worse, it is imposible to decide which one is correct, and it is even possible that both sequences are incorrect. The upstream part of the file that has been used for STS WI-16155, ym54a02.r1 (GB:H22755), is part of a cDNA encoding a protein that is clearly conserved in mouse and pig (GB:F15055) and is cys-rich. In fact, there are four repeats of a unit that consists of two cysteines in a CX9C unit that is clearly conserved on other positions as well. The same organization can be found in the yeast (S. cerevisiae) ORF YDR031w protein (gp:Z74327), which encodes an 117 aa hypothetical protein, that contains at least three units encoding CX9CIRD/E. The other file, that is clearly derived from the same cDNA as STS WI-16155, zl66f04.s1, has an r-file (GB:AA056710) that is present in the cDNA encoding the tetraspan transmembrane protein SFA-1. Neither this cDNA, which has been completely sequenced (GB:D29963/GB:U14650), nor the many overlapping ESTs show clear evidence for alternative splicing/polyadenylation, that might connect them to the WI-16155 3' end. The chromosomal localization of SFA-1 is unfortunately not known. The situation for the other candidate, the CX9C protein mRNA, is more complex. In this case, there is evidence for alternative splicing and polyadenylation. Some clones are alternatively spliced and have 3' ends (s1-clones) that end in an L1-like sequence. When I screen this 3' end against the EST database, I get some matching ESTs and even an STS, that are clearly identical except for some likely sequence errors. Thir scores are clearly above the background of other L1 repeats present in the genome. The STS is present on Chr2. This would clear dispel the Cx9C protein as a candidate for the 5' end of WI016155, were it not for two facts: there is a possible that I am looking at a very highly related L1 repeat and one of the few clones that shows such a high similarity is named ym55a02.s1, whereas WI-16155 is named ym54a02.s1. Since I have encountered several instances in which the numbers of two diffrent ESTs had been interchanged, I have obviously tried to see whether ym55a02.r1 is present in SFA-1. This is not the case, rather, it is presents in yet another sequence that shows some similarity to a C. elegans protein, but on the bottom strand. I have therfore decided that the situation is too confusing to merit any further work at this stage (this literally takes hours of work). Additional ESTs matching the WI-16155 3' end are required before this frustrating problem can be resolved. For the moment, the CX9C cDNA is still contained in the WI-16155 file. Some additional data on the CX9C protein: The WI-16155 cDNA is also alternatively spliced (e.g. GB:N39730). The present 5' end of the transcript overlaps with three virtually identical CpG island clones- HS34E8R, HS26H3R and HS96E1R (GB:Z60793, GB:Z55221, GB:Z66422, respectively). The other ends (F-files) of these clones, that should be located in a more downstream direction, do not overlap with other database sequences, except each other. The 5' end of wi16155.seq also overlaps with an EST (GB:AA236097) in the reverse orientation. The sequence of this EST is virtually identical to that of the CpG clones, but the similarity abruptly ends at a position that is compatible with a splice acceptor sequence ON THE BOTTOM STRAND. Although this makes the sequence difficult to interpret, the other (s1) end of GB:AA236097 has been included in the genomic wi16155g.seq-file.
WI-15725. Last revised/checked: June 17, 1997 Most recent EST: GB:H04931 Unigenelink Sequence name: wi15725.seq A 1.0 kb sequence with a central gap, that is covered by only two ESTs (four sequences).
WI-6903. Last revised/checked: June 13, 1997 Most recent EST: GB:AA448423 Unigenelink Sequence name: wi6903.seq Genomic sequence: wi6903g.seq Protein sequence: wi6903.pep Other species: mus6903.seq Most recent EST: GB:AA388176 Protein sequence: mus6903.pep Other species: rat6903.seq Related to the rat SCAMP37 gene (GB:L22079), but not its counterpart, since another EST contig is much more closely related to the rat protein. SCAMP stands for Secretory Carrier Membrane Protein 37. Most of the frame shifts in the human cDNA seem to have corrected by now, but there is still a region just 5' of the putative stop codon that is still relatively low in quality. EST sequences GB:N32438 and GB:AA164368 skip an approximately 80 bp exon in the human sequence. The position of two other intons can be inferred from GB:H25083, which contains an exon bordered by splice consensus sites. WI-6903 has also been mapped on the Whitehead YAC-STS map to YAC 795_h_5, which probably also contains GATA-P19287 (GATA85H08), D1S2358 (WI-6296; directly below), and IB1251. STS TIGR-A004W05 is also a member of the WI-6903 contig, whereas chicken STS ADL277 (GB:G01697) is possibly derived from a similar chicken gene. The WI-6903 protein is also related to a C. elegans protein (gp:AF003739).
WI-6296. Last revised/checked: June 18, 1997 Most recent EST: GB:AA311650 Unigenelink Sequence name: wi6296.seq Genomic sequence: Protein sequence: wi6296.pep Other species: mus6296.seq Most recent EST: GB:AA265237 Genomic sequence: mus6296g.seq Protein sequence: mus6296.pep The human sequence does not show any clear similarities to other sequences in the database. Comparison with the 1.2 kb murine sequence clearly indicates that both contigs contain coding sequences. In fact, even the 3' UTRs are quite well conserved. No clear similarities to other proteins are discernible. A mouse clone (GB:AA198177) contains a sequence that is absent from the human clones and that is bordered by consensus splice sites (CAG|GTCAGG....TTTCTTTCTCTCTCTCTGCAG|GG). This sequence has been removed from mus6296.seq, but is present in mus6296g.seq. The mouse clone GB:AA168722 is probably human in origin, since it is identical to the human sequence (except for some sequencing errors). WI-6296 has also been mapped to various YACs by the Whitehead group. It is probably present (all hits are listed as ambiguous, but these are clearly the best candidates) on 796-H-5, 887-H-8 and 954-A-11. At least one of these YACs contains the markers WI-6903, WI-7160, GATA85H08, IB1251, WI-5995, IB708, D1S305, and WI-7978, many of which are present in the vicinity of WI-6296 (=D1S2358) on the RH map as well.
WI-14846. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: wi14846.seq Genomic sequence: Identical to GB:L77213, phosphomevalonate kinase, a 1.0 kb cDNA.
TIGR-A002G08. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: t002g08.seq This EST is derivced from the DAP-3 gene, that encodes an ionizing radiation resistance conferring protein (GB:X83544, GB:U18321, the C-termini of the proteins in these files differ due to a frame-shift in one of them). The protein has also been identified as a mediator of gamma-interferon-mediated cell death. The mRNA is 1.6 kb long and the protein is about 385 aa.
GATA85H08. Last revised/checked: Sept 20, 1996 Sequence name: gata85h08.seq A CHLC tetranucleotide repeat sequence, close to an Alu repeat. Quite similar to GATA25B02 (GB:G07801), which has been mapped on 1p. This might be an error, but there is some evidence for a duplication of these parts. GATA85H08 is identical to GATA-P19287 in the Whitehead YAC-STS map and maps to the same YAC (876_B_11) as IB1251 (directly below) and WI-8330 (WI-7325) which has been mapped some distance away from the former two markers. D1S303 is probably also present on these YACs.
IB1251. Last revised/checked: Feb 6, 1997 Most recent EST: GB: Unigenelink Sequence name: ib1251.seq This sequence is represented by many ESTs and by GB:D38522 (KIAA0080), which is a 4001 bp long sequence that encodes the last 105 amino acids of a protein that is most similar to mouse and rat synaptotagmin IV, which is 425 aa. None of the ESTs reach into the coding region, which is 3.5 kb upstream from the polyadenylation site. The level of similarity of the rodent sequences to D38522 is not extremely high (59%), which suggests that the D38522 may represent another synaptotagmin gene. The 3' UTR contains an Alu-repeat (pos. 530-866), which has been masked by Ns. IB1251 has also been used for the Whitehead YAC-STS map, and it is located close to D1S303, GATA85H08 and WI-8330/WI-7325. The sequence also covers TIGR-A002N39 (see below).
SGC34121. Last revised/checked: June 17, 1997 Most recent EST: GB:AA403168 Unigenelink Sequence name: sgc34121.seq Genomic sequence: Protein sequence: sgc34121.pep Other species: mus34121.seq Most recent EST: GB:AA461783 Protein sequence: mus34121.pep The human sequence is 2.1 kb in length and has a single gap, the mouse sequence is 1750 bp long at present. The human EST GB:W17304 has a deletion with respect to GB:AA100623 and several overlapping mouse clones (GB:W59271 and several others). GB:AA100623 also skips an exon, but at another location. EST GB:AA403168 at present seems to be either chimeric or genomic. In the latter case, the 3' end of the intron (the splice acceptor site) would be remarkably like the 3' end of the preceding exon. The putative protein product of the SGC34121 cDNA is very proline-rich and highly conserved between humans and mice. It does show significant similarity to the hypothetical S. pombe protein C30D11.14 (sp:Q09911). The murine sequence consists of two parts, both of which are highly similar to the human sequence, but there is no formal proof that both parts are derived from the same mRNA. Two murine clones, GB:AA138556 and GB:AA154458 show alternative splicing or rearrangements. The latter possibility is quite realistic in this case, because the two ends of the deletion show a high level of similarity, suggesting that a recombination may have occurred.
SGC34121. Last revised/checked: June 17, 1997 Most recent EST: GB:AA403168 Unigenelink Sequence name: sgc34121.seq Genomic sequence: Protein sequence: sgc34121.pep Other species: mus34121.seq Most recent EST: GB:AA461783 Protein sequence: mus34121.pep The human sequence is 2.1 kb in length and has a single gap, the mouse sequence is 1750 bp long at present. The human EST GB:W17304 has a deletion with respect to GB:AA100623 and several overlapping mouse clones (GB:W59271 and several others). GB:AA100623 also skips an exon, but at another location. EST GB:AA403168 at present seems to be either chimeric or genomic. In the latter case, the 3' end of the intron (the splice acceptor site) would be remarkably like the 3' end of the preceding exon. The putative protein product of the SGC34121 cDNA is very proline-rich and highly conserved between humans and mice. It does show significant similarity to the hypothetical S. pombe protein C30D11.14 (sp:Q09911). The murine sequence consists of two parts, both of which are highly similar to the human sequence, but there is no formal proof that both parts are derived from the same mRNA. Two murine clones, GB:AA138556 and GB:AA154458 show alternative splicing or rearrangements. The latter possibility is quite realistic in this case, because the two ends of the deletion show a high level of similarity, suggesting that a recombination may have occurred.
TIGR-A002N39. Last revised/checked: Feb 6, 1997 Unigenelink Sequence name: ib1251.seq Same sequence as IB1251 above.
SGC30383. Last revised/checked: April 28, 1997 Most recent EST: GB:AA167653 Unigenelink Sequence name: sgc30383.seq Genomic sequence: Protein sequence: Other species: Most recent EST: GB: Protein sequence: A 2.4 kb sequence but this contig may be incorrectly assembled; its structure is entirely dependent on the correct assignment of the two zq39b01 sequences (GB:AA167653 and GB:AA166633). This EST is the only EST that connects the the 5' and 3' parts of the contig. The 3' part is similar, but not identical to a sequence on Chr11, that is present in the PAC pDJ356d6 ((GB:AC002036). It probably represents a mildly repetitive sequence.
SGC34568. Last revised/checked: May 30, 1997 Most recent EST: GB:AA379449 Unigenelink Sequence name: sgc34568.seq A 1.3 kb sequence without any clear characteristics. Most ESTs are derived from a fetal brain library. There is no obvious similarity to other database entries on the protein level.
WI-8330. Last revised/checked: Sept 20, 1996 Most recent EST: GB: Unigenelink Sequence name: wi8330.seq Genomic sequence: The gamma subunit of chaperonin TCP-1 or Cctg (chaperonin containing TCP-1, gamma subunit) (GB:U17104 and GB:X74801). The equivalent mouse cDNA is present in GB:L20509 (matricin) and GB:Z31556. WI-7325, which has been used for the Whitehead YAC-STS map (just like WI-8330), is another STS covered by this cDNA. WI-7325 has been mapped to YACs 876-B-11 (also positive for WI-8330 (this entry), GATA85H08, IB1251, IB708 and D1S303) and 927-B-9 (positive for WI-9029 (lamin A, not present on the RH-map) and IB708).
WI11851. Last revised/checked: March 5, 1997 Most recent EST: GB:AA218694 Unigenelink Sequence name: wi11851.seq Genomic sequence: Protein sequence: wi11851.pep Other species: mus11851.seq Most recent EST: GB:AA208332 Protein sequence: mus11851.pep Other species: rab11851.seq Protein sequence: rab11851.pep A 1.1 kb sequence that is the human counterpart of the rabbit rab25 protein (gb:L03303; sp:P46629). The 213 aa human protein is 93% identical to the rabbit protein. Two ESTs, GB:AA195079 and GB:W25368 show alternative splicing or rearrangements. STS A005X41 (GB:G20582) is also a member of this contig.