1q21 EST Sequence Properties (Release 2.0)
WI-6908.
Last revised/checked: May 29, 1997
Most recent EST: GB:H16889
Unigenelink
Sequence name: wi6908.seq
A 0.9 kb sequence with a central gap. No obvious similarities and
without any long open reading frames. The location of this marker
on 1q is not certain, its RH value is very different from the
other 1q markers, but is also quite different from those on 1p. It is
apparently very close to the centromere. TIGR-A006Y17 is also part
of this sequence.
SGC32297.
Last revised/checked: July 2, 1997
Most recent EST: GB:R44888
Unigenelink
Sequence name: sgc32297.seq
This sequence might also be located on 1p, it is in any case
close to the centromere of Chr1. The sequence is 1.1 kb in length,
but it only consists of the 5' (GB:R24291) and 3' (GB:R44888) ends
of a single clone. The 3' end contains an Alu repeat and a MER20
repeat, which have been masked by Ns. No similarities to coding
sequences.
WI-8997.
Last revised/checked: Feb 4, 1997
Most recent EST: GB:AA031328
Unigenelink
Sequence name: wi8997.seq
One of the three FCGR1 genes, coding for the high affinity Fc gamma
receptor. There are three highly related FCGR1 genes mapping around
the centromere, one on 1p12, two on 1q21 (Maresco et al., Cytogenet.
Cell Genet. 73, 157-163 (1996)). WI-8997 is supposed to be identical
to the 3' end of the FCGR1A gene (GB:M91645), which has been mapped
on Chromosome 1q21, compatible with the present localization. The B
and C form are 99 and 98% similar over the 150 bp covered by WI-8997
(1 and 2 bp difference, respectively), but ESTs representing the
B form lack certain parts that are present in the A form. The mapping
of the marker to the position closest to the centromere is possibly
the result of the fact that the gene in that position closest will be
present in the largest number of radiation hybrids.
SGC30600.
Last revised/checked: May 29, 1997
Most recent EST: GB:AA338960
Unigenelink
Sequence name: sgc30600.seq
A 1.0 kb sequence with a single gap based on the ends
of three EST inserts. The sequence does not show any clear
similarities to other database entries.
WI-12966.
Last revised/checked: May 21, 1997
Most recent EST: GB:AA321779
Unigenelink
Sequence name: wi12966.seq
A 1.3 kb sequence that probably mainly or entirely consists
of 3' UTR. Contains a masked Alu-repeat, which has been crossed.
One clone (yw91g08) ends in an L1 repeat. There are two polyA
addition sites, the most 5' with the canonical AATAAA, the 3' site
contains the much less common AATACA-signal. Two sequences,
GB:F02101 and GB:F02102 seem to diverge from the consensus-sequence
of the contig at their downstream ends, which suggests that
alternative splicing occurs at the sequence TTTAG|CTTTT, but the
stretch of alternative nucleotides is quite short. TIGR-A001Z01
(GB:G19805) is also part of this contig.
D1S442.
Last revised/checked: May 1, 1997
Sequence name: D1S442.seq
A polymorphic marker, not known to contain coding sequences.
Not similar to anything else. This marker has also been mapped
on the genetic map.
SGC33871.
Last revised/checked: July 1, 1997
Most recent EST: GB:AA479148
Unigenelink
Sequence name: sgc33871.seq
Genomic sequence:
Protein sequence:
Other species: mus33871.seq
Most recent EST: GB:AA423669
Protein sequence:
Not part of a known gene, although part of the contig is clearly
coding based on the similarities to several mouse ESTs, one of which
skips an exon. The cDNA also contains tigr-005K39 and Bdab5d06. The
sequence has a length of 1.4 kb with one gap.
WI-17395.
Last revised/checked: May 29, 1997
Most recent EST: GB:N39874
Unigenelink
Sequence name: wi17395.seq
Not part of a known gene and no rodent similarities. 0.9 kb of
sequence with a gap and a repetitive part (which I have partially
removed, the borders of this region still show detectable similarity
to a few repeats).
NIB736.
Last revised/checked: May 29, 1997
Most recent EST: GB:H17808
Unigenelink
Sequence name: nib736.seq
A 0.9 kb sequence with a central gap and without any clear
characteristics. SHGC-3211 is also part of the contig.
WI-8000.
Last revised/checked: June 2, 1997
Most recent EST: GB: AA420558
Unigenelink
Sequence name: wi8000.seq
Genomic sequence: wi8000g.seq
Protein sequence: wi8000.pep
Other species: mus8000.seq
Most recent EST: GB:AA466243
Protein sequence: mus8000.pep
Other species: rat8000.seq
Identical to the cDNA encoding a "brain-expressed HHCPA78
homolog", GB:S73591, now named VDUP1 for Vitamin D3 upregulated
protein 1. It encodes a 391 aa protein (GP:S73591), which is
highly similar to the proteins encoded by the rat N27 cDNA gb:U30789
and the Mustela vison cycloheximide-induced cDNAs gb:U13891 and
gb:U13888. The latter two sequences are probably derived from the
same mRNA. The protein is also similar to several C. elegans proteins.
The previously mentioned HHCPA78 does not feature on the list of
protein hits. Remarkably, the rat cDNA-sequence is about 750 bp
longer at the 5' end. The WI-8000 cDNA is covered by over 250 EST
sequences. I have therefore not attempted to screen this huge list
for splice variants and genomic clones. What has been done, however,
is to screen the WI-8000 protein sequence against the EST database. It
appears that many s1-clones start on an internal (coding) A-rich sequence
AAAAAAGAAAAGAAA. A related sequence is encoded by ESTs GB:H08947 and
GB:W68215 (among several others). EST za87b01 is alternatively spliced
or internally deleted. Rat (GB:C06727 and GB:H32712) and porcine
(GB:Z81181) clones are also available. The WI-8000 cDNA is also covered
by STS TIGR-A002G31 (see below) and Bda44g03 and tigr-A002O32,
which have not been mapped by Whitehead and are not contained in dbSTS.
b-44g03 is among the EST hits, however.
WI-15443.
Last revised/checked: June 2, 1997
Most recent EST: GB:AA333799
Unigenelink
Sequence name: wi15443.seq
Genomic sequence: wi15443g.seq
Protein sequence: wi15443.pep
Other species: mus15443.seq
Most recent EST: GB:AA250187
Protein sequence: mus15443.pep
Other species: bru15443.seq
Most recent EST: GB:AA161573
A 1.8 kb sequence that shows a strong similarity to the putative
C. elegans protein C11H1.2 (gp:Z70205) and several other predicted
C. elegans and yeast proteins, that are all relatively hydrophobic
in character. The human protein is at least 250 aa long. Highly
related proteins from mouse, from Drosophila (GB:AA391343) and
from the filarial parasite Brugia malayi can also be found in
the EST database. The 5' end of the WI-15443 sequence is quite
similar to the sequence of the CpG-clone 45G1 (gb:Z61133), but
the number of mismatches makes it likely that this sequence is
derived from a related gene, somewhere else in the genome.
Several EST clones contain an L1-like sequence, other clones
(yf21d07) contain an Alu-repeat. Several other clones contain
intronic sequences. GB:R08969 and GB:AA090641 diverge from one
another at a probably splice site. One of the two may be
an intron, the other should then be an alternatively spliced
exon. GB:D20759 skips an exon, which seems to be located in
the 3' UTR. Splicing in the 3' UTR is very uncommon. The
alternative exon might also represent a rarely spliced
intron, since it ends with a good acceptor site (C/TnCAG),
but it starts with the sequence AG|GCCAGT. GC-donor sites
are not unheard of, though. One of the s1-clones, that
normally start on the polyA-tail at the 3' end of the mRNA,
has primed on an A-rich sequence toward the 5' end of the
contig.
TIGR-A003P17.
Last revised/checked: May 29, 1997
Unigenelink
Sequence name: wi8668.seq
Not part of a known gene, but this STS is part of the WI-8668
contig (see directly below).
WI-8668.
Last revised/checked: May 29, 1997
Most recent EST: GB:AA428433
Unigenelink
Sequence name: wi8668.seq
Genomic sequence:
Protein sequence:
Other species: mus8668.seq
Most recent EST: GB:AA146436
Protein sequence:
0.9 kb of sequence with a central gap. Not part of a known gene
nor similar to other genes. This sequence also contains TIGR-A003P17.
The mouse sequence is derived from the 3' UTR and contains a repeated
element which has been masked by Ns in the sequence.
TIGR-A002G31.
Last revised/checked: June 2, 1997
Unigenelink
Sequence name: wi8000.seq
This STS is contained in the WI-8000 contig (see above).
SGC35000.
Last revised/checked: May 29, 1997
Most recent EST: GB:H72274
Unigenelink
Sequence name: sgc35000.seq
0.9 kb with a central gap. Not part of a known gene. Only
represented by a single cDNA in the EST database. A clear similarity
to a sequence present on 4p16 in the Huntington region (gb:Z49237). A
rare repeat maybe. A relatively high similarity of questionable
significance has been observed for more markers, usually in their
3' UTR and especially when this region is quite long. This suggests
that there are still a large number of medium-to-low-copy number
repeats in the human genome waiting to be discovered.
WI-497.
Last revised/checked: May 1, 1997
Unigenelink
Sequence name: wi497.seq
A polymorphic genomic marker without any clear similarities. No
overlap with EST sequences.
WI-7969.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:H60267
Unigenelink
Sequence name: wi7969.seq
Derived from the FMO5 gene, encoding flavin-containing
monooxygenase 5 (positions 1684-2326). FMO1 and 2 are involved in
detoxification, a defect in FMO2 leads to a fishy odor of the person
affected. Other FMO-genes have also been mapped to 1q. See also
WI-18060, which might be derived from another position in the FMO5
gene, but which shows some peculiarities. Other STS from this cDNA are
SHGC-162 and SHGC-12943 (according to Unigene).
WI-11526.
Last revised/checked: May 29, 1997
Most recent EST: GB:R12571
Unigenelink
Sequence name: wi11526.seq
Genomic sequence:
Protein sequence: wi11526.pep
Other species:
Most recent EST: GB:
Protein sequence:
Not derived from a known gene nor convincingly similar to another
protein. This EST-contig is 0.9 kb long with a central gap and
represented by only 3 cDNA clones. This STS marker overlaps with
WI-11405 (see below).
SGC31941.
Last revised/checked: May 29, 1997
Most recent EST: GB:AA382383
Unigenelink
Sequence name: sgc31941.seq
Genomic sequence: s31941g.seq
Protein sequence: sgc31941.pep
Other species: mus31941.seq
Most recent EST: GB:AA139291
Protein sequence: mus31941.pep
A 2.5 kb contig with a single gap. The cDNA might encode a motor
protein. It has a domain that is similar to myosin and it also shows
similarities to kinesin and an endosomal protein. The highest score
is with pir:S44243. These similarities are mainly the result of
the high Glu- and Gln- content and may be of little functional
significance. In the human contig, GB:AA090859 is either a chimeric
cDNA or an alternatively spliced form. The unique part of this EST
is not similar to anything in the database. In the murine contig,
EST GB:AA072145 shows a lot of differences to the other mouse ESTs
that are difficult to explain.
WI-11405.
Last revised/checked: May 29, 1997
Unigenelink
Sequence name: wi11526.seq
See above at WI-11526. The two STSs overlap, but they use
different primer sets.
WI-8440.
Last revised/checked: May 12, 1997
Most recent EST: GB:AA292816
Unigenelink
Sequence name: wi8440.seq
Genomic sequence: wi8440g.seq
Protein sequence: wi8440.pep
Other species: mus8440.seq
Most recent EST: GB:AA289705
Protein sequence: mus8440.pep
Represents the c-jun leucine zipper interactive protein
(PIR:B46132), which was identified in a two-hybrid screen using the
c-jun leucine zipper as the bait. However, the protein encoded by
the 1.1 kb EST-contig represented by wi8440.seq is much larger than
the PIR:entry, which probably only covers the part of the cDNA
encoding the jun-leucine zipper binding domain of the
protein. Most human EST sequences derived from this gene contain an
intron that contains stopcodons in all three reading frames toward
its 3' end. This intron is absent from the mouse ESTs, however, and
the wi8440.seq-sequence is based on the spliced variant, since this
yields an uninterrupted reading frame that also contains the c-jun
binding region. The intron is included in the genomic sequence
wi8440g.seq (g stands for genomic). A second possible intron is
present at the extreme 5' end of the human sequence, since the
similarity to the mouse sequence drops sharply at a point in the
sequence that shows a good similarity to a splice acceptor site.
GB:AA251605 probably sips an exon between CCCCGCACA|GT and
.......|ATATTAAA. GB:W16619 contains 13 additional basepairs
between TTGGATGAG and GTCACCGTT. This is also the location of an
intron and the 13 additional basepairs are the result of the use
of an upstream AG dinucleotide.
Based on its association with c-jun, the protein is likely to
represent a transcription factor, but there are also some
similarities to akt kinase sequences in the mouse protein, which
extends more to the 5' end and is 1.6 kb long. The quality of these
matches is relatively low, however, making this a tentative assignment.
The WI-8440 EST-contig also covers STS tigr-A005Z14. The mouse ESTs
also show alternative splicing in the form of exon skipping.
SGC32182.
Last revised/checked: May 20, 1997
Most recent EST: GB:AA339201
Unigenelink
Sequence name: sgc32182.seq
Genomic sequence:
Protein sequence: sgc32182.pep
Other species: mus32182.seq
Most recent EST: GB:AA068884
Protein sequence: mus32182.pep
The 1.3 kb sequence with a central gap seems to encode a
transmembrane protein with a domain that is similar to carbonic
anhydrase. However, such a domain is also present on a pair of human
proteins called p54/58N, and on the amino-terminal end of
proteoglycans (phosphacan) and receptor-type tyrosine phosphatases.
The function of this domain is not clear. Two sequences (GB:AA08433
and AA080898), derived from the same EST clone are labeled as human
clones, but show a very high level of similarity to murine ESTs,
whereas they are clearly different from the other human clones.
Therefore they have been included in the murine sequence, although
the existence of a second human gene that is very similar to the
mouse gene can of course not be excluded. GB:AA322820 skips an exon
in the human sequence between TGAGG|GCCCA and TGCAG|CCCCA. The
murine SGC32182 gene probably overlaps in a 3'-3' orientation with the
murine SGC31751 gene, which contains a long 3' extension in comparison
with the human gene.
WI-7732.
Last revised/checked: Sept 20, 1996
Unigenelink
Sequence name: wi7732.seq
The histone H2B.1 gene. This gene is also covered by Cda0ab09
(according to Unigene).
TIGR-A003N45.
Last revised/checked: Nov 27, 1996
Most recent EST: GB:AA044343
Unigenelink
Sequence name: t003n45.seq
Genomic sequence: t003n45g.seq
Protein sequence: t003n45.pep
Other species: m003n45.seq
Most recent EST: GB:AA108610
Genomic sequence: m003n45g.seq
Protein sequence: m003n45.pep
The transcript from this gene contains an unspliced intron in at
least four clones, and this also is the case for the mouse EST
sequences. The intron is located in the 5' UTR of the cDNA. The
protein sequence is very clearly similar to other proteins, namely
to the bacterial BolA proteins, which are putative regulators of
murein gene expression. The best match is with gp:Z37111. Related
proteins are also found in yeasts, plants and animals. Since murein
is a component of the bacterial cell wall, this means that the
encoded protein should have another function in these organisms. The
protein is only 125 amino acids long, and the complete sequences of both
the human and mouse proteins are available. The proteins are over
85% similar. The mouse protein might start three amino acids
upstream from the human Met-startcodon. The TIGR-A003N45-contig also
contains WI-16272 (see below). A related cDNA (t003n45b.seq) is also
available upon request. This sequence also shows complex splicing and
polyadenylation patterns and there is some evidence for yet a third
cDNA. The t003n45b.pep-sequence is most similar to the yeast YGL220w
protein (gp:Z72742).
WI-18060.
Last revised/checked: Nov 14, 1996
Most recent EST: GB:H51750
Unigenelink
Sequence name: wi18060.seq
The FMO5 gene, see also WI-7969 (above). These two markers are
quite widely spaced on the RH map, considering the fact that they
are supposed to cover the same gene, but I have encountered several
similar situations (see the table). The WI-18060 sequence has been
derived from an EST, that is identical to the 5' end of the FMO5
mRNA (GB:Z47533) for most of its sequence. There is no overlap with
other FMO5-ESTs, which are all derived from more 3' positions in the
gene. I have made some confusing observations regarding the WI-18060-EST
(yp81c12.r1/s1; GB:H51749 and GB:G51750, respectively): it is in
the reverse orientation with respect to the FMO5 gene and all the
other ESTs. This would suggest that the 5' end of the FMO5 gene
is transcribed on both strands, an unlikely situation. A mix-up of
the orientations of the insert (which happens now and then) is
unlikely in this case, because the 3' (s1)-sequence (that partially
overlaps with the extreme 5' end of the FMO5 cDNA) contains an
AATAAA polyA-signal at the expected distance from the end, suggesting
that it is derived from a genuine mRNA. A chimeric cDNA (derived from
two unrelated mRNAs) is also unlikely, because STS WI-18060, which is
derived from the same s1-sequence (GB:H51750), is completely upstream
of the full-length(?) FMO5 cDNA and its localization to 1q21 is therefore
not based on known FMO5 sequences. The chance that two independent
cDNAs, derived from two genes located very close to each other on 1q21
would end up in the same clone seems very small indeed, and it is
therefore unlikely that WI-18060 has been derived from a chimeric
EST. One exotic possibility is that this represents a genomic sequence
after all, and that the polyA-tail and the polyadenylation signal
are derived from a partial pseudogene that is present just upstream
from the FMO5 gene or in an intron of the gene. Such a scenario is
also suggested by the fact that the length of the EST (1070 bp) is larger
than the distance covered on the FMO5 cDNA, suggesting that the middle
of the EST-insert contains an intron. Resolving this issue is possible
by sequencing the FMO5 gene.
SGC30813.
Last revised/checked: Jan 16, 1997
Most recent EST: GB:AA065094
Unigenelink
Sequence name: sgc30813.seq
Genomic sequence:
Protein sequence: sgc30813.pep
Other species: rat30813.seq
Most recent EST: GB:H33534
Protein sequence: rat30813.pep
A 2.1 kb sequence with four gaps with a clear similarity to the
Xenopus elav-like ribonucleoprotein (etr1; gp:U16800). The human
(>140 aa) and rat proteins are identical in the regions of overlap,
with the exception of a variation in the length of a CAG-rich repeat
sequence, which encodes (Gln)n. This repeat is masked in the file
with the nucleotide sequence, because it gives a lot of background
in the database searches.
WI-15174.
Last revised/checked: Nov 14, 1996
Most recent EST: GB:H51507
Unigenelink
Sequence name: wi15174.seq
This gene is only covered by two EST sequences and the compiled
0.75 kb sequence, that contains a central gap, does not have any
similarity to known proteins or sequences in other species. A clear
open reading frame is also lacking.
WI-16232.
Last revised/checked: March 27, 1997
Most recent EST: GB:AA227883
Unigenelink
Sequence name: wi16232.seq
This contig is difficult to explain: it contains most of the same
sequences as WI-13356, but the large majority of ESTs are in the
reverse orientattion. Based on the length of the sequence-contig
(more than 1.3 kb) and the presence of several sequences that
seem to be alternatively spliced, it appears that the clones
representing this contig are somehow in the wrong orientation.
I have not been able to find a plausible explanation for this.
The WI-16232 alternative sequences are also present in the rat homologue
of WI-13356, which encodes a PI4-kinase. For more details, see at
WI-13356 below. Again, note how far markers derived from the same
gene are separated from one another (about 10 cR, with many
intervening markers). The gene encoding the PI4-kinase might well
be large, and the markers are derived from the 5' and 3' end of the
gene, but this still does not sufficiently explain the difference (let
alone the intervening markers).
WI-8386.
Last revised/checked: Feb 12, 1997
Most recent EST: GB:AA043991
Unigenelink
Sequence name: wi8386.seq
Genomic sequence:
Protein sequence:
Other species:
Most recent EST: GB:
Protein sequence:
A 1.4 kb contiguous sequence that only shows some similarities
to simple sequence and other repeats. Therefore probably derived
from a gene with a long 3' UTR. This is also suggested by the
absence of murine ESTs (3' UTRs may be quite poorly conserved) and
by the fact that only the extreme 5' end of the contig shows similarity
to another human EST (gb:AA203595) in a way that suggests the
presence of a coding domain. Predictions regarding the properties of
the protein are not yet possible. The sequence of the related sequence
wi8386b.seq is available upon request.
WI-8123.
Last revised/checked: Feb 12, 1997
Most recent EST: GB:AA025534
Unigenelink
Sequence name: wi8123.seq
Genomic sequence:
Protein sequence: wi8123.pep
Other species:
Most recent EST: GB:
Protein sequence:
A 1.2 kb sequence with a central gap. No convincing similarities
to other DNA or protein sequences. This EST-contig overlaps with
markers SHGC-15372 (Stanford Human Genome Center; GB:G15089) and
with TIGR-A005M02 (GB:G20321). The SHGC marker does not feature on
the Stanford RH map (at least not in this position on 1q), but rather
has been mapped to Chr14, despite the fact that it overlaps with
the other two markers.
SGC32664.
Last revised/checked: Dec 16, 1996
Most recent EST: GB:AA136468
Unigenelink
Sequence name: sgc32664.seq
Genomic sequence: s32664g.seq
Protein sequence: sgc32664.pep
Other species: mus32664.seq
Most recent EST: GB:AA110993
Protein sequence: mus32664.pep
Other species: rat32664.seq
Most recent EST: GB:H32182
This 0.6 kb cDNA, which is probably full length, encodes a
protein that is related to bacterial ribosomal S21 proteins, a yeast
protein and a Salmonella rhamnulose kinase. The similarities are not
extremely strong. SGC32206 (see below) is derived from a rare
variant-cDNA that encodes an extended 3' UTR (ESTs GB:N63268,
GB:R20655, and GB:H53531. Other cDNAs end at upstream AATAAA and
non-optimal AATATA poly-adenylation sites. The longest 3' UTR, that
covers SGC32206, contains an Alu-repeat.
An intron is present in quite a number of the ESTs,
and in the almost full-length cDNA GB:U79258. This cDNA lacks the
short upstream exon, however. The intron contains several ATG-codons,
and the U79258-file contains a putative translation product derived
from this intron, but most ATGs are not part of a good Kozak-consensus
sequence, due to the absence of purines on the -3 and/or +4 positions.
Moreover, this intron is never present in the murine cDNAs. None of
the possible open reading frames starting in this intron overlap with
the S21-like open reading frame. The genomic sequence, including this
intron is contained in the s32664g.seq-file. Remarkably, the mouse
S21-protein, which shows a 89% identity to the human protein in the
first 62 amino acids, has a different carboxyl-terminus. The sequences
from the 3' ends of the cDNAs are also clearly divergent, which makes
a simple error in the reading frame less likely. It is therefore possible
that the mouse sequences are derived from a paralogous, non-homologous
gene. In conclusion, the situation is kind of complex.
SGC31751.
Last revised/checked: May 30, 1997
Most recent EST: GB:AA411081
Unigenelink
Sequence name: sgc31751.seq
Genomic sequence:
Protein sequence: sgc31751.pep
Other species: mus31751.seq
Most recent EST: GB:AA432911
Protein sequence: mus31751.pep
This 1.8 kb contig consists of a large number of ESTs. It is
for a large part coding, based on the changes in the third positions
in one rat EST and many mouse ESTs, but the encoded protein does
not show a clear similarity to well-characterized proteins. There is a
very distant similarity to a neuronal Glu transporter, but this seems
to reflect the high percentage of hydrophobic amino acids. The human
and murine sequences show several frame-shifts with respect to one
another and the very high level of similarity between the two contigs
makes it difficult to decide which frame is correct between two
consecutive frame-shifts. The human sequence contains an intron that is
present in most of the ESTs, but is absent in some. The reason that
I describe this insert as an intron, and not as an alternatively
spliced exon, is that it bordered by consensus splice sites,
CTTGT|GTAAG and CCTCCCTTCCCCTCTGCAG|GCCGA. More importantly,
there are multiple stop codons in all three frames. However, the putative
open reading frame in the spliced cDNAs is not much longer.
Unfortunately, the sequences around the splice sites do not agree
very well with one another, which makes it difficult to decide which
frame is correct. For the moment, the intron has therefore been retained
in the human sequence. The murine sequence also contains the putative
intron. It is also much more extensive at the 3' side, but this is based
on a rather short overlap (of approximately 40 bp) of only a single
EST (GB:W74838) and this part of the contig may therefore be
erroneous. This 3' extension in the mouse does overlap with several
murine clones in the opposite orientation, that are most like the
the human SGC32182 gene, which has been mapped not very far from here.
This makes it more likely that the murine extension is correct after all, and
shows again how close the genes are to one another, especially when they are
in a 3'-3' orientation.
Related human and murine sequences, which might in the end
help to establish the correct reading frame, are also present in the
EST database and are available upon request. The human cDNA also
covers SHGC-10321 (GB:G14461). The two STSs cover separate parts of
the contig, however.
WI-16272.
Last revised/checked: Nov 14, 1996
Most recent EST: GB:
Unigenelink
Sequence name: t003n45.seq
Derived from the same contig as TIGR-A003N45 (see above), but
quite a bit separated from it on the RH-map.
WI-14283.
Last revised/checked: May 1, 1997
Most recent EST: GB:AA223344
Unigenelink
Sequence name: wi14283.seq
Genomic sequence: wi14283g.seq
Protein sequence: wi14283.pep
Other species: mus14283.seq
Most recent EST: GB:AA237591
Genomic sequence: m14283g.seq
Protein sequence: mus14283.pep
The cDNA encodes a protein that is similar to proteins
involved in vacuolar transport of proteins. The present contig is
highly related to another cDNA sequence (GB:U35246) which is also claimed
to be a human cDNA. I have some doubts about the species-designation,
however. The last 1500 basepairs of U35246 are virtually identical (3
mismatches in over 1500 bp or 99.8%) to the recently published
rat sequence (GB:U81160), whereas the remaining 240 bp at the 5'
end show many more differences to the rat sequence, all at the third
position in codons. This suggests that a rat cDNA has inadvertedly
been sequenced and that the 5' end has been obtained via 5' RACE
on human mRNA. Of note, the other cDNAs in the same publication are
also derived from rat. The view that U35246 is derived from a non-human
species is also supported by the fact that not a single EST in the database
is derived from this cDNA (there are no rat ESTs in dbEST covering this
cDNA).
The murine sequence is largely based on GB:U66865, but I have added
a 3' (untranslated) extension that is based on EST sequences. All ESTs
are identical in sequence to U66865 in the region of overlap, with the
exception of two ESTs, GB:AA183278 and AA237591, that probably contain
an intron and an alternatively spliced exon (or another cDNA cloned in
the same vector), respectively.
The WI-14283 cDNA and the homologous sequences from mouse and rat
encode a vacuolar protein sorting protein similar to the yeast VPS45
protein. This protein is quite similar to the mammalian proteins and
the C. elegans gene, which is encoded on cosmid C44C1 (GB:U41030) is
readily detected even at the nucleotide level. The related C. elegans
protein UNC-18 is labeled as a vesicle transport protein and an
acetylcholine regulator. A human homolog(ue) of this protein does also
exist. The human wi14283.seq-sequence also covers NIB1471, which
has been used for the Whitehead YAC-STS map.
IB3045.
Last revised/checked: May 20, 1997
Most recent EST: GB:AA345236
Unigenelink
Sequence name: ib3045.seq
Genomic sequence:
Protein sequence: ib3045.pep
Other species: mus3045.seq
Most recent EST: GB:
Protein sequence: mus3045.pep
The protein encoded by this EST-contig is related to the yeast
(S. pombe) longevity assurance protein (LAG1, gp:U76608) and even
more to LAG1-related C. elegans proteins (gp:U42438 and gp:U40415).
A 1.7 kb contiguous sequence and a protein of at least 250 aa. The
mouse contig overlaps at its 3' end with the 3' end of the mus288.seq-
sequence (see NIB288). There are several related cDNAs, that can be
compiled from the ESTs in the database.
SGC32206.
Last revised/checked: Nov 19, 1996
Unigenelink
Sequence name: sgc32664.seq
This marker is located in a rare Alu-repeat containing extension
of the 3' UTR of SGC32664 (see above), which probably encodes
a protein with similarity to the ribosomal S21 proteins.
WI-11473.
Last revised/checked: March 25, 1997
Most recent EST: GB:AA194147
Unigenelink
Sequence name: wi11473.seq
Genomic sequence:
Protein sequence:
Other species: mus11473.seq
Most recent EST: GB:AA274819
Protein sequence:
One of two convergently transcribed genes, of which the longest
3' UTRs (there are alternative polyadenylation sites) overlap by 100 bp
with the gene represented by WI-13356, which encodes a
phosphoinositol-4-kinase (see below). The WI-11473 primers are
specific for the WI-11473 cDNA, since one of them is located
outside the WI-13356 transcription unit. The compiled sequence
is presently 760 bp and is only represented by a small number of
ESTs. The sequence ends with an ATTAAA signal. The murine
sequence is more extended toward the 5' end and shows some similarity
to zinc finger proteins. So far, the murine protein is only 41 amino
acids, however, making a definite assignment a risky undertaking.
WI-5177.
Last revised/checked: Dec 9, 1996
Unigenelink
Sequence name: wi5177.seq
A random genomic STS, with no similarity to EST sequences.
It is present in the same YAC contig as D1S2343, WI-7217
(profillagrin), WI-7815 (trychohyalin), UTR-9853 (S100A10,
calpactin light chain), GATA51H09, WI-9245 (a SPRR), WI-7842
(another SPRR, see below).
WI-12245.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: wi12245.seq
This marker is derived from the Cathepsin K/O/X gene (GB:U13665;
all different names for the same gene). It is derived from the 5'
end of the gene. Another marker, WI-9390 (GB:G07268), is clearly
derived from the 3' end of the same gene, but has surprisingly
been mapped to a much more telomeric position on 1q via YAC-STS
mapping (in WC1.20). The RH location of this marker is not available.
A clear discrepancy. See also SGC35262 (below) which is derived
from the cathepsin S gene. One would expect the cathepsin genes
to be clustered.
SGC34368.
Last revised/checked: Nov 19, 1996
Most recent EST: GB:W86675
Unigenelink
Sequence name: sgc34368.seq
Genomic sequence:
Protein sequence: sgc34368.pep
Other species: mus34368.seq
Most recent EST: GB:W36491
Protein sequence:
Other species: rat34368.seq
Most recent EST: GB:H34823
Protein sequence:
This marker is part of a cDNA with a good similarity to bacterial
50S ribosomal L9 proteins, although the eukaryotic protein is more
extended at the amino-terminal end (the complete protein sequence
has most probably been obtained). The best match is with E. coli L9
(sp:P02418). I am a bit surprised that there are no mammalian
matches. One would expect that the cDNAs of such abundant proteins
would have been cloned some time (ages!!!) ago. On the other hand,
the message is not extremely abundant, as judged from the number of
EST clones, although it is apparently very well represented in a
retina cDNA library. One EST (GB:W26374) does not match the
consensus sequence from 1 to 370. It may be an alternatively
spliced form.
WI-18164.
Last revised/checked: Feb 14, 1997
Most recent EST: GB:AA149863
Unigenelink
Sequence name: wi18164.seq
Only characterized by a few EST-clones. A 0.7 kb sequence.
No matches to the protein database. The contig shows significant
similarity to sequences present in various genomic clones, although
a number of gaps need to be introduced. It seems likely therefore
that this contig is mainly composed of 3' UTR.
SGC35293.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: sgc35293.seq
This marker was directly derived from the sequence for Small
Proline Rich Protein 2a (SPRR2), GB:M20030. Remarkably, WI-17060,
which encodes another SPRR-protein (GB:M21539), has been mapped much
further downstream (see below), whereas the SPRR genes are known to
be clustered.
WI-13356.
Last revised/checked: April 9, 1997
Most recent EST: GB:AA282706
Unigenelink
Sequence name: wi13356.seq
Genomic sequence:
Protein sequence: wi13356.pep
Other species: mus13356.seq
Most recent EST: GB:AA260873
Protein sequence: mus13356.pep
Other species: rat13356.seq
Protein sequence: rat13356.pep
This marker covers a transcription unit that overlaps with
WI-11473 (see above). Several clones (e.g. GB:R98965 and GB:44360)
overlap for 100 bp with the 3' end of the WI-11473 sequence.
These clones represent the 3' end of the transcription unit,
which ends with a AATAAA consensus polyadenylation signal.
Most clones end 300 bp further upstream at another polyadenylation
signal and do not overlap with WI-11473. The WI-13356 cDNA, which
was recently published (Meyers and Cantley, J. Biol. Chem. 272,
4384-4390 (1997) GB:U81802) and which ends at the upstream
polyadenylation signal, encodes a PI4-kinase. Especially
at the 5' side of the mRNA alternatively splicing
occurs, as evidenced by the homologous rat cDNA (GB:D84667) and
by EST GB:AA282706. Several mouse ESTs contain part of intron
sequences. The most surprising feature about the WI13356
sequence is that it largely overlaps with the sequence of
WI16232, but in the reverse orientation. Since this overlap is
present at the 5' end of the PI4 kinase cDNA and most probably
spans several exons, a real reverse transcript seems unlikely.
It is more likely that for some reason the orientation of the
inserts has been reversed, although there is no obvious reason
why this should have happened (e.g. there is no A-rich stretch
on which oligodT-priming might have occurred). Some of the rat
alternative exons are present in the WI-16232 transcription unit,
which in fact constitutes the evidence that this transcript
spans several exons. For the moment the WI-16232 contig has been
retained as a separate entry.
Other details about the WI-13356 contig: one EST (GB:W52129)
skips a 37 bp sequence in the cDNA, leading to a frame-shift and
a premature stop. This is probably a recombined clone, since the
borders of the deletion are identical over a 6 bp stretch.
The WI-13356 sequence shows identity to the first 120 bp (in the
reverse orientation) of GB:U15590, which encodes human heat-shock
protein 27. This probably represents an error on the part of the HSP27
sequence, since this part is clearly similar to various PI4-kinase
sequences and is present in several independent clones.
WI-6771.
Last revised/checked: May 30, 1997
Most recent EST: GB:AA187628
Unigenelink
Sequence name: wi6771.seq
This marker is part of a gene that encodes a protein of unknown
function that has similarity to ankyrin-containing proteins. The
marker is at the 3' end of a transcript that is a non-spliced and
alternatively poly-adenylated variant of the NIB288/WI-7370 (GB:G06543)
transcript described below. It is situated around position 1200 in
the full-length transcript. The marker is part intron, part exon, the
perfetly normal AATAAA polyA site being located in an intron.
A spliced EST (GB:186759 (5') and GB:AA187628 (3')) that ends at
this polyadenylation site does also exist. For the moment, I have
retained WI-6771 as a separate sequence, but it is also contained
in the genomic sequence of NIB288, nib288g.seq (see below).
WI-6771 has been mapped to the same YACs as WI-8118, WI-9627,
WI-7370 (=NIB288), and D1S498. It is also known as D1S2372.
SGC31587.
Last revised/checked: Jan 10, 1997
Most recent EST: GB:AA128568
Unigenelink
Sequence name: s31587hv.seq
This sequence is derived from the cDNA of DNA binding regulatory
factor (GB:X85786). However, the ESTs cover mainly the 3' UTR, which
is not contained in the Genbank entry. In addition, there are some
consistent differences between the ESTs and the database entry,
which is why I have compiled my own sequence. The differences
are in GC-rich regions, and may be caused by compressions in the
gel ans subsequent misinterpretations of the EST sequences, but
at least one difference is present on both strands. Because all
differences are downstream from the stopcodon, the protein is
not (yet?) affected. The s31587hv.seq-sequence is the complete
coding sequence combined with "my own" 3' UTR, which still
contains a gap. The complete sequence is at least 3.0 kb long.
TIGR-A002I04.
Last revised/checked: Feb 14, 1997
Most recent EST: GB:AA206559
Unigenelink
Sequence name: t002i04.seq
Genomic sequence:
Protein sequence: t002i04.pep
Other species: m002i04.seq
Most recent EST: GB:AA071777
Protein sequence:
A compiled 1.9 kb sequence that represents the human variant of the
rat and bovine p87 transport-like protein or SV2 form A protein
protein (form B is highly similar, but clearly distinct; the latter
protein is represented by WashU ESTs GB:R53361 and GB:T80035 and mouse
GB:R74749). The 1.9 kb sequences encodes the last 200 aa of the 742
aa protein, which is very highly conserved between species.
NIB288.
Last revised/checked: May 30, 1997
Most recent EST: GB:AA430987
Unigenelink
Sequence name: nib288.seq
Genomic sequence: nib288g.seq
Protein sequence: nib288.pep
Other species: mus288.seq
Most recent EST: GB:AA396251
Protein sequence:
Derived from the same gene as WI-6771 (see above), but from the
normal transcript. The full length (4333 bp) cDNA is of unknown
function (GB:D31891). It is expressed in the immature
myeloid cell line KG1 and its gene product (of about 1300 aa) is
related to the G9a protein, an ankyrin-repeat containing protein
that is encoded somewhere in the MHC-complex and is, again, of
unknown function. The NIB288 cDNA is even more related to
a putative DNA topoisomerase II from C. elegans.
Other DNA-interacting proteins also feature on the list of related
proteins. Another human match with a relatively high score is the
MG44 protein, another DNA binding protein with similarity to SRY.
There are several EST sequences that are similar to this protein,
but this might reflect the presence of ankyrin repeats. This makes
it also a little bit difficult to discriminate the homologous
murine ESTs from those that are simply related to NIB288, because
of the presence of various well-conserved domains. The ESTs from
a related gene may show a higher level of similarity to NIB288 than
real mus288 ESTs, if the latter are derived from a less-conserved part
of the gene. This has led to an initial misassignment of some murine
ESTs. Some murine ESTs are, however, even similar to the 3' untranslated
region of NIB288, and these ESTs are present in the mus288.seq-file,
which has been entirely based on nucleotide sequence similarities.
The mus288-sequence overlaps at its extreme 3' end with the 3' sequence
of mus3045 (see at IB3045 below). The human mRNAs do not seem to
overlap. WI-7370 (GB:06543) is another marker derived from this cDNA.
This marker has also been used for the Whitehead YAC-STS map. It is
present on YACs 764-A-1, 789-E-5, 854-D-5 and 947-E-1. These YACs are
positive for WI-8118, WI-9627, WI-6771 (see above), D1S498 and D1S2347.
WI-15024.
Last revised/checked: May 13, 1997
Most recent EST: GB:H20877
Unigenelink
Sequence name: wi15024.seq
Genomic sequence: wi15024g.seq
Protein sequence: wi15024.pep
Other species: mus15024.seq
Most recent EST: GB:W13076
Protein sequence:
Other species: rat15024.seq
Most recent EST: GB:H32838
Protein sequence: rat15024.pep
This marker is only covered by a single EST clone, but this clone
has good similarity to both rat (GB:H32838) and mouse EST sequences
(GB:W13076 and W11461). The contig also overlaps with a genomic
(CpG-island) fragment (GB:HS58D6R). The other end of this CpG island
clone, HS58D6F, is not similar to anything. The deduced protein
sequence is weakly, but convincingly, related to yeast YPL191c
(GP:Z73547) and YGL082w (GP:Z72604). However, none of this leads to
a clearly defined function of the putative protein, which is at least
100 aa long. A second, related cDNA is also present in the database.
Remarkably, this sequence seems to be described in the opposite
direction, since all ESTs are in the reverse orientation. The sequence
of this contig (wi15024b.seq) is available upon request.
WI-17680.
Last revised/checked: May 13, 1997
Most recent EST: GB:AA416763
Unigenelink
Sequence name: wi17680.seq
Genomic sequence:
Protein sequence:
Other species: mus17680.seq
Most recent EST: GB:AA209852
Protein sequence:
Other species: pig17680.seq
Most recent EST: GB:Z84190
Protein sequence:
A 2.6 kb sequence, with two gaps and two masked T-rich repeats.
The correct order of the first two human fragments is by no means certain.
There are two polyadenylation sites, one about 1.1 kb upstream of the
other. Both are preceded by canonical AATAAA polyA signal sites.
The similarity to the mouse sequence is restricted to a non-coding
part of the sequence, which means it has not been proven that the two
sequences really encode a homologous protein. It is however still
possible that the human sequence mainly consists of 3' UTR and that
only part of this 3' UTR has been conserved in the mouse. Although
the 1.4 kb murine sequence has an open reading frame of some length,
it lacks a methionine start codon. The extreme 3' end of the human
cDNA is quite similar to a pig EST (in the reverse orientation),
GB:SSZ84190 (pig17680). GB:N98426 is a chimeric cDNA that is only for
the first 110 bp identical to the 3' end of WI-17680.
WI-16732.
Last revised/checked: May 13, 1997
Most recent EST: GB:R08189
Unigenelink
Sequence name: wi16732.seq
A 0.6 kb sequence based on a small number of ESTs. The sequence
contains an Alu repeat, which has been removed. Most or all of the
sequence of this contig is derived from the 3' UTR of this cDNA. It
does not show any similarity to other genes. tigr-A008P08 is also
part of this contig.
SGC35262.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: sgc35262.seq
Derived from the cathepsin S gene (GB:M90696). See also WI-12245,
which is derived from another cathepsin family member, Cathepsin K.
One would expect that these genes would be relatively close to one
another in the genome, which is usually the case for genes that
arose by duplication and are still present on the same chromosome.
WI-15828.
Last revised/checked: June 4, 1997
Most recent EST: GB:H43073
Unigenelink
Sequence name: wi15828.seq
Genomic sequence:
Protein sequence:
Other species: mus15828.seq
Most recent EST: GB:AA268358
Protein sequence:
This cDNA contains an Alu-repeat in its 3' UTR and none of the
present EST clones crosses this repeat. This makes it
impossible to reach the coding part of the transcript. A recently
identified mouse EST may in the end help to circumvent this
problem. At present, this EST is still on its own and only covers
the extreme 3' end of the mRNA, however. The fact that it is
detectable suggests that the level of conservation of this
transcription unit should be quite strong.
WI-17569.
Last revised/checked: June 4, 1997
Most recent EST: GB:AA402743
Unigenelink
Sequence name: wi17569.seq
Genomic sequence:
Protein sequence: wi17569.pep
Other species: mus17569.seq
Most recent EST: GB:AA213063
Protein sequence:
The ongoing sequencing of the human genome has revealed that
the entire 3' end of this 1.4 kb contig is quite similar to other genomic
sequences. The major part of the 3' end of WI-17569 has now been
removed. The 5' end of the contig overlaps with a small CpG clone,
63f1 (GB:Z55775 and GB:Z55776). The last part of GB:F00707 is
not compatibel with the other human ESTs. The cause of this
discrepancy is not very obvious. WI-17569 is similar to several
mouse ESTs, but this gives relatively little additional information.
The putative protein encoded by the human mRNA might be related
to a yeast protein with the accession number sp:P40098.
SGC34816.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: sgc34816.seq
The ROR-gamma gene (GB:U16997), encoding a retinoid-receptor-like
orphan receptor (so without a known ligand). Always an interesting
candidate for a hereditary disease!
WI-16757.
Last revised/checked: Jan 20, 1997
Most recent EST: GB:AA031777
Unigenelink
Sequence name: wi16757.seq
Genomic sequence:
Protein sequence:
Other species: mus16757.seq
Most recent EST: GB:AA190161
Protein sequence:
A 2.9 kb sequence that mainly consists of the 3' UTR of the Aryl
Hydrocarbon Receptor Nuclear Translocator (ARNT) gene. The 3' UTR is
not present in the GB:M69238 entry of the ARNT cDNA. The length of
the 3' UTR is at least 2.5 kb and it contains a CA-repeat and a long
stretch of Ts, which have both been replaced by Ns in the sequence.
This transcription unit also covers SGC/WI-30626 (GB:G21200; see
below), SHGC10310 (GB:G11399), CHLC.UTR_01521_M69238.P56088
(GB:G15891), SHGC-11178 (GB:G11246), and SHGC-19027 (GB:G31040).
In addition, Bos taurus minisatellite marker RME23 (GB:BTU15433)
is probably also derived from this gene. The murine sequence
contains a large gap with respect to the human sequence in its
3' UTR. ARNT KO mice show defects in angiogenesis and in their
response to glucose and oxygen deprivation (Maltepe et al., Nature
386, 403-407 (1997)).
SGC30626.
Last revised/checked: Jan 20, 1997
Unigenelink
Sequence name: wi16757.seq
Identical to WI-16757 and several other STSs (see directly
above).
SGC34987.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: sgc34987.seq
Identical to the MCL1 gene (myeloid cell differentiation protein,
involved in leukemias or lymphomas). The full-length MCL1 cDNA is
contained in GB:L08246. The MCL1 transcript also covers SHGC-15108
(GB:G14999).
WI-14860.
Last revised/checked: Dec 2, 1996
Most recent EST: GB:
Unigenelink
Sequence name: wi14860.seq
Genomic sequence:
Derived from the human (mitochondrial) ubiquinol cytochrome-c
reductase core I protein cDNA (GB:L16842 and GB:D26485), although
the actual STS WI-14860 does not overlap with these files. In fact,
WI-14860 does not match to anything in dbEST except itself. The match
with the mitochondrial sequences is entirely based on the 5' end of
the EST insert from which WI-14860 is derived. Possibly, WI-14860
represents a rare 3' end. Remarkably, the protein sequence also shows
significant similarity to mitochondrial proteases, such as rat
mitochondrial processing protease P52 (GB:D13907), which seems to be
an altogether different function.
SGC35405.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: sgc35405.seq
Small Proline Rich Protein 2C (SPRR2C; GB:M21539). Quite similar,
but not identical to SHGC-35293 (SPRR2A; GB:20030), which has been
located to a more centromeric position on 1q (see above).
WI-7842.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: wi7842.seq
Also a SPRR (GB:M19888). SPRR stands for Small Proline-Rich
Protein. There are many genes encoding these proteins, that
form a small family. All genes are known to be clustered in a
short interval on 1q21. At least one of the YACs positive for
WI-7842 are also positive for D1S2343, WI-7217 (profillagrin),
WI-5177, WI-7815 (trychohyalin), UTR-9853 (S100A10, calpactin
light chain), GATA51H09, WI-9245 (another SPRR), and D1S2346.
WI-17060.
Last revised/checked: May 12, 1997
Most recent EST: GB:AA321239
Unigenelink
Sequence name: wi17060.seq
Genomic sequence: wi17060g.seq
The 0.5 kb cDNA for CAAF1 (calcium-binding protein in human
amniotic fluid 1; GB:D83664 (cDNA) and D83657 (gene)), also named
calgranulin C (GB:X97859) or S100A12 (GB:X98288, X98289, X98290).
Calgranulin A and B (S100A8 and S100A9) have been mapped
closest to loricrin, which is also located on 1q21 in the
Epidermal Differentiation Complex (Marenholz et al., Genomics 37,
295-302 (1996)). It is clear that the S100-gene complex contains
additional genes of the same family, that have not yet been mapped
by other means (see also WI-16548 below), or have only been mapped
recently, such as this gene. The bovine homolog(ue) is also
available as GB:D49548.
WI-2862.
Last revised/checked: June 4, 1997
Unigenelink
Sequence name: wi2862.seq
A random genomic STS, not similar to other entries in the
database. This marker has also been used for the Whitehead
YAC-STS project. Although most hits are ambiguous, it probably maps
to YACs 713_h_12, 717_c_3 and 870_c_5, since these YACs also
contain other STSs in the region, such as D1S305 and IB3580 (SGC33740).
WI-11760.
Last revised/checked: April 16, 1997
Most recent EST: GB:
Unigenelink
Sequence name: wi11760.seq
Genomic sequence: wi11760g.seq
Protein sequence: wi11760.pep
Other species: mus11760.seq
Most recent EST: GB:
Protein sequence: mus11760.pep
A 0.5 kb full-length cDNA encoding an unknown, but full-length
protein. The 137 aa protein (136 aa in mouse) encodes a protein with
a clear signal peptide and several hydrophobic stretches. ESTs GB:N30852
and GB:AA044232 are polyadenylated, but unspliced forms of the cDNA.
The single intron is located just downstream of the two methionines
at ATGATGG|TG. Both ATGs might function as a startcodon. In the murine
sequence only the second methionine is present. The upstream (5') ends
of the two unspliced clones (GB:N41379 and AA044371) are probably located
in an intron. The position of the intron in the mouse sequence seems to
be conserved. It is present in five EST clones, whereas the spliced form
is represented by more than ten clones. However, the murine situation is
more complex, because the unspliced variants contain CG in stead of the
expected AG in an otherwise acceptable splice acceptor site, although it
contains an AG dinucleotide 5 nt upstream of the CG. To complicate matters
even more, one splice variant (GB:W67111) not only lacks the "intron", but
also the first 3 nucleotides, TAG, of the downstream exon. The borders of
the murine intron are ATGG|GTAACC.....polyYGTAGCCTCG|TAG|TCGG. Since the
splice acceptor site at the identical position in the human ESTs
is absolutely perfect, my present interpretation is that some of
the murine alleles carry a mutated splice acceptor site which leads
to defective/alternative splicing. The 80 bp insert sequence contains
an in-frame stop-codon, but no ATG codon, which suggests that
a protein can not be translated from these mRNAs.
WI-15199.
Last revised/checked: June 4, 1997
Most recent EST: GB:H46842
Unigenelink
Sequence name: wi15199.seq
Covered by only two ESTs. No similarities to other genes or
proteins.
TIGR-A004W05.
Last revised/checked: June 13, 1997
Unigenelink
Sequence name: wi6903.seq
Identical to WI-6903, which is downstream from MUC1 (WI-5995).
Encodes a family-member of the rat SCAMP (Secretory Carrier Membrane
Protein 37) gene. For additional information see at WI-6903.
WI-4536.
Last revised/checked: May 12, 1997
Most recent EST: GB:AA337698
Unigenelink
Sequence name: wi4536.seq
Genomic sequence: wi4536g.seq
Other species: pig4536.seq
Most recent EST: GB:
Protein sequence: wi4536.pep
A random genomic STS, that seems to overlap with an exon.
The sequence is similar to pig EST GB:F15007. The genomic
sequence contains a good splice acceptor site at the position
of divergence with the cDNA sequence. WI-4536 has been mapped
to YACs 717-C-3, 955-E-11 and 736-H-10 and is present in the
YAC-contig that also contains IB3262 and WI-9711 (see below
and directly below at RP_S27_2).
RP_S27_2.
Last revised/checked: June 25, 1997
Sequence name: wi9711.seq
Genomic sequence: wi9711g.seq
Based on GB:L19739, which encodes ribosomal protein S27
or metallo-panstimulin. See WI-9711 below for additional details.
WI-15073.
Last revised/checked: July 2, 1997
Unigenelink
Sequence name: wi17491.seq
From the same gene, but from another position in the cDNA
as WI-17491 (see below for more information).
SGC32326.
Last revised/checked: Jan 30, 1997
Most recent EST: GB:AA122417
Unigenelink
Sequence name: sgc32326.seq
Genomic sequence:
Protein sequence: sgc32326.pep
Other species: mus32326.seq
Most recent EST: GB:AA199119
Protein sequence: mus32326.pep
The 2.1 kb sequence encodes a protein with the highest similarity
to sp:P38753, a hypothetical yeast protein with an SH3 domain.
Several other SH3-domain containing proteins also feature on the
"hit-list", such as Grb2. The SGC32326 protein sequence is highly
related to the protein encoded the cDNA contig represented by
GB:L49705. I have also compiled a contig of this transcript
(s32326b.seq), which is available upon request.
WI-16548.
Last revised/checked: June 26, 1997
Most recent EST: GB:
Unigenelink
Sequence name: wi16548.seq
Genomic sequence: wi16548g.seq
Protein sequence: wi16548.pep
Other species: mus16548.seq
Most recent EST: GB:
Protein sequence: mus16548.pep
Another novel S100-like protein just like WI-17060. WI-16548 is most
similar to S100B, which is on chromosome 21, S100A1 is the most related
member on 1q21. There are at least 10 S100 genes on this part of the
chromosome, most of which have been physically mapped on YAC contigs.
These genes are part of the Epidermal Differentiation Complec (EDC),
that also contains other genes expressed in the epidermis. Few of these
genes are present on the RH map, possibly because they were already
mapped by other means. The present human sequence contains two
alternative non-coding first exons, the mouse sequence even contains
an additional 5' extension. These 5' UTRs are quite GC-rich and it is
likely that the sequence contains some mistakes, as the various EST
sequences do not agree very well at some points. It is possible that
one of the murine 5' UTRs represent a transcription start in an
intron or a genomic fragment, since the last part of this sequence
is C/T-rich and ends in AG. The human EST GB:AA038823 lacks exon 2,
and GB:T98187 also skips part of the mRNA. The markers WI-8650
(GB:G11630) and CHLC.GCT15E11.P17446 (see also below under
GCT15E11) are also derived from this transcription unit. The
cDNA contains a (CAG)n (=GCT) element, and this part of the
cDNA is also present in the genomic sequence GB:U23859. WI-8650
has been mapped to YAC 955-E-11, which also contains D1S2346,
WI-6071 (=S100A9, calgranulin B), IB3262, D1S2463 (=WI-4536,
see above), WI-9711 (see below), D1S2418 (=WI-9245, SPRR),
D1S2400 (=WI-7842, SPRR, see above), GCT15E11 (see below),
and possibly WI-8190.
TIGR-A002G29.
Last revised/checked: July 1, 1997
Most recent EST: GB:AA220223
Unigenelink
Sequence name: t002g29.seq
Genomic sequence:
Protein sequence: t002g29.pep
Other species: m002g29.seq
Most recent EST: GB:
Protein sequence: m002g29.pep
A 1.7 kb cDNA that is covered by more than 100 matching sequences
in dbEST. Many of the ESTs are claimed to be similar (in this case:
identical) to GB:M35718, the fibroblast growth factor receptor BFR2
or K-sam. Although this is true, this is likely to be due to a stray
EcoRI fragment in the cDNA sequence of the K-sam clone, since this
fragment is not present in several other Genbank files that contain
full-length cDNAs of the same receptor, and that are otherwise
identical in sequence. The ORF of TIGR-A002G29, which is most
likely full length, shows some similarity to bacterial
6-phosphogluconate dehydrogenases. The best match, which is still
quite weak, is with the E. coli enzyme (GP:U14430). The conclusion that
the open reading frame is probably full length is mainly based on the
observation that the level of similarity to the murine sequence
drops dramatically upstream of an ATG-codon, that is in a good
Kozak-consensus (GCCATGG). There are at least four exons in this gene,
since several ESTs skip one (e.g. GB:T32028) or even two exons
(GB:Z42265). The latter event results in a frame-shift. One of the
mouse ESTs (GB:W51350) is alternatively spliced and lacks the second
human exon skipped in GB:Z42265. The contig is almost identical in
sequence to a cDNA derived from glioblastoma cells, that is present in
a Japanese patent application (GB:E08542 and GB:E08543). No details
on its function were given. The cDNA also covers STS SHGC-11135
(GB:G13549).
One EST, za86a07 (r1= , s1=GB:N76101), is quite remarkable
with regard to its structure; the s1-file covers the 5' end of SGC33740
(see below) in the same orientation, the r1-file covers the 5' end of
TIGR-A002G29, also in the same orientation. This is only possible if the
clone would contain bothe genes, TIGR-A002G29 would be present in an
intron of SGC33740 or if the clone is scrambled in some way. Again, like
in previous cases, the scrambling option seems to be the most likely, but
it also suggests that the two genes are probably neighbors in the genome.
SGC34058.
Last revised/checked: June 17, 1997
Most recent EST: GB:AA411756
Unigenelink
Sequence name: sgc34058.seq
Genomic sequence: s34058g.seq
Protein sequence: sgc34058.pep
Other species: mus34058.seq
Most recent EST: GB:AA387333
Genomic sequence: m34058g.seq
Protein sequence: mus34058.pep
Other species: rat34058.seq
Most recent EST: GB:H32493
Protein sequence: rat34058.pep
Other species: dro34058.seq
Protein sequence: dro34058.pep
This STS-marker is located in a very complex region, that contains
of two overlapping genes, that may both have alternative 3' ends.
The two 3' ends of SGC34058 are about 500 bp apart. The STS itself
is present at the 3' end of the longest of the two mRNA-species in
this gene. Both transcripts overlap with the transcription unit
defined by STS WI-16155 (see below). In fact, the two last introns
of both genes do almost completely overlap, although the coding regions
are clearly separate.
A clear polyadenylation signal is lacking from the WashU/
Merck s-files that are derived from the longer SGC34058 transcript. The
sequences that normally would be expected to "begin" with a polyA-tail,
"start" at different positions within an approximately 100 bp region.
It can at present not be excluded that these clones are in fact genomic
in origin, but I can not identify an A-rich sequence motif that might
explain the presence of such a large number (there are more than 10
such clones) in various cDNA-libraries. One clone, GB:AA279439,
contains a diverging A/T rich sequence at its 5' end, which would
correspond to the 3' end of the putative transcript, but it is at present
unclear whether this AT-tich region is really present at this position.
Regarding splicing, the situation is also rather complex. Several human
clones contain what seems to be an intron sequence, and the intron seems
to contain two different splice acceptor sites, one of which will result
in a truncated protein as a result of a frame-shift. This upstream splice
site is used in at least two cDNAs, although a poly-pyrimidine tract at
this splice site is not very obvious. The most likely explanation for the
presence of these cDNAs is that this splice site is used because it
represents the first AG downstream from the branch point sequence, which
has been defined by the polypyrimidine tract of the very nice second
splice acceptor site, which is located 25 bp downstream. This second
splice site is also used in the murine sequence, but some unspliced forms
(containing an upstream AG as well) are also found among the murine cDNA
clones. Both murine clones (GB:W80183 and GB:AA254293 are in the reverse
orientation, however, which might indicate that they are derived from the
mus16155 transcription unit or are genomic in origin.
The SGC34058 cDNA is presently 2.3 kb in length. It encodes a
protein that is quite similar to the recently identified Drosophila
misato protein, that encodes a protein with tubulin and myosin like
motifs, that is involved in cell division (Miklos et al., PNAS 94,
5189-5195 (1997); GB:U80043). Other tubulin beta-like sequences are also
identified in a database search. The alternative splice site upstream
of the last exon would of course lead to a different carboxyl-terminus.
Although Miklos et al. claim that there is no true counterpart of this
protein in yeast, the human protein readily identifies the hypothetical
protein YMR211w (pir:S55093) as the next-best match.
WI-9711.
Last revised/checked: June 26, 1997
Most recent EST: GB:
Unigenelink
Sequence name: wi9711.seq
Genomic sequence: wi9711g.seq
Protein sequence: wi9711.pep
Other species: mus9711.seq
Most recent EST: GB:
Genomic sequence: mus9711g.seq
Protein sequence:
The original STS is a genomic fragment that covers part of
the gene encoding the ribosomal protein S27 (GB:U57847),
also known as metallopanstimulin (GB:L19739). This gene also
covers marker WI-8593 (GB:G06949). WI-9711 has also been used
for the YAC-STS mapping project of the Whitehead lab; it maps
to the same YACs (717-C-3, 736-H-10, 955-E-11) as IB3262 (not
included in the RH map) and WI-4536, which is included in the
RH map and which has been mapped close to the other STS marker
representing WI-9711, RP_S27_2 (see above).
Usually, when one starts screening the various databases using
a full-length cDNA sequences one comes across a few variant clones,
that often represents splice variants or alternatively polyadenylated
forms. In this case, however, the situation turns out to be almost
bizarre. First of all, it turns out that a processed pseudogene is
present on Chr7q31 on the BAC clone GS274A07 (GB:AC002075). This
sequence is more than 95% identical. It contains a deletion of one
amino acid and two amino acid substitutions with respect to the gene-
sequence. It is probably not expressed, because there are no ESTs
matching it, whereas there are many ESTs matching the WI-9711
sequence. A second problem is that the last 100 bp of the WI-9711
sequence match with both the 5' ends of both the cDNA (GB:M31520)
and the genomic (GB:U12202) sequence of the human ribosomal protein
S24 in the reverse orientation. This does not seem possible, it may
be that the cDNA clone of the S24 gene was chimeric and that the
genomic sequence has not been determined entirely independently.
There are many WI-9711 sequences crossing the putative junction,
meaning that an error on the WI-9711 cDNA sequences can be ruled out.
Finally, the splicing pattern of the human and mouse cDNAs in the
part of the cDNA seem rather wild. Whereas the second intron in the
human genomic sequence is only derived from the original STS, which
was genomic in origin, the first intron is represented in many EST
clones. In the human situation there seem to be three classes of clones:
one type ships the entire intron and splices ATGCCT|GTGAGT to
TCTTAG|CTCGCA (where the ATG in the first exon is the start codon),
a second type seems to use an alternative splice donor site in the
intron, with the sequence TCCCCG|GTGTGT (the T in position 4 in the
intron makes this a non-consensus splice site). Finally, a third type
does not remove the intron at all. In the murine clones, the situation
is even more complex. All the human splice events can also be traced
among the murine ESTs, but there is one additional complication: there
are murine clones that use an alternative splice acceptor site in the
intron, which is 24 bp (8 aa) upstream from the other acceptor sites.
The upstream site ends with TAAG|TC, the downstream one with TTAG|CT
and both motifs are preceded by a good polypyrimidine tract. Based on
the level of conservation between the human and the mouse sequence and
on the pattern of mismatches, I would not be surprised if part of the
intron was coding. Translation may start at the upstream ATG or it may
start at an ATG that is present in the intron itself. Some of the
protein products might be translated in another reading frame. This
clearly needs to be studied with real experiments. Anyone interested?
WI-17491.
Last revised/checked: July 2, 1997
Most recent EST: GB:AA378659
Unigenelink
Sequence name: wi17491.seq
Genomic sequence: wi17491g.seq
Protein sequence: wi17491.pep
Other species: mus17491.seq
Most recent EST: GB:AA407990
Genomic sequence: m17491g.seq
Protein sequence: mus17491.pep
A 1.0 kb sequence. The same gene as covered by WI-15073 (see
above), which is more to the 3' end of the cDNA than WI-17491.
GB:W23551 skips an exon. The same exon is present in the unspliced,
but polyadenylated EST GB:AA017325, which features in the genomic
sequence together with some additional genomic clones wi17491g.seq.
The genomic sequences now contains one complete and one partial
intron. The first intron is also present in the murine clone GB:AA389117.
The partial open reading frame (the 5' end of the contig is GC-rich
and the problems are not yet completely resolved) of WI-17491
shows a weak similarity to the yeast myosin-like protein
(GP:L01992). The extreme 3' end of the WI-17491 mRNA probably overlaps
with the 3' end of another gene (WI-17491b) that is identical to NF-AT
(or NF45), a transcription factor (GB:U10323). The overlap is only
about 60 bp and the two polyadenylation signals are contained in the
sequence tTTTATTcAATAAAt. The STS SHGC-11878 (GB:11259) is located
in the NF-AT cDNA. The NF-AT cDNA has also been mapped to 1q21.
SGC34808.
Last revised/checked: May 27, 1997
Unigenelink
Sequence name: wi6812.seq
Probably the 3' UTR of the human IL6 receptor and part of the
same cDNA as WI-6812 (see below), but the STS itself is far from
STS WI-6812 (GB:G05547) in this contig.
WI-6812.
Last revised/checked: May 27, 1997
Most recent EST: GB:AA381253
Unigenelink
Sequence name: wi6812.seq
The present sequence is more than 2.2 kb in length and probably
represents the 3' UTR of the human IL6 receptor, since it is similar
to the same region of the rat IL6 receptor (GB:M58587), which is
several kb long (see also SGC34808 above). The sequence of the
3' UTR of the human IL6 receptor cDNA is not available in the
database. The IL6 receptor gene is known to be located in the
1q21 region, however. Interestingly, the IL6 receptor gene is most
related to the CNTF receptor and the IL11 receptor, which are both
located on Chr9 and lie in close proximity to each other. The
IB3580 gene, which is present on the Whitehead YAC-STS map and which
is represented by SGC33740 on the RH map (see directly below) also
has a counterpart on Chr 9. This suggest that part of 1q21 and
part of Chr9 have a common evolutionary origin. They might e.g. be
the result of an ancient duplication event. WI-6812 alias D1S2375
has been mapped to YACs 778-H-2 (D1S303, WI-7978) and probably
811-H-6 (WI-7978).
SGC33740.
Last revised/checked: May 26, 1997
Most recent EST: GB:AA410854
Unigenelink
Sequence name: sgc33740.seq
Genomic sequence: s33740g.seq
Protein sequence: sgc33740.pep
Other species: mus33740.seq
Most recent EST: GB:AA403904
Protein sequence: mus33740.pep
This sequence is a very complex transcription unit. The present
cDNA contig contains GB:D63478, the KIAA0144 myeloblast mRNA. The
protein encoded by this mRNA is extremely Ser/Thr-rich and is highly
conserved in the mouse. However, it turns out that the cDNA present
in the database entry is just one of a number of possible splice
variants. At present, there seem to be three different 5' ends, that
diverge from one another at the same postion in the cDNA. At present,
all ends are represented by a single EST. The GB:D63478 5' end is
represented by GB:N76101, which extends more to the 5' end. It seems
likely that thic clone is a (truncated) genomic clone, since it is an
s1-clone (that normally corresponds to the 3' end of a cDNA) and because
the other end of this clone represents the 5' end of the TIGR-A002G29
contig (see above). Other 5' ends are represented by GB:AA341150 and
GB:AA247130. Both clones diverge from the GB:D63478 sequence 5' from
the sequence TAGGCGCAG|TATTCTACC. The unique parts are very GC-rich,
but are not related to one another.
At the 3' end the situation is also complex. Several ESTs end at the
same polyadenylation signal as GB:D63478, but a large number are derived
from an alternative 3' end, that is reached by alternative splicing. This
alternative 3' end is also present in the murine sequence. Since the
splicing takes place in the open reading frame of GB:D63478 (at the
sequence GATACAACACTG|GAAGAAAA), the resulting protein will also have an
alternative carboxyl-terminus. However, it has been very difficult
to establish the correct reading frame, because the conservation
with the mouse sequence is so extremely high that there are many
regions that do not contain a single third position change. However,
frame shifts do occur in this region (based on discrepancies between
the ESTs and on the position of third position changes further
downstream), which makes it hard to identify the exact site of
the mistake. Another factor that makes life rather complex is
the presence of quite a number of genomic fragments (or of mRNAs
that have read through the first polyadenylation signal), that
made it difficult at first to discern the meaning of all these
sequences. A definitive assignment of the correct reading frame
must await the appearance of some additional ESTs in this
region therefore. The two Chr1-markers SGC33740 and IB3580
(the latter present on the Whitehead YAC-STS map) are located
in the alternative 3' untranslated region. SGC33740, D1S305, and
WI-7978 (NIB241) are all present on the same YACs (950_e_2 and
951_f_6) and have very similar RH coordinates, suggesting that they
are really quite close to one another.
A highly related sequence is present on Chr9. Its sequence is
available upon request.
SGC32043.
Last revised/checked: June 26, 1997
Most recent EST: GB:R38652
Unigenelink
Sequence name: sgc32043.seq
A 0.96 kb sequence with a single gap. Only represented by the
5' and 3' ends of two EST clones. No clear similarity to other
sequences, except for the last 300 bp, which are derived from a
repeated sequence, that is present several times in the database.
D1S305.
Last revised/checked: Dec 9, 1996
Sequence name: d1s305.seq
One of the few genetic markers in this region that has
also been mapped using radiation hybrids. It is also known
as AFM220xf8. This marker has also been used in the YAC-STS
project and maps to the same doubly-linked contig as WI-7978
(directly below) and IB3580 (=SGC33740).
WI-7978.
Last revised/checked: June 30, 1997
Unigenelink
Sequence name: nib241.seq
The DRADA gene. See below at NIB241. WI-7978 is present in
the YAC-STS map of the Whitehead lab and is present on the same
YACs (950-E-2 and 951-F-6) as IB3580 (SGC33740) and D1S305. Both
markers are present in the RH map as well. Other YACs that contain
the WI-7978 gene are 778-H-2 (D1S2375 (=WI-6812), D1S303) and
811-H-6 (WI-6812).
SGC32441.
Last revised/checked: June 17, 1997
Most recent EST: GB:F11455
Unigenelink
Sequence name: sgc32441.seq
A 0.65 kb sequence, that consists of the 5' and 3' end of
a single EST clone. The lower strand might encode a protein
with some similarity to RING proteins, but this observation
is not very trustworthy.
NIB241.
Last revised/checked: June 30, 1997
Unigenelink
Sequence name: nib241.seq
Genomic sequence: nib241g.seq
Protein sequence: nib241.pep
Identical to the double-stranded RNA adenosine deaminase
gene (DRADA; GB:U18121 and GB:U10439), that is also covered by
WI-7978 (see over there for some details on the position of this
gene in the Whitehead YAC-STS map). A 6.5 kb mRNA in total.
The genomic sequence contains additional 5' and 3' sequences
based on GB:U32571 and GB:U32347 and some genomic sequences
based on various EST sequences. The EST yv25d07 is most probably
a genomic sequence, whereas the status of AA:057407 and AA449679
is less clear. They contain the same sequence divergence and are both in
thje reverse orientation, but the site of sequence divergence is not
compatible with a splice acceptor site. Finally, GB:AA096321 skips
a 2 kb part of the mRNA, possibly as a result of an internal deletion,
because the sequence CTTCT is present at both ends of the deletion,
suggesting internal recombination.
GCT15E11.
Last revised/checked: June 26, 1997
Sequence name: wi16548.seq
A simple sequence repeat from CHLC, that is part of the WI-16548
cDNA (see above). It is possibly located on YAC 955-E-11. That is also
positive for a number of other 1q21 STSs, such as D1S2346, WI-8650,
WI-6071 (S100A9), IB3262, WI-4536, WI-9711, WI-9245, WI-7842 (the
latter two STSs are derived from SPRR genes), and WI-8190.
WI-12606.
Last revised/checked: Dec 12, 1996
Most recent EST: GB:AA137125
Unigenelink
Sequence name: wi12606.seq
Genomic sequence:
The THBS3 or thrombospondin3 gene (GB:L38969). This gene is known
to reside about 3 kb upstream of MUC1 (WI-5995; GB:J05581), see
directly below) in the same transcriptional orientation (Vos et al.,
J. Biol. Chem. 267, 12192-12196 (1992)). Several ESTs (ym97f11,
yo38e08 and yo36f07) are unspliced and in the reverse orientation
with respect to the mRNA and probably represent genomic DNA
fragments.
WI-5995.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: wi5995.seq
Genomic sequence:
The MUC1 gene (GB:J05581), present 3 kb downstream
of the THBS3 gene (WI-12606; GB:L38969; see directly above).
Also mapped to YAC 887_h_8 by the Whitehead people. This YAC is
also positive for GATA-P19287 (=GATA85H08), IB1251 and is
probably positive for D1S2358 (WI-6296), IB708 (not RH mapped) and
WI-7160 (the glucocerebrosidase gene GBA), which is about 30 kb from
the MUC1 gene (Long et al., Genomics 33, 177-184 (1996)). The MUC1
gene encodes mucin1 or episialin, an O-glycosylated protein that is
overexpressed in many carcinomas.
WI-16155.
Last revised/checked: June 17, 1997
Most recent EST: GB:AA418434
Unigenelink
Sequence name: wi16155.seq
Genomic sequence: wi16155g.seq
Protein sequence: wi16155.pep
Other species: mus16155.seq
Most recent EST: GB:AA397083
Protein sequence: mus16155.pep
Other species: pig16155.seq
A very complex arrangement of genes and polyadenylation sites. At
present, the best interpretation seems to be that this marker is
located in a region of overlap between the 3' ends of two
transcription units. One of the transcription units (wi16155) is
defined by only two ESTs, GB:H24035 (ym54a02.s1) and GB:AA056343
(zl66f04.s1). The other transcription unit is SGC34058 (see above).
Because there are many ESTs derived from this unit (as judged from
their reverse orientation with respect to the WI-16155 sequence) it
is still possible to derive the entire last exon from the latter
gene. The exon border is defined by the murine EST GB:387333, that
is highly similar to wi16155/sgc34058 sequence downstream from
a point in the sequence that is a very good splice acceptor site:
CCCTCCTCCTTTCTTAG|GTTTCCC. In addition, the pattern of mismatches
between the human and murine sequences clearly indicates that this
region is coding on the WI-16155 strand. The murine sequence can be
extended quite a bit to the 5' end, although this is in part based on
the sequence from an EST, GB:W48478, that does not perfectly match
the two files that overlap with it on both sides. The GB:W48478-
sequence is derived from a Life Tech mouse cDNA, which for some reason
sometimes (certainly not always!) show a number of differences with
respect to the sequences from the Merck-mice. This means that this part
of the mus16155-sequence should be considered as tentative, although it
is not excluded that the final sequence will be very similar to the
present one in this region. The putative protein product is not obviously
similar to anything in the protein database.
The information given above is likely to be more or less correct,
the remaining part, however, is not guaranteed to remain in the contig
much longer. The problem is the following: The r1-files of the two
s1-files that contain the WI-16155 polyadenylation signal, seem to
be derived from different cDNAs. Even worse, it is imposible to decide
which one is correct, and it is even possible that both sequences
are incorrect. The upstream part of the file that has been used for
STS WI-16155, ym54a02.r1 (GB:H22755), is part of a cDNA encoding a
protein that is clearly conserved in mouse and pig (GB:F15055) and
is cys-rich. In fact, there are four repeats of a unit that consists
of two cysteines in a CX9C unit that is clearly conserved on other
positions as well. The same organization can be found in the yeast
(S. cerevisiae) ORF YDR031w protein (gp:Z74327), which encodes an
117 aa hypothetical protein, that contains at least three units
encoding CX9CIRD/E.
The other file, that is clearly derived from the same cDNA
as STS WI-16155, zl66f04.s1, has an r-file (GB:AA056710) that is
present in the cDNA encoding the tetraspan transmembrane protein
SFA-1. Neither this cDNA, which has been completely sequenced
(GB:D29963/GB:U14650), nor the many overlapping ESTs show clear
evidence for alternative splicing/polyadenylation, that might
connect them to the WI-16155 3' end. The chromosomal localization
of SFA-1 is unfortunately not known.
The situation for the other candidate, the CX9C protein mRNA,
is more complex. In this case, there is evidence for alternative
splicing and polyadenylation. Some clones are alternatively spliced
and have 3' ends (s1-clones) that end in an L1-like sequence.
When I screen this 3' end against the EST database, I get some
matching ESTs and even an STS, that are clearly identical except
for some likely sequence errors. Thir scores are clearly above
the background of other L1 repeats present in the genome. The
STS is present on Chr2. This would clear dispel the Cx9C protein
as a candidate for the 5' end of WI016155, were it not for two facts:
there is a possible that I am looking at a very highly related L1
repeat and one of the few clones that shows such a high similarity
is named ym55a02.s1, whereas WI-16155 is named ym54a02.s1. Since
I have encountered several instances in which the numbers of two
diffrent ESTs had been interchanged, I have obviously tried to
see whether ym55a02.r1 is present in SFA-1. This is not the case,
rather, it is presents in yet another sequence that shows some
similarity to a C. elegans protein, but on the bottom strand. I
have therfore decided that the situation is too confusing to merit
any further work at this stage (this literally takes hours of work).
Additional ESTs matching the WI-16155 3' end are required before this
frustrating problem can be resolved. For the moment, the CX9C cDNA is
still contained in the WI-16155 file.
Some additional data on the CX9C protein:
The WI-16155 cDNA is also alternatively spliced (e.g. GB:N39730). The
present 5' end of the transcript overlaps with three virtually
identical CpG island clones- HS34E8R, HS26H3R and HS96E1R (GB:Z60793,
GB:Z55221, GB:Z66422, respectively). The other
ends (F-files) of these clones, that should be located in a more
downstream direction, do not overlap with other database sequences,
except each other. The 5' end of wi16155.seq also overlaps with an
EST (GB:AA236097) in the reverse orientation. The sequence of this
EST is virtually identical to that of the CpG clones, but the
similarity abruptly ends at a position that is compatible with a
splice acceptor sequence ON THE BOTTOM STRAND. Although this makes
the sequence difficult to interpret, the other (s1) end of
GB:AA236097 has been included in the genomic wi16155g.seq-file.
WI-15725.
Last revised/checked: June 17, 1997
Most recent EST: GB:H04931
Unigenelink
Sequence name: wi15725.seq
A 1.0 kb sequence with a central gap, that is covered by only two
ESTs (four sequences).
WI-6903.
Last revised/checked: June 13, 1997
Most recent EST: GB:AA448423
Unigenelink
Sequence name: wi6903.seq
Genomic sequence: wi6903g.seq
Protein sequence: wi6903.pep
Other species: mus6903.seq
Most recent EST: GB:AA388176
Protein sequence: mus6903.pep
Other species: rat6903.seq
Related to the rat SCAMP37 gene (GB:L22079), but not its
counterpart, since another EST contig is much more closely
related to the rat protein. SCAMP stands for Secretory Carrier
Membrane Protein 37. Most of the frame shifts in the human cDNA
seem to have corrected by now, but there is still a region just
5' of the putative stop codon that is still relatively low in
quality. EST sequences GB:N32438 and GB:AA164368 skip an
approximately 80 bp exon in the human sequence. The position
of two other intons can be inferred from GB:H25083,
which contains an exon bordered by splice consensus
sites. WI-6903 has also been mapped on the Whitehead YAC-STS
map to YAC 795_h_5, which probably also contains GATA-P19287
(GATA85H08), D1S2358 (WI-6296; directly below), and IB1251.
STS TIGR-A004W05 is also a member of the WI-6903 contig,
whereas chicken STS ADL277 (GB:G01697) is possibly derived
from a similar chicken gene. The WI-6903 protein is also related to
a C. elegans protein (gp:AF003739).
WI-6296.
Last revised/checked: June 18, 1997
Most recent EST: GB:AA311650
Unigenelink
Sequence name: wi6296.seq
Genomic sequence:
Protein sequence: wi6296.pep
Other species: mus6296.seq
Most recent EST: GB:AA265237
Genomic sequence: mus6296g.seq
Protein sequence: mus6296.pep
The human sequence does not show any clear similarities to other
sequences in the database. Comparison with the 1.2 kb murine sequence
clearly indicates that both contigs contain coding sequences. In fact,
even the 3' UTRs are quite well conserved. No clear similarities to
other proteins are discernible. A mouse clone (GB:AA198177) contains
a sequence that is absent from the human clones and that is bordered
by consensus splice sites (CAG|GTCAGG....TTTCTTTCTCTCTCTCTGCAG|GG).
This sequence has been removed from mus6296.seq, but is present in
mus6296g.seq. The mouse clone GB:AA168722 is probably human in origin,
since it is identical to the human sequence (except for some sequencing
errors). WI-6296 has also been mapped to various YACs by the
Whitehead group. It is probably present (all hits are listed as
ambiguous, but these are clearly the best candidates) on 796-H-5,
887-H-8 and 954-A-11. At least one of these YACs contains the
markers WI-6903, WI-7160, GATA85H08, IB1251, WI-5995, IB708,
D1S305, and WI-7978, many of which are present in the vicinity
of WI-6296 (=D1S2358) on the RH map as well.
WI-14846.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: wi14846.seq
Genomic sequence:
Identical to GB:L77213, phosphomevalonate kinase, a 1.0 kb cDNA.
TIGR-A002G08.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: t002g08.seq
This EST is derivced from the DAP-3 gene, that encodes an
ionizing radiation resistance conferring protein (GB:X83544,
GB:U18321, the C-termini of the proteins in these files differ due
to a frame-shift in one of them). The protein has also been
identified as a mediator of gamma-interferon-mediated cell death.
The mRNA is 1.6 kb long and the protein is about 385 aa.
GATA85H08.
Last revised/checked: Sept 20, 1996
Sequence name: gata85h08.seq
A CHLC tetranucleotide repeat sequence, close to an Alu repeat.
Quite similar to GATA25B02 (GB:G07801), which has been mapped on
1p. This might be an error, but there is some evidence for a duplication
of these parts. GATA85H08 is identical to GATA-P19287 in the
Whitehead YAC-STS map and maps to the same YAC (876_B_11) as
IB1251 (directly below) and WI-8330 (WI-7325) which has been mapped
some distance away from the former two markers. D1S303 is probably
also present on these YACs.
IB1251.
Last revised/checked: Feb 6, 1997
Most recent EST: GB:
Unigenelink
Sequence name: ib1251.seq
This sequence is represented by many ESTs and by GB:D38522
(KIAA0080), which is a 4001 bp long sequence that encodes the
last 105 amino acids of a protein that is most similar to mouse and
rat synaptotagmin IV, which is 425 aa. None of the ESTs reach into
the coding region, which is 3.5 kb upstream from the polyadenylation
site. The level of similarity of the rodent sequences to D38522
is not extremely high (59%), which suggests that the D38522 may
represent another synaptotagmin gene. The 3' UTR contains an
Alu-repeat (pos. 530-866), which has been masked by Ns. IB1251 has
also been used for the Whitehead YAC-STS map, and it is located close
to D1S303, GATA85H08 and WI-8330/WI-7325. The sequence also covers
TIGR-A002N39 (see below).
SGC34121.
Last revised/checked: June 17, 1997
Most recent EST: GB:AA403168
Unigenelink
Sequence name: sgc34121.seq
Genomic sequence:
Protein sequence: sgc34121.pep
Other species: mus34121.seq
Most recent EST: GB:AA461783
Protein sequence: mus34121.pep
The human sequence is 2.1 kb in length and has a single gap, the
mouse sequence is 1750 bp long at present. The human EST GB:W17304
has a deletion with respect to GB:AA100623 and several overlapping
mouse clones (GB:W59271 and several others). GB:AA100623 also skips
an exon, but at another location. EST GB:AA403168 at present seems
to be either chimeric or genomic. In the latter case, the 3' end of
the intron (the splice acceptor site) would be remarkably like the
3' end of the preceding exon. The putative protein product of
the SGC34121 cDNA is very proline-rich and highly conserved between
humans and mice. It does show significant similarity to the hypothetical
S. pombe protein C30D11.14 (sp:Q09911). The murine sequence consists
of two parts, both of which are highly similar to the human sequence,
but there is no formal proof that both parts are derived from the same
mRNA. Two murine clones, GB:AA138556 and GB:AA154458 show alternative
splicing or rearrangements. The latter possibility is quite realistic
in this case, because the two ends of the deletion show a high level
of similarity, suggesting that a recombination may have occurred.
SGC34121.
Last revised/checked: June 17, 1997
Most recent EST: GB:AA403168
Unigenelink
Sequence name: sgc34121.seq
Genomic sequence:
Protein sequence: sgc34121.pep
Other species: mus34121.seq
Most recent EST: GB:AA461783
Protein sequence: mus34121.pep
The human sequence is 2.1 kb in length and has a single gap, the
mouse sequence is 1750 bp long at present. The human EST GB:W17304
has a deletion with respect to GB:AA100623 and several overlapping
mouse clones (GB:W59271 and several others). GB:AA100623 also skips
an exon, but at another location. EST GB:AA403168 at present seems
to be either chimeric or genomic. In the latter case, the 3' end of
the intron (the splice acceptor site) would be remarkably like the
3' end of the preceding exon. The putative protein product of
the SGC34121 cDNA is very proline-rich and highly conserved between
humans and mice. It does show significant similarity to the hypothetical
S. pombe protein C30D11.14 (sp:Q09911). The murine sequence consists
of two parts, both of which are highly similar to the human sequence,
but there is no formal proof that both parts are derived from the same
mRNA. Two murine clones, GB:AA138556 and GB:AA154458 show alternative
splicing or rearrangements. The latter possibility is quite realistic
in this case, because the two ends of the deletion show a high level
of similarity, suggesting that a recombination may have occurred.
TIGR-A002N39.
Last revised/checked: Feb 6, 1997
Unigenelink
Sequence name: ib1251.seq
Same sequence as IB1251 above.
SGC30383.
Last revised/checked: April 28, 1997
Most recent EST: GB:AA167653
Unigenelink
Sequence name: sgc30383.seq
Genomic sequence:
Protein sequence:
Other species:
Most recent EST: GB:
Protein sequence:
A 2.4 kb sequence but this contig may be incorrectly assembled;
its structure is entirely dependent on the correct assignment of
the two zq39b01 sequences (GB:AA167653 and GB:AA166633). This EST
is the only EST that connects the the 5' and 3' parts of the contig.
The 3' part is similar, but not identical to a sequence on Chr11, that
is present in the PAC pDJ356d6 ((GB:AC002036). It probably represents
a mildly repetitive sequence.
SGC34568.
Last revised/checked: May 30, 1997
Most recent EST: GB:AA379449
Unigenelink
Sequence name: sgc34568.seq
A 1.3 kb sequence without any clear characteristics. Most ESTs
are derived from a fetal brain library. There is no obvious
similarity to other database entries on the protein level.
WI-8330.
Last revised/checked: Sept 20, 1996
Most recent EST: GB:
Unigenelink
Sequence name: wi8330.seq
Genomic sequence:
The gamma subunit of chaperonin TCP-1 or Cctg (chaperonin
containing TCP-1, gamma subunit) (GB:U17104 and GB:X74801).
The equivalent mouse cDNA is present in GB:L20509 (matricin) and
GB:Z31556. WI-7325, which has been used for the Whitehead YAC-STS
map (just like WI-8330), is another STS covered by this cDNA.
WI-7325 has been mapped to YACs 876-B-11 (also positive for
WI-8330 (this entry), GATA85H08, IB1251, IB708 and D1S303)
and 927-B-9 (positive for WI-9029 (lamin A, not present on
the RH-map) and IB708).
WI11851.
Last revised/checked: March 5, 1997
Most recent EST: GB:AA218694
Unigenelink
Sequence name: wi11851.seq
Genomic sequence:
Protein sequence: wi11851.pep
Other species: mus11851.seq
Most recent EST: GB:AA208332
Protein sequence: mus11851.pep
Other species: rab11851.seq
Protein sequence: rab11851.pep
A 1.1 kb sequence that is the human counterpart of the
rabbit rab25 protein (gb:L03303; sp:P46629). The 213 aa
human protein is 93% identical to the rabbit protein. Two ESTs,
GB:AA195079 and GB:W25368 show alternative splicing or
rearrangements. STS A005X41 (GB:G20582) is also a member
of this contig.