head.shtml


Chromosome 1 mapping and sequencing at The Sanger Centre

(Prepared by Simon Gregory)


Dramatic progress has been made in the construction of the physical map and generation of sequence data in the period following the last chromosome 1 workshop, C1W5 (White et al., 1999a). Supplementation of the Sanger Centre's chromosome specific physical map strategy (Gregory et al., 1998; D.R. Bentley, manuscript in preparation) with whole genome fingerprinting resources (Marra et al., 1997; J. McPherson, manuscript in preparation) increased the physical map coverage from 165 Mb in August 1999 to 223 Mb in September 2000 (Fig. 1). Minimum tiling path clones from the physical map were the primary source for the production of 'working draft' and 'finished' sequence data.

Workshop 2000
Introduction
Resources
Sanger Centre
Physical maps on 1p
Physical maps on 1q
Disease genes
Neoplasia
Participants
References

Abstracts

Mapping

To date, a total of 8375 STSs (64% RH mapped) have been placed on the clone map. The increase of 3354 markers from last year's total can largely be attributed to in silico positioning of novel STSs by electronic PCR to finished or unfinished sequence. The criteria for ePCR matching requires that both primer pairs have 100% sequence identity to the genomic sequence (upon opposite sequence strands) and that both primers are within 1 kb of one another. Without accounting for potential clustering of markers, the current collection of localized STSs means that the map has a marker density of 1 STS/36 kb across the 243 Mb euchromatic portion of chromosome 1, well above our target density of 1 STS/75 kb.

Estimates of chromosome coverage of the physical map were calculated using either STS content within existing bacterial clone contigs, or cumulative contig sizes in comparison to the estimated size of chromosome 1. A total of 4849 STSs (94% of RH mapped markers) were placed on the physical map either by ePCR or by experimental association with bacterial clones within contigs. The placement of 89% of 850 randomly derived flow sorted STSs (generated during the course of the project) onto the physical map gives a less biased approximation of STS localization. Physical map coverage of PAC and BAC clones contributing to the minimum tiling paths of the contigs yields an estimated coverage of 223 Mb (92% of the euchromatic region of the chromosome), with an average contig size of 2.1 Mb.

We are continuing to close gaps between contigs, by generating novel STSs from publicly available or in-house generated end sequences, to walk across the 50 genomic equivalents represented by the RP11, RP13 or CIT BAC libraries.

Sequencing

There were several changes to the detail of the Sanger Centre sequencing strategy in the period following C1W5. Shotgun sequencing now utilizes only 1.4-2.2 kb inserts subcloned in a pUC vector as sequence templates and primarily Energy Transfer terminators for sequencing reactions. Molecular Dynamics MegaBACE or Applied Biosystems 3700 sequencing machines have replaced the more traditional slab gels of the Applied Biosystems 377s.

A total of 150 Mb of intermediate phase 'working draft' sequence and 24 Mb of 'finished' sequence were generated since C1W5 (Fig. 1). The 'working draft' comprises genomic sequence of a minimum tiling path of bacterial clones in the physical map, determined at an average 3X depth of coverage. All sequence clones currently containing draft sequence will have complete shotgun sequence coverage by June 2001, as agreed by the Human Genome Project consortium. All chromosome 1 minimum tiling path clones are targeted for sequence 'finishing', to the agreed consortium standard (White et al., 1999a), by the end of 2002.

Analysis

Scale-up in the production of draft sequence data required the development of analysis tools to predict potential coding features within unfinished sequence that, in turn, may also give order and orientation of contigs within a sequence clone. Ensembl, written by Ewan Birney and Tim Hubbard as a joint project between the EMBL-EBI and the Sanger Centre, provides automatic annotation to human genome data. Ensembl takes assembled DNA sequence as its primary information and then runs a number of computer programs to determine sequence annotation. In particular, Ensembl determines the location of genes, transcripts and exons, DNA repeats and STSs within the draft sequence data. The final results are stored in a database whose access is provided through the Web (see Data access/release). Ensembl provides an iterative analysis of draft sequence data using frozen datasets from publicly available human genome data. The data set includes a large number of known genes in the analysis, an estimated 75% of all genes. Analysis of available sequence data prior to C1W6 predicted 3711 genes with EST and protein homology, of which 2108 genes were supported by strong protein evidence.

We are continuing with detailed manual annotation and experimental analysis on 'finished' sequence clones (White et al., 1999a). A suite of prediction programs provide evidence for the existence of gene features, which are then supported by an experimental approach to isolate corresponding cDNA sequences. PCR-based screening of cDNA libraries is initiated upon strong evidence from in silico prediction. Both computational and experimental coding sequences are collated and graphically displayed in an ACEDB format (see Data access/release). A total of 30 Mb from 269 finished clones have been submitted to EMBL, an increase upon C1W5 of 12 Mb and 99 clones respectively.

Data access/release

Sanger Centre mapping and sequence data is released freely into the public domain. Guidelines and conditions on use of data is available at:

The Sanger Centre Chromosome1 homepage contains links to Sanger Centre map and sequence data including a chromosome 1 specific BLAST search:

A weekly dump of the chromosome 1ace database is available for query as Webace:

Finished and unfinished sequence data is available on the Sanger ftp site:

Ensembl automated analysis of draft sequence is linked from the Sanger Centre homepage:

or available directly for the Ensembl homepage: