
The Sanger Centre human chromosome 1 project aims to construct a comprehensive map of the entire chromosome, in close collaboration with the international chromosome 1 community, that will include all genes and other biologically important features, up to the level of the DNA sequence itself.
The mapping and sequencing strategies and details of the initial targeted regions have been described in detail in the report of the 3rd International chromosome 1 workshop (Vance et al.; 1997).
This report details the progress of the mapping and sequencing, describes chromosome 1 collaborations established with the Sanger Centre since the last workshop and provides information on data release and access.
Mapping
A landmark approach has been adopted for the construction of a sequence ready chromosome 1 map. Publicly available markers and those from sequencing small insert libraries, at a combined density of 15/ Mb, are being placed on the Sanger Centre radiation hybrid (RH) map and used to screen large insert bacterial clone libraries. Further details on the RH strategy are located at: http://www.sanger.ac.uk/HGP/Rhmap/ and the latest version of the chromosome 1 RH map is located at: http://www.sanger.ac.uk/cgi-bin/rhtop?chr=1.
Fluorescent restriction digest fingerprinting (Gregory et al.; 1997) is used to determine the extent of overlap between clones of common marker content and to identify overlap between clones where no marker data is available. At present the 1p STS collection is in the screening phase and will be followed shortly by similar screening on 1q. Progress at the different stages of the mapping pipeline in the period following the fourth chromosome workshop is illustrated in Figure 1.
Sequencing
The main approach continues to be the shotgun sequencing of 1.4-2.2 Kb fragments of the minimally overlapping PAC and BAC clones in M13 and pUC clones. In the last year the proportion of pUC clones sequenced per project has increased to about 80%. Big Dye terminator chemistry (PE) has largely replaced previous dye terminator sequencing chemistries, resulting in an average increase in read length of 50-100 bases. All sequencing data from the sequencing machines is now transferred automatically to the UNIX network once it is collected. A detailed description of the process and software used can be found at: http://www.sanger.ac.uk/Software/Sequencing/ASD/overview.shtml
The use of XGAP (Staden) as an editing interface has been replaced by the new GAP4 program (Staden). Further details can be found at: http://www.mrc-lMb.cam.ac.uk:80/pubseq/. Sequencing progress is shown in Figure 2. 4.8 Mb of finished, annotated sequence has been submitted to the public databases and a further 5.4 Mb is available as unfinished sequence. The searching of DNA databases containing available human chromosome specific sequence data can be performed using the Sanger Centre human chromosome-specific blast server at: http://www.sanger.ac.uk/HGP/Chrom_blast_server.shtml.
The searchable data contains:
i) Sequence submitted to the EMBL database and all unsubmitted finished sequence from the Sanger Centre, and submitted sequence supplied by collaborators on the chromosome-specific sequencing projects.
ii) Unfinished sequence contigs over one thousand bases from the Sanger Centre.
iii) Genomic, cDNA, EST and other sequences extracted from the EMBL database, or produced at the Sanger Centre.
Analysis
The strategy for sequence analysis involves semi-automatic analysis of finished sequence by various search prediction tools, collation of results in an ACEDB database for graphic display, interactive annotation of genes and features and generation of an EMBL entry. The automatic analysis results are collated into an ACEDB database from which they can be displayed graphically, allowing interactive gene building and other annotation. The annotation takes into account the results of the external gene prediction programs and compares these with homology data. At present our gene prediction submissions are conservative, requiring a strong protein or good cDNA match. These criteria could be relaxed in the future by taking advantage of improvements in prediction algorithms and greater experience with larger regions of human genomic sequence in which genes have been characterized experimentally.
Future plans include a program of reanalysis using these improvements and also a program to perform automatic analysis of the unfinished sequence. A full analysis of a particular clone can be viewed via "Webace" at: http://webace.sanger.ac.uk/cgi-bin/webace?db=acedb1
Collaborations
The Sanger Centre actively encourages mapping and sequencing collaborations. The standard data release policy of the centre is adopted for all such projects (see below). A summary of all the regions being mapped or sequenced in collaboration with The Sanger Centre is shown in Figure 3. The two most advanced collaborations are the VWS region in 1q32 (B. Schutte; see abstract this workshop and "disease mapping" in this report), and an in-house collaboration in 1q24-25 (H. Williams; see abstract this workshop).
Data access
A weekly copy of our 1ace database can now be viewed on-line via the recently created interface "Webace" at: http://webace.sanger.ac.uk and at: http://www.sanger.ac.uk/HGP/Chr1
Data for sequence ready contigs and for markers & associated information can be searched at: http://www.sanger.ac.uk/HGP/db_query/query.shtml
A public release version of the database is available by ftp at: ftp://ftp.sanger.ac.uk/pub/human/chr1/ or by registered HGMP-RC users at: http://www.hgmp.mrc.ac.uk/
Information on finished or unfinished sequenced clones is available at: ftp://ftp.sanger.ac.uk/pub/human/sequences/Chr_1/ or ftp://ftp.sanger.ac.uk/pub/human/sequences/Chr_1/unfinished_sequence/
Searching of a DNA database containing all human sequence data available from the Sanger Centre can be performed at: http://www.sanger.ac.uk/HGP/blast_server.shtml