Frequently Asked Questions (FAQ)

How are the mutation frequencies shown in COSMIC calculated?

The mutation frequency of a gene or tissue on the COSMIC webpages is a simple division of the number of samples with observed mutations, over the number of samples examined, from our curations. There are two different contexts for this data, between the published literature and the Cancer Genome Consortium data. The Cancer Genome Consortium data can be considered fully objective, where every gene has been fully sequenced through every sample. However, for the genes with full literature curation ('classic' genes), the % frequencies will reflect the samples and mutations as they are published. Since it is more difficult to publish studies which find no mutations, it is likely these frequencies are less accurate, simply representing the best current knowledge.

What if I need to use old versions of COSMIC for referencing?

Our non-commercial terms and conditions changed in November-21. The following statement has been removed: ‘If I now need a licence for my use of COSMIC data, instead of licensing I can use/continue to use an old unsupported version of COSMIC’. This means that you are not permitted to use old and unsupported versions of COSMIC.

Unfortunately, we don't have capacity to support older versions of COSMIC. COSMIC is designed as a 'living tool' that is constantly evolving in line with the latest research and information.

It's also important to note that old versions of COSMIC aren't kept up to date. As a result, many of the links are broken and the data isn't accurate.

Why can’t I see SNP data anymore?

June 2014 (v69) we introduced a 'noise reduction' strategy for whole genome screens which included SNP filtering. Variants reported in whole genome screens were matched to an internal SNP panel defined from the 1000 Genomes Project and normal (non-cancer) samples sequenced at Sanger.

While variants flagged as SNPs were excluded from the website they were still included in the mutation downloads but with the SNP flag (y/n) shown.This strategy continued until 2020 but began to change as a result of user feedback. Three main points were raised:

  1. User defined filtering is preferable to excluding data
  2. It is preferable for the website and download datasets to be the same
  3. The SNP panel isn't current and over time tracing the origin of SNPs in the panel has become very difficult

In August 2020 (v92) we addressed points 1 and 2 by including all mutations, including those flagged as SNPs, on the website. In v95 we will partially address point 3 by removing the SNP flag from the mutation downloads and the website. In v97 we have granted non-commercial access to the complete Cancer Mutation Census (CMC) dataset which includes gnomAD and ExAC allele frequencies and other metrics to predict mutation significance. Please see the CMC section on the Downloads page.

Why can’t I see FATHMM scores anymore?

Our initial strategy to assign significance to mutations was based on SNP filtering and the inclusion of FATHMM predictions. These methods have now been superseded by the Cancer Mutation Census (CMC). This resource, released in v92, is our current best approach towards identifying the most significant variants driving different types of cancer.

The CMC integrates all coding somatic mutations collected by COSMIC with biological and biochemical information from multiple sources, combining data obtained from manual curation and computational analyses. Metrics like ClinVar significance, dN/dS ratios, and variant frequencies in normal populations (gnomAD) have been integrated into this resource. They have been used alongside COSMIC data on mutations' prevalence across 1,500 forms of human cancer. This helps to predict candidates for driver mutations in the coding portion of the genome.

Why aren’t TERT promoter mutations showing under the TERT gene or sample specific pages?

There was a major upgrade to the annotation system in v90. This enabled us to standardise annotations and map mutations to a more recent and defined set of transcripts (Gencode basic, Ensembl release 93). The annotation of mutations is now standardised using Ensembl VEP using the latest HGVS rules.

TERT promoter mutations from whole genome screens have always been curated in COSMIC using genomic annotations only (ie. no c. syntax), as this is a standard annotation for non-coding mutations processed from whole genome screens. However, one unintended consequence of the upgrade in v90 was that the manually curated TERT promoter mutations, which had previously been annotated at the transcript level (and given TERT c.-124C>T style syntaxes) were converted to COSN mutations identified with genomic (g.*) details only.

We recognise the additional value of these transcript level c.* annotations from manual curation and we are currently working to reincorporate them, but at the moment there is no release date.

In order to maintain a standardised dataset, we will continue to show the VEP genomic annotations for all mutations, but we have now produced a mapping file to allow the non-coding variant (NCV) genomic annotations to be linked back to the CDS syntaxes. The new mapping file NCV_CDS_syntax_mapping.tsv released in v97 can be cross referenced with the CosmicNonCodingVariants.vcf.gz or CosmicNCV.tsv.gz download files to link CDS syntaxes with LEGACY_ID or COSV identifiers.

Generally, on the website we focus on coding mutations, but non-coding variants are displayed on the Genome Browser and can also be viewed directly by searching for the COSN identifier e.g.
https://cancer.sanger.ac.uk/cosmic/search?q=COSN32285790

In v97, the new mapping file contains only TERT promoter mutations, but we plan to include non-coding mutation mapping for other genes in future releases.

Where does the data in COSMIC come from?

There are two types of data in COSMIC: expert manual curation data and systematic screen data. It is useful to understand the differences of these data types and use them appropriately.

Expert curation data

  1. Manually input from peer reviewed publications by COSMIC expert curators
  2. Consists of comprehensive literature curation of selected Cancer Gene Census genes, introduced at a given release then updated at subsequent releases
  3. Includes additional data points relevant to each disease and publication
  4. Provides accurate frequency data as mutation negative samples are specified
  5. Also called non-systematic or targeted screen data

Genome-wide screen data

  1. Uploaded from publications reporting large scale genome screening data or imported from other databases such as TCGA and ICGC
  2. Provides unbiased molecular profiling of diseases while covering the whole genome
  3. Provides objective frequency data by interpreting non-mutant genes across each genome
  4. Facilitates finding novel driver genes in cancer

Can I download the full COSMIC dataset?

Yes. Files containing all data for each variant type (simple mutations, gene fusions, non-coding, structural, copy number, expression and methylation variants) are available for download and these are updated for each COSMIC release. Instructions on how to download, and details of all the files available can be found on the Download page. CosmicMutantExport.tsv.gz contains all the samples analysed for every gene in COSMIC found with mutations. For targeted gene screens there is also a file CosmicCompleteTargetedScreensMutantExport.tsv.gz containing both positive and negative data.
Note that you will need to register and login before you can download data.

Where can I download all the mutations from COSMIC?

The file can be found on the download page. It is called CosmicMutantExport.tsv.gz. This file contains all the samples analysed for every gene in COSMIC found with mutations.

How can I find the latest COSMIC release version?

The COSMIC home page shows the version number and release date for the current, most recent COSMIC release. You can find more information about the release in the news and release notes sections, which can be found under the News menu on every page.

Which version of the human reference sequence does COSMIC use?

The default version is GRCh38/hg38 but this can be changed to GRCh37/hg19 from the 'Genome Version' menu on the main navigation bar at the top of each page. The version currently selected in the menu will be ticked. When GRCh37 is selected an icon is displayed in the page header which indicates that the 'GRCh37 Archive' is being used.

How are samples counted in COSMIC?

Each sample has its own name and ID. Multiple instances of the same sample name can exist as separate entries, indicating that it was unclear during curation that these samples were identical, apart from their name. To account for the duplication of probably identical samples during curation, we attempt to combine samples that have identical names and disease descriptions. Please see the help documentation for more details.

What does 'NS' mean in tissue/histology classifications?

NS means 'not specified'.

Can I download data from older versions of COSMIC?

Yes. We make data files available to download for the current release and for at least the three previous releases only. Files for the current release are available from the Download page but previous versions can be downloaded programatically or using a command line interface. For instructions please see the 'Useful Links' section at the top right of the Download page or follow the following links -
Downloading using command line tools
Download programatically
Available download files

I am preparing a manuscript for publication and I am including some COSMIC data. How should I cite COSMIC?

We are very happy for you to use the data, and any tabulations or graphic screenshots which support your work. Please cite the website address (cancer.sanger.ac.uk) and the paper COSMIC: the Catalogue of Somatic Mutations in Cancer Thank you.

How often is COSMIC updated?

COSMIC is continually adding to and refining the content of the database, but in order to maximise our use of resources we are currently limiting our major core data releases (versioned) to two per year. However, as COSMIC has expanded beyond the core dataset we have now moved to a rolling schedule of new product releases and software updates and the Actionability database is updated regularly and independently of the core data releases.

Please see the News page for updates about future releases or the Release Notes page for details about the content of the current and previous releases.

What is the difference between cell line data in COSMIC and the Cell Line Project?

The main COSMIC and Cell Lines Project datasets are separate, and we partition the download files and website as two distinct datasets.

While there are cell lines which appear in both COSMIC and The Cell Lines Project, the mutation data are from different sources. In COSMIC the data is sourced from publications and whole genome uploads e.g. ICGC/TCGA. The Cell Lines Project data is entirely from a Sanger Institute exome sequencing study on 1015 cancer cell lines described here. We do not include that cell line exome study in the main COSMIC dataset. Please note that compared to the Cell Lines project, many more cell lines have been examined in the scientific literature for somatic mutations and these are recorded in the main COSMIC dataset.

What are Census genes and where is the updated version of the census?

The Cancer Gene Census is a list of genes known to be involved in cancer. They are listed here.

Can I submit data to COSMIC?

All mutation data in COSMIC is currently entered by our curators. If you would like to submit data for one of your publications, or even pre-publication, please contact cosmic@sanger.ac.uk and one of our curators will be happy to help.

How are papers selected from the literature?

To identify papers reporting somatic mutations, PubMed is broadly searched for papers containing relevant mutation data (example search: (ras OR genes, ras) AND human AND mutation). Those papers that are identified from their abstracts to include somatic mutation information relating to cancer or pre-cancerous conditions are then selected for curating. After examination of the information in the full text of the paper, the sample and mutation data are extracted. Any papers containing incomplete data (e.g. mutations that are reported but not fully described) or data of insufficient quality (e.g. errors identified in the data) are not fully curated but are added to a list of "additional references containing somatic mutation information".

How do I calculate gene mutation frequencies across multiple transcript variants?

Currently COSMIC uses a 'one-to-one' model for genes and transcripts so mutation frequency counts on the website are calculated on a per-transcript basis, rather than across gene loci. As each unique genomic mutation can be annotated to multiple different splice variants of the same gene, adding the prevalence counts of the splice variants together can potentially add the same genomic mutation more than once. However, from v90 (Sept. 2019) we began uniquely identifying variants at the genome level using COSV identifiers. This means that with some basic bioinformatics/data analysis skills it is now possible to calculate frequencies by downloading and analysing complete datasets. A unique list of COSVs for a locus can be extracted from the CosmicCodingMuts.vcf file and the number of unique mutated samples can be counted using the COSV list and the CosmicMutantExport.tsv.gz file. Please also refer to the question 'How do I calculate mutant frequencies' on this page.

What is the difference between census and classic genes?

A census gene is one that is known to be involved in cancer. The list of these genes is used to prioritise the literature curation for the COSMIC database. Once the literature for a census gene has been completely curated, it is released and sometimes termed a 'COSMIC classic' gene.

There are no mutations in the full text of the paper. Are these extracted from the supplementary material?

Yes. We utilize supplementary material for curation when it contains additional information.

What are the rules for mutation syntax in COSMIC?

Mutations are annotated using syntax derived from HGVS nomenclature recommendations (see http://varnomen.hgvs.org ).

How do I examine a histology or cancer type?

COSMIC may use an alternative histology terminology, for example small cell carcinoma instead of neuroendocrine carcinoma (or, for some sites, neuroendocrine carcinoma instead of small cell carcinoma). More information about our classification system can be found at the URL below and all COSMIC tumour site and histology translations are available to view as an excel spread sheet or tab delineated text file in the Classification documents found here. Note: You may also want to use our search to find out the matching disease classification for the alternative terminologies.

How do you define Mutation somatic status?

  • Confirmed Somatic: The variant allele from the tumour sample differs from the germ-line alleles of the same individual who provided the tumour sample.
  • Previously Reported: There is no germ-line allele information provided for the tumour sample for the same individual, but the same variant has been found to be 'Confirmed Somatic' variant in a normal-tumour sample pair from another patient. Please, note that the same variant from multiple samples from the same patient should always get the same somatic status, because all the samples share the same germ-line alleles in individuals who are not genetic mosaics.
  • Variant of unknown origin: There is no information provided on the germ-line alleles in the data source to help determine if it is either a germ-line or somatic variant.

How do I examine colon cancer?

COSMIC uses Large Intestine for this cancer site. You can find information about our classification system and download all COSMIC tumour site and histology translations as an Excel spreadsheet or tab-delineated text file in the Classification documentation page.

How do I examine a tumour site?

COSMIC may use an alternative site eg Colon versus Large Intestine. More information about our classification system can be found at the URL below and all COSMIC tumour site and histology translations are available to view as an excel spread sheet or tab delineated text file in the Classification documents found here.

Why is my search bringing back fewer records than expected?

Check you have not got a filter (displayed in the sidebar) on unexpectedly, limiting the gene region, tumour type or site etc. Some genes will have been curated as part of systematic screens so will have some data in COSMIC but have not yet been manually curated so will have less data than has been published. This may be because they are in the list of genes waiting to be manually curated, or they are not included in the Cancer Gene Census.

Where can I find patient age information?

If a paper gives the precise age then this is entered and displayed in years to 2 decimal places in the sample overview page, for example COSS1735169. Less precise age information is added as a remark and displayed on the sample overview page as, for example, Age=Adult; Age=Child; Age=Elderly; Age=young adult; Age=more than 65 years; Age=Adult 20-60 years. For an example see COSS1757821. If the paper uses term “paediatric” this is added as a remark Age=Child. In the past some papers reporting paediatric or adult leukaemias have had this information included in the Tumour remark section. This information is now included in the Individual Remark section Age=Child or Age=Adult as described above.

Has the whole gene been screened?

Not necessarily. Sometimes the entire coding sequence and the intron-exon boundaries of a gene will have been screened, but at other times only specific exons, codons or a specific single nucleotide change in 1 codon will have been analysed. This information is not visible in COSMIC but can be obtained from the original publication from which the data was extracted.

How can I tell what part of a gene has been screened?

Sometimes the entire coding sequence and the intron-exon boundaries of a gene will have been screened, but at other times only specific exons, codons or a specific single nucleotide change in 1 codon will have been analysed. This information is not visible in COSMIC but can be obtained from the original publication from which the data was extracted.

Is my gene fully curated in COSMIC? How are genes selected for manual curation in COSMIC?

As new cancer genes are identified from the literature these are added to the Cancer Gene Census list. A gene which is not currently in the manually curated Classic Gene List may be awaiting completion of the initial curation process, thus the data will not yet have been released; a gene may not have been confirmed as a true cancer gene according to our selection criteria and is awaiting more evidence; alternatively we may have missed the gene in question. We welcome suggestions for missing genes at cosmic@sanger.ac.uk.

Why can’t I find a particular publication in COSMIC?

Publications are identified for manual curation of genes from the Classic Gene List by using weekly PubMed and PubCrawler searches. If data from a specific publication is missing it may have been missed from these searches or the paper may be awaiting curation, especially for some of the older well known cancer genes. Alternatively, the publication may be recorded in COSMIC but as a reference only if, for instance, the data was unclear or not presented in a format which was compatible with the COSMIC data entry system. We welcome suggestions concerning missing publications at cosmic@sanger.ac.uk.

Is mutation data from cell lines included in COSMIC?

Cell lines are included in COSMIC if they have been screened for mutations. You might also want to check the COSMIC Cell Line Project, where the genetics and genomics of large numbers of cancer cell lines have been systematically characterised.

Are mutations analysed by immunohistochemistry included in COSMIC?

Mutations analysed solely by immunohistochemistry using mutation specific antibodies are not currently included in COSMIC.

Why does COSMIC contain data on overgrowth syndromes as they are not really cancer?

Somatic mutations detected in tissues associated with overgrowth syndromes such as Proteus and Cloves syndromes are included in COSMIC. Not all somatic mutations give a growth advantage to the cells but the mutations that have been identified in context of these syndromes clearly do. Including these mutations in COSMIC will help us further define and understand cancer.

What does Inferred Breakpoint mean?

This is the genomic breakpoint for a gene fusion. For many fusions this is not reported in detail so it is necessary to infer the position based on the reported mRNA transcripts in a given sample. To do this, it is assumed that each sample's breakpoint lies between the most 3' expressed exon of the 5' gene and the most 5' exon of the 3' gene, from the mRNAs reported in that sample. However, if the genomic breakpoint position is reported in detail for the sample then this is input as the Inferred Breakpoint.

What does Observed mRNA transcript mean?

Many papers determine fusions between genes using expression technologies such as RT-PCR. A number of these studies have identified more than one transcript per sample, some finding over four different products between the same gene pair in one tumour. This implies significant alternative splicing of the mRNAs expressed from the fused gene pair. These alternative transcripts are input as Observed mRNA transcripts.

What are Related Breakpoints?

These are either all the Inferred Breakpoints for a selected mRNA transcript mutation, or all the Observed mRNA transcripts for a selected inferred breakpoint mutation.

What is a Translocation Name?

This is the syntax format describing the portions of mRNA present (in HGVS "r." format) from each gene in a fusion.

How is an inverted sequence annotated in a fusion?

An "o" before a gene name is used to indicate an inverted sequence, e.g. FUS{NM_004960.2}:r.1_597_oCREB3L2{NM_194071.2}:r.979-18_991_CREB3L2{NM_194071.2}:r.1049_7455

Why can't I find any information in COSMIC on a particular gene fusion pair?

The curation of fusion data is on-going and the list of fusions currently curated in COSMIC can be found here. Sometimes an alternative transcript needs to be used to annotate a fusion so it may be necessary to search all transcripts for a gene to find any curated for fusions e.g. NOTCH1 and NOTCH1_ENST00000277541.

How do I calculate mutant frequencies?

This is possible, but you will need some basic bioinformatics/data analysis skills and the following download files:

  1. CosmicCompleteTargetedScreensMutantExport.tsv.gz
  2. CosmicGenomeScreensMutantExport.tsv.gz
  3. CosmicSample.tsv.gz

File 1. Lists samples with mutations (positives) and samples with no mutation (negatives) from targeted screens. You can calculate mutant frequencies (%) for genes as follows:

Positives ÷ (Positives + Negatives) x 100

However, whole genome screen data is not included.

File 2. Lists samples with mutations (positives) from whole genome screens but samples without mutations (negatives) are not included. The number of samples analysed by whole genome screening can be extracted from File 3. by selecting rows where the 'whole genome screen' column is equal to 'y'. Frequencies can be calculated as follows -

Positives in File 2 ÷ Whole Genome Screen Samples in File 3 x 100

For the total dataset (targeted and whole genome screens), calculate the frequencies as follows -

(Positives in File 1 + Positives in File 2) ÷ (Negatives in File 1 + Positives in File 1 + Whole Genome Screen Samples in File 3) x 100

To calculate frequencies for a specific tissue/histology of interest, the sample set must be restricted to those matching the tissue/histology classification. Please see the help section on Sample Counting for a description of how we count samples in COSMIC, and also the Mutation Frequency section on the same page.

What does the term 'Whole Genome Screen' mean?

We use the term 'Whole Genome Screen' to describe any study which has surveyed all genes in the genome, in contrast to a 'Targeted Screen' which surveys a smaller subset of genes. This term does not differentiate between whole genome sequencing and whole exome sequencing.

How are mutations mapped to gene sequences?

We attempt to map every mutation to a single version of a gene, but where this is not possible we map to an alternative transcript. The gene sequences are held in COSMIC and available in the download section.

What mutation detection method was employed?

Mutation screening methods differ in their sensitivity and the sensitivity of a particular method can vary from laboratory to laboratory. Some methods identify all classes of small intragenic mutation (base substitutions and small insertions/deletions). However, the protein truncation test will not detect mutations that cause missense amino acid substitutions.

Was the whole gene screened?

Some genes are characterised by mutation hot spots, for example BRAF, RAS and TP53. These genes are often screened for somatic mutations only in the region most likely to contain mutations. This strategy will obviously miss mutations located elsewhere in the gene and hence will provide a distorted view of the distribution of mutations in the gene and perhaps underestimate the frequency of mutations.

Has the sample been screened before?

There are examples where the same data is reported twice, perhaps in a follow-up study with reference to further data or as a positive control, for example using cell lines with known mutations. Where possible we have noted sample names and within papers have removed any redundancy. However between papers it is not possible to confirm two samples with the same name are indeed the same sample. We have therefore included both samples and both results in COSMIC. If you want to review this information the sample name, mutation and paper reference are displayed in the Mutation Details view.

Are all the mutations real?

For many putative somatic mutations that have been reported in the published literature, definitive evidence that they are somatically acquired (through demonstration of their absence in normal DNA from the same individual as the tumour) is not available. Therefore, occasional germline variants may have inadvertently been represented in publications as somatic mutations and entered in the database. In addition, simple laboratory errors which result in an incorrect normal DNA sample (i.e. from a different individual) being analysed as a control for a particular tumour sample may provide apparently persuasive, but misleading, evidence of somatic origin. Finally, DNA amplification methods have an intrinsic error rate, and these errors may subsequently be interpreted as somatic mutations. There is some evidence that this may be a particular problem in analyses of archival formalin-fixed, paraffin embedded material.