This document is about the cpc algorithm known issues and our solution.
If you want to use our CPC online, please refer to our
Recognizing Potential UTR Regions
In most mammalian genomes, the 3' UTR regions of a coding transcript may
extend for several kb and are abundant in many EST libraries.
The current version of CPC's SVM classifier could not accurately distinguish
transcripts that fall entirely within UTR regions from those true non-coding
transcripts. To deal with the limitation, we provide the option of
BLAST searching against UTRdb on the
CPC web server.
For example, AK057932 is a partial GenBank mRNA derived entirely from 3' UTR
of human protein coding gene pantothenate kinase 1 (PANK1). Since it could not
encode a peptide, CPC classified it as "noncoding". However, the existence of
five BLAST hits in UTRdb suggested that this transcript was likely to be derived
from UTR regions.
[+] view large picture
Performance on Short Peptides
Recent reports suggested that the short peptides with no more than 100
amino acids may play key roles in many biological processes and are
abundant in the mammalian proteome (
Frith, M.C., Forrest, A.R., Nourbakhsh, E., Pang, K.C., Kai, C., Kawai, J.,
Carninci, P., Hayashizaki, Y., Bailey, T.L. and Grimmond, S.M. (2006)
The abundance of short proteins in the mammalian proteome.
PLoS Genet, 2, e52.).
To assess the CPC's performance in identifying short peptides,
we derived a testing dataset by retrieving eukaryotic proteins with
no more than 100 residues from NCBI Entrez system.
To ensure quality of the dataset, only entries from RefSeq with the status
label of "Validated" or "Reviewed" are kept for subsequent analysis.
Corresponding RefSeq mRNAs were then fetched by using "Nucleotide Links"
function of Entrez system. Finally, we generated a dataset of 2,849 mRNA
sequences which encode small peptides with no more than 100 amino acids.
And results showed that CPC was able to predict 92.00%
(2621 out of 2849) of the short peptides correctly.
The Prediction Gray Area
CPC summarizes the main output in a table. Each row corresponds to one
input sequence. The columns show the sequence ID, the coding/noncoding
classification, the SVM score (the "distance" to the SVM classification
hyper-plane in the features space). In general, the farther away the score
is from zero, the more reliable the prediction is.
As a rule of thumb from our experience, the transcripts with score
between -1 and 1 are marked as
"weak noncoding"( [-1, 0] )
or "weak coding"( (0,-1] ).