About iMPI

Introduction

Non-coding RNAs (ncRNAs) have been recognized not to equip with coding ability for a long time. However, recent advances challenge this widely accepted recognition of ncRNAs, as numerous microprotein-encoding small open reading frames (sORFs) have been discovered in various ncRNAs, such as pre-miRNA, circular RNA, and long non-coding RNA. Introns, another major class of ncRNA, comprise about 40% on average of the total length of genes, and have once been regarded as genome wastes. A growing body of literature has demonstrated that introns would be unspliced and retained in mature mRNA. Then, we inquire whether the retained introns contain sORFs to encode microproteins.

Therefore, in order to systematically explore the translatable introns, we performed genome-wide searching to identify potential encodable intron sORFs (iORFs), and validated the iORF-encoded microproteins using large-scale proteomic mass-spectrometry (MS) data. Finally, we developed the iMPI database to provide an accessible source of iORF-encoded microproteins for researchers.

Genome-wide searching located 209,091 introns in human GRCh38 genome, among which 15,975 were identified to be with high coding potential. Finally, 4,751 introns containing a total of 5,832 iORFs were demonstrated to be potentially translatable according to proteomic MS evidence. The iMPI database provides detailed information on the translatable iORFs, including nucleic sequence, peptide sequence, genomic position, coding potential, and information on relevant introns, transcripts and genes. Incorporating protein MS data from over 27 cancer types, the iMPI database provides cancer-specific MS evidence of iORF-encoded microproteins. Besides, the iMPI database contains detailed annotation regarding the subcellular location, post-translational modifications (83,863 phosphorylation sites, 5,141 ubiquitylation sites, 3,574 methylation sites, 3,363 dimethylation sites, 1,996 trimethylation sites, and 735 acetylation site-specific), second structures, surface accessibility and three-dimensional structure of iORF-encoded microproteins.

209,091 introns from human GRCh38 genome

Incorporating protein MS data from over 27 cancer types.

15,975 introns with high coding potential

4,751 introns containing 5,832 iORFs were potentially translatable.

Various Annotation

Integrating detailed annotations of iORF-encoded microproteins.

Browse
Data statistics of iMPI:
(A) Number of IR-encoded peptides with MS evidence across each cancer type. (B) Distribution spectrum of MPI based on their occurrence across 27 cancer types. MPI were classified into four categories: Tumor-specific (1 cancer type), Shared (2–5 cancer types), Widespread (6–10 cancer types) and Pan-cancer (≥11 cancer types). (C) The inter-cancer overlap of MPI. The vertical axis represents the number of shared MPI among different cancer types, and the horizontal axis lists the number of MPI for each cancer type. The black dots and lines represent the cancer types included in each intersection combination. (D) Distribution of peptide length ranging from 1 to 1,000 amino acids (AA). (E) Distribution of the CPAT score for predicting the coding potential of iORFs. (F) Gene types of introns. (G) Number of IR-encoded peptides localized in different subcellular compartments. (H) Number of PTM sites on IR-encoded peptides.
Data Sources

Mass spectrometry data of specific cancer types used in iMPI

Cancer type Abbreviation Repository Database identifiers/PMID
Colorectal cancerCRCiProxIPX0000832003
Colorectal cancerCRCiProxIPX0000832010
Hepatocellular carcinomaHCCiProxIPX0000859001
Lung cancerLUNGiProxIPX0001451001
Lung adenocarcinomaLUADiProxIPX0001804001
Lung cancerLUNGMassiveMSV000083969
Bladder cancerBLCAMassiveMSV000087370
MelanomaMELPRIDEPXD001048
Colorectal cancerCRCPRIDEPXD001676
Colorectal cancerCRCPRIDEPXD001794
Non-small cell lung cancerNSCLCPRIDEPXD002612
AscitesAscitesPRIDEPXD003351
Prostate cancerPCAPRIDEPXD003497
Plasma-derived exosomes in Chronic lymphocytic leukemiaPLASMAPRIDEPXD004420
Lung cancerLUNGPRIDEPXD004682
Lung cancerLUNGPRIDEPXD004683
Lung cancerLUNGPRIDEPXD004684
Hepatocellular carcinomaHCCPRIDEPXD004873
Breast cancerBRCAPRIDEPXD005214
Hepatocellular carcinomaHCCPRIDEPXD005571
Breast cancerBRCAPRIDEPXD007088
Breast cancerBRCAPRIDEPXD007217
Colorectal liver metastasisCRLMPRIDEPXD008383
Bladder cancerBLCAPRIDEPXD009203
AscitesAscitesPRIDEPXD009382
Colorectal cancerCRCPRIDEPXD009602
Bladder cancerBLCAPRIDEPXD010260
CholangiocarcinomaCHOLPRIDEPXD010294
Squamous cell carcinomaSCCPRIDEPXD011609
Renal cell carcinomaRCCPRIDEPXD011681
Breast cancerBRCAPRIDEPXD012162
Colorectal cancerCRCPRIDEPXD012254
Breast cancerBRCAPRIDEPXD012431
MeningiomasMGMPRIDEPXD012923
Hepatocellular carcinomaHCCPRIDEPXD013057
Head and neck squamous cell carcinoma; Breast cancerHNSCC; BRCAPRIDEPXD013311
Lung cancerLUNGPRIDEPXD013649
Breast cancerBRCAPRIDEPXD013795
Prostate cancerPCAPRIDEPXD013837
Breast cancerBRCAPRIDEPXD014414
CholangiocarcinomaCHOLPRIDEPXD017906
Colorectal cancerCRCPRIDEPXD019103
Esophageal squamous cell carcinomaESCCPRIDEPXD021701
Renal cell carcinomaRCCPRIDEPXD019123
High-Grade Serous Ovarian CancerHGSOCPRIDEPXD020557
Acute myeloid leukemiaAMLCPTAC38232702
Clear cell renal cell carcinomaCCRCCCPTAC31675502
GlioblastomaGBMCPTAC33577785
Hepatocellular CarcinomaHCCCPTAC31585088
Head and neck squamous cell carcinomaHNSCCCPTAC33417831
Lung adenocarcinomaLUADCPTAC32649874
Oral squamous cell carcinomaOSCCCPTAC28878238
Ovarian serous cystadenocarcinomaOVCPTAC33086064
Pediatric brain tumorsPBTCPTAC33242424
Pancreatic ductal adenocarcinomaPDACCPTAC34534465
Uterine corpus endometrial carcinomaUCECCPTAC32059776

Annotation sources used in iMPI

Data typeResourceDescriptionURL
Basic informationUniProtUniversal protein resourcehttps://www.uniprot.org/
Annotation of iORF-encoded microproteins qPTMQuantification of post-translational modificationshttps://qptm.omicsbio.info/

Tools used in iMPI

ToolsDescriptionURL
KMAAn R package that performs intron retention estimation and detection.https://github.com/pachterlab/kma
CPATCoding-Potential assessment tool using an alignment-free logistic regression model.https://github.com/liguowang/cpat
MaxQuantA quantitative proteomics software package designed for analyzing large mass-spectrometric data sets.https://www.maxquant.org/
AlphaFoldPredicting a protein's 3D structure from its amino acid sequence.https://alphafold.com/
MULocDeepA deep-learning framework for protein subcellular and suborganellar localization prediction.https://github.com/yuexujiang/MULocDeep
DSSPApplication to assign secondary structure to proteins.https://github.com/PDB-REDO/dssp
Pipeline Construction
1
MS Data Collection

Collecting MS data from CPTAC Data Portal, iProx, MassIVE, and PRIDE databases.

2
IR identification

Extracting all introns from the human genome data by KMA (Keep Me Around) algorithm.

3
Prediction of sequence coding potential

Predicting the coding potential of the introns by CPAT.

4
Identification of iORF coding ability

Validating the coding potential by performing a database search for protein identification via MaxQuant software.

5
Annotation of iORF-encoded microproteins

The subcellular location, PTM, tertiary structure, secondary structure, and surface accessibility were predicted and annotated via MULocDeep, qPTM, AlphaFold, and DSSP.

About author
This study was performed by Jiamin Hu, Yongqiang Zheng and Ze-Xian Liu,

Jiamin Hu, Yongqiang Zheng and Ze-Xian Liu are from

Sun Yat-sen University Cancer Center,

Building 2#, 651 Dongfeng East Road,

Guangzhou 510060, P. R. China


Email: liuzx AT sysucc.org.cn