About RNAct

Our database enables a global view of the protein–RNA interactome. RNAct currently covers the human, mouse and yeast genomes and contains a total of 5.87 billion pairwise interactions, reflecting nearly 120 years of computation time on the CRG's high-performance computing cluster. It combines experimentally identified interactions (e.g. from ENCODE) with ab initio predictions, enabling full coverage of the RNA-binding proteome. An in-depth description of our protein–RNA interaction prediction algorithm (catRAPID) is available here.


Background

To compute the interaction propensity scores, we used the catRAPID approach (Bellucci et al., Nature Methods 2011) with the fragmentation procedure (Cirillo et al., RNA 2013; Agostini et al., Nucleic Acids Research 2013) and normalised for sequence lengths similarly to a previous work (Agostini et al., Nucleic Acids Research 2013). ViennaRNA (Gruber et al., Nucleic Acids Research 2008) was used internally for RNA secondary structure prediction. For each protein–RNA pair, the fragment with the maximum interaction propensity score is used to assess overall binding ability (Figure 1, Left). We stress that the method was trained on X-ray and NMR data and that its performance on the experimental eCLIP data (212,256 high-confidence interactions observed in all available eCLIP replicates, against a background sampled from slightly over 2 billion protein–RNA pairs) reflects its predictive power (Figure 1, Right). More details on our prediction method are available here.

Density plot
ROC curve
Figure 1 (Left) Interaction propensity scores for the background (pink, sampled from slightly over 2 billion human protein–RNA pairs) and positive set (cyan, 212,256 high-confidence protein–RNA interactions revealed by eCLIP in all available replicates). The z-scores (standard scores) reported on the RNAct pages are based on the blue distribution, with the solid cyan line indicating the mean and the dashed line indicating a z-score of +1 (one standard deviation above the mean). (Right) The length-normalised catRAPID score shows a receiver operating characteristic (ROC) area under the curve (AUC) of 0.72 for the high-confidence ENCODE eCLIP data (i.e. interactions detected in all available replicates, resulting in 212,256 interactions with human GENCODE "basic" RNAs). Prior to normalisation for fragment sequence lengths, the catRAPID score showed a ROC AUC of 0.77. When including all eCLIP interactions regardless of replication (723,881 interactions for GENCODE "basic" RNAs), this AUC was 0.76. These values indicate a strong predictive performance of the catRAPID method, which was trained on X-ray and NMR data, on recent high-confidence experimental data.


Why study the protein–RNA interactome?

RNA-binding proteins (RBPs) are implicated in a number of physiological and pathological processes, with molecular mechanisms ranging from defects in splicing, localisation and translation to the formation of aggregates (Marchese et al. 2016). Examples include heterogeneous and life-threatening diseases such as amyotrophic lateral sclerosis, spinocerebellar ataxia and retinitis pigmentosa, among others (Markmiller et al. 2018). The RNAct database reports — for the first time — the map of all possible protein–RNA interactions in the human, mouse and yeast genomes.

How many RBPs are there? What are the specific binding partners of each RBP?

Around 1,400 human proteins have been experimentally determined to bind RNA (Hentze et al. 2018). This catalogue has quickly expanded recently thanks to biochemical advances. Many proteins contain one or more RNA-binding regions, either in the form of canonical globular domains or of more recently discovered, intrinsically disordered RNA binding regions. Additionally, protein-protein interaction interfaces and even enzyme active sites are sometimes employed for RNA binding. In the RNAct database we provide the interactome of each protein and RNA in the genome, even where experimental evidence of RNA binding activity is not (yet) available. We report all protein–RNA interactions revealed by the ENCODE project for currently 150 proteins (Van Nostrand et al. 2016), and show that the agreement with our predictions is particularly striking (Figure 1, Right), which indicates that our database may be of great value to the scientific community.

Are the predictions accurate enough?

Our group developed the catRAPID approach (Bellucci et al. 2011, Cirillo et al. 2017), which is the most accurate ab initio method for prediction of protein–RNA interactions, based on X-ray and NMR structures. catRAPID and its variants have been run more than 130,000 times by external users and provide a score for the association of a protein–RNA pair (details are available here). The agreement between the original catRAPID approach (Bellucci et al. 2011) and the eCLIP experiments (Van Nostrand et al. 2016) is strong (AUC=0.72, see Figure 1, Right), and we stress that our method was trained on selected X-ray and NMR structures, not on recent high-throughput data.

What type of sequences do you use?

We run our protein–RNA interaction predictions on mature spliced transcripts as annotated by GENCODE and Ensembl, including 5' and 3' UTRs. Including introns would increase the computation time needed by a large factor. The eCLIP data, however, includes peaks within introns as well as in the 5' and 3' UTRs.

What will be our future directions?

We plan to cover the major model organisms, particularly C. elegans, Drosophila melanogaster and Arabidopsis thaliana, in the near future. Our research targets genetic disease risk via variants in the human protein–RNA interactome and is enabled by recent data on experimentally determined protein–RNA interactions (Marchese et al. 2017). We use predictions to explore protein–RNA interactomes in a genome-wide manner, beyond current experimental data.

Updates

Feel free to follow us on Twitter at @tartaglialab for updates.

Contact

Please feel free to email Ben Lang (benjamin.lang@crg.eu) and Gian Gaetano Tartaglia (gian@tartaglialab.com) — any questions, ideas, doubts and feedback are very welcome.

How to cite RNAct

Please reference Lang, B., Armaos, A., and Tartaglia, G.G. (2018). RNAct: Protein–RNA interaction predictions for model organisms with supporting experimental data. Nucleic Acids Res. 8, 14741.

References

Primary data sources

Funding

European Union This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreements No 727658 and 793135.

Acknowledgements

Template and CSS from Bootstrap, various icons from Font Awesome, species icons by Danil Polshin (human), needumee (mouse), and Luiz Carvalho (yeast) from the Noun Project, and table export to CSV files using ExcellentExport by Jordi Burgos.

Licence

Our own work is licenced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Licence Creative Commons Licence.

See also: the CRG's legal notice. © 2018 tartaglialab.com