Non-coding RNAs identification with Hybrid Random Forest Ensemble algorithm

Abstract

Since there is a wide spectrum of non-coding RNAs (ncRNAs) from short ncRNAs to long ncRNAs, it is a great challenge to identify ncRNA signals within genomic regions. This study has developed a classification tool based on a hybrid random forest with a logistic regression model to efficiently discriminate not only those limited to the short ncRNA sequences but also the long and complex ncRNA classes. This random forest based classifier is trained on well-balanced training data using a discriminative set of features and has achieved 92.11%, 90.7% and 93.5% in terms of accuracy, sensitivity and specificity, respectively. The selected feature set includes a new proposed feature called SCORE. The SCORE feature is generated based on a logistic regression function that comprises of five significant features, which are--structure, sequence, modularity, structural robustness, and coding potential based features--to better characterize lncRNA functional elements. This SCORE feature has revealed in this study a useful criterion for finding lncRNAs by improving the performance of the RF-based classifier in classifying Rfam’s lncRNA families. A genome-wide ncRNA classification framework has been applied to a wide variety of organisms especially those potentially important genomes that could have economic, social, public health, environmental and agricultural impacts such as various bacteria genomes, Spirulina genome, rice and human genomic region sequences. The results have reported that our framework is able to identify known ncRNAs by yielding a sensitivity level above 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively.

 Supplementary Files: Source Code, README, Example files

Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm, Nucleic acids research 42 (11), e93-e93