HLRF: NON-CODING RNAS IDENTIFICATION WITH HYBRID RANDOM FOREST ENSEMBLE ALGORITHM ------------------------------------------------------------------------------------------------------------------------------------ System requirements: =================== - The framework has been developed for linux platform (Redhat/Fedora Release 18; 64-bit ; Intel Core i7-3770 CPU @ 3.4GHz x 8). - Perl, java, C complier are pre-installed on your system. - the RNAFold (ViennaRNA package), Bioperl perl module, Statistic-basic perl module and Numerical shuffle perl module are pre-installed on your system. - BLAST and Infernal programs are pre-installed on your system. Running the classifier: ====================== - go to the directory by command > cd NCPred - execute the script by command > perl HLRF_NcPred.pl [input_file] example: perl HLRF_NcPred.pl Ecoli_1.fasta INPUT ===== The input file is in "fasta format" (1 nucleotide sequence in a one-line-fasta) contained only capital letters of A,C,G,U. Or use Fasta1line.pl code (from http://wiki.bioinformatics.ucdavis.edu/index.php/Fasta1line.pl) to put all nucleotide of each sequence on one line following header line, prior to run the HLRF. Example input fasta file: (Each sequence in a single-line) ------------------------ >name1 UGUCUAGUGCCUUUAUGUGUAGUGCCUUAUUCAGGAAGGUGUUACUUAAUGUAUUAACAUUUGUAAGGCACCCUUCUGAGUAGAGUAAUGUGCAACAUGGAUAUCAUUUAAUUUGGCCCUUUUCCAAUC >name2 GUGCUGUGUGUAGUGCUUCACUUCAAUAAGUGCCAUUCAUGUGUCUAGAAAUAUGUUUUGCACCUUUUGGAGUGAAAUAAUGCACAACAGGUAC >name3 UGAGUGUAGUGCUCUACUCCAGAGGGCGUCAAUCACAUAAACUAAAACAUGAUUGUCACCUUUUUGAGUAGAGCAAUACACAUCA >name4 UGUACAGUGCCUUUCACAGGGAGGUGUCAUUUGUGUGAACUAAACUAUAAAUGUCACCUUUCUGGGAAGUGUAAUGUACA >name5 AUGUUGUCUGUGGUACCCUACUCUGGAGAGUGACAAUCAUGUAUAAUUAAAUUUGAUUGACACUUCUGUGAGUAGAGUAACGUAUGACACGU ...... ...... ...... In case of genome-scan, use the sliding window module to cut the genome sequence into multiple fasta sequences. Note that the genomic-coordination should included in the name of fasta format. Again, please note that each sequence must in a one-line and contained only capital letter of A, C, G and U. Example input fasta file of genome-scan (in Example files) --------------------------------------- >U00096-ign-0_1 1 to 81 U00096 256-336 + CGCGUACAGGAAACACAGAAAAAAGCCCGCACCUGACAGUGCGGGCUUUUUUUUUCGACCAAAGGUAACGAGGUAACAACC >U00096-ign-3_1 1 to 120 U00096 5021-5233 + AAUCUAUUCAUUAUCUCAAUCAGGCCGGGUUUGCUUUUAUGCAGCCCGGCUUUUUUAUGAAGAAAUUAUGGAGAAAAAUGACAGGGAAAAAGGAGAAAUUCUCAAUAAAUGCGGUAACUU >U00096-ign-3_2 61 to 180 U00096 5021-5233 + AGAAAUUAUGGAGAAAAAUGACAGGGAAAAAGGAGAAAUUCUCAAUAAAUGCGGUAACUUAGAGAUUAGGAUUGCGGAGAAUAACAACCGCCGUUCUCAUCGAGUAAUCUCCGGAUAUCG >U00096-ign-3_3 121 to 213 U00096 5021-5233 + AGAGAUUAGGAUUGCGGAGAAUAACAACCGCCGUUCUCAUCGAGUAAUCUCCGGAUAUCGACCCAUAACGGGCAAUGAUAAAAGGAGUAACCU >U00096-ign-4_1 1 to 120 U00096 5531-8237 + AUGACAAAUGCCGGGUAACAAUCCGGCAUUCAGCGCCUGAUGCGACGCUGGCGCGUCUUAUCAGGCCUACGUUAAUUCUGCAAUAUAUUGAAUCUGCAUGCUUUUGUAGGCAGGAUAAGG >U00096-ign-4_2 61 to 180 U00096 5531-8237 + UCAGGCCUACGUUAAUUCUGCAAUAUAUUGAAUCUGCAUGCUUUUGUAGGCAGGAUAAGGCGUUCACGCCGCAUCCGGCAUUGACUGCAAACUUAACGCUGCUCGUAGCGUUUAAACACC >U00096-ign-4_3 121 to 240 U00096 5531-8237 + CGUUCACGCCGCAUCCGGCAUUGACUGCAAACUUAACGCUGCUCGUAGCGUUUAAACACCAGUUCGCCAUUGCUGGAGGAAUCUUCAUCAAAGAAGUAACCUUCGCUAUUAAAACCAGUC >U00096-ign-4_4 181 to 300 U00096 5531-8237 + AGUUCGCCAUUGCUGGAGGAAUCUUCAUCAAAGAAGUAACCUUCGCUAUUAAAACCAGUCAGUUGCUCUGGUUUGGUCAGCCGAUUUUCAAUAAUGAAACGACUCAUCAGACCGCGUGCU .......... .......... .......... OUTPUT ====== If, there is no errors, the prediction results will show-up on command line prompt as the following: ___________________________________________________________________________ Input file: Ecoli_1.fasta; There are 13 query sequences in the input file. .... Completed Extracting All Features .... All extracted features are written to the file: Ecoli_1.fasta.csv. ############ Prediction results by Heterogeneous Ensemble ############# Sequence no. 1 predict as 1:ncRNA - with a probability of 1 Sequence no. 2 predict as 2:Other - with a probability of 0.8 Sequence no. 3 predict as 2:Other - with a probability of 0.9 Sequence no. 4 predict as 1:ncRNA - with a probability of 0.7 Sequence no. 5 predict as 1:ncRNA - with a probability of 0.8 Sequence no. 6 predict as 2:Other - with a probability of 0.9 Sequence no. 7 predict as 2:Other - with a probability of 0.8 Sequence no. 8 predict as 2:Other - with a probability of 0.7 Sequence no. 9 predict as 2:Other - with a probability of 0.9 Sequence no. 10 predict as 2:Other - with a probability of 1 Sequence no. 11 predict as 2:Other - with a probability of 0.9 Sequence no. 12 predict as 2:Other - with a probability of 1 Sequence no. 13 predict as 2:Other - with a probability of 0.9 [jeabbo@localhost NCPred]$ _____________________________________________________________________________ Moreover, the program also generate 3 files: [input_file].out, [input_file].csv, and [input_file].SeqList. (We have included some examples of input and output files in the folder: Example files) --1) [input_file].out -- is the prediction output. Ex: Ecoli_1.fasta.out, Ecoli_16.fasta.out. (Note that please ignore the last line of the output which is the control sample) --2) [input_file].CSV -- is the feature extraction. In this process, all the samples are labeled as ncRNA (in the last column), except the last line which is negative control sample. This file is result from feature extraction, not the prediction result from model. Ex: Ecoli_1.fasta.csv, Ecoli_16.fasta.csv. --3) [input_file].SeqList -- is the list of input sequence that is complete with prediction. In case that the input is not in suitable format (contain characters other than A, C, G, U), or there are errors during the prediction, that sequence name will not show-up in this file. Ex: Ecoli_1.fasta.SeqList, Ecoli_16.fasta.SeqList. ============================================================================================================================================= References: ------------ - Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Libman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403-10. - Eddy, S. R., and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic Acids Res., 22, 2079-88. - Nawrocki, E. P., Kolbe, D. L., and Eddy, S. R. (2009) Infernal 1.0: Inference of RNA alignments. Bioinformatics, 25, 1335-7. - Yao, Z., Weinberg, Z., and Ruzzo, W. L. (2006) CMfinder--a covariance model based RNA motif finding algorithm. Bioinformatics, 22, 445-52. - Lee, M. T., and Kim, J. (2008) Self Containment, a Property of Modular RNA Structures, Distinguishes microRNAs. PLoS Comput. Biol., 4, e1000150. - Lertampaiporn, S., Thammarongtham, C., Nukoolkit, C., Kaewkamnerdpong, B., and Ruengjitchatchawalya, M. (2013) Heterogeneous ensemble approach with discriminative features and modified-SMOTEbagging for pre-miRNA classification. Nucleic Acids Res., 41, e21. - Xue, C., Li, F., He,T., Liu,G., Li,Y. and Zhang,X. (2005) Classification of Real and Pseudo MicroRNA Precursors Using Local Structure-Sequence Features and Support Vector Machine. BMC Bioinformatics, 6, 310. - Batuwita, R. and Palade, V. (2009) microPred: Effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics, 25, 989-995, 2009. - Loong,K. and Mishra,S. (2007). De nove SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics, 23,1321-1330. - Hofacker,I.L. (2003) Vienna RNA secondary structure server. Nucleic Acids Res., 31, 3429-3431. - Markham, N. R. and Zuker, M. (2005) DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res., 33, W577-W581. - Chang,C. and Lin,C. (2001) LIBSVM: a library for support vector machines. Available http://www.csie.ntu.edu.tw/~cjlin/libsvm. - Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte f Chemie 1994, 125:167-188. - Kong, L., Zhang, Y., Ye, Z., Liu, X., Zhao, S., Wei, L., and Gao, G. (2007) CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res., 35, W345-9. - Slater, G. C. (2000) Algorithms for the Analysis of Expressed Sequence Tags. Cambridge: University of Cambridge. - Fasta1line.pl: Joseph Fass, (2008) The Bioinformatics Core at UC Davis Genome Center. http://wiki.bioinformatics.ucdavis.edu/index.php/Fasta1line.pl