_PROBLEM CoEPrA-2006_Classification_003 _GROUP_NAME Wuju Li _GROUP_MEMBERS Hua Li Yanyan Hou Wuju Li _ADDRESS Center of Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, 100850, China _MODELING_PROCEDURE Description of Tclass classification system In order to determine the class (-1/+1) of each member in the test dataset CoEPrA-2006_Classiification_003_Prediction_peptides.txt, Tclass classification system was used. Tclass system was developed in our center, which was originally written for the gene expression profile-based sample classification (Bioinformatics 2002, 18:325-326). In Tclass system, both Fisher and Naive Bayes prediction methods are integrated with the feature selection methods such as feature forward selection method. For the present task, the Naive Bayes method was used to determine the class (-1/+1) of each member in the test dataset. Processes of classification After input the training dataset and related class information (-1/+1) into the Tclass system (CoEPrA-2006_Classification_003_Calibration_Data.txt), the optimal three feature sets were found with the leave-one out cross-validation (LOOCV) as the object function. Prediction accuracy is 0.8421. There are only three nona-peptides misclassified. Three optimal feature sets are as follows. The first set: 323, 753, 1926, 4524, 4825, 4878, and 5345 The second set: 323, 753, 1926, 4363, 4524, 4878, and 5345 The third set: 323, 753, 1926, 4008, 4524, 4878, and 5345 Where the number such as 323, 753, and etc. stand for the descriptors. For example, the 323 represents the 323rd descriptor. In order to find the best feature set among the optimal three feature sets, the stability analysis was performed as follows for the particular feature set. First, the 89 nona-peptides in training dataset were divided into two groups randomly with partition ratio 75%, and the major part was used to construct the classifier. Second, the minor part was used to test the classifier. Third, above processes were repeated 1000 times and the average prediction accuracy from 1000 minor parts was taken as the stability index for the particular feature set. The stability indexes are 0.7977, 0.7991, and 0.7999 for the first, second, and third set respectively. Therefore, the third feature set (323, 753, 1926, 4008, 4524, 4878, and 5345) was selected as the final best feature set, and the correspondent 1000 classifiers were taken as the final classifiers during the above stability analysis. The detail information for the 1000 classifiers is provided in supplement 1. In order to determine the class of a new nona-peptide, all 1000 classifiers were used. If there are more than 500 classifiers to classify the nona-peptide into class -1, the nona-peptide is predicted into the class -1. Otherwise, the nona-peptide is predicted into the class 1. Classification information for the training dataset Based on the 1000 classifiers and the training dataset, we found that there were 21 nona-peptides misclassified (13th, 23rd, 25th, 39th, 42nd, 45th, 47th, 60th, 62nd, 66th, 67th, 74th, 80th, 83rd, 87th, 90th, 93rd, 105th, 109th, 110th, and 125th nona-peptide). Therefore, the number of nona-peptides for TP, TN, FP, and FN are 56, 56, 11, and 10 respectively, which can be used to calculate sensitivity (Se), Specificity (Sp), and prediction accuracy as follows. Se=TP / (TP+FN) = 84.85% Sp=TN / (TN+FP) = 83.52% Accuracy = (TP+TN) / (TP+TN+FP+FN) = 84.21% Classification information for the test dataset In test dataset, there are 133 nona-peptides. Based on the 1000 classifiers, the classification information was provided in table 2. From table 2, we can see that there are 77 nona-peptides in class 1 and 56 nona-peptides in class -1. _PREDICTION Obj_00001 +1 Obj_00002 +1 Obj_00003 -1 Obj_00004 +1 Obj_00005 +1 Obj_00006 -1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 -1 Obj_00011 +1 Obj_00012 -1 Obj_00013 -1 Obj_00014 +1 Obj_00015 -1 Obj_00016 -1 Obj_00017 +1 Obj_00018 +1 Obj_00019 +1 Obj_00020 +1 Obj_00021 -1 Obj_00022 +1 Obj_00023 -1 Obj_00024 +1 Obj_00025 -1 Obj_00026 +1 Obj_00027 -1 Obj_00028 +1 Obj_00029 +1 Obj_00030 +1 Obj_00031 +1 Obj_00032 +1 Obj_00033 +1 Obj_00034 -1 Obj_00035 -1 Obj_00036 +1 Obj_00037 -1 Obj_00038 +1 Obj_00039 -1 Obj_00040 +1 Obj_00041 +1 Obj_00042 -1 Obj_00043 +1 Obj_00044 +1 Obj_00045 +1 Obj_00046 +1 Obj_00047 -1 Obj_00048 -1 Obj_00049 +1 Obj_00050 +1 Obj_00051 +1 Obj_00052 +1 Obj_00053 -1 Obj_00054 +1 Obj_00055 +1 Obj_00056 -1 Obj_00057 +1 Obj_00058 +1 Obj_00059 +1 Obj_00060 -1 Obj_00061 -1 Obj_00062 -1 Obj_00063 +1 Obj_00064 +1 Obj_00065 +1 Obj_00066 -1 Obj_00067 -1 Obj_00068 +1 Obj_00069 +1 Obj_00070 +1 Obj_00071 +1 Obj_00072 +1 Obj_00073 +1 Obj_00074 -1 Obj_00075 +1 Obj_00076 +1 Obj_00077 +1 Obj_00078 -1 Obj_00079 -1 Obj_00080 +1 Obj_00081 +1 Obj_00082 -1 Obj_00083 -1 Obj_00084 +1 Obj_00085 -1 Obj_00086 +1 Obj_00087 -1 Obj_00088 -1 Obj_00089 +1 Obj_00090 +1 Obj_00091 +1 Obj_00092 -1 Obj_00093 +1 Obj_00094 +1 Obj_00095 -1 Obj_00096 +1 Obj_00097 -1 Obj_00098 -1 Obj_00099 -1 Obj_00100 +1 Obj_00101 -1 Obj_00102 +1 Obj_00103 -1 Obj_00104 +1 Obj_00105 +1 Obj_00106 +1 Obj_00107 +1 Obj_00108 +1 Obj_00109 +1 Obj_00110 -1 Obj_00111 -1 Obj_00112 +1 Obj_00113 -1 Obj_00114 -1 Obj_00115 +1 Obj_00116 +1 Obj_00117 +1 Obj_00118 -1 Obj_00119 +1 Obj_00120 +1 Obj_00121 +1 Obj_00122 -1 Obj_00123 +1 Obj_00124 -1 Obj_00125 -1 Obj_00126 -1 Obj_00127 +1 Obj_00128 -1 Obj_00129 -1 Obj_00130 +1 Obj_00131 -1 Obj_00132 -1 Obj_00133 +1