_PROBLEM CoEPrA-2006_Classification_003 _GROUP_NAME Matt Segall _GROUP_MEMBERS Joelle Gola Olga Obrezanova Matt Segall _ADDRESS Inpharmatica 127 Cambridge Science Park Milton Road, Cambridge, CB4 0GD, UK tel. +44(0)1223 706177 e-mail: o.obrezanova@inpharmatica.co.uk _MODELING_PROCEDURE 1.Calculated one additional descriptor - Molecular weight. (for example, see http://www.expasy.org/tools/protparam.html) 2. Descriptor pre-selection. Descriptors with low standard deviation, low occurence and highly correlated were filtered out. Excluded descriptors - with standard deviation < 5.0E-04 - with occurence < 0.5% - with correlation coefficient >= 0.9 (only one of a pair left). After filtering 2044 descriptors remained. 3. Modelling technique. Decision Tree using C4.5 algorithm. Gain ratio criterion is used to select descriptors. Original tree is converted to set of rules. Each rule is simplified, that is some irrelevant conditions are deleted based on results of statistical tests of contingency tables. Whole set of rules is simplified by eliminating unnecessary rules. Prediction is done by taking sum of votes from all rules that apply weighted by rule's accuracy (Laplace Ratio). 4. Validation. Cross-Validation procedure (5 groups) was used to choose the best model and to obtain an estimate for the misclassification rate on unseen set. 5. Final model and statistics. Calibration set: 133 cases, 2044 descriptors. Ruleset on calibration set: Rule 1: Desc_00196 > 1.02, Desc_03110 > 1.78, class=1 (27/0) accuracy=0.965517 Rule 2: Desc_00196 <= 1.02, Desc_01901 > -0.22, Desc_03474 > -0.29, class=-1 (25/0) accuracy=0.962963 Rule 3: Desc_00196 <= 1.02, Desc_02950 <= 0.581, Desc_05160 <= 8.249, Desc_00687 > 0.32, class=-1 (20/1) accuracy=0.909091 Rule 4: Desc_00753 > 0, Desc_01317 <= 0, class=-1 (6/0) accuracy=0.875 Rule 5: Desc_01901 <= -0.22, Desc_02957 > -0.16, Desc_03688 <= 0, class=1 (15/2) accuracy=0.823529 Rule 6: Desc_03110 <= 1.78, Desc_02101 > 0.94, class=-1 (31/8) accuracy=0.727273 Rule 7: Desc_00196 <= 1.02, Desc_02950 <= 0.581, class=-1 (72/21) accuracy=0.702703 Rule 8: Desc_01317 > 0, class=1 (44/16) accuracy=0.630435 Rule 9: Desc_01901 <= -0.22, class=1 (86/32) accuracy=0.625 Rule 10 (default): class=1 (133/66) accuracy=0.503704 Number of misclassified cases=16. Misclassification error = 12.0301%. Confusion matrix: Class -1 Class 1 <-- predicted ------- ------- 61 5 | Class -1 11 56 | Class 1 Kappa statistic: 0.759548. Mutual Information: 0.47846. Estimated misclassification rate on unseen set is 37.66% (obtained by cross-validation). _PREDICTION Obj_00001 -1 Obj_00002 -1 Obj_00003 +1 Obj_00004 +1 Obj_00005 -1 Obj_00006 +1 Obj_00007 +1 Obj_00008 -1 Obj_00009 +1 Obj_00010 -1 Obj_00011 -1 Obj_00012 -1 Obj_00013 -1 Obj_00014 -1 Obj_00015 -1 Obj_00016 -1 Obj_00017 -1 Obj_00018 -1 Obj_00019 -1 Obj_00020 -1 Obj_00021 +1 Obj_00022 +1 Obj_00023 -1 Obj_00024 +1 Obj_00025 -1 Obj_00026 -1 Obj_00027 -1 Obj_00028 -1 Obj_00029 +1 Obj_00030 +1 Obj_00031 +1 Obj_00032 -1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 +1 Obj_00037 -1 Obj_00038 +1 Obj_00039 +1 Obj_00040 -1 Obj_00041 +1 Obj_00042 -1 Obj_00043 +1 Obj_00044 -1 Obj_00045 +1 Obj_00046 -1 Obj_00047 -1 Obj_00048 -1 Obj_00049 +1 Obj_00050 -1 Obj_00051 +1 Obj_00052 -1 Obj_00053 +1 Obj_00054 -1 Obj_00055 +1 Obj_00056 -1 Obj_00057 -1 Obj_00058 +1 Obj_00059 -1 Obj_00060 -1 Obj_00061 -1 Obj_00062 +1 Obj_00063 +1 Obj_00064 +1 Obj_00065 -1 Obj_00066 -1 Obj_00067 -1 Obj_00068 +1 Obj_00069 -1 Obj_00070 -1 Obj_00071 +1 Obj_00072 +1 Obj_00073 +1 Obj_00074 -1 Obj_00075 +1 Obj_00076 +1 Obj_00077 -1 Obj_00078 -1 Obj_00079 -1 Obj_00080 -1 Obj_00081 +1 Obj_00082 -1 Obj_00083 -1 Obj_00084 +1 Obj_00085 -1 Obj_00086 +1 Obj_00087 -1 Obj_00088 -1 Obj_00089 +1 Obj_00090 -1 Obj_00091 +1 Obj_00092 -1 Obj_00093 +1 Obj_00094 +1 Obj_00095 -1 Obj_00096 +1 Obj_00097 -1 Obj_00098 -1 Obj_00099 -1 Obj_00100 +1 Obj_00101 -1 Obj_00102 +1 Obj_00103 -1 Obj_00104 +1 Obj_00105 +1 Obj_00106 -1 Obj_00107 -1 Obj_00108 -1 Obj_00109 +1 Obj_00110 +1 Obj_00111 -1 Obj_00112 +1 Obj_00113 -1 Obj_00114 -1 Obj_00115 +1 Obj_00116 +1 Obj_00117 -1 Obj_00118 +1 Obj_00119 +1 Obj_00120 +1 Obj_00121 +1 Obj_00122 +1 Obj_00123 -1 Obj_00124 +1 Obj_00125 +1 Obj_00126 -1 Obj_00127 +1 Obj_00128 -1 Obj_00129 +1 Obj_00130 -1 Obj_00131 +1 Obj_00132 -1 Obj_00133 +1