_PROBLEM CoEPrA-2006_Classification_001 _GROUP_NAME Matt Segall _GROUP_MEMBERS Joelle Gola Olga Obrezanova Matt Segall _ADDRESS Inpharmatica 127 Cambridge Science Park Milton Road, Cambridge, CB4 0GD, UK tel. +44(0)1223 706177 e-mail: o.obrezanova@inpharmatica.co.uk _MODELING_PROCEDURE 1.Calculated one additional descriptor - Molecular weight. (for example, see http://www.expasy.org/tools/protparam.html) 2. Descriptor pre-selection. Descriptors with low standard deviation, low occurence and highly correlated were filtered out. Excluded descriptors - with standard deviation < 5.0E-04 - with occurence < 0.5% - with correlation coefficient >= 0.9 (only one of a pair left). After filtering 1864 descriptors remained. 3. Modelling technique. Decision Tree using C4.5 algorithm. Gain ratio criterion is used to select descriptors. Stopping rules used: minimum number of cases on a tree leaf is 2, upper limit for relative frequency of majority class on a node = 0.9. 4. Validation. Cross-Validation procedure (5 groups) was used to obtain an estimate for the misclassification rate on unseen set. 5. Final model and statistics. Calibration set: 89 cases, 1864 descriptors. Decision tree on calibration set: |Desc_02030 <= 0.8, | |Desc_05258 <= 1.5, | | |Desc_01704 <= 1.33, | | | |Desc_00717 <= -34.5, class=-1 (3/0) | | | |Desc_00717 > -34.5, class=1 (44/3) | | |Desc_01704 > 1.33, class=-1 (4/0) | |Desc_05258 > 1.5, class=-1 (6/0) |Desc_02030 > 0.8, class=-1 (32/3) Total number of cases=89. Number of misclassified cases=6. Misclassification error = 6.74157%. Confusion matrix: Class -1 Class +1 <-- predicted ------- -------- 42 3 | Class -1 3 41 | Class +1 Kappa statistic: 0.865152 Mutual Information: 0.643711. Estimated misclassification rate on unseen set is 31.46% (obtained by cross-validation). _PREDICTION Obj_00001 -1 Obj_00002 -1 Obj_00003 -1 Obj_00004 -1 Obj_00005 +1 Obj_00006 +1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 +1 Obj_00011 +1 Obj_00012 +1 Obj_00013 +1 Obj_00014 +1 Obj_00015 -1 Obj_00016 -1 Obj_00017 +1 Obj_00018 +1 Obj_00019 +1 Obj_00020 -1 Obj_00021 +1 Obj_00022 -1 Obj_00023 -1 Obj_00024 +1 Obj_00025 +1 Obj_00026 +1 Obj_00027 +1 Obj_00028 +1 Obj_00029 +1 Obj_00030 +1 Obj_00031 +1 Obj_00032 +1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 -1 Obj_00037 -1 Obj_00038 +1 Obj_00039 -1 Obj_00040 +1 Obj_00041 +1 Obj_00042 -1 Obj_00043 -1 Obj_00044 +1 Obj_00045 +1 Obj_00046 -1 Obj_00047 +1 Obj_00048 +1 Obj_00049 +1 Obj_00050 +1 Obj_00051 +1 Obj_00052 +1 Obj_00053 -1 Obj_00054 -1 Obj_00055 +1 Obj_00056 +1 Obj_00057 +1 Obj_00058 -1 Obj_00059 +1 Obj_00060 -1 Obj_00061 -1 Obj_00062 -1 Obj_00063 -1 Obj_00064 -1 Obj_00065 +1 Obj_00066 -1 Obj_00067 +1 Obj_00068 -1 Obj_00069 +1 Obj_00070 -1 Obj_00071 +1 Obj_00072 +1 Obj_00073 -1 Obj_00074 -1 Obj_00075 +1 Obj_00076 -1 Obj_00077 -1 Obj_00078 +1 Obj_00079 -1 Obj_00080 +1 Obj_00081 +1 Obj_00082 -1 Obj_00083 -1 Obj_00084 +1 Obj_00085 +1 Obj_00086 -1 Obj_00087 +1 Obj_00088 -1