_PROBLEM CoEPrA-2006_Classification_004 _GROUP_NAME Matt Segall _GROUP_MEMBERS Joelle Gola Olga Obrezanova Matt Segall _ADDRESS Inpharmatica 127 Cambridge Science Park Milton Road, Cambridge, CB4 0GD, UK tel. +44(0)1223 706177 e-mail: o.obrezanova@inpharmatica.co.uk _MODELING_PROCEDURE 1.Calculated one additional descriptor - Molecular weight. (for example, see http://www.expasy.org/tools/protparam.html) 2. Descriptor pre-selection. Descriptors with low standard deviation, low occurence and highly correlated were filtered out. Excluded descriptors - with standard deviation < 5.0E-04 - with occurence < 0.5% - with correlation coefficient >= 0.9 (only one of a pair left). After filtering 2302 descriptors remained. 3. Modelling technique. Regression model was built by Gaussian Processes technique. (See [1] for the details.) After the threshold was chosen to classify predicted values of training set so as to achieve highest MCC statistic. The same threshold was used to classify predicted values on validation and prediction data sets. 4. Details of the regression model. Covariance matrix is taken in the form (see [1]): K (x^(n)_i-x^(m)_i)^2 C_nm=a_1 * EXP(-0.5 * SUM ( ------------------- ) ) + a_2 + a_3 * delta_nm i=1 r_i^2 n,m=1..N Here N- number of cases in a data set, K- number of descriptors, x^(n) is a vector of descriptor values for observation n (a row in matrix X of descriptors) Hyperparameters a_1, a_2, a_3, r_i (i=1..K) are found by minimizing marginal likelihood (see [1]). [1] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998. ( Can be found online: http://www.inference.phy.cam.ac.uk/mackay/GP/ ) 5. Validation. The calibration set was divided randomly into 2 sets: training and validation sets. Validation set consists of 15 cases: validation set=[Obj_00002,Obj_00003,25,31,46,52,55,62,67,73,74,79,93,95,100]. The model was trained on training set and tested on validation set. 6. Final model statistics. Training set contains 96 cases and 2302 descriptors. Validation set contains 15 cases. On training set - MCC=0.933 number of misclassified=2 misclass. error=2.08% confusion matrix: Class -1 Class 1 <-- predicted ------- ------- 77 2 | Class -1 0 17 | Class 1 On validation set - MCC=0.78446 number of misclassified=1 misclass. error=6.66% confusion matrix: Class -1 Class 1 <-- predicted ------- ------- 12 1 | Class -1 0 2 | Class 1 _PREDICTION Obj_00001 -1 Obj_00002 -1 Obj_00003 -1 Obj_00004 -1 Obj_00005 -1 Obj_00006 -1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 -1 Obj_00011 -1 Obj_00012 -1 Obj_00013 -1 Obj_00014 -1 Obj_00015 -1 Obj_00016 +1 Obj_00017 -1 Obj_00018 -1 Obj_00019 -1 Obj_00020 -1 Obj_00021 -1 Obj_00022 -1 Obj_00023 -1 Obj_00024 -1 Obj_00025 -1 Obj_00026 -1 Obj_00027 -1 Obj_00028 -1 Obj_00029 -1 Obj_00030 -1 Obj_00031 +1 Obj_00032 -1 Obj_00033 +1 Obj_00034 -1 Obj_00035 +1 Obj_00036 -1 Obj_00037 -1 Obj_00038 -1 Obj_00039 -1 Obj_00040 -1 Obj_00041 -1 Obj_00042 -1 Obj_00043 -1 Obj_00044 -1 Obj_00045 -1 Obj_00046 +1 Obj_00047 -1 Obj_00048 -1 Obj_00049 -1 Obj_00050 -1 Obj_00051 -1 Obj_00052 -1 Obj_00053 -1 Obj_00054 -1 Obj_00055 +1 Obj_00056 +1 Obj_00057 -1 Obj_00058 -1 Obj_00059 -1 Obj_00060 -1 Obj_00061 -1 Obj_00062 -1 Obj_00063 -1 Obj_00064 +1 Obj_00065 -1 Obj_00066 +1 Obj_00067 -1 Obj_00068 -1 Obj_00069 -1 Obj_00070 -1 Obj_00071 +1 Obj_00072 -1 Obj_00073 -1 Obj_00074 -1 Obj_00075 -1 Obj_00076 -1 Obj_00077 -1 Obj_00078 -1 Obj_00079 -1 Obj_00080 -1 Obj_00081 -1 Obj_00082 -1 Obj_00083 +1 Obj_00084 -1 Obj_00085 -1 Obj_00086 -1 Obj_00087 -1 Obj_00088 -1 Obj_00089 -1 Obj_00090 -1 Obj_00091 +1 Obj_00092 -1 Obj_00093 -1 Obj_00094 -1 Obj_00095 -1 Obj_00096 -1 Obj_00097 -1 Obj_00098 -1 Obj_00099 -1 Obj_00100 -1 Obj_00101 -1 Obj_00102 -1 Obj_00103 +1 Obj_00104 -1 Obj_00105 -1 Obj_00106 -1 Obj_00107 -1 Obj_00108 -1 Obj_00109 -1 Obj_00110 +1 Obj_00111 +1