_PROBLEM CoEPrA-2006_Regression_001 _GROUP_NAME Matt Segall _GROUP_MEMBERS Joelle Gola Olga Obrezanova Matt Segall _ADDRESS Inpharmatica 127 Cambridge Science Park Milton Road, Cambridge, CB4 0GD, UK tel. +44(0)1223 706177 e-mail: o.obrezanova@inpharmatica.co.uk _MODELING_PROCEDURE 1.Calculated one additional descriptor - Molecular weight. (for example, see http://www.expasy.org/tools/protparam.html) 2. Descriptor pre-selection. Descriptors with low standard deviation, low occurence and highly correlated were filtered out. Excluded descriptors - with standard deviation < 5.0E-04 - with occurence < 0.5% - with correlation coefficient >= 0.9 (only one of a pair left). After filtering 1864 descriptors remained. 3. Modelling technique. Gaussian Processes technique. (See [1] for the details.) Covariance matrix is taken in the form (see [1]): K (x^(n)_i-x^(m)_i)^2 C_nm=a_1 * EXP(-0.5 * SUM ( ------------------- ) ) + a_2 + a_3 * delta_nm i=1 r_i^2 n,m=1..N Here N- number of cases in a data set, K- number of descriptors, x^(n) is a vector of descriptor values for observation n (a row in matrix X of descriptors) Hyperparameters a_1, a_2, a_3, r_i (i=1..K) are found by minimizing marginal likelihood (see [1]). [1] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998. ( Can be found online: http://www.inference.phy.cam.ac.uk/mackay/GP/ ) 4. Descriptor selection procedure. Keeping hyperparameters a_1, a_2, a_3, r_i fixed, the forward variable selection procedure is performed. The descriptors are added until there is an improvement in the model. The marginal likelihood is used as an indicator of the model performance. We are looking for the minimum of the marginal likelihood. 5. Validation. The calibration set was divided into 2 sets: training and validation sets. Every 5th case of the calibration set was taken to validation set, that is validation set=[Obj_00005,Obj_00010,...,Obj_00085]. The model was trained on training set and tested on validation set. 6. Final model statistics. Training set contains 72 cases and 1864 descriptors. Validation set contains 17 cases. Descriptor selection procedure chose 137 descriptors for the model. Statistics on training set - RMSE_tr= 0.37972 Rsqr_tr= 0.86521 on validation set - RMSE= 0.41581 Rsqr= 0.75404 _PREDICTION Obj_00001 5.564 Obj_00002 5.922 Obj_00003 4.802 Obj_00004 5.105 Obj_00005 6.678 Obj_00006 6.199 Obj_00007 5.281 Obj_00008 3.420 Obj_00009 5.567 Obj_00010 4.646 Obj_00011 6.265 Obj_00012 3.230 Obj_00013 5.815 Obj_00014 4.101 Obj_00015 6.391 Obj_00016 5.865 Obj_00017 4.949 Obj_00018 5.888 Obj_00019 3.941 Obj_00020 3.967 Obj_00021 4.352 Obj_00022 4.670 Obj_00023 4.933 Obj_00024 5.118 Obj_00025 4.415 Obj_00026 6.483 Obj_00027 5.934 Obj_00028 6.136 Obj_00029 5.436 Obj_00030 4.800 Obj_00031 5.640 Obj_00032 5.856 Obj_00033 5.938 Obj_00034 5.776 Obj_00035 5.764 Obj_00036 6.185 Obj_00037 4.437 Obj_00038 5.686 Obj_00039 5.481 Obj_00040 4.531 Obj_00041 6.132 Obj_00042 4.709 Obj_00043 6.675 Obj_00044 4.740 Obj_00045 5.901 Obj_00046 5.366 Obj_00047 4.579 Obj_00048 5.794 Obj_00049 5.085 Obj_00050 5.070 Obj_00051 5.774 Obj_00052 6.079 Obj_00053 5.368 Obj_00054 4.168 Obj_00055 5.038 Obj_00056 5.892 Obj_00057 4.844 Obj_00058 4.617 Obj_00059 6.608 Obj_00060 3.617 Obj_00061 5.757 Obj_00062 5.242 Obj_00063 5.509 Obj_00064 5.666 Obj_00065 5.867 Obj_00066 6.326 Obj_00067 6.076 Obj_00068 5.777 Obj_00069 5.679 Obj_00070 4.824 Obj_00071 3.620 Obj_00072 6.523 Obj_00073 5.526 Obj_00074 5.353 Obj_00075 5.620 Obj_00076 4.307 Obj_00077 5.948 Obj_00078 6.163 Obj_00079 3.923 Obj_00080 5.981 Obj_00081 5.616 Obj_00082 6.353 Obj_00083 4.671 Obj_00084 5.239 Obj_00085 6.826 Obj_00086 6.595 Obj_00087 5.330 Obj_00088 5.089