_PROBLEM CoEPrA-2006_Regression_002 _GROUP_NAME Matt Segall _GROUP_MEMBERS Joelle Gola Olga Obrezanova Matt Segall _ADDRESS Inpharmatica 127 Cambridge Science Park Milton Road, Cambridge, CB4 0GD, UK tel. +44(0)1223 706177 e-mail: o.obrezanova@inpharmatica.co.uk _MODELING_PROCEDURE 1.Calculated one additional descriptor - Molecular weight. (for example, see http://www.expasy.org/tools/protparam.html) 2. Descriptor pre-selection. Descriptors with low standard deviation, low occurence and highly correlated were filtered out. Excluded descriptors - with standard deviation < 5.0E-04 - with occurence < 0.5% - with correlation coefficient >= 0.9 (only one of a pair left). After filtering 1289 descriptors remained. 3. Modelling technique. Gaussian Processes technique. (See [1] for the details.) Covariance matrix is taken in the form (see [1]): K (x^(n)_i-x^(m)_i)^2 C_nm=a_1 * EXP(-0.5 * SUM ( ------------------- ) ) + a_2 + a_3 * delta_nm i=1 r_i^2 n,m=1..N Here N- number of cases in a data set, K- number of descriptors, x^(n) is a vector of descriptor values for observation n (a row in matrix X of descriptors) Hyperparameters a_1, a_2, a_3, r_i (i=1..K) are found by minimizing marginal likelihood (see [1]). [1] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998. ( Can be found online: http://www.inference.phy.cam.ac.uk/mackay/GP/ ) 4. Descriptor selection procedure. After optimization of hyperparameters the lengthscale r_i for each descriptor is compared to a certain constant proportional to standard deviation of that descriptor in the training set. The descriptors for which lengthscales r_i are much larger than the constant are eliminated from the model. 5. Validation. The calibration set was divided into 2 sets: training and validation sets. Every 5th case of the calibration set was taken to validation set, that is validation set=[Obj_00003,Obj_00008,...,Obj_00073]. After the split into two sets descriptors with low standard deviation and low occurence in training set were removed from both sets. (1287 descriptors remained) The model was trained on training set and tested on validation set. 6. Final model statistics. Training set contains 61 cases and 1287 descriptors. Validation set contains 15 cases. Descriptor selection procedure chose 891 descriptors for the model. Statistics on training set - RMSE_tr= 0.12550 Rsqr_tr= 0.97287 on validation set - RMSE= 0.46003 Rsqr= 0.64368 Values of hyperparameters and names of descriptors used in the model are given in file Regression002_Model_Segall.txt. _PREDICTION Obj_00001 7.770 Obj_00002 6.876 Obj_00003 7.823 Obj_00004 7.871 Obj_00005 7.855 Obj_00006 8.021 Obj_00007 8.052 Obj_00008 7.865 Obj_00009 6.850 Obj_00010 4.346 Obj_00011 8.027 Obj_00012 5.312 Obj_00013 7.564 Obj_00014 7.316 Obj_00015 7.112 Obj_00016 7.860 Obj_00017 7.886 Obj_00018 6.768 Obj_00019 6.983 Obj_00020 7.574 Obj_00021 7.029 Obj_00022 7.635 Obj_00023 7.827 Obj_00024 5.904 Obj_00025 6.917 Obj_00026 7.779 Obj_00027 7.794 Obj_00028 7.788 Obj_00029 8.069 Obj_00030 7.858 Obj_00031 7.685 Obj_00032 7.847 Obj_00033 7.642 Obj_00034 7.596 Obj_00035 6.944 Obj_00036 7.820 Obj_00037 7.821 Obj_00038 7.517 Obj_00039 7.859 Obj_00040 4.470 Obj_00041 3.080 Obj_00042 7.727 Obj_00043 7.931 Obj_00044 7.836 Obj_00045 7.892 Obj_00046 8.127 Obj_00047 6.480 Obj_00048 7.850 Obj_00049 8.143 Obj_00050 4.471 Obj_00051 7.894 Obj_00052 8.060 Obj_00053 7.416 Obj_00054 8.057 Obj_00055 7.684 Obj_00056 6.260 Obj_00057 7.865 Obj_00058 6.776 Obj_00059 7.619 Obj_00060 7.849 Obj_00061 7.778 Obj_00062 8.092 Obj_00063 8.095 Obj_00064 7.819 Obj_00065 6.595 Obj_00066 7.871 Obj_00067 7.858 Obj_00068 6.875 Obj_00069 7.765 Obj_00070 7.702 Obj_00071 7.989 Obj_00072 7.842 Obj_00073 7.903 Obj_00074 7.755 Obj_00075 7.886 Obj_00076 8.016