_PROBLEM CoEPrA-2006_Regression_003_Dataset_2 _GROUP_NAME Matt Segall _GROUP_MEMBERS Joelle Gola Olga Obrezanova Matt Segall _ADDRESS Inpharmatica 127 Cambridge Science Park Milton Road, Cambridge, CB4 0GD, UK tel. +44(0)1223 706177 e-mail: o.obrezanova@inpharmatica.co.uk _MODELING_PROCEDURE 1.Calculated one additional descriptor - Molecular weight. (for example, see http://www.expasy.org/tools/protparam.html) 2. Descriptor pre-selection. Descriptors with low standard deviation, low occurence and highly correlated were filtered out. Excluded descriptors - with standard deviation < 5.0E-04 - with occurence < 0.5% - with correlation coefficient >= 0.9 (only one of a pair left). After filtering 2044 descriptors remained. 3. Modelling technique. Gaussian Processes technique. (See [1] for the details.) Covariance matrix is taken in the form (see [1]): K (x^(n)_i-x^(m)_i)^2 C_nm=a_1 * EXP(-0.5 * SUM ( ------------------- ) ) + a_2 + a_3 * delta_nm i=1 r_i^2 n,m=1..N Here N- number of cases in a data set, K- number of descriptors, x^(n) is a vector of descriptor values for observation n (a row in matrix X of descriptors) Hyperparameters a_1, a_2, a_3, r_i (i=1..K) are found by minimizing marginal likelihood (see [1]). [1] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998. ( Can be found online: http://www.inference.phy.cam.ac.uk/mackay/GP/ ) 4. Descriptor selection procedure. After optimization of hyperparameters the lengthscale r_i for each descriptor is compared to a certain constant proportional to standard deviation of that descriptor in the training set. The descriptors for which lengthscales r_i are much larger than the constant are eliminated from the model. 5. Validation. The model was build on whole calibration set. The cross-validation was performed using 7 groups. 6. Final model statistics. Training set contains 133 cases and 2044 descriptors. Descriptor selection procedure chose 730 descriptors for the model. Statistics on training set - RMSE_tr=0.27600 Rsqr_tr=0.88458 cross-validation - RMSE_cv=0.50488 Rsqr_cv=0.61375 _PREDICTION Obj_00001 6.946 Obj_00002 7.280 Obj_00003 6.841 Obj_00004 7.252 Obj_00005 7.486 Obj_00006 8.048 Obj_00007 6.715 Obj_00008 7.016 Obj_00009 7.773 Obj_00010 8.364 Obj_00011 6.346 Obj_00012 6.892 Obj_00013 5.415 Obj_00014 7.353 Obj_00015 6.467 Obj_00016 7.148 Obj_00017 5.614 Obj_00018 6.292 Obj_00019 7.501 Obj_00020 7.410 Obj_00021 6.903 Obj_00022 6.502 Obj_00023 6.502 Obj_00024 7.027 Obj_00025 5.801 Obj_00026 6.118 Obj_00027 7.886 Obj_00028 6.396 Obj_00029 6.502 Obj_00030 7.182 Obj_00031 6.657 Obj_00032 7.615 Obj_00033 7.099 Obj_00034 7.914 Obj_00035 6.558 Obj_00036 6.978 Obj_00037 7.653 Obj_00038 7.254 Obj_00039 7.928 Obj_00040 4.624 Obj_00041 4.851 Obj_00042 7.839 Obj_00043 7.942 Obj_00044 6.936 Obj_00045 7.240 Obj_00046 6.719 Obj_00047 7.230