_PROBLEM CoEPrA-2006_Regression_003 _GROUP_NAME Matt Segall _GROUP_MEMBERS Joelle Gola Olga Obrezanova Matt Segall _ADDRESS Inpharmatica 127 Cambridge Science Park Milton Road, Cambridge, CB4 0GD, UK tel. +44(0)1223 706177 e-mail: o.obrezanova@inpharmatica.co.uk _MODELING_PROCEDURE 1.Calculated one additional descriptor - Molecular weight. (for example, see http://www.expasy.org/tools/protparam.html) 2. Descriptor pre-selection. Descriptors with low standard deviation, low occurence and highly correlated were filtered out. Excluded descriptors - with standard deviation < 5.0E-04 - with occurence < 0.5% - with correlation coefficient >= 0.9 (only one of a pair left). After filtering 2044 descriptors remained. 3. Modelling technique. Gaussian Processes technique. (See [1] for the details.) Covariance matrix is taken in the form (see [1]): K (x^(n)_i-x^(m)_i)^2 C_nm=a_1 * EXP(-0.5 * SUM ( ------------------- ) ) + a_2 + a_3 * delta_nm i=1 r_i^2 n,m=1..N Here N- number of cases in a data set, K- number of descriptors, x^(n) is a vector of descriptor values for observation n (a row in matrix X of descriptors) Hyperparameters a_1, a_2, a_3, r_i (i=1..K) are found by minimizing marginal likelihood (see [1]). [1] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998. ( Can be found online: http://www.inference.phy.cam.ac.uk/mackay/GP/ ) 4. Descriptor selection procedure. After optimization of hyperparameters the lengthscale r_i for each descriptor is compared to a certain constant proportional to standard deviation of that descriptor in the training set. The descriptors for which lengthscales r_i are much larger than the constant are eliminated from the model. 5. Validation. The model was build on whole calibration set. The cross-validation was performed using 7 groups. 6. Final model statistics. Training set contains 133 cases and 2044 descriptors. Descriptor selection procedure chose 730 descriptors for the model. Statistics on training set - RMSE_tr=0.27600 Rsqr_tr=0.88458 cross-validation - RMSE_cv=0.50488 Rsqr_cv=0.61375 Values of hyperparameters and names of descriptors used in the model are given in file Regression003_Model_Segall.txt. _PREDICTION Obj_00001 7.785 Obj_00002 7.053 Obj_00003 7.181 Obj_00004 6.736 Obj_00005 7.195 Obj_00006 7.075 Obj_00007 7.329 Obj_00008 7.186 Obj_00009 7.927 Obj_00010 6.545 Obj_00011 6.628 Obj_00012 7.935 Obj_00013 7.933 Obj_00014 6.945 Obj_00015 6.321 Obj_00016 6.746 Obj_00017 7.096 Obj_00018 7.147 Obj_00019 7.651 Obj_00020 8.005 Obj_00021 6.791 Obj_00022 7.215 Obj_00023 6.950 Obj_00024 7.457 Obj_00025 7.547 Obj_00026 7.578 Obj_00027 6.660 Obj_00028 6.242 Obj_00029 7.008 Obj_00030 6.314 Obj_00031 8.189 Obj_00032 7.964 Obj_00033 7.187 Obj_00034 5.939 Obj_00035 5.641 Obj_00036 5.967 Obj_00037 6.456 Obj_00038 8.064 Obj_00039 6.707 Obj_00040 6.443 Obj_00041 8.261 Obj_00042 7.635 Obj_00043 7.142 Obj_00044 6.211 Obj_00045 7.762 Obj_00046 6.484 Obj_00047 6.389 Obj_00048 7.215 Obj_00049 7.782 Obj_00050 6.672 Obj_00051 5.078 Obj_00052 6.615 Obj_00053 6.634 Obj_00054 6.364 Obj_00055 6.650 Obj_00056 6.108 Obj_00057 7.007 Obj_00058 6.948 Obj_00059 7.630 Obj_00060 7.322 Obj_00061 6.489 Obj_00062 8.284 Obj_00063 6.528 Obj_00064 5.547 Obj_00065 7.819 Obj_00066 6.551 Obj_00067 5.825 Obj_00068 7.079 Obj_00069 6.487 Obj_00070 7.196 Obj_00071 8.233 Obj_00072 6.415 Obj_00073 7.964 Obj_00074 7.295 Obj_00075 7.433 Obj_00076 6.925 Obj_00077 7.151 Obj_00078 7.232 Obj_00079 7.406 Obj_00080 7.447 Obj_00081 7.922 Obj_00082 7.430 Obj_00083 6.903 Obj_00084 7.543 Obj_00085 7.083 Obj_00086 6.469 Obj_00087 7.398 Obj_00088 6.483 Obj_00089 6.613 Obj_00090 5.897 Obj_00091 7.600 Obj_00092 7.015 Obj_00093 7.451 Obj_00094 7.230 Obj_00095 5.881 Obj_00096 6.650 Obj_00097 7.429 Obj_00098 6.363 Obj_00099 7.436 Obj_00100 7.179 Obj_00101 7.715 Obj_00102 7.272 Obj_00103 6.690 Obj_00104 8.565 Obj_00105 6.814 Obj_00106 6.453 Obj_00107 6.991 Obj_00108 7.618 Obj_00109 7.927 Obj_00110 6.591 Obj_00111 7.422 Obj_00112 7.129 Obj_00113 7.896 Obj_00114 7.265 Obj_00115 7.414 Obj_00116 7.120 Obj_00117 6.542 Obj_00118 7.086 Obj_00119 7.699 Obj_00120 5.862 Obj_00121 6.914 Obj_00122 6.585 Obj_00123 7.074 Obj_00124 7.791 Obj_00125 6.829 Obj_00126 6.779 Obj_00127 7.923 Obj_00128 7.109 Obj_00129 6.480 Obj_00130 6.744 Obj_00131 7.775 Obj_00132 7.165 Obj_00133 7.211