_PROBLEM CoEPrA-2006_Regression_001 _GROUP_NAME Liao Quan _GROUP_MEMBERS Liao Quan _ADDRESS liaoq@mail.sioc.ac.cn Lab of Computer Chemistry & Chemoinformatics, Shanghai Institute of Organic Chemistry, CAS, 354 Fenglin Rd., Shanghai, 200032, P. R. China _MODELING_PROCEDURE 1. Descriptor Selection No new descriptors were added. The distribution of each of the 5787 descriptorsits was checked as following: a). If the descriptor has constant value, then eliminate it. b). If the normalized descriptor(mean=0 and standard deviation=1) has "outliers" (value exceed the range(-7,7)), then eliminate it. Step a) eliminated 7 descriptors and step b) eliminated 106 descriptors. Finally 5674 descriptors were used to construct the final model. 2. Modeling Procedure In this study, PLS and SVM were combined together to build the final model: PLS gave the weights of the descriptors and SVM gave the regression model. a) PLS analysis The PLS method can handle high dimensional descriptors and give linear regression model. Each descriptor j will have a coefficient b(j). The NIPALS algorithm was used and implemented in our in-house c++ program. Leave-one-out (LOO) cross-validation was used to select the best number of latent variables. The best q2 was 0.660 when selecting 8 latent variables. b) Rescale the descriptors We didn't use the PLS model as the final model. We just used the coefficients of PLS model b(j) to rescale the raw descriptors as following: x'(j)=x(j)*b(j) where x(j) and x'(j) are the raw and the rescaled descriptor j, respectively. c) SVM regression Using the rescaled descriptors, SVM regression were performed by using the EPSILON-SVR method in LIBSVM 2.82 toolbox. Only RBF kernel was considered. Three parameters (C, epsilon for epsilon-insensitive function and gamma for RBF kernel) were optimized by grid search and LOO cross-validation. Other parameters were kept as default values. The optimal parameters for the final model were C=3325 epsilon=0.13 and gamma=0.026. 3. Validation procedures In the single step of PLS analysis and SVM analysis, LOO cross-validation was used for parameter selection to achieve highest q2. To check the predict ability of the final model, another independent LOO cross-validation was performed for the whole procedure. In this procedure, the leave-one-out splitting was made at the very beginning (before the descriptor selection). The q2 for this indepenent cross-validaion was 0.748, which was better than using the PLS only or using the SVM only in this case. The parameters of the final model ( including the regression coefficients for PLS and the support vectors and weights for SVM) is available upon request. _PREDICTION Obj_00001 5.37 Obj_00002 6.00 Obj_00003 4.73 Obj_00004 5.18 Obj_00005 7.02 Obj_00006 6.45 Obj_00007 5.58 Obj_00008 3.47 Obj_00009 5.89 Obj_00010 4.85 Obj_00011 6.25 Obj_00012 3.07 Obj_00013 5.75 Obj_00014 4.06 Obj_00015 6.63 Obj_00016 5.84 Obj_00017 4.89 Obj_00018 6.17 Obj_00019 4.19 Obj_00020 3.94 Obj_00021 4.20 Obj_00022 5.17 Obj_00023 4.78 Obj_00024 5.18 Obj_00025 4.15 Obj_00026 6.95 Obj_00027 6.13 Obj_00028 6.15 Obj_00029 5.01 Obj_00030 4.70 Obj_00031 5.46 Obj_00032 5.64 Obj_00033 5.68 Obj_00034 5.88 Obj_00035 6.08 Obj_00036 6.51 Obj_00037 4.41 Obj_00038 5.93 Obj_00039 5.15 Obj_00040 4.03 Obj_00041 6.54 Obj_00042 4.66 Obj_00043 6.97 Obj_00044 4.46 Obj_00045 5.55 Obj_00046 5.44 Obj_00047 3.68 Obj_00048 5.84 Obj_00049 4.48 Obj_00050 4.66 Obj_00051 5.89 Obj_00052 5.72 Obj_00053 6.05 Obj_00054 3.70 Obj_00055 4.86 Obj_00056 6.09 Obj_00057 4.55 Obj_00058 4.21 Obj_00059 7.07 Obj_00060 3.20 Obj_00061 5.91 Obj_00062 5.41 Obj_00063 4.78 Obj_00064 5.77 Obj_00065 5.86 Obj_00066 6.71 Obj_00067 6.17 Obj_00068 5.83 Obj_00069 5.80 Obj_00070 4.96 Obj_00071 3.54 Obj_00072 6.04 Obj_00073 4.83 Obj_00074 5.30 Obj_00075 5.88 Obj_00076 4.38 Obj_00077 6.18 Obj_00078 6.47 Obj_00079 3.83 Obj_00080 6.01 Obj_00081 5.90 Obj_00082 6.98 Obj_00083 4.14 Obj_00084 5.38 Obj_00085 7.04 Obj_00086 6.40 Obj_00087 5.54 Obj_00088 5.03