_PROBLEM CoEPrA-2006_Regression_001 _GROUP_NAME Hendrik Blockeel _GROUP_MEMBERS Joaquin Vanschoren Leander Schietgat Elisa Fromont Jan Ramon Hendrik Blockeel Kurt De Grave Luc Dehaspe Walter Luyten _ADDRESS Dept. of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, 3001 Leuven, Belgium _MODELING_PROCEDURE 1) construction of different views on the data As data representation is known to have an important impact on learning results, we have constructed different views on the same data set. View 1 is the raw data, with 643 descriptors for each amino acid. View 2 contains information on "frequent subsequences at specific positions" in the peptide sequences. Patterns of length 1, 2 and 3 with a minimum frequency of 5% (i.e., occurring in at least 5% of all examples) were mined using the Warmr algorithm [Dehaspe99]. An example of such a pattern is (7,V,T), which means that VT is occurring at positions 7 and 8 in the peptide sequence. These frequent patterns were mined in the calibration and prediction set together. Then, every example was checked against all the frequent patterns: 1 was written if the pattern was occurring in the example, 0 otherwise. After consulting a domain expert, we decided to make a variant of this first view (called ``view2_inter'') that looks for alternating patterns in the peptide sequence. Apparently the interactions between alternating amino acids are sometimes more important, since their side-chains are directed in the same direction. An example of such a pattern is (4,P,P,T), which means that a P is occurring at position 4, a second P at position 6, and a T at position 8. The corresponding view was constructed in the same way. These two variants of the first view were also combined in a single view (called ``view2_combined''). View 3 (called ``view3_names'' in our results) focused on different properties of amino acids. Instead of using the given (unknown) descriptors, we used some of the most popular amino acid properties (i.e. size, charge, polarity, hydrophobicity) and converted for each of the 9 amino acids a row with all of its properties. View 4 is similar to View 2 but contains information on occurrence of subsequences independent of the position. For a pattern (P,F) for example, it doesn't matter where this pattern occurs in the sequence. The same strategy was applied for alternating patterns and also a combined view was constructed (these views are ``view4'', ``view4_inter'' and ``view4_combined''). These views contain complementary information, in the sense that for instance a condition "the subsequence PPT occurs somewhere in the peptide" would be very complex to express using any view but View 4, whereas "the 5th amino acid has a positive charge" can be expressed using View 3 but not using View 4, etc. We therefore constructed some combinations. A first combination (called ``cl1_view2_view1_combined'') merges the 5787 descriptors of the calibration data with the frequent patterns of view2_combined. A second combination (called ``re1_view2_view3_combined'') merges the information of view3 and view2_combined. 2) construction of the predictive model The following procedure was followed: 1. the different views on the data were constructed, for calibration as well as test set 2. 62 different algorithms encoded in the WEKA data mining tool were trained with their default parameters using 10-fold cross-validation on the different views. 3. for each classification and regression problem, the best 15 couples (view, algorithm) were chosen according to their root mean squared error on the training set. The couples retained for the first classification problem are: (re1_view2_combined_and_view3, AdditiveRegression) rmse = 61.45643 (re1_view2_combined_and_view3,SMOreg) rmse = 69.4302 (re1_view3_names, AdditiveRegression) rmse = 71.01758 (re1_view1_view2_combined, M5P) rmse = 74.84666 (re1_view2, AdditiveRegression) rmse = 74.88904 (re1_view2_inter, AdditiveRegression) rmse = 76.15944 (re1_view2, LinearRegression) rmse = 77.061966 (re1_view1, M5P) rmse = 77.475685 (re1_view2, RegressionByDiscretization) rmse = 77.731064 (re1_view1, LWL) rmse = 79.24143 (re1_view2_combined_and_view3, LWL) rmse = 80.4978 (re1_view2_inter, RegressionByDiscretization) rmse = 80.922356 (re1_view2_combined_and_view3, RegressionByDiscretization) rmse =80.999985 (re1_view2_combined_and_view3, IBk) rmse = 81.3017 4. for each problem, the 15 algorithms were trained and tested with 10-fold cross validation on the training set to compute a meta view on the training data that contains the predictions of each model as attributes. In other words, the meta-view contains one line for each example and one column for each model, and lists for each example the predictions that the 15 models made for that example. The true classification was added as a 16th column. 5. we then trained all weka algorithms on this meta view, and the best one (tested with 10-fold cross validation) was selected. This was KStar (rmse = 0.5550731). A model was then learned with KStar (again with default parameter settings) from the meta view; this amounts to learning to combine the predictions of the multiple models into a single prediction in an optimal way. This procedure is known as stacking [Wolpert92]. 6. the 15 models were applied to the test set, yielding for each test example 15 predictions. The model learned in step 5 was then applied to these predictions to get the final prediction for the example. _PREDICTION Obj_00001 5.660 Obj_00002 5.452 Obj_00003 4.976 Obj_00004 4.938 Obj_00005 6.082 Obj_00006 7.549 Obj_00007 5.147 Obj_00008 5.136 Obj_00009 5.278 Obj_00010 5.570 Obj_00011 5.991 Obj_00012 4.789 Obj_00013 5.591 Obj_00014 4.400 Obj_00015 5.895 Obj_00016 5.551 Obj_00017 5.981 Obj_00018 5.581 Obj_00019 4.378 Obj_00020 4.590 Obj_00021 5.540 Obj_00022 5.434 Obj_00023 5.473 Obj_00024 5.591 Obj_00025 4.203 Obj_00026 6.390 Obj_00027 5.890 Obj_00028 5.706 Obj_00029 4.712 Obj_00030 4.940 Obj_00031 5.464 Obj_00032 5.718 Obj_00033 5.720 Obj_00034 6.075 Obj_00035 7.536 Obj_00036 5.752 Obj_00037 4.807 Obj_00038 5.309 Obj_00039 5.450 Obj_00040 4.044 Obj_00041 5.328 Obj_00042 5.015 Obj_00043 5.700 Obj_00044 4.485 Obj_00045 5.420 Obj_00046 5.572 Obj_00047 4.962 Obj_00048 5.496 Obj_00049 4.002 Obj_00050 4.089 Obj_00051 5.575 Obj_00052 5.718 Obj_00053 5.581 Obj_00054 4.938 Obj_00055 4.964 Obj_00056 5.518 Obj_00057 4.928 Obj_00058 4.761 Obj_00059 6.100 Obj_00060 4.913 Obj_00061 5.470 Obj_00062 5.151 Obj_00063 5.389 Obj_00064 5.980 Obj_00065 5.369 Obj_00066 7.647 Obj_00067 5.891 Obj_00068 5.750 Obj_00069 5.505 Obj_00070 4.950 Obj_00071 4.710 Obj_00072 5.382 Obj_00073 4.555 Obj_00074 5.540 Obj_00075 5.891 Obj_00076 4.017 Obj_00077 6.366 Obj_00078 6.351 Obj_00079 5.280 Obj_00080 5.417 Obj_00081 5.060 Obj_00082 6.402 Obj_00083 5.281 Obj_00084 7.621 Obj_00085 6.058 Obj_00086 5.101 Obj_00087 5.305 Obj_00088 5.455