_PROBLEM CoEPrA-2006_Regression_003 _GROUP_NAME Hendrik Blockeel _GROUP_MEMBERS Joaquin Vanschoren Leander Schietgat Elisa Fromont Jan Ramon Hendrik Blockeel Kurt De Grave Luc Dehaspe Walter Luyten _ADDRESS Dept. of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, 3001 Leuven, Belgium _MODELING_PROCEDURE 1) construction of different views on the data As data representation is known to have an important impact on learning results, we have constructed different views on the same data set. View 1 is the raw data, with 643 descriptors for each amino acid. View 2 contains information on "frequent subsequences at specific positions" in the peptide sequences. Patterns of length 1, 2 and 3 with a minimum frequency of 5% (i.e., occurring in at least 5% of all examples) were mined using the Warmr algorithm [Dehaspe99]. An example of such a pattern is (7,V,T), which means that VT is occurring at positions 7 and 8 in the peptide sequence. These frequent patterns were mined in the calibration and prediction set together. Then, every example was checked against all the frequent patterns: 1 was written if the pattern was occurring in the example, 0 otherwise. After consulting a domain expert, we decided to make a variant of this first view (called ``view2_inter'') that looks for alternating patterns in the peptide sequence. Apparently the interactions between alternating amino acids are sometimes more important, since their side-chains are directed in the same direction. An example of such a pattern is (4,P,P,T), which means that a P is occurring at position 4, a second P at position 6, and a T at position 8. The corresponding view was constructed in the same way. These two variants of the first view were also combined in a single view (called ``view2_combined''). View 3 (called ``view3_names'' in our results) focused on different properties of amino acids. Instead of using the given (unknown) descriptors, we used some of the most popular amino acid properties (i.e. size, charge, polarity, hydrophobicity) and converted for each of the 9 amino acids a row with all of its properties. View 4 is similar to View 2 but contains information on occurrence of subsequences independent of the position. For a pattern (P,F) for example, it doesn't matter where this pattern occurs in the sequence. The same strategy was applied for alternating patterns and also a combined view was constructed (these views are ``view4'', ``view4_inter'' and ``view4_combined''). These views contain complementary information, in the sense that for instance a condition "the subsequence PPT occurs somewhere in the peptide" would be very complex to express using any view but View 4, whereas "the 5th amino acid has a positive charge" can be expressed using View 3 but not using View 4, etc. We therefore constructed some combinations. A first combination (called ``re3_view2_view1_combined'') merges the 5787 descriptors of the calibration data with the frequent patterns of view2_combined. A second combination (called ``re3_view2_view3_combined'') merges the information of view3 and view2_combined. 2) construction of the predictive model The following procedure was followed: 1. the different views on the data were constructed, for calibration as well as test set 2. 62 different algorithms encoded in the WEKA data mining tool were trained with their default parameters using 10-fold cross-validation on the different views. 3. the best 17 couples (view, algorithm) were chosen according to their root mean squared error on the training set. The couples retained for the third regression problem are: "CoEPrA-2006_Regression_003_Calibration_Peptides","LWL" "re3_view4","M5P" "re3_view4","M5Rules" "re3_view2_combined","LWL" "re3_view4_inter","RBFNetwork" "re3_view2_combined","RBFNetwork" "CoEPrA-2006_Regression_003_Calibration_Peptides","M5P" "re3_view3","AdditiveRegression" "re3_view4_combined","RBFNetwork" "re3_view3","RBFNetwork" "re3_view2_inter","RBFNetwork" "re3_view2_combined_and_view3","AdditiveRegression" "re3_view4_combined","AdditiveRegression" "re3_view2_combined_and_raw_data","AdditiveRegression" "re3_view2_combined_and_view3","M5P" "CoEPrA-2006_Regression_003_Calibration_Peptides","KStar" "re3_view4_inter","AdditiveRegression" 4. for each problem, the 17 algorithms were trained and tested with 10-fold cross validation on the training set to compute a meta view on the training data that contains the predictions of each model as attributes. In other words, the meta-view contains one line for each example and one column for each model, and lists for each example the predictions that the 17 models made for that example. The target value was added as a 18th column. 5. we then trained all weka algorithms on this meta view, and the best one (tested with 10-fold cross validation) was selected. This was AdditiveRegression (rmse = 0.44315475). A model was then learned with AdditiveRegression (again with default parameter settings) from the meta view; this amounts to learning to combine the predictions of the multiple models into a single prediction in an optimal way. This procedure is known as stacking [Wolpert92]. 6. the 17 models were applied to the test set, yielding for each test example 17 predictions. The model learned in step 5 was then applied to these predictions to get the final prediction for the example. _PREDICTION Obj_00001 6.864 Obj_00002 7.913 Obj_00003 7.913 Obj_00004 7.913 Obj_00005 7.728 Obj_00006 7.913 Obj_00007 6.864 Obj_00008 7.913 Obj_00009 7.913 Obj_00010 6.864 Obj_00011 7.913 Obj_00012 6.864 Obj_00013 6.864 Obj_00014 7.913 Obj_00015 8.291 Obj_00016 6.864 Obj_00017 6.864 Obj_00018 7.728 Obj_00019 6.864 Obj_00020 7.913 Obj_00021 7.913 Obj_00022 7.913 Obj_00023 8.291 Obj_00024 7.913 Obj_00025 6.864 Obj_00026 6.864 Obj_00027 7.913 Obj_00028 6.864 Obj_00029 7.913 Obj_00030 7.913 Obj_00031 7.913 Obj_00032 6.864 Obj_00033 7.913 Obj_00034 6.864 Obj_00035 8.291 Obj_00036 7.913 Obj_00037 7.728 Obj_00038 6.864 Obj_00039 7.913 Obj_00040 7.913 Obj_00041 6.864 Obj_00042 7.913 Obj_00043 6.864 Obj_00044 6.864 Obj_00045 6.864 Obj_00046 7.913 Obj_00047 8.291 Obj_00048 6.864 Obj_00049 7.728 Obj_00050 7.913 Obj_00051 7.913 Obj_00052 6.486 Obj_00053 6.864 Obj_00054 7.913 Obj_00055 6.864 Obj_00056 6.864 Obj_00057 6.864 Obj_00058 6.864 Obj_00059 6.864 Obj_00060 7.913 Obj_00061 6.864 Obj_00062 6.864 Obj_00063 7.728 Obj_00064 6.864 Obj_00065 6.864 Obj_00066 7.913 Obj_00067 7.913 Obj_00068 6.864 Obj_00069 7.913 Obj_00070 8.291 Obj_00071 6.864 Obj_00072 6.864 Obj_00073 6.864 Obj_00074 7.728 Obj_00075 7.350 Obj_00076 8.291 Obj_00077 8.291 Obj_00078 7.913 Obj_00079 6.864 Obj_00080 6.864 Obj_00081 7.913 Obj_00082 6.864 Obj_00083 6.864 Obj_00084 6.864 Obj_00085 6.864 Obj_00086 7.913 Obj_00087 7.728 Obj_00088 6.486 Obj_00089 7.913 Obj_00090 6.864 Obj_00091 8.291 Obj_00092 7.913 Obj_00093 6.864 Obj_00094 7.913 Obj_00095 8.291 Obj_00096 6.864 Obj_00097 6.864 Obj_00098 7.728 Obj_00099 7.913 Obj_00100 7.913 Obj_00101 6.864 Obj_00102 7.728 Obj_00103 6.864 Obj_00104 6.864 Obj_00105 7.913 Obj_00106 8.291 Obj_00107 7.913 Obj_00108 7.728 Obj_00109 7.350 Obj_00110 7.913 Obj_00111 6.864 Obj_00112 7.913 Obj_00113 6.864 Obj_00114 7.913 Obj_00115 7.913 Obj_00116 6.864 Obj_00117 7.728 Obj_00118 8.291 Obj_00119 7.728 Obj_00120 7.913 Obj_00121 7.728 Obj_00122 6.864 Obj_00123 7.913 Obj_00124 6.864 Obj_00125 7.728 Obj_00126 7.728 Obj_00127 7.913 Obj_00128 6.864 Obj_00129 6.864 Obj_00130 8.291 Obj_00131 6.864 Obj_00132 6.864 Obj_00133 6.864