_PROBLEM CoEPrA-2006_Regression_002 _GROUP_NAME Hendrik Blockeel _GROUP_MEMBERS Joaquin Vanschoren Leander Schietgat Elisa Fromont Jan Ramon Hendrik Blockeel Kurt De Grave Luc Dehaspe Walter Luyten _ADDRESS Dept. of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, 3001 Leuven, Belgium _MODELING_PROCEDURE 1) construction of different views on the data As data representation is known to have an important impact on learning results, we have constructed different views on the same data set. View 1 is the raw data, with 643 descriptors for each amino acid. View 2 contains information on "frequent subsequences at specific positions" in the peptide sequences. Patterns of length 1, 2 and 3 with a minimum frequency of 5% (i.e., occurring in at least 5% of all examples) were mined using the Warmr algorithm [Dehaspe99]. An example of such a pattern is (7,V,T), which means that VT is occurring at positions 7 and 8 in the peptide sequence. These frequent patterns were mined in the calibration and prediction set together. Then, every example was checked against all the frequent patterns: 1 was written if the pattern was occurring in the example, 0 otherwise. After consulting a domain expert, we decided to make a variant of this first view (called ``view2_inter'') that looks for alternating patterns in the peptide sequence. Apparently the interactions between alternating amino acids are sometimes more important, since their side-chains are directed in the same direction. An example of such a pattern is (4,P,P,T), which means that a P is occurring at position 4, a second P at position 6, and a T at position 8. The corresponding view was constructed in the same way. These two variants of the first view were also combined in a single view (called ``view2_combined''). View 3 (called ``view3_names'' in our results) focused on different properties of amino acids. Instead of using the given (unknown) descriptors, we used some of the most popular amino acid properties (i.e. size, charge, polarity, hydrophobicity) and converted for each of the 9 amino acids a row with all of its properties. View 4 is similar to View 2 but contains information on occurrence of subsequences independent of the position. For a pattern (P,F) for example, it doesn't matter where this pattern occurs in the sequence. The same strategy was applied for alternating patterns and also a combined view was constructed (these views are ``view4'', ``view4_inter'' and ``view4_combined''). These views contain complementary information, in the sense that for instance a condition "the subsequence PPT occurs somewhere in the peptide" would be very complex to express using any view but View 4, whereas "the 5th amino acid has a positive charge" can be expressed using View 3 but not using View 4, etc. We therefore constructed some combinations. A first combination (called ``re2_view2_view1_combined'') merges the 5787 descriptors of the calibration data with the frequent patterns of view2_combined. A second combination (called ``re2_view2_view3_combined'') merges the information of view3 and view2_combined. 2) construction of the predictive model The following procedure was followed: 1. the different views on the data were constructed, for calibration as well as test set 2. 62 different algorithms encoded in the WEKA data mining tool were trained with their default parameters using 10-fold cross-validation on the different views. 3. for each classification and regression problem, the best 13 couples (view, algorithm) were chosen according to their root mean squared error on the training set. The couples retained for the first regression problem are: (CoEPrA-2006_regression_002_Calibration_Peptides, SMOreg) rmse = .5577 (CoEPrA-2006_regression_002_Calibration_Peptides, IBk) rmse = .5623 (re2_view3_names, IBk) rmse = .5862 (re2_view4_combined, KStar) rmse = .5938 (re2_view2_combined, KStar) rmse = .5940 (re2_view2_inter, IBk) rmse = .5956 (re2_view2_combined, AdditiveRegression) rmse = .5960 (re2_view2_inter, M5Rules) rmse = .5975 (re2_view4_combined, M5Rules) rmse = .6041 (re2_view4_combined, M5P) rmse = .6134 (re2_view2_inter, LinearRegression) rmse = .6143 (re2_view2_combined_and_raw_data, SMOreg) rmse = .6185 (re2_view2_combined_and_view3, AdditiveRegression) rmse = .6319 4. for each problem, the 13 algorithms were trained and tested with 10-fold cross validation on the training set to compute a meta view on the training data that contains the predictions of each model as attributes. In other words, the meta-view contains one line for each example and one column for each model, and lists for each example the predictions that the 13 models made for that example. The true classification was added as a 14th column. 5. we then trained all weka algorithms on this meta view, and the best one (tested with 10-fold cross validation) was selected. This was KStar (rmse = 0.3360). A model was then learned with KStar (again with default parameter settings) from the meta view; this amounts to learning to combine the predictions of the multiple models into a single prediction in an optimal way. This procedure is known as stacking [Wolpert92]. 6. the 13 models were applied to the test set, yielding for each test example 13 predictions. The model learned in step 5 was then applied to these predictions to get the final prediction for the example. _PREDICTION Obj_00001 7.493 Obj_00002 7.853 Obj_00003 8.040 Obj_00004 8.018 Obj_00005 7.833 Obj_00006 8.105 Obj_00007 8.028 Obj_00008 8.076 Obj_00009 7.814 Obj_00010 7.013 Obj_00011 8.029 Obj_00012 7.416 Obj_00013 7.724 Obj_00014 7.636 Obj_00015 8.035 Obj_00016 7.802 Obj_00017 8.036 Obj_00018 7.588 Obj_00019 7.814 Obj_00020 7.689 Obj_00021 7.827 Obj_00022 7.637 Obj_00023 7.889 Obj_00024 7.814 Obj_00025 7.721 Obj_00026 8.126 Obj_00027 7.644 Obj_00028 8.037 Obj_00029 8.154 Obj_00030 7.995 Obj_00031 7.844 Obj_00032 7.894 Obj_00033 7.627 Obj_00034 7.716 Obj_00035 7.143 Obj_00036 7.978 Obj_00037 7.915 Obj_00038 7.943 Obj_00039 8.093 Obj_00040 7.075 Obj_00041 7.814 Obj_00042 7.426 Obj_00043 7.898 Obj_00044 7.845 Obj_00045 7.814 Obj_00046 8.034 Obj_00047 5.029 Obj_00048 7.833 Obj_00049 7.990 Obj_00050 7.075 Obj_00051 8.169 Obj_00052 8.042 Obj_00053 7.013 Obj_00054 8.028 Obj_00055 7.594 Obj_00056 7.158 Obj_00057 8.053 Obj_00058 8.013 Obj_00059 7.718 Obj_00060 8.070 Obj_00061 7.393 Obj_00062 8.091 Obj_00063 8.035 Obj_00064 8.036 Obj_00065 7.983 Obj_00066 7.884 Obj_00067 7.832 Obj_00068 7.945 Obj_00069 7.814 Obj_00070 7.713 Obj_00071 8.000 Obj_00072 7.696 Obj_00073 7.896 Obj_00074 8.116 Obj_00075 7.900 Obj_00076 7.814