_PROBLEM CoEPrA-2006_Classification_004 _GROUP_NAME Hendrik Blockeel _GROUP_MEMBERS Joaquin Vanschoren Leander Schietgat Elisa Fromont Jan Ramon Hendrik Blockeel Kurt De Grave Luc Dehaspe Walter Luyten _ADDRESS Dept. of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, 3001 Leuven, Belgium _MODELING_PROCEDURE 1) construction of different views on the data As data representation is known to have an important impact on learning results, we have constructed different views on the same data set. View 1 is the raw data, with 643 descriptors for each amino acid. View 2 contains information on "frequent subsequences at specific positions" in the peptide sequences. Patterns of length 1, 2 and 3 with a minimum frequency of 5% (i.e., occurring in at least 5% of all examples) were mined using the Warmr algorithm [Dehaspe99]. An example of such a pattern is (7,V,T), which means that VT is occurring at positions 7 and 8 in the peptide sequence. These frequent patterns were mined in the calibration and prediction set together. Then, every example was checked against all the frequent patterns: 1 was written if the pattern was occurring in the example, 0 otherwise. After consulting a domain expert, we decided to make a variant of this first view (called ``view2_inter'') that looks for alternating patterns in the peptide sequence. Apparently the interactions between alternating amino acids are sometimes more important, since their side-chains are directed in the same direction. An example of such a pattern is (4,P,P,T), which means that a P is occurring at position 4, a second P at position 6, and a T at position 8. The corresponding view was constructed in the same way. These two variants of the first view were also combined in a single view (called ``view2_combined''). View 3 (called ``view3_names'' in our results) focused on different properties of amino acids. Instead of using the given (unknown) descriptors, we used some of the most popular amino acid properties (i.e. size, charge, polarity, hydrophobicity) and converted for each of the 9 amino acids a row with all of its properties. View 4 is similar to View 2 but contains information on occurrence of subsequences independent of the position. For a pattern (P,F) for example, it doesn't matter where this pattern occurs in the sequence. The same strategy was applied for alternating patterns and also a combined view was constructed (these views are ``view4'', ``view4_inter'' and ``view4_combined''). These views contain complementary information, in the sense that for instance a condition "the subsequence PPT occurs somewhere in the peptide" would be very complex to express using any view but View 4, whereas "the 5th amino acid has a positive charge" can be expressed using View 3 but not using View 4, etc. We therefore constructed some combinations. A first combination (called ``cl4_view2_view1_combined'') merges the 5787 descriptors of the calibration data with the frequent patterns of view2_combined. A second combination (called ``cl4_view2_view3_combined'') merges the information of view3 and view2_combined. 2) construction of the predictive model The following procedure was followed: 1. the different views on the data were constructed, for calibration as well as test set 2. 62 different algorithms encoded in the WEKA data mining tool were trained with their default parameters using 10-fold cross-validation on the different views. However, to enhance stacking performance (see point 5) we used the confidence values of the predictions of these algorithms instead of their class predictions. 3. Because the class distribution was heavily skewed, we resampled each of our datasets to correct this. We then repeated the procedure of step 2, evaluating the predictions on the original datasets. 4. We also tried different parameter settings for three of the algorithms: J48 -> confidence value Logistic -> ridge parameter SMO -> Complexity value, exponent for polynomial kernels and gamma for RBF kernels. 5. One of the best of these models was RandomForest (with default parameters) on the resampled view2_inter dataset. This model was then used to predict the target values of the test set. _PREDICTION Obj_00001 -1 Obj_00002 -1 Obj_00003 -1 Obj_00004 -1 Obj_00005 -1 Obj_00006 -1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 +1 Obj_00011 -1 Obj_00012 -1 Obj_00013 -1 Obj_00014 -1 Obj_00015 -1 Obj_00016 +1 Obj_00017 -1 Obj_00018 -1 Obj_00019 -1 Obj_00020 -1 Obj_00021 -1 Obj_00022 -1 Obj_00023 -1 Obj_00024 -1 Obj_00025 -1 Obj_00026 -1 Obj_00027 -1 Obj_00028 -1 Obj_00029 -1 Obj_00030 -1 Obj_00031 -1 Obj_00032 -1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 +1 Obj_00037 -1 Obj_00038 -1 Obj_00039 -1 Obj_00040 -1 Obj_00041 -1 Obj_00042 -1 Obj_00043 -1 Obj_00044 -1 Obj_00045 -1 Obj_00046 -1 Obj_00047 -1 Obj_00048 -1 Obj_00049 -1 Obj_00050 -1 Obj_00051 -1 Obj_00052 -1 Obj_00053 -1 Obj_00054 -1 Obj_00055 -1 Obj_00056 -1 Obj_00057 -1 Obj_00058 -1 Obj_00059 -1 Obj_00060 -1 Obj_00061 -1 Obj_00062 -1 Obj_00063 -1 Obj_00064 -1 Obj_00065 -1 Obj_00066 -1 Obj_00067 -1 Obj_00068 -1 Obj_00069 -1 Obj_00070 -1 Obj_00071 +1 Obj_00072 -1 Obj_00073 -1 Obj_00074 +1 Obj_00075 -1 Obj_00076 -1 Obj_00077 -1 Obj_00078 -1 Obj_00079 -1 Obj_00080 -1 Obj_00081 -1 Obj_00082 -1 Obj_00083 -1 Obj_00084 -1 Obj_00085 -1 Obj_00086 -1 Obj_00087 -1 Obj_00088 -1 Obj_00089 +1 Obj_00090 -1 Obj_00091 -1 Obj_00092 -1 Obj_00093 -1 Obj_00094 -1 Obj_00095 -1 Obj_00096 -1 Obj_00097 -1 Obj_00098 -1 Obj_00099 -1 Obj_00100 -1 Obj_00101 -1 Obj_00102 -1 Obj_00103 +1 Obj_00104 -1 Obj_00105 -1 Obj_00106 -1 Obj_00107 -1 Obj_00108 -1 Obj_00109 -1 Obj_00110 -1 Obj_00111 -1