_PROBLEM CoEPrA-2006_Classification_001 _GROUP_NAME Hendrik Blockeel _GROUP_MEMBERS Joaquin Vanschoren Leander Schietgat Elisa Fromont Jan Ramon Hendrik Blockeel Kurt De Grave Luc Dehaspe Walter Luyten _ADDRESS Dept. of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, 3001 Leuven, Belgium _MODELING_PROCEDURE 1) construction of different views on the data As data representation is known to have an important impact on learning results, we have constructed different views on the same data set. View 1 is the raw data, with 643 descriptors for each amino acid. View 2 contains information on "frequent subsequences at specific positions" in the peptide sequences. Patterns of length 1, 2 and 3 with a minimum frequency of 5% (i.e., occurring in at least 5% of all examples) were mined using the Warmr algorithm [Dehaspe99]. An example of such a pattern is (7,V,T), which means that VT is occurring at positions 7 and 8 in the peptide sequence. These frequent patterns were mined in the calibration and prediction set together. Then, every example was checked against all the frequent patterns: 1 was written if the pattern was occurring in the example, 0 otherwise. After consulting a domain expert, we decided to make a variant of this first view (called ``view2_inter'') that looks for alternating patterns in the peptide sequence. Apparently the interactions between alternating amino acids are sometimes more important, since their side-chains are directed in the same direction. An example of such a pattern is (4,P,P,T), which means that a P is occurring at position 4, a second P at position 6, and a T at position 8. The corresponding view was constructed in the same way. These two variants of the first view were also combined in a single view (called ``view2_combined''). View 3 (called ``view3_names'' in our results) focused on different properties of amino acids. Instead of using the given (unknown) descriptors, we used some of the most popular amino acid properties (i.e. size, charge, polarity, hydrophobicity) and converted for each of the 9 amino acids a row with all of its properties. View 4 is similar to View 2 but contains information on occurrence of subsequences independent of the position. For a pattern (P,F) for example, it doesn't matter where this pattern occurs in the sequence. The same strategy was applied for alternating patterns and also a combined view was constructed (these views are ``view4'', ``view4_inter'' and ``view4_combined''). These views contain complementary information, in the sense that for instance a condition "the subsequence PPT occurs somewhere in the peptide" would be very complex to express using any view but View 4, whereas "the 5th amino acid has a positive charge" can be expressed using View 3 but not using View 4, etc. We therefore constructed some combinations. A first combination (called ``cl1_view2_view1_combined'') merges the 5787 descriptors of the calibration data with the frequent patterns of view2_combined. A second combination (called ``re1_view2_view3_combined'') merges the information of view3 and view2_combined. 2) construction of the predictive model The following procedure was followed: 1. the different views on the data were constructed, for calibration as well as test set 2. 62 different algorithms encoded in the WEKA data mining tool were trained with their default parameters using 10-fold cross-validation on the different views. 3. for each classification and regression problem, the best 15 couples (view, algorithm) were chosen according to their area under the roc curve. The couples retained for the first classification problem are: (cl1_view2_view1_combined, VFI) auc = 0.85353535 (cl1_view1, VFI) auc = 0.8424242 (cl1_view3_names, NaiveBayes) auc = 0.83181816 (cl1_view2, Winnow) auc = 0.8181818 (cl1_view4_combined, NBTree) auc = 0.8181818 (cl1_view2_inter, MultilayerPerceptron) auc = 0.8068182 (cl1_view2, LogitBoost) auc = 0.8068182 (cl1_view2_combined, Winnow) auc = 0.8068182 (cl1_view2_combined, LogitBoost) auc = 0.8068182 (cl1_view3_names, BayesNet) auc = 0.7989899 (cl1_view1, Bagging) auc = 0.79848486 (cl1_view3_names, JRip) auc = 0.7979798 (cl1_view1, LogitBoost) auc = 0.7977273 (cl1_view1, PART) auc = 0.79747474 (cl1_view1,ClassificationViaRegression) auc = 0.79747474 4. for each problem, the 15 algorithms were trained and tested with 10-fold cross validation on the training set to compute a meta view on the training data that contains the predictions of each model as attributes. In other words, the meta-view contains one line for each example and one column for each model, and lists for each example the predictions that the 15 models made for that example. The true classification was added as a 16th column. 5. we then trained all weka algorithms on this meta view, and the best one (tested with 10-fold cross validation) was selected. This was AODE (auc = 0.87651515). A model was then learned with AODE (again with default parameter settings) from the meta view; this amounts to learning to combine the predictions of the multiple models into a single prediction in an optimal way. This procedure is known as stacking [Wolpert92]. 6. the 15 models were applied to the test set, yielding for each test example 15 predictions. The model learned in step 5 was then applied to these predictions to get the final prediction for the example. _PREDICTION Obj_00001 -1 Obj_00002 -1 Obj_00003 -1 Obj_00004 -1 Obj_00005 +1 Obj_00006 +1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 +1 Obj_00011 +1 Obj_00012 +1 Obj_00013 +1 Obj_00014 +1 Obj_00015 -1 Obj_00016 -1 Obj_00017 -1 Obj_00018 +1 Obj_00019 -1 Obj_00020 -1 Obj_00021 +1 Obj_00022 -1 Obj_00023 -1 Obj_00024 +1 Obj_00025 +1 Obj_00026 +1 Obj_00027 +1 Obj_00028 -1 Obj_00029 +1 Obj_00030 +1 Obj_00031 +1 Obj_00032 -1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 -1 Obj_00037 -1 Obj_00038 +1 Obj_00039 -1 Obj_00040 +1 Obj_00041 -1 Obj_00042 -1 Obj_00043 -1 Obj_00044 -1 Obj_00045 +1 Obj_00046 -1 Obj_00047 +1 Obj_00048 +1 Obj_00049 +1 Obj_00050 -1 Obj_00051 +1 Obj_00052 -1 Obj_00053 -1 Obj_00054 +1 Obj_00055 +1 Obj_00056 +1 Obj_00057 -1 Obj_00058 -1 Obj_00059 -1 Obj_00060 -1 Obj_00061 -1 Obj_00062 +1 Obj_00063 +1 Obj_00064 -1 Obj_00065 +1 Obj_00066 -1 Obj_00067 -1 Obj_00068 -1 Obj_00069 +1 Obj_00070 -1 Obj_00071 -1 Obj_00072 -1 Obj_00073 -1 Obj_00074 -1 Obj_00075 +1 Obj_00076 -1 Obj_00077 -1 Obj_00078 +1 Obj_00079 -1 Obj_00080 +1 Obj_00081 +1 Obj_00082 -1 Obj_00083 -1 Obj_00084 +1 Obj_00085 -1 Obj_00086 -1 Obj_00087 -1 Obj_00088 +1