_PROBLEM CoEPrA-2006_Classification_001 _GROUP_NAME Joao Aires-de-Sousa _GROUP_MEMBERS Goncalo Carrera Sunil Gupta Yuri Binev Joao Aires-de-Sousa _ADDRESS REQUIMTE, CQFB, Departamento de Quimica, Faculdade de Ciencias e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal. E-mail: jas@fct.unl.pt; Fax: +351 21 2948550 _MODELING_PROCEDURE From the initial pool of 5787 descriptors given by the organizers, 433 descriptors (file descriptors.txt) were selected to exclude: - inter-correlation between descriptors above a Pearson correlation coefficient of 0.62. - descriptors with values in the prediction set outside the range of the training set. With the selected descriptors, classification models were built with Support Vector Machines. An ensemble of 5 SVMs was used for classification. The R program was employed (version 2.2.1 2005-12-20 r36812) with the kernlab library (version 0.8-1). After manual optimization of parameters C, and nu, the values of C=8 and nu=0.008 were chosen on the basis of 3-fold cross-validation results for the training set. The default value of epsilon (0.1) was used. A radial basis kernel "Gaussian" was used. The output files are named class_svmx_output.txt (x from 1 to 5), and the parameters are in the file class_parameters_svm.txt. Experiments with Random Forests were performed to select the most relevant descriptors to train SVMs. These experiments always resulted in higher 3-fold cross-validation errors than the experiments with the 433 descriptors and no selection of features. This suggests a situation with different mechanisms of action involved, and made us decide to use the SVMs trained with 433 descriptors. Alignment of the nona-peptides in the training set with the CLUSTALW program (multiple sequence alignment program for DNA or proteins) using default parameters at the EBI web site (http://www.ebi.ac.uk/clustalw/) produced a cladogram that was compared with the classification labels. The observation of a general good agreement between clusters of the cladogram and the class of the peptides, suggested that the classification of peptides in the prediction set could be assisted by the cladograms. The final predictions for the peptides of the prediction set were obtained by majority vote of the 5 SVMs of the ensemble when the sum of the 5 probabilities was more than 3.5. When the sum of the 5 probabilities was less than 3.5, the peptide was predicted from the class of its neighbors in the cladogram built with the sequences of peptides in the training and prediction sets together (file cladogram-clustalw_tr_te.jpg). For that reason, the predictions obtained from the ensemble of SVMs were changed for the peptides 16, 28, 30, 41, 44, 53, 62 and 75. _PREDICTION Obj_00001 -1 Obj_00002 +1 Obj_00003 -1 Obj_00004 -1 Obj_00005 +1 Obj_00006 +1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 +1 Obj_00011 +1 Obj_00012 +1 Obj_00013 +1 Obj_00014 +1 Obj_00015 -1 Obj_00016 -1 Obj_00017 -1 Obj_00018 +1 Obj_00019 -1 Obj_00020 -1 Obj_00021 +1 Obj_00022 +1 Obj_00023 -1 Obj_00024 +1 Obj_00025 +1 Obj_00026 +1 Obj_00027 +1 Obj_00028 +1 Obj_00029 +1 Obj_00030 +1 Obj_00031 +1 Obj_00032 +1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 -1 Obj_00037 -1 Obj_00038 -1 Obj_00039 -1 Obj_00040 +1 Obj_00041 +1 Obj_00042 -1 Obj_00043 -1 Obj_00044 +1 Obj_00045 +1 Obj_00046 -1 Obj_00047 +1 Obj_00048 +1 Obj_00049 +1 Obj_00050 +1 Obj_00051 +1 Obj_00052 +1 Obj_00053 -1 Obj_00054 -1 Obj_00055 +1 Obj_00056 +1 Obj_00057 +1 Obj_00058 -1 Obj_00059 +1 Obj_00060 -1 Obj_00061 -1 Obj_00062 +1 Obj_00063 +1 Obj_00064 -1 Obj_00065 +1 Obj_00066 +1 Obj_00067 +1 Obj_00068 -1 Obj_00069 +1 Obj_00070 -1 Obj_00071 -1 Obj_00072 +1 Obj_00073 -1 Obj_00074 -1 Obj_00075 -1 Obj_00076 -1 Obj_00077 -1 Obj_00078 +1 Obj_00079 -1 Obj_00080 +1 Obj_00081 +1 Obj_00082 -1 Obj_00083 -1 Obj_00084 +1 Obj_00085 -1 Obj_00086 -1 Obj_00087 +1 Obj_00088 -1