_PROBLEM CoEPrA-2006_Classification_001 _GROUP_NAME Alexander Zelikovsky _GROUP_MEMBERS Alexander Zelikovsky Dumitru Brinza _ADDRESS Alexander Zelikovsky Department Computer Science, Georgia State University Phone: (404) 651-0676 Fax: (815) 642-0052 Email: alexz@cs.gsu.edu Office: 1443, Peachtree Str. 34 web:http://www.cs.gsu.edu/~cscazz/ Dumitru Brinza Department Computer Science, Georgia State University Phone: (404) 463-2808 Email: dima@cs.gsu.edu Office: 1415, Peachtree Str. 34 web:http://www.cs.gsu.edu/~cscdubx/ _MODELING_PROCEDURE METHOD DESCRIPTION The following approach was succesfuly used in disease susceptibility prediction based on genotype data. Bellow you can find a link with full description of our method. http://suez.cs.gsu.edu/%7Ecscazz/postscript/wabi06.pdf For CoEPrA competition we slightly modify our method to handle amino acid's alphabet in place of 0,1,2 which we have used for SNP notations. Short description of the proposed method: 1. For each protein sequence s we find the most significant multi-amino-acid combination which attributes s to class +1 (MAAC+) and combination which attributes s to class -1 (MAAC-). These combinations are found using exhaustive search on the training (calibration) dataset. For datasets with longer protein sequencies we use combinatorial and greedy searches, which are described in http://suez.cs.gsu.edu/%7Ecscazz/postscript/bgrs06.pdf 2. Based on the significance (p-value) of MAAC+ (SG+) and significance of MAAC- (SG-) we decide if s belongs to the class +1 or -1. If (SG+)/(SG-) < CONST then s belongs to -1 else to +1. The value of CONST is computed as follows. 3. On the training dataset we perform leave-one-out test by computing for each left out sequence CC = (SG+)/(SG-). Then we sort all sequencies by CC and find a position such that all sequencies below are classified to -1 and above to +1. Position is chosen to minimize the error. METHOD VALIDATION We validated our method by performing leave-one-out test on training (calibration) data CoEPrA-2006_Classification_001_Calibration_Peptides.txt. The result of the running can be found below. The estimated accuracy is 76.4%. _PREDICTION Obj_00001 +1 Obj_00002 +1 Obj_00003 +1 Obj_00004 -1 Obj_00005 -1 Obj_00006 -1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 -1 Obj_00011 -1 Obj_00012 -1 Obj_00013 -1 Obj_00014 -1 Obj_00015 -1 Obj_00016 -1 Obj_00017 -1 Obj_00018 -1 Obj_00019 +1 Obj_00020 -1 Obj_00021 +1 Obj_00022 -1 Obj_00023 -1 Obj_00024 +1 Obj_00025 -1 Obj_00026 +1 Obj_00027 -1 Obj_00028 +1 Obj_00029 +1 Obj_00030 -1 Obj_00031 -1 Obj_00032 +1 Obj_00033 +1 Obj_00034 -1 Obj_00035 -1 Obj_00036 -1 Obj_00037 -1 Obj_00038 +1 Obj_00039 -1 Obj_00040 +1 Obj_00041 +1 Obj_00042 +1 Obj_00043 -1 Obj_00044 +1 Obj_00045 +1 Obj_00046 +1 Obj_00047 +1 Obj_00048 -1 Obj_00049 +1 Obj_00050 +1 Obj_00051 +1 Obj_00052 +1 Obj_00053 -1 Obj_00054 +1 Obj_00055 +1 Obj_00056 +1 Obj_00057 +1 Obj_00058 +1 Obj_00059 +1 Obj_00060 +1 Obj_00061 +1 Obj_00062 +1 Obj_00063 +1 Obj_00064 -1 Obj_00065 +1 Obj_00066 +1 Obj_00067 +1 Obj_00068 -1 Obj_00069 +1 Obj_00070 +1 Obj_00071 +1 Obj_00072 +1 Obj_00073 -1 Obj_00074 +1 Obj_00075 +1 Obj_00076 +1 Obj_00077 +1 Obj_00078 +1 Obj_00079 +1 Obj_00080 +1 Obj_00081 +1 Obj_00082 +1 Obj_00083 +1 Obj_00084 +1 Obj_00085 +1 Obj_00086 +1 Obj_00087 +1 Obj_00088 +1