_PROBLEM CoEPrA-2006_Classification_004 _GROUP_NAME David Farrelly _GROUP_MEMBERS David Farrelly Ernestine Lee Kevin Hestir Loren Hansen _ADDRESS Dr. David Farrelly Department of Chemistry and Biochemistry Utah State University Logan, UT 84322-0300 USA _MODELING_PROCEDURE Name of method: GENTREE First we separated the training set of data into two groups of size 61 and size 50. The size 61 group was then used as a new training and the size 50 group was set aside as the new test. This was done as a means to grade how well the method does at selecting descriptors. Descriptors would be selected based on the 61 cases and tested on the 50 cases. By doing so we could get a feel for how well we are doing. The classification method consists of two parts a heuristic optimization algorithm and a random forest of decision trees. First the given training set of data is randomly split to produce a smaller training set and a trial set. So in this case we would split the initial 61 to produce a smaller training set of size 20 and a trial set of size 41. Then the heuristic optimization algorithm tries to find the subset of descriptors that best classifies the data in the trial set. Based on how well the random forest of decision trees does using the given subset of descriptors. The output of the optimization algorithm is a binary string of size 5787 a one means use the given descriptor a zero means do not use. Then the two smaller datasets are recombined and a new random selection takes place to produce a new training set of size 20 and trial set of size 41. And the optimization algorithm again tries to find the subset of descriptors that best classifies the new trial set. This process is repeated for many different random splits of the data into training and trial sets. Finally a count is made of how often each descriptor was used as part of the subset of descriptors obtained by the optimization algorithm. We would then look to see which descriptors are selected most often regardless of how the data was split and these become our final set. For example it may be 100 random splits where done and descriptor x was used in 75 of those splits as one of the subset of descriptors that best classifies the data and it could be descriptor y was used only 10 times. We would then conclude descriptor x is important and descriptor y is probably not. The final step was then to take our subset of descriptors used most often and see how well they do on an out of sample test set of data. By using this approach it was hoped that over fitting would be minimized. The optimization algorithm used in this case is a genetic algorithm. We measured how well the given subset of descriptors does in classifying using a variety of metrics including area under a ROC curve, sensitivity and specificity and the error rate. Descriptors Used A total of 28 Descriptors where used in the final model Desc_01413 21 Desc_01605 28 Desc_01733 37 Desc_01875 26 Desc_02581 31 Desc_02582 33 Desc_02583 46 Desc_02604 65 Desc_02605 20 Desc_02681 57 Desc_02722 42 Desc_03021 20 Desc_03046 43 Desc_03047 50 Desc_03076 83 Desc_03077 43 Desc_03161 50 Desc_03360 20 Desc_04574 108 Desc_05154 27 Desc_05155 34 Desc_05281 23 Desc_05590 25 Desc_05591 22 Desc_05592 24 Desc_05596 26 Desc_05627 22 Desc_05734 21 _PREDICTION Obj_00001 -1 Obj_00002 +1 Obj_00003 -1 Obj_00004 -1 Obj_00005 -1 Obj_00006 -1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 +1 Obj_00011 -1 Obj_00012 -1 Obj_00013 -1 Obj_00014 +1 Obj_00015 +1 Obj_00016 -1 Obj_00017 -1 Obj_00018 +1 Obj_00019 -1 Obj_00020 -1 Obj_00021 -1 Obj_00022 +1 Obj_00023 -1 Obj_00024 +1 Obj_00025 -1 Obj_00026 +1 Obj_00027 -1 Obj_00028 +1 Obj_00029 -1 Obj_00030 +1 Obj_00031 -1 Obj_00032 -1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 -1 Obj_00037 +1 Obj_00038 -1 Obj_00039 +1 Obj_00040 -1 Obj_00041 -1 Obj_00042 -1 Obj_00043 +1 Obj_00044 -1 Obj_00045 +1 Obj_00046 +1 Obj_00047 +1 Obj_00048 -1 Obj_00049 -1 Obj_00050 -1 Obj_00051 -1 Obj_00052 -1 Obj_00053 -1 Obj_00054 +1 Obj_00055 +1 Obj_00056 +1 Obj_00057 +1 Obj_00058 -1 Obj_00059 +1 Obj_00060 +1 Obj_00061 -1 Obj_00062 -1 Obj_00063 -1 Obj_00064 +1 Obj_00065 -1 Obj_00066 -1 Obj_00067 -1 Obj_00068 -1 Obj_00069 -1 Obj_00070 -1 Obj_00071 +1 Obj_00072 -1 Obj_00073 +1 Obj_00074 +1 Obj_00075 -1 Obj_00076 +1 Obj_00077 +1 Obj_00078 -1 Obj_00079 -1 Obj_00080 -1 Obj_00081 -1 Obj_00082 -1 Obj_00083 +1 Obj_00084 +1 Obj_00085 -1 Obj_00086 +1 Obj_00087 -1 Obj_00088 -1 Obj_00089 +1 Obj_00090 -1 Obj_00091 -1 Obj_00092 -1 Obj_00093 +1 Obj_00094 -1 Obj_00095 +1 Obj_00096 -1 Obj_00097 -1 Obj_00098 -1 Obj_00099 -1 Obj_00100 -1 Obj_00101 -1 Obj_00102 +1 Obj_00103 +1 Obj_00104 -1 Obj_00105 -1 Obj_00106 +1 Obj_00107 +1 Obj_00108 -1 Obj_00109 -1 Obj_00110 +1 Obj_00111 -1