_PROBLEM CoEPrA-2006_Classification_001 _GROUP_NAME David Farrelly _GROUP_MEMBERS David Farrelly Ernestine Lee Kevin Hestir Loren Hansen _ADDRESS Dr. David Farrelly Department of Chemistry and Biochemistry Utah State University Logan, UT 84322-0300 USA _MODELING_PROCEDURE Name of method GENTREE First we randomly separated the training set of data into two groups of size 60 and size 29. The size 60 group was then used as a new training and the size 29 was set aside as the new test. This was done as a means to grade how well the method does at selecting descriptors. Descriptors would be selected based on the 60 cases and tested on the 29 cases. By doing so we could get a feel for how well we are doing. Then for the final run all 89 cases would be combined and a new set of descriptors would be selected using the combined 89 these would be used to make the final predictions that would be turned in. The classification method consists of two parts a heuristic optimization algorithm and a random forest of decision trees. First the given training set of data is randomly split to produce a smaller training set and a trial set. So in this case we would split the initial 60 to produce a smaller training set of size 40 and a trial set of size 20. Then the heuristic optimization algorithm tries to find the subset of descriptors that best classifies the data in the trial set. Based on how well the random forest of decision trees does using the given subset of descriptors. The output of the optimization algorithm is a binary string of size 5787 a one means use the given descriptor a zero means do not use. Then the two smaller datasets are recombined and a new random selection takes place to produce a new training set of size 40 and trial set of size 20. And the optimization algorithm again tries to find the subset of descriptors that best classifies the new trial set. This process is repeated for many different random splits of the data into training and trial sets. Finally a count is made of how often each descriptor was used as part of the subset of descriptors obtained by the optimization algorithm. We would then look to see which descriptors are selected most often regardless of how the data was split and these become our final set. For example it may be 100 random splits where done and descriptor x was used in 75 of those splits as one of the subset of descriptors that best classifies the data and it could be descriptor y was used only 10 times. We would then conclude descriptor x is important and descriptor y is probably not. The final step was then to take our subset of descriptors used most often and see how well they do on an out of sample test set of data. By using this approach it was hoped that over fitting would be minimized. The optimization algorithm used in this case is a genetic algorithm. We measured how well the given subset of descriptors does in classifying using a variety of metrics including area under a ROC curve, sensitivity and specificity and the error rate. _PREDICTION Obj_00001 -1 Obj_00002 +1 Obj_00003 -1 Obj_00004 -1 Obj_00005 +1 Obj_00006 +1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 +1 Obj_00011 +1 Obj_00012 +1 Obj_00013 +1 Obj_00014 +1 Obj_00015 -1 Obj_00016 -1 Obj_00017 -1 Obj_00018 +1 Obj_00019 +1 Obj_00020 -1 Obj_00021 +1 Obj_00022 +1 Obj_00023 -1 Obj_00024 +1 Obj_00025 -1 Obj_00026 +1 Obj_00027 +1 Obj_00028 -1 Obj_00029 +1 Obj_00030 +1 Obj_00031 +1 Obj_00032 +1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 -1 Obj_00037 -1 Obj_00038 +1 Obj_00039 -1 Obj_00040 +1 Obj_00041 -1 Obj_00042 -1 Obj_00043 -1 Obj_00044 -1 Obj_00045 +1 Obj_00046 -1 Obj_00047 +1 Obj_00048 +1 Obj_00049 +1 Obj_00050 +1 Obj_00051 -1 Obj_00052 -1 Obj_00053 -1 Obj_00054 -1 Obj_00055 +1 Obj_00056 +1 Obj_00057 -1 Obj_00058 -1 Obj_00059 -1 Obj_00060 -1 Obj_00061 -1 Obj_00062 +1 Obj_00063 +1 Obj_00064 -1 Obj_00065 +1 Obj_00066 -1 Obj_00067 +1 Obj_00068 +1 Obj_00069 +1 Obj_00070 -1 Obj_00071 +1 Obj_00072 -1 Obj_00073 -1 Obj_00074 -1 Obj_00075 +1 Obj_00076 -1 Obj_00077 -1 Obj_00078 -1 Obj_00079 -1 Obj_00080 +1 Obj_00081 +1 Obj_00082 -1 Obj_00083 -1 Obj_00084 +1 Obj_00085 -1 Obj_00086 -1 Obj_00087 +1 Obj_00088 -1