_PROBLEM CoEPrA-2006_Classification_002 _GROUP_NAME David Farrelly _GROUP_MEMBERS David Farrelly Ernestine Lee Kevin Hestir Loren Hansen _ADDRESS Dr. David Farrelly Department of Chemistry and Biochemistry Utah State University Logan, UT 84322-0300 USA _MODELING_PROCEDURE Name of method GENTREE First we randomly separated the training set of data into two groups of size 56 and size 20. The size 56 group was then used as a new training and the size 20 was set aside as the new test. This was done as a means to grade how well the method does at selecting descriptors. Descriptors would be selected based on the 56 cases and tested on the 20 cases. By doing so we could get a feel for how well we are doing. Then for the final run all 76 cases would be combined and a new set of descriptors would be selected using the combined 76 these would be used to make the final predictions that would be turned in. The classification method consists of two parts a heuristic optimization algorithm and a random forest of decision trees. First the given training set of data is randomly split to produce a smaller training set and a trial set. So in this case we would split the initial 56 to produce a smaller training set of size 40 and a trial set of size 16. Then the heuristic optimization algorithm tries to find the subset of descriptors that best classifies the data in the trial set. Based on how well the random forest of decision trees does using the given subset of descriptors. The output of the optimization algorithm is a binary string of size 5144 a one means use the given descriptor a zero means do not use. Then the two smaller datasets are recombined and a new random selection takes place to produce a new training set of size 40 and trial set of size 16. And the optimization algorithm again tries to find the subset of descriptors that best classifies the new trial set. This process is repeated for many different random splits of the data into training and trial sets. Finally a count is made of how often each descriptor was used as part of the subset of descriptors obtained by the optimization algorithm. We would then look to see which descriptors are selected most often regardless of how the data was split and these become our final set. For example it may be 100 random splits where done and descriptor x was used in 75 of those splits as one of the subset of descriptors that best classifies the data and it could be descriptor y was used only 10 times. We would then conclude descriptor x is important and descriptor y is probably not. The final step was then to take our subset of descriptors used most often and see how well they do on an out of sample test set of data. By using this approach it was hoped that over fitting would be minimized. The optimization algorithm used in this case is a genetic algorithm. We measured how well the given subset of descriptors does in classifying using a variety of metrics including area under a ROC curve, sensitivity and specificity and the error rate. Descriptors Used A total of 131 Descriptors where used in the final model Desc_00012 Desc_00065 Desc_00185 Desc_00361 Desc_00381 Desc_00448 Desc_00450 Desc_00479 Desc_00572 Desc_00637 Desc_00652 Desc_00673 Desc_00675 Desc_00698 Desc_00706 Desc_00752 Desc_00755 Desc_00761 Desc_00793 Desc_00861 Desc_00872 Desc_00875 Desc_00893 Desc_00936 Desc_00941 Desc_00955 Desc_00962 Desc_00982 Desc_00996 Desc_01002 Desc_01048 Desc_01090 Desc_01091 Desc_01092 Desc_01117 Desc_01118 Desc_01147 Desc_01186 Desc_01213 Desc_01236 Desc_01237 Desc_01248 Desc_01273 Desc_01274 Desc_01421 Desc_01732 Desc_01880 Desc_02064 Desc_02066 Desc_02683 Desc_02695 Desc_02698 Desc_02746 Desc_02877 Desc_02880 Desc_02884 Desc_02915 Desc_02918 Desc_02944 Desc_02958 Desc_02974 Desc_03085 Desc_03089 Desc_03123 Desc_03180 Desc_03182 Desc_03194 Desc_03245 Desc_03249 Desc_03280 Desc_03300 Desc_03302 Desc_03304 Desc_03321 Desc_03326 Desc_03342 Desc_03343 Desc_03361 Desc_03367 Desc_03399 Desc_03447 Desc_03465 Desc_03475 Desc_03484 Desc_03555 Desc_03616 Desc_03659 Desc_03733 Desc_03734 Desc_03741 Desc_03754 Desc_03760 Desc_03766 Desc_03789 Desc_03798 Desc_03802 Desc_03803 Desc_03805 Desc_03835 Desc_03870 Desc_03871 Desc_03908 Desc_04009 Desc_04055 Desc_04058 Desc_04074 Desc_04211 Desc_04306 Desc_04307 Desc_04310 Desc_04332 Desc_04337 Desc_04341 Desc_04358 Desc_04360 Desc_04362 Desc_04363 Desc_04405 Desc_04428 Desc_04447 Desc_04495 Desc_04642 Desc_04697 Desc_04825 Desc_04887 Desc_04950 Desc_05019 Desc_05077 Desc_05090 Desc_05093 Desc_05094 _PREDICTION Obj_00001 +1 Obj_00002 +1 Obj_00003 +1 Obj_00004 +1 Obj_00005 +1 Obj_00006 -1 Obj_00007 +1 Obj_00008 -1 Obj_00009 +1 Obj_00010 -1 Obj_00011 -1 Obj_00012 +1 Obj_00013 +1 Obj_00014 +1 Obj_00015 +1 Obj_00016 -1 Obj_00017 +1 Obj_00018 -1 Obj_00019 +1 Obj_00020 +1 Obj_00021 +1 Obj_00022 -1 Obj_00023 +1 Obj_00024 +1 Obj_00025 -1 Obj_00026 -1 Obj_00027 +1 Obj_00028 +1 Obj_00029 -1 Obj_00030 +1 Obj_00031 -1 Obj_00032 +1 Obj_00033 -1 Obj_00034 -1 Obj_00035 -1 Obj_00036 -1 Obj_00037 +1 Obj_00038 -1 Obj_00039 +1 Obj_00040 +1 Obj_00041 -1 Obj_00042 +1 Obj_00043 -1 Obj_00044 +1 Obj_00045 +1 Obj_00046 -1 Obj_00047 -1 Obj_00048 +1 Obj_00049 +1 Obj_00050 -1 Obj_00051 +1 Obj_00052 +1 Obj_00053 +1 Obj_00054 +1 Obj_00055 +1 Obj_00056 -1 Obj_00057 -1 Obj_00058 +1 Obj_00059 -1 Obj_00060 +1 Obj_00061 -1 Obj_00062 +1 Obj_00063 +1 Obj_00064 +1 Obj_00065 -1 Obj_00066 +1 Obj_00067 -1 Obj_00068 +1 Obj_00069 -1 Obj_00070 +1 Obj_00071 +1 Obj_00072 +1 Obj_00073 +1 Obj_00074 +1 Obj_00075 +1 Obj_00076 +1