_PROBLEM CoEPrA-2006_Classification_003 _GROUP_NAME David Farrelly _GROUP_MEMBERS David Farrelly Ernestine Lee Kevin Hestir Loren Hansen _ADDRESS Dr. David Farrelly Department of Chemistry and Biochemistry Utah State University Logan, UT 84322-0300 USA _MODELING_PROCEDURE Name of method GENTREE First we randomly separated the training set of data into two groups of size 100 and size 33. The size 100 group was then used as a new training and the size 33 was set aside as the new test. This was done as a means to grade how well the method does at selecting descriptors. Descriptors would be selected based on the 100 cases and tested on the 33 cases. By doing so we could get a feel for how well we are doing. The classification method consists of two parts a heuristic optimization algorithm and a random forest of decision trees. First the given training set of data is randomly split to produce a smaller training set and a trial set. So in this case we would split the initial 100 to produce a smaller training set of size 70 and a trial set of size 30. Then the heuristic optimization algorithm tries to find the subset of descriptors that best classifies the data in the trial set. Based on how well the random forest of decision trees does using the given subset of descriptors. The output of the optimization algorithm is a binary string of size 5787 a one means use the given descriptor a zero means do not use. Then the two smaller datasets are recombined and a new random selection takes place to produce a new training set of size 70 and trial set of size 30. And the optimization algorithm again tries to find the subset of descriptors that best classifies the new trial set. This process is repeated for many different random splits of the data into training and trial sets. Finally a count is made of how often each descriptor was used as part of the subset of descriptors obtained by the optimization algorithm. We would then look to see which descriptors are selected most often regardless of how the data was split and these become our final set. For example it may be 100 random splits where done and descriptor x was used in 75 of those splits as one of the subset of descriptors that best classifies the data and it could be descriptor y was used only 10 times. We would then conclude descriptor x is important and descriptor y is probably not. The final step was then to take our subset of descriptors used most often and see how well they do on an out of sample test set of data. By using this approach it was hoped that over fitting would be minimized. The optimization algorithm used in this case is a genetic algorithm. We measured how well the given subset of descriptors does in classifying using a variety of metrics including area under a ROC curve, sensitivity and specificity and the error rate. Descriptors Used A total of 120 Descriptors where used in the final model Desc_00010 Desc_00106 Desc_00137 Desc_00160 Desc_00182 Desc_00186 Desc_00323 Desc_00347 Desc_00416 Desc_00450 Desc_00541 Desc_00580 Desc_00652 Desc_00675 Desc_00686 Desc_00706 Desc_00710 Desc_00716 Desc_00734 Desc_00752 Desc_00755 Desc_00793 Desc_00808 Desc_00857 Desc_00966 Desc_01001 Desc_01017 Desc_01026 Desc_01027 Desc_01069 Desc_01070 Desc_01074 Desc_01090 Desc_01093 Desc_01117 Desc_01118 Desc_01126 Desc_01147 Desc_01148 Desc_01157 Desc_01161 Desc_01162 Desc_01206 Desc_01213 Desc_01219 Desc_01220 Desc_01224 Desc_01232 Desc_01236 Desc_01335 Desc_01395 Desc_01509 Desc_01652 Desc_01736 Desc_01773 Desc_01801 Desc_01812 Desc_01826 Desc_01897 Desc_01980 Desc_02240 Desc_02308 Desc_02483 Desc_02573 Desc_02644 Desc_02707 Desc_02709 Desc_02717 Desc_02727 Desc_02821 Desc_02843 Desc_02881 Desc_02927 Desc_02972 Desc_03077 Desc_03144 Desc_03160 Desc_03163 Desc_03203 Desc_03398 Desc_03869 Desc_03923 Desc_03930 Desc_03951 Desc_03970 Desc_03992 Desc_04038 Desc_04048 Desc_04191 Desc_04258 Desc_04310 Desc_04377 Desc_04512 Desc_04520 Desc_04539 Desc_04574 Desc_04619 Desc_04638 Desc_04726 Desc_04735 Desc_04753 Desc_04758 Desc_04759 Desc_04844 Desc_04854 Desc_04865 Desc_04874 Desc_04928 Desc_04947 Desc_04949 Desc_04950 Desc_04951 Desc_04952 Desc_04953 Desc_04954 Desc_05015 Desc_05029 Desc_05034 Desc_05093 Desc_05686 _PREDICTION Obj_00001 +1 Obj_00002 +1 Obj_00003 -1 Obj_00004 +1 Obj_00005 -1 Obj_00006 -1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 -1 Obj_00011 -1 Obj_00012 -1 Obj_00013 -1 Obj_00014 -1 Obj_00015 -1 Obj_00016 -1 Obj_00017 +1 Obj_00018 -1 Obj_00019 +1 Obj_00020 -1 Obj_00021 +1 Obj_00022 +1 Obj_00023 -1 Obj_00024 -1 Obj_00025 +1 Obj_00026 -1 Obj_00027 -1 Obj_00028 -1 Obj_00029 -1 Obj_00030 +1 Obj_00031 +1 Obj_00032 -1 Obj_00033 -1 Obj_00034 -1 Obj_00035 +1 Obj_00036 -1 Obj_00037 -1 Obj_00038 -1 Obj_00039 -1 Obj_00040 -1 Obj_00041 -1 Obj_00042 -1 Obj_00043 -1 Obj_00044 -1 Obj_00045 -1 Obj_00046 +1 Obj_00047 -1 Obj_00048 -1 Obj_00049 -1 Obj_00050 +1 Obj_00051 +1 Obj_00052 -1 Obj_00053 +1 Obj_00054 -1 Obj_00055 +1 Obj_00056 -1 Obj_00057 +1 Obj_00058 +1 Obj_00059 -1 Obj_00060 -1 Obj_00061 -1 Obj_00062 -1 Obj_00063 +1 Obj_00064 +1 Obj_00065 +1 Obj_00066 -1 Obj_00067 -1 Obj_00068 +1 Obj_00069 -1 Obj_00070 +1 Obj_00071 +1 Obj_00072 +1 Obj_00073 +1 Obj_00074 -1 Obj_00075 +1 Obj_00076 +1 Obj_00077 +1 Obj_00078 -1 Obj_00079 -1 Obj_00080 -1 Obj_00081 +1 Obj_00082 +1 Obj_00083 -1 Obj_00084 -1 Obj_00085 -1 Obj_00086 -1 Obj_00087 -1 Obj_00088 -1 Obj_00089 +1 Obj_00090 -1 Obj_00091 +1 Obj_00092 -1 Obj_00093 -1 Obj_00094 -1 Obj_00095 -1 Obj_00096 +1 Obj_00097 -1 Obj_00098 -1 Obj_00099 -1 Obj_00100 -1 Obj_00101 -1 Obj_00102 +1 Obj_00103 -1 Obj_00104 +1 Obj_00105 +1 Obj_00106 -1 Obj_00107 -1 Obj_00108 -1 Obj_00109 -1 Obj_00110 -1 Obj_00111 -1 Obj_00112 -1 Obj_00113 -1 Obj_00114 -1 Obj_00115 -1 Obj_00116 -1 Obj_00117 +1 Obj_00118 +1 Obj_00119 -1 Obj_00120 +1 Obj_00121 +1 Obj_00122 -1 Obj_00123 -1 Obj_00124 -1 Obj_00125 -1 Obj_00126 -1 Obj_00127 +1 Obj_00128 -1 Obj_00129 +1 Obj_00130 +1 Obj_00131 -1 Obj_00132 +1 Obj_00133 +1