_PROBLEM CoEPrA-2006_Classification_002 _GROUP_NAME Shikha Varma-O’Brien _GROUP_MEMBERS Shikha Varma-O’Brien _ADDRESS Accelrys/Scitegic 10188 Telesis Court, Suite 100, San Diego CA, 92121 _MODELING_PROCEDURE All data modeling was done within Scitegic’s Pipeline Pilot software. We generated 2 different models with this classification problem, first by using descriptors provided by the competition and then secondly using our own descriptors. Both descriptor sets gave us good results. We decided to report the latter first because: a) model uses very simple descriptors b) produced better cross-validated ROC score. Both models/results are provided for your review with details below. Model #1: We took the peptide dataset and calculated very simple position specific descriptors. In other words, which residue occurs at position 1, 2, etc. in the given set of peptides. Once the descriptors are generated using a simple Pilot script in Pipeline Pilot, we generated a Bayesian model. The Bayesian analysis method available in Pipeline Plot is a method for the binary categorization of molecular data. The scientist presents the data to the method, with some subset marked as “good”; the system builds a model which returns a number that can be used in ranking compounds from most-to-least likely as members of this “good” subset. The learning process starts by generating a large set of Boolean (yes/no) features from the input descriptors, then collects the frequency of occurrence of each feature in the “good” subset and in all data samples. To apply the model to a particular sample, the features of the sample are generated, and a weight is calculated for each feature using a Laplacian-adjusted probability estimate. The weights are summed to provide a probability estimate, which is a relative predictor of the likelihood of that sample being from the “good” subset. The Laplacian corrected estimator is used to adjust the uncorrected probability estimate of a feature to account for the different sampling frequencies of different features. The derivation is: assume that N samples are available for training, of which M are “good” (active). An estimate of the baseline probability of a randomly chosen sample being active, P(Active), is M/N. Next, assume we are given a feature F contained in B samples, and that A of those B samples are active. The uncorrected estimate of activity, P(Active|F), is A/B. Further details of this Bayesian statisticas are available in a recent application paper J. Med. Chem., 47 (2004), 4463-4470, and other related articles Data manipulation, descriptor calculation, and model building is very easy and extremely fast in Pipeline pilot. The whole protocol ran in less than 1 minute. The Bayesian model cross validated ROC score of the model is 0.96 Cross-Validation Results This model was built using 76 samples, and validated using a leave-one-out cross-validation. Each sample was left out one at a time, and a model built using the results of the samples, and that model used to predict the left-out sample. Once all the samples had predictions, a ROC plot was generated, and the area under the curve (XV ROC AUC) calculated. Best Split was calculated by picking the split that minimized the sum of the percent misclassified for category members and for category nonmembers, using the cross-validated score for each sample. Using that split, a contingency table is constructed, containing the number of true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN). _PREDICTION Obj_00001 +1 Obj_00002 +1 Obj_00003 +1 Obj_00004 +1 Obj_00005 +1 Obj_00006 -1 Obj_00007 +1 Obj_00008 -1 Obj_00009 +1 Obj_00010 -1 Obj_00011 -1 Obj_00012 +1 Obj_00013 +1 Obj_00014 +1 Obj_00015 +1 Obj_00016 -1 Obj_00017 +1 Obj_00018 -1 Obj_00019 +1 Obj_00020 +1 Obj_00021 +1 Obj_00022 +1 Obj_00023 +1 Obj_00024 +1 Obj_00025 -1 Obj_00026 -1 Obj_00027 +1 Obj_00028 +1 Obj_00029 -1 Obj_00030 +1 Obj_00031 -1 Obj_00032 +1 Obj_00033 +1 Obj_00034 -1 Obj_00035 -1 Obj_00036 +1 Obj_00037 +1 Obj_00038 +1 Obj_00039 +1 Obj_00040 +1 Obj_00041 -1 Obj_00042 -1 Obj_00043 -1 Obj_00044 -1 Obj_00045 +1 Obj_00046 -1 Obj_00047 +1 Obj_00048 +1 Obj_00049 +1 Obj_00050 -1 Obj_00051 +1 Obj_00052 +1 Obj_00053 +1 Obj_00054 +1 Obj_00055 +1 Obj_00056 -1 Obj_00057 -1 Obj_00058 +1 Obj_00059 +1 Obj_00060 +1 Obj_00061 -1 Obj_00062 +1 Obj_00063 +1 Obj_00064 +1 Obj_00065 -1 Obj_00066 +1 Obj_00067 -1 Obj_00068 +1 Obj_00069 -1 Obj_00070 +1 Obj_00071 +1 Obj_00072 +1 Obj_00073 +1 Obj_00074 +1 Obj_00075 +1 Obj_00076 -1