_PROBLEM CoEPrA-2006_Classification_004 _GROUP_NAME Artem Cherkasov _GROUP_MEMBERS Emre Karakoc Cenk Sahinalp Artem Cherkasov _ADDRESS University of British Columbia, Medicine Simon Fraser University, Computer Science Vancouver, BC, Canada _MODELING_PROCEDURE Descriptors: Based on our previous experience with QSAR modeling of peptides and General QSAR clustering and classification of bioactivity properties, we decided to use the following strategy. First, we have optimized the geometry of the studied peptides using MMFF94 force field; carboxylic groups have been deprotonated, amino groups - protonated, partial charges computed according to [1]. Then we have computed QSAR descriptors that describe an entire peptide molecule (global parameters) as well as descriptors corresponding to constituent aminoacids considered in the context of their peptide environment (we did not use 'isolated aminoacids' approximation). Thus, for all peptides in the testing and training sets of Classification_004 problem, we initially calculated > 400 various 3D and 2D QSAR parameters that included: - 50 global 'inductive' QSAR descriptors as described in [2]. - 10 local 'inductive' QSAR descriptors (computed toward CA atom) have been calculated for each aminoacid of a given 8-mer; therefore, 80 additional 'inductive' QSAR descriptors have been produced. - 260 global atomic type-specific 'inductive' QSAR descriptors - (previously unpublished parameters) that have been computed additively For specific atomic types presented in the studied peptides; - We have also computed ~90 conventional 3D and 2D global QSAR parameters which are implemented within the MOE modeling package [3]. All 'inductive' QSAR descriptors that are described above, have been calculated by our own SVL scripts for the MOE; most of them can be freely downloaded through the SVL exchange. Modeling Procedure: We used our linear optimization method based on a distance measure[4] for calculating our prediction model. Given the calibration data-set with our optimization approach aims to find a weighted Minkowski distance that maximizes the difference between active and inactive peptides. The seperation between active and inactive compounds are written as a linear program (LP) where all the descriptors are given to the program. In order to eliminate the problem of the missing descriptors for some compounds such as certain atom types these descriptors are removed. The final set contains 414 descriptors and all the descriptors are normalized. Although all the descriptors are given to the LP formulation only 100 of them are used in the optimal distance measure. You can see these descriptors in the attachment. We trained our prediction model using whole calibration data-set and the quality of our model is determined using the accuracy of the prediction results which is calculated using a modification of the k nearest neighbor (kNN) classification. kNN based classification assigns the activity of an peptide, P, as the majority of the k nearest neighbors of P using the distance model determined by the LP optimization. Instead of looking only k compounds we consider all the compounds and calculate the average distance between P and active compounds as well as the distance between P and inactive compounds. The activity of the test compounds is determined as the miniminum of these two distances. For the test data we have: Sensitivity:0.84 Specificity:0.97 Accuracy:0.95 [1] Cherkasov, A. Inductive Electronegativity Scale. Iterative Calculation of Inductive Partial Charges. Journal of Chemical Information and Computer Sciences, 2003, 43, 2039-2047. Cherkasov, A., Z. Shi, Y. Li, S.M. Jones, M. Fallahi, G.L. Hammond. 'Inductive' Charges on Atoms in Proteins: Comparative Docking with the Extended Steroid Benchmark Set and Discovery of a Novel SHBG Ligand. Journal of Chemical Information and Modelling, 2005, 45, 1842-1853. [2] Cherkasov, A. 'Inductive' Descriptors. 10 Successful Years in QSAR. Current Computer-Aided Drug Design, 2005, 1, 21-42. [3] Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada. [4] Karakoc E., Cherkasov A., Sahinalp S. C. Distance Based Algorithms for Small Biomolecule Classification and Structural Similarity Search. ISMB'06, 14th Annual International conference on Intelligent Systems for Molecular Biology, Fortaleza, Brazil 2006. [5] SNNS: Stuttgart Neural Network Simulator; Version 4.0, University of Stuttgart, 1995. _PREDICTION Obj_00001 -1 Obj_00002 -1 Obj_00003 +1 Obj_00004 -1 Obj_00005 -1 Obj_00006 -1 Obj_00007 -1 Obj_00008 -1 Obj_00009 -1 Obj_00010 -1 Obj_00011 -1 Obj_00012 -1 Obj_00013 -1 Obj_00014 -1 Obj_00015 -1 Obj_00016 +1 Obj_00017 -1 Obj_00018 -1 Obj_00019 -1 Obj_00020 -1 Obj_00021 -1 Obj_00022 -1 Obj_00023 -1 Obj_00024 -1 Obj_00025 -1 Obj_00026 +1 Obj_00027 -1 Obj_00028 -1 Obj_00029 -1 Obj_00030 -1 Obj_00031 -1 Obj_00032 -1 Obj_00033 -1 Obj_00034 +1 Obj_00035 +1 Obj_00036 -1 Obj_00037 -1 Obj_00038 -1 Obj_00039 +1 Obj_00040 -1 Obj_00041 -1 Obj_00042 -1 Obj_00043 -1 Obj_00044 -1 Obj_00045 +1 Obj_00046 +1 Obj_00047 -1 Obj_00048 -1 Obj_00049 +1 Obj_00050 -1 Obj_00051 -1 Obj_00052 +1 Obj_00053 -1 Obj_00054 +1 Obj_00055 -1 Obj_00056 -1 Obj_00057 +1 Obj_00058 -1 Obj_00059 +1 Obj_00060 +1 Obj_00061 -1 Obj_00062 -1 Obj_00063 -1 Obj_00064 +1 Obj_00065 +1 Obj_00066 -1 Obj_00067 -1 Obj_00068 -1 Obj_00069 -1 Obj_00070 -1 Obj_00071 -1 Obj_00072 -1 Obj_00073 -1 Obj_00074 -1 Obj_00075 -1 Obj_00076 -1 Obj_00077 +1 Obj_00078 -1 Obj_00079 -1 Obj_00080 +1 Obj_00081 -1 Obj_00082 -1 Obj_00083 -1 Obj_00084 +1 Obj_00085 -1 Obj_00086 -1 Obj_00087 -1 Obj_00088 -1 Obj_00089 -1 Obj_00090 -1 Obj_00091 -1 Obj_00092 -1 Obj_00093 +1 Obj_00094 -1 Obj_00095 -1 Obj_00096 -1 Obj_00097 -1 Obj_00098 -1 Obj_00099 -1 Obj_00100 +1 Obj_00101 +1 Obj_00102 -1 Obj_00103 -1 Obj_00104 -1 Obj_00105 -1 Obj_00106 -1 Obj_00107 -1 Obj_00108 -1 Obj_00109 -1 Obj_00110 -1 Obj_00111 -1