_PROBLEM CoEPrA-2006_Classification_001 _GROUP_NAME Elizabeth Jacob _GROUP_MEMBERS Saniya Abraham Elizabeth Jacob _ADDRESS Computational Modeling & Simulation Group Regional Research Laboratory (CSIR) Trivandrum – 695019 Kerala State INDIA Email: lizjacb@yahoo.co.in Phone: 0471-2515381, 2515264, 2437714 _MODELING_PROCEDURE The Chaos Game Representation (CGR) technique has been used for this classification problem. Each amino acid in a nona-peptide is described by 643 descriptors, So there are a total of 643X9 = 5787 descriptors per nona peptide. We have used only the data for these 643 descriptors for each amino acid. No new data has been referred to. The Algorithm Classification of amino acids based on the descriptor values 1. The 643 descriptors for each of the 20 amino acids is extracted into a matrix. 2. Corresponding to a descriptor, there are 20 values. The range of values for a descriptor is divided into 4 equal parts. Depending on the part into which a value falls, the descriptor values are replaced by the numbers 1,2,3,4. 3. From the matrix of integers in Step 2., a statistic is calculated for each pair of amino acids that is a measure of how similar they are. 4. Based on the values of the statistic, the 20 amino acids were classified into 8 groups. Classification of test data by comparison with CGRs 5. Unlike the conventional square CGR, we have formulated a CGR in a box. The 8 corners of the box represent the 8 groups of amino acids determined in step 4. 6. The nona peptides in the CoEPrA-2006_Classification_001_Prediction_Data.txt are taken and the two classes are separated. All the peptides belonging to category +1 are concatenated to form one string and those belonging to category -1 are joined to form another string. 7. The CGRs are drawn for each category. (CGR-1 and CGR+1) 8. For each nona peptide in the test data, a CGR with 9 points is drawn 9. The test CGR is compared with the two CGRs in Step 7 using a minimum distance formula. The category to which it is closer is declared the winner to which the test CGR will belong. The whole algorithm has been implemented using a computer program. The test data file consisting of 88 nona peptides was fed to this program. The classification yielded 48 nona peptides belonging to the Category -1 and 40 to the Category +1. The Detailed Design Classification of amino acids based on the descriptor values In a row of 5787 descriptors, the first 643 descriptors belong to the first amino acid in the peptide chain, and so on. We isolate the 643 descriptors for each amino acid to form a 20 x 643 matrix (called the amino matrix) by reading the file CoEPrA-2006_Classification_001_Calibration_Data.txt. Descriptor 1 Descriptor 2 Descriptor 3 Descriptor 4 Y 4.60 1.88 0.39 0.68 L 4.17 1.53 3.23 2.93 F 4.66 2.02 1.96 2.03 Table1 First 3 rows and 4 columns of the amino matrix Table1 shows the top left part of the amino matrix. The 20 rows of amino acids are arranged in the order Y, L, F, P, G, E, T, A, S, H, V, C, Q, I, W, N, R, M, K and D. For every descriptor, the range of 20 values, when divided equally into 4 classes reduces the real values for the descriptors into integers in [1,4]. Descriptor 1 Descriptor 2 Descriptor 3 Descriptor 4 Y 4 3 1 1 L 2 3 4 4 F 4 4 3 3 Table2 Integer values for descriptors It may be noted that for Y, the value for descriptor 1 is very high whereas for descriptors 3 and 4, it is low when compared to other amino acids. Reducing to this form helps in finding how similar any two amino acids are w.r.t. the descriptors. The statistic of similarity is calculated between any two rows by finding the number of columns that have the same values in Table2. For e.g. Y and D have a score of 152 (implies 152 descriptors out of 643 have the same integer value for Y and D), L and V have 335 descriptors alike. There are 19+18+…+1 (=190) pairs of this type. The amino acids were classified based on how similar they are to each other. Group 1: L, V, F, I, M, W, C Group 2: Q, R, K, E Group 3: G, S, T Group 4: D, N Group 5: Y Group 6: A Group 7: P Group 8: H Classification of test data by comparison with CGRs Here, we have made a deviation from the conventional CGR which consists of a square with square grid elements. The square CGR is most suited for the analysis of nucleotide sequences that have only four alphabets (A, T, G and C) but in the case of 20 amino acids divided into 8 groups, this CGR is not suitable. Hence a box domain is taken where the square grid elements are replaced by volume elements. Also instead of 2D coordinates, the CGR points will become 3D. The 8 amino groups are mapped onto the 8 corners of a box. Following the chaos game algorithm, the first amino acid residue of the sequence is plotted half way between the centre of the box and the vertex labeled with the code of the first residue. The second residue is then plotted half way between the first point and the vertex labeled with the code of the second residue. The process is repeated until the last residue in the sequence is plotted. Next, the parent CGRs for categories +1 and -1 have to be drawn. It has been verified that the characteristic pattern of a peptide class does not depend on the number of sequences added together or on the order in which they are concatenated. The calibration peptide data is made into 2 strings (for +1 and -1). The concatenation of the +1 strings is used to make a CGR for +1. There are about 400 amino acids in each string and therefore 400 CGR points in each CGR. The box is assumed to be 100x100x100. The centre (50,50,50) is taken as the starting point of the CGRs. The CGRs represent the patterns followed by amino acids in each category. Now, the system is ready to take test data. A CGR is drawn for a nona peptide. This gives 9 points in the box. The task is to find which of the 2 categories it is closer to, in terms of the pattern i.e. the test pattern has to be compared with CGR-1 and CGR+1. Starting from (50,50,50), the first point is a 1-mer, if there is a point very close to this in the parent CGR, then the 1-mer similarity exists. Similarly, it can be extended to the n-mer case. Comparison between two images can be made from the distances between their corresponding points. A small distance between images indicates that the alphabets are similarly used in the two sequences. The minimum distances of each point in the test CGR from that of points in a parent CGR are computed and summed. The lesser the value of the minimum distance, the more closer the test pattern is to that parent. If the test pattern has a minimum distance of d1 from CGR-1 , a distance of d2 from CGR+1 and d1