Personalized drug-response prediction model for lung cancer patients using machine learning

—Lung cancer caused by mutations in the epidermal growth factor receptor (EGFR) is a major cause of cancer deaths worldwide. EGFR Tyrosine kinase inhibitors (TKIs) have been developed, and have shown increased survival rates and quality of life in clinical studies. However, drug resistance is a major issue, and treatment efﬁcacy is lost after about an year. Therefore, predicting the response to targeted therapies for lung cancer patients is a signiﬁcant research problem. In this work, we address this issue and propose a personalized model to predict the drug-response of lung cancer patients. This model uses clinical information, geometrical properties of the drug binding site, and the binding free energy of the drug-protein complex. The proposed model achieves state of the art performance with 97.5% accuracy, 100% recall, 95% precision, and 96.3% F1-score with a random forest classiﬁer. This model can also be tested on other types of cancer and diseases, and we believe that it may help in taking optimal clinical decisions for treating patients with targeted therapies.


I. INTRODUCTION
L UNG cancer is a leading cause of cancer deaths worldwide [1], and has the lowest survival rate among all cancer types. It is the second most common type of cancer, and often diagnosed at later stages when metastatic spread to other parts of the body may have occurred [2] [3].
In the last decade, great progress has been made in the management of non-small cell lung cancer NSCLC patients. Molecular targeting has made great advancements, and EGFR and ErbB family members have been identified as a useful therapeutic targets [4]. Over-expression of EGFR is found in about 60% of advanced NSCLC patients [5]. The US Food and Drug Administration (FDA) has approved small molecule inhibitors Gefitinib/Erlotinib as a first line treatment for lung cancer patients, harboring EGFR mutations [6].
Small molecule inhibitors produced encouraging results at the initial stage of therapy, and increased the survival rate and life quality of patients. However, drug resistance appeared due to secondary point mutation(s), which limited the effectiveness of the drug [7]. Using molecular dynamics, several computational studies have been conducted to decode the mechanism of drug resistance simulations [8], [9]. These studies provided useful insights about the conformational dynamics [10], stability [11], and structural changes [12] of EGFR and explaining several aspects of drug resistance. Recently, a framework is developed for the visualization of protein-drug interaction for The authors are with City University of Hong Kong. lung cancer drug resistance analysis in [13]. There is still much unexplained variation in a patient's response to these drugs and the patient's personal characteristics may play a significant role in the mechanism of drug resistance.
The completion of human genome project [14] has allowed a move from the traditional medical model of targeting a large population, as a one-size fits all approach [15], towards personalized therapies. Information from genomic and genetic data provides new opportunities for patient care, prevention, and diagnosis [16].
Predicting a patient's response to a drug treatment [17] [18], or identifying their optimal treatment strategy, is challenging for computational methods due to limited data sources. Disease outcomes have been modeled for breast [19], and lung [20] cancers, and for large B-cell lymphoma [21] using clinical and molecular structural information and for NSCLC patients using supervised machine learning [22].
As response to a drug is often mediated by a protein-drug interaction, the geometry of the drug binding site, or pocket, and molecular dynamics (MD) modeling of the binding site, can be useful predictors. MD simulation of the binding energy of drug-mutant complexes and patient's personal characteristics when combined with an extreme learning machine (ELM) [23] could classify the drug response level into two classes [24]. They achived an accuracy of 95.3%. In another interesting work, Local geometrical properties combined with energy related features in an Eigen binding site method achieved 69.35% accuracy for four classes of drug responses [25] and personal and geometric features were used to make a similar prediction in [26], while Bin et al. used proteindrug interaction footprints in a three level drug response classificationl [27].
These studies demonstrate the potential of combining dynamic molecular features with patient's personal characteristics to predicting drug responses, but the quality of the predictions needs to be improved for potential clinical use. In this work, we combine geometry, energy and patient's personal information to predict four classes of drug response. The proposed model achieves state of the art performance with 97.5% accuracy, 100% recall, 95% precision, and 96.3% F1 score. The main contributions of this work can be stated as follows.
• We have developed a drug response prediction model using molecular dynamics simulation and machine learning, which achieves state of the art performance. • Our work demonstrates the contribution of geometrical features to increase in prediction accuracy. , and (c) shows the molecular structure of a drug.
• As the proposed model is a general model, it can be tested on other types of cancer and and diseases and used in clinical decision support with minor modifications. The paper is organized as follows. In section II, we formulate an improved method to classify a patient's response to drug treatment. Sections III, IV and V present the proposed methodology, geometrical feature extraction and classification, while results and discussion are given in section VI. Conclusions and future work are given in section VII.

II. FORMULATION OF THE CLASSIFICATION MODEL
The framework to classify individual patient outcomes is divided into three modules; computational modeling of mutant/drug structures, MD simulations, and classification. Figure 1 (a) shows the crystal structure of the EGFR dimer in complex with Gefitinib and this is the template from which mutant structures are modeled. Figure 1 (b) and (c) show the key interactions between the protein and drug and the detailed chemical structure of Gefitinib. Prediction of mutant structures from the template by computational modeling, using the Rosetta modeling tools [28], is shown in Figure 2 (a), MD simulations are shown in Figure 2 (b), and the classification module is shown in Figure 2 (c).

A. Datasets and Patients
The clinical information to conduct such study is always a major challenge. The clinical information used in this study was taken from several published sources [24], [25], [29]- [31]. A dataset of 201 NSCLC patients was obtained. These patients had a median age of 63 years, 35% (71) were female and 65% (130) male, and about 75% were non-smokers. All patients received EGFR-TKIs as their first line of treatment. A total of 31 different EGFR mutations occurred in these patients at frequencies shown in Figure 3. The most common mutations were L858R, delE746−750 and L858R−T790M. All mutations were modelled into the EGFR 3D structure using Rosetta [28].
The potency of an inhibitor can be measured by a patient's survival time or their drug response level. Drug response is classified into four levels, based on the response evaluation criteria in solid tumors RECIST [32]. Response levels 1 and 2 indicate complete and partial responses to the drug. Response levels 3 and 4 correspond to stable and progressive disease. The dataset used here consisted of 19, 118, 30, 34 patients at response levels 1, 2, 3 and 4, respectively. The dataset was divided into training (80%, 163 patients) and testing (20%, 38 patients) subsets.

B. Personal features
The personalized information used for each patient was age, sex, smoking history, performance status, and drug response level. Age was coded as [0,4] based on the ranges (0,40), (41,50), (51,60), (61,70), and (70+). Figure 4 shows that the drug response level and survival time are not correlated with the age of a patient.
A detailed description of the features and their value ranges are presented in Table I. For a patient with specific personal and clinical information, the energy and geometric features of their mutant EGFR-Gefitinib complex were obtained and used to predict their drug response level through machine learning classifiers.

III. PROPOSED METHODOLOGY
The framework for predicting drug response level is divided into three parts ( Figure 2). Initially, the mutation is modeled into the EGFR 3D structures so that MD simulations [33] can be performed to extract the energy and geometric features of the protein-drug interaction which are passed to machine learning classifiers to predict the drug response level. The 3D mutant structures are predicted based on the crystal structure of wildtype EGFR, taken from Protein Data Bank (PDB) [34]. The high resolution ddgmonomer (HRDM) [35] protocol in Rosetta is used to predict point mutations, and the comparative modeling protocol is used to predict multi-point mutations [36]. Quality assessment of predicted structures is performed by Verify3D [37], and Q-mean [38].

B. Molecular dynamics simulation
MD simulations of the protein-drug complex were performed using the QM/MM method in Amber [39] with a surrounding waterbox neutralized using Na+ and Cl-atoms and the ff9SB [40] and gaff force fields. The total energy of the system is the sum of the bonded (stretch, bend, torsion) and non-bonded (electrostatic, van derWaals) terms.
(1) Energy minimization is performed to refine the modeled structure bfeore the MD run, which starts with a heating of the system from 0°K to 300°K, followed by density equilibrium for 50-ps and constant pressure for 500-ps. The SHAKE [41] algorithm was used to constrain bond stretching and for efficient temperature control. After achieving a stable state, production MD runs were performed at constant temperature (300°K) and pressure (1 atm) for 2-ns. The MD simulations were performed using a 12 core 3.47 Ghz I-7 processor, with 8 GB RAM [42]. A Tesla C2075 GPU [43] was used for production files. Each simulation was completed in about 12 hours. The CPPTRAJ [44] package in Amber was used to extract the trajectory with frames collected every 10-ps, giving 200 frames for each run.
1) Root mean square deviation: The root mean square deviation is used to measure the spatial variance between a reference structure and superimposed structures.
where N is the number of frames, X i is the target and Y i is the reference structure. RMSD [45] is used to measure the stability of the MD simulations. The trajectories of the RMSD values of EGFR and its mutants from the reference structure are shown in Figure 5.
2) Binding free energy: The free energy of binding [46] of a drug to a protein in a solvent environment estimates the binding affinity [47]. The parallel version of MM-GBSA [48] on a 12 core, 3.47 GHz processor is used for the simulation. The MD trajectory was input to the MM-GBSA, and each simulation took about 12 hours for computation. The binding free energy is calculated based on the theory of the thermodynamic cycle in vacuum and solvent environments [49], as: where ∆G is the binding free energy difference of the receptor-ligand system in a vacuum. ∆G Solv,Complex , ∆G Solv,ligand , and ∆G Solv,Receptor represent their energy differences between vacuum and solvent states. The energy component is composed of Van der Waals forces (VDW), electrostatic energy (EEL), the electrostatic contribution to solvation, and non-polar contributions to the solvation free energy (ESURF). The binding free energy and its components for EGFR and its mutants are shown in Figure 6.
In this work, we have used the binding free energy and its components as energy features in our prediction model. Drugsensitive mutants generally have higher binding energy values than drug resistant mutants. The energy features vs response level and survival are shown in Figure 7. Energy feature are not  Interactions between the binding site residues of a protein and small molecule inhibitors are commonly used in prediction methods [50]. Local geometric surface properties were determined based on the alpha shape [51] using the computational geometry algorithm library (CGAL) [52].

A. Alpha shape
The theory of the alpha shape algorithm is based on 3D Delaunay triangulation, which aims to maximize the minimum angle of all the angles of the triangle in triangulation [53]. Given four atoms A, B, C, D, in 3D space, after successful triangulation, no atom will be located in the circumcircle of any triangle. The Delaunay algorithm maximizes the angle using the following rule.
Here x A , y A shows the location of atom in a 2D-plane. If the determinant is positive, the atom D lies in the circumcircle of A, B, C, as shown in Figure 8.

1) Convex atoms:
Each atom in a drug-mutant system has a position and mass, represented as a = (p, w), where p is the position and w is the mass of the atom. Two atoms a 1 = (p 1 , w 1 ) and a 2 = (p 2 , w 2 ) are defined as orthogonal or sub-orthogonal using the following equation.
From the alpha shape, the solid angle [54] of atoms was determined to characterize the geometric properties of the local surface . If A, B, C, and D are the vertices of a tetrahedron, the solid angle, Ω i , is: where φ AB i , φ BC i , and φ BC i represent the dihedral angles of tetrahedron i. Ω = cos(Ω i ) 4 Ω results in a convex shape if positive and a concave shape if negative (Figure 9). The number of atoms in a convex shape at the local surface of the drug-dimer complex are used. 2) Matching rates: The atoms at the interface of the structures create the interaction between the drug and the target. The surface atoms are collected using the alpha shape algorithm, and named as point set A. After that, point sets B and C are obtained, to represents the surface atoms of the target and the drug, respectively. The interacting atoms (I) are obtained using set operations and further classified as, interacting atoms on the drug I d or target I t .
I represents the interacting atoms, I t the interacting atoms in the target, and I d the interacting atoms in the drug. The matching rate is determined by selecting atoms at the drug and the target. If one of the atoms is convex and other is concave, the pair is recorded as matched and there is a strong interaction between them. If both atoms are convex or concave, the pair is unmatched, and the interaction is weak. The matched and unmatched atoms are determined as: Matching rates are calculated for each frame of the MD trajectory as: M R represents the matching rate, f (B i , C i ) is a matched atom pair, and N is the total number of MD snapshots. The matching rate is used as a feature in this work, and low matching rates were linked to low drug responses.
3) Connectivity measure: Connectivity changes between binding site residues and the drug molecule throughout the MD simulation. The consistency of these connections could be used as a predictor of the drug response level.
Where A k,i,j represents the connection between the i th EGFR atom and the j th drug atom in the k th MD snapshot, and is 1 if there is a connection and zero otherwise.
D k represents number of connected atoms in the MD snapshot. The number of connected atoms over the entire trajectory is used as a feature.
4) Binding site positioning: The positioning is evaluated using the Euclidean distance between drug-binding site atoms and the center of the drug-molecule.
The binding site residues are represented by their CA atoms, and if there are 14 CA atoms at the binding site, and two atoms at the drug molecule center, then a 14 × 2 or 28 × 1 vector will represent this. The distance over the entire MD simulation of 200 frames can be represented as 200 × 28 matrix. The binding site position is represented as the average distance between the drug and the target.
where D avg shows the binding site position, D i shows the ith MD snapshot distance, and N shows the number of MD snapshots. All the feature values were normalized to [0,1]. Figure 10 shows the geometrical features relative to the response level. Generally, drug sensitive mutants have less distance between the drug and the target.

Response VS Geometrical Features
Distance Convex atoms Fig. 10. Geometric features by response level. As the distance and number of convex atom increases, the response level also increases. Fig. 11. Normalized values for energy, and geometrical features 5) Hydrogen Bonds: Hydrogen bonds contribute to the stability of a structure and can provides insights about interactions within the structure. Stable systems tend to have more hydrogen bonds. The number of hydrogen bonds in the EGFRdrug complex were calculated using the hbond command in Amber.

B. Composite Geometric Features
The geometric features were combined to make two composite features, with the matching rate, number of connected atoms, and number of hydrogen bonds as one feature, x g1 and the number of convex atoms and Euclidean distance, x g2 , as the other. Similarly, the personal features and energy features were combined as x p and x e1 , respectively.

C. Feature normalization
Each feature was normalized to the [0, 1] range using min -max normalization.
where z i represents the normalized value, min(x) represents the minimum value, and max(x) represents the maximum value for each feature. The normalized values are shown in Figure 11. Composite features and density distributions are presented in Figure 12.

V. CLASSIFICATION
For patients with clinical information, geometrical and energy features were obtained from their EGFR mutant drug complex, and classifiers were trained to predict one of the four-classes of drug response level. Five popular classifiers were tested using Rstudio with the CARET package [55] and 10-fold cross-validation.

A. Classification performance metric
The classification performance metrics used are precision, recall, F1-measure and balanced accuracy.
The terms TP, FP, FN, and TN denote true positive, false positive, false negative, and true negative respectively. The performance metrics are shown in Figure 13. We also used Kappa (κ) statistic to deal with multi-class classification and and imbalanced problems.

VI. RESULTS AND DISCUSSION
In this work, we have designed a drug-response prediction model for lung cancer patients. Clinical data, age, survival, sex, smoking history, and type of mutation, were collected from various previous studies [24], [25], [29]- [31] and were used with energy and geometric features from the EGFR mutant-drug complex to predict the drug-response level as one of complete response, partial response, stable disease and progressive disease. The accuracy of the five classifications methods on these data and their Kappa scores [56] showing the agreement between the reference and predicted values are shown in Figure 14. The average accuracy of Knn at 71% with Kappa at 39% is lowest, whereas the random forest achieves an accuracy of 97.5% with Kappa at 95%.
The confusion matrices for training and testing are shown in Figure 15 and 16 respectively. The random forest classifier achieves 100% accuracy with no mis-classification for training data and only one mis-classification in the testing data. Fig. 15. Confusion matrix for the training dataset. The labels correspond to complete response, no response, partial response and stable disease. Table II shows that the classification accuracy of the proposed method achieves state of the art performance relative to related works and for prediction using four classes of Fig. 16. Confusion matrix for testing dataset, labels as in Figure 17. drug response. The pioneer work [24] used binding free energy, and personal features of 168 patients to predict a twoclass drug response. Methods predicting fewer response levels performed better than earlier four-level predictors, however, our method outperforms all previous work. The combination of geometrical, energy, and personal features seems to be the optimal strategy for predicting the drug response. Figure 17, shows the contribution of each feature.

B. Discussion
Rapid developments in the field of bioinformatics and the large amount of genomic data available today, allows development of personalized drug response models [58]. Previous studies showed that drug response is associated with clinical information, the type of EGFR mutation, binding free energy, and geometrical properties of the drug binding pocket. Different studies combined different properties to achieve higher accuracy (Table II).
In this work, three features, clinical information, proteindrug interactions, and geometrical properties of the drugbinding pocket, were combined to achieve state of the art performance with 97.5% accuracy, 100% recall, 95% precision, and 96.3% F1 -score with a random forest classifier. The classifier performed well on all the classes, with only one mis-classification between stable disease and progressive disease.
Our main focus in this work was on modeling the geometry of the drug-binding pocket and combining this with clinical information. The geometrical features, convex atoms at the interaction surface of the complex, the matching rates of surface atoms, the distances between the center of the drug molecule and binding-site residues, and hydrogen bonds, were the most discriminative features. Combining clinical and molecular predictors to identify drug-sensitive patients was most effective. Further investigation of this model may result in mutation, age or gender specific therapies and the model can also help in selecting the optimal drug for specific patients.
A limitation of this study was that it contained only most common 33 EGFR mutations from a possible 594 EGFR mutations available in COSMIC database [59]. However, the mutations used account for about 90% of all mutations. It is difficult to determine drug sensitivity to rare mutations due to limited patient data. Another limitation is the small dataset of 201 patients. Since, obtaining clinical data is difficult due to privacy and ethical considerations, most clinical studies consist of fewer than 400 patients, e.g. Table II, and may have imbalanced numbers of patients at each response level. Despite this, our model achieved a highly accurate prediction rate.
Personalized or precision medicine separates patients into different groups based on their individual medical decisions, interventions, risk of disease and drug-responses. Our model provides a personalized drug response model with a highly accurate prediction rate that could be tested on other types of cancer and other diseases.

VII. CONCLUSION
Computational methods, especially machine learning [60]- [62] are widely used to analyze, visualize and predict responses to lung cancer drugs. In this work, we developed a systematic model that uses personal, energy, and geometrical features in machine learning classifiers to predict the four levels of drug response. This method achieved state of the art performance at 97.5% accuracy with a random forest classifier, even though only a small patient dataset was available. This demonstrates the potential of the random forest method to deal with difficult learning situations and to be implemented in daily clinical practice to optimize treatment strategies for individual patients. In the future, more clinical data will be collected to further refine the prediction model, and test it on other diseases.