Computational methods for the analysis and prediction of drug resistance in EGFR-mutated non-small cell lung cancer patients

—Lung cancer is the major cause of cancer deaths worldwide, and has a very low survival rate. Non-small cell lung cancer (NSCLC) is the largest subset of lung cancers, which accounts for about 85% of all the cases. It has been well established that the mutation in the epidermal growth factor receptor (EGFR) can lead to lung cancer. EGFR tyrosine kinase inhibitors (TKIs) are developed to target the kinase domain of EGFR. These TKIs produce promising results at the initial stage of therapy, but the efﬁcacy becomes limited due to the development of drug resistance. In this paper, we provide a comprehensive review of computational methods, for understanding drug resistance mechanism. The important EGFR mutants and different generations of EGFR–TKIs, with the survival and response rates are discussed. Next, we evaluate the role of important parameters in drug resistance, including struc- tural dynamics, hydrogen bonds, stability, binding free energies, and signaling pathways. Personalized drug resistance prediction models and the use of deep learning and big data analytics in drug design, and drug resistance analysis are also highlighted. In addition, we discuss limitations in the current methodologies, and explore potential research avenues for the development of future therapies, using computational methods. We hope, that this review will serve as a reference for early and advanced stage researchers, to apply computational techniques for drug resistance analysis and prediction in lung cancer patients.

Abstract-Lung cancer is the major cause of cancer deaths worldwide, and has a very low survival rate. Non-small cell lung cancer (NSCLC) is the largest subset of lung cancers, which accounts for about 85% of all the cases. It has been well established that the mutation in the epidermal growth factor receptor (EGFR) can lead to lung cancer. EGFR tyrosine kinase inhibitors (TKIs) are developed to target the kinase domain of EGFR. These TKIs produce promising results at the initial stage of therapy, but the efficacy becomes limited due to the development of drug resistance. In this paper, we provide a comprehensive review of computational methods, for understanding drug resistance mechanism. The important EGFR mutants and different generations of EGFR-TKIs, with the survival and response rates are discussed. Next, we evaluate the role of important parameters in drug resistance, including structural dynamics, hydrogen bonds, stability, binding free energies, and signaling pathways. Personalized drug resistance prediction models and the use of deep learning and big data analytics in drug design, and drug resistance analysis are also highlighted. In addition, we discuss limitations in the current methodologies, and explore potential research avenues for the development of future therapies, using computational methods. We hope, that this review will serve as a reference for early and advanced stage researchers, to apply computational techniques for drug resistance analysis and prediction in lung cancer patients.

I. BACKGROUD
Cancer is the second largest cause of deaths worldwide, resulting in a loss of 9.6 million lives in 2018 [1]. About 13% of all the new cancer cases are lung cancer and more than 65% of them are diagnosed at advanced cancer stages.
The over-expression of epidermal growth factor receptor (EGFR) is found in about 60% of non-small cell lung carcinomas (NSCLCs) [2]. EGFR Tyrosine kinase inhibitors (TKIs) are developed to target the kinase domain of EGFR. These inhibitors produce promising results at the initial stage of therapy, but the drug resistance develops in most of the cases after about a year [2]. Drug resistance is currently one of the biggest challenges in targeted cancer therapy.
The discovery of small molecule inhibitors targeting the kinase EGFR has generated much hope for the further advances in the treatment of lung cancer, but the limitations of their efficacy are apparent [3]. The upregulation and engagement of efflux pumps to remove the drug from the binding site is one of the primary reason for the acquired resistance [4]. Another reason is the genomic variations in the kinase domain of EGFR, such as T790M or C797S mutations. Other mechanisms include c-Met amplification, activation of alternate signaling pathways, and the co-activation of multiple receptor tyrosine kinases that can reduce the dependence of tumor cells on EGFR-mediated signaling [5].
One of the major causes of drug resistance to the 1st and 2nd generation drugs is a secondary acquired gatekeeper T790M mutation. The 3rd generation drug Osimertinib is designed specially to inhibit the T790M mutation through covalent binding of CYS 797 [6], but the efficacy of Osimertinib is lost upon the emergence of C797S mutation ( Figure 1 ). The 4th generation drug EAI045 appeared in 2016, targets both T790M and C797S giving hope to NSCLC patients. Table I presents  different generations of EGFR inhibitors with response and  progression free survival rate. EGFR mutations usually increase the activity of kinase domain, resulting in uncontrolled cell division that can eventually lead to lung cancer. The COSMIC database [7] is the largest open access database for EGFR mutations, which shows that about 93% of EGFR mutations are found in the four exons 19 -21 of the kinase domain. These mutations include inframe deletion in exon 19, insertion in exon 20, and a single point substitution in exon 21. Nucelotide substitution in exon 18 (G719S or G719C) accounts for about 5% of the EGFR mutations [8]. The statistics of different EGFR mutations is shown in Figure 1.
Experiment methods are expensive and time consuming, due to the necessity of multiple experiment conditions. Computational methods are also applied to quantitatively investigate the drug resistance process, virtual screening of compound, visualize the drug-ligand interactions [9] and the identification of drug binding sites [10].
Although, there are reviews on EGFR-mutated drug resistance in NSCLC [18]- [20], these studies mainly focus on clinical data and random trials. The methods using CT scan lung cancer images are also covered in [21]. The Nanotechnology based intelligent cancer drug design is discussed in [22]. This article presents a comprehensive review of computational studies on EGFR-mutated NSCLC patients and personalized drug resistance prediction models. The paper is organized as follows, we discuss computational methods for drug resistance analysis in Section 2. Personalized drug resistance prediction models are described in Section 3. The use of deep learning and big data analytics is highlighted in Section 4. The challenges and opportunities in the current methodologies are discussed in Section 5. Finally, we conclude this review with potential future research directions in Section 6. A list of acronyms with definitions, used in this paper is given in supplementary Table 1.

II. COMPUTATIONAL METHODS FOR EGFR-MUTATED DRUG RESISTANCE
Computational methods are widely used for the drug resistance analysis, due to its flexibility, low cost, easy implementation, and the ability to process a large amount of data [23]. Computational methods can provide deeper insights, generate novel hypothesis and devise new promising strategies. These methods can broadly be classified into structure based and ligand based. In this paper, we mainly focus on structure based drug design and analysis methods.

A. Molecular Dynamics Simulations
Molecular dynamics (MD) simulation is a powerful computational method for analyzing the interactions between the drug and the target at atomic scale [24]. MD simulation enables to examine the structural changes occurred due to genetic mutations, and it is widely used for drug resistance analysis, prediction and discovery [25]. The interactions between drug and protein is a dynamic process, involving several residues. MD simulations can reveal the atomic level details of drugprotein interactions.
Since its beginning, MD simulations have advanced from simulating several hundred atoms to systems with biological relevance, including entire protein with solvent, protein with membrane embedded, and large macromolecular complexes like nucleosomes [26] or ribosomes [27]. This immense improvement is due to the high performance computing (HPC) and basics of MD simulation algorithm [28]. The trajectories of positions, accelerations and velocities of each atom can be obtained using Newtons's second law of motion. The MD simulation generates a large amount of dynamics data, which can be used to understand the structure to function relationship.
Shan et al. [29] used long time scale MD simulation and showed that N-lobe dimerization of the wildtype (WT) EGFR is disordered, and it becomes ordered only upon dimerization. They also showed that some cancer specific mutations L858R/L834R may facilitate the dimerization by suppressing this local disorder. The L858R causes abnormally high kinase activity by promoting EGFR dimerization. They further performed unbiased, all-atom MD simulations of EGFR kinase domain [30], and showed that EGFR monomer is more stable in its inactive state than its active state.
Shunzhou et al. [31] used 10 micro-second long MD simulation for the investigation of conformational dynamics and interactions. The simulation result shows that the mutant type L858R binds to Gefitinib, rather than ATP. They used the MD simulation of [32] and applied PCA on the MD trajectories. The number of correct bound association were compared, which showed the preference of WT EGFR binding with ATP and L858R with Gefitinib. Fig. 1: EGFR frequent mutations (right). L85R drug sensitive mutant, T790M drug resistive mutant to the 1st and 2nd generation drugs, and C79S drug resistive mutant to 3rd generation drug. The frequency of EGFR mutations, acquired resistance and mutation rate in Asian patients [33], [34]. (a) L858R is the most common type of EGFR mutation. (b) The EGFR mutation are the most common mechanism associated with the drug resistance to 1st and 2nd generation of EGFR TKI. (c) Mutation rates in Asian patients.  [35]. EGFR TKI Binding mechanism, the WT EGFR shows the cancer growth, L858R mutant binds with the TKI, due to T790m mutation TKI is unable to bind the kinase (right most).

B. Binding Free Energy EGFR
The energy released due to bond formation, ligand and protein interactions is known as the binding free energy. The free energy of a favourable reaction is negative. The energetic contribution of individual residues provide useful quantitative information about the binding mode of a ligand and the protein [36]. Wang et al. [37] investigated the structural and energetic features of both active and inactive states of WT EGFR and L858R mutant using the molecular mechanics Generalized Born solvent accessible surface area MMGBSA [38] tool in Amber [39]. They further analyzed how the mutations affects the stability of several conformational states. They mapped the free energy landscape on to the principal components to identify population changes in various conformations on mutations. The study reveals that the L858R mutant induces conformational changes in active and inactive states, which affect the relative stability.
In another study, Wang et al. [40] profiled the correlation between the EGFR mutations and EGFR-TKI Afatinib (second generation drug). The progression free survival rate and drugresponse level was recorded. The L858R and the complex mutation L858R -T790M showed a response level of 3, 2 and a PFS rate of 15.87 and 9.59 months respectively, with the lower values showing low response. The WT and the single T790M mutation showed no response and no survival. Results of this study verifies the higher potency of Afatanib to classical EGFR mutation L858R and exon 19 deletion, and a lower one to T790M mutation, which are consistent with the clinical values.

C. Molecular Docking
Molecular docking is another important tool in structurebased drug design, due to its low-cost, and simplicity [41]. In molecular docking, the position and orientation of a molecule are predicted against the other molecule, when they are bound to each other to form a complex. The prediction can be useful for estimating the strength of binding affinity.
In [42], molecular docking and MD simulation are used for the investigation of structural and chemical features of EGFR inhibitors and binding pocket mechanism. They found that the binding pocket consists of three regions (P1, P2, P3). The Met 793 and ASP 855 may be the reason for the binding recognition through H-bond interactions with Phe 856.
Rajith et al. [43] investigated the G719S-T790M double mutation using 50-ns MD simulation and molecular docking. They observed the escalation in distance between P-loop and functional loop in T790M mutation compared with the G719S. They also verified that G719S mutant causes the ligand and hinge region to come closer and T790M mutant caused the ligand to escape from the binding pocket. This may be the reason for the aberrant function of EGFR-TKIs in T790M mutant.

D. Computational Modeling of EGFR Mutations
Structural information for only a few mutants is available, and a variety of rare EGFR mutations account for about 10 -20% of NSCLC patients. It is very difficult to treat patients harboring such rare mutations with EGFR TKIs. Robust prediction models are needed for such rare EGFR mutants to existing EGFR TKIs [44].
Starting from the structures found in the Protein Data Bank (PDB) [45] (www.rcsb.org), Ma et al. [46] created a 3D structure database of EGFR mutants with binding free energies of 112 EGFR mutation types collected from 942 NSCLC patients, using computational modeling and MD simulations. For residue substitution mutants, the Rosetta ddg -monomer protocol was employed. The side-chain of the mutated residue is first replaced and the rotamers of all residues are then optimized by the Rosetta's standard side-chain optimization module. For the mutations of amino acids, Rosetta comparative (CM) protocol was applied. The framework for building the EGFR mutant structure database is shown in Figure 3. The predicted models may not be accurate, and several computational tools, including Q-MEAN Z-score [47], Verify3D [48], and Ramachandaran plot [49] can be used to validate the predicted models. It will be interesting to compare the computationally predicted mutants with multiple methods, to increase the confidence level [50].

E. EGFR Dimerization and Signaling Pathways
When the extracellular ligand-binding domain is activated by its growth factor, a homo-dimer or a hetero-dimer is formed with another member of the ErbB family [51]. EGFR dimerization is an essential event in the EGF-signal transduction [52]. The dimerization stimulates the catalytic activity of tyrosine kinase domain, and promotes the autophosphorylation of several residues in the kinase domain [53]. These residues provide docking sites for downstream signaling molecules such as Shc, Grb2 and P13k. It is also shown that the EGFRs can form dimer on the cell surface independent of ligand binding [54]. Tumors, that are sensitive to EGFR TKIs are characterized by a rapid decrease in the Akt activity [55]. The failure to irregulate Akt causes drug insensitivity [56].
Wang et al. [35] investigated the contribution of EGFR and ErbB-3 heterodimerization based on binding free energy. The EGFR dimerization and downstream signaling mechanism is shown in Figure 2 (left). They characterized the EGFR mutations for 168 clinical subjects using molecular interactions for three EGFR dimers (ErbB-2, IGF-1R, c-Met) and two EGFR inhibitors, Gefitinib and Erlotinib. Simulation results show that the mutant -partner interactions increased in the L858R and L858R -T790M. The mutant delL747 -P753insS has the largest difference between the mutant interactions and inhibitors, and this may be the reason for the shorter progression free survival in this mutant type. They also investigated ErbB-3 and its interactions with EGFR mutants, IGF-1R, ErbB-2 and c-Met and found that the binding was remarkably strong between ErbB-3 and c-Met, which shows its role in regulating ErbB-3 signaling.

F. Long Range Communication Capability in EGFR
Allosteric Communication is an important phenomenon in many biological processes and considered to be a useful parameter in governing molecular motion and signal transduction [57], [58]. Recent studies [59], [60] show that allosteric networks of cooperative protein motion may be formed by sparsely connected group of residues.
In [57], Dixit and Verkhivker used (MD) simulation, PCA and signal propagation in protein to identify the allosteric communication in ABL and EGFR kinase domain with cancer mutations. They used the concept of absolute and relative long-range communication capability (LRCC) in residues for tracing the signal propagation in proteins. According to their algorithm, two remote protein residues (residue clusters) have strong communication if the mean square fluctuation of interresidues remain in a specific threshold over long time MD simulations.
The efficient long-range communication is possible not only because of the thermal fluctuations, but also a dynamic longrange interaction exists between the regions that are important in coordinating inter-lobe and inter domain motion.
It has been shown in [31] that free energy landscape is populated by conformational isomers, and extended sampling of the landscape indicates the flexibility of EGFR. First two principal components are used to describe the system dynamics [31]. Simulation results show that L858R has higher binding preference to Gefitinib than the ATP.

G. Hydrogen Bond Analysis
Hydrogen bonds are important for the analysis of biological and chemical interactions of molecules. Ghosh and Yan [61] Fig. 3: The framework for predicting the mutant structure of EGFR and their binding energies [46]. The Binding preference for ATP in WT EGFR and Gefitinib L858R mutant [31] investigated the EGFR and ErbB-3 heterodimers and their mutant structures. They performed 3-ns MD simulation of three EGFR dimers (WT, L858R, and L858R-T790M). They calculated the hydrogen bonds and found that the number of hydrogen bond changes throughout the structure. The mean value was 481 for WT, 484 for L858R and 477 in L858R -T790M structure. They concluded that the T790m mutant structure have a smaller number of hydrogen bonds, which changes the conformational stability of the system.

H. Drug Response Curves
The drug / dose response curve shows the response of an organism as a function of exposure to a drug after certain time [62]. The drug s half maximum inhibitory concentration IC 50 [63] of cancer cell viability is widely used to measure the potency. Another commonly used parameter is the half maximal effective concentration EC 50 , which measures the concentration of the drug, which induces a response halfway between the baseline and maximum, after a specified exposure time [64]. Computational methods, including support vector machine (SVM), neural networks (NN) and quantitative structure activity representation (QSAR) are widely used to determine the drug sensitivity and design [65]. Jang et al. [66] carried out a systematic assessment of analytical methods for drug sensitivity prediction using CCLE data. They evaluated more than 110,00 models, based on multifactorial experiment design. They found that the model input data, including molecule features and compound are important factor for model performance, followed by the type of algorithms.
In another study, Zou et al. [67] analyzed the 30 most common EGFR mutants. They used MD simulation, and proteinligand interactions footprint (IFP) to analyze the binding modes of EGFR-Gefitinib complexes [68]. Multilinear Principal Component Analysis (MPCA) was used for dimensionality reduction and feature selection. Target projection pursuit [69] was used to show drug sensitivities. The findings show that the IFP features of EGFR-mutant complexes and MPCA based tensor are useful candidates for prediction of drug sensitivity. They used five classifiers for predicting the drug sensitivity, and achieved greater than 90% accuracy.

I. PCA-3D QSAR
Quantitative structural activity representation (QSAR) is applied for establishing a relationship between chemical properties of a compound and their biological activities [70]. Principal component analysis (PCA) is a method to determine the essential dynamics of a protein. It is used to express the dominant motions in the protein system or a MD trajectory. PCA can be used for reducing the dimension of QSAR data matrix [71].
Sukrita et al. [72] performed MD simulations and applied PCA to WT EGFR and mutant T854A. 3-D QSAR model was built using the step forward multiple regression and advanced variable selection. The proposed model was validated using the statistical parameters.
PCA was used to simplify the motion and extract the important component of motion. For evaluation purpose, co-variance matrix was created from all the trajectories and diagonalized to identify set of Eigenvalues and Eigenvectors that correspond to displacement of atom and show the concentrated motion.
The simulation results show higher flexibility in mutant T854A compared to WT. Another parameter known as radius of gyration Rg was calculated and results show higher values of Rg for T854A mutant as well. The WT structure become flexible upon T854A mutation and losses stability in RMSD, RMSF, and Rg.
Computational methods [29], [30], [57], [72], [73] provide useful information about the structure and dynamics of EGFR mutants. The L858R genetic mutations in the kinase domain suppress the local disorder, and provides a series of conformations states, capable of binding Gefitinib. This provides an explanation for effectiveness of Gefitinib in L858R EGFR mutated NSCLC patients. While, the T790M mutant increases the ATP affinity and the escalation in the p-loop and long range communication capability of residues are some of the factors for the failure of the EGFR TKIs in this mutant. The loss of hydrogen bond between threonine and arginine in T790M mutant is also a reason for the drug resistance.
T854A residue is located at the bottom of ATP binding site on C-lobe and in contact with Erlotinib and Gefitinib. The substitution T854A results in the loss of contact and binding affinity to the inhibitors. The loss of stability in T854A and increase in the hinge region further explains the aberrant function. In the docking analysis, the G719S mutant causes the ligand and hinge region to come closer, and T790M mutant causes the ligand to escape from the binding pocket. In Table II, we list several important computational studies on EGFR mutants related to drug resistance.

III. PERSONALIZED DRUG RESISTANCE PREDICTION
Personalized medicine is a growing field of healthcare. It is an individual treatment approach based on the patient's unique clinical, genetic, epigenetic, and environmental information [85]. Disease are heterogeneous, and the ultimate objective of the personalized therapy is to define the disease at molecule level, so that therapeutic agents are targeted towards right population of the people [86], [87], [87]. Keeping this in view, Wang et al. [79] used binding energy of EGFR mutant complex and personal features (age, sex, smoking history, medical history) to build a personalized drug resistance prediction model. They used extreme learning machines (ELMS) [88] to predict the drug resistance level. EGFR-TKI interaction pattern and personal features are used in the prediction model, and an overall accuracy of 95% was achieved. The accuracy of the models is significantly increased with the addition of personal features.
Kureshi et al. [36] established the relationship between patient's personal characteristic and tumor response level. The drug response is associated with EGFR mutation status type and personal features. They applied four classifiers to predict the outcome of EGFR TKI, and achieved an overall accuracy of 76.54% with area under the curve (AUC) equal to 0.76. They showed that the SVM and the decision trees are potential candidates for personalized drug resistance prediction.
1) Geometric Analysis of Drug Binding Site: Analysis of geometrical shape at the drug binding site can also reveal interesting insights about the drug resistance mechanism and structure based drug design (SBDD). The geometrical properties of the protein-drug complex can also be a useful feature for drug response prediction [89]. The concave shape has a higher drug molecule affinity than the convex. Low convex degree shapes bind tightly than the high convex degree shapes.
Ma et al. [80] analyzed the properties of EGFR mutants using structural information. They computed the alpha shape model [90] and solid angle [91] to evaluate the properties of the atom at the binding sites.
MD simulations were performed using Amber [92] suite, and the (CGAL) [93] library is employed to compute the shapes of the EGFR mutants. They normalized the curvature value to [-1, 1], with values falling in the range of [-1, 0] is defined as concave and for [0, 1], the shape is defined as convex. To simplify the problem, they calculated the average convex degree at the drug binding site and obtained the average knob level of mutant by averaging the convex degrees for 200 frames. It is shown that the 90% of the mutant can be grouped together by the knob threshold. To validate the model, they compared the results with the clinical data, to be specific the L858R -T790M mutant showed no drug response which is consistent with our knowledge.
These studies provide useful reference for the personalized drug design for EGFR-mutated NSCLC patients. The simulation results show that the accuracy of the prediction model is significantly increased by adding the personal features. The geometrical features, drug binding site distance, energy and personal features may be combined to construct an efficient personalized drug resistance prediction model.

IV. DEEP LEARNING
Deep learning has shown remarkable results in solving many challenging problems of computer vision, natural language processing, financial model analysis, and bioinformatics [94], [95]. The use of deep learning has also shown promise in pharmaceutical research such as bio-activity prediction, and drug discovery. The convolutional neural network (CNN) has provided excellent results in image recognition problems [96], CNN can also be used to design protein-ligand scoring functions. Recently, a new antibiotics have been discovered from the pool of 100 million molecules using deep learning [97] A deep learning model is proposed to predict the mutation status form CT-scan lung images in [98]. The proposed model predicts the probability of a tumor being EGFR mutant using the CT-scan tumor image as an input. The model consists of two sub-networks, sub-network 1 shares the same structure as first 20 layers of Densenet [99] using transfer learning [100], and sub-network 2 was trained on the EGFR mutation dataset. The proposed model achieved high accuracy (AUC 0.85, 95% CI 0.83-0.88) and also revealed an association between high dimensional CT-scan images and EGFR genotype.
Deep learning can also be used for virtual screening, and to design novel EGFR inhibitors [101]. Recently, a multiinput deep neural network based on attention mechanism [102] is proposed for biological activity prediction of EGFR inhibitors [82]. The proposed model uses the dataset [103] of 3492 compounds labeled as inhibitors or non-inhibitors. The method uses SMILES (Simplified molecule input line entry system [104] which is a way to represent compound molecular structures in in silico study, as an input to the CNN. The CNN generates a 120 dimensional vector, which generates the attention map. The proposed model achieves state of the art accuracy with AUC equal to 90% by cross-validation. Another contribution of this model is the integration of attention layer, which may explain the contribution of each atom on the overall biological activity. The drug response prediction is related to precision medicine. A 1-dimensional CNN (1D CNN) DeepIC50 [83]  is proposed to predict the three-class drug response. The model used genomics profile and drug molecular features from massive drug data on cancer cell lines. The proposed model achieved better accuracy than the baseline methods. Training a deep neural network with a small number of points is a very active research topic in deep learning. One example is the one shot learning [105]. Such a network can be exploited for drug resistance prediction model with small number of clinical samples.

A. Big Data Analytics
Big data analysis and Artificial Intelligence are changing the drug discovery pipeline. The Cancer Therapeutics Response Portal (CTRP) [106] provides a resource to develop new insights into small molecule action mechanism, generate new hypothesis, and personalized drug discovery based on predictive bio-markers. Several big data projects, including Cancer Cell Line Encyclopedia (CCLE) [107], and Genomics of Drug Sensitivity in Cancer (GDSC) [108] have performed large scale molecule screens on panels of hundreds of molecularly characterized cancer cell lines. These projects demonstrates the potential of modern machine learning algorithms to de-velop drug response predictors based on molecular profiling measurements [97].
However, it is important to acknowledge the challenges of big data analytics. The current cancer data resources are not enough for providing an adequate answer to drug resistance and response. In fact, independent procedures analyzing the clinical data may reach different conclusions, aiming to answer the same biological question [109]. Another problem is the inconsistency between datasets and missing clinical information [110]. One solution to missing values is to use data imputation techniques [111].

V. CHALLENGES AND OPPORTUNITIES IN THE CURRENT METHODOLOGIES
Most of the studies discussed in this review are based on MD simulations. MD simulations are a valuable tool in structure-based drug design (SBDD), and play a vital role in understanding the molecular interactions, and conformational changes. However, it is important to acknowledge the limitations of MD simulations, including time limitation, force -field inadequacy and quantum effects [112]. The enormous amount of computational power has made it possible to carry out MD simulations for several microseconds for systems containing hundreds of millions of atoms, yet the time resolution may not be enough for relaxing the system to certain quantities.
Moreover, several biological properties such as protein folding, ligand binding and unbinding may occur at longer time scales. The issue of selecting the force fields remain a significant challenge and MD simulation results are reliable only if the force fields mimic the same force as experienced by the real-world systems. It is important to mention that force field files are parameterized, and they describe varied situation of the same atom-type.
Classical MD-simulations cannot model the chemical reaction of drug substrate and the bonding process of certain covalently bonded ligands. In such cases, quantum mechanics becomes a viable option, which models the system at the electron level. Nevertheless, they require more computational power than the classical method. The reactive force fields are developed to model the chemical activities.
Another challenge faced by the MD simulations is electronic polarization, and quantum effect. The bio-molecules are polarizable, and the electron clouds around the atom constantly changes with the chemical environment. In such a case, we can use quantum mechanics (QM) MD. QM-MD are computationally expensive and limited to small number of atoms. Computationally efficient QM approaches are needed to model large protein-ligand systems at the electron-level.
The binding free energies calculated are also not accurate, and there are errors reported around 1k/cal on average in applicable scenarios [109]. However, there is still merit in calculating binding free energies, as they allow one to distinguish between weak and tight binding. In the personalized models, the data dimensionality remains a significant challenge. Most cancer studies of anticancer drug response have small sample size (less than 100 patients) compared to variable size (human genes > 20,000). Computational methods might not produce robust results, and non-algorithmic solutions are necessary. The high dimensionality problem can be addressed by using feature filtering techniques or sparse principal component analysis [113]. Another solution is the data integration [114]. Integrating all studies together may increase the confidence in results.
Deep neural networks are considered black boxes outside the machine learning community, and often domain experts are needed to interpret the model output [115]. Since most of the studies are connected to patient's health, the logical reasoning of the model will provide useful information [116]. Furthermore, transforming the deep learning black-box to a white-box is an active research topic. Different methods have been developed to interpret the model including backpropagation, exaggeration of hidden representations, and activation maximization [117].

VI. CONCLUSION
In this paper, we explained the drug resistance mechanism in EGFR-mutated NSCLC patients. We discussed several important EGFR mutants and their interactions with EGFR-TKIs. The mutation changes the stability, binding free energy, dynamics and structure of EGFR. In our opinion, the stability is one of the most crucial factor for analyzing the drug response. The stability is the change in the net energy in the unfolded and folded states [118]. It will be interesting to see whether a state-transition matrix can be propose to model the changing states of a protein [119]. Computational methods have shown promise in analyzing the EGFR properties, and produced useful insights about drug resistance mechanisms.
There are various reasons for the failure of EGFR-TKIs. The increase in the ATP affinity, flexibility, loss of stability and the loss of hydrogen bonds at the site 790 are some of the reasons for the T790M mutation. However, some mutants are drug sensitive. The L858R mutant prefers binding to Gefitinib than ATP. The Del 19 mutant has a high drug sensitivity, and L858R and L861Q have moderate, and T790M and T790M-L858R have low drug sensitivity. The personal features also play an important role, and it is observed that the drug sensitivity also depends upon the personal characteristics. We believe that deep understanding of personalized models may result in age specific or gender specific treatment plans.
A variety of rare mutations occur in about 10 -20% of the NSCLC patients, due to higher diversity, proper medication for such mutants is difficult in the daily clinic. A little is known about the effect of these mutations on downstream signaling and interactome [120]. Robust prediction models are needed to predict the sensitivities to such mutants. Predicting the impact of mutations on protein-ligand affinity, structural changes is of great importance. Deep learning architectures based on time-series features can be used to predict the impact of mutations, which can help understand the mutation induced drug-resistance mechanism.
Diagnosing the EGFR-mutant using CT-scan images and deep learning provides non-invasive and an easy solution [121]. By using feature visualization or other method, interesting insights can be obtained. Despite of all these studies and findings, there are still unknown reasons of drug resistance and further studies are required to investigate, and validate the findings. The rise of deep learning, and the enormous amount of digital data, and large computational resources can provide efficient pipeline to improve drug discovery, understand the drug resistance process and provide optimal decision making in treating EGFR-mutated NSCLC patients.

A. Future Work
More clinical data is required to refine the prediction models and deep neural networks can be exploited to increase the prediction accuracy. The interpretation of deep neural networks may produce useful insights about the drug resistance mechanisms. The small dataset problem may be addressed by using the matching networks [122]. Moreover, data augmentation can be used to create virtual clinical samples. Combination of long MD simulations, and usage of recently developed tools, such as, DeepMD [123] with large clinical data may pave the way for further studies.
The high arithmetic and intrinsic parallelism of graphical processing units (GPUs), tensor processing units (TPUs) can be exploited to perform longer MD simulations. ACEMD is a MD simulation program, capable of providing supercomputer level performance of 40-ns/ day for protein systems with more than 23,000 atoms [124]. AceCloud [125] is another cloud-based MD program, capable of running hundreds of MD simulations. Moreover, the development of efficient open source libraries for the analysis of molecular dynamics trajectories, such as MDAnalysis [126], MDTraj [127], and online web applications [128] provides opportunity for flexible and fast framework for complex analysis programs. In addition, online freely drug-databases, such as drug-bank [129] can be exploited for future drug design. Another important research direction is to design decision support systems, based on clinical features. Such systems can help doctors in devising optimal treatment strategies.
We believe that data analysis methods will play a vital role in future cancer research [130]. Several Bio-tech companies are using artificial intelligence to enhance the drug development process, from candidate screening to trial management [131]. We speculate that decades from now, personalized drug models will be used in treatment of NSCLC patients. We further believe that the combination of computational methods and clinical studies can provide useful recommendations, for precision and personalized medicine. His research interests include novel radiation therapy, head and neck cancer, lung cancer, liver cancer, colorectal cancer and brain tumours, as well as genetic and molecular studies on nasopharyngeal cancer and lung cancer Hong Yan received his PhD degree from Yale University. He was Professor of Imaging Science at the University of Sydney and is currently Chair Professor of Computer Engineering at City University of Hong Kong. His research interests include image processing, pattern recognition and bioinformatics, and he has over 600 journal and conference publications in these areas. Professor Yan is an IEEE Fellow and IAPR Fellow, and he received the 2016 Norbert Wiener Award from the IEEE SMC Society for contributions to image and biomolecular pattern recognition techniques.

IX. HIGHLIGHTS
• Computational methods for drug resistance analysis in EGFR-mutated NSCLC patient are discussed in detail. • Personalized drug resistance prediction models are also evaluated. • Limitations in the current computational methodologies and the way forward are also highlighted. • Properties of EGFR mutant structures, including stability, binding free energy and dynamics and structure are critically evaluated. • This article provides useful information for understanding the drug resistance mechanism in EGFR-mutated NSCLC patients, using computational methods and sheds some light on the future.