Chatbot Analytics Based on Question Answering System: Movie Related Chatbot Case Analytics

—Question Answer (QA) systems are established to retrieves accurate and concise answers to human queries posted in natural language. The primary focus of the QA system is to achieve efﬁcient and natural interaction between machines and humans. To achieve the above several researchers are directed towards Natural Language Processing (NLP) based deep learning. With the rise of a variety of deep NLP models, it is now possible to obtain a vector form of words and sentences that stores the meaning of the context. NLP considerably aids deep learning-based mathematical models in understanding the semantic and syntax of natural human language. In this paper, research is conducted on chatbots based QA system. The sequence-to-sequence (seq2seq) model proposed by Ilya Sutskever 2014, has laid the foundation for the chatbot model build in this paper. The Cornell Movie-Dialogs Corpus created at Cornell University, and Movie Dialog Dataset created at Facebook are preprocessed and used to train the chatbot. The encoder and decoder of the seq2seq model comprise of LSTM cells and are deﬁned using Bidirectional Dynamic RNN and Dynamic Decoder RNN package of the tensor ﬂow library. Additionally, to ensure the chatbot performs well on long sentences attention mechanism from the tensor ﬂow library is applied to the decoder.


I. INTRODUCTION
Existing information retrieval systems retrieves a list of documents in response to queries inquired by a human in natural language. One example that is part of our day to day life would be searching on Google search. The list of documents presented to the user as results might have the correlated data; however, a considerable amount of work is done by humans to extract valuable information from the list of documents [1] [2]. Hence with the need to get a concise and exact answer to user queries, Question Answer Systems came to existence. The QA system derives the answers from a variety of sources: unstructured data (e.g., Web pages, blogs), semi-structured data (e.g., Wikipedia), structured Knowledgebase [3]. Based on the domain of question answered by the QA system can be classified as open-domain QA or closed-domain QA. Closed-domain QA answer question to a specific field, whereas an open domain QA gives the user the freedom to raise queries related to any topic [4]. The QA system proposed for the travel domain in [5] can be a suitable example for closed-domain QA, whereas the QA system developed for factoid-based question [what, when, who, which, how] is an open-domain QA [6]. One of the metrics for the evaluation of this QA system is the accuracy of the answer. One way to reach the desired answer is by classifying the question into different categories [7]. The recent advancements in the field of NLP and deep learning, has led to the application of neural networks like RNN, BiLSTM in building an opendomain QA system [8] The neural networks are trained on natural human language, mostly by word embedding. Today, QA is an influential discipline of more sophisticated natural language processing (NLP) techniques [9]. The QA systems are categorized into four categories, chat robot, QA based knowledge base, QA retrieval system, and QA based on free text [10]. This paper concentrates on a chat robot-based QA system, also known as a chatbot.

II. CRITICAL REVIEW
Research has been conducted to address the problem of human-machine interaction for a very long period. The earliest chatbot ever build was developed by MIT called ELIZA, and it dates back to the 1960s. ELIZA follows a rule-based design methodology and operates on pattern matching and substitution algorithms. It gives an illusion of a program with an understanding [11]. ELIZA was build to offer Rogerian psychotherapy to patients by asking personal questions and engaging in a long conversation [12]. The bot function by analyzing the patient's response and performing keyword matching on the set of predefined templates to generate a formatted string response. The bot demonstrated the capability to provide aid to the patients; however, its efficiency could not match that of a human therapist. Additionally, ELIZA significantly lagged the ability to learn new patterns from interaction and perform logical reasoning due to the rule-based approach.
Today with the recent innovations in the field of machine learning and NLP, cloud-based chatbot platforms are generated. Various commercial tools like IBM Watson Conversation service has been established to assist in developing a chatbot. Neha Godse et al., 2018, showcased the application of cloudbased chatbot platform IBM Watson Conversation service for Information technology service management (ITSM) application in software companies [13]. The author proposes a chatbot that can help an employee within the company to resolve his or her IT-related issues without raising a ticket. The chatbot takes user input in natural language, bounds it into a JSON object, and sends it to IBM Watson using plugins. At IBM Watson Intents, Entities and Dialog flow have been defined for generating the response. Intents specify the purpose of the user query; entities specify the context of the intents, and dialog flow represents the conversation nodes that are created based on the intent and entities. The conversation node's response is then passed on to the user. Heru Santoso et al. 2018, also present the application of another cloud-based chatbot platform called DialogFlow for university admission services [14]. The author proposes a chatbot called DINA, Dinus intelligent assistant, that responds to student's university admission queries. Here the methodology mentioned by the above authors can be widely used when building a chatbot for close-domain like hotel reservations in which an expert can manually define the intents, entities, and dialog. However, it might not be an optimal solution for open-domain chatbots as the conversation scope is too broad. Additionally, the chatbot learning might be limited.
With messaging service being ubiquitous and the elevation of deep Learning and NLP techniques, scalable and critical learning became achievable, which contributed to the emergence of intelligent chatbot, an instance of which is A Neural Conversational Model [15] [17]. These chatbots are not dependent on a predefined knowledge base or rules. They are deep learning models trained on conversation samples giving them the ability to respond to open-domain queries. The chatbot may belong to the generative model or retrievalbased model category [18]. The Automated Thai-FAQ Chatbot using RNN-LSTM proposed by Panitan et al., 2018, signifies a Retrieval based model [19]. In a retrieval-based model, the model learns to select a suitable response from a set of predefined responses, whereas the generative model generates a new response from scratch. A generative model trained on a vast corpus can generate better response compared to the retrieval-based model. Neural Conversational Model outlines a Generative Model. It involves the training of a neural network with extensive data to build a conversational model that can converse in natural human language. Sequence to Sequence framework proposed by Ilya Sutskever, 2014 laid the path to the Neural Conversational Model [16] [17]. The initial application of this framework involved neural machine translation and archival [20] [17]. Huyen Nguyen et al. 2017, represents an instance of a neural chatbot based on the seq2seq model [22]. The authors propose an open-domain response generator that imitates characters from popular tv shows. The model as trained using five different datasets and was evaluated using automatic metrics BLEU and ROUGE as well as human judgment. Based on the parameters presented in the paper, the bot is able to communicate fluently, proving the capability of a seq2seq. Taking advantage of the seq2seq model, in this paper, we focus on the application of the seq2seq model to build a close-domain chatbot for movie-related questions and answers.

III. PROBLEM DEFINITION
Traditional chatbots are deeply dependent on hand-written rules. The chatbots architecture included a fixed template and some NLP based analytical method to generate the response. Also, these chatbots respond to questions limited to a specific domain. With the recent advances made in the field of Neural Network and NLP, in this paper, we experiment with a chatbot based on the Neural Conversation model suggested in [17]. A Neural Conversation model can be trained end-to-end, eliminating the bottleneck caused due to predefined handwritten rules. It is based on the Seq2Seq framework, which involves the use of LSTM. A well-known application of the Seq2Seq framework in the field of Machine translation.
In recent years researchers have also started applying the seq2seq framework for building a deep NLP based chatbot; most of this study is performed on an open-domain dataset that enables the chatbot to converse with the human in natural language. In this paper, the focus is to utilize the power of a seq2seq model to build a chatbot that can respond to close domain questions related to movies. For this study, the Cornell Movie-Dialogs Corpus created at Cornell University, and Movie Dialog Dataset created at Facebook would be used to train the chatbot.

IV. METHODOLOGY
In this research, we would be creating a prototype of a seq2seq model-based chatbot by first training the model on Cornell Movie-Dialogs Corpus. This corpus would help the chatbot to build the general conversation ability of the model. In the second stage, the model would be trained on the close-domain Movie Dialog Dataset, which would enable the chatbot to respond to queries specific to film. Below steps are performed to build the seq2seq model-based chatbot.

A. Pre-processing
The preprocessing of the Cornell and Facebook movie dataset is divided into multiple sections. a) Data Harmonization: Considering both datasets Cornell Movie-Dialogs Corpus and Facebook Movie Dialog dataset exist in a different format; the initial step would be to convert both the dataset into a single unified form. To achieve a standard format of the data representation, we would represent all the data in question and response format. The question and answer list would have one to one mapping. This one to one mapping would assist in data preprocessing steps and would also simplify the model training. The above is achieved using basic python functionality. b) Data Cleaning: As a part of data cleaning, the question and answers generated in the previous step are first converted to the lower case, this reduces the size of word vocabulary, making it easier and efficient for the model to learn. The lower case question and answers are then analyzed to expand the short words and remove the punctuation's from the text. The list of short words to be expanded and the punctuation to be removed is currently a static list that can be modified to add new conditions. For example, if the text is "That's all I've to say," the cleaned version would be "that is all i have to say'". c) Word Vocabulary: To generate the word vocabulary following steps are performed.
• Word Tokenization: In this step, the questions and answers are broken down into a list of words. Word tokenization can be achieved using the word tokenize function of the python NLTK library or using Python elementary functions. For example, the text "I told you, this is not you" would be tokenized to the list [i,told, you, this, is, not, you]. • Word occurrence count: At this step, we generate a dictionary of words in the corpus associated with their count. Word count can be performed using python FreqDist library or by iterating over the whole dataset. • Frequent Words selection: The dictionary generated in the previous step is used to filter out words that occur less time than the given threshold. It is essential to limit the words because passing too much data to the model may affect the model training. • Mapping Frequent words: All the frequent words identified are mapped to a unique integer. • Tokens: The <SOS> and <EOS> tokens are used to indicate the start of the sentence and end of the sentence for a seq2seq model. Additionally, <PAD> token is to used to pad the question and answers so that all the questions are of the same length, and the <OUT> token is used to represent the non-frequent words in the sentence. Each token is assigned a unique number.

d) Dataset Vocabulary:
The word vocabulary generated in the last step is used to represent each word in the question and answer list as an integer. Since only frequent words are assigned a unique number, the non-frequent words that appear in the question-answer list is replaced with the numeric representation of the <OUT> token.

B. Model Building
This study makes use of the seq2seq framework proposed by Ilya Sutskever, 2014. RNN model serves the base for this structure; however, since the vanilla implementation of RNN suffers from the vanishing gradient problem, the Long Short-Term Memory (LSTM) recurrent neural network variant is preferred for forming a conversational model. The Seq2Seq framework proposed by Ilya Sutskever, 2014 comprises two LSTM, one for encoding and one for decoding [20]. The seq2seq model takes as input sequence, process it one word at a time, and generates an output sequence one word at a time. Neha Atul et al., 2018, represents the mathematical description of the seq2seq model as follows: Given a sequence of input X one at a time to seq2seq encoder, it converts the input to fixed-size vector c [13]. With c as input, the decoder then predicts the probability of the output sequence Y. Maximizing the generation probability of Y conditioned on X, the objective function mentioned by the authors is shown in equation below. ρ(y 1 ....., y T |x 1 ....., x T ) = T t=1 ρ(y t |c, y 1 ....., y t−1 ) (1) Fig. 1 below represents the general architecture of a seq2seq model. In the figure, the encoder takes as input the text "How is the food" one word at a time and generates a context vector. The context vector is then passed to the decoder, which maps the input words to the output words. For this study, the model is build using the Tensor flow library. In this study, as we are training the model on two different datasets, each section of the data preprocessing step is implemented as a python class, with section output stored in the instance variable of the class. This enables us to retrieve the output of both the dataset at any stage of the implementation.

A. Model Encoder
We define the seq2seq encoder by first defining a basic LSTM using the BasicLSTMCell library of tensor flow. A dropout of 0.5 is applied on the LSTM cell, and then the LSTM cells are composed sequentially using the MultiRNN-Cell function of tensor flow. The bidirectional dynamic rnn function is then used with the sequential LSTM cells in both forward and backward directions to generate the encoder state and encoder output.

B. Model Decoder
The training and testing decoder are defined separately. On both training and testing decoder, attention mechanism is applied so the seq2seq model can operate better on long sequences. The training and testing decoder are described using the attention decoder fn train function and the attention decoder fn inference function of tensor flow library. Both decoders take encoder states as input, along with attention variables. Both the established decoder are then passed to the dynamic rnn decoder defined in the seq2seq library of tensor flow to generate the decoder train prediction and test prediction. Also, before returning the training decoder to the user-defined seq2seq function, output dropout is applied to it, and it is fed to fully connected layers.

C. Seq2Seq Model
A seq2seq function is defined representing the seq2seq model. This function internally calls the encoder and training and testing decoder function to get the training and testing predictions.
The training data utilized to train any conversational models are generally present in the form of natural human conversation. In employing a neural network like seq2seq to the natural language task, each word in the training data has to transform into a numerical representation as a part of data preprocessing [21]. Word embedding techniques assist in converting each word into a vector of real numbers. Pre-trained models are also available for converting words to vectors. For this study, we are using the tensor flow embed sequence layer to perform embedding on the input data before passing it to the encoder.

D. Model Training and Testing
The model is trained by splitting the inputs into batches and feeding the batches to the model for a given number of epochs. During training, the model is also fed with answers along with questions so that the model can learn through backpropagation.
To ensure that the model does not overfit the training data is divided into training and validation. The model is run on the validation dataset every 100th batch. Training and validation loss are calculated using Adam optimizer. If the validation loss improves, the model weights are saved in a local file, and the training continues. If the model performs continuously inadequate on the validation set, then early stopping is applied, indicating that the model cannot perform any better. The weight file saved is then used to perform testing on the model. The model is evaluated based on human judgment. Table 1 below represents the hyperparameters used for training the model.
Currently due to the limited GPU and computation power the chatbot takes a considerable amount of time for training. As a part of this experiment we trained the chatbot on ten epochs. However the results obtained were not good enough, during testing for every question the chatbot provided a single answer.

VI. CONCLUSION
In this paper, an attempt has been made to understand the importance of a neural network-based chatbot system for movie-related queries. With the rise of the neural network and NLP, it is possible to extend the chatbot to automate other critical problems. The seq2seq model presented in the paper was build using the Tensor flow library. The performance of the model was confined due to limited training; however, the study showcases the usage of the data-driven approach for chatbot building. Notable performance can be achieved with powerful computational resources and making modifications to the training hyperparameters. Additionally, The textual dataset passed to the model was preprocessed using userdefined functions. Advance functions in libraries like NLTK and sklearn give us another area to explore to improve the data preprocessing.
ACKNOWLEDGMENT I want thank Dr. Sabah Mohammed for his encouragement and supervision throughout this research project. This research work is part of the COMP5800 Research Methodology Course at Computer Science, Lakehead University, Winter2020, supervised by Prof. Dr. Sabah Mohammed.