Prompt scoring system for dialogue summarization using GPT-3

Recent results in language processing show that language models are capable of performing several natural language tasks without the need of supervised learning. A challenging task for pre-trained language models is dialogue summarization. One way of generating summaries is engineering prompt templates for few-shot training. However, a static approach of creating prompts leads to unreliable outcomes between different classes of dialogues. Focusing on the dialogues structure properties we propose a scoring system to improve the few-shot training performances. We build tuned prompts composed by the highest scored dialogue samples. Our evaluation based on ROUGE scores and human evaluation shows that there is an improvement for the experiments in which we use the score system. All experiments are performed within the framework of the GPT-3 API. We use different engines for comparison. Moreover, the human evaluation we conducted showed that the number of failures decreased by 11\% after applying our scoring system.


I. INTRODUCTION
L ANGUAGE models have evolved significantly in recent years. Whereas task specific models showed that they can attain very good results only in one direction [1] [2], language models proved that they can handle a variety of NLP tasks without supervised learning. Colossal models such GPT-3 are already used by thousands of developers. Few-shot learning without weights updating [3] is one reason for that, as it makes possible a fast development of applications in several directions (classification, semantic search, content generation, summarization and so forth). GPT series are based on the Transformer architecture [4], which relies on self-attention mechanisms [5] [6]. Dialogue summarization is a challenging problem that can be tackled using state-of-the-art NLP technologies. In this study we analyze the possibility to use such models for abstractive summarization of conversations. The aim is to perform generic and informative single-document summarization [7]. This can be done using GPT-3 in a few-shot setting, meaning that one constructs a prompt consisting of one or more summarized dialogues along with the input dialogue, which is given to the model for completion. The fact that this can be done only by using a few training samples, without changing any of the model weights, has several advantages over the classic fine-tuning: (i) there is no need in creating another model, so we save memory space and time and (ii) even if there is a lack of corpora for fine-tuning purposes, we can still obtain reliable results based only on prompt tuning. Online communication platforms and mobile chat applications can implement functionalities based on dialogue summarization. Multiple notifications can be replaced by a few summaries providing a better user experience. There is a massive activity on chat applications nowadays, hence such a feature would match the needs of the users. However, one should take into account that processing an enormous volume of messages using models like GPT-3 will demand large computational costs. This leads to the problem of cost minimization. Another objective of this work is to study the behaviour of summarization performance when we are using less computational resources.

A. Related work
As far as we know, there have not been any solutions explored in this direction (i.e. dialogue summarization prompt tuning). Anyway, improving the few-shot training performances is a problem of interest as there can be instabilities in the GPT-3 performances due to the way prompt is chosen [8]. A solution based on contextual calibration was proposed for tasks such as text classification, fact retrieval or information extraction [9]. There have been investigated fine-tuning approaches which also prove that prompt-based tuning increases the language models performances [10] [11]. It is known that GPT series of models are multitask learners [12], but different tasks demand different prompt tuning approaches. Our focus is to find the best way of tuning the prompts for dialogue summarization. The research in dialogue summarization has started to gain popularity in the last two years [13]. There is an impressive increase of data-sets for the summarization of the chat conversations [14] [15] [16] and several models have been developed [17]- [26]. The main known problem of the dialogue summarization is when the summary provides wrong references [13] [27], distorting the information from the original dialogue, phenomenon also known as hallucinations [28], [29].

B. Solution details
We aim to improve the quality of the summaries generated in a few-shot training regime by choosing the best picks for the training samples. For that, we establish a simple but efficient scoring system which takes into account the dialogue content, size and number of active participants in the conversation (Section 3) in order to find feature similiarities between dialogues. Two similar conversations will have a higher score than two different ones. Firstly, we calibrate the scoring system configuration in order to achieve the best performances (Section 4). We separate each component and prove its relevance to the system. The scoring system is initialised with a weight distribution based on the initial experiments. The final distribution is established after several variations. Secondly, we vary the model temperature. We noticed an increase of the performance around 25% for low temperatures. We continued the experiments at a low temperature of 0.25. More precise determinations can be done for each parameter we consider in our evaluations. However, at the moment there is no proof to state that the scoring system configuration we use will behave at its best for any other data-set. Further work will be needed to analyze that.

C. Evaluations
The performance evaluation is based on the ROUGE scores [30]. The established score system is evaluated with respect to the results obtained with random generated prompts (Section 5). We observe an improvement of the ROUGE scores for the summaries generated with the score system (SS) proposed by us. Apart from evaluating the score system, we also tested the summarization behaviour for different GPT engines. We noticed that SS improves the summarization performances for each engine. Lastly, we performed human evaluation and compared the results with those obtained with ROUGE-1 and ROUGE-L F1 metrics.

A. Datasets
The data-sets on which we rely in experiments are briefly described. They are large datasets that could be used in the training of the dialogue summarization: 1) SAMSum Corpus -SCd, A Human-annotated Dialogue Dataset for Abstractive Summarization [14] is a dataset published by Samsung researchers in 2019. It contains 16369 messenger conversations created by linguists. 2) DialogSum -DSd, A Real-Life Scenario Dialogue Summarization Dataset [15] is similar to SAMSum Corpus, but spoken conversations are summarized instead. There are 13460 summarized dialogues labeled manually. We decided to use mainly the SAMSum Corpus dataset (SCd) to drive several experiments, because it is currently the largest provider of conversations similar to those discussed on online chats. These conversations are generally natural dialogues in English. The topics approached are diverse, such that we can find discussions between friends, colleagues, more formal discussions or even some inappropriate language. Another interesting aspect about SAMSum is the fact that it was created by linguists, who are asked to reproduce common discussions which should reflect reality. The fact that the dialogue samples are not very long (as token count) and the variety of topics are the two main reasons for using SAMSum Corpus in our study.

B. Prompt generation
A preliminary processing takes place for the dialogues considered in the prompt shots selection -the selection pool (SP). We construct the term frequency matrix (TF) and compute the inverse document frequency coefficients (IDF). Also, we retrieve the token count and the number of persons participating in the conversation for each dialogue in SP. The input dialogue which has to be summarized is also analysed. Consequently, the score system is assigning a score with respect to the input dialogue features for each entry in the selection pool. The dialogues having the largest k scores are picked to form the prompt. The dialogues, along with the reference summaries, are constructing the prompt ordered ascending by the score. The prompt includes the following components: the instruction hint for summarization (first row), the selected examples and the delimiters ('''''' in our case). We provide an example for a two-shot prompt in Table VII.

C. Implementation details
In our experiments SP consists in the training part of SCd and the tests are taken one by one using input dialogues from the testing part of SCd. The data reduction of the SP is performed before testing. A diagram illustrating the whole process can be visualized in Figure 1. For each SP dialogue we compute and save the features we mentioned earlier (TF, IDF, token count, attendance and ID). The score system is retrieving information about the SP dialogues only by searching through the SP features. Then, we use the IDs of the highest k scored dialogues in order to retrieve the content and the summary. Finally, we construct the prompt by following the order imposed by the scores and generate the summary using the GPT-3 engine. The prompt is engineered in such a way the summary to correspond with the prompt completion. Usually for testing, we repeated this procedure for 50, 100 or 200 runs of the model in the same configuration.

III. SCORING SYSTEM
Three aspects are considered in order to have the best picks of dialogues for the few-shot training of the model: the content, size and attendance. Each of them is characterised by a corresponding score and weight. We establish the weights based on empirical results. The quantities for which an optimal result is sought are the ROUGE coefficients [30], and, also, the scores given by human evaluations.

A. Content
The content evaluation is based on the TF-IDF approach [31]. Using the BERT tokenizer from transformers library [32], Fig. 1. Diagram illustrating the prompt generation process. The preliminary data reduction is performed only once before using the score system. Dataset has been stored in a JSON file during our experiments. Before handling the prompt to the model (GPT-3 engine) the dialogues are ordered ascending by the score. The model is completing the prompt and, therefore, we receive back the summary.
we determined the tokens distribution for each dialogue and computed the TF-IDF weights, where f td is the token t frequency in dialogue d, f md is the maximum frequency of a token in the dialogue d, N is the total number of dialogues and n t is the number of dialogues in which we can find token t. A score is given by the cosine similarity between an input dialogue and a dialogue from the training dataset. Let W i and W d be the TF-IDF weights vectors. Then, the content score, κ, will be the cosine similarity of the two vectors.

B. Size
We aim to find dialogues having a similar length to that of the input dialogue. We also can define variables on which we can rely for cost minimisation or faster queries. In these cases, the quality of the selected dialogues from the training data-set would not be the top priority. Let S = A × d 0 be the set of dialogue pairs from which we search the training shots. For each selected pair (d, d 0 ) we compute a length similarity coefficient λ. This coefficient is given by a function f λ which should obey the following conditions: The last condition increases the chances to choose the shorter conversation if there are multiple ones with similar content. There is no reason to increase the number of tokens in the query which will be handled to the model if the content score is the same. From the distribution of the token count ( Figure  2), one can observe that the short conversations are much more and the probability to find content similarities are high. Thus, in most of the cases we can perform the search only through a pool of shorter conversations. The maximum score is achieved when we pick a conversation of exactly the same token count. If such a sample is not found, then we try to find another one with a small error. Like in the case of a standard deviation, there can be defined an interval within the samples to be scored high. Indeed, we can choose a standard distribution for scoring, meaning that we got a curve similar to a symmetrical Gaussian, but which does not help if we are looking to minimize the costs. Shorter conversations help in creating a low token count prompt. Therefore, we will need an asymmetrical curve in order to boost the probability of picking short dialogues. In this work we tested the performance of the asymmetrical double sigmoid (asd) function in two configurations: narrow and broad. Different asd trends can be obtained by modifying the free parameters c 1 -c 6 , The large number of free parameters allowed us to test different shapes and find a good set-up for our experiments.

C. Attendance
The number of participants in a dialogue also is a key factor. As also suggested in [33], better results are obtained when the summarization system takes into account the number of persons involved in the discussion. The conversations between two people are usually easier to summarize than those when there are more participants involved. Thus, we group the conversations of the training data-set in two classes: 2 participants class and 3+ participants class. The attendance score is simply δ id = 1 if the dialogues are in the same class 0 otherwise

D. Weights
The final score is given by the weighted sum of the scores: In creating the tuned prompt, we place the scored dialogues in an ascending order. The last shot should be the one scored highest as it has more impact on the completion returned by the model.

A. Weights distribution
Several weights are tested whereas keeping the same GPT-3 configuration. By GPT-3 configuration, we refer only to the engine, temperature and number of shots. The results for 50 tests of GPT-3 curie-instruct-beta model at a temperature of 1 are presented in Table I. Standard deviation is provided: score ± std. We notice a slight increase in the average token count if the size weight (w 2 ) is larger than content weight (w 1 ). This confirms that the probability to find content similarities is higher for shorter conversations. Once again, this is due to the nature of the data-set. To establish if this data-set reflects a real scenario or not is beyond the scopes of this work and requires further investigations.

B. Size similarity function
The function which computes the size coefficient has a significant contribution in cost minimisation. We test three functions: the normal distribution with σ = 20 tokens, meaning a FWHM of 47.2 tokens (gauss), an asymmetrical double sigmoid having a FWHM of 61.8 tokens (ads narrow), and another asymmetrical double sigmoid having a larger FWHM of 85.6 tokens (ads broad). We find that a broader ads function helps for cost minimisation. Also, the performance is not affected significantly. There are very small differences between the ROUGE-1 scores:

C. Selection pool
The selection pool remains unchanged during the experiments. However, to investigate if the score system results depend on the SP we perform a series of separate experiments including a significant amount of foreign data. The results of these experiments are discussed in the next section. Software applications can be developed using the score system

V. RESULTS AND DISCUSSIONS
In our experiments we use mostly the curie-instructbeta engine of GPT-3 due to its improved performances with respect to curie engine and cost reasons. To settle the temperature, we vary the temperature using the same score system configuration with w 1 = 0.5, w 2 = 0.3 and w 2 = 0.2 (5:3:2), which is highlighted in Table I. We run the curie-instruct-beta model on 200 two-shot prompts (standard deviation is provided: score ± std) and we obtain better ROUGE scores for lower temperatures as it is presented in Table II. Therefore, we have conducted the experiments at a low temperature setting (0.25).

A. Performance evaluation of the score system
We experimented further by running at different numbers of prompt tuning shots. As expected, the performance increases as more tuning samples are provided to the prompt. The results at the temperature of 0.25 for ROUGE-1 and ROUGE-L scores can be seen in Figure 3. Multiple sessions of 100 tests have been made in order to evaluate the fluctuations of the ROUGE scores. In the boxplot from Figure 4 we consider a baseline of random generated prompts using samples from the same SP. An improvement is slightly observed in the case of the prompts tuned using the proposed score system (5:3:2 weights distribution, ads broad for size function). If we assume a normal distribution for the ROUGE scores, there are no visible changes of the standard deviations.

B. Monitoring the costs
We showed in the previous section that we can diminish the costs (i.e. token count) by modifying the coefficients of the size scoring function f λ -which in our case is the asymmetrical double sigmoid. Therefore, we can monitor the costs only by defining the desired settings for the ads function. Obviously, there are many other options for the scoring function. If we are looking only to minimize the costs, then we can choose a function for which f (d, d 0 ) = 0 for any d  and d 0 such that |d| > |d 0 |. Moreover, in order to increase the scoring for shorter conversations, we can give up on the condition that the score is maximum when the dialogues have equal token counts. We study the behaviour of two functions: the piecewise function, and the following function, that can improve the chance in picking similar size conversations depending on the value of the free parameter a, We run 100 tests for these functions using SS on 5:3:2 configuration: • Piecewise function: 119.77 average token count • Equation 5 (a = 0.005): 204.64 average token count Using a smaller model would also reduce the costs. We evaluated the performances for ada, babbage, curie and curieinstruct-beta engines of GPT-3 with and without using SS. As expected, the results ( Figure 5) show that performance decreases significantly when using a smaller model.

C. Selection pool variations
We insert data from the DialogSum (DSd) data-set into the current SP, which has been created using the training part of SCd. The new SP consists of 54% samples from SCd (14732), 46% from DSd (12460). We ran the model again for 100 tests with curie-instruct-beta engine at temperature of 0.25 and 5:3:2 ads-broad score system. Also, we run the model only on tests from DSd. The results are shown in Table III. As expected, in the case of one-shot training, the performance is slightly better for a larger SP. However, the SCd only twoshots training scores higher than for the other SPs. Also, there is a large difference in the score provided by SS between DSd experiments and the others. This difference is actually a measure of the discrepancies between SCd and DSd.

D. Human evaluation
For human evaluation (HE) we considered four criteria following [34]: Coherence -it is related to the overall quality of the sentences and how well is the summary structured, Consistency -it measures the factual information transfer, Fluency -it scores the quality of individual sentences (i.e. grammar, formatting and so forth) and Relevance -it shows how important is the selected information, an excess of information is penalised. We asked the annotators to rate different summaries for 100 dialogues on a scale from 1 to 5. The evaluation was blind and it included the reference summary from SAMSum dataset and the summaries generated by GPT-3's curie-instruct-beta (two shots and 5:3:2 low temperature configuration) with and without applying the score system. The results are presented in Table V. The average scores of 3 x 100 dialogue summaries are provided together with the number of dialogues rated between unit intervals. A significant increase due to the use of the scoring system can be observed. For Relevance we can spot the largest difference in the average score. However, an improvement is visible for each criterion. The number of failures (i.e. poor summaries rated under 2.5) decreases with 11% of the total amount of dialogues after applying the scoring system. We calculated the Pearson coefficient to evaluate the correlation between human judgment and ROUGE scores (Table VI). We see that the ROUGE scores are not reliable in this case.
There are many examples, as the one provided in Table IV, when very good summaries are rated with low ROUGE scores.

VI. CONCLUSION
In this study we investigated possible prompt-based improvements of language models to perform abstractive summarization. We show that choosing the right dialogues can increase the quality of the summaries and reduce the number of failures by 11 %. We test the scoring system for several GPT-3 engines and we obtain better results for each engine when applying the scoring system. Also, the scoring system we proposed for selecting the best picks to create prompts also can control the computational costs by using different size similarity functions. We proved that content similarities between dialogues are also valuable for tuning the prompt, and as a result computational resources can be saved. We evaluate the scoring system using ROUGE metrics and, also, by conducting human evaluation. It seems that in our experiment small variations in the average ROUGE score correspond to larger discrepancies in the scores given by the annotators. However, both evaluations show that applying the score system increases the quality of the summaries. By studying the selection pool behaviour, we identified another research direction that can represent a future work. The way one gathers the dialogue samples in the selection pool can lead to different results. It may be necessary to engineer dynamic data-sets of dialogue samples depending on the end users behaviour and individual feedback. Thus, investigating prompt tuning methods that do not require fine-tuning is an advantage as we do not need to use space for distinct modules of a finetuned version of the model for each user.

APPENDIX A TWO SHOT PROMPT EXAMPLE
In Table VII we show one example of a two-shot training to illustrate the prompt structure for summary generation.