Analytical Techniques for Developing Argumentative Writing in STEM

— Contribution: This article demonstrates how experiential learning could be used to develop argumentative essay-writing skills in STEM students. It illustrates the design, implementation and evaluation of an Experiential Learning project for undergraduate Computer Science and Engineering students, and discusses the development of a natural language processing application, designed to aid instructors in providing students with high-quality, prompt, formative feedback on writing tasks. Background: Written feedback, when delivered in a timely manner, is an effective way of advancing students’ understanding of the writing process. Unfortunately, large class sizes and the limited backgrounds of instructors do not always make formative feedback possible. STEM students are especially disadvantaged since approaches to teaching written communication tend to differ from the trial-and-error strategies compatible with many STEM areas. Intended Outcomes: An experiential learning approach to writing instruction can have a positive impact on developing writing skills in STEM learners. Implementing algorithms for providing STEM students with immediate, dependable, formative feedback is expected to improve their performance in writing. Application Design: An experiential learning project for teaching argumentative writing was delivered to computer science and engineering freshmen. The structure of the project is described: the teaching approach, essay assignments, the rubric used for grading the essays, and its reliability. Also discussed are automated analysis of content and argumentation in the essays. Findings: The project was successful in producing a transfor-mative writing experience for computer science and engineering students. It demonstrates ways of incorporating experiential education to help STEM students develop strategies for good essay writing.

have shown that graduates in computer science and engineering seldom have the required writing skills needed for work in a professional setting [2], [3]. Gibbs [4] argues that many students leave secondary school without proficiency in reading, writing and communication. Furthermore, even those who have good language skills run a risk of losing them while studying at university because there are too few opportunities to write essays and for getting good-quality feedback on writing assignments. This paper discusses an Experiential Learning project, which included designing, delivering and assessing two argumentative writing assignments for 141 freshmen studying computer science or computer engineering. Three researchers, including the course instructor, collaborate to investigate the development of a Natural Language Processing (NLP) application designed to assist instructors with providing students with prompt, reliable feedback on argumentative writing assignments. As part of their preliminary investigation, the project was planned for a freshman Academic Skills course at a university in the U.K. For these students there is sometimes a lack of contextual coherence due to the absence of opportunities for systematic inquiry when it comes to writing assignments. This is indicative of the fact that approaches to teaching writing traditionally differ from those used in STEM courses, in which students are assigned associated laboratory work allowing them to investigate hypotheses through actions and activities. Such explorations enable the development of knowledge through internal and external discourse; for example, by watching videos or through discussions with their peers.
Experiential learning involves immersing students in educational activities, and then encouraging them to reflect on the experience and develop new ways of thinking. Experiential learning is a constructivist process allowing the learner to expand their ideas through a process of inquiry and reflection, as is usually done in STEM. Students can work individually, as part of a group, or under the guidance of a facilitator. The ways in which HE institutions organise curriculum, integrate technology and infuse other resources to improve student outcomes have garnered scrutiny in recent decades [5]. Now that universities are being held accountable for the quality of their teaching through instruments such as the U.K. National Student Survey [6], the multidimensional nature of getting students to achieve desired outcomes has assumed a new urgency. Central to attaining the best academic results for students is constructive commentary that they can use as a scaffold. One concern is developing technology to assist instructors, especially those with large classes, in providing high-quality, timely and consistent feedback to guide students as they experiment with writing to develop their written communication skills. The project involved providing students with two tasks, each asking them to analyze source material critically prior to writing an argumentative essay.
John Dewey [7] presents the idea that the process of learning should involve a cycle of doing and reflection to produce an awareness of the problem at hand, formulating a response, experiencing the consequences and finally modifying or confirming a proposed solution. Such a process of transformation involves concrete experiences as opposed to abstract conceptualization. More recently, Vygotsky [8] has been credited for providing the foundations for experiential learning. He contends that knowing, understanding and thinking all happen within a sociocultural context. His arguments are expanded by the American scholar Kolb [9] through an exploration of processing information via concrete experiences. The resulting experiential learning model involves a cycle of observation, formulation, testing and experiencing. In other words, we do something, experience its consequences, take action in response to these and then repeat the process, this time with a more developed understanding of what the process involves.
This paper provides guidance for practitioners seeking to integrate experiential learning in writing courses for STEM students, as well as information for researchers concerned with using NLP for building applications to support writing instruction. The project was set up to explore the following: 1) How and to what extent can experiential learning be used to develop argumentative writing skills of students enrolled in STEM programs? 2) How can technologies that help promote experiential learning in argumentative writing be developed? 3) How can the holistic impact of using such an approach be evaluated?
The next two sections of the paper address the first two of these questions. Each part begins with a review of related literature. Section IV discusses the impact the project had on students and what might constitute a full appraisal of the impact of learning technologies that promote experiential education. Section V concludes the paper.

A. Making a Case for Experiential Learning
This literature review section discusses the multifaceted interpretations of experiential learning, ranging from on-thejob training to engaging in and reflecting on work. It provides evidence-based arguments in favour of learning-by-doing.
1) Deweyan Viewpoint: The construct of learning-by-doing refers to the work expounded by Dewey in his review of educational philosophy-Experience and Education. One of his distinguishing suggestions is that learning is developed from within and is based on ideas formed by performing certain actions. Experience, he argues, is the basis of understanding. He furthers this by explaining that an experience comprising certain qualities, such as uniqueness and wholeness, can expand human perception and increases one's value of what is being experienced [10]. For him the ultimate purpose of an experience is to reawaken the senses, to see differently [11] ideas that might have been missed, and to validate that being studied. In describing transformative experiences, Dewey discusses educational possibilities rather than actualities, which raises rather than answers questions about how such experiences could be fostered in teaching [12]. Nevertheless, the theoretical underpinning of experiential learning developed over fifty years ago is now being implemented in the public and private sectors to foster experiential innovation in industry [13], and in HE through activities such as internships and fieldwork.
2) Modern Perspectives: HE institutions worldwide use experiential learning to develop in students 21st century skills and competencies, including empathy, resilience and collaboration [14], to better prepare them for an unpredictable world. Through volunteering, internships and field studies involving local and global communities, some online, students are expected to harness communal traits that enable them to become more integrated and better connected as human beings. Although underutilized [15], such community partnerships are seen as beneficial in preparing students for a life of work.
Experiential learning is a strategy used to integrate active, structured, and meaningful reflection into teacher preparation programs [16]. The aim is to develop in teachers additional knowledge and skills in readiness for the challenging situations they might encounter in schools. Learning by experience is central also to nursing education; for example, where students spend half of their studies doing hands-on practice [17]. Roakes and Norris-Tirrell [18] argue that practical situations provide uncertainties and complexities that cannot be replicated in the classroom. Thus, as with students of engineering, high importance is placed on the cyclical process of experiential learning to help connect textbook theories with real-life.
The traditional standardised testing approach prevalent in K-12 settings leaves teachers little room for experience-based learning. Studies connecting instructional practices with policies centred on accountability and rankings [19] highlight the pressures put on teachers by large class sizes and a lack of time to complete the syllabus, stifling their desires to make learning interesting and engaging for students. Consequently, teaching to the test becomes the only option available to them.
The disconnect between work expectations of young employees and their academic training is underscored even more during university. Teacher-dominated instruction remains the primary mode of teaching in HE notwithstanding existing research showing that experiential learning promotes teamwork [20] and develops critical-thinking skills [21]. Instead, attempts are made to enhance students' chances of gaining employment by including employability in the curriculum and providing initiatives, such as inviting alumni and other professional speakers, to help students garner some understanding of pursuing a career. Unfortunately, these approaches involve telling students what to do instead of showing them how. Another approach to preparing students for employment is through work-based learning, often available at some point during their college career. However, there is little agreement within institutions and indeed across countries [22] on how and when to implement student internships in the curriculum.
B. An Experiential Learning Approach 1) The Setting and Assignments: In the project there are around 200 first-year computer science and engineering students at a public university in the U.K. At the beginning of their studies these students are required to complete Academic Skills, a semester-long freshman course designed to enhance academic writing. The course aims to develop in students proficiency in writing and communications skills necessary for success in college and for future employment. The university attracts students from surrounding towns and cities, and is recognised for widening participation in HE. Thus, the project participants come from a wide range of socioeconomic and academic backgrounds. Most study full-time, but a small number are part-time students. The learning outcomes of the course are drawn from the UNESCO [23] definition of literacy, which centers on ensuring that students are able to "identify, understand, interpret, create, communicate and compute, using printed and written materials, as well as ... to solve problems in an increasingly technological and information-rich environment". The course leader is supported by six tutors. Course meetings are scheduled over six hours per week; on two days, each with an hour-long lecture followed by a twohour hands-on workshop. A highly structured course design, based on interactive lessons and hands-on practice, is used to help deepen students' understanding of the content. Academic Skills is run by the university each year, but this particular year the project took it over. The assessment included two argumentative essays described below. Students were required, first, to analyze critically reading material provided and summarize it in 150 to 250 words; next, to write a short argumentative essay (300 to 500 words), based on the reading, in response to a given prompt. They could choose one of the following three topics for the first essay.
• Autonomous Vehicles (AV): will these change how we travel today? • Cryptocurrencies (Crypto): are they the currencies of the future? • Cybercrime (Cyber): will education and investment provide the solution?
The areas of specialization of these students include cybersecurity, information technology and computer engineering. Thus for the first essay they had the opportunity to choose a topic they were already familiar with or interested in. For the second essay, all students had to respond to the same question-Should artificial intelligence be used in teaching and learning? This second essay task limited students to making arguments based only on material from two articles they were given. The two essays were designed to be developmental exercises, in that the second assignment was more difficult.
2) Rubric -Motivation and Design: Writing scales arose in the early 20th century to compare performance of schools and teachers [24], and only later were they developed within classroom contexts to provide guidance for students. It has been shown that analytic rubrics, where scores are assigned to distinct dimensions, have greater reliability than holistic rubrics [25]. Still, many studies highlight as problematic inconsistency among raters and scoring professionals [26] in applying analytic rubrics, due to lack of training or familiarity. Despite these debates, analytic rubrics can serve as an instructional tool to improve students' writing quality [27]. An analytic rubric was developed and used both for instruction and grading.
The rubric was designed through a collaborative process by the three researchers working on the project. The instructor, who is one of the researchers, has a background in educational technology whereas the other two specialize in applying NLP to educational data. Part of the investigation involved understanding how the rubric supports instruction in argumentative writing. The rubric contained explicit descriptions of performance characteristics, each corresponding to a point on a rating scale. Table I shows the four dimensions of the rubric, the dimension weights and sub-dimensions with the points assigned. Design of the rubric was guided by Ferretti's wellknown argument rubric [28] and the Source-based Argument Scoring Attributes (AWC) [29]. Research has shown that the range of a rubric scale is important because it affects reliability and ability to make meaningful distinctions; more than seven levels lead to cognitive difficulty, and fewer levels produce sharper classification. Timely feedback, as a means of supporting student learning, has long been advocated in the assessment literature. It provides a learner-centered approach in which, from a socialconstructivist standpoint, students to learn from one another.
A key component of authentic assessment, rubrics provide descriptive feedback and can also be used for self-assessment as a criterion of written work. It was therefore important that students were given the rubric together with the first assignment. The rubric was also used to provide formative feedback on the first essay before the second essay was assigned.
3) Essays -Assigning and Grading: A Universal Design for Learning framework guided the formulation of the assignments. This framework provides a structure for developing curriculum-learning outcomes, instructional methods, and assessment-and is composed of three main ideas: provide engagement, provide representation, and provide action and expression. Students were taught how to write good essays in two lectures. The first of these focused on the following four elements of writing argumentative essays: engaging with the prompt, formulating a claim, developing arguments and counterarguments, and concluding the essay [30]. Students spent the tutorial following the lecture honing in on each of these components. In [30] Black advocates that argumentative writing should be considered as an aesthetically pleasing art form and that on completion of the work, authors should have the satisfaction of knowing that they have made something".
The rubric was an instructional tool for explaining the purpose of the assignment. Six model essays on the each of the three topics were written by sophomores who had formed part of a Wise Crowd. The exemplars were used to highlight how students could maximize their scores. The first assignment was scored by three course tutors, with each person scoring essays with the same title; the number of students per topic was capped for even distribution. All tutors were trained to use the rubric consistently.
Once scored, examples from the first essays were used to point out avoidable mistakes. As a more learner-centered approach the students were encouraged to reflect on their scored essays. Taking this first attempt at argumentative writing as a point of departure, the second essay was assigned. The project participants had experienced the consequences of their first attempt, reflected on the feedback received and now has to go through the process again.
4) Reliability of the Rubric: The use of writing rubrics has engendered debate about their reliability and purpose. Educational intervention studies apply rubrics whose reliability is usually quite good. For example, Graham and Perin [31] in a meta-analysis of educational interventions exclude interventions with reliability lower than 0.60. Yet a large body of research, including [32], has documented how trained raters can exhibit different levels of severity on analytic rubric categories. There has also been skepticism about applying rubrics to classroom grading, due to subjectivity in interpreting rubric criteria and over-reliance by teachers on the rubric as authority. Turley and Gallagher [24] argue that the debate should not be about whether rubrics are good or bad, but about how to use them. They discuss how the interpretation of a rubric depends, in part, on developing a community of users who understand the language of the rubric criteria. However, very little work has been done to compare how rubrics are used for instruction with how they are used in scoring, or to examine the difference between their reliable use and inconsistent classroom use. The present work investigates these problems.
Although having an analytic rubric for both instruction and grading is beneficial for students, it is difficult to apply an analytic rubric reliably in the context of large numbers of students. This motivates the view that development of algorithms to support the application of a rubric is an important goal. Development of automated methods is facilitated by creation of training data for a specific rubric, consisting of a large number of examples where the rubric has been applied.
Two advanced undergraduates were trained to use the rubric over a period of seven weeks. Subsequently, each of them spent 10 hours per week re-scoring half the essays written by students for the first assignment. Their training included understanding the structure of argument writing and completing both essay assignments; see Table II.
Pearson correlation on the content and argument components of the rubric was used to assess rater agreement. Their correlations with each other and with the assigned grades varied widely, from negative correlation to high correlation, after the raters applied the rubric to the first sample. This improved following the second round of three essays; the Virtual meeting to review argument writing: assignment #1 and rubric #1 2 Raters write individual essays: one on AV, and one on Crypto or Cyber; then each rater applies rubric to essays written by the other rater 3 Virtual meeting to review raters' essays and assessments 4 Both raters assess the same three Crypto essays 5 Virtual meeting on their first round of assessment centering on discussion between raters and with all three researchers 6 Each rater assesses the same three additional Crypto essays 7 Feedback on the second round of assessment, with some discussion on assignment #2 and rubric #2 correlation between the raters was perfect on two, and poor on the third. After discussing these results, the raters were allowed to proceed independently with applying the rubric to the remaining essays.
To check reliability, each rater scored 28 essays per week for three weeks and 31 in the fourth week. Ten randomly selected essays were assigned to both raters for continued monitoring of their reliability. The correlations for the content and argument dimensions on the ten essays were generally high. They ranged from one low outlier of -0.52 to 1, with an average of 0.75, or 0.89 after dropping the outlier. The reliable raters had lower correlations with the assigned grades for the three-essay assignment, with averages ρ equal to 0.72, 0.63 and 0.59, respectively.
The reliability study shows that the rubric can be applied very reliably by specially trained raters and with moderate to low reliability in uncertain classroom contexts. It indicates the difficulty of using a fine-grained rubric in large classes, where teaching assistants do the grading, where students want to see their grades quickly, and where timely and specific feedback is beneficial.

III. TECHNOLOGY FOR EXPERIENTIAL LEARNING A. Previous Work on Technology for Writing Instruction
A comprehensive review of research on instruction revealed that writing skills develop best given a formative assessment cycle. This involves successive stages of instruction to target specific learning goals, followed by assessments for which instructors provide feedback to help students scaffold their learning. Reliable and valid assessment is seen to be important as part of instruction. The time involved in regular assessments of writing, and the difficulties in assessing writing discussed in the previous section, both provide strong motivation for technological support for writing instruction.
A recent review of the impact of technology on writing instruction found the main strength to be increased student engagement in writing assignments, and support for peer collaboration [33]. The main drawback was that instructors found it challenging to integrate technology into the writing curriculum. A concurrent review compared forty-four tools intended to support academic writing instruction, the majority of which concern automated writing evaluation (AWE) [34]. Most of these tools focus on college-level English L1, or L1 combined with L2 learners, and are not tied to a specific domain or genre. Apart from pointing to the need to address languages other than English, the authors conclude with recommendations that align with several of their research goals; in particular, feedback linked to writing goals and genres, and to strategy instruction, meaning techniques for planning and revising text in general, or specific kinds of text such as persuasive writing. AWE has been used to generate feedback to support students' revisions. It also supports analytic or holistic rubrics, or content maps using specific tools and techniques, including Coh-Metrix [35], C-rater-ML [36], G-rubric [37], Coh-Viz [38], and PEG [39].
Automated support for revision feedback using analytic rubrics has been applied to second language learners' persuasive essays [40], college students' physics lab reports [41], and middle school students in English language arts (ELA) classes [42]. Liu et al. [40] developed machine-learned models by training on pairs of student sentences and teacher comments from a previously collected dataset of L2 learner essays. A comparison of teacher versus automated feedback for 104 students found that the automated feedback led to the same kinds of improvements between first and second drafts on four of seven classes. Park and Cho [41] investigated whether Coh-Metrix indices could predict peer reviews of lab reports in a study with 41 students. Eight out of fifty-four Coh-Metrix indices had modest but significant correlations with the human scores on the final drafts. Perin and Lauterbach [43] apply Coh-Metrix to community college students' summaries to predict four dimensions of an analytic rubric, and found a different set of Coh-Metrix features to be predictive than those identified in previous work. Wilson and Czik [44] conducted a study with eight 8th grade ELA classes where four classes received teacher feedback alone, and four received a combination of teacher and automated feedback from PEG [39]. PEG provides scores for six dimensions of writing quality (e.g., idea development, style, word choice), each on a 5-point scale. It combines natural language processing and machine learning techniques, using more than 500 variables to predict essay ratings assigned by expert raters. Results indicated that teachers gave more feedback on higher-level writing skills to students in the combined condition, and that reliance on PEG saved one-third to half the time it took to provide feedback without PEG. In a somewhat larger study, middle school students in ELA classes using PEG had more positive writing self-efficacy and higher scores on the state ELA test than a control group [42]. Other machine learning methods for predicting scores using all or some of the analytic rubric dimensions have also been investigated for college level L2 essays [45] and middle school argument writing [46].
Compared with analytic rubrics, there are fewer studies investigating automated support for holistic rubrics. A significant exception involves application of the ETS C-rater technology to facilitate teacher feedback on middle school students' revisions of short answers to science questions [47], [48]. In Gerard et al. [47], C-rater was adapted to analyze short answers in tests given in two sixth grade science units, so that the teacher could intervene more efficiently to strengthen student collaboration. This work extended a previous study involving seventh graders [48]. Other applications that investigate holistic rubrics include assessment of English proficiency in writing from sources [49], analysis of features of good writing in college-level students [50], [43], and analysis of science argumentation of high school seniors [51]. More recent work on similar short answer tasks compared three kinds of machine learning models, and found that pre-trained transformer models performed better than RNNs or supportvector machines [52].
Rubric-free methods have also been investigated. Grubric [53], [37], is a modification of latent semantic analysis (LSA) [54], a method to create numeric vector representations of the meanings of words, where the number of vector dimensions is up to the investigator. G-Rubric converts LSA vector space with latent dimensions of meaning to a new vector space with semantic grounding i (e.g., 300), a fixed number of relevant concepts. It has been used to give college students iterative feedback during revision of source-based summaries [53], and with business students in a MOOC [55]. Concept maps are another rubric-free feedback method. Concept maps, a visualization used in education for decades [56], are graphs that depict explanatory knowledge, where nodes represent concepts and edges represent relations between them. Sung et al. [57] compared four conditions of feedback for sixth graders writing summaries over six weeks: none, LSA-based visualization, concept maps, and LSA plus concept maps. Students who received feedback all improved between pre-and post-test, and students with concept-map feedback outperformed the other two conditions. Coh-Viz tool automatically creates concept maps for individual sentences, similar to subject-predicate-object graphs [58], and has been tested with students studying education in a German university. Students' revisions based on concept map conditions had significantly greater improvements in cohesion over a baseline.
To summarize, studies show automated analysis can support formative assessment during writing instruction by helping the instructor to provide prompt feedback [40], [47], [48], to students while revising their drafts [44], [53], [55], which can lead to improved writing skills [42], [57]. Machine learning methods as used in PEG, C-rater-ML, G-Rubric and [40] generalize better than Coh-Metrix alone, although Coh-Metrix provides useful features for the machine learning approach used in [40]. Santamaría Lancho et al. [55] suggest that automated support could also be integrated with human grading to improve the consistency and reliability of summative assessment.

B. Content Annotation and Analysis: the Wise Crowd Method
Writing summaries of source texts has been found to be among the best instructional tools to develop students' reading and writing skills for conceptual knowledge [31]. Their use as a pedagogical tool requires a method to assess the conceptual quality of a summary, which in turn rests on the identification of the main ideas of the source texts being summarized. Many similar methods have been utilized in educational psychology, including expert consensus [59], ranking of propositional units in source texts [60], and successive elimination of less important propositional units in source texts [61]. All these methods elicit explicit judgements of propositions. The present study relies on exploiting the notion of a wise crowd of experts who summarize the same sources [62]. Ideas that are expressed in more of the wise crowd summaries have higher weight. Table III illustrates a summary content unit (SCU) from five wise crowd summaries of the Autonomous Vehicles article. Four of the five summaries expressed the idea that use of public transportation might decrease with increased reliance on autonomous vehicles. Although the idea is expressed in different ways, all four expert summaries clearly express the same idea. Ideas in student summaries that match an SCU are credited with the corresponding SCU weight. Given five reference summaries, SCU weights can range from 1 to 5. Ideas in student summaries that do not match an SCU are assigned a weight of 0. The score assigned to a student summary normalizes the total sum of the weights of their ideas by the number of ideas in the student's summary, and by the average number of ideas in a reference summary. Thus a summary gets a higher score if the ideas expressed by the student match more of the high weighted SCUs, and if the student expressed a good proportion of the high weighted SCUs. The scores can then be explained to the student or instructor in terms of the overlap of ideas in the student's summary with the full repertoire of SCUs for a given text. One of the main points is the displacement of public transport B the rise of autonomous vehicles will disrupt the current standing of public transport C even more people would switch from public transport D it could also have a negative impact upon the public transport systems In previous work, it was shown that the wise crowd method for identifying important ideas in source texts, ranking them, and using the resulting ranked list to assess student summaries, correlates very well with a main-ideas-rubric used in an educational intervention (ρ=0.88) [59]. The method itself has been found to be highly reliable given four or five reference summaries [63]. Originally this method was applied through manual annotation procedure (see next paragraph). An automated approach to the assessment and feedback step has been developed [62] and, more recently, a fully automated approach called PyrEval that identifies and ranks the SCUs from a set of reference summaries, then uses the weighted SCUs to assess new summaries [64], was developed. PyrEval was tested on summaries from the Autonomous Vehicles and Cryptocurrency topics. Also described here is an extension to this annotation linking the SCUs from the summaries to propositions in the argument portion of a student essay. The analysis of the content of the students' essays asks how well the automated method of summary content analysis replicates the manual method, and how the automated method could support feedback on the rubric, either to help students revise the essay as a whole, or to give the instructor an overview of students' grasp of the content and their ability to draw on it to support their arguments. Recall that the assignments first asked students to summarize the source text or texts in 150 to 250 words, then to construct an argument addressing one of the prompts. Fig. 1 illustrates the work flow of the manual annotation. The reference summaries are annotated first to identify the SCUs, to create a pyr file (a list of SCUs derived from reference summaries is referred to as a pyramid). The pyr file is used to assess student summaries, with one pan file per student summary; in this step the propositions the student expresses are matched to the weighted SCUs. The new aspect of content annotation that has been added is for the argument part of a student's essay. Thus, Elementary Discourse Units (EDUs) are annotated [65], [66]; these are essentially individual clauses or propositions (sep file). In contrast to a summary of a source text, the quality of a student's argument is not expected to depend on how much of the same content is expressed as in a reference argument. On the other hand, of interest is how much of what they summarized from a source appears in their arguments, and what sort of content they use to frame their arguments. The last annotation step therefore involves matching the EDUs in a student's argument to the SCUs.
The automated wise crowd method performs fairly well on these summaries, as described in [64], with a Pearson correlation of 0.66 on the Autonomous Vehicle summaries when comparing the manual and automated summary content assessment, and a Pearson correlation of 0.72 for the Cryptocurrency summaries. Previously, the instructor found the content scores and justifications to be very useful, including cases where the tool gave low scores to written work that, on reflection, were scored favorably based on the writing fluency rather than the content [67]. For the present study, neither the manual nor automated content scores on the summaries correlate well with the content dimensions of the rubric. This is because the rubric content dimensions of quality and coherence relate to the essay as a whole not to the summaries alone, and the students consulted other sources of their choosing to find evidence to support their arguments. Clearly, the content assessment of the students' summaries reflects their reading skills, which suggests further investigation into whether summarization skills might provide insight into how well students use external sources in their arguments.
Ongoing work on the content analysis of the students' essays includes investigation of the essays on the third topic (Cybercrime), and analysis of the relationship between the content and argumentation, particularly with regard to the overall structure of the students' essays. The next subsection describes the analysis of students' arguments.

C. Argumentation Annotation and Automated Analysis
Effective argumentative writing presents a claim, considers evidence in support of and against the claim, and demonstrates how the pros outweigh the cons. The project aimed to test whether argumentation features derived from coarse-grained argumentative discourse structure correlate well with the 6point scale rubric that rate the quality of the argument. To do this, the first step was to label the argumentative part of the 37 Cryptocurrency essays using an annotation scheme generally used in argument mining [68]: main claim, claim and premise/evidence as argument components, and support and attack as argument relations. The advantage of a simple annotation scheme is two-fold: more reliable human annotation, and better performance of automatic methods to detect the argument structure. Two expert annotators with background in linguistics and argumentation performed the annotation that resulted in a gold standard set of 36 main claims, 559 claims, 277 premises, 560 support relations and 101 attack relations. A proposition was considered as the unit of annotation, given that premises are frequently propositions that conflate multiple clauses and sometimes even sentences [69].
The set of argumentative features introduced by Ghosh et al. [70] were used on the annotated essays to test whether they correlate with the argument quality scores obtained in the reliability study. The features are grouped in three categories: 1) features related to argument components (ArgC) such as the proportion of argumentative sentences (i.e., contain a main claim, claims and/or premise) and the number of argument components in an essay; 2) features related to argument relations (ArgR) such as the number of supported and unsupported claims and the number of attack relations (counterarguments); and 3) features related to the typology of argument structure (Str)-the number of argument chains and argument trees (see [70] for more details).
While Ghosh et al. [70] showed that these features correlate with the holistic essay score (low, medium and high) when applied to TOEFL persuasive essays, this study aims to test the effectiveness of these argumentation features in predicting the argument quality scores (scale of 0-5) obtained in the reliability study. Logistic Regression (LR) learners were used to evaluate them using quadratic-weighted kappa (QWK) against the human scores. QWK has been used for essay scoring [71], [70]. Table IV reports the results from a 5-fold cross validation setting for the three argumentation feature groups and their combination. The baseline feature is the essay length in sentences, since it has been shown to be highly correlated with essay scores [72]. The best correlation is obtained when using all the argumentative features (ArgC+ArgR+Str), while ArgR is the best performing individual feature group. Moreover, all argumentation features outperform the baseline. Also, the argument tree feature correlates with high scoring essays, which is not surprising as these features capture the complexity of a wellwritten argument. In addition, top-scoring essays (with score 5) have a higher number of attack relations to the main claim, showing that these essays contain counterarguments, which is an aspect in the rubric. The number of claims supporting the main claim was negatively correlated with low scoring essays since students who received a low score, although forming arguments, failed to link them to their main claim. Similar to the work of Ghosh et al. [70], the number of supported claims correlate negatively with lower scoring essays, which show that students who receive low scores do not provide evidence for their claims. Another interesting observation from this analysis is that in the best essays (score 5) the ratio of argumentative sentences to total number of sentences was higher than for essays with a score of 4, whereas essays with a score of 4 were generally longer than the essays with a score of 5. That could also explain why the baseline feature (essay length) performed so poorly, since length alone is not indicative of argument quality.
The correlations scores were lower than the ones reported by Ghosh et al. [70], a finding that could have several explanations: 1) the number of essays is smaller, 37 compared to 107; 2) a 6-point scale rather than a 3-point one was used; and 3) the scale used reflects the argument quality and not an overall score. Looking at argument structure alone might not be enough; instead, both the structure and the semantics of arguments need to be examined in order to predict the argument quality more reliably. [73]. This approach will be pursued in future work.

IV. EVALUATING THE EXPERIENTIAL LEARNING PROJECT
Foregoing discussions have addressed the first two research questions. Central to these arguments is the importance of providing timely, formative feedback to enhance students' understanding of what is expected in argumentative writing. Rubrics provide an avenue for doing this, and may be used in a manner that integrates the feedback into instruction so that students begin to view their grades as more than just a score. However, the experiential aspects of the learning cycle hinges on the premise that the writing activities students engage in must also include reflection.
Expectancy-value theory [74] can be used to assess the holistic impact of experiential learning. This framework depicts learners' motivation as based on their expectancy of success and the value attributed to a given task. Students with low self-efficacy, typically find understanding and acting on formative feedback difficult. As a result, they tend not to engage in reflection. Motivation and engagement are intrinsic to all learning. The latter is currently receiving much attention in HE because students' opinions now play a principal role in rating teaching and learning in tertiary education. Consequently, utilizing innovative teaching techniques to produce positive academic outcomes for students is no longer an option; it gives impetus to experiential learning.
Students were asked to complete a questionnaire about the rubric after receiving feedback to the first of the two essay assignments. First, students were asked the following initial questions requiring Yes/ No responses. 1) Did you get the mark you were expecting on Argumentative Essay 1? 2) Did you use the rubric? Those who had used the rubric were questioned further: 3) When was the rubric used? before starting the assignment, while doing it, after completing it; or some combination of all of these 4) What was it used for? to understand the requirements, as a guide, for checking; or some combination of all of these 5) Do you feel the rubric helped you achieve your desired score? If yes, explain how. Out of the 84 respondents, almost two-third of them (63%) reported using the rubric in one or more of the ways suggested above. Thirty-four percent of these students believed that the passing score they received was due to having access to the rubric. Others said they probably would have achieved the same score without using the rubric, with only 11% suggesting that the rubric did not help them at all. The Wise Crowd were also questioned about their experiences with using the rubric. In addition to saying it helped them understand different aspects of the writing process, one said that it made "very clear what the expectations were". Another suggested that without the rubric, he "wouldn't have known exactly what was expected". What was even more striking is that more than 65% of students who attempted both essays scored the same or a higher mark on the second essay. There is limited evidence to suggest that the feedback from the first assignment aided their performance on the second essay. Instead, what seems more likely is that the second essay allowed students to master the experiential learning approach they had been exposed to in the first assignment.

V. CONCLUSION
Discussions in this paper have centered on how argumentative writing instruction for STEM students could be aligned with the learn-by-doing approaches used in these fields. In sum, experiential learning provides an alternative to simply telling these students what is expected of them in college assignments. Furthering Dewey's conjecture-experience as reawakening-it is suggested that conceptual understanding of writing arguments could be fostered by engaging students in transformative experiences that allow them to confirm and extend their ideas. Part of this process involves providing them with a rubric for instruction and assessment. The reliability study showed that rubrics can be applied reliably outside classroom contexts but classroom grading tends to be less reliable, which can be attributed to time pressures and lack of training. The study also provides a benchmark for training and testing the algorithms being developed ultimately to support instructors or raters. Previous work has shown that automated methods for applying analytic rubrics can reduce the demands on instructors' time, and can be used fruitfully to support students in revising their written work.
Source-based writing draws on reading comprehension as well as on writing skills, which are skills that support each other, but which require different kinds of instruction [75]. The automated analysis of the summaries that students included in their essays shows that the automated summary analysis performs well. It could therefore be used by instructors to provide feedback on students' understanding of sources. The same features that have proved useful for automated analysis of argument in previous work [76] are shown to be the most predictive of the feature sets used here as well. In the context of freshman writing courses, especially for STEM students, Work on integrating automated assessment of argumentation and subject matter content is already in progress. The availability of 21st century educational technologies now make it possible to support new pedagogical approaches through automation; at least for transparency, uniformity and competence. Although automated scoring is still subject to much debate, what is being advocated here is automation with human intervention. Automation needs to be designed with studentcentred learning in mind.