Dr. Vincent Ng’s Group Advances the State-of-the-Art in Automated Essay Grading

Dr. Vincent Ng, Associate Professor of Computer Science at the University of Texas at Dallas and a member of the UT Dallas Human Language Technology Research Institute, has been active in a range of research areas related to machine learning of natural languages, part of the broad area of natural language processing (NLP). NLP concerns the development of systems and algorithms that improve a user’s ability to find and extract information from text documents. Dr. Vincent Ng is also a popular instructor among Computer Science undergraduates. He is the professor of artificial intelligence for them, as he has been regularly teaching senior-level courses on artificial intelligence and machine learning since his arrival at UT Dallas a decade ago.

Dr. Ng’s group has published a series of papers in the recent Association for Computational Linguistics (ACL) annual conferences that describe the advances they have made in automated grading of student essays. Automated essay grading is one of the most important educational applications of NLP. For instance, The Educational Testing Service’s (ETS) automated essay scoring system, E-rater, has been employed in scoring SAT and GRE essays, where a holistic score is assigned to each essay that reflects its overall quality. In one of the models that was adopted to grade these essays, a human grader and e-rater first independently assigned a score to each essay, and if the scores differ by more than a pre-defined margin of tolerance, a second human grader will be asked to grade the essay. In other words, automated essay scoring systems can help reduce the amount of time humans spend on grading essays

For an essay grading system to be truly beneficial to students, however, merely providing a holistic essay is by no means sufficient. Given a bad holistic score, a student will not be able to tell why he/she did not do well. For this reason, ETS offers an online automated grading service called Criterion. Through a web-based interface, a student can submit an essay written for one of a set of predefined prompts to Criterion, which then provides diagnostic trait feedback in five dimensions of essay grading: grammar, usage, mechanics, style, and organization and development.

As Dr. Ng explains, “our goal is to extend the capabilities of state-of-the-art essay grading systems such as Criterion. A major limitation of these systems is that they can only handle essay grading dimensions that focus primarily on the lexical and syntactic aspects of an essay. In particular, they cannot handle dimensions that require an understanding of essay content, such as thesis clarity (how clearly an author explains the thesis of her essay) and argument strength (the strength of the argument(s) an essay makes for its thesis). Grading essay content, however, is a substantially more challenging problem: it requires that a system focus on the semantic aspect of an essay and its discourse structure. For instance, to determine the strength of an argument, a system has to first identify the claims and the premises from an essay. Then it has to determine which of these premises was used to support which claim. After connecting the premises to the claims, it has to determine whether a premise provides the right kind of support for the associated claim (e.g., it does not make sense for someone to say that she supports abortion because it is a murder), and if so, whether the premise convincingly supports the claim (i.e., how strong it is). None of these subtasks is easy to accomplish, however. For instance, identifying discourse elements such as claims and premises requires that a system understand the discourse structure of an essay, and determining whether a premise provides the right kind of support for its claim requires that a system determine the author’s stance on the essay topic. The challenge involved in these subtasks stems in part from the fact that they could demand a substantial amount of so-called background knowledge, that is, knowledge about the world that humans rely heavily on when understanding textual content.” In collaboration with Computer Science Ph.D. student Isaac Persing, Dr. Ng has begun working on solving these problems.

While providing diagnostic trait feedback is an important aspect of automated essay grading, a longer term goal of Dr. Ng’s research effort is to correct the errors in an essay, as error correction will provide even more useful feedback to a student. Conceivably, however, correcting errors is more challenging than identifying errors. As Dr. Ng points out, “recent years have seen a surge of interest in grammatical error correction, such as correcting prepositional errors. We are interested in correcting discourse-level problems, such as incoherence in the ideas presented and weak arguments.”

Addressing these problems will not only advance the state of the art in automated essay grading, but also will have a significantly positive impact on improving the writing skills of students. Given the challenges they present, however, a natural question is: is there a good way to tackle them? Dr. Ng explains that “a good starting point is to employ machine learning. In a way, machines learn like humans. For instance, to enable a child to distinguish between cats and dogs, we can show him/her many pictures of cats and dogs. This ‘learn-by-examples’ paradigm has been shown to be very effective for machine learning. In the context of essay grading, if we want the machine to learn to identify claims and premises, we show it essays that are annotated with claims and premises; and if we want it to learn how to correct incoherence problems, show it essay pairs such that in each pair one essay was incoherently written and the other is the same essay but with the problem manually corrected.”

Given these annotated examples, algorithms that can learn from them can be developed. However, providing these annotated examples is a labor-intensive and time-consuming task. While there have been attempts to develop prompt-independent essay-grading technologies, in reality it is somewhat unrealistic to expect a system to do a good job at grading essays written for a new prompt without providing annotated examples for it. One significant challenge is to design algorithms that require a small number of annotated examples without significantly sacrificing performance. Dr. Ng’s research hopefully will lead to solutions to these issues and enhance the overall efficiencies of the learning process.

To learn more about Dr. Ng’s essay grading project, Click Here.

About the UT Dallas Computer Science Department

The UT Dallas Computer Science program is one of the largest Computer Science departments in the United States with, as of Fall 2015, over 1,600 bachelor’s-degree students, more than 1,100 master’s students, 160 PhD students, and 80 faculty members. With The University of Texas at Dallas’ unique history of starting as a graduate institution first, the CS Department is built on a legacy of valuing innovative research and providing advanced training for software engineers and computer scientists.