This is my first time learning that AI-graded essays are a thing. Am I the only one who thinks that's insane? I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.
I work in AI, and was very surprised when I heard about this (a few years ago). I don't think anyone who works in the area thinks the tech is ready for this kind of deployment. There is research on the subject [1], and NLP systems can do better than baseline methods, but the error rates are still pretty high.
A thing you quickly find if you try to download off-the-shelf NLP tools and apply them to anything is how little is reliable at all, unless you can constrain the domain. Even basic topic identification only works with low error rates when constrained to something like NYT stories, or PubMed abstracts, not arbitrary text by arbitrary writers. And I would bet ETS is using worse tech than research state-of-the-art.
You've noticed though that the AI con is on. This damages your work as people get burned and will bring about the second "AI winter"
People making big decisions with a lot of money around computing know nothing about it and are marks for con-artists. Think big consulting firms selling to senior public servants in washington. "For a successful technology reality must take precedence of public relations." But reality just gets in the way when conning a mark for a successful snake oil sale, right?
Call it out, publically, cite your credentials. Encourage colleagues, your competition and everyone with a clue to pour scorn on whoever is selling this evil, toxic waste as drinkable.
Hmmm. I also work in AI, in fact professionally in information retrieval and NLP. I disagree strongly with what you say. Basic topic summarization and keyword / named entity extraction on unstructured sources of text works reasonably well. It’s easy to modify BERT and GPT on smaller problems, language classification is borderline totally solved by extremely easy to train neural network models.
I still agree that automatic essay grading is beyond the reach of SOTA NLP models today, but youmake it sound like virtually nothing can be done in a production-grade manner that solves real world unconstrained NLP problems. This is manifestly false.
It's completely possible I'm not fully up on recent progress, especially since a bunch of stuff seems to have moved in the past 6 months. But I haven't seen any general models that can solve open-domain problems, without specifically retraining on each domain. Do you have any pointers? E.g. a single pretrained BERT model that can reliably extract topics from: tweets, paragraphs from 19th-century novels, mathematics journal articles, and Wikipedia articles? All the systems with very low error rates that I know of target one specific domain. The last time I looked into sentiment analysis (a year or so ago), it wasn't even that great on many individual domains, e.g. it would get tripped up by sentences from novels that used "negative" keywords in a humorous or ironic way.
In production problems that I work on, we don’t even really use things from within the past year. These problems are just incredibly well-solved with fairly vanilla LSTM networks from 2-3 years ago. Enough so that while it’s probably premature for fully automated essay grading, it’s not _crazy_ to make a product from models trained to solve this problem.
I have a grant where were are doing just that. Implementing more or less SOTA research using fairly vanilla LSTM networks from 2-3 years ago (primarily Taghipour & Ng) to provide low stakes feedback to students on their essays in one of our teaching tools at Purdue. It’s based on research using the Kaggle ASAP database and we have found it to be pretty accurate across a variety of domains in early testing. Though some essay prompts seem to do better with CNNs vs. RNNs. I doubt many of the systems in TFA are based on LSTMs or neural nets at all. They are probably doing regression on hand-crafted features.
Very interesting. Are there any meta-analyses / reviews that summarize progress in this area? Would it be possible to share your grant proposal -- I'd be curious to get an idea of what is being attempted.
It's an internal grant and I'm not sure I'd be allowed to share it. We are adding AES to our peer-review app. Currently as an additional "grader" to the peer reviews since that's what the PI requested. Since the tool allows unlimited submissions until the review date, I hope to add it as a "pre-flight" estimate to give students a chance to get a rough prediction of the score they will receive and a metric they can use as they revise until the due date.
I'm not aware of any meta-analyses myself. I have been keeping up with the ASAP competition and various attempts to improve on the initial systems for a number of years. The two papers I believe are having the most success are [1] and [2]. [3] seems promising for balancing the opposing forces of high accuracy for true positives and the risk of false positives via adversarially crafted inputs.
I'm also vaguely aware of research happening around extracting features from neural nets. I'd love to be able to help students understand why the system is predicting a particular score.
We had this in my school for 8th and 9th grade so 2008-2010. We had to type the essays in class and submit by the end of the hour. I would only get maybe 3 paragraphs in before time was up because I was trying to build a strong argument for the prompts. Despite that I would usually get 3-4/6 and my teacher said she would read the essays and regrade but she never actually did. My friend literally copy and pasted the pledge of allegiance 20-30 times and scored a perfect 6/6. Later we found out if you repeated the words in the writing prompt you would get a guaranteed 5/6 and with a high enough word count you’d get 6/6. The essays were all bullshit and just a way for the teachers to get an extra free period once a week.
I totally agree that "AI" grading is totally bullshit. But, I also have plenty of experience teaching/TAing large courses, and after reading too many essays they all become semanticically saturated meaninglessness. One can not help but skim them, and grade according to a few quick heuristics. At that point one tries to be self-consistent and defensible in one's grading, but careful consideration is right out. I suspect state graders are dealing with way more than 100 essays per person and are probably on a tight schedule too. It's quite possible that a ML model is better than an exhausted human grader, as their cognitive strategies are mostly identical.
The solution isn't to do a better job at grading 'meaninglessness' but to stop requiring the production of it in the first place.
One major problem with algorithmic approaches, whether automated or not, is that they become the definition of good in the context and therefore become something that cannot be argued against. And of course it makes 'teaching to the test' an even more likely outcome.
If I were a conspiracy theorist I'd attribute this to wanting a dumbed down population. Unfortunately I think it is probably the other way round, the population is already dumbed down and a belief in AI unicorns is the result.
As Aristotle said to Alexander: 'There is no royal road to geometry', and so it is with education; it's hard work for both the student and the educator and no amount of AI/ML/algorithmic snake oil will change that without also changing the meaning of the word education.
I remember when I was in middle school 16 years ago, my English classes would have us submit some of our work to a web app. It would then grade the submission. I remember this distinctly because I asked my teacher to intervene on at least two occasions. The app failed to recognize the words "squirrelly" (as in "That guy in the corner has been acting squirrelly.") and "defragment". My teacher decided to subvert the app's recommended grades because she, as a human, understood the intent of my use of those weird words.
> I feel like you'd probably have to have an AGI to meaningfully evaluate an essay.
So the reason this isn't the case, is because there are very simple metrics that tend to highly correlate with essay quality. It doesn't mean the grading-bot is actually evaluating essay quality. It's just looking for properties that are statistically associated with good essays. Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.
A very straightforward example is spelling mistakes. People who make spelling mistakes aren't necessarily bad writers. And vice versa, there may be great speller who can't write for shit. But by and large the people who spell poorly also tend to write poorly. Easily detectable grammatical issues, like misplaced modifiers, subject verb disagreement, or inconsistent tense, are also correlated indicators.
A very simple metric is essay length. Especially if its a timed exam. Good writers tend to have verbal fluidity, with words easily flowing to paper. They don't struggle converting thoughts too sentences. So they tend to end up with the most words written down within a fixed time period. By and large the longer a timed essay is, the more likely that its actual quality is high.
Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing. But at the end of the day, their student rankings are usually pretty close to that of a typical human grader. In some cases the bot will have a closer ranking to a random human grader, than two random human graders will have to each other.
The biggest flaw here is Goodwin's law. When the test takers become aware of the kludges that the bots use, they can exploit it. For example just dump a bunch of verbal diarrhea with as many correctly spelled words as possible. But even then it doesn't really hurt the bot's ranking accuracy too much. Because the kids who do the most test-prep and learn all the tips and tricks, are usually high-achievers who do well on essays anyway.
Strongly (but respectfully) disagree with a lot of this!
This is related to current fairness-in-AI discussions. In many cases the basic problem is ML systems leverage correlations for making causal decisions. Here, there is a huge ethical difference between scoring a person based on "is this a good essay" and "do the features of this essay correlate with features of good essays". Just like there is a huge fairness and discrimination difference between "is this person qualified for a loan" and "do the features of this person correlate with features of people who qualify for loans" (algorithmic redlining). Your last sentence has a big discrimination/fairness issue also, since you are testing even more for parental income and parental free time.
>Remember, at the end of the day as long as the bot's ranking is close enough to the human grader's ranking, nobody really cares about the internal logic.
This isn't true at all. Imagine you got a B or C on an essay that a human would have given an A to because you wrote it concisely and in plain language, or because you used language that's statistically correlated with being black. Does the fact that this is rare console you? "Sorry, but it's usually very close to the human grader's ranking." Close enough isn't good enough when you get the short end of the stick. "Sorry, you aren't going to get to go to the college you wanted because you use language statistically correlated with poor writing." Or just because you're different, so the statistical correlation doesn't apply to you, you filthy outlier. Just because it's a rare event doesn't make it okay.
In adulthood, this is like hiring or firing for work statistically correlated with good work. Remember when amazon rolled out the resume scorer? [0] Sure it was biased towards women, but it was close enough to human scores, so who cares about the internal logic?
>Grading bots basically rely on these statistical relationships. They're not measuring anything intrinsic to good writing.
At the end of the day, our goal here is to measure good writing. If the bots aren't measuring anything intrinsic to good writing, we shouldn't use them.
The problem with the bots is that while they average agreement with the humans they can produce very different results. Fine if you're seeing how a school is doing, horrible if you're testing how a student is doing.
Your last paragraph, and particularly the last sentence, epitomizes what is wrong with your whole thesis: the ultimate goal of the testing (and education itself, for that matter) is not to find people who can "do well on essays"; it is to develop analytical thinking.
That assumption lacks justification when the scoring does not actually measure analytical thinking. Any statistical evidence for it is suspect as a predictor of future outcomes when a high score can more easily be gamed than 'honestly' achieved.
Scoring is not the point here; the analytical thinker is gaming the test to pump the score, thus proving they are an analytical thinker. Not a statistical argument; a suggestion that the screen works, when it is abused. Because it is abused.
It is absolutely insane. By no definition does the system understand what is written.
You could ask a student to write an essay taking a firm opinion on some subject, and they could change standpoint every paragraph and there's no way these systems would know.
If I was a student I would be extremely offended at people wasting my time like this.
I'm surprised people are surprised by it. I guess it just hasn't gotten talked about it a lot? When I took the GRE in 2011 the rule was that my essay would be graded by one human and one automated grader, and a second human would become involved if the computer and the human differed by one point or more iirc.
Maybe nobody really makes a big deal about it because it is pretty much irrelevant anywah. Applicants provide a letter of intent that the grad dept people can, y'know, actually read for themselves, so I think unless you totally bombed the writing section nobody cared.
In a forum of CS people I'm surprised this is one of the top opinions. Our field is full of super surprising results like this -- that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it.
Like this kind of thing should be cool, not insane. I mean wasn't it cool in your AI class when you learned that DFS could play Mario if you structured the search space right?
I came first in English for my school, many moons ago. Leading up to the finals, I regularly finished ahead of the hard core the English essay people, generally to my amusement. My exam essay responses were generally half the length (sometimes even shorter) than the prodigious writers. Although I've an ok vocabulary, I always made sure I made the right choice of word to hit a specific meaning, rather than choosing words with a high syllable count.
I'd find it highly interesting to see what kind of result I'd get using an automated system.
Why?
Because, I once asked a teacher (also an examiner) why I got good grades above the others, and the answer surprised me: my answers were generally unique /refreshingly different, to the point/ not too long and easy to read.
I suspect with this new system, I'd be an average student. It'd also be interesting to find out, several years down the road, if the automated system could be gamed at all -- I suspect it could, and teachers would help students 'maximise' their scores as a result of that.
It seems plausible that, under this system, you would eventually have learned to write longer essays.
To my mind, that would be a school teaching you to be worse.
In fact, throughout the article I kept being surprised by the idea that long is good. When writing, I tend to prefer being brief.
When I hear a result like "software which understands basic grammar structures can predict what grade a human would give an essay" I think my views are roughly:
* 5% - cool, we could make a company that grades essays
* 15% - cool, we could make a company that grades essays and sell our source code to the test-prep industry
* 80% - fascinating, it sounds like the exam designers need to reevaluate what they are trying to measure with essay questions
Whatever we decide to measure, it needs to scale to millions of essay responses each year in a way such that scores are consistent across entire states or countries. With that in mind I'd imagine it's difficult to do much more than grade on grammar and basic semantics.
And if you succeed you will simply be measuring an uninteresting but manageable subset of the problem which will then become in some people's eyes the definition of the problem.
Education is supposed to be about teaching people to think, to give them the tools with which to do it, to be able to evaluate, criticise, invent, etc.
"...that you don't have to actually understand the text at beyond basic grammar structures to reasonably accurately predict the score a human would give it"
That only really shows that the humans they're training on are terrible at grading essays.
This problem is a first class demonstration of the difference between "can we?" and "should we?"
The fact that it's being implemented in society is insane because anyone who is paying attention to the state of AI today already knows how it will go wrong: without reading the article I already guessed that it systematically discriminated against certain demographics. Which was in fact what the article claimed.
It's interesting that it's possible to predict what the scorer would decide, but the moment you actually implement it is when all of the known problems become relevant, and the intellectual wonder must take a backseat to the human problems.
Teaching human-human communication by removing human inputs and having computers decide about quality... call me a skeptic. I feel bad for the students. Essay grading was bad enough before this
Narrowly for grammar however - is even that a good thing? It probably helps scale grammar help to more students, but if those tools became ubiquitous in grading and editing then unique voices would just disappear and a lot of potentially “great writers” might choose different careers because the machines don’t like them
Adding further bias against the underprivileged is not "cool". Implenting this while avoiding publicity or providing a means to publically audit the results is doubly not cool.
It is fine to play with "cool" techniques when you are doing consequence free stuff like playing Mario. When you are creating systems that have significant and long term effects of people's lives a different standard applies.
Like how I felt when I was given low grades for my ugly handwriting. It was stupid to grade it, but it guaranteed that I will never get a top score on any literature class.
This is sort of like discovering the Excel spreadsheet at the heart of a system responsible for handling hundreds of millions of dollars of transactions for your bank.
Yeah, it's cool, but what about your savings account?