Yes, it's the latter sort, according to the article. The rubrics for standardized-test essays are very narrow and shallow, since they seek to eliminate as much subjective judgement as possible.

Even in the relatively innocent early '80s, when machine evaluation of exam essays was infeasible, and even for the US AP English exam's essays (which were exercises in literary criticism and as such weighted substance rather more highly than the sort of "can Johnny write" test described here), the grading rubric was a substantial document, several pages long. It allowed quite a bit of latitude but took pains to try to emphasize a consistent interpretation of those qualities the exam was supposed to test.

In the '90s, when portfolio evaluation was the rage in college composition classes, instructors would hold calibration sessions where they'd all evaluate the same set of portfolios, then compare and discuss the grades they assigned. That's a much more nuanced and useful way to get consistent human judging, but of course it's very resource-intensive.

There's a huge body of research on evaluating writing, which standardized testing companies blithely ignore. It's quite an active area of research, apparently, with many contentious disagreements. I have friends in the industry who don't believe evaluation, as it's currently conceived, is even meaningful. It's not that they think everyone's prose is equivalent - just that the ways in which we conceive of writing as "better" or "improving" are so subjective, culturally specific, and inconsistent that we should stop pretending the metrics we've been using actually mean anything.

So what this present study boils down to is the nearly tautological observation that if you can reduce the evaluation of writing to something sufficiently mechanical, then you can mechanize it. Well, yes. Whether you've achieved anything useful thereby is rather another question.

