A Book Contest – Evaluating the Evaluation

I’m sure most of us have the experience of writing a government-sponsored exam. It may not be one of our favorite memories.  However, not many of us have experience in marking any of these dreaded exams. Over my thirty plus years of teaching, I had several opportunities to work on marking teams for provincial writing assessments. After my involvement in the BookLife Prize (BLP), I started to think about their evaluations compared to the assessment practices used by educators.  

The government starts by looking for a large group of teachers to mark their writing exam.  Their goal is to have every writing sample given a reasonable and defensible mark.  In an ideal process, every student work would get the same mark, irrespective of who in the marking team did the evaluation.  I’m guessing that the BLP hopes that their evaluators would also be consistent. In order to select semi-finalist for the prize, they rely on using the novels that got the highest marks.  This would indicate their belief that all of their evaluators mark in a way that is consistent with their peers – that any given entry would get the same mark, irrespective of the evaluator.  It’s actually a difficult thing to achieve.

The government looks at many applications and selects educators they feel have the best chance of doing a good job.  In the BLP, five published authors agree to vet and evaluate all the manuscripts in their category ((Mystery/Thriller, Romance/Erotica, Fantasy/Science Fiction, General Fiction, YA).  I assume they’re chosen because they’ve been successful at writing books that got published. So far, there’s no real difference except in the scale of hiring. 

Everybody comes to the evaluation job with different beliefs and baggage.  In teaching there are the people who believe that standards everywhere are plummeting.  They say that there’s too much acceptance of shoddy work and the system is producing nothing but illiterates.  They demand proper grammar and excellence in writing skills and are loath to give high marks, as their standards are very high.  Let’s call these the A end of a personality scale.  There are other teachers who believe every student deserves nothing but nurturing and praise.  They think low marks are bad for a student’s self esteem so they are not likely to give low marks to anyone. They know a positive approach is the only way to give students the confidence to continue on to greater achievement. Let’s call these the Z end of the scale.  Most teachers fall somewhere between A and Z on this scale.  The perfect evaluator is one who approaches the work without an agenda, uses a defensible scale and does their best to balance criticism with positive encouragement.  Of course anybody, no matter where on the scale, can have a ‘bad hair’ day and slip closer to one of the extremes.  Let’s assume that BLP evaluators also fall somewhere in a similar spectrum – the only difference being that they are assessing adults rather than students.

A very important step has to be considered next – establishing the benchmark of excellence you’re expecting.  In teaching it means tailoring your expectations to the age level you’re marking. In a government assessment, evaluators know in advance they don’t expect a great writing sample from a ten year old is going to look nearly as good as one from an eighteen year old.  You have to take this into account when you’re marking a set of writing from either age group. 

You might think this is not an issue for the evaluators in the BLP contest – they’re all assessing adults.  Perhaps this is so, but a contest designed for unpublished authors could use one of several possible benchmarks for an evaluator to use. In one case they might only give a ‘perfect’ mark to a novel they consider a contender for a Nobel, Booker or Pulitzer prize – something so wonderfully written that it has lasting impact and literary value.  On the other hand, in a contest of unpublished authors, it might be that an evaluator looks at submissions from a ‘publishable’ perspective – with a ‘perfect’ score going to a ‘professionally’ crafted and engaging story that really ought to find a publisher.  Either approach is fine, but you can imagine the problems that could arise if some evaluators marked with one idea while others marked with the other idea in mind.  I have no idea if BLP establishes a common understanding of the benchmark they’re using in evaluating a novel.

You’ve got a team and you’ve established the benchmark of excellence.  The next trick is to establish the scale you want your evaluators to use in assessing the writing.  Do the evaluators give a mark out of 5, 10, 100 or some other number?  This is not like math, where you can give a test with 100 questions and however many are correct is the mark you get out of 100.  Writing is a much more subjective thing to assess and the more specific the criteria you use, the more valid is your mark should be.  Think of it this way – let’s say you choose to mark out of 100. In this case you should, in theory, have 100 specific attributes that distinguish each level/mark.  That’s so you can validate giving one piece of writing 63/100 and another one 65/100.  You would be able to say: “I gave this a 63 because it has 63 of the 100 things I’m looking for – the one with 65 had two more attributes than the other.  It will require a lot of work and preparation to come up with descriptors that clearly outline 100 specific things in writing. 

If you think this is a bit boggling, you’re right!  The government uses a 4 point scale. Each point on the scale has a a small number of descriptors that evaluators use when assessing a writing sample.  Each descriptor will build on and overlap the one before it.  For example, a 1 descriptor will have the least number of criteria.  A 2 descriptor will have everything the 1 had, plus something more/better, but still missing things needed to get a 3.  If you are using a 10 point scale and give a mark of 8, it is because you can find 8 attributes but is missing something that is found in the descriptor for 9. Obviously a 10 level scale is trickier to create and use than a 4 level scale.

The BLP uses a set of 10 point scales: 10 for Plot, 10 for Prose/Style, 10 for Originality and 10 for Character Development for a total of 40 marks.  They then average the four marks to give an overall mark out of 10. In the ‘best practice’ evaluation noted above, there should be 10 attributes for the descriptors in each one of the four subsets – 40 descriptors in total. An evaluator using the descriptors should then be able to justify any mark they’ve given in each subset. I have no idea if BLP has developed 40 attributes in descriptors or whether an evaluator can justify their mark in this way.

Now you have evaluators and have given them the benchmark of expectation and a marking guide with a full set of attributes in descriptors.  If you remember that your evaluator team has people with different ideas of how they should mark, you know that the next step is to minimize marking differences due to attitudes.  The government attempts to do this by training the evaluators in a group setting.  The entire evaluating group starts by examining the descriptors with the attributes and then discussing some samples that were previously marked. Then everyone evaluates a few more samples as a group, using the descriptors and discusses the results with their peers. When the group begins to get reasonably consistent in their marking, they are sent to work.

I have no idea if BLP engages their evaluators in a training process in order to ensure consistency in marking and so minimize their evaluators’ natural biases. 

The government goes one step further to ensure consistency in their marking.  Each writing sample is evaluated twice, each time by a different evaluator.  The first mark is hidden from the second evaluator.  After the second evaluation, the marks are compared.  It is considered acceptable if the two evaluations were within one degree of each other. If the two marks were greater than one degree apart, the writing was sent to a third evaluator. Example: the two marks are 3 & 4 – this is acceptable.  If the two marks are 1 & 3 – this is not acceptable and is sent for a third opinion.  Double blind marking adds considerably to the validity of any mark.  It helps tone down any individual prejudices that might occur from an ‘A’ or ‘B’ evaluator – or a ‘C’ evaluator having a bad hair day.

It doesn’t seem remotely possible to do double blind marking in a contest like the BLP. Novels typically run between 75,000 and 100,000 words.  Depending on how many entries are received, a single evaluator might be hard pressed to read all the books in their category even once through. Indeed, one might wonder if some evaluations are based on the reading of a small section, balanced with a look at the plot synopsis submitted with the novel.

The other aspect of the BLP that makes the validity of their grading more difficult to establish is the lack of statistical information.  Marks, in isolation, can be rather meaningless as every mark is based on some kind of comparative.  In order to understand a grade, you need to know how everyone else did and where the mark sits in comparison to all the others.  If your work got a mark of 4/10, you might think it’s terrible, but if there were only twenty other marks, with the highest given a 4.5 and an average of 3 or lower, then the four is actually really good.  The BLP has a site where you can read evaluations, but only the ones that the authors have agreed to make public. https://booklife.com/prize/5   

It seems logical that those receiving low marks and/or negative comments will not have much motivation to list them publicly on the site.  Unfortunately, that also means you can’t know how well you really did in a proper comparative way – you’re left with only the mark itself with some comments, and have to guess as to their reliability.  The inclusion of written comments is actually an advantage the BLP system has over just a mark.   In the government test, the mark is accompanied by endless comparative statistics to put the mark into its proper context. In the absence of such complete statistics, comments may help give greater context to the mark itself, especially in the absence of known descriptors or group training with the evaluators.  Of course the comments should be in harmony with the mark. Terrific comments with a low mark, or critical comments with a high mark won’t give confidence in either.

An example of this problem arose when I looked at the evaluations listed on their site.  Among the five categories, there were far fewer evaluations in the ‘Romance/Erotica’ (R/E) category than any of the others.  In addition, all the other categories were awarding marks in the 8 and 9.5 range while a 7.25 was the best mark in the R/E category.  There are several possible reasons for this anomaly.  Perhaps the R/E category is not as popular for authors so few people enter in the category.  Perhaps there were lots of entries, but most received poor marks and negative comments so weren’t made public.  Perhaps the entries are equally good but the R/E evaluator is an A type on the scale – holding higher standards than the others.  Perhaps people writing in this category were less gifted authors and the marks reflected poorer work.  Perhaps there is not clarity in what a 10 means with the R/E evaluator using 8s and 9s for great literature while the others give 9s for good professional stories.  Perhaps the other categories had B type evaluators that only like to give high marks. Perhaps it is due to some combination of these factors – the problem is, you’ll never know for sure. 

Another example of this problem can be found in three evaluations I received for my own novels.  The one with the lowest mark and most critical comments was the second highest in its category.  My novel with the highest mark and most positive comments only came eighth in its category.  In reality, which should I decide did better?

At the conclusion of the contest, some will be left feeling good about their evaluation and some won’t.  It would seem logical to conclude that in the BLP contest, the marks may not be as defensible as the ones in government assessments – that they are based more personal feelings and on the experience and prejudices of an individual evaluator.  Not that this is an inherently bad thing, even if it’s hard to justify a mark.  Unhappy entrants will have to come to their own conclusions.  Was it that their writing is… shudder… just not as good as they thought?  Did the critic have a bad hair day?  Would their novel have had a better mark from a different evaluator or if it had been put in a different category?  In subjective marking, you never know for sure.  We’ve all heard of international best-sellers that were constantly rejected by the evaluators at publishing companies.  C’est la vie as they say in France.  Not that it matters – if you feel the need to write, you just keep writing and strive to improve as best you can, irrespective of what a contest evaluation tells you.   

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s