It’s easy to pick on numerical review scores. In fact, the internet is positively stuffed with examples. Reviewers call them irrelevant, and developers call them arbitrary and meaningless. Both scoff at Metacritic, and developers in particular have damn good reasons to undermine numerical scoring. Even I take the odd potshot and insist on using my own scoring system.
You’ve probably already spotted the disconnect. Everyone seems to hate numerical scores, but nobody stops using them (well, I did, but you can bet that my principles end where a publisher’s paycheck begins). So one of two things must be true: either everyone in the world is now and always has been stupid, or there is some legitimate merit to numerical review scores that nobody bothers to talk about. This looks like a job for sophistry.
Numbers Are Irrelevant
Let’s say that you’re an English major who edits the text in scientific articles for a living. Science writing is a labyrinth of impenetrable jargon. It’s written by people who haven’t taken a writing course since high school. Most of their reading material comes from people who write as badly as they do. This job isn’t editing; it’s the rehabilitation of language.
Now imagine the feeling in your gut when you find out that most scientists don’t actually read articles. They just examine the figures, maybe read the captions, and move on to the next article. This is what you’re doing to game reviewers when you skip their 2000 word, multi-section article in favor of the number at the top. For bonus points, you can also post an anonymous flame because the writer scored your favorite game an 8 instead of a 9.
This is what reviewers are talking about when they say that numerical scores are irrelevant. They squeeze the nuance out of criticism, and all you get in return is a redundant summary. Review aggregators like Metacritic go one step further by amalgamating individual author scores to a featureless mean.
The problem with this position is that it has absolutely no bearing on whether numerical review scores are useful at all. The unfortunate reality (for my ego) is that the individual games journalist is a blip. A thoughtful, entertaining, and engaging blip to be sure, but only a fraction of gamers have sufficiently overlapping tastes with my own to be usefully guided by my advice. So any number I, or any other reviewer, assign can only account for a portion of the gaming public.
Here’s the secret that each of Metacritic’s 137,000 daily users know intuitively. As a group, reviewers reflect the opinions of the gaming public better than any individual reviewer can. So unless you want to dredge the review pools until you find the reviewer that’s right for you, an aggregate opinion makes for a passable approximation.
Admittedly, the principles of statistical sampling say it’d be better if the profession didn’t self-select for the kind of zealot who is willing to do the job. However, until we start press-ganging journalists by random sample, we’ll just have to muddle along with the ones we have.
Numbers Are Meaningless
Ask any two players if they enjoyed Spore, and you’ll get three answers. This is partly because games are complex products, but mostly because players game for different reasons. You can say that games are meant to entertain, but if you enjoy competing against human opponents and I enjoy exploration, there isn’t a single objective number that can inform us both if we should buy Street Fighter 4, let alone Katamari Damacy.
There’s more at stake than a $60 purchase as well. As recently as a year ago, Microsoft threatened to delist underperforming XBLA games based, in part, on Metacritic numerical scoring. The ensuing internet hubbub practically ensured that delisting would never come to pass, but the fracas made the practical impact of review aggregators visible.
Is it really unfair to use an objective metric like numerical review scores to characterize a subjective product like video games? By the same token that multiple reviewers reflect the population of gamer opinions better than any one, sampling multiple reviewers should better represent the diversity of subjective opinions of games.
Here, unfortunately, we run into a snag. Consider the simple case of the universally acclaimed Braid. I extracted all the scores indexed by gamerankings.com, which lists Braid‘s mean score as about 92%, and broke them down into a histogram. After truncating the graph at 50% (most game reviewers only use the top half of the scale anyway), it looked like this:
![]()
A handful of reviewers pegged Braid at a score of 85% or lower, but the vast majority think it’s a roughly 90% kind of game. Everything is nice and tidy.
Now let’s see what happens with a more divisive game like the 77%-rated Killer7:
![]()
Although the majority of scores cluster in the 75%-95% range, fully a third of reviewers sprinkled their scores between 50%-75% (stats nerds might like to know that the SD for Braid is 5.8, and the SD for Killer7 is 13.7). So, one group of reviewers thinks that Killer7 is an 80% to 90% kind of game, and the remainder thinks it deserves worse than 75%.
Looking at these histograms makes it obvious that you’ll have a pretty good chance of loving Braid, but a fair risk that you’ll hate Killer7. The distinction isn’t obvious in the naked means, but the problem stems from neither numerical review scores nor their aggregation. The real problem with aggregated and numerical reviews is the way that they’re presented.
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.
Tags: aggregators · Review · scoring2 Comments
Trackback to this article.

2 responses so far ↓
by the way, speaking of aggregators…Prince of Persia movie trailer! http://www.rottentomatoes.com/dor/objects/664420/prince_of_persia_sands_of_time/videos/pop_mov_trl1_110209.html
I see some value in numerical scores, but it depends on how they’re done. I like them as a rule of thumb…they’re never sure fire. But I guess I prefer one sentence summaries of opinion more. If that makes sense.
You know those keen little bars that Amazon.com uses to show you how many people give a product a “5,” and so forth? They’re histograms. Histograms that have been turned on their side not to look scary to the non-math enabled populace.
They do their job. You look at them, and BAM, you have a sense of how people like the product in question. Video games can be polarizing, and seeing the histogram is an intuitive way of communicating concepts like “bimodal distributions.”