Amazon.com’s publishing arms CreateSpace and Kindle Direct Publishing will soon be more selective, a la the traditional publishers in New York.
Using software that evolved from research by a group of computer scientists from Stony Brook University in New York, CS and KDP will analyze each submission; if the submission doesn’t pass muster in terms of potential for literary or commercial success, then it will be rejected. The author then has the option of revising the manuscript and resubmitting it, or paying a fee to have it published anyway. The lower the score, the higher the fee.
After all, Jeff Bezos and Amazon.com are in the business of making money. So why accept all comers without any sort of screening process? With the software—tentatively dubbed “Thresher” (as in “separating the wheat from the chaff,” although some may see it in terms of the shark)—Amazon can maximize its profitability by monetizing publishing services that it currently offers free of charge.
Thresher will do in seconds what literary agents and acquisition editors at publishing houses take days, weeks and months (if ever) to accomplish. What’s more, Thresher will accomplish this with a significantly higher success rate than agents and editors can ever do—and deliver Amazon.com’s knock-out punch to the traditional publishing industry.
Unbelievable? In part. Because I’m pulling your leg.
While such a scenario is possible, and may be turn out to be practical, I just made this up.
Mind you, the computer scientists at Stony Brook and their predictability algorithm do exist. That much is true.
And I did contact Amazon.com and asked about the application of these scientists’ research to screening uploads to CreateSpace and KDP. “It’s news to us,” the Amazon spokesperson said. (Ironic, because the researchers presented their study last October in Seattle.)
The broader question is whether this goal of accurately predicting a book’s success can be achieved by a computer. The Stony Brook researchers, apparently, believe it can; they described their methodology as “surprisingly effective.” Whether the concept has any commercial application or viability remains to be seen. (It turns out that Google, not Amazon, helped fund the research, in association with the Gutenberg Project.)
So why am I wasting your time by spreading a false rumor? First, to get your attention, then point out that others have misrepresented the report recently published by that group of computer scientists at Stony Brook, making it seem as if the researchers can actually predict whether a book will be successful or not.
Among those reporting on this topic, I point my accusing finger at Matthew Sparkes, Deputy Head of Technology and formerly a reporter on the City desk at The Telegraph newspaper (UK)—someone who should know better.
On January 9, the newspaper published a Sparkes missive titled “Scientists find secret to writing a best-selling novel.”
His opening statement:
Scientists have developed an algorithm which can analyse a book and predict with 84 per cent accuracy whether or not it will be a commercial success.
That title and first sentence grabbed my attention, as I’m sure it captured the eyes of many an aspiring novelist. Wow! All I have to do is follow the Stony Brook Formula and—shazam!—my novel soars to the top of the New York Times best-seller list. I pop a cork and light cigars with $100 bills. I then take a trophy wife, buy a tropical island and live far from the madding crowd.
However, let’s now analyze the analysis. The headline and opening paragraph of Sparkes’ article are not just misleading, they are false. And anyone taking the time to actually read the report on this academic study will see this. You can read the report—“Success with Style: Using Writing Style to Predict the Success of Novels”—online.
(But before you dive in, you may want to familiarize yourself with “unigram,” “bigram,” “stylometry,” “clausal tag” and other terms tossed out by those immersed in the study of computational linguistics and predictability.)
In fact, the researchers qualified their assertion by stating: “. . . achieving accuracy up to 84% in the novel domain . . .” My emphasis, with the operative phrase being “up to.”
When one examines the table of results, one discovers that the 84 percent figure applies to a single subcategory (Adventure, Unigram) out of 120 categories they tested. In truth, many of their results were little better than the flip of coin. And in the case of Hemingway’s Old Man and the Sea, they employed a bit of unscientific rationalization to explain away a low score, thanks to Hemingway’s predilection for short, declarative sentences.
On average, the predictability rate landed closer to 70 percent (not bad, but nothing to bet the ranch on), and in one instance (Historical Fiction, POS) it attained a paltry 47 percent (i.e., less predictive value than a coin toss). The highest across-the-board percentage of any screening component of the study topped out at 73.5 percent, while bottoming out a lackluster 64.5. (Keep in mind the baseline is 50 percent—the coin toss—not 0 percent.)
Casino bosses in Vegas would LOL at these odds.
The authors of the report also differentiated between literary success and commercial success. They acknowledged that literary success does not necessarily mean “commercial success” and even a mediocre or poorly written book may still become popular (witness 50 Shades of Grey and The Lost Symbol) and achieve commercial success.
The study analyzed 800 books, from classic literary works and best-selling novels to some of the worst-selling books available from Amazon.com (unbeknownst, apparently, to Amazon.com). It also evaluated some movie scripts.
“To the best of our knowledge, our work is the first that provides quantitative insights into the connection between the writing style and the success of literary works,” the researchers say. They also note that the study provides “insights into lexical, syntactic, and discourse patterns that characterize the writing styles commonly shared among the successful literature,” while acknowledging that “some elements of successful styles are genre-dependent.”
Thus, the researchers concluded that “. . . deep syntactic features expressed in terms of different encodings of production rules consistently yield good performance across almost all genres.”
Sparkes translated that into layman’s terms: “They found several trends that were often found in successful books, including heavy use of conjunctions such as ‘and’ and ‘but’ and large numbers of nouns and adjectives.”
Seriously? Aren’t conjunctions, nouns and adjectives characteristics of every book? Hemingway, excepted, of course, when it comes to conjunctions.
“Less successful work tended to include more verbs and adverbs,” Sparkes wrote.
Did Mr. Sparkes actually think about what he’d written? He didn’t see any point in questioning such nonsense? Mind you, most discerning critics agree that frequent use of adverbs exemplifies lower-quality writing. But verbs?
So, if I understand this correctly, successful writers use tons of conjunctions (typical of a run-on sentence—think Faulkner), and tons of nouns (generally, a critical component of a complete sentence) and adjectives (a preponderance of which is also known pejoratively as “purple prose”). Meanwhile, unsuccessful writers use a lot of verbs (another critical component of a complete sentence). Imagine that. Ever try writing a sentence, let alone a book, without a verb?
So . . . what, if any, useful information can we glean from this study? (Drum roll, please! Aspiring writers pay attention.)
Less successful books used such unigrams as:
- Negative terms: never, risk, worse, slaves, hard, murdered, bruised, heavy, prison
- Body Parts: face, arm, body, skins
- Location: room, beach, bay, hills, avenue, boat, door
- Emotional /Action Verbs: want, went, took, promise, cry, shout, jump, glare, urge
- Extreme Words: never, very, breathless, sacred, slightest, absolutely, perfectly
- Love Related: desires, affairs
More successful books used such unigrams as:
- Negation: not
- Report/Quote: said, words, says
- Self Reference: I, me, my
- Connectives: and, which, though, that, as, after, but, where, what, whom, since, whenever
- Prepositions: up, into, out, after, in, within
- Thinking Verbs: recognized, remembered
Good luck with that. (And forget about writing romance novels.)
The scientists wrapped up their report with this gem: “In sum, our analysis reveals an intriguing and unexpected observation on the connection between readability and the literary success—that they correlate into the opposite directions. Surely our findings only demonstrate correlation, not to be confused as causation, between readability and literary success. We conjecture that the conceptual complexity of highly successful literary work might require syntactic complexity that goes against readability.”
Translation: A highly successful literary work is unreadable.
Hmmm. That’s news?
My conclusion: In sum, Amazon will not be hiring these guys any time soon.
Footnote 1: Unigram. According to Wikipedia: “In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. An n-gram of size 1 is referred to as a ‘unigram’ . . .” Got that?
Footnote 2: Grammatical irony. Incorrect usage found in the report’s Abstract: the verb “lead” rather than the correct past-tense conjugation “led.” But who needs verbs?
Footnote 3: Prediction. The report cited herein will become a highly successful literary work based on its low readability quotient.