Big Book Data for the Little Author Guy

data for author research on google nviewer ball-457334_960_720Coming up with blanks while trying to decide what to write about this month, I went back to a list of post ideas I’d started what feels like forever ago in internet terms. I found some notes about a New York Times article The Passive Voice had excerpted that talked about Google’s Ngram Viewer. This is cool stuff. Better late than never, right?

I’m going to briefly touch on two different areas in this post. What the Ngram Viewer does, and how it might be useful to an author.

According to Wikipedia: “The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n -grams found in sources printed between 1500 and 2008” What the Ngram viewer does is take one or more words or phrases and evaluates how often those words were used in books over time, then shows you this on a graph. If you’re already wondering why you might care, what possible use you could put this to, I can see two specific areas for an author. The first, and most likely, is research related to the content of your book. The second is it might be useful for marketing exercises, specifically deciding on keywords to associate with your book on Amazon or elsewhere. I’ll give some examples later on.You might have seen the term “big data” bandied about. This term can have a few meanings, but the one I’ve seen used the most lately is using data about an extremely large number of something and analyzing that to learn something useful. There have been stories lately about big data being used in various way in the US presidential election last year, for example. The Google Ngram Viewer is a big data tool that while seeming simple at first glance, requires an immense amount of computing resources to support what it provides.

For example, this chart.

Google NGram Viewer NGChar1
Click on images to enlarge

What does this tell us? A few things. One is that, at least in books, the term “traditionally published” has only come into use recently and, relatively speaking, hasn’t been used that much. “Self-publishing” as a term used in the set of books Google had available to evaluate, started getting used more often before “traditionally published” and has gained in popularity. “Indie” is a term that has been used a lot over the years with multiple small peaks and valleys until near the end of the time evaluated where it has become extremely popular.

Some of you are hopefully jumping up and down, eager to point out that “indie” could refer to a lot of things, not just publishing. All those indie-rock bands, businesses not part of a chain, or low-budget movies made by a small studio are indeed getting counted along with mentions of indie referring to self-publishing and small presses. You caught me. As with anything involving numbers and comparisons, you need to be careful that your comparisons aren’t apples to oranges. If you want to try fixing this by changing “indie” to “indie publishing” you’ll get the message I did: that they didn’t find any references to indie publishing in the books they evaluated. This points out one of the downsides to this tool. The books being evaluated (the “corpus” of books to use their terminology) only goes up to 2008. I’d be willing to bet that term has been used in one or two books since then. In the four years since this article came out, that ending date hasn’t changed. I suspect that might be related to the big to-do that resulted from Google letting it be known that they were scanning all the books they could get their hands on which immediately sent publishers, the Author’s Guild, and more than a few authors into a tizzy. Even though Google eventually won the case, it might have impacted their scanning program negatively. I’ll leave it to you to play around with the various options and different capabilities beyond the very basic. Google has a page where can get you started.

It seems to me the uses authors might put this tool to are obvious, but I’ll point out a few examples that will hopefully trigger additional ideas that fit your specific needs. As I said above, the main reason an author would use this tool is research. For example, if your book takes place at a specific point in time and you’re wondering what phrase people were using to say “that’s really neat-o” back then, you might compare some of the possible phrases.

Google NGram Viewer NGChar2Personally, I think it’s really rad that “far out” has been making a comeback.

One of the commenters at the Passive Voice suggested playing with this might be useful for coming up with or at least comparing different possibilities for keywords to use on Amazon. That’s the other area where I think authors can find some value in this tool: for various marketing purposes. Keywords is one possibility. Maybe it could be used to help you decide what to write about. The only question here is, is this telling us which is the most popular for readers or only which was the most popular for authors and publishers?

Google NGram Viewer NGChar3I’ll let you ponder that question on your own.

I’ll bet all of you smart and creative people can think of lots of other ideas where this tool could be put to good use. Tell us all in the comments.

Author: Big Al

Big Al (who insists he only has one name, like Cher, Sting, and Madonna) spends his days writing computer programs that are full of typos, homonym errors, and incorrect verb usage. During his evenings, he writes reviews of indie books for BigAl’s Books and Pals and has recently taken over The IndieView, a website founded by indie author Simon Royle as a resource for indie authors, indie reviewers, and those who read either.

7 thoughts on “Big Book Data for the Little Author Guy”

  1. I used Google Ngram Viewer extensively in writing my latest novel, Harry Seven. The two main characters are an educated man from the contemporary US, and an upper-class woman from 1940’s Britain (it’s a time-travel romance). Several of the minor characters are also from the 1940s, some American, some British, the latter not all upper-class.

    As an Aussie but long time resident of the US, I can do both flavors/flavours of 21st century English, but 1940’s British and American English were a whole different ball game, and I wanted to get the dialog right.

  2. Like any source, it has its uses and its limitations. For example, it lists “okay” as starting about 1940. Other sources list “OK” as going back to 1840. But Ngram doesn’t list “OK.”
    Second qualm: it isn’t what’s correct, it’s what people think is correct. So you don’t use “OK” in an 1880 Historical Fiction, because readers will stop dead and say, “Isn’t that a modern term?”
    But thanks for another site to while away the hours when I should be working 🙂

Comments are closed.