nancylebov (nancylebov) wrote,
nancylebov
nancylebov

A whole lot of books.....

Ngrams are a great toy, but I can't help wondering about possible systemic biases in which books form the initial pool.

I'm guessing that a book is more likely to end up in googlebooks if modern people like it. If old, it was more likely to be found by a modern person if it was popular.

It's not modern people in general, though, or even literate modern people. There's probably a bias towards geeky modern people.

There will be a bias towards books that aren't in copyright.

Quality of OCR is a factor, and I have no idea how that plays out. Might some typefaces be problematic? If so, there could be cultural/temporal effects.

Each book presumably only shows up once, which means that the effects of popularity only show up indirectly.

Anything else?

This entry was posted at http://nancylebov.dreamwidth.org/450669.html. Comments are welcome here or there. comment count unavailable comments so far on that entry.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 10 comments