nancylebov (nancylebov) wrote,

A whole lot of books.....

Ngrams are a great toy, but I can't help wondering about possible systemic biases in which books form the initial pool.

I'm guessing that a book is more likely to end up in googlebooks if modern people like it. If old, it was more likely to be found by a modern person if it was popular.

It's not modern people in general, though, or even literate modern people. There's probably a bias towards geeky modern people.

There will be a bias towards books that aren't in copyright.

Quality of OCR is a factor, and I have no idea how that plays out. Might some typefaces be problematic? If so, there could be cultural/temporal effects.

Each book presumably only shows up once, which means that the effects of popularity only show up indirectly.

Anything else?

This entry was posted at Comments are welcome here or there. comment count unavailable comments so far on that entry.

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded