?

Log in

No account? Create an account
A whole lot of books..... - Input Junkie
December 19th, 2010
11:04 am

[Link]

Previous Entry Share Next Entry
A whole lot of books.....
Ngrams are a great toy, but I can't help wondering about possible systemic biases in which books form the initial pool.

I'm guessing that a book is more likely to end up in googlebooks if modern people like it. If old, it was more likely to be found by a modern person if it was popular.

It's not modern people in general, though, or even literate modern people. There's probably a bias towards geeky modern people.

There will be a bias towards books that aren't in copyright.

Quality of OCR is a factor, and I have no idea how that plays out. Might some typefaces be problematic? If so, there could be cultural/temporal effects.

Each book presumably only shows up once, which means that the effects of popularity only show up indirectly.

Anything else?

This entry was posted at http://nancylebov.dreamwidth.org/450669.html. Comments are welcome here or there. comment count unavailable comments so far on that entry.

(10 comments | Leave a comment)

Comments
 
[User Picture]
From:madfilkentist
Date:December 19th, 2010 04:40 pm (UTC)
(Link)
Much of the Google Books collection is from academic libraries, so I don't think the bias is quite that "modern people like it," but that a book is considered a "serious" book.

Typefaces are definitely a problem. Look at the graph for filk.
From:nancylebov
Date:December 19th, 2010 04:46 pm (UTC)
(Link)
What would it be an OCRo for?
[User Picture]
From:thnidu
Date:December 19th, 2010 06:08 pm (UTC)
(Link)
"Silk", with a long "s": ſilk. madfilkentist sent me a funny on that.

Edited at 2010-12-19 06:17 pm (UTC)
[User Picture]
From:richardthinks
Date:December 19th, 2010 05:25 pm (UTC)
(Link)
Each book presumably only shows up once
I wouldn't be so sure of this. My (admittely not well-informed) sense is that the metadata are bad enough that some books might well have been scanned twice or more, if only in different editions.

It's a big project, multiple institutions are involved, and I doubt anyone has been super-concerned to prevent multiple scannings of individual titles. My guess is that the total picture is even messier than current pessimistic estimates.
[User Picture]
From:madfilkentist
Date:December 19th, 2010 07:50 pm (UTC)
(Link)
The general problem of identifying books uniquely is a very difficult one. Books are re-issued with different titles, or with the same title under a different version of the author's name. Multiple books may be combined into a single volume, or a one-volume monster broken down into several. An apparently new book may consist almost entirely of material from previous books.
[User Picture]
From:nancylebov
Date:December 19th, 2010 09:19 pm (UTC)
(Link)
And a new edition of a book may be more or less revised or expanded.
[User Picture]
From:bemused_leftist
Date:December 20th, 2010 12:00 am (UTC)
(Link)
All this would introduce the factor of popularity, as a popular book would be more likely to go through several editions, have portions anthologized, etc.
From:(Anonymous)
Date:December 22nd, 2010 01:51 am (UTC)
(Link)
yes, the google books meta-data are bad so it duplicates many, but the n-grams site is a smaller set with much effort towards better meta-data.
[User Picture]
From:nancylebov
Date:December 22nd, 2010 11:25 am (UTC)
(Link)
thnidu commented in some detail on my previous post-- the metadata on the earlier books in the ngrams is horrendously bad.
[User Picture]
From:thnidu
Date:December 19th, 2010 06:09 pm (UTC)
(Link)
Also see my long comment on your previous post.
nancybuttons.com Powered by LiveJournal.com