Page 1 of 1

Google Print versus Amazon's Search Inside the Book

Posted: Fri Oct 08, 2004 7:24 pm
by Jonathan
Win for Google: don't need to register to use their service.

Win for Amazon: no stupid hacks to keep you from saving the image.

Loss for both: both try to keep you from reading a whole book straight through with cheesy UI.

Another loss for both: images are nice, but clearly you have OCRed this image to let me search it, so how about giving me just ASCII (or UTF8)?

My conclusion: nice try, guys. Keep it coming.

Posted: Fri Oct 08, 2004 9:49 pm
by Jonathan
Thank you for using Google Print.

You have either reached a page that is unavailable for viewing or reached your viewing limit for this book.

Google protects works that are under copyright by restricting access to certain pages and restricting the number of pages you can view. You may continue to take advantage of Google Print by clicking on About this Book. Thank you for using Google Print.
I knew Amazon had this "feature". Now I know Google does too. I viewed about 30 pages of a Dave Barry book using Google Print before I got this message. I could probably delete my cookies to continue, but that's an experiment for another time.

Posted: Fri Oct 08, 2004 9:57 pm
by Peijen
stop trying to steal other people's hardwork. theif.

Posted: Fri Oct 08, 2004 9:58 pm
by Peijen
couldn't you just write a wget script to download all the pages?

Posted: Fri Oct 08, 2004 10:02 pm
by Jonathan
They block wget. Presumably by the user agent string?

Also, there's no good way to discover all the image names. You'll hit the viewing limit for the book pretty quick, regardless of the access method.

Granted, they have no defense against a slow, distributed attack. A thousand computers working in concert over a long time could probably suck them dry. Sounds like a distributed computing project!

Posted: Fri Oct 08, 2004 10:19 pm
by Jonathan
http://weblogs.mozillazine.org/gerv/

Reasonably good analysis of their protections.

I think with a custom wget that pretends to be Firefox (or IE, or whatever) and offers a faked cookie might be able to programmatically download the images.

Now, looking at my Google cookie, I see it has an ID string. I have no idea how easy or difficult it is to produce a bunch of valid ID strings for your cookies. I wonder if you could generate them just by visiting Google with no cookies...

Posted: Fri Oct 08, 2004 10:36 pm
by VLSmooth
Dwindlehop wrote:I think with a custom wget that pretends to be Firefox (or IE, or whatever) and offers a faked cookie might be able to programmatically download the images.
GetRight and similar products have been doing this for a long time. Great for automatically downloading screenshots / photo albums.

Posted: Fri Oct 08, 2004 10:37 pm
by Peijen
I think you might be better off getting access into their database and steal the book from there.

Posted: Tue Oct 12, 2004 11:50 pm
by bob
Dwindlehop wrote:They block wget. Presumably by the user agent string?
wget --user-agent=MSIE/5.0
or
wget -U MSIE/5.0