Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Branko Collin Date: Sep 27, 2007 6:30am

Forum: texts Subject: Re: API?

I don't work for or represent the Internet Archive, so please do not take the following as gospel. I figured I would try and help you, because TIA people often reply late to questions, and sometimes not at all.

re: ISBN: in February this year the vast majority of books in the Toronto and Americana collections were published before 1920. ISBN is from 1966. You do the math.

re: scraping. A quick look through the FAQ does not teach me how TIA would prefer you to minimize traffic. However, the fact that they discuss wget and that they offer RSS feeds of the most recent items would suggest that scraping is indeed the way to go. (If that means what I think it means.)

I am not 100% sure about this, but it would seem that all items get a unique identifier. The item can then be found at http://www.archive.org/details/identifier.

Reply to this post
Reply [edit]

Poster: jrochkind Date: Oct 17, 2007 5:47am

Forum: texts Subject: Re: API?

Someone from the archive contacted me over email offering to discuss this further, but I'm afraid I lost their email address. Perhaps they will see this again and contact me again?

Reply to this post
Reply [edit]

Poster: EmilPer Date: Oct 19, 2007 11:57pm

Forum: texts Subject: Re: API?

In case you had the conversation with "someone from the archive", would you care to share, or did they bring a lawyer with them :-D ?

There is a sort of API for searching: the search uses Lucene, so the rules for building the query string are in the open, and the results page is easy to parse.

Once you have identified the strings that identify uniquely an item, it's very easy to get the xml with the list of files and after that the full text or the page images: see http://www.archive.org/about/faqs.php#140 .

Reply to this post
Reply [edit]

Poster: AnnaN Date: May 13, 2009 10:30am

Forum: texts Subject: Re: API?

Archive.org encourages the use of the JSON API:
http://www.archive.org/help/

Reply to this post
Reply [edit]

Poster: EmilPer Date: May 13, 2009 10:46am

Forum: texts Subject: Re: API?

Thank you, I did not check the docs recently.

Was there any change in TOS, too ? To say what can and what cannot be done with the books in the archive ?

Reply to this post
Reply [edit]

Poster: marcus lucero Date: Oct 12, 2007 5:11pm

Forum: texts Subject: Re: API?

As Collins said above, each item in the texts archives does in fact receive a unique identifier which inturn creates a persistent URL. You can, and I am not an engineer, is point to that unique url.

http://www.archive.org/details/itemid

(e.g. http://www.archive.org/details/mindcure00larsrich which will always stay the same and never be replaced by other files)

Others have 'scrapped" our database from outside but have never really shared their techniques.

Marcus

Reply to this post
Reply [edit]

Poster: EmilPer Date: Oct 20, 2007 12:51am

Forum: texts Subject: Re: API?

"Others ... have never really shared their techniques."

This could be because it's not that difficult to scrap the public domain books archive, and in consequence there is not much code or technique to share.

It could also be because Archive.org does not say clearly what they allow and what they do not allow. For example, Project Gutenberg says clearly what can be done with their content, so there are many PG readers out there that read their text databases, process the book text, split it into pages, reformat etc. Archive.org does not, or not in a place that's easy to find, state what can be done and what cannot be done with the texts they host.

"Access to the Archive’s Collections is provided at no cost to you and is granted for scholarship and research purposes only." is very ambiguous. Would anyone spend a few hundred man-hours to write code to search, download marc/dc/meta files, get the fulltext files, index the text, cross reference it, identify correlations, generate new searches etc. and then share the code only to find out that "no, that's not allowed" ? Most likely s/he would leach as much as possible, share the code that does only the leaching, and then claim s/he is writing a better spelling checker and needs raw data.

Ambiguous "Terms of Use" and " questions or comments regarding these terms ... at info@archive.org" means "don't bother unless you can afford to pay a lawyer full time".

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Branko Collin Date: Sep 27, 2007 6:30am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: jrochkind Date: Oct 17, 2007 5:47am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: EmilPer Date: Oct 19, 2007 11:57pm

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: AnnaN Date: May 13, 2009 10:30am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: EmilPer Date: May 13, 2009 10:46am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: marcus lucero Date: Oct 12, 2007 5:11pm

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: EmilPer Date: Oct 20, 2007 12:51am

Forum: texts Subject: Re: API?

Poster:	Branko Collin	Date:	Sep 27, 2007 6:30am
Forum:	texts	Subject:	Re: API?

Poster:	jrochkind	Date:	Oct 17, 2007 5:47am
Forum:	texts	Subject:	Re: API?

Poster:	EmilPer	Date:	Oct 19, 2007 11:57pm
Forum:	texts	Subject:	Re: API?

Poster:	AnnaN	Date:	May 13, 2009 10:30am
Forum:	texts	Subject:	Re: API?

Poster:	marcus lucero	Date:	Oct 12, 2007 5:11pm
Forum:	texts	Subject:	Re: API?

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | See parent post | Go Back View Post [edit]

Poster: Branko Collin Date: Sep 27, 2007 6:30am Forum: texts Subject: Re: API?

Reply to this post Reply [edit]

Poster: jrochkind Date: Oct 17, 2007 5:47am Forum: texts Subject: Re: API?

Reply to this post Reply [edit]

Poster: EmilPer Date: Oct 19, 2007 11:57pm Forum: texts Subject: Re: API?

Reply to this post Reply [edit]

Poster: AnnaN Date: May 13, 2009 10:30am Forum: texts Subject: Re: API?

Reply to this post Reply [edit]

Poster: EmilPer Date: May 13, 2009 10:46am Forum: texts Subject: Re: API?

Reply to this post Reply [edit]

Poster: marcus lucero Date: Oct 12, 2007 5:11pm Forum: texts Subject: Re: API?

Reply to this post Reply [edit]

Poster: EmilPer Date: Oct 20, 2007 12:51am Forum: texts Subject: Re: API?

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Branko Collin Date: Sep 27, 2007 6:30am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: jrochkind Date: Oct 17, 2007 5:47am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: EmilPer Date: Oct 19, 2007 11:57pm

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: AnnaN Date: May 13, 2009 10:30am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: EmilPer Date: May 13, 2009 10:46am

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: marcus lucero Date: Oct 12, 2007 5:11pm

Forum: texts Subject: Re: API?

Reply to this post
Reply [edit]

Poster: EmilPer Date: Oct 20, 2007 12:51am

Forum: texts Subject: Re: API?