Skip to main content

View Post [edit]

Poster: Branko Collin Date: Sep 27, 2007 6:30am
Forum: texts Subject: Re: API?

I don't work for or represent the Internet Archive, so please do not take the following as gospel. I figured I would try and help you, because TIA people often reply late to questions, and sometimes not at all.

re: ISBN: in February this year the vast majority of books in the Toronto and Americana collections were published before 1920. ISBN is from 1966. You do the math.

re: scraping. A quick look through the FAQ does not teach me how TIA would prefer you to minimize traffic. However, the fact that they discuss wget and that they offer RSS feeds of the most recent items would suggest that scraping is indeed the way to go. (If that means what I think it means.)

I am not 100% sure about this, but it would seem that all items get a unique identifier. The item can then be found at http://www.archive.org/details/identifier.

Reply [edit]

Poster: jrochkind Date: Oct 17, 2007 5:47am
Forum: texts Subject: Re: API?

Someone from the archive contacted me over email offering to discuss this further, but I'm afraid I lost their email address. Perhaps they will see this again and contact me again?

Reply [edit]

Poster: EmilPer Date: Oct 19, 2007 11:57pm
Forum: texts Subject: Re: API?

In case you had the conversation with "someone from the archive", would you care to share, or did they bring a lawyer with them :-D ?

There is a sort of API for searching: the search uses Lucene, so the rules for building the query string are in the open, and the results page is easy to parse.

Once you have identified the strings that identify uniquely an item, it's very easy to get the xml with the list of files and after that the full text or the page images: see http://www.archive.org/about/faqs.php#140 .

Reply [edit]

Poster: AnnaN Date: May 13, 2009 10:30am
Forum: texts Subject: Re: API?

Archive.org encourages the use of the JSON API:
http://www.archive.org/help/

Reply [edit]

Poster: EmilPer Date: May 13, 2009 10:46am
Forum: texts Subject: Re: API?

Thank you, I did not check the docs recently.

Was there any change in TOS, too ? To say what can and what cannot be done with the books in the archive ?

Reply [edit]

Poster: marcus lucero Date: Oct 12, 2007 5:11pm
Forum: texts Subject: Re: API?

As Collins said above, each item in the texts archives does in fact receive a unique identifier which inturn creates a persistent URL. You can, and I am not an engineer, is point to that unique url.

http://www.archive.org/details/itemid

(e.g. http://www.archive.org/details/mindcure00larsrich which will always stay the same and never be replaced by other files)

Others have 'scrapped" our database from outside but have never really shared their techniques.

Marcus



Reply [edit]

Poster: EmilPer Date: Oct 20, 2007 12:51am
Forum: texts Subject: Re: API?

"Others ... have never really shared their techniques."

This could be because it's not that difficult to scrap the public domain books archive, and in consequence there is not much code or technique to share.

It could also be because Archive.org does not say clearly what they allow and what they do not allow. For example, Project Gutenberg says clearly what can be done with their content, so there are many PG readers out there that read their text databases, process the book text, split it into pages, reformat etc. Archive.org does not, or not in a place that's easy to find, state what can be done and what cannot be done with the texts they host.

"Access to the Archive’s Collections is provided at no cost to you and is granted for scholarship and research purposes only." is very ambiguous. Would anyone spend a few hundred man-hours to write code to search, download marc/dc/meta files, get the fulltext files, index the text, cross reference it, identify correlations, generate new searches etc. and then share the code only to find out that "no, that's not allowed" ? Most likely s/he would leach as much as possible, share the code that does only the leaching, and then claim s/he is writing a better spelling checker and needs raw data.

Ambiguous "Terms of Use" and " questions or comments regarding these terms ... at info@archive.org" means "don't bother unless you can afford to pay a lawyer full time".