Skip to main content

View Post [edit]

Poster: FamousLongAgo Date: Jul 14, 2003 2:00am
Forum: researchproposals Subject: Research Proposal from FamousLongAgo

Which collection would you like to work with: fullweb
Name: Maciej Ceglowski
Organization: NITLE
Email: maciej@ceglowski.com
Project name: NITLE Blog Census
Abstract: The blog census (http://www.blogcensus.net) is an attempt to identify and archive all weblogs on the Net. Currently we have 614K blogs in our list, and do a full snapshot of our database (including HTML for all blogs in the census) every twelve days or so. The crawl has been active since May 2003, with the first snapshot taken June 28. If this material is of interest to the Internet Archive, we would like to donate it on an ongoing basis. This will also ensure that the data is not lost (we lost our first, June 10 snapshot to a weird RAID 5 error).
Description: The full database snapshot is about 3GB compressed (12GB uncompressed). Metadata includes:

* crawl date
* language (identified from content)
* blogging tool used
* URL
* full HTML
* outbound links

Since the number of blogs is growing rapidly, a conservative estimate is 12 GB/month of compressed data.
We're working on setting up diffs or some other way of reducing the storage requirements.