Skip to main content

View Post [edit]

Poster: Marina Santini Date: Jan 15, 2003 6:07pm
Forum: researchproposals Subject: Research Proposal from Marina Santini

Which collection would you like to work with: fullweb
Name: Marina Santini
Organization: ITRI (Brighton Univ., UK) - www.itri.brighton.ac.uk
Email: santinim@inwind.it
Project name: PhD Project: Automatic Detection of Genres on the Web
Abstract: Web documents fully exploit the new capabilities of the Web (layout, graphics, multimedia, etc.). The Web represents a new challenging corpus: it is a huge repository and contains striking diversities of textual resources. But what sort of texts can be found on the Web and in what proportions? The Internet Archive helps freeze the mutability of the Web and provides snapshots of the Web at a given time. Methods must be found to use these Web snapshots as massive stable corpora from which to derive linguistic knowledge. The final aim of the project is to create an automatic cybergenre classifier to help Information Retrieval on the Web.
Description: One of the main goals of my research is to exploit the Web snaphots provided by the Internet Archive and use them as electronic text corpora. The project will have an iterative approach, which means that a different random Web sample will be analysed at every cycle. For example, say we are going to plan 5 cycles: I should be able to work on 5 different random samples taken from 5 different Web snapshots. From the Web snapshots, I want to filter out some Web pages I'm not interested in, for example non-English Web pages, or dynamic Web pages (CGI and the like). The ramdom samples must be reproducible and manageable. Reproducible means that I would like to be able to get the same sets of streamable files every time I need to run further experiments on a Web sample. Manageable means that the size and the format should not hinder my experiments based on statistical, linguistic and stylistic features extraction.