Skip to main content

View Post [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:24am
Forum: toronto Subject: Re: Universal OCR

"However, for Project Runeberg I need to know where the page breaks and line breaks are, and this information is lost in the PG e-text"

DP now tries to retain at least page numbers in its HTML versions (though they are unlikely to appear at the exact page boundaries all the time, because we reconnect words that were broken across page boundaries). Also, footnotes, columns and other items that span pages are unlikely to be in the right position, so to speak.

In other words, when sending a text through DP, it is not unreasonable to ask our volunteers to retain page breaks.

"Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"?"

I don't see why not.

"Would this be useful?"

I think it is.

Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:34am
Forum: toronto Subject: Re: Universal OCR

BTW, you could use DP just for proofreading. During proofreading rounds, we retain line breaks to make it easier for our volunteers to compare the text with the scan. Line breaks are only removed during the post-processing round.