Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:24am

Forum: toronto Subject: Re: Universal OCR

"However, for Project Runeberg I need to know where the page breaks and line breaks are, and this information is lost in the PG e-text"

DP now tries to retain at least page numbers in its HTML versions (though they are unlikely to appear at the exact page boundaries all the time, because we reconnect words that were broken across page boundaries). Also, footnotes, columns and other items that span pages are unlikely to be in the right position, so to speak.

In other words, when sending a text through DP, it is not unreasonable to ask our volunteers to retain page breaks.

"Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"?"

I don't see why not.

"Would this be useful?"

I think it is.

Reply to this post
Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:34am

Forum: toronto Subject: Re: Universal OCR

BTW, you could use DP just for proofreading. During proofreading rounds, we retain line breaks to make it easier for our volunteers to compare the text with the scan. Line breaks are only removed during the post-processing round.

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:24am

Forum: toronto Subject: Re: Universal OCR

Reply to this post
Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:34am

Forum: toronto Subject: Re: Universal OCR

Poster:	Branko Collin	Date:	Feb 22, 2005 8:24am
Forum:	toronto	Subject:	Re: Universal OCR

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Reply to this post | See parent post | Go Back View Post [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:24am Forum: toronto Subject: Re: Universal OCR

Reply to this post Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:34am Forum: toronto Subject: Re: Universal OCR

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:24am

Forum: toronto Subject: Re: Universal OCR

Reply to this post
Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:34am

Forum: toronto Subject: Re: Universal OCR