Font adaptive word indexing of modern printed documents

We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engine...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1998. - 28(2006), 8 vom: 15. Aug., Seite 1187-99
1. Verfasser: Marinai, Simone (VerfasserIn)
Weitere Verfasser: Marino, Emanuele, Soda, Giovanni
Format: Aufsatz
Sprache:English
Veröffentlicht: 2006
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Evaluation Study Journal Article
LEADER 01000caa a22002652 4500
001 NLM164568271
003 DE-627
005 20250207124410.0
007 tu
008 231223s2006 xx ||||| 00| ||eng c
028 5 2 |a pubmed25n0549.xml 
035 |a (DE-627)NLM164568271 
035 |a (NLM)16886856 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Marinai, Simone  |e verfasserin  |4 aut 
245 1 0 |a Font adaptive word indexing of modern printed documents 
264 1 |c 2006 
336 |a Text  |b txt  |2 rdacontent 
337 |a ohne Hilfsmittel zu benutzen  |b n  |2 rdamedia 
338 |a Band  |b nc  |2 rdacarrier 
500 |a Date Completed 05.09.2006 
500 |a Date Revised 10.12.2019 
500 |a published: Print 
500 |a Citation Status MEDLINE 
520 |a We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals 
650 4 |a Evaluation Study 
650 4 |a Journal Article 
700 1 |a Marino, Emanuele  |e verfasserin  |4 aut 
700 1 |a Soda, Giovanni  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on pattern analysis and machine intelligence  |d 1998  |g 28(2006), 8 vom: 15. Aug., Seite 1187-99  |w (DE-627)NLM098212257  |x 0162-8828  |7 nnns 
773 1 8 |g volume:28  |g year:2006  |g number:8  |g day:15  |g month:08  |g pages:1187-99 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d 28  |j 2006  |e 8  |b 15  |c 08  |h 1187-99