SRI Logo
Spacer
    

Spacer
         
  SRI Logo

Text Keyword Recognition (SCRIBBLE)

SRI International (SRI) has developed a prototype system, Scribble, to spot specific words (keywords) in scanned images of paper documents. The system can search for up to several thousand keywords over large sets of unconstrained-format documents. It is useful for finding and retrieving relevant information, classifying documents into various categories, or simply deciding which ones are of further interest.

Scribble is based on the holistic way humans tend to scan and read documents: instead of concentrating on individual characters, people tend to look at entire words or phrases. Our ability to read the degraded text in the figure below is one example. Words have more features than isolated characters; thus, the recognition of whole words is faster and more dependable than character recognition, especially in the presence of image "noise". Instead of recognizing individual characters, SRI's keyword spotting system detects entire words or word phrases as single entities directly in the scanned image of the document.

scribble

How It Works

For machine-printed English text, the word shape of each keyword is represented by the presence of ascenders (characters with components that rise above the height of lowercase characters) and descenders (characters with components that fall below the baseline of the text line). Each document image is segmented into individual word images. As the image of each word is delimited, ascenders and descenders are detected, and their relative locations in the word image are compared to the predetermined ascender/descender locations in each keyword. If there is a possible match with one or more keywords, more detailed features of the word image are computed and compared to that of the possible keyword(s). If the shape of the word image matches closely enough to that of a keyword, the word image is flagged as a keyword.

User Interface

An easy-to-use interface permits the user to specify sets of lexicon words, specify sets of document images to be searched, and displays the results of the search, showing the location of each instance of keyword detected.


Examples


Comparison with OCR

The conventional approach to detecting key words in scanned documents has been to use optical character recognition (OCR) to convert the document image into text data, and then to compare the recognized text with the set of keywords. Because OCR is a character (rather than word) based process, a single misrecognized character will cause the keyword to be missed. (One could allow for single or multiple missed characters in both the OCR output and in the Scribble system. This would increase the number of detected keywords, but would also increase the number of false hits.) The performance of an OCR-based approach to keyword spotting drops rapidly as the quality of the document decreases. By contrast, the Scribble system can spot words in spite of merged and fragmented characters because it uses shape features of whole words; its recognition performance is therefore more robust on poor-quality documents, such as faxes, multigenerational photocopies or carbons.


In tests on 380 document images from the University of Washington CD-ROM English Document Image Database I, Scribble was 6% to 10% more accurate than a commercially-available OCR package.

In addition, because the keyword spotting system finds only the specified set of key words and does not attempt to recognize every word in the document, it can work from 2 to 20 times faster than typical OCR processes.


Applications

SRI's keyword spotting software can significantly improve speed and throughput in document processing systems. It could be applied to document declassification tasks, such as systematic review, or automatic content-based routing of documents to an appropriate reviewer based on the subject area or the identity of referenced organizations. The word shape features that are used to characterize the document can also be used as a basis for rapidly identifying duplicate documents or highlighting the differences between two documents.

Versatility and Extensibility

Scribble can detect keywords in almost any machine-printed font in English. SRI is currently extending the system to handle Cyrillic and Arabic alphabet languages. Scribble can handle multi-columned documents and documents with skewed text. The capabilities of Scribble can easily be extended and customized for particular applications. For example, it could be integrated with a search engine for finding logical combinations of keywords (also using their positional relationships in the document) for more accurate document categorization. Punctuation, which is currently detected but discarded, could help to locate social security and telephone numbers. Scribble currently runs on UNIX-based platforms but could easily be ported to other computing environments.

For More Information on Scribble

Gregory K. Myers, Program Director


 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2011 SRI International 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy