SRI Logo
Spacer
    

Spacer
         
  SRI Logo

Intelligent Document Understanding and Optical Character Recognition (OCR)

The ability to extract, index, and search digitized document images for relevant information is a growing need in many business and government applications. Conventional approaches to this problem involve the use of page, line, and character segmentation followed by optical character recognition to convert the pixel information into symbol strings that can be manipulated. Document degradation, however, causes the loss of important information at the pixel level, which in turn affects the quality of the characters extracted from documents and therefore degrades the quality of the search results.

SRI has been conducting several related research efforts that are exploring the use of collateral or contextual information that can compensate for degradation-induced loss of information. In general, these methods use shape information from entire words to complement character recognition; lexicons organized in domain-specific ways to enhance recognition; language models that can focus attention on parts of a document; and information combined from graphical and textual modalities within a single document.

The goal of our research is to produce systems that can synergistically extract information from different parts of a printed document to produce a consistent and searchable representation of the total information content.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2010 SRI International 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy