Detecting and parsing embedded lightweight structures
Author(s)
Rha, Philip
DownloadFull printable version (4.245Mb)
Other Contributors
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
Advisor
Rob Miller.
Terms of use
Metadata
Show full item recordAbstract
Text documents, web pages, and source code are all documents that contain language structures that can be parsed with corresponding parsers. Some documents, like JSP pages, Java tutorial pages, and Java source code, often have language structures that are nested within another language structure. Although parsers exist exclusively for the outer and inner language structure, neither is suited for parsing the embedded structures in the context of the document. This thesis presents a new technique for selectively applying existing parsers on intelligently transformed document content. The task of parsing these embedded structures can be broken up into two phases: detection of embedded structures and parsing of those embedded structures. In order to detect embedded structures, we take advantage of the fact that there are natural boundaries in any given language in which these embedded structures can appear. We use these natural boundaries to narrow our search space for embedded structures. We further reduce the search space by using statistical analysis of token frequency for different language types. By combining the use of natural boundaries and the use of token frequency analysis, we can, for any given document, generate a set of regions that have a high probability of being an embedded structure. (cont.) To parse the embedded structures, the text of the region must often be transformed into a form that is readable by the intended parser. Our approach provides a systematic way to transform the document content into a form that is appropriate for the embedded structure parser using simple replacement rules. Using our knowledge of natural boundaries and statistical analysis of token frequency, we are able to locate regions of embedded structures. Combined with replacement rules which transform document content into a parsable form, we are successfully able to parse a range of documents with embedded structures using existing parsers.
Description
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. Includes bibliographical references (p. 71-72).
Date issued
2005Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.