Login

A new class of functions for describing logical structures in text

Show full item record




Title: A new class of functions for describing logical structures in text
Author: Dao, Ngon D. (Ngon Dong), 1974-
Other Contributors: Harvard University--MIT Division of Health Sciences and Technology.
Advisor: C. Forbes Dewey, Jr.
Department: Harvard University--MIT Division of Health Sciences and Technology.
Publisher: Massachusetts Institute of Technology
Issue Date: 2004
Abstract: Text documents generally contain two forms of structures, logical structures and physical structures. Loosely speaking, logical structures are sections of text that are both visually and semantically distinct. For example, a document may have an "introduction", a "body", and a "conclusion" as its logical structures. These structures are so named because each section has a distinct purpose in conveying the document's logical arguments or intentions. Perfect machine recognition of logical structures in large collections of documents is an unsolved problem in computational linguistics. This thesis presents evidence that a new family of functions on text segments carries information that is useful for differentiating document logical structures. For any given text segment, a function of this form is referred to as the cadence, and it is based on a new interpretation of the vector space representation that Gerard Salton introduced in 1975. Cadence also differs from the original Salton representation in that it relies on three heuristic transformations based on authorship, location, and term coherence. To test the hypothesis that the cadence of a text segment carries information helpful to differentiating logical structures, a corpus was built containing 2800 documents with manually-annotated logical structures. Structures representing abstracts, introductions, bodies, and conclusions from this corpus were clustered with a k-means algorithm using cadence data. Precision and recall performances were computed for the results, and a chi-squared cross-tabulation test was used to determine the statistical significance of the clustering results. Precision and recall were highest for abstracts (P = 0.931 [plus-minus] 0.025, R = 0.992(cont.) [plus-minus] 0.026), followed by introductions (P = 0.747 [plus-minus] 0.025, R = 0.802 [plus-minus] 0.026) and conclusions (P = 0.737 [plus-minus] 0.025, R = 0.813 [plus-minus] 0.026), and lowest for bodies (P = 0.876 [plus-minus] 0.03, R = 0.663 [plus-minus] 0.026). These results suggest that cadence may have substantial promise for finding logical structures in un-annotated documents.
Description: Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2004.Includes bibliographical references (p. 49-51).
URI: http://hdl.handle.net/1721.1/28594
Keywords: Harvard University--MIT Division of Health Sciences and Technology.

Files in this item

Files Size Format View Description
Preview, non-printable (open to all) 2.984Mb PDF View/Open Preview, non-printable (open to all)
Full printable version (MIT only) 2.988Mb PDF View/Open Full printable version (MIT only)

This item appears in the following Collection(s)

Show full item record

Search DSpace@MIT


Advanced Search

Browse

My Account

Links