A new class of functions for describing logical structures in text

Dao, Ngon D. (Ngon Dong), 1974-

dc.contributor.advisor	C. Forbes Dewey, Jr.	en_US
dc.contributor.author	Dao, Ngon D. (Ngon Dong), 1974-	en_US
dc.contributor.other	Harvard University--MIT Division of Health Sciences and Technology.	en_US
dc.date.accessioned	2005-09-27T17:12:58Z
dc.date.available	2005-09-27T17:12:58Z
dc.date.copyright	2004	en_US
dc.date.issued	2004	en_US
dc.identifier.uri	http://hdl.handle.net/1721.1/28594
dc.description	Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2004.	en_US
dc.description	Includes bibliographical references (p. 49-51).	en_US
dc.description.abstract	Text documents generally contain two forms of structures, logical structures and physical structures. Loosely speaking, logical structures are sections of text that are both visually and semantically distinct. For example, a document may have an "introduction", a "body", and a "conclusion" as its logical structures. These structures are so named because each section has a distinct purpose in conveying the document's logical arguments or intentions. Perfect machine recognition of logical structures in large collections of documents is an unsolved problem in computational linguistics. This thesis presents evidence that a new family of functions on text segments carries information that is useful for differentiating document logical structures. For any given text segment, a function of this form is referred to as the cadence, and it is based on a new interpretation of the vector space representation that Gerard Salton introduced in 1975. Cadence also differs from the original Salton representation in that it relies on three heuristic transformations based on authorship, location, and term coherence. To test the hypothesis that the cadence of a text segment carries information helpful to differentiating logical structures, a corpus was built containing 2800 documents with manually-annotated logical structures. Structures representing abstracts, introductions, bodies, and conclusions from this corpus were clustered with a k-means algorithm using cadence data. Precision and recall performances were computed for the results, and a chi-squared cross-tabulation test was used to determine the statistical significance of the clustering results. Precision and recall were highest for abstracts (P = 0.931 [plus-minus] 0.025, R = 0.992	en_US
dc.description.abstract	(cont.) [plus-minus] 0.026), followed by introductions (P = 0.747 [plus-minus] 0.025, R = 0.802 [plus-minus] 0.026) and conclusions (P = 0.737 [plus-minus] 0.025, R = 0.813 [plus-minus] 0.026), and lowest for bodies (P = 0.876 [plus-minus] 0.03, R = 0.663 [plus-minus] 0.026). These results suggest that cadence may have substantial promise for finding logical structures in un-annotated documents.	en_US
dc.description.statementofresponsibility	by Ngon D. Dao.	en_US
dc.format.extent	51 p.	en_US
dc.format.extent	2984150 bytes
dc.format.extent	2988167 bytes
dc.format.mimetype	application/pdf
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582
dc.subject	Harvard University--MIT Division of Health Sciences and Technology.	en_US
dc.title	A new class of functions for describing logical structures in text	en_US
dc.type	Thesis	en_US
dc.description.degree	Ph.D.	en_US
dc.contributor.department	Harvard University--MIT Division of Health Sciences and Technology
dc.identifier.oclc	57509348	en_US

Files in this item

Name:: 57509348-MIT.pdf
Size:: 2.849Mb
Format:: PDF
Description:: Full printable version

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record