Syllables and the M Language : improving unknown word guessing
Author(s)Jacokes, M. Brian (Michael Brian)
Unknown word guessing in a semantic data language
Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.
David L. Brock.
MetadataShow full item record
Despite the huge amount of computer data that exists today, the task of sharing information between organizations is still tackled largely on a case-by-case basis. The M Language is a data language that improves data sharing and interoperability by building a platform on top of XML and a semantic dictionary. Because the M Language is specifically designed for real-world data applications, it gives rise to several unique problems in natural language processing. I approach the problem of understanding unknown words by devising a novel heuristic for word decomposition called "probabilistic chunking," which achieves a 70% success rate in word syllabification and has potential applications in automatically decomposing words into morphemes. I also create algorithms which use probabilistic chunking to syllabify unknown words and thereby guess their parts of speech and semantic relations. This work contributes valuable methods to the areas of natural language processing and automatic data processing.
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 71-72).
DepartmentMassachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science.; Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Electrical Engineering and Computer Science.