MIT OpenCourseWare
  • OCW home
  • Course List
  • about OCW
  • Help
  • Feedback
  • Support MIT OCW

Assignments

File decompression software, such as Winzip® or StuffIt®, is required to open the .gz and .tar files in this section.

Homework 1

The goal of the homework is to design and evaluate a method for sentence segmentation of speech transcripts. Since raw speech transcripts do not contain sentence boundaries, a tool for sentence separation is important for many applications that operate over speech transcripts, such as information retrieval and summarization.

For training, development and testing, you will be provided with 6.001 lecture transcripts manually annotated with sentence boundaries. The transcripts are also annotated with pause information that your model may use. Note that transcripts do not contain capitalization and punctuation, so your model should not rely on this information.

What to do?

Read Related Work

You will find abundant literature on the topic of sentence segmentation, and it is worth looking at some of the existing techniques before designing your own. The Manning & Schutze text gives a short summary of the sentence segmentation of written language, and provides some pointers. In addition, you may want to consider literature on sentence segmentation of spoken language. (Note that you do not have access to prosodic features typically used in spoken language segmentation.)

Establish Upper and Lower Boundaries

To compute the upper boundary, manually segment the following file (TXT) into 20 sentences. Compare your segmentation with the "gold standard" (TXT)." To establish the lower boundary, randomly segment the file into 20 sentences, and compare against "gold standard." Report precision, recall and F-measure for both boundaries.

Design Your Method

Traditionally, sentence segmentation is cast as a binary classification task, where each potential boundary is classified either as the end of a sentence or not. If you decide to follow the traditional path, you will need to decide about the set of relevant features, and then apply one of the existing classifiers (see links below). You can also consider a model that takes into account the global properties of sentence segmentation (e.g., sentence length distribution). Tune all the parameters on the development set.

Analyze the Performance

Compute the learning curve of your method, and report the performance using various features subsets. Consider other experiments that shed light on the merits and weaknesses of your method.

What to Submit?

You have to submit a writeup that clearly explains your model, presents the results and analyzes its performance. You have to submit your code, and the output of your model on the test set. The README file should clearly specify how to run your program.

Data

The data is in hw1-data.gz (GZ). It contains three directories with data for training, development and testing. Here you will find the data annotated with pauses.

Relevant Links

Links to Classifiers

Homework 2

In this homework, you will explore corpus-based approaches to lexical semantics. More concretely, you will implement and analyze a method for clustering words based on their distributional properties. By evaluating the resultant clustering on two disambiguation tasks, you will explore the merits of different representations and study the properties of the learning method.

To train your method, you will use the lecture transcript corpus (GZ) from the first homework, and a 6.001 textbook source file (GZ).

What to do?

Implement

Your program has to cluster a given list of words into n groups based on their distributional patterns. You will first construct a word-by-word matrix that captures co-occurrence patterns of the given words. Then, you will cluster them, using the EM algorithm. Your program should take as input a list of words to cluster and a number of clusters.

Evaluate

The first evaluation task is pseudoword disambiguation. For each one of 50 words in this file (TXT), randomly substitute half of their occurrences in the corpus with its reverse (e.g., "procedure'' will be transformed into "erudecorp"). Now, apply your clustering algorithm to the list of 100 words, which contains original words and their reverses. If you generate 50 clusters, how many of them will contain correct pairs (i.e., a word and its reverse)? The second evaluation task relates to the part-of-speech disambiguation. This file (TXT) contains nouns, verbs, and adjectives. Apply your program to these words to create three clusters. Evaluate the quality of the clusters.

Analyze the Performance

Your analysis will focus on the impact of context representation and the features of training data on the quality of generated clusters. To analyze the contribution of contextual representation, consider different ways of constructing a word-by-word matrix (e.g., vary the dimensions of the matrix) and experiment with different definitions of context. To analyze the impact of training data, train your system on spoken and written parts of the corpus separately, and also on their combination. Do you observe any difference?

Do you reach similar conclusions, when you analyze the performance of your clustering on the two evaluation tasks?

Can you find any regularity in the mistakes of your method?

What to Submit?

You have to submit a writeup that clearly explains parameters of your models, presents the results and analyzes its performance. You have to submit your code, and the output of your model. The README file should clearly specify how to run your program.