Improved source code editing for effective ad-hoc code reuse

Han, Sangmok

Author(s)

Han, Sangmok

DownloadFull printable version (16.53Mb)

Other Contributors

Massachusetts Institute of Technology. Dept. of Mechanical Engineering.

Advisor

David Wallace.

Terms of use

M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

Code reuse is essential for productivity and software quality. Code reuse based on abstraction mechanisms in programming languages is a standard approach, but programmers also reuse code by taking an ad-hoc approach, in which text of code is reused without abstraction. This thesis focuses on improving two common ad-hoc code reuse approaches-code template reuse and code phrase reuse because they are not only frequent, but also, more importantly, they pose a risk to quality and productivity in software development, the original aims of code reuse. The first ad-hoc code reuse approach, code template reuse refers to programmers reusing an existing code fragment as a structural template for similar code fragments. Programmers use the code reuse approach because using abstraction mechanisms requires extra code and preplanning. When similar code fragments, which are only different by several code tokens, are reused just a couple of times, it makes sense to reuse text of one of the code fragments as a template for others. Unfortunately, code template reuse poses a risk to software quality because it requires repetitive and tedious editing steps. Should a programmer forget to perform any of the editing steps, he may introduce program bugs, which are difficult to detect by visual inspection, code compilers, or other existing bug detection methods. The second ad-hoc code reuse approach, code phrase reuse refers to programmers reusing common code phrases by retyping them, often regularly, using code completion. Programmers use the code reuse approach because no abstraction mechanism is available for reusing short yet common code phrases. Unfortunately, code phrase reuse poses a limitation on productivity because retyping the same code phrases is time-consuming even when a code completion system is used. Existing code completion systems completes only one word at a time. As a result, programmers have to repeatedly invoke code completion, review code completion candidates, and select a correct candidate as many times as the number of words in a code phrase. This thesis presents new models, algorithms, and user interfaces for effective ad-hoc code reuse. First, to address the risk posed by code template reuse, it develops a method for detecting program bugs in similar code fragments by analyzing sequential patterns of code tokens. To proactively reduce program bugs introduced during code template reuse, this thesis proposes an error-preventive code editing method that reduces the number of code editing steps based on cell-based text editing. Second, to address the productivity limitation posed by code phrase reuse, this thesis develops an efficient code phrase completion method. The code phrase completion accelerates reuse of common code phrases by taking non-predefined abbreviated input and expanding it into a full code phrase. The code phrase completion method utilizes a statistical model called Hidden Markov model trained on a corpus of code and abbreviation examples. Finally, the new methods for bug detection and code phrase completion are evaluated through corpus and user studies. In 7 well-maintained open source projects, the bug detection method found 87 previously unknown program bugs. The ratio of actual bugs to bug warnings (precision) was 47% on average, eight times higher than previous similar methods. The code phrase completion method is evaluated on the basis of accuracy and time savings. It achieved 99.3% accuracy in a corpus study and achieved 30.4% time savings and 40.8% keystroke savings in a user study when compared to a conventional code completion method. At a higher level, this work demonstrates the power of a simple sequence-based model of source code. Analyzing vertical sequences of code tokens across similar code fragments is found useful for accurate bug detection; learning to infer horizontal sequences of code tokens is found useful for efficient code completion. Ultimately, this work may aid the development of other sequence-based models of source code, as well as different analysis and inference techniques, which can solve previously difficult software engineering problems.

Description

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 2011.

Cataloged from PDF version of thesis.

Includes bibliographical references (p. 111-113).

Date issued

2011

URI

http://hdl.handle.net/1721.1/67583

Department

Massachusetts Institute of Technology. Department of Mechanical Engineering

Publisher

Massachusetts Institute of Technology

Keywords

Mechanical Engineering.

Collections

Doctoral Theses