Identifying and Merging Related Bibliographic Records by Jeremy A. Hylton Submitted to the Department of Electrical Engineering and Computer Science in partial ful llment of the requirements for the degrees of Master of Engineering in Electrical Engineering and Computer Science and Bachelor of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 1996 c Jeremy A. Hylton, 1996. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part, and to grant others the right to do so. Author : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Department of Electrical Engineering and Computer Science February 13, 1996 Certi ed by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Jerome H. Saltzer Professor of Computer Science and Engineering, Emeritus Thesis Supervisor Accepted by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Frederic R. Morgenthaler Chairman, Departmental Committee on Graduate Theses Identifying and Merging Related Bibliographic Records by Jeremy A. Hylton Submitted to the Department of Electrical Engineering and Computer Science on February 13, 1996, in partial ful llment of the requirements for the degrees of Master of Engineering in Electrical Engineering and Computer Science and Bachelor of Science in Computer Science and Engineering Abstract Bibliographic records freely available on the Internet can be used to construct a high- quality, digital nding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identi es records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an n-gram-based approximate string matching algorithm to build the clusters. The algorithm identi es more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet. The thesis will be published as M.I.T. Laboratory for Computer Science Technical Report 678. Thesis Supervisor: Jerome H. Saltzer Title: Professor of Computer Science and Engineering, Emeritus Acknowledgments I would like to thank  Jerry Saltzer, my thesis adviser, for his detailed criticism of this thesis. He helped me, in particular, to chose my words more carefully and use them more precisely.  Mitchell Charity for many conversations that helped to shape the design and implementation of the system described in this thesis. He frequently provided a valuable sounding board for ill-formed ideas.  Tara Gilligan and my parents, Bill and Judi, for their patience and support as I nished a thesis that took longer than any of us expected. My father, who as a writer and editor is so familiar with nishing the manuscript, o ered exactly the encouragement and commiseration I needed. This work was supported in part by the IBM Corporation, in part by the Digi- tal Equipment Corporation, and in part by the Corporation for National Research Initiatives, using funds from the Advanced Research Projects Agency of the United States Department of Defense under grant MDA972-92-J1029. 6 Contents 1 Introduction 15 1.1 What is a digital library? : : : : : : : : : : : : : : : : : : : : : : : : : 15 1.2 Integrating bibliographic databases : : : : : : : : : : : : : : : : : : : 16 1.3 Overview of related work : : : : : : : : : : : : : : : : : : : : : : : : : 18 1.3.1 Networked information discovery and retrieval : : : : : : : : : 18 1.3.2 Libraries and cataloging : : : : : : : : : : : : : : : : : : : : : 20 1.3.3 Database systems : : : : : : : : : : : : : : : : : : : : : : : : : 20 1.3.4 Information retrieval : : : : : : : : : : : : : : : : : : : : : : : 22 2 Cataloging and the Computer Science Collection 23 2.1 Bibliographic relationships and cataloging : : : : : : : : : : : : : : : 23 2.1.1 Issues in cataloging and library science : : : : : : : : : : : : : 25 2.1.2 Taxonomy of relationships between records : : : : : : : : : : : 27 2.1.3 Expressing relations between Bibtex records : : : : : : : : : : 29 2.2 Practical issues for identifying relationships : : : : : : : : : : : : : : : 30 2.2.1 Bibtex record format : : : : : : : : : : : : : : : : : : : : : : : 31 2.2.2 CS-TR record format : : : : : : : : : : : : : : : : : : : : : : : 32 2.2.3 MARC records : : : : : : : : : : : : : : : : : : : : : : : : : : 33 2.2.4 E ects of cataloging practices : : : : : : : : : : : : : : : : : : 34 7 2.3 Describing the contents of DIFWICS : : : : : : : : : : : : : : : : : : 35 2.3.1 Subjects covered in collection : : : : : : : : : : : : : : : : : : 36 2.3.2 Other characteristics of collection : : : : : : : : : : : : : : : : 38 3 Identifying Related Records 41 3.1 Record comparison in the presence of errors : : : : : : : : : : : : : : 42 3.2 Algorithm for author-title clusters : : : : : : : : : : : : : : : : : : : : 43 3.2.1 Comparing author lists : : : : : : : : : : : : : : : : : : : : : : 45 3.2.2 String comparisons with n-grams : : : : : : : : : : : : : : : : 46 3.2.3 Performance of algorithm : : : : : : : : : : : : : : : : : : : : 48 3.3 Other systems for identifying related records : : : : : : : : : : : : : : 50 3.3.1 Duplicate detection in library catalogs : : : : : : : : : : : : : 51 3.3.2 The database merge/purge problem : : : : : : : : : : : : : : : 54 3.4 Analysis of author-title clusters : : : : : : : : : : : : : : : : : : : : : 54 4 Merging Related Records 59 4.1 Goals of merger and outline of process : : : : : : : : : : : : : : : : : 60 4.2 Creating the union record : : : : : : : : : : : : : : : : : : : : : : : : 62 4.2.1 Problems with union records : : : : : : : : : : : : : : : : : : : 63 4.2.2 Re nements to merger process : : : : : : : : : : : : : : : : : : 66 4.3 Clusters sizes and composition in DIFWICS : : : : : : : : : : : : : : 68 5 Presenting Relations and Clusters 71 5.1 The basic Web interface : : : : : : : : : : : : : : : : : : : : : : : : : 72 5.2 Assessing the quality of union records : : : : : : : : : : : : : : : : : : 73 6 Automatic linking 79 6.1 Searching for related Web citations : : : : : : : : : : : : : : : : : : : 79 6.2 Principles for fully-automated system : : : : : : : : : : : : : : : : : : 81 8 7 Conclusions and Future Directions 85 7.1 Future directions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85 7.1.1 Performance, portability, and production : : : : : : : : : : : : 85 7.1.2 Improving the quality of records : : : : : : : : : : : : : : : : : 87 7.1.3 Identifying other bibliographic relationships : : : : : : : : : : 88 7.1.4 Integrating non-bibliographic information : : : : : : : : : : : : 89 7.1.5 Enabling librarianship and human input : : : : : : : : : : : : 90 7.2 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91 9 10 List of Figures 2-1 Number of citations per year in bibliography collection and ACM Guide to Computing Literature : : : : : : : : : : : : : : : : : : : : : : : : : 38 2-2 Distribution of records by size : : : : : : : : : : : : : : : : : : : : : : 39 2-3 Distribution of Bibtex records, arranged by the number of elds used 40 3-1 N-gram comparison of two strings with a letter-pair transposition : : 48 3-2 Errors in author-title clusters for 937-record sample : : : : : : : : : : 55 3-3 Two falsely matched records : : : : : : : : : : : : : : : : : : : : : : : 56 4-1 CS-TR records for one TR from two publishers : : : : : : : : : : : : 64 4-2 Bibtex records exhibiting the conference-journal problem : : : : : : : 65 4-3 Variation in elds values within author-title clusters of the same type 70 5-1 Sample author-title cluster display from Web interface : : : : : : : : 74 5-2 Source match and eld consensus ratios for cluster including a false match : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77 5-3 Source match and eld consenus ratios for correct cluster : : : : : : : 77 11 12 List of Tables 2.1 Versions of the Whole Internet Catalog cataloged in the OCLC union catalog : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27 2.2 Bibtex document types : : : : : : : : : : : : : : : : : : : : : : : : : : 32 3.1 Statistics for size of potential match pools : : : : : : : : : : : : : : : 50 4.1 Cluster sizes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68 4.2 Souce records, by type : : : : : : : : : : : : : : : : : : : : : : : : : : 69 13 14 Chapter 1 Introduction The growth in the volume of information available via computer networks has in- creased both the usefulness of network information and the diculty of managing it. The information freely available today o ers the opportunity to provide library- like services. Organizing the information, which is available from many sources and in many forms, is one of the key challenges of building a useful information ser- vice. This thesis presents a system for integrating bibliographic information from many heterogeneous sources that identi es related bibliographic records and presents them together. The system has been used to construct the Digital Index for Works in Computer Science (DIFWICS), a 240,000-record catalog of the computer science literature. 1.1 What is a digital library? The term digital library is used widely, but little consensus exists about what exactly a digital library is. Before discussing the kind of information sources used to construct the library and how these sources were integrated, it is useful to clarify the term and explain how the particular vision of a digital library a ected the design of the system. 15 The digital library envisioned here is a computer system that provides a single point of access to an organized collection of information. The digital library is, of course, not a single system, but a loose federation of many systems and services; nonetheless, in many cases it should appear to operate as a single system. Providing interoperability among these systems remains a major research topic [25]. One kind of interoperability that is sometimes overlooked is the interoperation between physical and digital objects in the library. Most documents exist today only on paper, and digital libraries must provide access to both paper and digital documents to be able to satisfy most users' needs. The components of the digital library serve many purposes, including the storage of information in digital repositories, providing aids to discovering information that satis es users' needs, and locating the information that users want. The integrated database of bibliographic records addresses the discovery process. It o ers an index of document citations and abstracts. 1.2 Integrating bibliographic databases The emphasis of the research reported here is to make e ective use of the diversity of bibliographic information freely available on the Internet. At least 450,000 biblio- graphic records describing the computer science literature are freely available, mostly in the form of Bibtex citations; they describe a large part of the computer science literature, but they very widely in quality and accuracy. Nonetheless, when combined with papers available from researchers' Web pages and servers and with on-line library catalogs, these records provide enough raw data to build a fairly useful library. The bibliographic records present many challenges for creating an integrated in- formation service. The records contain typographical and cataloging errors; there are many duplicate records; and there are few shared standards for entering information 16 in the records. The records come from many sources: Some are prepared by librari- ans for their patrons, others come from individual authors' personal collections or are assembled from Usenet postings. Although the heterogeneity of the source records poses a problem for combining all the records into a single collection, it can also be exploited to improve the quality of the overall collection. The heterogeneity provides considerable leverage on the problems of extracting the best possible information from records and providing links between closely related records (an observation made by Buckland, et al. [8]). By identifying related records, a union record can be created that combines the best information from each of the source records. One of the primary contributions of this thesis is an algorithm for identifying related bibliographic records. The algorithm nds records that share the same author and title elds and groups them together in a cluster of records that all describe the same work; the algorithm works despite variations and errors in the source records. The clusters produced by the algorithm may include several di erent but related documents, such as a paper published as a technical report and later as a journal article. A catalog of the records, with record clusters presented together, forms the core of DIFWICS. DIFWICS is intended primarily as an aid to the discovery process, helping users to explore the information collected in the library. In addition to the index of biblio- graphic records, it provides a simple system for locating cited documents when they are available in digital form. This second system works equally well as a means of locating documents in a traditional library's catalog, which helps to locate physical copies, and provides the groundwork for a more ambitious automatic linking project. The preliminary automatic linking work, presented in Chapter 6, uses Web indexes like Alta Vista to nd copies of papers on the Web. When papers are available on the Web, they are usually referenced by at least one other Web page, which includes 17 a standard, human-readable citation along with a hypertext link. Some of the same techniques used to identify related bibliographic records in DIFWICS can be used to nd related citations on the Web and to link the Web pages to the bibliographic records. Two important characteristics of my work set it apart from related work. First, the collection is assembled without speci c coordination among the information providers; this limits the overhead involved for authors and publishers to make their documents available and makes data from many existing services available for use in the current system. The second novel feature of this research is the algorithm for identifying re- lated records. Systems for identifying related records in library catalogs and database systems use a single key for record comparison; my algorithm uses a full-text index of the records to help identify related records. The details of the duplicate detection algorithm are presented in Chapter 3. 1.3 Overview of related work This thesis builds on research in both computer science and library science. The problem of duplicate detection, for example, occurs in somewhat di erent form in the database community (the multidatabase or semantic integration problem) and in the library community (creating union catalogs and the de-duplication problem). This section gives an overview of some related areas of research. 1.3.1 Networked information discovery and retrieval Networked information discovery and retrieval (NIDR) is a broad category encom- passing nearly any kind of information access using a large scale computer network [26]. DIFWICS is an example NIDR application and is informed by a variety of work in this area. 18 This thesis touches at least tangentially on many NIDR issues|including locating, naming, and cataloging network resources|but the clearest connection is to several projects that have used the World-WideWeb and other Internet information resources to provide access to some part of the computer science literature. These systems work primarily with technical reports, because they are often freely available and organized for Internet access by the publishing organization. The Uni ed Computer Science Technical Report Index (UCSTRI) [43] automat- ically collects information about technical reports distributed on the Internet and provides an index of that information with links to the original report. The Harvest system [6] is a more general information discovery system that com- bines tools for building local, content-speci c indexes and sharing them to build indexes that span many sites; these tools include support for replicate and caching. The Harvest implementors developed a sample index of computer science technical re- ports. Harvest was designed to illustrate the principles for scalable access to Internet resources described in Bowman, et al. [7]. A third system is the Networked Computer Science Technical Report Library (NCSTRL) [12], which uses the Dienst protocol [21]. Dienst provides a repository for storing documents, a distributed indexing scheme, and a user interface for the documents in a repository. All three systems rely, in vary degrees, on publishers for making documents avail- able and providing bibliographic information. The publisher-centric model is lim- iting. Information not organized by publishers, including most online information, lacks standards for naming and cataloging. Other systems, like Alta Vista and Lycos, index large quantities of information available via the World-Wide Web. They are not selective about the material they index, and are somewhat less useful for information retrieval purposes as result. If a user is looking for a particular document, however, and has information like the 19 author and title, these indexes can be quite useful for nding it (if it is available). Not only do these Web indices index the papers themselves, but they index pages that point to the papers. Chapter 5 discusses some possibilities for integrating these citations with the more traditional bibliographic citations used to build the computer science library. 1.3.2 Libraries and cataloging The recent development of library-like services for using network information, like UCSTRI or Harvest, parallels the traditional library community's development of large-scale union catalogs and online public access catalogs in the late 70s and early 80s. The OCLC Online Union Catalog merged several million bibliographic records and developed one of the rst duplicate detection systems [18]. More recently, the library community has begun to re-evaluate its cataloging stan- dards. Several papers [5, 15, 40, 48] suggest that catalogers should focus more on describing \works"|particular, identi able intellectual works|rather than \docu- ments"|the particular physical versions of a work. For example, Shakespeare's play Hamlet is a clearly identi able work; it has been published in many editions, each a \document." This thesis makes use of this distinction when it labels as duplicates records for di erent documents that instantiate a particular work. Levy and Marshall [24] raise similar questions about how people actually use libraries in their discussion of future digital libraries. 1.3.3 Database systems Heterogeneous databases di er from more conventional database systems because they included distributed components that do not all share the same database model; 20 the component databases may have di erent data models, query languages, or sche- mas [13]. One of the problems that arises in multidatabase systems is the integration of the underlying schemas to provide users with a standard interface. Duplicate detection is closely related to integration of heterogeneous databases, but is complicated by the fact that bibliographic formats impose little structure on the data they contain; the wide variations in quality and accuracy that typify col- lections of Bibtex records further complicate the problem. Papakonstantinou, et al. [30] present a more thorough discussion of the di erences between the integration of databases and the integration of information systems like bibliographic record collec- tions. The merge/purge problem described by Hernandez and Stolfo [17] implements a duplicate detection system for mailing lists that copes with variations and errors in the underlying data by making multiple passes over the data, each time using a di erent key to compare the records. Mediators [45] are a di erent approach to the problem of integrating information from multiple sources. Mediators are part of a model of the networked information environment that includes database access as its lowest level and users and informa- tion gathering applications at the top level. The mediators operate in between the databases and the users, providing an abstraction boundary that captures enough knowledge about various underlying databases to present a new, uniform view of that data to users. Data warehousing extends the mediator model by creating a new database that contains the integrated contents of other databases rather than providing a dynamic mediation layer on top of them. DIFWICS integrates distributed bibliographies in a similar way. 21 1.3.4 Information retrieval Duplicate detection in information retrieval is at the opposite end of the spectrum from database schema integration; information retrieval deals with the full-text of doc- uments with little or no formal structure. Duplicate takes on a wider meaning in this eld: Consider a search for stories about an earthquake in a collection of newspaper articles; there are probably many stories about the earthquake, from many di erent news sources. The stories are duplicates because their content overlaps substantially and not because of some external feature like their title or date of publication. Yan and Garcia-Molina [49] provide a more detailed discussion of duplicate detection in this context. 22 Chapter 2 Cataloging and the Computer Science Collection This chapter introduces the conceptual framework for creating, using, and relating bibliographic records. It also discusses some of the more practical issues associated with the speci c records used in the experimental computer science library. 2.1 Bibliographic relationships and cataloging Bibliographic records have traditionally described speci c physical objects or units of publication. This thesis describes a di erent use of bibliographic information and a di erent focus for cataloging: I use bibliographic records to describe a work instead of a particular document. The term work is used here to mean a unit of intellec- tual content, which may take on one or more published forms (or none at all); each published form is a document. A paper, presented at a conference and later revised and published in a journal or collected in a book, is a work, each version a di erent \document." I emphasize the work over the document because I believe that the primary use of 23 the library catalog is to nd works. A patron looking for a copy of Hamlet is usually looking for the work Hamlet, and, depending on circumstance, any particular copy of Hamlet might do. The library catalog should help patrons identify the works available and then chose a particular document, based on the patron's needs. The MIT library catalog, for example, returns a list of 17 items as the result of a title search for \Hamlet." It takes careful review to determine that the list contains eight copies of Shakespeare's play, three recorded performances, one book about performances, three musical works inspired by the play, and two copies of Faulkner's novel titled The hamlet. (Copies that exist in various collected works do not appear in the search results.) The Hamlet search illustrates the problem of cataloging particular documents but not the works they represent. Deciphering the search results took several minutes and required study of the source bibliographic records to determine exactly what each item was. The search results would be easy to understand if they were organized around works and relationships. The eight copies of the play are all versions of the same work, Hamlet, and the performances might be considered derivative works. In a music library, it might be useful to highlight the three distinct musical works, and note that each is related to the play. This chapter discusses recent work in library science that suggests catalogers should focus on the work instead of the document. It presents a taxonomy of bibli- ographic relations, which helps to clarify the di erence between a work and a docu- ment, and it discusses the experimental collection these theories will be applied to, a collection of 250,000 computer science citations. 24 2.1.1 Issues in cataloging and library science Standards for cataloging and for using bibliographic information are a current subject of research in the library community. Two majors themes run through several recent papers [5, 15, 37, 48]:  Library catalogs should make is easier for users to understand relationships be- tween di erent entries. In particular, catalogs should identify particular works.  Increasingly bibliographic information is being used to nd information in a networked environment, where the catalog of bibliographic records is less likely to described the contents of local library and more likely to be a union catalog. The most recent theoretical framework for descriptive cataloging was formulated in the 1950s by Seymour Lubetzky. According to Wilson [48], Lubetzky suggested the library catalog serves two functions: the nding function and the collocation function. The nding function. If a patron knows the author, the title, or the subject of the book, the catalog should enable him or her to determine whether the library has the book. The collocation function. The catalog should show what the library has by a par- ticular author or on a particular subject, and it should show the various editions or translations of a given work. Current cataloging practice places more emphasis on the rst function than the second. This emphasis is strange, Wilson says, because most discussions of the theo- retical background conclude that patrons are not interested in particular documents, so much as in the works they represent. The catalog standards result partly from the historical development of catalogs as simple shelf lists and partly from the ease of 25 cataloging discrete physical objects, rather than works, which might constitute only part of an object or span several of them. The emphasis on the physical object over the work is inadequate in several ways. The user's interest in the work rather than the document has already been noted. Smiraglia and Leazer [37] note that anecdotal evidence supports this claim and that catalog usage studies show that the bibliographic elds used to di erentiate between variant editions are seldom used. Library catalogs and other collections of bibliographic records are used increasingly as networked information discovery and retrieval tools, where discovering what kinds of works exist is more important than nding out which works are in the local library. When a user wants an item, there are many other retrieval options other than the local library, including an Internet search and inter-library borrowing. Trends in publishing make the bibliographic model increasingly unwieldy: Elec- tronic publishing and advances in paper publishing technologies have made it easy to change and update documents. As a result, it is now common for each new printing of a book to incorporate some corrections and additions or for authors to publish electronic copies of their papers that include changes made after print publication. OCLC cataloging rules require a new bibliographic record be created for each copy of a work that has di erent date of impression and di erent text. Heaney [15] cites a message from Bob Strauss to the Autocat mailing list that describes the problem; Strauss observes that between the original publication of Ed Krol's Whole Internet Catalog in 1992 and his search on Dec. 2, 1993, nine di erent versions have been cataloged in the OCLC union catalog (see Table 2-1). The problem raises two questions: First, is it sensible to create new records to describe each version of a document? Second, should the average user be exposed to this level of detail? The answer to the rst question is unclear; the answer to the second, in many cases, may be no. 26 Number Holdings Feature 1 323 1992 2 10 minor corr, 1992 3 883 [Corr ed] 4 968 1st ed (same as #1?) 5 51 July 1993, minor corr. 6 136 \May 1993" 7 116 [Corr ed] (1993) 8 19 [Corr ed] 9 132 \Feb. 1993; minor corr" total 2638 Table 2.1: Versions of the Whole Internet Catalog cataloged in the OCLC union catalog 2.1.2 Taxonomy of relationships between records Tillett [41] and others have developed taxonomies of bibliographic relationships. Till- ett's taxonomy provides a useful vocabulary for discussing the di erent kinds of doc- uments that describe the same work, as well as relationships between di erent works. The seven categories presented here are based on Tillett's taxonomy, although some of the categories are slightly di erent. Equivalence relationship. Equivalence holds between records that describe the same document but in di erent mediums, e.g. reprint, micro lm, original book. Derivative relationship. The derivative relation holds between di erent versions of the text, e.g. di erent editions or translations, arrangements or adaptations. Referential relationships. The referential relationship holds when one document explicitly contains a reference or link to another document. Tillett describes reviews, critiques, abstracts, and other secondary references as \descriptive" relations; the referential category expands Tillett's de nition to include other kinds of references between works, including citations of one work within an- 27 other and survey articles. In an online environment, where a paper's citation list may be as accessible as standard bibliographic information, this expanded notion seems useful. Sequential relationship. The sequential, or chronological, relationship describes documents intentionally published as a sequence or group. Examples included the successive issues of a journal or volumes of a book series, like an almanac or encyclopedia. Hierarchical relationship. The hierarchical relationship holds between the whole and the parts of a particular work. It applies to relations between a book and its chapters, and vice versa, or articles in a journal or collection. (Tillett prefers the term \whole-part" relationship.) Accompanying relationship. The accompanying relationship holds between two documents that are intentionally linked, but not necessarily in a hierarchical relationship. Examples include a textbook and an instructional supplement, or a concordance, index, or catalog that describes another work (or group of works). \Shared characteristic" relationship. The \shared-characteristic" relationship holds between bibliographic records that have the same value in a particular eld. It is a kind of catch-all category that describes almost any sort of rela- tionship; the relation seems to be most useful for elds like publisher, subject heading, or classi cation code, but year of publication or page count are also potentially shared characteristics. User queries are another example of a kind of shared characteristic relationship. The results of a query all share the particular characteristic described by the query. 28 2.1.3 Expressing relations between Bibtex records Identifying and representing each of the relationships described in the previous section is beyond the scope of this thesis. Instead, I focus on identifying a limited set of relationships and presenting the rough form of a user-interface for those relationships. The algorithm presented in the next chapter identi es related records based on the author and title elds; if the elds are the same, it concludes the records describe the same work. Two relationships hold between records in such a cluster: equivalence and derivative relationships; some records will be duplicate citations of the same document and others will cite di erent documents that represent the same work. Identifying works is dicult. Even when a human cataloger is reviewing two bibliographic records, it can be dicult to tell it they describe the same work without referring to actual copies of the cited work. A cluster is an algorithmically-generated set of related records, which may or may not be the same as the actual set of records for a particular work. A cluster generated by one algorithm may be better in one way or another than a cluster generated by a di erent algorithm. (Indeed, the next chapter describes several algorithms that identify only duplicate records and do not consider works.) This thesis uses author-title clusters, which identi es a work as a unique author and title combination. Any pair of records with the same title and same authors are considered equivalent for the purpose of creating an author-title cluster, although there may be unusual cases where this test does not discriminate between two di erent works. Using the term equivalent requires some care; it sounds simple enough, but equiv- alence depends entirely on the context in which some equivalence test is applied. (Consider, for example, the four di erent equality tests in Common Lisp [38].) The records in an author-title cluster are equivalent for the purpose of identifying a work, 29 but are probably not equivalent for the purpose of locating the work in a library or retrieving it across the network. The two uses of equivalent above isolate two separate problems that must be ad- dressed in a catalog that is work-centered, but constructed from bibliographic records that have not been prepared with this use in mind. The rst problem is identifying the works described by the records in the catalog. The second problem is identifying the separate documents in the author-title cluster and presenting them as di erent instances of the main work. This process involves identifying the di erent documents within the cluster and merging duplicate citations for each document into a single, composite record. The derivative relation holds between the di erent documents in an author-title cluster, but identifying each di erent document is complicated by many factors, in- cluding the version problem and the diculty of relying on Bibtex for nely-nuanced descriptions. I solve a simpli ed version of the problem by using the Bibtex entry type to identify the di erent, horizontal classes of records within a cluster. Thus, a cluster might be presented to the user as containing two document types|an article and a technical report|but would not distinguish between, say, di erent editions of a book. 2.2 Practical issues for identifying relationships Bibliographic formats a ect how well relationships and works can be identi ed. The format not only dictates what information can be recorded, but also tends to a ect the practice of recording. Many freely available bibliographic records use the Bibtex format [22], which is used to produce citations lists in the LaTeX document preparation system. About 90 percent of the records in DIFWICS are Bibtex records. 30 Being able to accept bibliographic records in any format is a design goals of DIFWICS, because it minimizes the need for coordination among publishers, cat- alogers, and libraries and maximizes the number of records available for immediate use. Because other bibliographic formats, like Refer or Tib, are less common than Bibtex records, the current implementation supports only one other bibliographic format, the CS-TR format developed as part of the Computer Science Technical Report Project (CS-TR) [19] and de ned by RFC 1807 [23]. The two formats di er in both syntax and semantics, so using both formats inter- changeably requires a common format that both can be converted into. The common format involves some information loss, when one format captures more information about a particular eld than the other format is capable of expressing. For example, the CS-TR format has separate elds for authors and corporate authors, but Bibtex has only a single author eld for both kinds of author; the common format does not capture the distinction, because it is not possible to determine which kind of author is being referred to in Bibtex. Several characteristics of the Bibtex and CS-TR formats a ect the design of the library and the kinds of bibliographic relationships that can be identi ed. Bibtex is currently used as the common format, because Bibtex is capable of describing any document that can be described with a CS-TR record (albeit with some loss of information). 2.2.1 Bibtex record format Bibtex les are used for organizing citations and preparing bibliographies. The format is organized around several di erent entry types (see Table 2-2), which describe how a document was published. Each type uses several of the two dozen standard elds 31 Article MastersThesis Book Misc Booklet PhDThesis InBook Proceedings InCollection TechReport InProceedings Unpublished Manual Table 2.2: Bibtex document types to describe the publication. The format is very exible. There are few rules governing precisely how a eld must be formatted and users are encouraged to de ne their own elds as necessary. The individual elds fall into three categories|required, optional, and ignored| depending on the document type; the journal eld is required for an Article, but ignored for a TechnReport. Ignored elds allow users to de ne their own elds. Throughout this thesis, I call the required and optional elds the standard elds and the ignored elds non-standard. Most of the speci c elds are easy to use, process, and understand|like month, year, or journal|but a few elds contain unstructured information about the docu- ment being cited, notably note, annote, abstract, and keywords. In practice, the note, annote, and (non-standard) keywords eld often appear to be confused; the note eld is intended for miscellaneous information to print with a citation and the annote eld for comments about the cited document, such as would be included in an annotated bibliography. 2.2.2 CS-TR record format The CS-TR format was designed speci cally for universities and R&D organizations to exchange information about technical reports. The elds it uses are geared speci cally towards describing technical reports, allowing a more detailed description of technical 32 reports than standard Bibtex elds, but limits its usefulness for describing other documents. CS-TR de nes a few mandatory elds used for record management and 25 other elds, all of which are optional. Some of the elds can be easily converted to Bibtex| CS-TR date maps to Bibtex month and year (with loss of the day) and CS-TR title is the same as Bibtex title. Most of the CS-TR elds don't have an analogue in the standard Bibtex elds, and must be omitted or placed in a non-standard eld or the note eld. 2.2.3 MARC records MARC is the predominant bibliographic format in the library community. While it is not used by the system presented here, the MARC record makes an interesting point of comparison. The MARC record is a highly structured format; its use emphasizes precise la- bels for elds and detailed descriptions of the items being cataloged. Crawford [11] provides an overview of MARC and its use in libraries; he observes that all MARC records share ve characteristics:  Each record has a title or identifying name.  Each object is produced, published, or released at a speci c time, by a speci c person or group.  Each object is described physically.  Most non ction objects have subjects|what the object is about.  Notes are made about what is being cataloged, e.g. restrictions on use or notes about the reproduction. 33 The MARC format de nes several hundred elds, many of which have sub elds, that specify the format and content of the eld values exactingly. The primary eld used for author ( eld number 100, "Main Entry{Personal Name"), has sub eld codes for specifying the personal name, titles or dates associated with the name, and fuller forms of the name; another code indicates whether the personal name begins with a forename, single surname, or multiple surnames. MARC's precision makes comparing records more dicult for several reasons. There is more opportunity for small errors in MARC; several di erent elds can be used to enter the same information; and there is some exibility as to how much information must be entered. Users of electronic library catalogs will probably be familiar with the problem of determining when two author entries are the same| separate listings appear when one record has date of birth, while another has dates of birth and death and a third may contain a fuller form of the author's name. The practical implication of the di erences between MARC records and citation- oriented records like Bibtex is that while Bibtex records are not as rich in information they provide a much simpler structure for extracting information, like author and title, which are used to distinguish between di erent works. 2.2.4 E ects of cataloging practices Bibtex's exibilty allows people to enter information in many ways. The use of abbreviations in elds values is very common, which makes it dicult to compare two elds to see if they have the same value; ignoring capitalization, Communications of the ACM is abbreviated variously as \CACM", \C. ACM", \C.A.C.M.", \Comm. ACM", \Comm. of the ACM (CACM)", etc. Notes about the document, e.g. that it is an abstract only or that it is a revised edition, are entered in many di erent ways. Although the notes eld seems to be 34 the most likely candidate for this information, it is variously entered in the title eld (complicating comparisons), in the note or annote eld, or in the edition eld (where \2nd" is as likely as \second"). There appears to be less variation among the di erent sources of CS-TR records, because the de nition of each eld is fairly speci c and because technical reports have fewer unusual cases than other document types. The problems of abbreviations and variations are less pronounced in CS-TR records because they are produced by the publishing institutions, which tend to be consistent within their own records. CS-TR records are also easier to handle because many of the elds are unused in the records available today; more than half the records use no more than seven descriptive elds. 2.3 Describing the contents of DIFWICS DIFWICS incorporates bibliographic records from two major collections available on the Internet. The primary source is Alf-Christian Achilles' collection of 450,000 Bibtex records, titled \A Collection of Computer Science Bibliographies" [2]. This work organizes several hundred individual collections of varying size and quality. The second source is the CS-TR records produced by the ve participants in the CS-TR project; the collections is composed of approximately 6,000 technical report records from Berkeley, Carnegie Mellon, Cornell, M.I.T., and Stanford. The rst collection requires some explanation to understand what kinds of records it provides and how they a ect the system for identifying related records. The individ- ual bibliographies fall into three major categories|personal bibliographies organized by individual researchers, journal and conference proceedings bibliographies, and bib- liographies organized around a particular subject. Although Bibtex is commonly used to prepare citation lists for papers, none of the source les suggest they were prepared 35 for that purpose. There are only a few personal bibliographies, but each is quite large and appears to have been created and checked with some care. Joel Seiferas's collection holds 43,000 theory citations and Gio Wiederhold's collection holds 10,000 citations, mostly about databases. The journal and conference bibliographies tend to be fairly complete listings of the articles or papers published. The bibliography for the Journal of the ACM, for example, includes every article published from 1954 to 1995. The topical collections range widely from a 5,000-record collection on program- ming languages and compilers to a 24-record collection on fuzzy Petri nets. 2.3.1 Subjects covered in collection Achilles has organized the collection into major subject areas and we have selected an arbitrary subset of the records in each category to include in DIFWICS|in all, about 240,000 records; the remainder of the collection had not been processed at the time of this writing. A brief description of each categories and the number of records included from it follows. (Sizes are rounded to the nearest 5,000.) Arti cial Intelligence. 30,000 records covering most areas of AI, with most exten- sive coverage of logic programming and natural language processing. Includes 13 journal bibliographies. Compiler Technology and Type Theory. 20,000 records covering programming languages and compilers. Includes ve journals and six conferences. Databases. 20,000 records. Half from personal collection of Gio Weiderhold. In- cludes one journal and one conference. 36 Distributed Systems. 10,000 records. Scattered coverage of networking, distribut- ed systems (including Mach and Amoeba bibliographies), mobile computing. Graphics. 15,000 records covering topic including vision and ray tracing. Includes complete SIGGRAPH bibliography, ve journals. Mathematics. 5,000 records. Computer algebra, ACM Transactions on Mathemat- ical Systems, applied statistics. Neural Networks. 5,000 records. Sampling of a few seemingly representative col- lections. Object-Oriented Programming and Systems. 5,000 records. Includes OOP- SLA and ECOOP proceedings. Operating Systems. 10,000 records. Mostly from the Univ. of Erlangen library and a USENIX bibliography. Parallel Processing. 15,000 records. Broad coverage, including transputers, multi- processors, numerical methods, parallel vision. Many journals and conferences. Software Engineering. 10,000 records. Process modelling, speci cation, veri ca- tion. Includes three journals. Theory. 60,000 records. Major topics include concurrency, hashing, logic, term rewriting, cryptography, and computability. Mostly from the Seiferas collec- tion, which includes citation from 34 journals and conferences. Also includes two journals and two conferences. Miscellaneous. 35,000 records. Records that do not t into the other categories or span multiple categories. Includes 5,000 record from the SEL-HPC archive, a 3,200-record Communications of the ACM bibliography, and 16 other journals. 37 Figure 2-1: Number of citations per year in bibliography collection and ACM Guide to Computing Literature Half the records included in the computer science library cite documents pub- 38 Figure 2-2: Distribution of records by size The source records range in size from from 50 bytes to 10,000 bytes, but more 39 Figure 2-3: Distribution of Bibtex records, arranged by the number of elds used 40 Chapter 3 Identifying Related Records Libraries have faced the problem of duplicate records for more than 20 years; as they developed computer systems for managing library catalogs and, in particular, for shar- ing bibliographic records in a networked environment, duplicate records were identi- ed as a problem and techniques for eliminating them were developed. Traditional de-duplication, however, is signi cantly di erent from identifying all the citations of a particular work; in many ways, identifying works is easier. This chapter describes in detail the algorithm for identifying author-title clusters and the characteristics of the source records that make this problem hard. It presents a novel algorithm that uses randomized full-text searches to detect potential duplicates. The algorithm is analyzed and two kinds of failures are examined|including a record in a cluster when it cites a di erent work (false match) and creating two di erent clusters whose members describe the same work (missed match). The chapter also surveys other approaches to identifying related records|two early systems used for de-duplication in library catalogs and a recent multidatabase integration system. 41 3.1 Record comparison in the presence of errors The record comparison test addresses three sources of variation between records that describe the same document: entry errors, which are primarily typographic, di er- ences in cataloging standards, and abbreviations. Entry errors are unintentional errors made during the creation and transcription of bibliographic records. The most common entry errors are simple misspellings and typographical errors, but improperly formatted records and records which omit required elds are also common. Typical format errors include placing information in the wrong eld, such as putting the month and year in the year eld, or not using the required syntax for a eld, e.g. entering the year as a two-digit number instead of a four-digit number. Misspelling and typographical errors are common to most text databases and the literature about them is substantial. O'Neill and Vizine-Goetz's overview of quality control in text databases [29] provides a thorough, though somewhat dated discussion of the sources of errors and some techniques for correcting them; Kukich [20] surveys techniques for automatically detecting and correcting spelling errors. While entry errors are an unintentional source of variability, di ering cataloging standards are often intentional. Many bibliographies apply a consistent set of cata- loging rules, but the standards are quite di erent from one bibliography to another. Cataloging standards a ect what kind of information is recorded in a citation and where it is recorded. If a citation describes an extended abstract, for example, some bibliographies record that fact in the title eld and others record it in the note or annote eld. Cataloging standards also determine what document type the citation is assigned or whether a new citation is created for each instance of publication. When an article is published more than once, some databases include a record that describes one of 42 the publications fully and includes information about the other publication in one of the comment elds. Similarly, many conference proceedings are published as issues of a journal; sometime these articles are cataloged as Bibtex InProceedings citations and other times as Article. Authors' names are a particular source of trouble. The same name can be cat- aloged many ways: last name only, rst initial and last name, all initials and last name, rst name and last name, etc. In a list of authors, some bibliographies include only the rst author (only sometimes indicating that there are other authors) and others include all the names. Unusual names frequently cause problems. In a sample of 1,000 records, 14 uses of \Jr." were found, but none follow the speci ed format [22]. The correct placement of \Jr." within the author string is: \Steele, Jr., Guy L." Abbreviations are another source of problems caused by di erences in cataloging. Journal and conference names and publishers are abbreviated in many non-standard ways, making comparisons using these elds dicult. (Bibtex allows users to de ne abbreviations within a Bibtex le; the abbreviations are expanded when a bibliogra- phy is processed. These are not the abbreviations discussed here; rather, the problem lies with abbreviations in the actual text of the eld.) The problems posed by abbreviations are not addressed here. Abbreviations are seldom used in the author or title elds, so they do not present a problem for identi- fying related records. The problem is discussed in somewhat more detail in Chapter 6 in the section on authority control. 3.2 Algorithm for author-title clusters The algorithm for identifying clusters of related records considers each record in the collection and determines whether it has (approximately) the same author and title 43 eld as any other record. There are two primary concerns for the algorithm's design. First, the algorithm should avoid comparing each record against every other record| an O(n2) proposition. Second, the algorithm, and in particular the test for similarity, must cope with the kinds of entry errors described above. The algorithm uses two rounds of comparisons to identify author-title clusters. The rst round creates a pool of potentially matching records using a full-text search of the entire collection; the pool consists of the results of three queries, with words randomly selected from the author and title of the source record. (The index, however, includes words that appear anywhere in a record, not just in the author and title elds.) In the second round, each potential match is compared to the source record, and an author-title cluster is created for the matching records. Three di erent queries are used to construct the potential match pool. Each query includes one of the authors' last names and two words from the title eld. Title words that are either one- and two-letters long or are in the stoplist, which includes the 50 most common words in the index, are not used. In cases where there are not enough words to construct three full queries, queries use fewer words. (And in some extreme cases only one or two queries are performed.) The second phase of testing compares the author and title eld of each record in the potential match pool against the source record. It uses an n-gram-based approximate string matching algorithm to compare the two elds. The string matching algorithm is explained in detail below. The algorithm is applied to every source record, regardless of whether it has been placed in a cluster already; thus, a cluster with ve records will be checked by ve di erent passes. The algorithm enforces transitivity between records: If record A matches record B and record B matches C, all three are placed in the same cluster, whether or not A matches C. The algorithm is tolerant of format errors and typographic errors. Given a pair 44 of records that cite documents with the same author and title, the potential match pool for one record will contain the other as long as one of the three queries contains words that match the second record; a small number of errors is unlikely to cause a problem. The approximate string match will tolerate variations as large as transposed or missing words in a long title string. 3.2.1 Comparing author lists Comparing two entries' author elds is more complex than comparing title elds, because there is much wider variation in how names are entered than in how titles are entered. Instead of comparing author strings directly, the algorithm performs a more complicated comparison. To compare two author entries, each author string is separated into a list of individual authors and each name is compared; the name comparison considers each part of the name| rst name, last name, etc.|separately. If the names and the order of the names both match, then the two author eld are considered to be the same. When one author list contains matching names, in the same order as the rst list, but omits names from the end of the last, the two strings are also considered the same; testing suggested it was likely either there was a cataloging error or that the same work had been published with slightly di erent author lists. Finally, an author list that includes a single name and the notation \and others" will match any list of two or more names that has a matching rst author. (The Bibtex format speci es the use of \and others," so variants like \et al." are not handled.) Individual author names are compared by separating the two name strings into four parts: rst name, middle name, last name, and sux (e.g. Jr.). Two parts of a name match if any of three conditions is met: 1. A trigram comparison reports the strings are the same. 45 2. One of the strings is an initial and the other string is a name that starts with that initial, or both strings are the same initial 3. Either of the strings is blank. Thus, \G. Steele" and \Guy L. Steele Jr." are considered the same. However, \B. Cli ord Neuman" and \Cli ord Neuman" are not considered the same (even though Neuman is cited both ways). 3.2.2 String comparisons with n-grams An n-gram is a vector representation that includes all the n-letter combinations in a string. The n-gram vector has a component vector for every possible n-letter combi- nation. (For an alphabet which contains the letters 'a' to 'z' and the numbers '0' to '9', the vector space is 36n-dimensional.) The n-gram representation of a string has a non-zero component vector for each n-letter substring it contains, where the mag- nitude is equal to the number of times the substring occurs. The string comparison algorithm uses 3-grams, or trigrams, so I will limit further discussion to trigrams. Formally, a trigram vector A~ for a string s is de ned as A~s = faaaa; aaab; : : : ; ad5f ; : : : ; a999g; where aaaa = number of times \aaa 00 appears in s: The trigram vector for the string \record" contains four components: \rec", \eco", \cor", and \ord." If the trigram \eco" appeared twice in the string, then aeco = 2. The string comparison algorithm forms trigram vectors for the two input strings and subtracts one vector from the other. The magnitude of the resulting vector di erence is compared to a threshold value; if the magnitude of the di erence is less than the threshold, the two strings are declared to be the same. If each trigram 46 appears only once in an input string, the di erence vector is the same as the list of trigrams that appears in one string but not the other. In general, the magnitude of the di erence vector for two inputs A~ and B~ is: v uut X999kD~ k = (a 2i bi) i=aaa The threshold value was determined experimentally, using several dozen pairs of strings which had been labelled the same or di erent, based on how the algorithm should compare them. The threshold that worked best varied linearly with the total number of distinct trigrams in the two input strings. The threshold T used for comparing two strings with a total of n trigrams is T = 2:486 + 0:025n The variable threshold allowed the test to tolerate several errors in a long title string, such as missings words or multiple misspellings, and still work well with short title strings. A higher, xed threshold would have admitted many short strings that had completely di erent words. The threshold was chosen using a collection of about 50 pairs of strings, each of which was labelled a priori as the same or di erent. The computed threshold minimized the number of pairs that was incorrectly identi ed. The sample strings, however, were not chosen with great care and the e ects of small variations in the threshold were not studied. The trigrams are produced by converting all strings to lower case letters and numbers; all spaces and punctuation are removed. Most n-gram implementations pad the string with leading and trailing spaces, so that \record" would produce a 47 s1 = \Machine Vision" A~s = f mac, ach, chi, hin, ine, nev, evi, vis, isi, sio, ion g1 s2 = \Machien Vision" A~s = f mac, ach, chi, hie, ien, env, nvi, vis, isi, sio, ion g2 D~ = A~ ~s As = f hin, hie, ine, ien, nev, env, evi, nvi g1 2 p kD~ k = 12 + 12 + 12 + 12 + 12 + 12 + 12 + 12 = 2:828  T = 2:486 + 0:025  15 = 2:861 Figure 3-1: N-gram comparison of two strings with a letter-pair transposition trigram including \ r" and \ re." My implementation, however, begins the rst trigram with the rst letter and ends with the trigram that includes the last three letters; this test is slightly more tolerant of errors at the beginning and end of strings. Consider the example comparison in Figure 3-1 between two strings \Machine Vision" and \Machien Vision." The di erence vector D~ contains eight trigrams, each appearing once, so the magnitude of the di erence is 2.828. The threshold, based on a count of 15 trigrams in the input, is 2.861. The strings are declared to be the same, because the magnitude of the di erence is just less than the threshold. The trigram string comparator worked well enough to create author-title clusters that tolerated small errors, but did not include elds that a human cataloger would consider distinct. There were some problems, however, and it seems likely that it could be improved. Many similarity functions for string and document vectors are discussed in the information retrieval literature; Salton [33] o ers a brief overview. 3.2.3 Performance of algorithm The rst round, which identi es a pool of potential matches, greatly reduces the num- ber of comparisons that must be made to create author-title clusters. The brute force 48 approach|compare each record to every other record{requires O(n2) comparisons: (n 1) + (n 2) + (n 3) + : : :+ 1 n2 n = 2 The pool of potential matches tends to be very small; the average pool contained fewer than 30 records when the entire 240,000 record collection was processed. The savings over pairwise comparison is big|7.5 million comparisons instead of 31.2 billion. The number of comparisons that must be performed grows with the average pool size, but it is dicult to judge how the collection size a ects the average pool size. The pools are the results of a full-text search that includes one of the authors' last names (unless no authors are cited); although the number of occurrences of any word tends to grow linearly with collection size, names behave di erently. The number of times \Szolovits" appears, for example, is a function of two things: how many papers Szolovits has written and how many duplicate citations for these papers are in the collection. A closer examination of the potential pools that were produced while processing the full collection helps to map out the possibilities for collections up to hundreds of thousands of records, but does not provide a de nitive answer. I recorded the size of the potential match pools for the rst 5,800 records, based on a collection with 50,000, 125,000, and 240,000 records. (Only a single pool was created, but records outside the rst 50,000 were ignored for the rst sample and records outside the rst 125,000 were ignored for the second sample.) The distribution of pool sizes is roughly exponential. The majority of the pools contained fewer than 13 records in each sample. The tail of the exponential grows longer as the collection size increases; the largest pool increases from 14,294 for the 50,000-record sample to 49 45,750 for the 240,000-record sample. Size of collection Average pool size Median pool size 50,000 15.76 8 125,000 18.30 10 I 240,000 I 32.29 I 13 I Table 3.1: Statistics for size of potential match pools Most of the pools are quite small, but occasionally large pools are returned; the latter are often caused by books with with some missing information, e.g. no au- thor, or very common title words, e.g. \introduction to algorithms" or, worst of all, both. A reasonable optimization might be to ignore records that contain extremely large potential pools, and treat them as failures for which related records cannot be recognized. 3.3 Other systems for identifying related records Several papers describe speci c systems for identifying duplicates records. Two early duplicate detection projects are surveyed below|the IUCS scheme [47] and the orig- inal OCLC On-Line Union Catalog [18]; O'Neill, et. al [28] presents some character- istics of duplicate records in the OCLC catalog. Toney [42] describes a more recent e ort, and provides a good overview of the design space for library duplicate detection systems. The OCLC and IUCS projects were completed in the late 70s, when the library community had just begun sharing bibliographic records among institutions and building large online union catalogs. Subsequent work re ects the general approach mapped out in these two seminal studies. The database merge/purge problem, described by Hernandez and Stolfo [17], is very similar to the duplication detection problem. In the merge/purge problem, 50 several di erent databases with similar records must be merged and the duplicates purged; an example is joining several mailing lists, which may contain inconsistent or incorrect addresses. The recent information retrieval literature contains several reports from Stanford on duplicate detection in a Usenet awareness service [49] and several schemes for copy detection in a digital library [35, 36]. Duplicate detection in information retrieval is a substantially di erent problem, because the actual content of two documents is compared; the reported work discusses several di erent approaches to identifying overlapping content at the granularity of words and sentences. 3.3.1 Duplicate detection in library catalogs The key di erence between the OCLC and IUCS algorithms and my algorithm is that the former identify duplicate records. Records are duplicates when they describe exactly the same document and not just the same work. Although the precise rules for de ning a duplicate vary somewhat between the algorithms, they both compare many elds. The OCLC rules require that the following elds match for a pair of records to be considered duplicates: author, title, publisher, copyright date, date of impression (if text di ers), edition, medium, series, page or number of volumes, illustrator, and translator. Most techniques for duplicate identi cation share the same gross structure: They use two rounds of comparisons, identifying possible duplicates on the rst round and making more careful examination on the second round. The rst round uses a xed- length key, created from one or more eld values; records with the same key are considered potential duplicates. A longer key, incorporating several elds, is used for the second round. (The second-round keys seem to be used primarily to avoid loading entire records into memory for comparison|a concern more important in 1979 than 51 today.) Duplicate detection keys summarize the information contained in the record; they consist of a series of xed-length encodings of eld values. In the OCLC and UICS schemes, di erent keys are used for round one and round two of duplicate identi ca- tion. Fields are typically encoded in one of three ways: 1. Selecting certain characters from the eld. The OCLC scheme uses characters 1, 2, 3, 5, 8, 13, 21, and 34 from a normalized title string to construct one of its keys. Character selection rules can be more complicated, specifying certain characters from speci c words, such as the rst and fourth characters of the second word. 2. For elds like reproduction code, which has a limited range of values, each possible value can be assigned a speci c value. 3. More complicated encodings construct hashes of normalized eld values. The second round of the OCLC scheme hashes the entire title into a 109-bit key. The key is built by converting the trigrams that do not span words to integers and applying a hash function to the sequence of integers. The IUCS scheme builds a Harrison key [14] from trigrams. The Hamming distance between two keys can be compared, allowing tolerance of typographical errors and other small variations between two strings. The IUCS scheme uses a short, xed-length title-date key for its rst round, taking eight characters from the beginning and end of titles plus some date information. (MacLaury [27] described the choice of title characters.) Records that have the same title-date key are compared using three more keys: the rst ve characters of the author eld, a 72-bit Harrison key of the title eld, and the highest number taken from the various page elds. 52 The OCLC scheme is similar. The rst round key looks for exact matches of four elds|the date, record type, reproduction code, and the 8-character title selection. The second round compares the entire key, but de nes 16 di erent conditions under which the keys can be declared a match. These conditions examine di erent parts of the key, looking for either an exact match or partial match. Partial matches are de ned for only a few elds of the key; a partial match of of Harrison keys occurs when the bits set in one key eld are a subset of those set in the other key. One of the conditions for matching illustrates the approach: Two records are duplicates if there is an exact match of the Government Document Number and page elds, and partial matches of the LCCN, ISBN, edition, and series elds; partial matches occur when one record has no value in a particular eld. The QUALCAT expert system [31] is an interesting alternative to the key-based comparisons. A team of catalogers developed a set of rules that describe whether a certain combination of elds values makes it more or less likely that two records are duplicates. Each record pair is described by a possibility that it is a duplicate (poss) and the certainty it is a duplicate (cert). Initially, the cert is 0 and the poss is 100. Each rule increases the cert or lowers the poss. Once all possible rules have been applied, a pair of records with suciently high cert and poss are automatically labelled duplicates, records with low values are not, and records in between are referred to a human cataloger for review. The expert system was used to compare records during the second round of QUAL- CAT testing. The rst round was a fairly typical key-based lter using Universal Standard Bibliographic Code (USBC). The USBC is similar to the OCLC and IUCS rst rounds keys, but Toney [42] notes that it is more dependent on clean data. Quantitative results of the QUALCAT system were not provided. In the OCLC study, Hickey and Rypka [18] observe three major causes of failure to match duplicates. A di erence in the rst 34 characters of the title string that the 53 initial date-title keys is drawn from causes 12 percent of the missed matches. The two other causes are di erences in place of publication and pagination. One question that remains to be answered is whether the probalistic rst round proposed here o ers any improvement over the key-based approach. Because both systems were not available for testing, we can at best hypothesize that the randomized text-search approach is more tolerant of errors. Word order errors, di erences in cataloging, and typographical errors would all cause problems for the date-title key. The randomized text-search approach can tolerate these errors as long as one of the three queries uses words that do not contain typographical errors. It is not a ected by word errors or by cataloging variations that a ect which words are included in the title. 3.3.2 The database merge/purge problem The approach to the merge/purge problem used by Hernandez and Stolfo [17] is simi- lar to the one used for duplicate detection. Their algorithm use keys to partition their data into sets small enough that they are willing to apply computationally intensive comparisons. The source data varies widely, so any particular key is likely to miss many records; to overcome this problem, their algorithm makes several independent runs over the data with di erent keys and computes the transitive closure of the re- sults. The accuracy of their test jumped from 70 percent with a single pass to 90 percent with three passes. 3.4 Analysis of author-title clusters The algorithm for creating author-title clusters can fail in two ways. It can cre- ate clusters that include citations for more than one work (a false merge) or it can create two separate clusters that both contain citations for the same document (a 54 missed match). The author-title cluster algorithm keeps these failures to a reason- able minimum; analysis of a sample set indicates that less than 1 percent of the clusters contained false merges and that missed matches occurred about 5 percent of the time. The clusters were analyzed using a 937-record sample. The sample was selected so that each shares at least three title words with one or more other records; this characteristic was intended to increase the possibility of false merges. The cluster- ing algorithm was run on the sample, and its results were examined by hand. The algorithm determined that the sample contained 554 unique clusters. After the algorithm was run, I checked each cluster to verify that each record described the same document. I checked for missed matches looking for records in other clusters that contained similar title elds. I identi ed potential missed matches of a record by doing independent searches for each word in the title. (The search used a an approximate string searching program that did not use a n-gram approach.) If a di erent record was returned by at least two-thirds of the agrep searches, it was compared by hand with the source record. According to my manual check of the records, the sample actually contained 554 unique works, 250 of which were cited by more than two records. The author-title clusters contained four false merges and 30 missed matches. Clusters identi ed 554 Missed matches 30 False merges 4 Figure 3-2: Errors in author-title clusters for 937-record sample The four false merges all involved closely related bibliographic records. In three of the four clusters, the two di erent works were part of a series, where each paper was labelled part one or part two. Figure 3-2 shows an example. The papers had 55 @ArticlefMulmuley90, title = \A Fast Planar Partition Algorithm, I", author = \Mulmuley", year = \1990", journal = \Journal of Symbolic Computation", volume = \10", g @ArticlefMulmul91, title = \A Fast Planar Partition Algorithm, II", author = \K. Mulmuley", year = \1991", month = \[1]", pages = \74{103", journal = \Journal of the ACM, JACM", volume = \38", number = \1", g Figure 3-3: Two falsely matched records the same authors and the approximate string match reported that the titles were the same because they di ered only in the nal trigram. The nal false merge involved two di erent papers with very similar titles|\Towards Data ow Analysis of Com- municating Finite State Machines" and \Data ow Analysis of Communicating Finite State Machines"|and the same authors. The sample set is small enough that it is dicult to extrapolate the results to a large collection with great precision, but it appears that the failure rate would be acceptably low. Measured in percentages, false merges occurred in 0.7 percent of the clusters and 5.7 percent of the clusters were missed matches that should have been merged with another cluster. I also used the results of two other algorithms to gauge the relative success of the author-title clustering algorithm presented here. The bibmerge program [1] creates title-date clusters; if two records have the same title and date, they are placed in the same cluster. Bibmerge has only limited utility as a benchmark, because it uses 56 a very simple detection scheme that is not tolerant of formatting and typographical errors. The program's source contains the comment: \This script is written in the most unprofessional manner available to me, the only reason why I have not scrapped it is that is works." Nonetheless, it was the only other duplicate detection system available for testing on the same sample set. Bibmerge found 650 unique clusters; it missed 126 matches and made one false merge. The title-data clusters identi ed 10 record pairs that were missed by the author-title clusters, because the title and date eld was the same but the author was di erent. In seven cases, the author elds were improperly formatted. In two cases, the authors were listed in a di erent order in each record. The nal case was bibmerge's lone false match: An issue of Computing Surveys contained two responses to an earlier article, both of which were titled \The File Assignment Problem." A second check of the quality of the algorithm presented here is the results reported for the OCLC duplicate detection scheme described earlier, although the OCLC clus- ters were created under substanitally di erent rules. The analysis of the OCLC algorithm is the most substantial reported in the literature. Using a set of 184 pairs of records that had been identi ed in advance as duplicates, the OCLC system suc- cessfully identi ed 127 pairs, a 69 percent success rate, compared to a 94 percent success rate for my author-title algorithm The false match rate was measured by se- lecting 1,000 record pairs and applying the duplicate test to each pair; only some of the pairs were duplicates. The error rate on this sample was 1.3 percent|13 false matches|which is comparable to the 4 false merges out of 554 (0.7 percent) in the author-title clusters. 57 58 Chapter 4 Merging Related Records This chapter describes the creation of a composite record that summarizes the dupli- cate records in an author-title cluster. It introduces two terms, information dossier and union record, to describe two di erent ways of grouping related bibliographic information. A union record is a composite record created by merging several bibliographic records from distinct sources. A di erent union record is created for each di erent type of document included in the cluster; thus, a particular cluster may have union records for a journal article, a technical report, and a conference paper. An information dossier [8] is a collection of information objects, e.g. bibliographic records, related to some way to one another. Speci cally, I use the term to describe the source records that form an author-title cluster and the union records generated for the cluster. Although the current system does not include other objects, the next section presents a system for automatically linking records to electronic copies available on the Internet. The dossier would contain links or perhaps local copies of relevant item; in this way, a dossier is distinct from the records in a cluster. Union records can be an imprecise summary of the source records, because the quality of the source records is variable and because there are sometimes too few 59 records to be able to resolve con icts between records. When there are con icts, a single representative value is chosen for the union record instead of omitting the eld or creating a hybrid value. Because of this decision, I also calculate statistics that describe the quality of the composite. When the source records vary signi cantly from the union record, the user may wish to examine the source records. These statistics are described in Chapter 5. 4.1 Goals of merger and outline of process There are three primary goals for the merger process that creates union records and dossiers:  to eliminate redundancy in query results  to identify multiple access paths to a work  to create complete and accurate union records from con icting and incomplete source records Because some author-title clusters may also include false merges, the dossier should contain some information about how closely the union record matches each source record. Duplicate listings in the results of a query limit the ease with which the results can be used. A long list takes longer to transmit, process, and read than a short list, but the real diculty is that it can be dicult for a person to identify the duplicate and related records. For a person, identifying duplicates in a long list can be quite dicult and error prone. When the union record is created, there is an opportunity to eliminate errors and to create a record that is more complete than any of the sources. An error in one 60 or a few source records can be purged, if there are enough records with the correct information; the correct value for the eld is chosen by voting. Fields that are missing in one record can be drawn from another record. Unfortunately, these kinds of quality control have limited utility, because in the common case there are only a few records in a cluster. Finally, the dossier brings together information about each di erent instance of publication, which allows a user to choose the physical document that is easiest to retrieve. When works are published several times|in di erent journals, proceedings, or books|having an exhaustive list of these publications makes it easier to nd the work when the local library does not hold all of the publications. Because many universities publish their technical reports in digital form, knowing that a work was issued as a technical report means the work is more likely to be found online. A secondary goal for the dossier is to minimize the amount information lost to the user when false merges occur. The dossier should include information about how closely individual records match the union record and how much variation there is in the value of a eld across the source records. When a particular record di ers substantially from the union record or when one eld has a di erent value in each record, there may be cause for the user to suspect a false merge. The interface described in the next chapter provides access to all of the source records and a measure of the truthfulness of the union record. The general strategy for merging records is based on the di erent kinds of entry types allowed by Bibtex. The source records are grouped by entry and a union record is produced for each type. The elds values in the union record are assigned, for most elds, by counting the occurrences of each value and choosing the most commonly occurring value for the union record. Some elds are treated di erently. The author and title elds will be the same regardless of type, so we can apply to counting strategy can be applied globally. The author eld is also di erent because 61 the counting strategy is applied to the component names rather than the full author list. 4.2 Creating the union record The merger process described here is very simple, and only copes with a few kinds of errors in the source records. However, a number of re nements are suggested for dealing with a wider range of errors; these re nements deal with speci c kinds of errors but use the same basic strategy. (A few of these re nements have been implemented, but most have not.) Records are merged a eld at a time. For each eld, the number of occurrences of each di erent value is counted and the most frequently occurring value is chosen. Sometimes, one or more values will occur with equal frequency, and a tie-breaker is needed; the longest value is chosen. The tie-breaker is arbitrary, although it is hoped that the longer values will be more likely to contain information that helps the user. There are three exceptions to the rules for merging elds.  Values for the month eld are converted to the rst three letters of the month, or they are ignored.  Values other than a four-digit number are ignored in the year eld.  The author eld is merged by considering each name individually, nding the longest form of that name, and assembling a list of these longest names. There are three rough categories of elds, each of which will be a ected somewhat di erently by merging. The categories di er in how likely a eld is to be the same in two di erent records for the same document. The rst category includes elds like title, date, or pages, which describe xed, 62 objective characteristics of the document. These elds are most common and the merger process is tailored to them. The second category includes elds like note, annote, and keywords, which will vary widely from source record to source record (if they appear at all). There are no speci c guidelines for the use of these elds, so each source may describe some di erent characteristic of the document. Each occurrence of one of these elds in a source record is included in the union record. The abstract eld is hard to classify. Although there should be a single abstract for each document, there is a lot of variation in what is actually recorded as the abstract. Currently, the longest abstract is chosen using the standard process. The last category is elds which are used to manage bibliographic records or serve some other purpose speci c to the record's creator. Standard elds like key and many non-standard elds, like bibdate or location, will appear in the union record, chosen by the standard counting scheme. However, it is unlikely that any of the eld values are related and the value selected has little signi cance. The interface presented in the next chapter ignores these elds in the standard display. 4.2.1 Problems with union records The counting approach does not work very well when there are only a few records of a particular type. Typographic and formatting errors also cause problems. In the six records for the article "Scheduler Activations: E ective Kernel Support for the User-Level Management of Parallelism" by Anderson et al., three di erent values appear in the pages eld|\53" and \53--70" each appear once and \53--79" appears four times. If the value \53--79" appeared only once, it would be impossible to distinguish between the correct value and the incorrect ones. Another potential problem is the policy of creating a union record for each di erent 63 BIB-VERSION:: v2.0 ID:: MIT-LCS//MIT/LCS/TR-569 ENTRY:: February 25, 1995 ORGANIZATION:: Massachusetts Institute of Technology, Laboratory for Computer Science TITLE:: Concurrent Garbage Collection of Persistent Heaps TYPE:: Technical Report AUTHOR:: Nettles, S. AUTHOR:: O'Toole, J. AUTHOR:: Gi ord, D. DATE:: June 1993 PAGES:: 22 ABSTRACT:: We describe the rst concurrent compacting garbage collector for a persistent heap. Client threads read and write the heap in primary memory [...] BIB-VERSION:: CS-TR-V2.0 ID:: CMU//CS-93-137 ENTRY:: September 13, 1995 ORGANIZATION:: Carnegie Mellon University, School of Computer Science TITLE:: Concurrent Garbage Collection of Persistent Heaps AUTHOR:: Nettles, Scott AUTHOR:: O'Toole, James AUTHOR:: Gi ord, David DATE:: April 1993 PAGES:: 22 ABSTRACT:: We describe the rst concurrent compacting garbage collector for a persistent heap. Client threads read and write the heap in primary memory [...] Figure 4-1: CS-TR records for one TR from two publishers 64 document type in a cluster. The policy assumes that there will be only a single document of a particular type in a cluster, i.e. that we will not nd two di erent articles with the same author and publisher. This assumption does not hold in some circumstances, resulting in misleading union records. One example of a failure is a technical report written by authors from di erent institutions and issued independently by each institution. (See Figure 4-1.) The sys- tem will create a single union record for these reports, which correctly represent most information|author, title, abstract|but will obscure or confuse the issuing organi- zations and the report's number or identi er. The problem is serious, because a the organization and report number are important for locating a copy of the document. @ArticlefBNBTEAEDLHML89, title = \Lightweight Remote Procedure Call", author = "Brian N. Bershad and Thomas E. Anderson and Edward D. Lazowska and Henry M. Levy", year = \1989", month = dec, pages = \102{113", journal = \Proc. Twelfth ACM Symposium on Operating Systems", volume = \23", number = \5", g @ArticlefBershadAndersonLazowskaLevy90, title = \Lightweight Remote Procedure Call", author = "Brian N. Bershad and Thomas E. Anderson and Edward D. Lazowska and Henry M. Levy", year = \1990", month = feb, pages = \37{55", journal = \ACM Transactions on Computer Systems", volume = \8", number = \1", g Figure 4-2: Bibtex records exhibiting the conference-journal problem Another example of a failure is caused by confusion about how to catalog the 65 papers in a conference proceedings that are published as a journal article, e.g. the SOSP proceedings printed in Operating Systems Review. The proceedings is an issue of the journal, so it would be quite reasonable to catalog the conference paper as a journal article. But if a paper is cataloged as an article and is also published in a journal (say the Transactions of Computer Systems), then the record for the SOSP paper and the record for the TOCS article will be merged into a single union record. 4.2.2 Re nements to merger process The general strategy just described improves signi cantly with a few re nements. Three re nements have been implemented: The author eld is treated separately, as it was during cluster identi cation, because of its special formatting. Two simple lters are used to prevent eld values with detectable errors from being counted. Several other re nements are suggested, but have not been implemented. The author list is constructed di erently because all of the author elds in the source records must match (with the approximate match described in Section 3.2.1) for the records to be placed in the same cluster. The merge algorithm extracts as much information about each name as possible and creates a new author list. When names are compared, each part (i.e. rst, middle, last name) is expanded wherever possible; a blank entry becomes an initial or a full name, and an initial becomes a full name. Filters, which validate eld values before creating the union record, prevent invalid data from being included in the union record; they can also normalize eld values, by testing for common mistakes and cataloging variants and attempting to correct them. The month and year elds are merged using lters. Common problems in these elds include: 66  The source record combines them, e.g. year = \Sept. 1987"  The source record uses the number of the month instead of the name, e.g. month = \[2]"  The source record uses question marks to indicate uncertainty, e.g. year = \199?" The month lter normalizes all entries to the three-letter abbreviations used by Bib- tex. Numbers are converted to text and full names shortened; entries that cannot be re-formatted are ignored. The year lter accepts only four-digit years, removes any data other than the year, and discards entries for which a valid year cannot be found. More powerful heuristics for identifying mis-formatted and incorrect data and either discarding it or converting it to the correct format would further improve the quality of the union records. For example, it may be pro table to identify a group of eld values that are similar but not exactly the same, e.g. two titles that di er in only a few positions or years that are similar, like \199?" and "1991. The merge process would then determine the most frequently occurring group, and then choose a representative element from that group. If three source records contained the year values \1989", \1991", and \199?", this strategy would choose 1991 as the most frequently occurring, correctly-formatted value. When there are several elds that occur with equal frequency, we choose the longest value for the union record. There are many other heuristics that could be used instead, like a strictly random choice or choosing the shortest eld; heuristics could be applied on a per- eld basis, e.g. using the highest number in the year eld. 67 4.3 Clusters sizes and composition in DIFWICS A brief analysis of the author-title clusters in the DIFWICS suggests two broad observations. First, enough related records were found to justify the e ort involved in identifying them. Second, within a cluster there is substantial variation among the source eld values. The DIFWICS collection consists of 243,000 source records and 162,000 author- title clusters. More than half of the records belong to a cluster that contains two or more records. Fewer clusters contain more than one record of the same type|about 30,000 clusters or 20 percent of the collection. Cluster Number of Number of Percentage size clusters records of all records 1 116,829 116,829 45.8% 2 28,865 57,730 22.6% 3 9,068 27,204 10.7% 4 3,885 15,540 6.1% 5 1,710 8,550 3.4% 6 878 5,268 2.1% 7 528 3,696 0.9% 8 or more 772 7,888 3.1% total 162,535 242,705 100.0% Table 4.1: Cluster sizes Table 4-1 shows how many clusters of a particular size there are and what percent- age of the total number of records are in clusters of that size. The average number of clusters in a record is 1.49. The distribution of record types within the entire collection is basically the same as the distribution within clusters: Articles are most common, followed by papers in conference proceedings Table 4-2 shows how many source records of are particular type exist. Most of the clusters contain one type of record. Clusters with more than one type of source record represent less than 10 percent 68 Type Records Percent Article 108,985 (44.9%) InProceedings 75,150 (31.0%) TechReport 23,606 (9.7%) Book 12,178 (5.0%) InCollection 9,955 (4.1%) PhdThesis 3,143 (1.3%) other 9,688 (4.0%) total 242,705 (100.0%) Table 4.2: Souce records, by type of the total number of clusters. The three most common combinations are Article and InProceedings (4,349 clusters), Article and TechReport (1,808 clusters), and In- Proceedings and TechReport (1,536 clusters). Fewer than 1,000 clusters contain three di erent record types and none contain four or more. Within clusters that contained two or more records of the same type, I exam- ined individual elds to see how often all the source records had the same value. Field values were compared by normalizing them to lowercase alphanumeric strings, eliminating formatting and punctuation. The statistics were gathered using a 10 percent sample of the clusters, considering only those clusters that had two or more records of the same type. The sample included 3,753 Article clusters (11,836 records), 3,113 InProceedings clusters (9,705 records), and 1,335 TechReport clusters (3,345 records). Table 4-3 summarizes the results of the analysis on several standard Bibtex elds. It shows the number of times the eld appeared in the sample clusters and the number of times all the records in a cluster had the same normalized value. The variation in the title eld is interesting because it suggests how many more clusters would exist if approximate string matching wasn't used. About 10 or 15 percent of the titles don't contain the same normalized string, even though they are considered to be the same under the approximate string match. 69 Field Article InProceedings TechReport Present Uniform Present Uniform Present Uniform title 3753 3197 (85%) 3113 2665 (86%) 1355 1232 (91%) year 3753 3546 (94%) 3075 2850 (93%) 1352 1212 (90%) month 3435 1417 (41%) 2496 1636 (66%) 1031 900 (87%) pages 3683 2640 (72%) 2973 2177 (73%) 417 393 (94%) journal 3752 1193 (32%) institution 1350 763 (57%) number 3578 3458 (97%) 995 585 (59%) booktitle 3062 477 (16%) publisher 1769 1265 (72%) Figure 4-3: Variation in elds values within author-title clusters of the same type Among the other elds, the year is the most consistent; records have the same year more than 90 percent of the time. The month eld shows much more variation because there are several di erent abbreviations used for each month. Abbreviations cause similar problems in several other elds with high variation|journal, institution, booktitle, and publisher. The lters for improving the merger process, described above, could increase the number of elds with uniform values. 70 Chapter 5 Presenting Relations and Clusters This chapter describes some preliminary work on presenting a collection of 240,000 records after author-title clusters are identi ed and information dossiers constructed. The Web-based interface allows simple full-text queries of the collection, presents related records together on the screen, and automatically creates links that invoke searches of Web-indices, technical report archives, and the local library collection. Although the user interface is only a prototype, it does illustrate two important design principles. 1. The results of a search should display the works that match the query, pre- senting the union records for di erent versions of the same work together. The presentation should cull the most complete and accurate information available from any duplicate records for a document, but should not hide variations in the underlying records. 2. The related records in a particular cluster should be used to help the user nd an easily accessible copy of a document. The combination of bibliographic information, which allows well-de ned queries into Web indices, library catalogs, and other databases, and clusters, which link related documents, should increase 71 a user's ability to nd a particular document (or a related one that represents the same abstract work). 5.1 The basic Web interface The basic interface to the collection is a full-text index of all 240,000 source records, that allows basic Boolean queries. (The underlying search engine allows for queries that look for words only when they appear in certain elds; a Web interface for this feature is underway.) The records are stored in a database that maintains a unique identi er for each record, a separate name space for clusters, and a mapping from record id to cluster id. The index uses record ids internally, but returns a list of clusters in response to a user query. A query using the Web interface returns a summary showing the title, author, and year of all the matching works, followed by a expanded entry for each cluster. Figure 5-1 shows the expanded entry for the paper \Lightweight Remote Procedure Call." The rst two lines of the cluster entry show the title and authors| the two de ning characteristics of the document|followed by entries for the abstract and keywords, which will be the same for each document in the cluster. A bulleted list of three document citations follows the keywords; the citations are produced from the union records. The rst two lines of document citations are formatted to look like a citation for a traditional bibliography. In the rst entry, an article in ACM Transactions on Computer Systems, it shows the date the article was published, the volume and number, and the pagination. The third line of the document citation contains a hypertext link to a search of the local reading room catalog. The search will look for the particular document being cited. In the rst entry, it searches for the journal the article appeared in; for the 72 second entry, a conference paper, it searches for the proceedings. The last line of the document citation contains two links to more information about the records for the document. The link to the expanded record displays a full union record, which includes nonstandard elds that are not displayed in the regular citation. The second link returns the source records used to make the union record; the link indiciates the number of source records and returns them in their original form. The last line of the cluster entry (following the bulleted document citations) in- corporates links to several other search services. There are links to the Alta Vista and Excite Web indices, which contain queries based on the title eld and the authors' last names. When the work has been published as a technical report, links to the NCSTRL and UCSTRI technical report indices are included. 5.2 Assessing the quality of union records The most serious shortcoming of the current interface is that provides no warning to the user when there are di erences between the union record and the source records. Preliminary work for detecting and measuring these di erences is presented here, but the results are not integrated into the current interface. Because the displayed document citations are based on the union records created by merging duplicate records, there is a chance that the citation will contain mistakes or hide useful information that is contained in the source records. Problems could be the result of a record that describes a di erent document being included in the cluster by mistake. However, most problems arise because the source records contain di erent and con icting information about the document. (These problems were discussed in greater detail in Chapter 4.) When a record is included in an author-title cluster by mistake, two di erent kinds 73 Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Abstract. Lightweight Remote Procedure Call (LRPC) is a communication facility designed and optimized for communication between protection domains on the same machine. In contemporary small−kernel operating systems, existing RPC systems incur an unnecessarily high cost when used for the type of communication that predominates −−− between protection domains on the same machine. This cost leads system designers to coalesce weakly−related subsystems into the same protection domain, trading safety for performance. By reducing the overhead of same−machine communication, LRPC encourages both safety and performance. LRPC combines the control transfer and communication model of capability systems with the programming semantics and large−grained protection model of RPC. LRPC achieves a factor of three performance improvement over more traditional approaches based on independent threads exchanging messages, reducing the cost of same−machine communication to nearly the lower bound imposed by conventional hardware. LRPC has been integrated into the Taos operating system of the DEC SRC Firefly multiprocessor workstation. Keywords. RPC−Modellierung BERS 89,HUTC 89 • Article in ACM Transactions on Computer Systems. Feb 1990. vol. 8, no. 1. pages 37−55. Check inL CS/AI reading room. View expanded records,o urce records (9). • Appeared in sosp12. Dec 1989. pages 102−113. Check inL CS/AI reading room. View expanded records,o urce records (7). • Technical report. Department of Computer Science, University of Washington, no. 89−04−02. Apr 1989. Check inL CS/AI reading room. View expanded records,o urce records (3). Search for this paper iAnl ta Vista, in Excite netSearch, iNn CSTRL, in UCSTRI. Figure 5-1: Sample author-title cluster display from Web interface 74 of failure result. 1. In a cluster that contains several citations for a document type, the mis-matched record will be hidden. The union record will not describe the hidden record, and any query that matches the record will return the cluster it is mistakenly included in. The only way to discover the hidden record is to look at the source records. 2. In a cluster with only a few citations, the merger process may create a union record that includes some elds from the mis-matched record and some elds from the good records, because an arbitrary value is selected when a most frequently occuring eld value can't be identi ed. The result is a confusing union record that mixes information from di erent documents or works. The \source match" ratio can identify the rst problem. It measures how closely the elds in the union record match the elds in a particular source record. It is computed by comparing the standard elds in the union record with the same elds in a source record. (Records not de ned in the source record are ignored.) The value is a tuple of the number of elds that match and the number of eld that do not match, e.g. (4,3) indicates four elds match and three elds do not. The second problem can be identi ed with the \ eld consensus" ratio, which measures how many of the source records contain the same value as the union record for a particular eld. For each union record, we can compute the source match value for each source record and the eld consensus value for each of the major elds. The result in each case is a list of ratios, one source match ratio for each source record and one eld consensus ratio for each standard eld. Unfortunately, these ratios provide only a very rough measure of the di erences between records or elds. The variations in Bibtex records that make identifying 75 duplicate and related records hard also complicates the analysis of union records. Variations, like abbreviations and typos, cause elds to fail to match. To limit the e ects of formatting errors, the elds values are normalized before comparison; all non-alphanumber characters are eliminated and all letters are converted to lowercase. These problems make the ratios hard to interpret. A source match ratio of (2,5) could mean that the source describes a di erent document than does the union record. But it could just as well mean that the source record contains a number of non- standard abbreviations or typographical errors. Despite the problems with creating and intepreting these ratios, some informal tests suggest that they are helpful for identifying clusters that contain false matches. Figure 5-2 shows an a sample cluster where the statistics help to identify a false match. The cluster contains four technical report citations, which the union record reports as \Process Migration in the Sprite Operating System." In fact, the cluster also contains a single citation for a di erent technical report by the same author, titled \Transparent Process Migration in the Sprite Operating System." The problem is clear in the source match ratio, which shows that the third source record shares one common eld value with the union record and di ers on ve other elds. The single di erences that appear in the eld consensus ratios also suggest that one of the source records may not belong. In the previous example, the di erence between source record and union record was rather pronounced. Often, the statistics are more ambiguous. Figure 5-3 shows a more representative set of ratios. Although the rst source record matches on two elds and disagrees on ve elds, the cluster is correct. If the ratios are useful, it seems less clear how to use them to automatically detect problems and warn the user. Always presenting the statistics to the user also seems cumbersome, because they further complicate the display and because they complicate understanding the display. A middle ground that displays the statistics 76 Fredrick Douglis. Process Migration in the Sprite Operating System. Technical report. Computer Science Division, University of California, no. UCB/CSD/ 87/343. Feb 1987. Source match ratios record source match #1 [5,1] #2 [3,4] #3 [1,5] #4 [5,1] Field consensus ratios title [3,1] institution [2,2] month [3,0] pages [1,1] number [2,2] year [3,1] Figure 5-2: Source match and eld consensus ratios for cluster including a false match Michael L. Scott and others. Implementation Issues for the Psyche Multiprocessor Operating System. Appeared in Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems. Oct 1989. pages 227-236. Source match ratios record source match #1 [2,5] #2 [5,1] #3 [4,3] Field consensus ratios title [2,1] booktitle [1,2] month [1,1] pages [3,0] year [3,0] publisher [1,2] Figure 5-3: Source match and eld consenus ratios for correct cluster 77 when they show signi cant variation and supresses them when there is very little variation might strike the right balance. A warning indicator that showed one of three values|a mistake is likely, possible, or unlikely|would also display the information concisely. 78 Chapter 6 Automatic linking The previous chapter described the current Web interface, which includes links to Web search services in its display of author-title clusters. The search links are a rst step towards automatically linking records in DIFWICS directly to copies of documents on the Web. The general scheme for automatically linking a cluster to copies of the work avail- able on-line is the same as the scheme for identifying the clusters in the bibliographic collection: A full-text search using some arbitrarily selected words from the author and title eld will turn up potential copies and a more detailed comparison of those copies will nd actual instances of the work. 6.1 Searching for related Web citations The work reported on here does not perform automatic linking, but it makes a rst step in that direction and the preliminary results o er some insight on a full-scale automatic linking project. The basic insight is that any document that is accessible from the World-Wide Web has a hypertext link to it { either from another page or through some search system; the hypertext link is a citation for the document, and 79 often the hypertext link is included from a traditional citation that describes the document. The Web catalog interface's links to other search services { to the reading room catalog, the AltaVista and Excite Web indices, and the NCSTRL and UCSTRI tech- nical report indices { provide the rst-step full-text search for online documents. The current system requires that the user perform the second lter manually on the search results, but the process can be automated. The individual search links are created with some knowledge of the particular search interface and the format of citations on the Web. The AltaVista search engine makes ecient use of long quoted strings, so the search looks for occurrences of the full title. The reading room catalog interface does not catalog journal articles, but does catalog journals and uses the word \holdings" in each journal entry; for journal articles, the catalog search uses a few words from the journal name and the word holdings to check the journal's availability. A pair of examples illustrate some of the success and pitfalls of this approach to automatic linking. The rst example is a search for the paper \Obliq: A Language with Distributed Scope" by Luca Cardelli. The paper was issued as a DEC SRC technical report, and is available from the author's personal Web pages. The results of a search for this paper are unusually good, because SRC's technical report archive includes a seperate page for each technical report, which matches the queries very closely. Searches in AltaVista, Excite, UCSTRI, and the reading room catalog all return a link to the report in the SRC technical report archive as the best match for the search. (NCSTRL does not index SRC technical reports.) The two Web searches use relevance ranking to order all the Web pages that matched at least some of the query terms. Among the most relevant pages are:  several pages about use of the Obliq language 80  a Web page listing the contents of the issue of Computing Systems in which the paper was published  a bibliography of Cardelli's papers from a Web site in Germany  lecture notes for a class that discusses Cardelli's work on the semantic of multiple inheritance  several papers that cite the Obliq paper  an FAQ on object-oriented programming languages The Web searches returned Cardelli's personal page with a link to a Postscript copy of the paper, but it is not ranked highly in the list of search results. It appears in about 30th position. A search for the paper \A Theory of Primitive Objects: Second-Order Systems" by Martin Abadi and Cardelli produces more representative results, because it is not a SRC technical report. The search illustrates the bene ts of the AltaVista full-string search over the Excite keyword-only search. The Excite search locates many pages that contain mention of the authors and their work on type theory, but no pages with links to the desired paper. The AltaVista search, on the other hand, locates two pages maintained by the authors with links to the paper. These pages appear to be several levels deep on their local lesystem, so it is likely they are not included in the smaller Excite index. (The reading room link shows that it has a copy of the conference proceedings that include the paper.) 6.2 Principles for fully-automated system These basic results suggest several things about how to design a system for automat- ically tracking down citations on other Web pages, instead of requiring the user to 81 take the second step of examining individual Web pages. 1. Large-scale Web indexes are likely to index many pages that contain references to the paper being sought. These references will be a mix of normal citations, found in other papers, lecture notes, and other works, and of Web-based citations, like those found in personal publication lists. Some but not all of these citations will contain hypertext links to a digital copy of the document. 2. The citations that included hypertext links are often found onWeb pages several levels deeper than a server's main Web page. Many of the smaller Web indexes omit these pages, so searches of these indexes are more likely to return no relevant pages or pages that lead to the revelant citation but do not actually contain it. For example, the second Excite search described above returned a page about a book on objects written by Abadi and Cardelli, which in turn contained links to the author's personal pages, which contained links to lists of publications. 3. Very few papers are available on the Web in an easily indexed format like HTML or plain text. Most papers are available as Postscript, which is not easily indexed. As a result, it is uncommon to discover pages where the title or search summary clearly indicates that the page contains the sought-after document. We experimented with two services that augmented the current interface for searching the Web. One service performed searches automatically, retrieved the rst 10 pages returned, and searched the pages for citations that were similar to the docu- ment being sought. Another service interposed a Web proxy server between the user that tracked the user's examination of the search results and recorded what page the paper was actually found on. The proxy let the user navigate the Web as normal, but added a header to the top of each page that showed the author and title of the paper being sought; the proxy header also contained a link for the user to follow when the paper had been found. Neither of the experimental services were robust enough or successful enough to include in the current interface, but our brief use of them 82 suggests that they would be interesting areas for future work. 83 84 Chapter 7 Conclusions and Future Directions 7.1 Future directions 7.1.1 Performance, portability, and production The underlying database and environment for storing and comparing bibliographic records, in particular the n-gram string comparison, were not the primary focus of this thesis. If the system is going to support a very large collection (millions of records) or many simultaneous users, its performance needs to be improved. The system also needs a few other implementation changes to allow long-term use in a production environment. The prototype system was implemented primarily in Perl 5, which allowed rapid prototyping at the cost of execution-time e ciency. The n-gram string comparison is particularly slow in the current implementation; during the second round of cluster creation, loading records from disk and performing the detailed comparison proceeds at less than 10 records per second (on a 25 MHz RS/6000). One consequence of the implementation decisions is that it is dicult to identify the bottlenecks. The Perl implementation of, for example, n-gram comparisons is 85 clearly slow, but implementing it in a di erent language might speed it up enough that some other part of the system becomes the bottleneck. The remaining comments on performance should be considered with this constraint in mind. It appears that loading bibliographic records into the system is costly. The records are stored in their original text format, and parsed each time the record is loaded. The primary cost of loading a record is parsing the Bibtex, so it may be pro table to develop a more easily parse intermediate format. The cost of performing full-text queries during the construction of potential match pools appears to be the next mostly costly part of clustering, after the n-gram com- parisons. In addition to optimizing the internal workings of the search engine, there may be an opportunity for global query optimizations. Three three-word queries are created for each record in the collection; it seems probable that the same query would be generated more than once, both because of duplicate records and because authors are likely to generate di erent works with some of the same title words. If performing a query is a sign cant bottleneck, the queries could be re-ordered to take advantage of repeated queries. One important limitation, independent of performance considerations, is that it does not record the source of a bibliographic record. When a particular bibliography is integrated into the main collection, there is no way to record that it was originaly part of, say, the USENIX bibliography. As a result, it is dicult to keep the collection up- to-date and incorporate changes and additions from a bibliography that has already been included. The production system should record the source in a consistent way, so that the main collection can continuously incorporate changes from external sources. A \data pump" could be set up to monitor sources of bibliographic records and add new records or update modi ed records. Recording the source of a record enables other value-added services, such as judg- 86 ing the quality of a record based on its source, which are described below. 7.1.2 Improving the quality of records One approach to improving the quality of bibliographic information, the one described in this thesis, is to locate related bibliographic records and merge them in a way that improves the quality of information. A di erent approach is to use authority control. In a library catalog, authority control describes the process of identifying each of the unique names in the catalog|usually names of authors and names of subject headings|and nding all of the variant forms of the name within the catalog. An authority record describes the authoritative form of the name along with any variants. Authority records can be integrated into the catalog, but more often they are used by librarians to help in the preparation of the catalog. New entries in the catalog can be checked against the authority records to determine the proper form of the name, and old records can be updated to use the authoritative form. Using authority control to regularize the use of certain elds, notably author, journal, and publisher, would improve the quality of the records visible to the user and, in the case of the author eld, the quality of the clustering algorithm. (Recall that problems parsing and comparing author lists accounted for most of the missed matches during clustering.) A system for authority control, however, would have to deal with some of the same problems the clustering algorithm handles now; it needs to identify as many variant entries as possible without being so aggressive that truly di erent entries are con ated. Authority control for journals, publishers, and conferences would not a ect the creation of author-title clusters, but would make it easier to produce union records and would improve the value of the \ eld consensus" and \source match" ratios. Creating the authority records, however, would be a labor-intensive process, requiring a human 87 cataloger to generate a list of authoritative names and review possible variations to determine if they in fact refer to the same object. It should be possible to automate much of the process by looking for plausible variations on and abbreviations of the authoritative name, but some variations would be virtually impossible: The Journal of Library Automation, for example, changed its name to Library Resources and Technical Services. On the other hand, it is possible that journals in di erent elds could be abbreviated the same way; possible con icts in abbreviations should be reviewed by a human cataloger. This observation about the need for human supervision of authority control applies to library cataloging in general. The identi cation of basic bibliographic information| the author and title of the work, the pages it appears on, etc.|is a largely clerical process. (Fully automated cataloging is an active area for research, but little progress has been made [44].) Instead, cataloging should focus information that is more di- cult to obtain|whether two authors with similar names are in fact the same person or whether two papers with similar but di erent titles actually represent the same work. Heaney makes the same case in his argument for an object-oriented cataloging standard[15]. 7.1.3 Identifying other bibliographic relationships The clustering algorithm identi es author-title clusters, in part because identifying equivalence and derivative bibliographic relationships has the most advantage for users and in part because they can be reliably identi ed in the presence of mixed- quality records. Identifying other bibliographic relationships would also be useful; if authority control (or some other mechanism) is used to improve the quality and consistency of eld values, this problem would be easier to tackle. The hierarchical relationship holds between a composite work and its parts| 88 between a journal issue and the articles it containes or between a conference pro- ceedings and the papers it contains. The wide variation in the journal and booktitle elds makes this relationship hard to identify in the current collection, but authority control could make possible comparisons. The relationship could be stored as journal issue clusters or proceedings clusters that contain all of the articles from a particular issue of a journal or all the papers presented at a conference. The hierarchical relationship would be a useful addition to the current search interface. When a user nds an interesting paper, he could examine the proceedings clusters to see if any similar work was presented or check the journal cluster for an accompanying article. Clusters could also identify the sequential relationship by linking together the journal issue or proceedings clusters, which would provide a three-level hierarchy from browsing; users could move between clusters for individual articles, clusters for issues, and clusters for entire journals. The referential relationship is interesting because it cannot, in general, be identi- ed using bibliographic information alone. References that involve critique or review, e.g. a Computing Reviews article, might be identi able, but citations do not contain enough information to determine that one paper is cited by another paper. 7.1.4 Integrating non-bibliographic information The referential relationship could be identi ed if an information dossier contained information in addition to bibliographic records|in particular, if it contained the citation list or the entire text of the document. The information dossier is a partic- ularly useful notion because it can include information of all sorts, such as the full text of the document or information on how to order it. Non-bibliographic information can be included in a dossier if the author and title can be identi ed and matched with an existing author-title cluster. Extending the 89 dossier allows a much richer set of interactions with the library collection. For ex- ample, a user browses a document and discovers a reference to another work that is potentially of interest. The user highlights the reference with his or her mouse and clicks a button, and the document that was referenced appears in a new window on the user's screen. Including the full-text of a document (including the citations) enables many other applications as well. Users can perform queries across the entire text of a document, which creates more opportunities for discovery relavant documents. Abstracts and summaries can be automatically generated for the documents [32], which can help the user quickly establish the relevance of a document to the current search. Citation indexes and graphs can be created that show how often and how widely a particular paper or conference proceedings is cited. One related issue that does not seem to be well-understood is machine processing of citations intended to be read by humans. It is dicult to design a general purpose processor that can identify the distinct parts of a citation; possible problems include identifying the individual authors names and distinguishing between di erent numeric values, like years, page numbers, and volume/issue numbers. Some leverage on the problem can be gained by looking for bibliographic records that are \similar" to the citation, using the structured information contained in bibliographic records to try and understand the unstructerd citation. Eytan Adar and I [3] proposed one scheme for linking the two. 7.1.5 Enabling librarianship and human input The automatic processes for identifying and merging bibligraphic records work quite well in general, but human intervention would be helpful for correcting the errors that do occur. In general, the system should allow users to make corrections and changes 90 to the bibliographic records and to the author-title clusters. There are at least two di erent actions that a librarian might want to perform. First, the librarian should be able to change the contents of an author-title cluster by explicitly labelling a pair of records as related or not related. Marking two records as related would cause the author-title cluster to contain all records that have the same author and title elds as one of the two records. Second, librarians should be able to label the quality of a source record or a particular collection of source records. Even a simple quality control scheme that allowed records to be marked as high quality, low quality, or mixed quality would improve the creation of composite records. The library collection would also bene t from other kinds of human interaction. The collection of bibliographic records and a system for managing information dossier provides a basic infrastructure for supporting collaborative and cooperative work. An annotation service that allowed users to share reviews and critiques of documents is an example of such a service. 7.2 Conclusions The two primary conclusions to draw from this work are that bibliographic relation- ships can be automatically identi ed in mixed-quality source records and that freely available bibliographic information can provide the basis for a useful and relatively complete index of the computer science literature. The author-title clustering algorithm, described in Chapter 3, successfully identi- es related bibliographic records that describe the same work. The algorithm tolerates errors in the records and variability in the cataloging practices, but maintains a tol- erably low error rate; testing the algorithm with a small, controlled sample showed that it identi ed more than 90 percent of the related records and mistakenly linked 91 records for two di erent works less than 1 time in 100. The e ects of mistaken links are mitigated by presenting the user with information about the amount of variation in the underlying records. The clustering algorithm uses a full-text index of the source records to limit the number of inter-record comparisons and to overcome errors in the author and title elds have have caused other algorithms to fail. The Digital Index for Works in Computer Science demonstrates that it is possible to create a useful information discovery service from heterogenous sources of biblio- graphic information. It uses the clustering algorithm to integrate records from many sources and in multiple formats without any more coordination between sources than now exists. The system can automatically incorporate records from other collections and from individual citation lists without requiring that the creators of those records change their current practice. The 240,000-record DIFWICS collection is broad in scope: It covers a large part of the computer science literature, including most areas of speciality and a large percentage of the total literature cataloged by the ACM between 1977 and 1993. The DIFWICS catalog identi es individual works, linking together duplicate records and di erent documents with the same author and title, and helps users nd online documents. The work-centered catalog reduces redundancy in search results and makes inter-document relationships clearer, and the preliminary automatic linking work speeds the process of searching for papers on the Web and suggests that the process could be fully automated. 92 Bibliography1 [1] Alf-Christian Achilles. bibmerge. [Program available via WWW], 1994. URL (version 28 Feb. 1995). [2] Alf-Christian Achilles. A collection of computer science bibliographies. [WWW document], 1995. URL (visited 10 Feb. 1995). [3] Eytan Adar and Jeremy Hylton. On-the- y hyperlink creation for pages images. In Shipman et al. [34], pages 173{176. [4] Deborah Lines Andersen, Thomas J. Galvin, and Mark D. Giguere, editors. Navigating the Networks: Proceedings of the ASIS Mid-Year Meeting, Medford, NJ, 1994. American Society for Information Science (ASIS), Learned Information. [5] Eva Bertha. Inter- and intrabibliographical relationships: A concept for a hypercatalog. In Helal [16], pages 211{223. 1Several documents cited in this thesis are available only in electronic form, via the World-Wide Web. It is dicult, however, to provide long-lasting citations for these documents. I have chosen to include the current URLs for the Web page and either the date of the most recent update to the page or the date of the last time I checked that the page was available (if the page was not dated). 93 [6] C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, and Michael F. Schwartz. The harvest information discovery and access system. In Committee [10]. [WWW document] URL (visited 10 Feb. 1995). [7] C. Mic Bowman, Peter B. Danzig, Udi Manber, and Michael F. Schwartz. Scalable internet resources discovery: Research problems and approaches. Communications of the ACM, 37(8):98{107, August 1994. [8] Michael K. Buckland, Mark H. Butler, Barbara A. Norhard, and Christian Plaunt. Union records and dossiers: Extended bibliographic information objects. In Andersen et al. [4], pages 42{57. [9] Michael J. Carey and Donovan A. Schneider, editors. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, May 1995. Also published as SIGMOD Record 24(2), June 1995. [10] International World Wide Web Conference Committee, editor. Proceedings of the 2nd International Conference on the World-Wide Web, Chicago, December 1994. [WWW document, labeled \These documents are no longer being supported."] URL . [11] Walt Crawford. MARC for library use. G. K. Hall, Boston, second edition, 1989. [12] James R. Davis. Creating a networked computer science technical report library. D-Lib Magazine [Online journal], September 1995. URL . 94 [13] Ahmed K. Elmagarmid and Calton Pu. Introduction to the special issue on heterogeneous databases. ACM Computing Surveys, 22(3):175{178, September 1990. [14] M. C. Harrison. Implementation of the substring test by hashing. Communications of the ACM, 14(12):777{779, December 1971. [15] Michael Heaney. Object-oriented cataloging. Information Technology and Libraries, 14(3):135{153, September 1995. [16] Ahmed H. Helal, editor. Opportunity 2000: understanding and serving users in an electronic library. Essen University Library, 1993. [17] Mauricio A. Hernandez and Salvatore J. Stolfo. The merge/purge problem for large databases. In Carey and Schneider [9], pages 127{138. Also published as SIGMOD Record 24(2), June 1995. [18] Thomas B. Hickey and David J. Rypka. Automatic detection of duplicate monographic records. Journal of Library Automation, 12(2):125{142, June 1979. [19] Robert E. Kahn. An introduction to the cs-tr project. [WWW document], December 1995. URL (version 11 Dec. 1995). [20] Karen Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377{439, December 1992. [21] Carl Lagoze and James R. Davis. Dienst: An architecture for distributed digital libraries. Communications of the ACM, 38(4):47, April 1995. 95 [22] Leslie Lamport. Latex: a document preparation system. Addison-Wesley, 2nd edition, 1994. [23] Rebecca Lasher and Danny Cohen. A format for bibliographic records, June 1995. Internet Engineering Task Force, RFC 1807. [24] David M. Levy and Catherine C. Marshall. Going digital: A look at assumptions underlying digital libraries. Communications of the ACM, 38(4):77{84, April 1995. [25] Cli ord Lynch and Hector Garcia-Molina, editors. Interoperability, Scaling and the Digital Libraries Research Agenda. HPCC/IITA Working Group, August 1995. A Report on the May 18-19, 1995 IITA Digital Libraries Workship. [26] Cli ord A. Lynch, Avra Michelson, Craig Summerhill, and Cecilia Preston. The nature of the nidr challenge. Technical report, Coalition for Networked Information, 1995. URL (visited 12 Feb. 1995). [27] Keith D. MacLaury. Automatic merging of monographic data bases{use of xed-length keys derived from title strings. Journal of Library Automation, 12(2):143{155, June 1979. [28] Edward T. O'Neill, Sally A. Rogers, and W. Michael Oskins. Characteristics of duplicate records in oclc's online union catalog. Library Resources and Technical Services, 37(1):59{71, 1993. [29] Edward T. O'Neill and Diane Vizine-Goetz. Quality control in online databases. In Williams [46], pages 125{156. 96 [30] Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer Widom. Object exchange across heterogeneous information sources. In Yu and Chen [50], pages 251{260. [31] M. J. Ridley. An expert system for quality control and duplicate detection in bibliographic databases. Program, 26(1):1{18, January 1992. [32] Gerard Salton, James Allan, Chris Buckley, and Amit Singhal. Automatic analysis, theme generation, and summarization of machine-readable text. Science, 264:1421{1426, June 1994. [33] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval, chapter 6.2 Vector Similarity Functions, pages 201{204. McGraw-Hill, 1983. [34] Frank M. Shipman, III, Richard Furuta, and David M. Levy, editors. Proceedings of Digital Libraries '95, Department of Computer Science, Texas A&M Universty, College Station, TX 77843, June 1995. Hypermedia Research Laboratory. [35] Narayanan Shivakumar and Hector Garcia-Molina. Scam: A copy detection mechanism for digital documents. In Shipman et al. [34]. [36] Narayanan Shivakumar and Hector Garcia-Molina. The scam approach to copy detection in digital libraries. D-Lib Magazine [Online journal], November 1995. URL . [37] Richard P. Smiraglia and Gregory H. Leazer. Toward the bibliographic control of works: Derivative bibliographic relationships in the online union catalog. In OCLC Research Bulletin, pages 56{59. 1994. 97 [38] Guy L. Steele, Jr. Common LISP, chapter 6.3 Equality Predicates, pages 103{110. Digital Press, 1990. [39] Elaine Svenonius, editor. The Conceptual Foundationgs of Descriptive Cataloging. Academic Press, San Diego, Calif., 1989. [40] Barbara B. Tillett. Bibliographic structures: The evolution of catalog entries, references, and tracings. In Svenonius [39], pages 149{166. [41] Barbara B. Tillett. A taxonomy of bibliographic relationships. Library Resources & Technical Services, 35(2):150{158, 1991. [42] Stephen R. Toney. Cleanup and deduplication of an international bibliographic database. Information Technology and Libraries, 11(1):19{28, March 1992. [43] Marc Van Heyningen. The uni ed computer science technical report index: Lessons in indexing diverse resources. In Committee [10]. [WWW document] URL (visited 10 Feb, 1995). [44] Stuart Weibel. Automated cataloging: Implications for libraries and patrons. In F. W. Lancaster and Linda C. Smith, editors, Arti cial Intelligence and Expert Systems: Will They Change the Library?, pages 67{80. University of Illinois, 1992. [45] Gio Weiderhold. Mediators in the architecture of future information systems. IEEE Computer, 25(3):38{49, mar 1992. [46] Martha E. Williams, editor. Annual Review of Information Science and Technology, volume 23. Elsevier Science Publishers B.V., 1988. 98 [47] Martha E. Williams and Keith D. MacLaury. Automatic merging of monographic data bases{identi cation of duplicate records in multiple les: The IUCS scheme. Journal of Library Automation, 12(2):156{168, June 1979. [48] Patrick Wilson. The second objective. In Svenonius [39], pages 5{16. [49] Tak W. Yan and Hector Garcia-Molina. Duplicate detection in information dissemination. In Proceedings of 21st International Very Large Database Conference (VLDB), September 1995. Zurich, Switzerland. [50] P. S. Yu and A. L. P. Chen, editors. Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, March 1995. IEEE Computer Society Press. 99