Synthesis Lectures on Information Concepts, Retrieval, and Services Series

Automatic Disambiguation of Author Names in Bibliographic Repositories

by Anderson A Ferreira, Marcos Andre Goncalves, and Alberto H F Laender

Published 1 June 2020

This book deals with a hard problem that is inherent to human language: ambiguity.

In particular, we focus on author name ambiguity, a type of ambiguity that exists in digital bibliographic repositories, which occurs when an author publishes works under distinct names or distinct authors publish works under similar names. This problem may be caused by a number of reasons, including the lack of standards and common practices, and the decentralized generation of bibliographic content. As a consequence, the quality of the main services of digital bibliographic repositories such as search, browsing, and recommendation may be severely affected by author name ambiguity. The focal point of the book is on automatic methods, since manual solutions do not scale to the size of the current repositories or the speed in which they are updated. Accordingly, we provide an ample view on the problem of automatic disambiguation of author names, summarizing the results of more than a decade of research on this topic conducted by our group, which were reported in more than a dozen publications that received over 900 citations so far, according to Google Scholar. We start by discussing its motivational issues (Chapter 1). Next, we formally define the author name disambiguation task (Chapter 2) and use this formalization to provide a brief, taxonomically organized, overview of the literature on the topic (Chapter 3). We then organize, summarize and integrate the efforts of our own group on developing solutions for the problem that have historically produced state-of-the-art (by the time of their proposals) results in terms of the quality of the disambiguation results. Thus, Chapter 4 covers HHC - Heuristic-based Clustering, an author name disambiguation method that is based on two specific real-world assumptions regarding scientific authorship. Then, Chapter 5 describes SAND - Self-training Author Name Disambiguator and Chapter 6 presents two incremental author name disambiguation methods, namely INDi - Incremental Unsupervised Name Disambiguation and INC- Incremental Nearest Cluster. Finally, Chapter 7 provides an overview of recent author name disambiguation methods that address new specific approaches such as graph-based representations, alternative predefined similarity functions, visualization facilities and approaches based on artificial neural networks. The chapters are followed by three appendices that cover, respectively: (i) a pattern matching function for comparing proper names and used by some of the methods addressed in this book; (ii) a tool for generating synthetic collections of citation records for distinct experimental tasks; and (iii) a number of datasets commonly used to evaluate author name disambiguation methods. In summary, the book organizes a large body of knowledge and work in the area of author name disambiguation in the last decade, hoping to consolidate a solid basis for future developments in the field.

Theoretical Foundations for Digital Libraries

by Edward Fox, Marcos Andre Goncalves, and Rao Shen

Published 31 July 2012

In 1991, a group of researchers chose the term digital libraries to describe an emerging field of research, development, and practice. Since then, Virginia Tech has had funded research in this area, largely through its Digital Library Research Laboratory. This book is the first in a four book series that reports our key findings and current research investigations.

Underlying this book series are six completed dissertations (Goncalves, Kozievitch, Leidig, Murthy, Shen, Torres), eight dissertations underway, and many masters theses. These reflect our experience with a long string of prototype or production systems developed in the lab, such as CITIDEL, CODER, CTRnet, Ensemble, ETANA, ETD-db, MARIAN, and Open Digital Libraries. There are hundreds of related publications, presentations, tutorials, and reports. We have built upon that work so this book, and the others in the series, will address digital library related needs in many computer science, information science, and library science (e.g., LIS) courses, as well as the requirements of researchers, developers, and practitioners.

Much of the early work in the digital library field struck a balance between addressing real-world needs, integrating methods from related areas, and advancing an ever-expanding research agenda. Our work has fit in with these trends, but simultaneously has been driven by a desire to provide a firm conceptual and formal basis for the field.Our aim has been to move from engineering to science. We claim that our 5S (Societies, Scenarios, Spaces, Structures, Streams) framework, discussed in publications dating back to at least 1998, provides a suitable basis. This book introduces 5S, and the key theoretical and formal aspects of the 5S framework.

While the 5S framework may be used to describe many types of information systems, and is likely to have even broader utility and appeal, we focus here on digital libraries. Our view of digital libraries is broad, so further generalization should be straightforward.

We have connected with related fields, including hypertext/hypermedia, information storage and retrieval, knowledge management, machine learning, multimedia, personal information management, and Web 2.0. Applications have included managing not only publications, but also archaeological information, educational resources, fish images, scientific datasets, and scientific experiments/ simulations.

Key Issues Regarding Digital Libraries

by Rao Shen, Marcos Andre Goncalves, and Edward A Fox

Published 1 February 2013

This is the second book based on the 5S (Societies, Scenarios, Spaces, Structures, Streams) approach to digital libraries (DLs). Leveraging the first volume, on Theoretical Foundations, we focus on the key issues of evaluation and integration. These cross-cutting issues serve as a bridge for those interested in DLs, connecting the introduction and formal discussion in the first book, with the coverage of key technologies in the third book, and of illustrative applications in the fourth book. These two topics have central importance in the DL field, allowing it to be treated scientifically as well as practically. In the scholarly world, we only really understand something if we know how to measure and evaluate it. In the Internet era of distributed information systems, we only can be practical at scale if we integrate across both systems and their associated content.

Evaluation of DLs must take place atmultiple levels,so we can address the different entities and their associated measures. Thus, for digital objects, we assess accessibility, pertinence, preservability, relevance, significance, similarity, and timeliness. Other measures are specific to higher-level constructs like metadata, collections, catalogs, repositories, and services.We tie these together through a case study of the 5SQual tool, which we designed and implemented to perform an automatic quantitative evaluation of DLs. Thus, across the Information Life Cycle, we describe metrics and software useful to assess the quality of DLs, and demonstrate utility with regard to representative application areas: archaeology and education.

Though integration has been a challenge since the earliest work on DLs, we provide the first comprehensive 5S-based formal description of the DL integration problem, cast in the context of related work. Since archaeology is a fundamentally distributed enterprise, we describe ETANADL, for integrating Near Eastern Archeology sites and information. Thus, we show how 5S-based modeling can lead to integrated services and content.

While the first book adopts a minimalist and formal approach to DLs, and provides a systematic and functional method to design and implement DL exploring services, here we broaden to practical DLs with richer metamodels, demonstrating the power of 5S for integration and evaluation.