Udo Hahn & Inderjeet Mani
Research and development in automatic text summarization has been assuming increased importance with the rapid growth of the Web and on-line information services, which provide access to vast amounts of textual data. The goal of automatic text summarization is to take a partially structured source text, determine its information content, and present the most important content in a manner sensitive to the needs of the user and the task to be accomplished. This tutorial is intended to give an overview of the main methodologies and systems currently available to deal with these challenges, as well as recent evaluation efforts.
The tutorial begins with a discussion of the varieties of text summarization. Naturally occurring human summarization activities are contrasted with strategies underlying professional abstracting. Summarization methods and tasks are differentiated from the closely related ones found in other activities involving text analysis, such as information retrieval (document filtering), information extraction, or text mining. Both shallow approaches, incorporating statistical and linguistic techniques, as well as deeper approaches, where summarization is characterized as an AI reasoning task, are discussed. This leads to the presentation of various system architectures for summarization, including a characterization of key condensation operations involved. Evaluation metrics and current evaluation efforts, including the U.S. Government's TIPSTER SUMMAC evaluation, are discussed in detail. New research areas such as multi-document and multi-media summarization are also treated. In addition, we characterize the state of commercial summarization products and conclude by identifying outstanding problems which remain challenging topics for future Ph.D. theses.
The target audience we address is mainly researchers, students, software developers, and research managers with an interest in sophisticated tools for taming the ever increasing flow of textual data.
Prerequisite knowledge:
Some familarity with questions relating to natural language processing
and information retrieval techniques is considered helpful, but will
not be a necessary prerequisite for attending the tutorial. A
background in general computer science is required, and prior exposure
to artificial intelligence methodologies is desirable.
Udo Hahn is professor for computational linguistics at Albert-Ludwigs-Universität Freiburg, Germany. He works at the intersection of text understanding and information systems, including areas such as text summarization, intelligent text retrieval, acquisition of knowledge from texts, and text mining. He has been involved in the development of a German-language text summarization system (TOPIC). His most recent work aims at the incorporation of condensation operators into the formal framework of description logics. Udo Hahn has (co-)authored four books, thirty-five articles in journals and compiled volumes, and more than ninety proceedings contributions.
Inderjeet Mani is a Principal Scientist in the Artificial Intelligence Laboratory at the MITRE Corporation in Reston, Virginia, where he has led a variety of projects in information retrieval, information extraction, and text summarization. He holds one patent, and is the author of more than thirty refereed papers in the areas of text summarization, information retrieval, machine translation, natural language generation, natural language interfaces, and formal semantics. Dr. Mani's current summarization-related activities include assisting the U.S. Government on the TIPSTER Summarization Evaluation Task (SUMMAC), and co-editing a book on text summarization (Advances in Automatic Text Summarization), to be published by MIT Press).