LiWiki : IR / Автоматическое Аннотирование

Проекты Live Internet | Page Index | Recent Changes | Recently Commented | Registration

The history, evolution and descriptions of methods of text document summarization.

Human quality text summarization systems are difficult to design and even more difficult to evaluate, in part because documents can differ along several dimensions, such as length, writing style and lexical usage. Nevertheless, certain cues can often help suggest the selection of sentence for inclusion in a summary. [19] presents an analysis of new article summaries generated by sentence selection. Sentences are ranked for potential inclusion in a summary using a weighted combination of statistical and linguistic features. The statistical features were adapted from standard IR methods. The potential linguistic ones were derived from a analysis of new-wire summaries. To evaluate these features authors use a modified version of precision recall curves, with a baseline derived from a theoretical analysis of text span overlap based on random selection. Authors illustrate their discussions with empirical results showing the importance of corpus-dependent baseline summarization standards, compression ratios and carefully crafted long queries.

Automated document summarization dates back at least to Luhn’s work at IBM in the fifties. Several researchers continued investigating various approaches to the problem through the seventies and eighties. The resource devoted to addressing this problem grew by several orders of magnitude with the advent of the world-wide web and large scale search engines. Several innovative approaches began to be explored – linguistic approaches, statistical and information-centric approaches and combination of these two. Almost all of works focused on summarization by text-span extraction, with sentences as the most common type of text-span. This technique creates document summaries by concatenating selected text-span excerpts from the original text. This paradigm transforms the problem of summarization, which in the most general case required the ability to understand, interpret, abstract and generate a new document, into a different and possible simpler problem: ranking sentences from the original documents according to their salience or their likelihood of being part of a summary. This kind of summarization is closely related to the more general problem of information retrieval, where documents from a document set (rather than sentences from a document) are ranked, in order to retrieve the best matches.

Human summarization of documents produces a fixed-length generic summary that reflects the key points which abstractor deems important. In many situations users will be interested in facts other than those contained in the generic summary, motivating the need for query relevant summaries. The approach to text summarization allows both generic and query relevant summaries by scoring sentences with respect to both statistical and linguistic features. An ideal query-relevant text summary must contain the relevant information for which the user is looking as well as eliminate irrelevant and redundant information. Unlike documents information retrieval, text summarization evaluation has not extensively addressed the performance of different methodologies by evaluating the contributions of each component. Most summarization systems use linguistic knowledge as well as a statistical component.

Document summarization by graph representation of meaningful text content.

Text summarization attempts to address this problem by taking a partially-structured source text, extracting information content from it, and presenting the most important content to the user in a manner sensitive to the user’s needs. Clearly, some sort of summarization is indispensible for dealing with these massive and unprece- dented amounts of information. Now, in many modern information retrieval applications, a common problem which arises is the existence of multiple documents covering similar information, as in the case of multiple news stories about an event or a sequence of events. A particular challenge for text summarization is to be able to summarize the similarities and differences in information content among these documents.

A variety of approaches exist for extracting content for multi-document summarization, which vary in the extent of domain dependence. In constrained domains, e.g., articles on terrorist events, natural language message understanding systems can extract relationships between entities, such as the location and target of a terrorist event. Such relationships can be used to identify areas of agreement and disagreement across texts. For arbitrary text, such techniques do not apply, and instead, word-based content representations have traditionally been exploited. However, as recent progress in information extraction reveals, it is possible to extract not just salient words but also phrases and proper names from unrestricted text in a highly scalable manner. As a result, such extraction techniques are now being exploited in general purpose information retrieval tools.

The focus of work [18] is to provide a tool for analyzing document collections such as multiple news stories about an event or a sequence of events. Given a collection of such documents, the tool can be used to detect and align similar regions of text among members of the collection, and to detect relevant differences among members. It is worth noting here that the context-sensitive aspect of summarization is particularly important in this task. Depending on the users’ interest, there may be many different sets of similarities and differences. Our summarization approach represents context in terms of a topic, which is a set of words which can be drawn from a user query or profile. Given a topic and a pair of related news stories, our method identifies salient regions of each story related to the topic, and then compares them, summarizing similarities and differences.

Traditionally, abstracts were written by authors or by professional abstractors with the goal of dissemination to a particular – usually broad – readership community. These “generic” abstracts were traditionally used as surrogates for full-text. As our computing environments continue to accommodate increased full-text searching, browsing, and personalized information filtering, “user-focused” abstracts, which are customized to the user’s interests, have assumed increased importance. As will be made clear, we report here on techniques for generating user-focused, indicative, moderately fluent, extract-based summaries for multiple sources. Automatic text summarization can be characterized as involving three phases of processing: analysis, refinement, and synthesis

In [18], the analysis phase builds a representation based on domain-independent information extraction techniques. Text items such as words, phrases, and proper names are extracted and represented in a graph. In particular, nodes in the graph represent word instances at different positions, with phrases and names being formed out of words. The refinement phase exploits cohesion relationships between term instances to determine what is salient. Finally, the synthesis phase takes the set of salient items discovered by the refinement phase, and uses that set to extract text from the source to present a summary.

Of course, if system are able to discover, given a topic and a pair of related documents, salient items of text in each document which are related to the topic, then these salient items can be compared to establish similarities and differences between the document pair. This forms the basis for a general scheme for multi-document summarization.

Файлов нет. [Показать файлы/форму]