LiWiki : IR / Многодокументное Аннотировние

Проекты Live Internet | Page Index | Recent Changes | Recently Commented | Registration

Multi-document summarization. History and evolution. The main approaches and requirements.

Generating an effective summary requires the summarizer to select, evaluate, order and aggregate items of information according to their relevance to a particular subject or purpose. These tasks can either be approximated by IR techniques or done in greater depth with fuller natural language processing. Most previous work in summarization has attempted to deal with the issues by focusing more on a related, but simpler, problem. With text-span deletion the system attempts to delete “less important” spans of text from the original document; the text that remains is deemed a summary. Work on automated document summarization by text span extraction dates back at least to work at IBM in the fifties (Luhn, 1958). Most of the work in sentence extraction applied statistical techniques (frequency analysis, variance analysis, etc.) to linguistic units such as tokens, names, anaphora, etc. More recently, other approaches have investigated the utility of discourse structure (Marcu, 1997), the combination of information extraction and language generation (Klavans and Shaw, 1995; Mc Keown? et al., 1995), and using machine learning to find patterns in text (Teufel and Moens, 1997; Barzilay and Elhadad, 1997; Strzalkowski et al., 1998. Some of these approaches to single document summarization have been extended to deal with multi-document summarization (Mani and Bloedern, 1997; Goldstein and Carbonell, 1998; TIPSTER, 1998b; Radev and Mc Keown?, 1998; Mani and Bloedorn, 1999; Mc Keown? et al., !999; Stein et al., 1999).

These include comparing templates filled in by xtracting information – using specialized, domain specific knowledge sources – from the document, and then generating natural language summaries from the templates (Radev and Mc Keown?, 1998), comparing named-entities – extracted using specialized lists between documents and selecting the most relevant section (TIPSTER, 1998b), finding co-reference chains in the document set to identify common sections of interest (TIPSTER, 1998b), or building activation networks of related lexical items (identity mappings, synonyms, hypernyms, etc.) to extract text spans from the document set (Mani and Bloedern, 1997). Another system (Stein et al., 1999) creates a multi-document summary from multiple single document summaries, an approach that can be sub-optimal in some cases, due to the fact that the process of generating the final multi-document summary takes as input the individual summaries and not the complete documents. (Particularly if the single-document summaries can contain much overlapping information.)

The Columbia University system (Mc Keown? et al., 1999) creates a multi-document summary using machine learning and statistical techniques to identify similar sections.

Consider the situation where the user issues a search query, for instance on a news topic, and the retrieval system finds hundreds of closely-ranked documents in response. Many of these documents are likely to repeat much the same information, while differing in certain Most of these were based on statistical techniques applied to various document entities. Summaries of the individual documents would help, but are likely to be very similar to each other, unless the summarization system takes into account other summaries that have already been generated. Multi-document summarization – capable of summarizing either complete documents sets, or single documents in the context of previously summarized ones – are likely to be essential in such situations. Ideally, multi-document summaries should contain the key shared relevant infor-
mation among all the documents only once, plus other information unique to some of the ndividual documents that are directly relevant to the user's query. Though many of the same techniques used in single-document summarization can also be used in multi- document summarization, there are at least four significant differences:

1. The degree of redundancy in information contained within a group of topically-related articles is much higher than the degree of redundancy within an article, as each article is apt to describe the main point as well as necessary shared background. Hence anti-redundancy methods are more crucial.
2. A group of articles may contain a temporal dimension, typical in a stream of news reports about an unfolding event. Here later information may override earlier more tentative or incomplete accounts.
3. The compression ratio (i.e. the size of the summary with respect to the size of the document set) will typically be much smaller for collections of dozens or hundreds of topically related documents than for single document summaries. Summarization becomes significantly more difficult when compression demands increase.
4. The co-reference problem in summarization presents even greater challenges for multi- document than for single-document summarization

There are two types of situations in which multi-document summarization would be useful: (1) the user is faced with a collection of dis-similar documents and wishes to assess the information landscape contained in the collection, or (2) there is a collection of topically-related documents, extracted from a larger more diverse collection as the result of a query, or a topically-cohesive
cluster. In the first case, if the collection is large enough, it only makes sense to first cluster and categorize the documents, and then sample from, or summarize each cohesive cluster. Hence, a “summary” would constitute of a visualization of the information landscape, where features could be clusters or summaries thereof. In the second case, it is possible to build a synthetic textual summary containing the main point(s) of the topic, augmented with non-redundant background information and/or query-relevant elaborations. This is the focus of our work reported here, including the necessity to eliminate redundancy among the information content of multiple related documents.
Users' information seeking needs and goals vary tremendously. When a group of three people created a multi-document summarization of 10 articles about the Microsoft Trial from a given day, one summary focused on the details presented in court, one on an overall gist of the day's events, and the third on a high level view of the goals and outcome of the trial. Thus, an ideal multi-document summarization would be able to address the different levels of detail, which is difficult without natural language understanding. An interface for the summarization system needs to be able to permit the user to enter information seeking goals, via a query, a background interest profile and/or a relevance feedback mechanism. Following is a list of requirements for multi-document summarization.

Following is a list of requirements for multi-document summarization
• clustering: The ability to cluster similar documents and passages to find related information.
• coverage: The ability to find and extract the main points across documents.
• anti-redundancy: The ability to minimize redundancy between passages in the summary
*. summary cohesion criteria: The ability to combine text passages in a useful manner for the reader.-This may include:
– document ordering: All text segments of highest ranking document, then all segments from
the next highest ranking document, etc.
– news-story principle (rank ordering):presentthe most relevant and diverse information first
so that the reader gets the maximal information content even if they stop reading the summary.
– topic-cohesion: Group together the passages by topic clustering using passage similarity cri-
teria and present the information by the cluster" centroid passage rank.
-time line ordering: Text passages ordered based on the occurrence of events in time.
* coherence: Summaries generated should be readable and relevant to the user.
. context: Include sufficient context so that the summary is understandable to the reader.
• identification of source inconsistencies: Articles of ten have errors (such as billion reported as million, etc.); multi-document summarization must be able to recognize and report source inconsistencies.
• summary updates: A new multi-document summary must take into account previous summaries in generating new summaries. In such cases, the system needs to be able to track and categorize events.
• effective user interfaces:
– Attributability: The user needs to be able to easily access the source of a given passage.
This could be the single document summary.
– Relationship: The user needs to view related passages to the text passage shown, which can
highlight source inconsistencies
– Source Selection: The user needs to be able to select or eliminate various sources. For example, the user may want to eliminate information from some less reliable foreign news reporting sources.
-Context: The user needs to be able to zoom in on the context surrounding the chosen passages.
– Redirection: The user should be able to highlight certain parts of the synthetic summary and give a command to the system indicating that these parts are to be weighted heavily and that other parts are to be given a lesser weight.

Multi-document summarization in specialized domain. Using of event templates in information extraction.

[1] presents and evaluates the initial version of RIPTIDES, a system that combines information extraction (IE), extraction-based summarization, and natural language generation to support user- directed multidocument summarization. (RIPTIDES stands for Rap Idly? Portable Translingual Information extraction and interactive multiDocumEnt Summarization.) Authors hypothesize that IE-supported summarization will enable the generation of more accurate and targeted summaries in specific domains than is possible with current domain-independent techniques.

The system first requires that the user select (1) a set of documents in which to search for information, and (2) one or more scenario templates (extraction domains) to activate. The user optionally provides filters and preferences on the scenario template slots, specifying what information s/he wants to be reported in the summary. RIPTIDES next applies its Information Extraction subsystem to generate a database of extracted events for the selected domain and then invokes the Summarizer to generate a natural language summary of the extracted information subject to the user’s constraints. In the subsections below, we describe the IE system and the Summarizer in turn.

The domain for the initial IE-supported summarization system and its evaluation is natural disasters. Very briefly, a top-level natural disasters scenario template contains: document-level information (e.g. docno, date-time); zero or more agent elements denoting each person, group, and organization in the text; and zero or more disaster elements. Agent elements encode standard information for named entities (e.g. name, position, geo-political unit). For the most part, disaster elements also contain standard event-related fields (e.g. type, number, date, time, location, damage sub-elements). The final product of the RIPTIDES system, however, is not a set of scenario templates, but a user-directed multidocument summary. This difference in goals influences a number of template design issues. First, disaster elements must distinguish different reports or views of the same event from multiple sources. As a result, the system creates a separate disaster event for each such account. Disaster elements should also include the reporting agent, date, time, and location whenever possible. In addition, damage elements (i.e. human and physical effects) are best grouped according to the reporting event. Finally, a slight broadening of the IE task was necessary in that extracted text was not constrained to noun phrases. In particular, adjectival and adverbial phrases that encode reporter confidence, and sentences and clauses denoting relief effort progress appear beneficial for creating informed summaries

Файлы [Скрыть файлы/форму]