Title: A Tool Set for the Quick and Efficient Exploration of Large Document Collections
Authors: IGNAT CAMELIAPOULIQUEN BRUNOSTEINBERGER RALFERJAVEC Tomaz
Citation: Proceedings of the 27th Annual Meeting - The Symposium on Safeguards and Nuclear Material Management
Publisher: European Commission - Joint Research Centre - ESARDA
Publication Year: 2005
JRC Publication N°: JRC31097
URI: http://publications.jrc.ec.europa.eu/repository/handle/JRC31097
Type: Contributions to Conferences
Abstract: We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to be particularly useful when users repeatedly have to sieve through large collections of documents such as those downloaded automatically from the internet. The proposed system takes a whole document collection as input, carries out some automatic analysis tasks, annotates the texts with the generated meta-information, stores the meta-information in a database, and provides the users with an interface that allows them to search and view the most pertinent text passages. In the first step, named entities (names of people, organisations and places) are recognised and stored. Then, highly similar documents are grouped into clusters of documents that are likely to report about the same event or subject domain. In the third step, domain-related terms are detected and extracted from each cluster of documents. Then, a zoomable map is generated for each of the document clusters to show the geographic coverage of the collected documents. Additionally, names and specialist terminology found in the document cluster are listed. Hyperlinks allow the users to jump to the text passage where they were found, and also to read the individual documents that are part of the cluster. All clusters of the document collection are made available in an html file that can be viewed with a standard web browser. When the system is used on a regular basis, it builds up a historical database that contains information on which names have been mentioned together with which other names or places, and users can query this database to retrieve information extracted in the past.
JRC Institute:Institute for the Protection and Security of the Citizen

Files in This Item:
There are no files associated with this item.


Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.