Title: CORLEONE - Core Linguistic Entity Online Extraction
Authors: PISKORSKI JAKUB
Publisher: OPOCE
Publication Year: 2008
JRC Publication N°: JRC45952
ISSN: 1018-5593
Other Identifiers: EUR 23393 EN
URI: http://publications.jrc.ec.europa.eu/repository/handle/JRC45952
Type: EUR - Scientific and Technical Research Reports
Abstract: This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC. This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it was implemented.
JRC Institute:Institute for the Protection and Security of the Citizen

Files in This Item:
File Description SizeFormat 
corleone_jrc_report_final.pdf884.07 kBAdobe PDFView/Open


Items in repository are protected by copyright, with all rights reserved, unless otherwise indicated.