CORLEONE - Core Linguistic Entity Online Extraction
This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a
pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used
to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources:
(a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic
resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys
state-of-the-art finite-state techniques.
Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were
developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC.
This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it
was implemented.
PISKORSKI Jakub;
2008-07-28
OPOCE
JRC45952
1018-5593,
EUR 23393 EN,
https://publications.jrc.ec.europa.eu/repository/handle/JRC45952,
Additional supporting files
| File name | Description | File type | |