CORLEONE - Core Linguistic Entity Online Extraction

PISKORSKI, Jakub

This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC. This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it was implemented.

PISKORSKI Jakub;

2008-07-28

OPOCE

JRC45952

1018-5593,

EUR 23393 EN,

https://publications.jrc.ec.europa.eu/repository/handle/JRC45952,

https://publications.jrc.ec.europa.eu/repository/bitstream/JRC45952/corleone_jrc_report_final.pdf

Name	Country	City	Type

Datasets

ID	Title	Public URL

Dataset collections

ID	Acronym	Title	Public URL

Scripts / source codes

Description	Public URL

Additional supporting files

File name	Description	File type

Show metadata record Copy citation url to clipboard Download BibTeX