Graph and Embedding based Approach for Text Clustering: Topic Detection in a Large Multilingual Public Consultation
We present a novel algorithm for multilingual text clustering built upon two well studied techniques: multilingual aligned embedding and community detection in graphs. The aim of our algorithm is to discover underlying topics in a multilingual dataset using clustering. We present both a numerical evaluation using silhouette and V-measure metrics, and a qualitative evaluation for which we propose a new systematic approach. Our algorithm presents robust overall performance and its results were empirically evaluated by an analyst. The work we present was done in the context of a large multilingual public consultation, for which our new algorithm was deployed and used on a daily basis.
STEFANOVITCH Nicolas;
JACQUET Guillaume;
DE LONGUEVILLE Bertrand;
2025-12-22
American Computer Association
JRC132590
https://dl.acm.org/doi/fullHtml/10.1145/3543873.3587627,
https://publications.jrc.ec.europa.eu/repository/handle/JRC132590,
10.1145/3543873.3587627 (online),
Additional supporting files
| File name | Description | File type | |