Name: Juliana Pinheiro Campos Pirovani
Type: PhD thesis
Publication date: 07/02/2019

Namesort descending Role
Elias Silva de Oliveira Advisor *

Examining board:

Namesort descending Role
Claudine Santos Badue Internal Examiner *
Elias Silva de Oliveira Advisor *
Eric Guy Claude Laporte External Examiner *
Patrick Marques Ciarelli External Examiner *
Priscila Machado Vieira Lima External Examiner *

Summary: Named Entity Recognition involves automatically identifying and classifying entities such as persons, places, and organizations, and it is a very important task in Information Extraction. Named Entity Recognition systems can be developed using the following approaches: linguistics, machine learning or hybrid. This work proposes the use of a hybrid approach, called CRF+LG, for Named Entity Recognition in Portuguese texts in order to explore the advantages of both linguistics and machine learning approaches.

The proposed approach uses Conditional Random Fields (CRF) considering the term classification obtained by a Local Grammar (LG) as an additional informed feature. Conditional Random Fields is a probabilistic method for structured prediction. Local grammars are handmade rules to identify expressions within the text. The aim was to study this way of including the human expertise (Local Grammar) in the machine learning Conditional Random Fields approach and to analyze how it can contribute to the performance of this approach.

To achieve this aim, a Local Grammar was built to recognize the 10 named entities categories of HAREM, a joint assessment for the Named Entity Recognition in Portuguese. Initially, the Golden Collection of the First and Second HAREM, considered as a reference for Named Entity Recognition systems in Portuguese, were used as training and test sets, respectively, for evaluation of the CRF+LG. After that, the proposed approach was evaluated in two other datasets.

The results obtained outperform the results of systems reported in the literature that were evaluated under equivalent conditions. This gain was approximately 8 percentage points in F-measure in comparison to a system that also used CRF and 2 points in comparison to a system that used Neural Networks. Some systems that used Neural Networks presented superior results, but using massive corpora for unsupervised learning of features, which was not the case of this work.

The Local Grammar built can be used individually when there is no training set available and in conjunction with other machine learning techniques to improve its performance. We also analyzed the boundaries (lower bound and upper bound) of the proposed approach. The lower bound indicates the minimum performance and the upper bound indicates the maximum gain that we can achieve for the task in question when using this approach.

Access to document

Acesso à informação
Transparência Pública

© 2013 Universidade Federal do Espírito Santo. Todos os direitos reservados.
Av. Fernando Ferrari, 514 - Goiabeiras, Vitória - ES | CEP 29075-910