SciELO - Scientific Electronic Library Online

 
vol.14 número2Análisis computarizado y comparativo del discurso en pacientes orgánicos crónicos índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

  • Não possue artigos citadosCitado por SciELO

Links relacionados

  • Não possue artigos similaresSimilares em SciELO

Compartilhar


Subjetividad y procesos cognitivos

versão On-line ISSN 1852-7310

Resumo

ALONSO ALEMANY, Laura. Linguistic Insights on the Lexical Normalization of user-generated content. Subj. procesos cogn. [online]. 2010, vol.14, n.2, pp.20-31. ISSN 1852-7310.

We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, orthographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an orthographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an orthographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance.

Palavras-chave : Word normalization; Unseen words; Strings; Orthographic variants.

        · resumo em Espanhol     · texto em Espanhol     · Espanhol ( pdf )

 

Creative Commons License Todo o conteúdo deste periódico, exceto onde está identificado, está licenciado sob uma Licença Creative Commons