citizenvilla.blogg.se - Part of speech tagger online

2005)-style POS tags capture both paradigmatic and syntagmatic relations among words, given that its annotation criterion is the syntactic distribution of words. For example, the Penn Chinese Treebank (CTB) (Xue et al. Both paradigmatic and syntagmatic lexical relations have a great impact on POS tagging, because the value of a word is determined by the two relations. Whereas syntagmatic relations are possibilities of combination, paradigmatic relations are functional contrasts-they involve differentiation. The distinction is a key one in structuralist semiotic analysis. From a linguistic point of view, meaning arises from the differences between linguistic units, including words, phrases, and so on, and these differences are of two kinds: paradigmatic (concerning substitution) and syntagmatic (concerning positioning).

It is generally accepted that Chinese POS tagging often requires more sophisticated language processing techniques that are capable of drawing inferences from more subtle linguistic knowledge. Although state-of-the-art tagging systems have achieved accuracies above 97% on English, Chinese POS tagging has proven to be more challenging and result in accuracies of about 93–94% (Ng and Low 2004 Tseng, Jurafsky, and Manning 2005 Huang, Harper, and Wang 2007 Huang, Eidelman, and Harper 2009 Li et al. The Chinese language is characterized by the lack of formal devices such as morphological tense and number that often provide important clues for syntactic processing tasks. But a number of augmentations and changes become necessary when dealing with highly inflected or agglutinative languages, as well as analytic languages, of which Chinese is the focus of this article. In some cases, the methods work well without large modifications, such as for German. Many successful tagging algorithms developed for English have been applied to many other languages as well. Automatically assigning POS tags to words plays an important role in parsing, word sense disambiguation, as well as many other NLP applications. In grammar, a part-of-speech (POS) is a linguistic category of words, generally defined by the syntactic or morphological behavior of the word in question.

Experimental results illustrate that the re-compiled models not only achieve high accuracy with respect to per token classification, but also serve as a front-end to a parser well. Specifically, hybrid systems are utilized to create large-scale pseudo training data for cheap models. In particular, we explore unlabeled data to transfer the predictive power of hybrid models to simple sequence models. In this article, we are also concerned with improving tagging efficiency at test time. Despite the effectiveness to boost accuracy, computationally expensive parsers make hybrid systems inappropriate for many realistic NLP applications.

Our linguistically motivated, hybrid approaches yield a relative error reduction of 18% in total over state-of-the-art baselines. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Syntagmatic lexical relations are implicitly captured by syntactic parsing in the constituency formalism, and are utilized via system combination. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing.