Abstract:
An algorithm for Chinese domain term extraction based on language feature is proposed. Domain terms in Chinese have three features: domain cohesiveness, domain relevancy and domain consensus. The algorithm to extract domain term integrates three statistical models which compute domain cohesiveness, domain relevancy and domain consensus respectively. Experimental results show that the algorithm has higher precision and recall than the method based on mutual information and log-likelihood. An automatic evaluation method based on perplexity attenuation ratio is proposed, and the above algorithms are measured by the automatic evaluation method.