基于文本聚类和NLU的自动文摘研究

Automatic Abstracting Based on Text Clustering and Natural Language Understanding

摘要: 针对当前自动文摘方法的缺陷,提出了基于文本聚类和自然语言理解的自动文摘实现方法.将文本聚类引入自动文摘中,实现多文档的自动文摘.提出了基于标题和段首句的二次自动分词算法.实验结果表明,分词正确率和召回率均在95%以上.实现了面向塑料行业的基于文本聚类和自然语言理解的自动文摘系统,其多文档自动文摘的正确率和召回率都在75%以上.实验表明该方法可行,对自动文摘系统的设计具有借鉴意义和深入研究价值.

Abstract: A method of realization of automatic abstracting based on text clustering and natural language understanding is brought forward, aimed at overcoming shortages of some current methods. The method makes use of text clustering and can realize automatic abstracting of multi-documents. The algorithm of twice word segmentation based on the title and first-sentences in paragraphs is brought forward. Its precision and recall is above 95%.For a specific domain on plastics, an automatic abstracting system is implemented. The precision and recall of multi-document's automatic abstracting is above 75%. And experiments do prove that it is feasible to use the method to develop a domain automatic abstracting system, that is valuable for further study in more depth.