Web网页知识获取技术

Technology of Web Page Knowledge Acquisition

摘要: 描述一种基于类自然语言理解的Web文本知识自动获取技术.用领域文法描述Web网页文本,将该领域文法转换成描述句子信息的、符合正则表达式规范的规则,使用该规则把Web网页文本转换为表示网页知识的语义三元组,最后形成领域知识库.试验数据表明,通过本技术生成的领域知识库中不同类型网页数据的召回率平均值是71.5%,准确率平均值是79.1%.

Abstract: Technology of automatic Web text knowledge acquisition is described,based on pseudo-natural language understanding.Web page texts are represented first by domain grammars.The domain grammars are transformed into rules that are used to describe the sentence information and are up to regular expression regulations.Then the Web page texts are transformed into semantic triples that represent Web knowledge by those rules.The semantic triples then form the domain knowledge base.Test data showed that the average recall rate and precision rate of different kinds of Web page data in domain knowledge base is 71.5% and 79.1% separately,as have been formed by the above technology.