基于树自动机的网页数据抽取

Web Pages Data Extraction Based on Tree Automata

摘要: 为了自动将数据从HTML网页中抽取出来,采取树自动机推断方式进行数据抽取.核心思想是将样本网页转化为二叉树并构建出能够接受这些网页二叉树的树自动机,利用所得到的树自动机对待抽取网页的接受和拒绝状态进行数据抽取.该方法充分利用了HTML文档内在的树状结构,设计了简单方便的样本网页标注形式.实验表明,该方法的抽取性能在查全率和F值方面优于其它的一些数据抽取方法.

Abstract: In order to extract data from HTML Web pages automatically, tree automata induction has been used in data extraction. The key idea is to transform the example tree into a binary tree, creating a tree automata which can accept the binary tree of example pages and using the tree automata to extract data according to tree automata state of acceptance and rejection. The method makes use of the native tree structure of HTML document and designs a new simple form of labeling the example pages. Experimental results on data sets showed that the approach with tree automata compared favorable against some other approaches in the F-score and recall.