Full-Page Wrapper Generation for Unsupervised Deep Web Data Extraction
preprintposted on 2021-09-22, 22:53 authored by Chia-Hui ChangChia-Hui Chang
Web data extraction is a key component in many business intelligence tasks, such as data transformation, exchange, and analysis. Many approaches have been proposed, with either labeled training examples (supervised) or annotation-free training pages (unsupervised). However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In fact, most unsupervised web data extraction ignores wrapper generation because they could work alone without any supervision.
In this paper, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis during testing. We consider two approaches for wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. We exploit unique mandatory templates to improve the FSM-based wrapper, and proposed two convolutional neural network (CNN)-based models for sequence-labeling. The experimental results show that the FSM wrapper performs well even with small training data, while the CNN-based models require more training pages to achieve the same effectiveness but are more efficient with GPU support. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for wrapper generation.