TechRxiv
FullPageWrapperGeneration (1).pdf (945.04 kB)
Download file

Full-Page Wrapper Generation for Unsupervised Deep Web Data Extraction

Download (945.04 kB)
preprint
posted on 2021-09-22, 22:53 authored by Chia-Hui ChangChia-Hui Chang
Web data extraction is a key component in many business intelligence tasks, such as data transformation, exchange, and analysis. Many approaches have been proposed, with either labeled training examples (supervised) or annotation-free training pages (unsupervised). However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In fact, most unsupervised web data extraction ignores wrapper generation because they could work alone without any supervision.
In this paper, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis during testing. We consider two approaches for wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. We exploit unique mandatory templates to improve the FSM-based wrapper, and proposed two convolutional neural network (CNN)-based models for sequence-labeling. The experimental results show that the FSM wrapper performs well even with small training data, while the CNN-based models require more training pages to achieve the same effectiveness but are more efficient with GPU support. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for wrapper generation.

Funding

Ministry of Science and Technology, Taiwan. MOST-107-2221-E-008-085-MY2

History

Email Address of Submitting Author

chia@csie.ncu.edu.tw

ORCID of Submitting Author

0000-0002-1101-6337

Submitting Author's Institution

National Central University

Submitting Author's Country

  • Taiwan