Structured Web Data Extraction 数据集

资源分类

2019-11-02 |

215 |

0 |

Structured Web Data Extraction 数据集

Motivation

This dataset is a real-world web page collection used for research on the automatic extraction of structured data (e.g., attribute-value pairs of entities) from the Web. We hope it could serve as a useful benchmark for evaluating and comparing different methods for structured web data extraction.

Contents of the Dataset

Currently the dataset involves:

8 verticals with diverse semantics;
80 web sites (10 per vertical);
124,291 web pages (200~2,000 per web site), each containing a single data record with detailed information of an entity;
32 attributes (3~5 per vertical) associated with carefully labeled ground-truth of corresponding values in each web page. The goal of structured data extraction is to automatically identify the values of these attributes from web pages.

The involved verticals are summarized as follows:

Vertical	#Sites	#Pages	#Attributes	Attributes
Auto	10	17,923	4	model, price, engine, fuel_economy
Book	10	20,000	5	title, author, isbn_13, publisher, publication_date
Camera	10	5,258	3	model, price, manufacturer
Job	10	20,000	4	title, company, location, date_posted
Movie	10	20,000	4	title, director, genre, mpaa_rating
NBA Player	10	4,405	4	name, team, height, weight
Restaurant	10	20,000	4	name, address, phone, cuisine
University	10	16,705	4	name, phone, website, type

Format of Web Pages

Each web page in the dataset is stored as one .htm file (in UTF-8 encoding) where the first tag encodes the source URL of the page.

Format of Ground-truth Files

For each web site, the page-level ground-truth of attribute values has been labeled using handcrafted regular expressions and stored in .txt files (in UTF-8 encoding) named as "<vertical>-<site>-<attribute".txt".

In each such file:

The first line stores the names of vertical, site, and attribute, separated by TAB characters ('t').
The second line stores some statistics (separated by TABs) w.r.t. the corresponding site and attribute, including:

the total number of pages,
the number of pages containing attribute values,
the total number of attribute values contained in the pages,
the number of unique attribute values.

Each remaining line stores the ground-truth information (separated by TABs) of one page, in sequence of:

page ID,
the number of attribute values in the page,
attribute values ("<NULL>" in case of non-existence).

Notes on Ground-truth Labeling

The ground-truth labeling was conducted in the DOM-node level. More specifically, the candidate attribute values in a web page are the non-empty strings contained in text nodes in the corresponding DOM tree.
One page (although containing a single data record) may contain multiple distinct values that correspond to an attribute (e.g., multiple authors of a book, multiple granularity levels of addresses).
Currently, when a text node presents a mixture of multiple attributes, its string value is labeled with each of these attributes, if no substitute is available.
Before being stored in .txt files, the raw attribute values were refined by removing redundant separators (e.g., ' ', 't', 'n').

上一篇：PASCAL Visual Object Classes Challenge 2010 数据集原始数据(VOC2010)

下一篇：纽约市出租车行车记录(2013年)

用户评价

全部评价

还没有评论，说两句吧！

热门资源

GRAZ 图像分类数据

GRAZ 图像分类数据
猫和狗图像分类数...

Kaggle 上的竞赛数据，用以区分猫和狗两类对象，...
凶杀案报告数据

凶杀案报告数据
MIT Cars 汽车图像...

MIT Cars 汽车图像数据
Large Scale Data FTP

Large Scale Data FTP

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com