资源数据集Structured Web Data Extraction 数据集

Structured Web Data Extraction 数据集

2019-11-02 | |  129 |   0 |   0


This dataset is a real-world web page collection used for research on the automatic extraction of structured data (e.g., attribute-value pairs of entities) from the Web. We hope it could serve as a useful benchmark for evaluating and comparing different methods for structured web data extraction.

Contents of the Dataset

Currently the dataset involves:

  • 8 verticals with diverse semantics;

  • 80 web sites (10 per vertical);

  • 124,291 web pages (200~2,000 per web site), each containing a single data record with detailed information of an entity;

  • 32 attributes (3~5 per vertical) associated with carefully labeled ground-truth of corresponding values in each web page. The goal of structured data extraction is to automatically identify the values of these attributes from web pages.

The involved verticals are summarized as follows:

Auto1017,9234model, price, engine, fuel_economy
Book1020,0005title, author, isbn_13, publisher, publication_date
Camera10  5,2583model, price, manufacturer
Job1020,0004title, company, location, date_posted
Movie1020,0004title, director, genre, mpaa_rating
NBA Player10  4,4054name, team, height, weight
Restaurant1020,0004name, address, phone, cuisine
University1016,7054name, phone, website, type

Format of Web Pages

Each web page in the dataset is stored as one .htm file (in UTF-8 encoding) where the first tag encodes the source URL of the page.

Format of Ground-truth Files

For each web site, the page-level ground-truth of attribute values has been labeled using handcrafted regular expressions and stored in .txt files (in UTF-8 encoding) named as "<vertical>-<site>-<attribute".txt".

In each such file:

  • The first line stores the names of vertical, site, and attribute, separated by TAB characters ('t').

  • The second line stores some statistics (separated by TABs) w.r.t. the corresponding site and attribute, including:

    1. the total number of pages,

    2. the number of pages containing attribute values,

    3. the total number of attribute values contained in the pages,

    4. the number of unique attribute values.

  • Each remaining line stores the ground-truth information (separated by TABs) of one page, in sequence of:

    1. page ID,

    2. the number of attribute values in the page,

    3. attribute values ("<NULL>" in case of non-existence).

Notes on Ground-truth Labeling

  • The ground-truth labeling was conducted in the DOM-node level. More specifically, the candidate attribute values in a web page are the non-empty strings contained in text nodes in the corresponding DOM tree.

  • One page (although containing a single data record) may contain multiple distinct values that correspond to an attribute (e.g., multiple authors of a book, multiple granularity levels of addresses).

  • Currently, when a text node presents a mixture of multiple attributes, its string value is labeled with each of these attributes, if no substitute is available.

  • Before being stored in .txt files, the raw attribute values were refined by removing redundant separators (e.g., ' ', 't', 'n').

上一篇:PASCAL Visual Object Classes Challenge 2010 数据集原始数据(VOC2010)




  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据


  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...
