资源数据集Cdiscount 商品图像分类竞赛【Kaggle竞赛】

Cdiscount 商品图像分类竞赛【Kaggle竞赛】

2020-01-19 | |  149 |   0 |   0

Description:

Cdiscount.com generated nearly 3 billion euros last year, making it France’s largest non-food e-commerce company. While the company already sells everything from TVs to trampolines, the list of products is still rapidly growing. By the end of this year, Cdiscount.com will have over 30 million products up for sale. This is up from 10 million products only 2 years ago. Ensuring that so many products are well classified is a challenging task.

Currently, Cdiscount.com applies machine learning algorithms to the text description of the products in order to automatically predict their category. As these methods now seem close to their maximum potential, Cdiscount.com believes that the next quantitative improvement will be driven by the application of data science techniques to images.

In this challenge you will be building a model that automatically classifies the products based on their images. As a quick tour of Cdiscount.com's website can confirm, one product can have one or several images. The data set Cdiscount.com is making available is unique and characterized by superlative numbers in several ways:

  • Almost 9 million products: half of the current catalogue

  • More than 15 million images at 180x180 resolution

  • More than 5000 categories: yes this is quite an extreme multi-class classification!


Evaluation:

Goal

The goal of this competition is to predict the category of a product based on its image(s). Note that a product can have one or several images associated. For every product id in the test set, you should predict the correct categoryid.

Metric

This competition is evaluated on the categorization accuracy of your predictions (the percentage of products you get correct).

Submission File

For each _id in the test set, you must predict a category_id. The file should contain a header and have the following format:

_id,category_id
2,1000000055
5,1000016018
6,1000016055
etc.

Data Description:

File Descriptions

Please Note: The train and test files are very large!

  • train.bson - (Size: 58.2 GB) Contains a list of 7,069,896 dictionaries, one per product. Each dictionary contains a product id (key: _id), the category id of the product (key: category_id), and between 1-4 images, stored in a list (key: imgs). Each image list contains a single dictionary per image, which uses the format: {'picture': b'...binary string...'}. The binary string corresponds to a binary representation of the image in JPEG format. This kernel provides an example of how to process the data.

  • train_example.bson - Contains the first 100 records of train.bson so you can start exploring the data before downloading the entire set.

  • test.bson - (Size: 14.5 GB) Contains a list of 1,768,182 products in the same format as train.bson, except there is no category_id included. The objective of the competition is to predict the correct category_id from the picture(s) of each product id (_id). The category_ids that are present in Private Test split are also all present in the Public Test split.

  • category_names.csv - Shows the hierarchy of product classification. Each category_id has a corresponding level1, level2, and level3 name, in French. The category_id corresponds to the category tree down to its lowest level. This hierarchical data may be useful, but it is not necessary for building models and making predictions. All the absolutely necessary information is found in train.bson.

  • sample_submission.csv - Shows the correct format for submission. It is highly recommended that you zip your submission file before uploading for scoring.



上一篇:LS3D-W 人脸对齐 2D / 3D 数据集

下一篇:UCSD 人群密度监测数据集

用户评价
全部评价

热门资源

  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据

    凶杀案报告数据

  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...

    数据来自产品在Bosch真实生产线上制造过程中的设备...