资源数据集Bosch 流水线降低次品率数据【Kaggle竞赛】

Bosch 流水线降低次品率数据【Kaggle竞赛】

2019-12-25 | |  60 |   0 |   0

Description:

A good chocolate soufflé is decadent, delicious, and delicate. But, it's a challenge to prepare. When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.

Bosch production line

Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.

In this competition, Bosch is challenging Kagglers to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.


Evaluation:


Submissions are evaluated on the Matthews correlation coefficient (MCC) between the predicted and the observed response. The MCC is given by:


MCC=(TPTN)(FPFN)(TP+FP)(TP+FN)(TN+FP)(TN+FN),


where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.

Submission File

For each Id in the test set, you must predict a binary prediction for the Response variable. The file should contain a header and have the following format:

Id,Response
1,0
2,1
3,0
etc.


Data Description:


The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.

In addition to being one of the largest datasets (in terms of number of features) ever hosted on Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes are expected to make this a challenging problem.

File descriptions

  • train_numeric.csv - the training set numeric features (this file contains the 'Response' variable)

  • test_numeric.csv - the test set numeric features (you must predict the 'Response' for these Ids)

  • train_categorical.csv - the training set categorical features

  • test_categorical.csv - the test set categorical features

  • train_date.csv - the training set date features

  • test_date.csv - the test set date features

  • sample_submission.csv - a sample submission file in the correct format


上一篇:猫和狗图像分类数据【Kaggle竞赛】

下一篇:T-Drive trajectory data sample【Kaggle竞赛】

用户评价
全部评价

热门资源

  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据

    凶杀案报告数据

  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • UCSD 行人视频数据

    UCSD 行人视频数据