Description:
A good chocolate soufflé is decadent, delicious, and delicate. But, it's a challenge to prepare. When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.
Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.
In this competition, Bosch is challenging Kagglers to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.
Evaluation:
Submissions are evaluated on the Matthews correlation coefficient (MCC) between the predicted and the observed response. The MCC is given by:
MCC=(TP∗TN)−(FP∗FN)(TP+FP)(TP+FN)(TN+FP)(TN+FN)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√,
where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.
Submission File
For each Id in the test set, you must predict a binary prediction for the Response variable. The file should contain a header and have the following format:
Id,Response
1,0
2,1
3,0
etc.
Data Description:
The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).
The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.
On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.
In addition to being one of the largest datasets (in terms of number of features) ever hosted on Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes are expected to make this a challenging problem.
File descriptions
train_numeric.csv - the training set numeric features (this file contains the 'Response' variable)
test_numeric.csv - the test set numeric features (you must predict the 'Response' for these Ids)
train_categorical.csv - the training set categorical features
test_categorical.csv - the test set categorical features
train_date.csv - the training set date features
test_date.csv - the test set date features
sample_submission.csv - a sample submission file in the correct format