资源数据集Microsoft Malware Classification Challenge

Microsoft Malware Classification Challenge

2020-01-20 | |  83 |   0 |   0

Description:

In recent years, the malware industry has become a well organized market involving large amounts of money. Well funded, multi-player syndicates invest heavily in technologies and capabilities built to evade traditional protection, requiring anti-malware vendors to develop counter mechanisms for finding and deactivating them. In the meantime, they inflict real financial and emotional pain to users of computer systems.One of the major challenges that anti-malware faces today is the vast amounts of data and files which need to be evaluated for potential malicious intent. For example, Microsoft's real-time detection anti-malware products are present on over 160M computers worldwide and inspect over 700M computers monthly. This generates tens of millions of daily data points to be analyzed as potential malware. One of the main reasons for these high volumes of different files is the fact that, in order to evade detection, malware authors introduce polymorphism to the malicious components. This means that malicious files belonging to the same malware "family", with the same forms of malicious behavior, are constantly modified and/or obfuscated using various tactics, such that they look like many different files.

In order to be effective in analyzing and classifying such large amounts of files, we need to be able to group them into groups and identify their respective families. In addition, such grouping criteria may be applied to new files encountered on computers in order to detect them as malicious and of a certain family.

For this challenge, Microsoft is providing the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families.

Evaluation:

Submissions are evaluated using the multi-class logarithmic loss. Each file has been labeled with one true class. For each file, you must submit a set of predicted probabilities (one for every class):


logloss=1Ni=1Nj=1Myijlog(pij),


where N is the number of files in the test set, M is the number of labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

The submitted probabilities for a given file are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,11015),1015).

Submission Format

For every file in the test set, submission files should contain 10 columns:











The file should contain a header and have the following format:

Id,Prediction1,Prediction2,...,Prediction9
02IOCvYEy8mjiuAQHax3,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
02K5GMYITj7bBoAisEmD,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
02zcUmKV16Lya5xqnPGB,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
03nJaQV6K2ObICUmyWoR,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
04EjIdbPV5e1XroFOpiN,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1

.....

Data Description:

Warning: this dataset is almost half a terabyte uncompressed! We have compressed the data using 7zip to achieve the smallest file size possible. Note that the rules do not allow sharing of the data outside of Kaggle, including bit torrent (why not?).

You are provided with a set of known malware files representing a mix of 9 different families. Each malware file has an Id, a 20 character hash value uniquely identifying the file, and a Class, an integer representing one of 9 family names to which the malware may belong:

  1. Ramnit

  2. Lollipop

  3. Kelihos_ver3

  4. Vundo

  5. Simda

  6. Tracur

  7. Kelihos_ver1

  8. Obfuscator.ACY

  9. Gatak

For each file, the raw data contains the hexadecimal representation of the file's binary content, without the PE header (to ensure sterility).  You are also provided a metadata manifest, which is a log containing various metadata information extracted from the binary, such as function calls, strings, etc. This was generated using the IDA disassembler tool. Your task is to develop the best mechanism for classifying files in the test set into their respective family affiliations.

The dataset contains the following files:

  • train.7z - the raw data for the training set (MD5 hash = 4fedb0899fc2210a6c843889a70952ed)

  • trainLabels.csv - the class labels associated with the training set

  • test.7z - the raw data for the test set (MD5 hash = 84b6fbfb9df3c461ed2cbbfa371ffb43)

  • sampleSubmission.csv - a file showing the valid submission format

  • dataSample.csv - a sample of the dataset to preview before downloading


上一篇:IMDB 电影数据仓库

下一篇:ACTIVITYNET

用户评价
全部评价

热门资源

  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据

    凶杀案报告数据

  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...

    数据来自产品在Bosch真实生产线上制造过程中的设备...