资源数据集KKBOX 音乐用户续订预测竞赛【Kaggle竞赛】

KKBOX 音乐用户续订预测竞赛【Kaggle竞赛】

2020-01-17 | |  100 |   0 |   0

Overview

The 11th ACM International Conference on Web Search and Data Mining (WSDM 2018) is challenging you to build an algorithm that predicts whether a subscription user will churn using a donated dataset from KKBOX. WSDM (pronounced "wisdom") is one of the the premier conferences on web inspired research involving search and data mining. They're committed to publishing original, high quality papers and presentations, with an emphasis on practical but principled novel models.

For a subscription business, accurately predicting churn is critical to long-term success. Even slight variations in churn can drastically affect profits.

KKBOX is Asia’s leading music streaming service, holding the world’s most comprehensive Asia-Pop music library with over 30 million tracks. They offer a generous, unlimited version of their service to millions of people, supported by advertising and paid subscriptions. This delicate model is dependent on accurately predicting churn of their paid users.

In this competition you’re tasked to build an algorithm that predicts whether a user will churn after their subscription expires. Currently, the company uses survival analysis techniques to determine the residual membership life time for each subscriber. By adopting different methods, KKBOX anticipates they’ll discover new insights to why users leave so they can be proactive in keeping users dancing.

Winners will present their findings at the WSDM conference February 6-8, 2018 in Los Angeles, CA. For more information on the conference, click here.


Evaluation

The evaluation metric for this competition is Log Loss


logloss=1Ni=1N(yilog(pi)+(1yi)log(1pi))


where N is the number of observations, log is the natural logarithm, yi is the binary target, and pi is the predicted probability that yi equals 1.

Note: the actual submitted predicted probabilities are replaced with max(min(p,11015),1015).

Submission File

For each user id (msno) in the test set, you must predict the probability of churn (a number between 0 and 1). The file should contain a header and have the following format:

msno,is_churn
ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg=,0.5
zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,0.4
f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,0.9
etc.

Data Description

In this challenge, you are asked to predict whether a user will churn after his/her subscription expires. Specifically, we want to forecast if a user make a new service subscription transaction within 30 days after the current membership expiration date.

KKBOX offers subscription based music streaming service. When users signs up for our service, users can choose to either manual renew or auto-renew the service. Users can actively cancel their membership at any time.

The churn/renewal definition can be tricky due to KKBox's subscription model. Since the majority of KKBox's subscription length is 30 days, a lot of users re-subscribe every month. The key fields to determine churn/renewal are transaction datemembership expiration date, and is_cancel. Note that the is_cancel field indicates whether a user actively cancels a subscription. Subscription cancellation does not imply the user has churned. A user may cancel service subscription due to change of service plans or other reasons. The criteria of "churn" is no new valid service subscription within 30 days after the current membership expires.

UPDATE: As of November 6, 2017, we have refreshed the test data to predict user churn in the month of April, 2017

The training and the test data are selected from users whose membership expire within a certain month. The train data consists of users whose subscription expires within the month of February 2017, and the test data is with users whose subscription expires within the month of March 2017. This means we are looking at user churn or renewal roughly in the month of March 2017 for train set, and the user churn or renewal roughly in the month of April 2017. Train and test sets are split by transaction date, as well as the public and private leaderboard data.

In this dataset, KKBox has included more users behaviors than the ones in train and test datasets, in order to enable participants to explore different user behaviors outside of the train and test sets. For example, a user could actively cancel the subscription, but renew within 30 days.

.

Tables

train.csv

the train set, containing the user ids and whether they have churned.

  • msno: user id

  • is_churn: This is the target variable. Churn is defined as whether the user did not continue the subscription within 30 days of expiration. is_churn = 1 means churn,is_churn = 0 means renewal.

train_v2.csv

same format as train.csv, refreshed 11/06/2017, contains the churn data for March, 2017.

sample_submission_zero.csv

the test set, containing the user ids, in the format that we expect you to submit

  • msno: user id

  • is_churn: This is what you will predict. Churn is defined as whether the user did not continue the subscription within 30 days of expiration. is_churn = 1 means churn,is_churn = 0 means renewal.

sample_submission_v2.csv

same format as sample_submission_zero.csv, refreshed 11/06/2017, contains the test data for April, 2017.

transactions.csv

transactions of users up until 2/28/2017.

  • msno: user id

  • payment_method_id: payment method

  • payment_plan_days: length of membership plan in days

  • plan_list_price: in New Taiwan Dollar (NTD)

  • actual_amount_paid: in New Taiwan Dollar (NTD)

  • is_auto_renew

  • transaction_date: format %Y%m%d

  • membership_expire_date: format %Y%m%d

  • is_cancel: whether or not the user canceled the membership in this transaction.

transactions_v2.csv

same format as transactions.csv, refreshed 11/06/2017, contains the transactions data until 3/31/2017.

user_logs.csv

daily user logs describing listening behaviors of a user. Data collected until 2/28/2017.

  • msno: user id

  • date: format %Y%m%d

  • num_25: # of songs played less than 25% of the song length

  • num_50: # of songs played between 25% to 50% of the song length

  • num_75: # of songs played between 50% to 75% of of the song length

  • num_985: # of songs played between 75% to 98.5% of the song length

  • num_100: # of songs played over 98.5% of the song length

  • num_unq: # of unique songs played

  • total_secs: total seconds played

user_logs_v2.csv

same format as user_logs.csv, refreshed 11/06/2017, contains the user logs data until 3/31/2017.

members.csv

user information. Note that not every user in the dataset is available.

  • msno

  • city

  • bd: age. Note: this column has outlier values ranging from -7000 to 2015, please use your judgement.

  • gender

  • registered_via: registration method

  • registration_init_time: format %Y%m%d

  • expiration_date: format %Y%m%d, taken as a snapshot at which the member.csv is extracted. Not representing the actual churn behavior.

members_v3.csv

Refreshed 11/13/2017, replaces members.csv data with the expiration date data removed.

Data Extraction Details

We include the code "WSDMChurnLabeller.scala" for generating labels for the user of our interest. The code provided is the one we used to generate the label for the test data set. Note that the date values in the script is modified so it is easier to run on personal laptops. On our cluster, the log history starts from 2015-01-01 to 2017-03-31. With the provision of the user label generator, we encourage participants to generate training labels using data not included in our sample training labels.

One important information in the data extraction process is the definition of membership expiration date. Suppose we have a sequence for a user with the tuple of (transaction datemembership expiration date, and is_cancel):

(2017-01-01, 2017-02-28, false)

(2017-02-25, 0217-03-15, false)

(2017-04-30, 3017-05-20, false)

(data used for demo only, not included in competition dataset)

This user is included in the dataset since the expiration date falls within our time period. Since the subscription transaction is 30 days away from 2017-03-15, the previous expiration date, we will count this user as a churned user.

Let's consider a more complex example derive the last one, suppose now a user has the following transaction sequence

(2017-01-01, 2017-02-28, false)

(2017-02-25, 2017-04-03, false)

(2017-03-15, 2017-03-16, true)

(2017-04-01, 3017-06-30, false)

The above entries is quite typical for a user who changes his subscription plan. Entry 3 indicates that the membership expiration date is moved from 2017-04-03 back to 2017-03-16 due to the user making an active cancellation on the 15th. On April 1st, the user made a long term (two month subscription), which is 15 days after the "current" expiration date. So this user is not a churn user.

Now let's consider the a sequence that indicate the user does not falls in our scope of prediction

(2017-01-01, 2017-02-28, false)

(2017-02-25, 2017-04-03, false)

(2017-03-15, 2017-03-16, true)

(2017-03-18, 2017-04-02, false)

Note that even the 3rd entry has member ship expiration date falls in 2017-03-16, but the fourth entry extends the membership expiration date to 2017-04-02, not between 2017-03-01 and 2017-03-31, so we will not make a prediction for the user.


上一篇:从 CT 影像中对肺部影像进行分割并识别肺部容积【Kaggle竞赛】

下一篇:KONECT 网络图结构数据合集

用户评价
全部评价

热门资源

  • GRAZ 图像分类数据

    GRAZ 图像分类数据

  • MIT Cars 汽车图像...

    MIT Cars 汽车图像数据

  • 凶杀案报告数据

    凶杀案报告数据

  • 猫和狗图像分类数...

    Kaggle 上的竞赛数据,用以区分猫和狗两类对象,...

  • Bosch 流水线降低...

    数据来自产品在Bosch真实生产线上制造过程中的设备...