New Championship ML Boot Camp VI. Forecast audience response to an online survey



Today, June 25, ML Boot Camp VI starts with the task “Forecast the audience’s response to an online survey” (if you suddenly hear for the first time what ML Boot Camp is, go under the spoiler).

Spoiler
ML Boot Camp - a championship dedicated to solving problems in machine learning. Scheme of work: we give the task, and the participants within one month solve it and send solutions. The authors of the best decisions receive prizes. Last time we gave MacBook Pro for the first place, NVIDIA 1080ti for the second, NVIDIA 1060 for the third, and WD My Cloud 6 TB for 4-6 places. By tradition, we sent the 50 best participants T-shirts with the symbols of the championship.

With each new competition, the audience of ML Boot Camp increases significantly (already 7000 participants from more than 20 countries are currently registered).

At the start, participants receive the conditions of the problem and a verbal description of the available data - a training sample. The sample consists of marked examples - vectors of descriptions of each object with a known answer. The participants, using the methods of machine learning known to them, train a computer and test a trained system on a test sample, which is divided into two parts: rating and final. The winner is the one who gets the best results on the final data.

On the last day of the championship the participant can choose two decisions that will represent him in the final. The best of them will go to the leaderboard.

Rules and useful materials can be found on the championship website .

This time we offer you to plunge into the dark abyss of marketing: as part of the next ML Boot Camp competition, you will be able to predict user behavior in one of the large-scale marketing research.

We offer the task of the appropriate level, while trying to make it interesting and the pros and beginners. In this championship you will find a real research paper.

The format of the competition has not changed: the championship will last for one month, from June 25 to July 25, 2018. More details about the prizes and the task - below.

The task of “Forecast of the response of the audience to an online survey”


There are results of an online survey. It is known that part of the audience has passed the survey completely and correctly. The other part completed the survey partially, with errors, or completely refused to participate. It is necessary to predict with the greatest possible accuracy which of the respondents belongs to the first group, that is, it has passed the study completely and without errors.

The main data file contains 19,528,597 lines (10GB) and consists of 6 columns:

1 . cuid - id. For a single identifier, the file may contain several entries;
2 cat_feature is some categorical variable. Value Range: {0,1,2,3,4,5};
3-5 counters, collected on the basis of human behavior on the Internet. Format: {w_1: c_1, w_2: c_2, ...}, where w_i is the coded token, and c_i is the frequency of this token;
6 dt_diff - the number of days until the date when the value of the target variable was received.



A small piece of data as an example:

00000d2994b6df9239901389031acaac 5 {"809001":2,"848545":2,"565828":1,"490363":1} {"85789":1,"238490":1,"32285":1,"103987":1,"16507":2,"6477":1,"92797":2} {} 39

Predictions must be made for 181 thousand users. The data set for training the model contains a table with identifiers and values ​​of the target variable (427 995 records).

Task metric is ROC AUC. This means that the answer is the assessment of class membership, which lies in the range [0; 1] for each cuid. This metric, in fact, evaluates the correctness of the ordering of objects by the classifier relative to one of the classes. In this case, we are not interested in the specific class label that the algorithm will produce, or the specific probability for each object. We are interested in the correctness of the ordering itself.

Of course, it happens that in the context of a specific applied task with equal roc_auc, one solution may be better than another, but we decided not to complicate the task.

Prizes


The distribution of six prizes this time looks like this:

Top1: Apple MacBook Pro 13
Top2: Apple MacBook Air 13
Top3: Western Digital My Cloud Mirror
Top4-5-6: Western Digital My Passport 4 TB

As always, the top 50 participants will receive T-shirts with the symbols of the championship, and participants with the most interesting decisions will be invited for an interview at Mail.Ru Group to the position of Data Scientists.

MLBootCamp Community


Join our Telegram community. You can always ask questions, get expert advice in the field of Data Science. In addition, the Mail.Ru Group championship community is networking where it's easy to find like-minded people.

check in


The championship starts today, at 19:00 Moscow time. Registration is open. We are waiting for everyone and good luck!

Source: https://habr.com/ru/post/415191/


All Articles