ML-Blitz: analysis of the tasks of the first qualifying round

On June 23, 2018, the final of ML-Blitz, a machine learning competition organized by Yandex, took place. Earlier we announced it on Habré and told about what tasks can be met at a real competition.


Now we want to share with you the analysis of the tasks of one of the qualifying rounds - the very first. Two participants managed to solve all the tasks of this competition; 57 participants solved at least one problem, and 110 made at least one attempt to pass the task.


Although the author of these lines took part in drawing up the tasks of the competition, it was in the first qualification that his tasks did not take part. So I am writing this analysis from the perspective of a participant in the contest who first saw the conditions and wanted to get as many points as possible as quickly as possible.


Python was expected to be the most popular programming language among the contestants, so I also used this particular language in all cases where I had to write code.


All my solutions are available on github


image


Problem A. Decisive stump


Task
Python solution
C ++ solution


Decisive stump is one of the simplest decisive functions in machine learning. This is a piecewise constant function, defined as follows:


f (x) = \ left \ {\ begin {array} {ll} a, & \ quad x <c \\ b, & quad x \ ge c \ end {array} \ right.

f (x) = \ left \ {\ begin {array} {ll} a, & \ quad x <c \\ b, & \ quad x \ ge c \ end {array} \ right.


In this problem, it was necessary to build the optimal decision stump for the training set. That is, according to pairs of numbers ( x 1 , y 1 ) , l d o t s , ( x n , y n )  , it was required to determine the optimal values ​​of the parameters of the decision tree to optimize the values ​​of the quadratic loss functional


Q ( f , x , y ) = s u m n i = 1 ( f ( x i ) - y i ) 2 


As a response, it was necessary to display the optimal values. a , b , c .


Decision

So, we need to construct a piecewise smooth approximation of a known function. If the parameter c known then find the parameters a and b simple enough. You can write solutions mathematically as solutions to the corresponding optimization problems. Parameter a - this is the magnitude of the predictions of the decisive stump on those objects. ( x i , y i ) training set for which x i < c . Similarly b - is the value of predictions on those objects. ( x i , y i ) training set for which x i g e c  .


We introduce the notation for the corresponding subsets of the training set: L ( c ) Is a subset of objects to the left of the point c , R ( c ) - a subset of objects to the right of the point c .


L (c) = \ {(x_i, y_i) | x_i <c \}


R (c) = \ {(x_i, y_i) | x_i \ ge c \}


Then the optimal value a will be equal to the arithmetic mean of the responses in the set L ( c ) and the optimal value b - respectively, the arithmetic mean of the answers in the set R ( c ) .


This can be justified as follows.

a= arg mint in mathbbR sum(xi,yi) inL(c)(tyi)2


b= arg mint in mathbbR sum(xi,yi) inR(c)(tyi)2


It is well known that the answer in such optimization problems is the arithmetic average:


a= frac sum(xi,yi) inL(c)yi|L(c)|


b= frac sum(xi,yi) inR(c)yi|L(c)|


It is easy enough to prove. Take the partial derivative of the loss functional by the prediction value:


 frac partial partialt sumy inY(ty)2=2 cdot sumy inY(ty)=2 cdot|Y| cdott2 cdot sumy inYy


After equating the derivative to zero, we get that


t= frac1|Y| sumy inYy


Equating the derivative to zero in this case is correct, since the function being optimized is a convex function , and for convex functions the points of local extremum are guaranteed to be also points of global extremum.


So, now it’s clear how to find the parameters. a and b with a known parameter c . It remains to understand how to find the optimal value of the parameter. c .


The first thing to notice: the parameter c may take any real values, but the set of different partitions is finite. Training set of n objects can be broken no more than n+1 ways on the "left" and "right" parts. In fact, such splits may be even smaller: for example, for some objects, the values ​​of features may coincide. In addition, for us, splits are equivalent, in which all objects of the training sample are on the left or right side.


To get all the possible splits, you can order the objects of the training sample in order of non-decreasing feature:


(xi1,yi1), ldots(xin,yin)


xi1 lexi2 le ldots lexin


It is now clear that potentially interesting parameter values c - these are values


 fracxi1+xi22, fracxi2+xi32, ldots, fracxin1+xin2


Now it’s clear what needs to be done to get a solution. It is necessary to go through all potentially interesting parameter values. c , for each of them determine the appropriate values a and b , as well as the value of the loss functional. Then you need to select a set of parameters corresponding to the minimum of the values ​​of the loss functional.


There remains only the question of how to make this decision effective. Direct implementation of the described algorithm will lead to the quadratic complexity of the algorithm, which is unacceptable.


To make the calculations effective, remember that to find the parameters a and b it is only necessary to calculate the average values ​​over the set. When adding the next element to the set (or after removing the element from the set), the average value can be effectively recalculated in constant time, if you store not the average itself, but the sum of the values ​​and the number of them separately. The situation is the same with the sum of squared deviations. To calculate it, according to the well-known formula for calculating the variance , you can separately store the sum of the values ​​and the sum of the squares of the values. In addition, you can use any online calculation method . Earlier, I already wrote about this in Habré .


Implementation

In the implementation, I will use the Welford method, since I like it more than calculations using "standard" formulas. In addition, I will not use numpy and any other libraries to demonstrate that knowledge of elementary constructions of the python language is sufficient to obtain a solution.


So, we need to be able to incrementally calculate the average and sum of squared deviations. To do this, we describe two classes:


class MeanCalculator: def __init__(self): self.Count = 0. self.Mean = 0. def Add(self, value, weight = 1.): self.Count += weight self.Mean += weight * (value - self.Mean) / self.Count def Remove(self, value, weight = 1.): self.Add(value, -weight) class SumSquaredErrorsCalculator: def __init__(self): self.MeanCalculator = MeanCalculator() self.SSE = 0. def Add(self, value, weight = 1.): curDiff = value - self.MeanCalculator.Mean self.MeanCalculator.Add(value, weight) self.SSE += weight * curDiff * (value - self.MeanCalculator.Mean) def Remove(self, value, weight = 1.): self.Add(value, -weight) 

Now we need a class for storing and processing the training set. First we describe its fields and input method:


 class Instances: def __init__(self): self.Items = [] self.OverAllSSE = SumSquaredErrorsCalculator() def Read(self): fin = open('stump.in') for line in fin.readlines()[1:]: x, y = map(float, line.strip().split()) self.Items.append([x, y]) self.OverAllSSE.Add(y) self.Items.sort() 

Simultaneously with the data entry, we immediately form the SumSquaredErrorsCalculator structure for all objects in the sample. After loading the entire sample, we sort it by not decreasing the values ​​of attributes.


Now the most interesting: a method for finding the optimal values ​​of the parameters.


Let's start with the initialization. In the initial state, all objects are in the right subsample. Then the parameter a can be equal to anything parameter b equal to the average of the answers in the entire sample, and the parameter c such that all objects in the sample are to the right of it.


In addition, we initialize the variables left and right - they will store statistics for the left and right subsamples, respectively.


 left = SumSquaredErrorsCalculator() right = self.OverAllSSE bestA = 0 bestB = right.MeanCalculator.Mean bestC = self.Items[0][0] bestQ = right.SSE 

Now go to the incremental algorithm. We will process the elements of the sample one by one. Each element is transferred to the left subset. Then you need to check that the corresponding partition really exists - that is, the value of the feature is different from the value of the feature of the next object.


 for i in range(len(self.Items) - 1): item = self.Items[i] nextItem = self.Items[i + 1] left.Add(item[1]) right.Remove(item[1]) if item[0] == nextItem[0]: continue a = left.MeanCalculator.Mean b = right.MeanCalculator.Mean c = (item[0] + nextItem[0]) / 2 q = left.SSE + right.SSE if q < bestQ: bestA = a bestB = b bestC = c bestQ = q 

Now it remains to build the code to get the solution:


 instances = Instances() instances.Read() a, b, c = instances.BestSplit() print >> open('stump.out', 'w'), a, b, c 

I note that the solution presented in python is indeed taken by the system, but it comes to the upper limit in terms of the decision time: the largest tests require about 800 milliseconds for execution. Of course, using additional libraries can achieve much more impressive performance.


Therefore, I also implemented the proposed algorithm in C ++ . At worst, this solution spent 60 milliseconds on one of the tests.


Problem B. Recovery of coefficients


Task
Sklearn python solution


It is necessary to restore the parameters a , b , c functions f having a set of famous couples (x1,f(x1)), ldots,(xn,f(xn)) and knowing that the values ​​of the function are determined by the following formula:


f(x)= Big((a+ varepsilona) sinx+(b+ varepsilonb) lnx Big)2+(c+ vareps i l o n c ) x 2


Decision

Open brackets, ignoring random variables:


f(x)=a2 cdot sin2(x)+b2 cdot ln2(x)+2ab cdot sin(x) cdot ln(x)+c cdotx2


Now we have a multidimensional linear regression problem without a free coefficient. Signs in this problem are the values  sin2(x) ,  ln2(x) ,  sin(x) cdot ln(x) , x2 .


Since the functional dependence does not imply a free coefficient, and all random components have a zero average, it can be expected that the free coefficient when training a model will be almost zero. However, it is worth checking this out before submitting the solution.


When solving a multidimensional linear regression problem, some coefficients will be found with the modified features. In fact, the following representation for the function will be found. f :


f(x)=t1 cdot sin2(x)+t2 cdot ln2(x)+t3 cdot sin(x) cdot ln(x)+t4 cdotx2


After that you can find the coefficients a , b , c :


a= sqrt(t1),b= sqrt(t2),c=t4


Additionally, it is worth checking that 2 cdota cdotb approxt3


Implementation

First you need to read the data and form the signs:


 x = [] y = [] for line in open('data.txt'): xx, yy = map(float, line.strip().split()) x.append(xx) y.append(yy) features = [] for xx in x: curFeatures = [ math.sin(xx) ** 2, # a^2 math.log(xx) ** 2, # b^2 math.sin(xx) * math.log(xx), # 2ab xx ** 2 # c ] features.append(curFeatures) 

Now it is necessary to solve the problem of multidimensional linear regression. There are many ways to do this - from tools like Weka and sklearn library functions to my own implementation . However, in this case I wanted to get a "closed" solution: one script that completely solves the problem. Therefore, I used sklearn.


 linearModel = lm.LinearRegression() linearModel.fit(features, y) coeffs = linearModel.coef_ a = math.sqrt(coeffs[0]) b = math.sqrt(coeffs[1]) c = coeffs[3] print "free coeff: ", linearModel.intercept_ print "2ab error: ", coeffs[2] - 2 * a * b print a, b, c 

In this case, it turned out that the free coefficient is equal to -0.0007, and the error in calculating t3 amounted to 0.00135. Thus, the solution found is quite consistent.


Summary line with coefficients:


 3.14172883822 2.71842889253 3.999957864335599 

Substituting it as an answer for the problem, we get a well-deserved OK!


Task C. Freshness Detector


Task
Script to get a solution using CatBoost


In this task, it was required to construct a detector of fresh queries, having a ready-made sample with the values ​​of the factors and the values ​​of the objective function. Each line of the input file described one request. Factors were the frequency of tasks of this request in the past: in the last hour, two hours, six hours, 12, 24, 72 hours. The objective function is binary: if you clicked on a fresh document, it is equal to one, otherwise it is zero.


It was required for each line of the test file to output either zero or one, depending on the prediction. Also required to get on the test set F 1 -measure more than 0.25.


Decision

Since the required value F1 -measures are not too large, surely some fairly simple method will work for solving the problem. So I tried to just run CatBoost on the factors provided, and then binarize its predictions.


For CatBoost to work, you need to provide two files: training and test, as well as column descriptions. It is not difficult to make column descriptions: the first two columns are the request text and the timestamp, it is easier to ignore them. The last column is the answer. Therefore, we get the following description of the columns :


 0 Auxiliary 1 Auxiliary 8 Target 

Since the test file contains no answers and, therefore, the last column, we add this column, simply filling it with zeros. I use the usual awk for this:


 awk -F "\t" -vOFS="\t" '{print $0, 0}' fr_test.tsv > fr_test.fixed 

Now you can train CatBoostL


 catboost calc --input-path fr_test.fixed --cd fields.cd 

After this, the predictions will be in the file output.tsv . However, these will be real predictions that must still be binarized.


We will proceed from the fact that the proportion of positive examples in the training and test samples coincide. In the training sample, about 3/4 of all requests contain clicks on the latest documents. Therefore, we select the classification threshold so that approximately 3/4 of all the queries from the test sample turn out to be with positive predictions. For example, for the threshold of 0.04 such is 178925 out of 200,000.


Therefore, we form the following solution file:


 awk -F "\t" -vOFS="\t" 'NR > 1 {if ($2 > 0.04) print 1; else print 0}' output.tsv > solution.tsv 

Here it was necessary to skip the first line, because CatBoost writes its own column names into it.


The resulting solution.tsv file is sent to the checking system and receives a valid OK as a verdict.


Task D. Selection of signs


Task
The script for the solution


In this task, participants were asked to choose no more than 50 signs from the 500 available in the selection, so that the CatBoost algorithm would then demonstrate the best possible quality on the test sample.


Decision

As you know, there is a wide variety of methods for the selection of signs.


For example, you can use any ready-made method. Let's say I tried to start the feature selection in Weka and after a little tuning of the parameters I managed to get 1.8 points in this task.


In addition, I have my own script for the selection of signs. This script implements a greedy strategy: exactly one factor is added to the sample each time, such that adding it in the best way affects the evaluation of the sliding control for the algorithm. However, in the context of the contest, it would either take too long to run such a script, or a large computing cluster.


However, when using decisive scaffolding with regularization (including CatBoost), there is also one extremely fast method for selecting features: you need to select the factors that are often used in the model. The CatBoost algorithm has a built-in mode for evaluating the influence of factors on the model predictions, and we will use it.


First you need to train the model:


 catboost fit --cd train.cd -f train.txt 

Then run the feature evaluation:


 catboost fstr --input-path train.txt --cd train.cd 

The severity attributes will be recorded after this in the feature_strength.tsv file. In the first column, the significance of the signs will be recorded, in the second - their names. The file is immediately sorted by non-increasing importance of features.


 head feature_strength.tsv 9.897213004 f193 9.669603844 f129 7.500907599 f292 5.903810044 f48 5.268100711 f337 2.508377813 f283 2.024904488 f111 1.933500313 f208 1.878848285 f345 1.652808387 f110 

It remains only to take the first few dozen signs and form the answer. It makes sense to take as few signs as possible - as is known, the complexity of the models has a negative effect on their generalizing ability.


For example, if you choose the top 50 signs, then on a public test set you could get 3.6 points; If you choose top 40, top 30 or top 20, it turned out exactly 4 points. Therefore, I chose the top 20 signs as a solution - this solution received 4 points on a closed test set.


It is worth noting, finally, that the considered way of selecting signs is not optimal in all situations. Often, “harmful” signs have a significant impact on the magnitude of the model prediction, but at the same time they degrade the generalizing ability of algorithms. Therefore, in each task, when the task of selecting signs arises, it is worth checking several approaches known to the researcher at once and choosing the best one.


In addition, it is necessary to remember about other ways to reduce the dimension of the space of signs - for example, there is a wide variety of methods for extracting signs .

Source: https://habr.com/ru/post/414997/


All Articles