In the implementation, I will use the Welford method, since I like it more than calculations using "standard" formulas. In addition, I will not use numpy and any other libraries to demonstrate that knowledge of elementary constructions of the python language is sufficient to obtain a solution.
So, we need to be able to incrementally calculate the average and sum of squared deviations. To do this, we describe two classes:
class MeanCalculator: def __init__(self): self.Count = 0. self.Mean = 0. def Add(self, value, weight = 1.): self.Count += weight self.Mean += weight * (value - self.Mean) / self.Count def Remove(self, value, weight = 1.): self.Add(value, -weight) class SumSquaredErrorsCalculator: def __init__(self): self.MeanCalculator = MeanCalculator() self.SSE = 0. def Add(self, value, weight = 1.): curDiff = value - self.MeanCalculator.Mean self.MeanCalculator.Add(value, weight) self.SSE += weight * curDiff * (value - self.MeanCalculator.Mean) def Remove(self, value, weight = 1.): self.Add(value, -weight)
Now we need a class for storing and processing the training set. First we describe its fields and input method:
class Instances: def __init__(self): self.Items = [] self.OverAllSSE = SumSquaredErrorsCalculator() def Read(self): fin = open('stump.in') for line in fin.readlines()[1:]: x, y = map(float, line.strip().split()) self.Items.append([x, y]) self.OverAllSSE.Add(y) self.Items.sort()
Simultaneously with the data entry, we immediately form the SumSquaredErrorsCalculator structure for all objects in the sample. After loading the entire sample, we sort it by not decreasing the values of attributes.
Now the most interesting: a method for finding the optimal values of the parameters.
Let's start with the initialization. In the initial state, all objects are in the right subsample. Then the parameter a can be equal to anything parameter b equal to the average of the answers in the entire sample, and the parameter c such that all objects in the sample are to the right of it.
In addition, we initialize the variables left
and right
- they will store statistics for the left and right subsamples, respectively.
left = SumSquaredErrorsCalculator() right = self.OverAllSSE bestA = 0 bestB = right.MeanCalculator.Mean bestC = self.Items[0][0] bestQ = right.SSE
Now go to the incremental algorithm. We will process the elements of the sample one by one. Each element is transferred to the left subset. Then you need to check that the corresponding partition really exists - that is, the value of the feature is different from the value of the feature of the next object.
for i in range(len(self.Items) - 1): item = self.Items[i] nextItem = self.Items[i + 1] left.Add(item[1]) right.Remove(item[1]) if item[0] == nextItem[0]: continue a = left.MeanCalculator.Mean b = right.MeanCalculator.Mean c = (item[0] + nextItem[0]) / 2 q = left.SSE + right.SSE if q < bestQ: bestA = a bestB = b bestC = c bestQ = q
Now it remains to build the code to get the solution:
instances = Instances() instances.Read() a, b, c = instances.BestSplit() print >> open('stump.out', 'w'), a, b, c
I note that the solution presented in python is indeed taken by the system, but it comes to the upper limit in terms of the decision time: the largest tests require about 800 milliseconds for execution. Of course, using additional libraries can achieve much more impressive performance.
Therefore, I also implemented the proposed algorithm in C ++ . At worst, this solution spent 60 milliseconds on one of the tests.