Gradient Boosting intuitions

models

Settlement : General ideas on function estimation

In supervised learning, having a dataset $\mathcal D = \{ \mathbf{x_i},y_i\}$, our goal is to find a function $\hat F(x)$ that estimates the real mapping function $F^*(\mathbf{x})$ mapping $\mathbf{x}$ to $y$. To achieve this, we usually try to minimize the expected value of a specific loss function $L(y,F(\mathbf{x}))$ over the joint distribution of all $(x,y)$ values:

\[F^* = \text{arg}\min_F E_{y,\mathbf{x}}L(y,F(\mathbf{x})) = \text{arg}\min_F E_\mathbf{x}\left[E_y[L(y,F(x))]|\mathbf{x}\right]\]

In our case, the class of functions $F(\mathbf{x};\mathbf{P})$ (where $\mathbf{P}$ is a set of parameters) that we will use to restrict $F(\mathbf{x})$ are the “additive” expansions of the form:

\[F(\mathbf{x};\{\beta_m,\textbf{a_m}\}_1^M) = \sum_{m=1}^M \beta_m h(x;\textbf{a}_m)\]

where $h(x;\mathbf{a_m})$ is a simple parameterized function of the input, each having a different set of parameters $\mathbf{a_m}$. These expansions are the most frequently used in neural networks and support vector machines.

Numerical optimization

Let us now reformulate the previous problem. In the case of a parameterized model $F(\mathbf{x};\mathbf{P})$, we can define the function

\[\Phi(\mathbf{P}) = E_{y,x} L(y,F(\mathbf{x},\mathbf{P}))\]

and, using this function, the problem in equation $(1)$ is reformulated as obtaining the parameters $\mathbf{P}^*$ that minimize our problem, that is:

\[\mathbf{P^*} =\text{arg}\min_{\mathbf{P}} \Phi(\mathbf{P}) = \text{arg}\min_{\mathbf{P}} E_{y,x} L(y,F(\mathbf{x},\mathbf{P}))\]

obtaining that our approximation $F^*(\mathbf{x}) = F(\mathbf{x},\mathbf{P}^*)$. In many cases, this optimization problem must be solved using numerical optimization. To do this, we usually express the solution for the parameters as: $\mathbf{P}^* = \sum_{m=0}^M \mathbf{p}_m,$

where we make an initial guess $\mathbf{p}_0$ and then we iterate to obtain $\{\mathbf{p}_m\}_1^M$ incrementally, using the preceding steps.

One of the most used methods to optimize this problem is steepest-descent, which is a variant of gradient descent. In gradient descent, we would update the parameters as

\[\mathbf{p}_m \gets \mathbf{p}_m-1 - \eta \nabla_{\mathbf{p_m}} \Phi(\mathbf{P}_{m-1}).\]

However, in steepest-descent, we use the $\eta$ such that the updates on the weights obtain the maximal gain along the negative gradient direction, searching in each iteration for the most appropriate $\eta$. This is know as line search.

Let us see how this method defines the increments $\{ \mathbf{p} _m \} _1^M$ . Firstly, expressing $\mathbf{P}_{m-1} = \sum_{i=0}^{m-1} \mathbf{p}_i$ as we did before, we compute the gradient $\mathbf{g_m}$ as follows:

\[\mathbf{g}_m = \{g_{jm}\} = \left\{ \frac{\partial \Phi(\mathbf{P}_{m-1})}{\partial P_j}\right\}.\]

Now, as we have said, we want to perform a line search to obtain the best parameter $\rho_m$ to perform the update:

\[\mathbf{p}_m = -\rho_m \mathbf{g}_m.\]

The line search we have mentioned before aims to solve the following optimization problem:

\[\rho_m = \text{arg}\min_{\rho} \Phi (\mathbf{P}_{m-1} - \rho \mathbf{g_m})\]

Recall that:

$\Phi$ is the expectation of the loss function over all possible datasets.
Given a fixed $\rho$, the argument of $\Phi$, which is $\mathbf{P}_{m-1} \ \ - \rho \mathbf{g_m}$, is the gradient descent update rule.

Having this into account, we can say that $\rho_m$ is in fact the steepest descent that the gradient descent algorithm can take in each step.

Problems and solutions

The theoretical basis have been set for this problem. However, when it comes to real life problems, the situation changes. The joint distribution of $(y,\mathbf{x})$ is usually estimated by a finite sample (from a dataset $\mathcal D$), so we cannot estimate $E_y[\cdot |x]$ accurately at each $\mathbf{x_i}$ and, even if we could interpolate $F^*(\mathbf{x})$ at each $\mathbf{x_i}$ of the dataset, this would not be interesting to predict new values (points that are not from the original dataset).

A first approach would be to discretize the numerical optimization problem, reducing it to solving the following optimization problem:

\[\{\beta_m \mathbf{a_m}\}_1^M = \text{arg}\min_{\{\beta_m' \mathbf{a}_m'\}_1^M } \sum_{i=1}^M L\left(y_i, \sum_{m=1}^M \beta_m' h(\mathbf{x_i}; \mathbf{a}_m')\right).\]

When it is infeasible, we can try a different option: a greedy approach. For each $m = 1,\dots,M$, we define the optimal parameters as a slightly improved version of the previous ones. That is,

\[(\beta_m,\mathbf{a}_m) = \text{arg}\min_{\beta,\mathbf{a} } \sum_{i=1}^N L(y_i, F_{m-1}(\mathbf{x}_i) + \beta h(\mathbf{x}_i; \mathbf{a})),\]

and then, the update of the estimated function $F$ is performed as: $F_m(x) = F_{m-1} + \beta_m h(\mathbf{x},\mathbf{a}_m).$

The function $h(\mathbf{x},\mathbf{a})$ is called a weak learner. The term $\beta_m h(\mathbf{x},\mathbf{a}_m)$ can be seen as the best greedy step towards the data based estimate of $F^*(\mathbf{x})$. Thus, it can be regarded as a steepest-descent. In this case, the negative gradient

\[-g_m(\mathbf{x}_i) = - \left[ \frac{\partial L(y_i,F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}\right]\]

is the best steepest-descent direction $-\textbf{g}_m =\{-g_m(\mathbf{x}_i)\}_1^N$. This gradient is only defined at the datapoints, so it cannot be generalized to new values. The workaround in this case is to choose the member of the weak learner class that is most parallel to $-\mathbf{g}_m \in \mathbb R^N$. We can obtain this solving the following optimization problem:

\[\mathbf{a}_m = \text{arg}\min_{\mathbf{a},\beta}\sum_{i=1}^N \left[ - g_m(\mathbf{x}_i - \beta h(\mathbf{x}_i;\mathbf{a})\right]^2,\]

which is essentially finding the argument that minimizes the squared distances between the vectors. Using this approximation to the gradient, we perform the line search to find our $\rho_m$ in each time step $m$:

\[\rho_m = \text{arg}\min_{\rho}\sum_{i=1}^N L(y_i, F_{m-1}(\mathbf{x}_i) + \rho h(\mathbf{x}_i;\mathbf{a}_m))\]

and we can update the approximation to our final $F^*(\mathbf{x})$ as:

\[F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \rho_m h(\mathbf{x};\mathbf{a}_m).\]

This is explained as follows: instead of obtaining the solution straightly using a smoothness constraint, we apply the constraint to a model $h(\mathbf{x};\mathbf{a})$ that has been fitted to the “pseudo-responses” ${ \tilde y_i = -g_m(\mathbf{x}_i})_1^N$, using the least squared error minimization. Our last consideration in this case would be to use models $h(\mathbf{x};\mathbf{a})$ which are solved quite easily using an algorithm that minimizes the squared errors.

Regression trees

It is known that, in general, trees have a great performance in classification and regression problems. In the gradient boosting case, we could use as our function $h$ a J-terminal node regression tree. In this case, $h$ would have the form: $h(\mathbf{x}; \{b_j,R_j\}_{1}^J) = \sum_{j=1}^J b_j \mathbb{1}(x \in R_j),$ where ${R_j}_1^J$ are the disjoint regions created by the trees and $\mathbb 1$ is the indicator function. The regions are represented by the terminal nodes of each tree, which would have to be learnt as well as the coefficients $b_j$, resulting in the following gradient boosting step:

\[F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \rho_m \sum_{j=1}^J b_{jm} \mathbb{1} (\mathbf{x} \in R_{jm}).\]

This case has been remarked since I have implemented this version of gradient boosting. The implementation can be found in GitHub