Generalized Additive Models (GAM) in statsmodels

#Generalized-Additive-Models-(GAM)-in-statsmodels

This notebook illustrates the basic usage of the new GAM support.

We are using formulas for the linear part and define additive smooth basis for the nonlinear terms. The following illustrates a Gaussian and a Poisson regression where categorical variables are treated as a linear term and the effect of two explanatory variables is captured by penalized B-splines.

Imports and Loading data

#Imports-and-Loading-data

The classes and functions can be either imported through the api or directly from the specific modules, similar to other model classes.

The data is from the automobile dataset https://archive.ics.uci.edu/ml/datasets/automobile which has miles per gallon as dependent variable and various characteristics of car models as explanatory variables. We can load a dataframe with selected columns from the unit test module.

Spline Basis and penalization weight alpha

#Spline-Basis-and-penalization-weight-alpha

We need to provide an instance of a spline basis class to the model. Currently there are two verified spline basis, B-Splines and CyclicCubicSplines.

The model class also takes the penalization weight alpha as argument. This is currently a model attribute and changing it on an existing instance might cause the results instance to produce inconsistent results if it uses the changed attribute.

We specify here alpha values from the unit tests which were obtained by the R mgcv package. At the end we will illustrate the methods to select an optimal alpha.

Specifying and Estimating the Model

#Specifying-and-Estimating-the-Model

The model class for generalized additive models is GLMGam. As other models, it can be used either with numpy arrays, pandas DataFrames or using the formula interface. The formula interface currently only supports the linear part. Specifying penalized splines inside the formulas is not yet supported. (patsy supports unpenalized splines in the formulas.)

Plot the smooth components

#Plot-the-smooth-components

The results classes have a plot_partial method that plots the partial linear prediction of a smooth component. The partial residual or component plus residual can be added as scatter point with cpr=True.

The plots show that both smooth explanatory variables, weight and hp have a downward sloping effect, i.e. cars with larger weight and horsepower have higher fuel consumption and a smaller miles per gallon. However, the effect levels of at higher values, and a linear function would not have been appropriate.

Loading output library...
Loading output library...

Poisson

#Poisson

GLMGam is a subclass of GLM and supports the same families. However, currently only Gaussian and Poisson have been tested and verified.

Miles per gallon can take on only positive values, so we can use Poisson with a log-link to impose that the prediction is non-negative. Except for adding the family, everything works the same as in the Gaussian case. One difference is that the non-linear link implies that predict_partial and plot_partial are not in terms of the response variable. Those are still for the linear prediction, i.e. the values before applying the inverse link, exp in this case.

Selection of penalty weight alpha

#Selection-of-penalty-weight-alpha

The model class has two methods to select the penalization weight alpha that optimizes some goodness of fit criteria.

select_penweight uses a one step approximation to the leave-one-out change in parameters. This only requires one fit for each alpha. This uses scipy optimizer, by default basinhopping to minimize either an information criterion 'aic', 'bic' or generalized cross-validation criterion 'gcv'. The default is 'aic'.

select_penweight_kfold uses k-fold cross validation on a fixed grid to find the alpha with the lowest cost function. The cost function can be specified by the user, the default is mean squared prediction error. There are currently no premade cost functions for non-Gaussian models.

Loading output library...
Loading output library...

Some Details

#Some-Details

Implicit Constant

#Implicit-Constant

Most spline bases including B-splines and cyclic cubic splines incorporate an implicit constant. This causes perfect collinearity if a spline basis is combined with an intercept term or if several spline terms are included in a model. To avoid the collinearity in the design matrix, the spline classes have options to remove the implicit constant and consequently reduce the number of columns by one. The spline classes in statsmodels remove this implicit constant by default. A keyword option is available if an implicit constant should be included in the spline basis.

In general the implicit constant can be removed by a linear transformation that imposes that the predicted mean is zero, or by demeaning the spline basis columns and dropping one column. The availability of different methods and their default is specific to the type of spline basis. The choices and defaults might still change in the statsmodels implementation. The spline basis function do not have any obvious form or shape after the centering transform. Furthermore, untransformed B-splines basis functions have support, i.e. nonzero values, on only a small number of knot intervals. This locality is lost after a centering transform.

CyclicCubicSplines reuse their definition in patsy and delegate part of the computation to patsy. The only available method is the centering transform obtained with constraints='center'. Bsplines implement the same kind of centering transform, but also implement a demeaning method. The current default in statsmodels is to demean the basis columns and to drop the first column.

Effect of constant removal on standard errors and confidence interval

The removal of the implicit constant imposes and additional constraint on the spline bases and therefore on its contribution to the prediction. This implies that the nonlinear function that is represented by the spline term is anchored to have either zero prediction at the mean or at the dropped column. As a consquence, the standard deviation of the partial prediction is zero at that point. Wood recommends to add the estimate of the intercept to avoid this behavior when looking at partial prediction plots. This is the default in plot_predict.

Inference and standard errors

#Inference-and-standard-errors

GLMGam currently provides the same types of covariance matrices (cov_type) as GLM except that here they are based on penalized score (gradient) and Hessian. The current default is a 'nonrobust' covariance matrix, that assumes that the loglikelihood is correctly specified and that the penalization reflects some prior information, Wood calls it the Bayesian covariance. This essentially uses the inverse penalized Hessian or penalized information matrix instead of the unpenalized ones as in GLM.
The other cov_types are penalized analogous of the robust sandwich sandwich covariance estimators which do not assume that variance and correlation are correctly specified in models using the linear exponential family. One covariance type that is commonly used in Ridge regression and is currently missing, is based on a sandwich form without assuming heteroscedasticity. This is the standard form of the sandwich covariance for M-estimators similar to the ones used in RLM. (Statsmodels has a linear model L2 penalized estimator, TheilGLS, which is essentially a linear generalized Ridge estimator that defaults to this sandwich form that is not heteroscedasticity robust.)

Two warnings about the current implementation of standard errors and inference

First, standard errors are based on the standard theory that takes a model as given and assumes a fixed, finite dimensional parameter space. This means that the data dependent selection of the penalty weights which correspond to smoothing parameter selection in semi- or nonparametric estimation is not taken into account. Standard inference in the models does not take into account if the model or included variables were selected in a data dependent way. The same is true for selection of the smoothing bases like number and location of knots.

Second, the default for the covariance type and the standard errors is likely to change in future versions of statsmodels. Specifying cov_type='nonrobust' will maintain the current default behavior. The reason for this is that it is not yet clear what the most appropriate covariance type will be taking also implementation issues into account. This is even less clear for more general cases of penalized estimation with variable selection. Finding appropriate standard errors for that case is still an active research area where statsmodels has not caught up with recent development.

Penalty parameters and their selection

#Penalty-parameters-and-their-selection

As mentioned above, GLMGam provides two methods for the selection of the penalization weight alpha.

In the first case of generalized crossvalidation type of methods, a scipy optimizer is used to minimize an approximate leave one out criterion, either gcv, aic or bic. For each trial value, the model is estimated using its nonlinear estimator. This means that we have two nested optimization problems and local optima or nonconvergence is possible. The current default is to use basinhopping which is a global estimator. This is slower than just using a local optimizer like bfgs, but should in many cases be more robust to local curvature problems.

In the second case a standard k-fold cross validation is used to find the penalization weight with the smallest value of a cost function on a grid of values. This method defines a default grid that will not be appropriate in all cases.

A related issue is that currently there does not seem to be apriori meaningful interpretation or guess on the scale or magnitude of the penalization parameter. It mainly depends on the definition of the spline basis, both the number of knots, the scaling of the splines, and on the number of observations. In examples such as the ones above, alpha can be anywhere in the range of 1 to 1e8. This makes it difficult to find either starting parameters or the range of grid points for cross-validation.

For given alpha we can look at the effective degrees of freedom as a rough measure for how strong the penalization is.

The literature for penalty parameter selection shows that some selection criteria have a tendency to over- or undersmooth. A visual inspection of the partial prediction plot and the comparison of the results from different selection methods can help in diagnosing this. A manual adjustment or "selection of the selection method" might be appropriate. As aside, in some cases the prediction can be insensitive to a range of values and exact selection of the penalization weight is not important, e.g. in case like the above, the partial prediction with an alpha in the range of 1e6 to 1e7 will be close to the one with alpha of 1e8.

General Penalization Support

#General-Penalization-Support

Statsmodels has now sets of classes for more generic or general support of penalized maximum likelihood models that use nonlinear optimization based on the model methods for loglike, score and hessian. GLMGam is implemented taking advantage of this framework and adds direct support of penalized spline terms in the model.

This penalization support consists of a model mixin class and of penalty classes. The PenalizedMixin class adds the penalization term to the methods related to the loglikelihood, i.e. loglike, score and hessian. A mew penalized likelihood model can be created by subclassing this mixin class and the underlying model class, for example class GLMGam(PenalizedMixin, GLM).

PenalizedMixin subclasses require an instance of a Penalty class that defines the penalization function and corresponding methods. Current Penalty classes support smoothed L1, smoothed SCAD and L2 optimization. The gam support includes additive penalty classes that implement the additive combination of several penalized spline terms instead of using a single penalization structure for the entire paramater.

Gam uses only quadratic penalization where the penalty matrices, i.e. which can be interpreted as incomplete prior covariance, are provided by the spline classes. The current spline classes penalize based on the integrated squared second derivative of the nonlinear function. The penalty matrix is in this case a numerical approximation to the matrix of integrated squared second partial derivatives of the spline basis functions.

P-splines of Eilers and Marx are not fully supported yet. B-splines can be defined with equal spaced knots, but there is currently no built-in support for the difference penalty matrices.

Status and Extensions

#Status-and-Extensions

The current GAM implementation includes all the main parts for defining and analyzing models with penalized spline terms. However, the availability of additional kinds of spline basis functions and additional options for the penalization, penalization parameter selection and inference are still missing. Several spline types that are in patsy and mgcv like CubicRegressionSplines and multivariate splines like tensor splines are not yet supported, but can be added within the same framework.

GLMGam inherits a large number of options and results methods from GLM. Some of those are not covered by the current unit tests and it is not clear whether they work correctly or not. Some of the options such as frequency and variance weights are not supported yet.