This function trains an interpretable boosted linear model.
The function combines a Generalized Linear Model (GLM) with a booster model of XGBoost
The "booster" model is trained on: - actual responses / GLM predictions, when the link function is log - actual responses - GLM predictions, when the link function is identity
Usage
train_iblm_xgb(
df_list,
response_var,
family = "poisson",
params = list(),
nrounds = 1000,
obj = NULL,
feval = NULL,
verbose = 0,
print_every_n = 1L,
early_stopping_rounds = 25,
maximize = NULL,
save_period = NULL,
save_name = "xgboost.model",
xgb_model = NULL,
callbacks = list(),
...,
strip_glm = TRUE
)Arguments
- df_list
A named list containing training and validation datasets. Must have elements named "train" and "validate", each containing df_list frames with the same structure. This item is naturally output from the function [split_into_train_validate_test()]
- response_var
Character string specifying the name of the response variable column in the datasets. The string MUST appear in both `df_list$train` and `df_list$validate`.
- family
Character string specifying the distributional family for the model. Currently only "poisson", "gamma", "tweedie" and "gaussian" is fully supported. See details for how this impacts fitting.
- params
Named list of additional parameters to pass to xgb.train. Note that train_iblm_xgb will select "objective" and "base_score" for you depending on `family` (see details section). However you may overwrite these (do so with caution)
- nrounds, obj, feval, verbose, print_every_n, early_stopping_rounds, maximize, save_period, save_name, xgb_model, callbacks, ...
These are passed directly to xgb.train
- strip_glm
TRUE/FALSE, whether to strip superfluous data from the `glm_model` object saved within `iblm` class that is output. Only serves to reduce memory constraints.
Value
An object of class "iblm" containing:
- glm_model
The GLM model object, fitted on the `df_list$train` data that was provided
- booster_model
The booster model object, trained on the residuals leftover from the glm_model
- data
A list containing the data that was used to train and validate this iblm model
- relationship
String that explains how to combine the `glm_model` and `booster_model`. Currently only either "Additive" or "Multiplicative"
- response_var
A string describing the response variable used for this iblm model
- predictor_vars
A list describing the predictor variables used for this iblm model
- cat_levels
A list describing the categorical levels for the predictor vars
- coeff_names
A list describing the coefficient names
Details
The `family` argument will be fed into the GLM fitting. Default values for the XGBoost fitting are also selected based on family.
Note: Any xgboost configuration below will be overwritten by any explicit arguments input via `params`
For "poisson" family the link function is 'log' and XGBoost is configured with:
objective: "count:poisson"
base_score: 1
For "gamma" family the link function is 'log' and XGBoost is configured with:
objective: "reg:gamma"
base_score: 1
For "tweedie" family the link function is 'log' (with a var.power = 1.5) and XGBoost is configured with:
objective: "reg:tweedie"
base_score: 1
tweedie_variance_power = 1.5
For "gaussian" family the link function is 'identity' and XGBoost is configured with:
objective: "reg:squarederror"
base_score: 0
Examples
df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)
iblm_model <- train_iblm_xgb(
df_list,
response_var = "ClaimRate",
family = "poisson"
)
