Skip to contents

This function trains an interpretable boosted linear model.

The function combines a Generalized Linear Model (GLM) with a booster model of XGBoost

The "booster" model is trained on the residuals of the glm model to the response_var, such that: - when the link function is log, IBLM predictions = GLM predictions * Booster predictions - when the link function is identity, IBLM predictions = GLM predictions + Booster predictions

Usage

train_iblm_xgb(
  df_list,
  response_var,
  weight_var = NULL,
  offset_var = NULL,
  family = "poisson",
  params = list(),
  nrounds = 1000,
  objective = NULL,
  custom_metric = NULL,
  verbose = 0,
  print_every_n = 1L,
  early_stopping_rounds = 25,
  maximize = NULL,
  save_period = NULL,
  save_name = "xgboost.model",
  xgb_model = NULL,
  callbacks = list(),
  ...,
  strip_glm = TRUE
)

Arguments

df_list

A named list containing training and validation datasets. Must have elements named "train" and "validate", each containing df_list frames with the same structure. This item is naturally output from the function [split_into_train_validate_test()]

response_var

Character string specifying the name of the response variable column in the datasets. The string MUST appear in both `df_list$train` and `df_list$validate`.

weight_var

Character string specifying the name of a variable to weight by. Value of NULL (default) for no weighting. Any string MUST appear in both `df_list$train` and `df_list$validate`.

offset_var

Character string specifying the name of a variable to use as offset. Value of NULL (default) for no offset. Any string MUST appear in both `df_list$train` and `df_list$validate`.

Any transformations required (e.g. log) must be performed BEFORE `df_list` is fed into function.

family

Character string specifying the distributional family for the model. Currently only "poisson", "quasipoisson", "gamma", "tweedie" and "gaussian" is fully supported. See details for how this impacts fitting.

params

Named list of additional parameters to pass to xgb.train. Note that train_iblm_xgb will select "objective" and "base_score" for you depending on `family` (see details section). However you may overwrite these (do so with caution)

nrounds, objective, custom_metric, verbose, print_every_n, early_stopping_rounds, maximize, save_period, save_name, xgb_model, callbacks, ...

These are passed directly to xgb.train

strip_glm

TRUE/FALSE, whether to strip superfluous data from the `glm_model` object saved within `iblm` class that is output. Only serves to reduce memory constraints.

Value

An object of class "iblm" containing:

glm_model

The GLM model object, fitted on the `df_list$train` data that was provided

booster_model

The booster model object, trained on the residuals leftover from the glm_model

data

A list containing the data that was used to train and validate this iblm model

relationship

String that explains how to combine the `glm_model` and `booster_model`. Currently only either "Additive" or "Multiplicative"

response_var

A string describing the response variable used for this iblm model

predictor_vars

A list describing the predictor variables used for this iblm model

cat_levels

A list describing the categorical levels for the predictor vars

coeff_names

A list describing the coefficient names

Details

The `family` argument will be fed into the GLM fitting. Default `params` values for the XGBoost fitting are also selected based on family:

  • For "poisson" family, the "objective" is set to "count:poisson"

  • For "quasipoisson" family, the "objective" is set to "count:poisson"

  • For "gamma" family, the "objective" is set to "reg:gamma"

  • For "tweedie" family, the "objective" is set to "reg:tweedie". Also, "tweedie_variance_power = 1.5".

  • For "gaussian" family, the "objective" is set to "reg:squarederror"

Note: Any xgboost configuration below will be overwritten by any explicit arguments input into `train_iblm_xgb()`

See also

Examples

df_list <- freMTPLmini |>
  dplyr::mutate(LogExposure = log(Exposure), .keep = "unused") |>
  split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimNb",
  offset_var = "LogExposure",
  family = "poisson"
)