Introduction

This notebook summaries key point from Hadley Wickham’s Tidy Model with R. The book only covers basic usage of tidy-model and some other dimension reduction techniques.

Link to the Book: https://www.tmwr.org/

Introduction of Data

Code

library(tmap)
library(osmdata)
library(tidymodels)
data("ames")

## refer to tmap: https://r-tmap.github.io/tmap-book/visual-variables.html
## for osm data: https://cran.r-project.org/web/packages/osmdata/vignettes/osmdata.html
## for query streets: https://wiki.openstreetmap.org/wiki/Key%3ahighway

ames_sf = sf::st_as_sf(ames,coords = c("Longitude","Latitude"), crs=4326)
ames_bbox = sf::st_bbox(ames_sf)
osm_streets = opq(bbox = ames_bbox) |> 
  add_osm_feature(key="highway",value = c(
                                          'secondary'
                                          ,'primary'
                                          ,'tertiary'
                                          ,'unclassified'
                                          ,'residential')) |> 
  # add_osm_feature(key="highway",value = 'motorway') |> 
  osmdata_sf()

## view a intersection
streets_sf = sf::st_intersection(sf::st_as_sfc(ames_bbox), osm_streets$osm_lines)

tm_shape(streets_sf) + 
  tm_lines(col='grey') +
tm_shape(ames_sf) + 
  tm_dots( shape = "Lot_Shape"
          ,col="Neighborhood"
          ,style = "cont"
          ,size=0.05
          ,border.col=NA
          ,border.lwd=0.01) + 
  tm_layout(legend.show=FALSE)

Warning: tm_scale_continuous is supposed to be applied to numerical data

ames familar with this data may come handy compare different model output later.

Code

library(tidymodels)
# 
ggplot2::theme_set(theme_minimal())
tidymodels_prefer()

ggplot(ames, aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, col= "white")

First thing they want to tell you is the data is not normal so require you to normalise somehow.

Code

ggplot(ames, aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, col= "white") +
  scale_x_log10()

ames <- ames |> mutate(Sale_Price = log10(Sale_Price))

# ames |> 
#   head(1) |> 
#   glimpse()

Spoil Alert

Following code create a linear model. Prediction uses these variables:

## preview columns in ames data
ames |> 
  select(Neighborhood, Gr_Liv_Area, Year_Built, Bldg_Type, Latitude, Longitude) |> 
  slice_sample(n=1) |> 
  glimpse()

Rows: 1
Columns: 6
$ Neighborhood <fct> Mitchell
$ Gr_Liv_Area  <int> 974
$ Year_Built   <int> 1991
$ Bldg_Type    <fct> OneFam
$ Latitude     <dbl> 41.98651
$ Longitude    <dbl> -93.60664

Ther are their transformation:

Neighborhood: convert low frequency one to “other”, then make dummy varible
Gr_Liv_Area: log10 treatment
Year_Built: year
Bldg_Type: convert building type into dummy varible
Latitude: spine function treatment
Longitude: spine function treatment

library(tidymodels)
data(ames)

## Normalise Prediction
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

## Split Data Sets
set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

## Recipy for Preprocessing Data, Build receipy object
ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)
  
## Linear Model
lm_model <- linear_reg() %>% set_engine("lm")

## Finaly Evaluate Lazy Object
lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

## Fit a Model
lm_fit <- fit(lm_wflow, ames_train)

The basics

Splitting/Feature Selection/Create a “Data Budget”

library(tidymodels)

Simple 80-20 split

The basics is the same, split train test. For this purpose you are splitting the data by 80-20.

ames_split <- rsample::initial_split(ames, prop = 0.80)
ames_split

<Training/Testing/Total>
<2344/586/2930>

Regards to spliting portion here is the advice from the Book:

A test set should be avoided only when the data are pathologically small.

ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)
dim(ames_train)

[1] 2344   74

Validation Split 60-20-20

set.seed(52)
# To put 60% into training, 20% in validation, and 20% in testing:
ames_val_split <- rsample::initial_validation_split(ames, prop = c(0.6, 0.2))
ames_val_split

<Training/Validation/Testing/Total>
<1758/586/586/2930>

ames_train <- training(ames_val_split)
ames_test <- testing(ames_val_split)
ames_val <- validation(ames_val_split)

Concepts

independent experimental unit: (knowing database basic this is just matter of object uid versus alternate uid) for example, measuring one patient
multi-level-data/multiple rows per experimental unit:

Data splitting should occur at the independent experimental unit level of the data!!!

Simple resampling across rows would lead to some data within an experimental unit being in the training set and others in the test set.

Pracrtical Implication

the book admit the practice of train and split at first for a validation of the model but follow up using all the data point possible for a better estimation of data.

Fitting Model with Parsnip

linear_reg
rand_forest

Linear Regression Family

lm
glmnet: fits generalised linear and model via penalized maximum likelihood.
stan

# switch computational backend for different model
linear_reg() |> 
  set_engine("lm") |> 
  translate()

Linear Regression Model Specification (regression)

Computational engine: lm 

Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

#  regularized regression is the glmnet model 
linear_reg(penalty=1) |> 
  set_engine("glmnet") |> 
  translate()

Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    family = "gaussian")

# To estimate with regularization, the second case, a Bayesian model can be fit using the rstanarm package:
linear_reg() |> 
  set_engine("stan") |> 
  translate()

Linear Regression Model Specification (regression)

Computational engine: stan 

Model fit template:
rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), 
    weights = missing_arg(), family = stats::gaussian, refresh = 0)

lm_model = linear_reg() |> 
  set_engine("lm") |> 
  translate()

lm_model |> 
  fit(Sale_Price ~ Longitude + Latitude, data = ames_train)

parsnip model object


Call:
stats::lm(formula = Sale_Price ~ Longitude + Latitude, data = data)

Coefficients:
(Intercept)    Longitude     Latitude  
   -313.623       -2.074        2.965

lm_xy_fit <- 
  lm_model %>% 
  fit_xy(
    x = ames_train %>% select(Longitude, Latitude),
    y = ames_train %>% pull(Sale_Price)
  )

lm_xy_fit

parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)    Longitude     Latitude  
   -313.623       -2.074        2.965

Tree Model

rand_forest(trees = 1000, min_n = 5) %>% 
  set_engine("ranger") %>% 
  set_mode("regression") %>% 
  translate()

Random Forest Model Specification (regression)

Main Arguments:
  trees = 1000
  min_n = 5

Computational engine: ranger 

Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    num.trees = 1000, min.node.size = min_rows(~5, x), num.threads = 1, 
    verbose = FALSE, seed = sample.int(10^5, 1))

Capture Model Results

Raw original way (useful to check og documentation)

lm_form_fit <- 
  lm_model %>% 
  # Recall that Sale_Price has been pre-logged
  fit(Sale_Price ~ Longitude + Latitude, data = ames_train)

lm_form_fit %>% extract_fit_engine() %>% vcov()

            (Intercept)    Longitude     Latitude
(Intercept)  273.852441  2.052444651 -1.942540743
Longitude      2.052445  0.021122353 -0.001771692
Latitude      -1.942541 -0.001771692  0.042265807

model_res <- 
  lm_form_fit %>% 
  extract_fit_engine() %>% 
  summary()

# The model coefficient table is accessible via the `coef` method.
param_est <- coef(model_res)
class(param_est)

[1] "matrix" "array"

param_est

               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -313.622655 16.5484876 -18.95174 5.089063e-73
Longitude     -2.073783  0.1453353 -14.26896 8.697331e-44
Latitude       2.965370  0.2055865  14.42395 1.177304e-44

The `Tidy` ecosystem for model result

What’s good about tidy is so you can reuse model.

tidy(lm_form_fit)

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -314.      16.5       -19.0 5.09e-73
2 Longitude      -2.07     0.145     -14.3 8.70e-44
3 Latitude        2.97     0.206      14.4 1.18e-44

Model Workflow

Chapter Link: workflows

Similar in Python or Spark this is called “pipeline”

Initiate a workflow use workflow()
Add whatever model

library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

lm_model <- linear_reg() %>% set_engine("lm")

## set up parsnip linear model
lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

## add this model to workflow (pipline)
lm_wflow <- 
  workflow() %>% 
  add_model(lm_model)

lm_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: None
Model: linear_reg()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

R formula is now used as a “pre-processor”

lm_wflow <- 
  lm_wflow %>% 
  add_formula(Sale_Price ~ Longitude + Latitude)

lm_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Sale_Price ~ Longitude + Latitude

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Update Fomula

It is possible to update the formula to this:

lm_fit %>% update_formula(Sale_Price ~ Longitude)

lm_wflow <- 
  lm_wflow %>% 
  remove_formula() %>% 
  add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))
lm_wflow

The Role of Formula:

inline transformations (e.g., log(x));
creating dummy variable columns;
creating interactions or other column expansions

Formula is Package Depend:

You have to go through each model one by one to see what type pre-processing are required for each different model.

Most packages for tree-based models use the formula interface but do not encode the categorical predictors as dummy variables.

Packages can use special inline functions that tell the model function how to treat the predictor in the analysis. For example, in survival analysis models, a formula term such as strata(site) would indicate that the column site is a stratification variable. This means it should not be treated as a regular predictor and does not have a corresponding location parameter estimate in the model.

A few R packages have extended the formula in ways that base R functions cannot parse or execute. In multilevel models (e.g., mixed models or hierarchical Bayesian models), a model term such as (week | subject) indicates that the column week is a random effect that has different slope parameter estimates for each value of the subject column.

A workflow is a general purpose interface. When add_formula() is used, how should the workflow preprocess the data? Since the pre-processing is model dependent, workflows attempts to emulate what the underlying model would do whenever possible. If it is not possible, the formula processing should not do anything to the columns used in the formula. Let’s look at this in more detail.

Special Formula/In-line Function

Because standard R methods cannot properly process this formula this will result in error.

library(lme4)
library(nlme)

data("Orthodont")
lmer(distance ~ Sex + (age | Subject), data = Orthodont)

Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ Sex + (age | Subject)
   Data: Orthodont
REML criterion at convergence: 471.1635
Random effects:
 Groups   Name        Std.Dev. Corr 
 Subject  (Intercept) 7.3915        
          age         0.6943   -0.97
 Residual             1.3101        
Number of obs: 108, groups:  Subject, 27
Fixed Effects:
(Intercept)    SexFemale  
     24.517       -2.146

model.matrix(distance ~ Sex + (age | Subject), data = Orthodont)

Warning in Ops.ordered(age, Subject): '|' is not meaningful for ordered factors

     (Intercept) SexFemale age | SubjectTRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Sex
[1] "contr.treatment"

attr(,"contrasts")$`age | Subject`
[1] "contr.treatment"

However, use add_model or add_variables solve this problem

library(multilevelmod)

multilevel_spec <- linear_reg() %>% set_engine("lmer")

multilevel_workflow <- 
  workflow() %>% 
  # Pass the data along as-is: 
  add_variables(outcome = distance, predictors = c(Sex, age, Subject)) %>% 
  add_model(multilevel_spec, 
            # This formula is given to the model
            formula = distance ~ Sex + (age | Subject))

multilevel_fit <- fit(multilevel_workflow, data = Orthodont)
multilevel_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Variables
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Outcomes: distance
Predictors: c(Sex, age, Subject)

── Model ───────────────────────────────────────────────────────────────────────
Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ Sex + (age | Subject)
   Data: data
REML criterion at convergence: 471.1635
Random effects:
 Groups   Name        Std.Dev. Corr 
 Subject  (Intercept) 7.3915        
          age         0.6943   -0.97
 Residual             1.3101        
Number of obs: 108, groups:  Subject, 27
Fixed Effects:
(Intercept)    SexFemale  
     24.517       -2.146

Use Multiple Model at Once

location <- list(
  longitude = Sale_Price ~ Longitude,
  latitude = Sale_Price ~ Latitude,
  coords = Sale_Price ~ Longitude + Latitude,
  neighborhood = Sale_Price ~ Neighborhood)
  
  
library(workflowsets)

location_models <- workflow_set(preproc = location, models = list(lm = lm_model))
location_models

# A workflow set/tibble: 4 × 4
  wflow_id        info             option    result    
  <chr>           <list>           <list>    <list>    
1 longitude_lm    <tibble [1 × 4]> <opts[0]> <list [0]>
2 latitude_lm     <tibble [1 × 4]> <opts[0]> <list [0]>
3 coords_lm       <tibble [1 × 4]> <opts[0]> <list [0]>
4 neighborhood_lm <tibble [1 × 4]> <opts[0]> <list [0]>

If you ever want to fit these model you have to use `purrr::map` which is actually intuitive for R user.

Right now these data.frames are all empty.

location_models <-
   location_models %>%
   mutate(fit = map(info, ~ fit(.x$workflow[[1]], ames_train)))
location_models

# A workflow set/tibble: 4 × 5
  wflow_id        info             option    result     fit       
  <chr>           <list>           <list>    <list>     <list>    
1 longitude_lm    <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>
2 latitude_lm     <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>
3 coords_lm       <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>
4 neighborhood_lm <tibble [1 × 4]> <opts[0]> <list [0]> <workflow>

Evaluate Test Set use `last_fit()` method

final_lm_res <- last_fit(lm_wflow, ames_split)
final_lm_res

# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits             id               .metrics .notes   .predictions .workflow 
  <list>             <chr>            <list>   <list>   <list>       <list>    
1 <split [2342/588]> train/test split <tibble> <tibble> <tibble>     <workflow>

…the modeling process encompasses more than just estimating the parameters of an algorithm that connects predictors to an outcome. This process also includes pre-processing steps and operations taken after a model is fit. We introduced a concept called a model workflow that can capture the important components of the modeling process. Multiple workflows can also be created inside of a workflow set. The last_fit() function is convenient for fitting a final model to the training set and evaluating with the test set.

For the Ames data, the related code that we’ll see used again is:

library(tidymodels)
data(ames)

## normalise y
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

## split data
set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

## linear models
lm_model <- linear_reg() %>% set_engine("lm")

## validating result
lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))

lm_fit <- fit(lm_wflow, ames_train)

Feature Engineering with Receipy

Syntax to use with recipe

USAGE:

Start with recipe() function call
begin with a series of step_*

## create a receipy object
simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal_predictors())

## add a receipy
lm_wflow %>% 
  add_recipe(simple_ames)

Compare Receipy with Standard Linear Model with formula

When this function is executed, the data are converted from a data frame to a numeric design matrix (also called a model matrix) and then the least squares method is used to estimate parameters.

A Standard Linear Model:

lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Year_Built + Bldg_Type, data = ames)

Use Receipy:

library(tidymodels) # Includes the recipes package
tidymodels_prefer()

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal_predictors())
simple_ames

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4

── Operations

• Log transformation on: Gr_Liv_Area

• Dummy variables from: all_nominal_predictors()

#> 
#> ── Recipe ───────────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 4
#> 
#> ── Operations
#> • Log transformation on: Gr_Liv_Area
#> • Dummy variables from: all_nominal_predictors()

Receipy is more verbal but more flexible use of formula:

Okay why not use formula?

These computations can be recycled across models since they are not tightly coupled to the modeling function.

A recipe enables a broader set of data processing choices than formulas can offer.

The syntax can be very compact. For example, all_nominal_predictors() can be used to capture many variables for specific types of processing while a formula would require each to be explicitly listed.

All data processing can be captured in a single R object instead of in scripts that are repeated, or even spread across different files.

Note on removing existing pre-processor before adding receipy

lm_wflow %>% 
  add_recipe(simple_ames)

Error in `add_recipe()`:
! A recipe cannot be added when a formula already exists.

You will have to remove existing preprocessor before adding recipe

lm_wflow <- 
  lm_wflow %>% 
  remove_variables() %>% 
  add_recipe(simple_ames)

Warning: The workflow has no variables preprocessor to remove.

Error in `add_recipe()`:
! A recipe cannot be added when a formula already exists.

lm_wflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
Sale_Price ~ Longitude + Latitude

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Typical Pre-Processing in Statiscs

Note

This section include two typical treatment

For nominal value, you may consider drop low-frequency terms step_other.
The second is for interaction terms. Combined effect is higher than addictive specify by using `step_interact
spine function (non-linear relationship), typically used for coordinate step_ns
- I like to think of spine function as a stretching sheet.
PCA feature extraction technique (use step_normalise)

Consider Encode Normial Values

d = ames_train |> 
  count(Neighborhood) |> 
  mutate(freqency = n / sum(n))


highest_n_at_0.01 = d |> 
  filter(freqency <= 0.01) |> 
  filter(n == max(n)) |> 
  pull(n)
  
d |> 
  ggplot(aes(y=Neighborhood,x=n)) + 
  geom_col() +
  gghighlight::gghighlight(n <= highest_n_at_0.01) + 
  ggtitle("These low frequency variables can be problematic")

Norminal Values: Consider chunk low frequency category into others this step you would use step_other;
step_dummy(all_nominal_predictors);

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors())

Consider Interation Terms: Variable Can Interact with One and Other

Interactions are defined in terms of their effect on the outcome and can be combinations of different types of data (e.g., numeric, categorical, etc). Chapter 7 of M. Kuhn and Johnson (2020) discusses interactions and how to detect them in greater detail.

… two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone.

Consider Interaction as group_by recalculate regression in

ggplot(ames_train, aes(x = Gr_Liv_Area, y = 10^Sale_Price)) + 
  geom_point(alpha = .2) + 
  facet_wrap(~ Bldg_Type) + 
  geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "lightblue") + 
  scale_x_log10() + 
  scale_y_log10() + 
  labs(x = "Gross Living Area", y = "Sale Price (USD)")

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  # Gr_Liv_Area is on the log scale from a previous step
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") )

Spine Function (Non-Linear Relationship)

library(patchwork)
library(ggplot2)
library(splines)

plot_smoother <- function(deg_free) {
  ggplot(ames_train, aes(x = Latitude, y = 10^Sale_Price)) + 
    geom_point(alpha = .2) + 
    scale_y_log10() +
    geom_smooth(
      method = lm,
      formula = y ~ ns(x, df = deg_free),
      color = "lightblue",
      se = FALSE
    ) +
    labs(title = paste(deg_free, "Spline Terms"),
         y = "Sale Price (USD)") +
    theme_minimal()
}

# plot_smoother(2) 
( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )

The example use case here is for coordinates, which is for

recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, deg_free = 20)

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 5

── Operations

• Log transformation on: Gr_Liv_Area

• Collapsing factor levels for: Neighborhood

• Dummy variables from: all_nominal_predictors()

• Interactions with: Gr_Liv_Area:starts_with("Bldg_Type_")

• Natural splines on: Latitude

Feature Extraction (Dimension Reduction Techniques)

The typical one you will see is PCA, But there exists more dimension reduction technique for example:

ICA Independent Component Analysis
NNMF Non-Negative Matrix Factorization
Multidimensional Scaling (MDS)
Uniform Manifold Approximation and Projection(UMAP)

Row Sampling Steps

This is a technique said to improve distribution but not performance.

Natual Language Sampling

Model Effectiveness Measurement

Chapter Extract

The typical statistical analysis workflow is analyzing different performance matrix given data.But in sum, you should think of measurement as these:

Accuracy Measurement (rmse)
Effectiveness Measurement (rsquare)
Implication Measuremnt (rsq)

Classical Measure for Normalized value are these three:

rmse
rsq
mae

For Binary Data there are:

conf_mat confusion matrix
accuracy
mcc Matthews correlation coefficient
f_meas F1 metric

When you use predicted probabilities as input rather than hard class predictor, the one matrix is ROC (ding ding ding!)

roc_curve

There is a use-full function autoplot let you do this.

Ultimately this gives you the power to compare different performance matrix easily

Yardstick Basic Usage

Yardstick is tool that produce performance matrix with consistent interface.

## Create a prediction frame
ames_test_res <- predict(lm_fit, new_data = ames_test %>% select(-Sale_Price))
ames_test_res
#> # A tibble: 588 × 1
#>   .pred
#>   <dbl>
#> 1  5.07
#> 2  5.31


## Bind prediction with Actual value
ames_test_res <- bind_cols(ames_test_res, ames_test %>% select(Sale_Price))
ames_test_res
#> # A tibble: 588 × 2
#>   .pred Sale_Price
#>   <dbl>      <dbl>
#> 1  5.07       5.02
#> 2  5.31       5.39

## YARDSTICK!! given a dataframe just do these two
rmse(ames_test_res, truth = Sale_Price, estimate = .pred)


## compare multiple matrix
ames_metrics <- metric_set(rmse, rsq, mae)
ames_metrics(ames_test_res, truth = Sale_Price, estimate = .pred)

Introduction

Introduction of Data

Spoil Alert

The basics

Splitting/Feature Selection/Create a “Data Budget”

Simple 80-20 split

Validation Split 60-20-20

Concepts

Pracrtical Implication

Fitting Model with Parsnip

Linear Regression Family

Tree Model

Capture Model Results

Raw original way (useful to check og documentation)

The Tidy ecosystem for model result

Model Workflow

Update Fomula

The Role of Formula:

Formula is Package Depend:

Special Formula/In-line Function

Use Multiple Model at Once

Evaluate Test Set use `last_fit()` method

Feature Engineering with Receipy

Compare Receipy with Standard Linear Model with formula

Typical Pre-Processing in Statiscs

Consider Encode Normial Values

Consider Interation Terms: Variable Can Interact with One and Other

Spine Function (Non-Linear Relationship)

Feature Extraction (Dimension Reduction Techniques)

Row Sampling Steps

Natual Language Sampling

Model Effectiveness Measurement

Yardstick Basic Usage

The `Tidy` ecosystem for model result