Week 10 Completed File

Author

Biagio Palese

Introduction to Basic Modeling

Coding is fun!!!

In the second part of the course we will leverage the following resources:

Class Objectives

  • Introduce the tidymodels framework, highlighting its comprehensive ecosystem designed to streamline the predictive modeling process in R.

  • Establish a foundation of modeling using tidymodels, emphasizing practical application through the exploration of the ames dataset.

Load packages

This is a critical task:

  • Every time you open a new R session you will need to load the packages.

  • Failing to do so will incur in the most common errors among beginners (e.g., ” could not find function ‘x’ ” or “object ‘y’ not found”).

  • So please always remember to load your packages by running the library function for each package you will use in that specific session 🤝

The tidymodels Ecosystem in action

Tidymodels in Action

After the theoretical introduction let’s check the tidymodels ecosystem in action, highlighting its components like recipes, parsnip, workflows, etc., and how it integrates with the tidyverse. But first we need to:

Exploring the ames Dataset

The ames housing dataset will be used in this portion of the course. Let’s explore the dataset to understand the variables and the modeling objective (predicting house prices).

Activity 0: Get to know and Explore the ames dataset.

The objective here is to leverage what we learned so far to understand the data. Write all the code you think is needed to complete the task in the chunks below - 10 minutes

Warning

Keep in mind that Sale_Price will be our dependent variable in supervised modeling. Identify possible relevant independent variables by producing charts and descriptive stats. Think also about potential manipulations needed on those variables/creating new variables.

The ames housing dataset will be used as a case study for this section of the course, demonstrating the practical application of predictive modeling techniques.

Part 2: Data Preprocessing with recipes

Recipes: Artwork by @allison_horst

Let’s dive into the concept of a recipe in the context of data preprocessing within the tidymodels framework, followed by creating a simple recipe for the ames dataset.

Understanding the Concept of a recipe in Data Preprocessing

In the tidymodels ecosystem, a recipe is a blueprint that outlines how to prepare your data for modeling. It’s a series of instructions or steps ( full list available here) that transform raw data into a format more suitable for analysis, ensuring that the preprocessing is systematic and reproducible. Here’s why recipes are pivotal in the data science workflow:

  • Standardization and Normalization: Recipes can include steps to standardize or normalize or transform numerical data, ensuring that variables are on comparable scales or have desired distributions (e.g., normal). Main functions:

    • step_log(): Applies log transformation
    • step_center(): Normalizes numeric columns by forcing mean = 0.
    • step_scale(): Normalizes numeric columns by forcing sd = 1
    • step_normalize(): Normalizes numeric columns by forcing mean = 0 & sd = 1.
  • Handling Missing Data: They allow you to specify methods for imputing missing values, ensuring the model uses a complete dataset for training. Main functions:

    • step_impute_median(): Imputes missing numeric values with the median
    • step_impute_mean(): Imputes missing numeric values with the average
    • step_impute_mode(): Imputes missing categorical values with the mode
  • Encoding Categorical Variables: Recipes describe how to convert categorical variables into a format that models can understand, typically through one-hot encoding or other encoding strategies. Main functions:

    • step_dummy(): Creates dummy variables for categorical predictors
    • step_other(): Collapses infrequent factors into an “other” category
  • Feature Engineering: Recipe can include steps for creating new features from existing ones, enhancing the model’s ability to learn from the data. Main functions:

    • step_mutate(): Creates a new feature from an existing one
    • step_interact(): Creates interaction terms between features
  • Data Filtering and Columns Selection: They can also be used to select specific variables or filter rows based on certain criteria, tailoring the dataset to the modeling task. Main functions:

    • step_select(): Selects specific variables to retain in the dataset
    • step_slice(): Filters rows using their position
    • step_filter(): Filters rows using logical conditions
Important

By defining these steps in a recipe, you create a reproducible set of transformations that can be applied to any dataset of the same structure and across multiple models.

This reproducibility is crucial for maintaining the integrity of your modeling process, especially when moving from a development environment to production.

Example: Creating a recipe for the ames dataset

Let’s create a recipe for the ames housing dataset, focusing on preprocessing steps that are commonly required for this type of data.

Note

Our goal is to predict house sale prices, so we’ll include steps to log-transform the target variable (to address skewness) and normalize the numerical predictors.

Normalize Distribution and More: Artwork by @allison_horst
Warning
  • I showed you just a few columns affected by the recipe preprocessing steps. However, due to the step_dummy(all_nominal_predictors()) step, all the factor/categorical variables are converted from factor into dummies variables.

  • Check the difference in the number of columns available in ames and preprocessed_ames to see the impact of that preprocessing step. You will see that such preprocessing step resulted in many more columns so always question if you need all the existing columns in your analysis or if you need to transform all of them into dummy variables.

Recap of the recipe steps

The ames_recipe outlines a series of preprocessing steps tailored to the ames dataset:

  • The step_log() function applies a log transformation to the Sale_Price, which is a common technique to normalize the distribution of the target variable in regression tasks.

  • The step_normalize() function centers and scales numerical predictors, ensuring they contribute equally to the model’s predictions.

  • The step_dummy() function converts categorical variables into a series of binary (0/1) columns, enabling models to incorporate this information.

By preparing the data with this recipe, we enhance the dataset’s suitability for predictive modeling, improving the potential accuracy and interpretability of the resulting models.

Important

It is critical that you verify that the preprocessing steps applied lead to the desired outcomes. For example, did the log transformation addressed the normality issue of your target variable?

Recipes in tidymodels provide a flexible and powerful way to specify and execute a series of data preprocessing steps, ensuring that your data science workflow is both efficient and reproducible.

Activity 1: Preprocessing with recipes. Write the code to complete the below tasks - 10 minutes

[Write code just below each instruction; finally use MS Teams R - Forum channel for help on the in class activities/homework or if you have other questions]

Part 3: Building Models with parsnip in tidymodels

The parsnip package is a cornerstone of the tidymodels ecosystem designed to streamline and unify the process of model specification across various types of models and machine learning algorithms.

Unlike traditional approaches that require navigating the syntax and idiosyncrasies of different modeling functions and packages, parsnip abstracts this complexity into a consistent and intuitive interface. Here’s why parsnip stands out:

Unified Interface:

parsnip offers a single, cohesive syntax for specifying a wide range of models ( full list available here), from simple linear regression to complex ensemble methods and deep learning. This uniformity simplifies learning and reduces the cognitive load when switching between models or trying new methodologies. Main models:

Regression Models:

Classification & Tree-Based Models

Important

Tree-based models can be used also for regression. We will see later how to change their purpose based on the analysis objective.

Deep Learning: Neural Network

When to Use: Best for capturing complex, nonlinear relationships in high-dimensional data. Neural networks excel in tasks like image recognition, natural language processing, and time series prediction.

Example: Recognizing handwritten digits where the model needs to learn from thousands of examples of handwritten digits.

What we learned from the above?

  • Each of these model functions can be specified with parsnip’s unified interface, allowing you to easily switch between them or try new models with minimal syntax changes. The choice of model and the specific parameters/arguments (penalty, mtry, trees, etc.) should be guided by the nature of your data, the problem at hand, and possibly iterative model tuning processes like training-test sampling and cross-validation.

  • Engine Independence: Behind the scenes, many models can be fitted using different computational engines (e.g., lm, glmnet, ranger). parsnip allows you to specify the model once and then choose the computational engine separately, providing flexibility and making it easy to compare the performance of different implementations.

  • Integration with tidymodels: parsnip models seamlessly integrate with other tidymodels packages, such as recipes for data preprocessing, workflows for bundling preprocessing and modeling steps, and tune for hyperparameter optimization.

Parsnip: Artwork by @allison_horst

Starting Simple: Specifying a Linear Regression Model Using parsnip

Linear regression is one of the most fundamental statistical and machine learning methods, used for predicting a continuous outcome based on one or more predictors. Let’s specify a linear regression model using parsnip for the ames housing dataset, aiming to predict the sale price of houses from their characteristics.

In this specification:

  • linear_reg() initiates the specification of a linear regression model. At this stage, we’re defining the type of model but not yet fitting it to data.

  • set_engine("lm") selects the computational engine to use for fitting the model. Here, we use “lm”, which stands for linear models, a base R function well-suited for fitting these types of models.

  • set_mode("regression") indicates that we are performing a regression task, predicting a continuous outcome.

Important
  • This process outlines the model we intend to use without binding it to specific data. The next steps typically involve integrating the prepared data (using a recipe), fitting the model to training data, and evaluating its performance.

parsnip’s design makes these steps straightforward and consistent across different types of models, enhancing the reproducibility and scalability of your modeling work.

Through the abstraction provided by parsnip, model specification becomes not only more straightforward but also more adaptable to different contexts and needs, facilitating a smoother workflow in predictive modeling projects.

Activity 2: Exploring modeling with parsnip. - 10 minutes:

[Write code just below each instruction; finally use MS Teams R - Forum channel for help on the in class activities/homework or if you have other questions]

Caution

As you probably noticed the specification of the models in 2a and 2b and the one of the models in 2c and 2d is identical.

Important

Why and how is that possible? The secret is that parsnip doesn’t require specifications of variables or formulas to be used in/by the models. That piece of information is acquired when we use workflows and we actually fit the models. Because of this, it is extremely easy to reuse defined models across different datasets and analyses.

Part 4: Integrating Preprocessing and Modeling with workflows

The workflows package is a powerful component of the tidymodels ecosystem designed to streamline the modeling process. It provides a cohesive framework that binds together preprocessing steps (recipes) and model specifications (parsnip) into a single, unified object. This integration enhances the modeling workflow by ensuring consistency, reproducibility, and efficiency.

Key Advantages of Using workflows:

  • Unified Process: By encapsulating both preprocessing (e.g., feature engineering, normalization) and modeling within a single object, workflows simplifies the execution of the entire modeling pipeline. This unified approach reduces the risk of mismatches or errors between the data preprocessing and modeling stages.

  • Reproducibility: workflows makes your analysis more reproducible by explicitly linking preprocessing steps to the model. This linkage ensures that anyone reviewing your work can see the complete path from raw data to model outputs.

  • Flexibility and Efficiency: It allows for easy experimentation with different combinations of preprocessing steps and models. Since preprocessing and model specification are encapsulated together, switching out components to test different hypotheses or improve performance becomes more streamlined.

Building and Fitting a Model Using workflows

To demonstrate the practical application of workflows, let’s consider the ames housing dataset, where our goal is to predict house sale prices based on various features. We’ll use the linear regression model specified with parsnip and the preprocessing recipe developed with recipes.

Warning

Make sure that ames_recipe, a preprocessing steps object created with recipes, is in your environment (check it to be true by running ls()) and that also linear_mod, a linear regression model specified with parsnip, is available in your environment

In this process:

  • We start by creating a new workflow object, to which we add our previously defined preprocessing recipe (ames_recipe) and linear regression model (linear_mod).

  • The add_recipe() and add_model() functions are used to incorporate the preprocessing steps and model specification into the workflow, respectively.

  • The fit() function is then used to apply this workflow to the ames dataset, executing the preprocessing steps on the data before fitting the specified model.

  • The result is a fitted model object that includes both the preprocessing transformations and the model’s learned parameters, ready for evaluation or prediction on new data.

This example underscores how workflows elegantly combines data preprocessing and model fitting into a cohesive process, streamlining the journey from raw data to actionable insights.

By leveraging workflows, data scientists can maintain a clear, organized, and efficient modeling pipeline. workflows epitomizes the philosophy of tidymodels in promoting clean, understandable, and reproducible modeling practices.

Through its structured approach to integrating preprocessing and modeling, workflows facilitates a seamless transition across different stages of the predictive modeling process.

Note

We will ignore the results interpretation for now because we don’t know how to evaluate the models yet but you can see how simple it is to create and run a modeling workflow.

Activity 3: Streamline Modeling with Workflow. - 15 minutes

[Write code just below each instruction; finally use MS Teams R - Forum channel for help on the in class activities/homework or if you have other questions]

Important

This is enough for our first coding modeling class with tidymodels. More to come in the next weeks but please make sure you understand everything we have covered so far! Please reach out for help and ask clarifications if needed. We are here to help ;-)

On completing another R coding class!