Modeling Enzyme Activity Using Linear and Ridge Regression

Étude de cas : Modeling Enzyme Activity Using Linear and Ridge Regression. Recherche parmi 303 000+ dissertations

Par Rayane Bedri • 14 Janvier 2026 • Étude de cas • 2 023 Mots (9 Pages) • 61 Vues

Page 1 sur 9

Artificial IntelligenceMini ProjectBEDRI Rayane

Abstract

Enzyme activity is a key parameter in biochemical processes, influenced by environmental factors such as temperature, pH, and substrate concentration. This study aims to model and predict enzyme activity under varying experimental conditions using Linear Regression and Ridge Regression. A dataset comprising 1000 experimentally measured enzyme activity values, each corresponding to a unique combination of pH, temperature, and substrate concentration was analyzed to investigate how these input variables affect catalytic performance. To account for potential non-linear relationships, polynomial and interaction terms were introduced, and feature standardization was applied to enhance model stability.

The results indicate that the linear regression model explains approximately 48.6% of the variance in enzyme activity (R² = 0.486, RMSE = 11.64), while the ridge regression model offers a slight improvement (R² = 0.487, RMSE = 11.63) by mitigating overfitting. Both models successfully capture the general trends in the data, with ridge regression demonstrating better generalization across unseen conditions.

This work demonstrates the potential of machine learning regression techniques for analyzing enzyme kinetics, offering a quantitative approach to understanding biochemical behaviors and optimizing experimental conditions.

Introduction

Enzymes are essential biological catalysts that accelerate the biochemical processes required for life in living organisms. Their activity is highly influenced by environmental factors such as temperature, pH, and substrate concentration (Akram et al., 2025), which affect their catalytic efficiency and structural stability (Yu et al., 2023). Understanding how these parameters influence enzyme activity is crucial, both for basic research and for the optimization of industrial and biotechnological processes (Ghorri et al., 2022; Yu et al., 2023), where optimizing reaction conditions can significantly improve yields and process efficiency (Ramnath et al., 2017)

The central issue of this project is to analyze how temperature, pH, and substrate concentration influence enzyme activity. This prediction is complex because the effects of these variables are not always linear and can interact significantly. Moreover, the variability of enzymatic responses depending on experimental conditions and the nature of the substrates makes it difficult to accurately model their activity (D’Almeida et al., 2025). To address this challenge, the objective of this work is to build and compare robust predictive models; specifically, multiple linear regression models including quadratic and interaction terms, as well as ridge regression models, to better understand and predict enzyme behavior under different experimental conditions.

To do this, we will compare the performance of classical linear regression, valued for its simplicity and interpretability, with that of ridge regression, which offers greater robustness in the face of multicollinearity and complex data (Sadiq et al., 2024). This approach aims to provide a decision-making tool for optimizing experimental and industrial conditions, while contributing to a better understanding of the mechanisms underlying enzyme activity (Yu et al., 2023).

In summary, this work contributes to the quantitative modeling of enzyme activity by applying linear and ridge regression techniques to analyze and predict how temperature, pH, and substrate concentration influence enzymatic behavior under varying experimental conditions.

Theoretical Background

Artificial intelligence encompasses methods for analyzing and predicting complex phenomena based on data, among which regression plays a central role in modeling the relationship between continuous variables (Letzgus et al., 2022). Linear regression aims to establish an equation linking a dependent variable to one or more independent variables, where the coefficients reflect the influence of each factor (Stulp & Sigaud, 2015). To improve the robustness of the model in the face of overfitting and multicollinearity, ridge regression introduces a regularization term controlled by the parameter α, limiting the amplitude of the coefficients and stabilizing the predictions (R). The performance of these models is evaluated using indicators such as the coefficient of determination R², which measures the proportion of variance explained, and the root mean square error (RMSE), which quantifies the average difference between observed and predicted values(Chikko et al., 2021). The choice of these approaches is particularly justified for the analysis of continuous experimental data, as they offer a compromise between interpretability, accuracy, and the ability to handle complex relationships between variables.

Methodology

Data Description

The dataset used in this project, entitled “enzyme_dataset.csv”, was prepared in CSV file and imported into Python for analysis. It contains experimental observations describing the effect of three environmental parameters: temperature, pH, and substrate concentration on enzyme activity.

The dataset includes several combinations of these factors, representing realistic biochemical conditions under which enzyme performance may vary.

Independent variables (inputs):

Temperature (°C): represents the thermal condition of the reaction.
pH: measures the acidity or alkalinity of the reaction medium.
Substrate Concentration (mM): indicates the available amount of substrate for enzymatic catalysis.

Dependent variable (output):

Enzyme Activity (U/mL): quantifies the catalytic efficiency of the enzyme under given conditions.

Software and Libraries

All analyses were performed using Visual Studio Code, with the following scientific libraries to handle data analysis, visualization, and machine learning:

NumPy: Provides efficient numerical operations and array handling, essential for mathematical computations.
Pandas: Used for data manipulation and analysis, especially with tabular data like CSV files.
Matplotlib: A core library for creating static, animated, and interactive plots.
Seaborn: Built on Matplotlib, it offers high-level functions for attractive and informative statistical graphics.

Data Preparation

The dataset was imported using the Pandas function pd.read_csv(), followed by a verification step to ensure that all required columns (Temperature, pH, Substrate_Concentration, and Enzyme_Activity ) were present. Before modeling, the data underwent a cleaning process to ensure quality and reliability: missing values were checked and none were detected, duplicate entries were removed, and all columns were confirmed to contain valid numerical data (the dataset used was already clean and required no corrections).

...

Télécharger au format txt (15.4 Kb) pdf (277.3 Kb) docx (257.1 Kb)

Voir 8 pages de plus »

Uniquement disponible sur LaDissertation.com

Lire le document complet Enregistrer