A Retrospective Examination of Machine Learning (ML) Techniques for Predicting Cycle Ergometer Peak Performance in a Clinical Setting of Cancer Survivors
Introduction
Cancer patients face numerous challenges, including the loss of physical performance due to the disease and its treatments. For example, lung resection is associated with a reduction of 1220% in peak oxygen saturation of working muscles ^{1}, while certain tumors ^{2}, ^{3} or chemotherapies ^{4} are known to cause skeletal myopathy that can result in leg muscle weakness and exercise failure. In contrast, certain chemotherapies, as well as targeted or immunologic therapies, may trigger peripheral vascular toxicities, such as systemic hypertension, pulmonary hypertension, thrombosis, and stenosis ^{5}. According to the International Agency for Research on Cancer (IARC), a crude estimate of new cancer cases in 2020 was over 19 million and almost 10 million cancer deaths worldwide ^{6}. The beneficial impact of physical exercise on cancer patients, from the time of diagnosis through survivorship and beyond, is widely acknowledged and recognized ^{7}, ^{9}. However, in order to attain the advantageous effects of exercise in individuals with cancer, it is essential for an exercise specialist to create a customized exercise program that is tailored to meet individualspecific needs. This requires consideration of various factors, including the underlying pathophysiology of cancer, any potential comorbidities, past injuries, and current level of physical fitness. However, one of the initial stages in this process involves conducting clinical exercise testing (CET). The primary objective of CET is to furnish comprehensive information concerning the physical performance and capabilities of the subject. Traditional methods of assessing physical performance in the clinical domain, such as the 6minute walk test and hand grip strength test, are not objective enough and may not be accurate in predicting physical performance, especially in cancer patients. In recent years, there has been a growing interest in sports therapy interventions for individuals diagnosed with cancer [^{10}, ^{12}]. This interest is particularly evident in the use of fitness assessment tools, designed to monitor and improve the patient’s physical condition.
Machine learning (ML) techniques have promising avenues for predicting physical performance in cancer patients. However, the investigation of ML techniques for accurately predicting maximal physical performance in cancer patients has been surprisingly limited. This article series will explore the use of machine learning techniques for predicting physical performance in cancer patients, with a focus on the methods used and their corresponding results. To address the gaps and deficiencies in the current literature, this series aims to assess and compare the accuracy and prediction performance of several prevalent supervised machine learning (ML) techniques in predicting fitness performance level (FPL) and peak Watt in a retrospective cohort of 1712 cancer patients.
Machine Learning Methods
For the purpose of the analysis two popular ML frameworks were employed. The ScikitLearn library was employed for the regression and classification models, including LR, SVc, RFc, LogR, SVc, RFc, and NB. Conversely, the Fastai library was utilized for the deep learning model, specifically the TL an CNN^{1}.
MODELS 

Regression 
Classification 
Deep learning (Neural Network) 
Linear regression (LR) 
Logistic regression (LogR) 
Convolutional neural network (CNN) 
Support vector regressor (SVr) 
Support vector classifier (SVc) 
Tabular learner (TL) 
Random forest regressor (RFr) 
Random forest classifier (RFc) 


Naive Bayes (NB) 

Table 1
This series of articles will consist of three distinct essays, each focusing on elucidating three distinct types of algorithms, namely regression algorithms, classification algorithms, and deep learning, in sequential order. Prior to delving into regression models, it is essential to provide a comprehensive elucidation of the procedures involved in data processing and data mining. In the realm of data science, the 80/20 principle is widely acknowledged as a heuristic, implying that approximately 80% of the time spent on data analysis is allocated to activities such as data preparation, processing, and data mining procedures, while the remaining 20% (or potentially even less) is devoted to algorithmic work. This principle is grounded in the rationale that in order to circumvent the wellestablished axiom in statistics regarding the impact of flawed input data leading to erroneous output (garbage in – garbage out), it is imperative to ensure the cleanliness and appropriate preparation of data prior to conducting any analysis.
Subjects
The dataset utilized in this research comprised anonymized anthropometric measurements, as well as information regarding the sex and age of 1712 cancer patients (Table 2).
Table 2
The sample consisted of 1127 female individuals and 585 male individuals, with mean ages of 55.1 ± 10.3 and 57.0 ± 11.9, respectively. Furthermore, the body height measurements were recorded as 164.3 ± 6.5 cm and 176.2 ± 7.7 cm, while the corresponding bodyweight measurements were 71.7 ± 15.0 kg and 84.1 ± 16.5 kg, respectively. All subjects included in the dataset had undergone a cycling ergometer test at the rehabilitation clinic between July 2016 and June 2019. This cycling ergometer test is a standardized clinical assessment used to determine the appropriate therapeutic modality for patients based on their level of physical fitness. Prior to analysis, the dataset underwent preprocessing and quality control procedures to identify and exclude erroneous or atypical entries. Subsequently, the dataset was partitioned into training and testing subsets to enable the comparison of various machine learning algorithms. The schematic representation of the detailed workflow adopted in this research is illustrated in the figure below.
Preprocessing and feature engineering
A visual inspection of the dataset was conducted to identify potential inconsistencies and outliers that could indicate errors during data entry. For instance, instances where an individual’s body weight exceeded 200 kg or body height fell below 100 cm were considered indicative of erroneous crossentry between these attributes. Additionally, the density distribution of all attributes, stratified by sex, was examined (figure below).
Figure 1
Preprocessing and feature engineering
A visual inspection of the dataset was conducted to identify potential inconsistencies and outliers that could indicate errors during data entry. For instance, instances where an individual’s body weight exceeded 200 kg or body height fell below 100 cm were considered indicative of erroneous crossentry between these attributes. Additionally, the density distribution of all attributes, stratified by sex, was examined (figure below).
Figure 2
The figure above clearly illustrates that there are discernible differences in the distribution shapes (peak and spread of data) of each attribute within the dataset when stratified by sex. Notably, certain attributes, such as body weight, body height, and peak watt, exhibit a more pronounced influence of sex. The normality of distribution for all variables in both the original and normalized data sets was assessed using the ShapiroWilk test, which yielded statistically significant results (p<.05). This indicates that none of the variables adhere to a normal distribution. In the context of predictive modeling, it is generally not assumed that the data must conform to a normal distribution or that it is a prerequisite for obtaining reliable prediction estimates. Rather, the focus lies on assessing whether the residuals demonstrate a Gaussian distribution after the model has been fitted ^{13}.
The residuals should exhibit a random scattering around the zero line, indicating that the model’s predictions are, on average, accurate, and that there are no discernible patterns in the residuals that suggest a lack of fit to the random error. The residual plots for the three regression models can be observed in Figure below.
Figure 3
Following data cleaning and preprocessing, four predictor variables (sex, age, body height, and body weight) were utilized separately against two outcome variables: peak watt (a continuous variable) and fitness performance level (FPL) (a categorical variable). The FPL was derived from the peak watt value based on established criteria ^{11}, which allowed for the categorization of subjects according to their FPL. Although a correlation between the two response variables is evident, as the focus of the analysis lies on the single response variable model, this existing correlation does not influence the analysis of interest. Furthermore, the body mass index (BMI), calculated as the product of body weight and body height, was included as an additional predictor variable to enhance the predictive performance of the models.
The validation set consisted of 20% of the total observations in the dataset, while the remaining 80% was allocated to the training dataset for model training purposes. In order to ensure consistency across the dataset, rescaling was performed to address variations in units and scales among the predictor variables. Standardization was achieved using the MinMaxScaler function from ScikitLearn, which is represented by the following equation:
Equation 1
The significance of each predictor variable in the random forest classifier (RFc) and random forest regressor (RFr) models was assessed using ScikitLearn’s feature_importances_ attribute. This attribute quantifies the percentage of information contributed by each predictor variable to the model. Specifically, the importance of a feature is determined by examining how much the tree nodes utilizing that feature reduce impurity (gini value), on average ^{14}. For all other ScikitLearn models lacking a feature importance attribute, the permutation importance method was employed. The permutation importance method involves randomly shuffling the observations of a single feature and measuring the decrease in model performance compared to a baseline model. The higher the decrease in performance, the more important the feature is considered to be. The feature importance assessment is based on the training dataset, which can lead to overfitting issues when applied to unseen data. However, the permutation importance method can be computed on the validation set, mitigating this problem. Additionally, permutation importance does not suffer from bias towards features with high cardinality. Essentially, permutation importance evaluates the decrease in accuracy score when a feature is not available. In this study, Sklearn’s permutation_importance class was used, with 10 random shuffles per feature.
Furthermore, all ScikitLearn models underwent finetuning using the GridSearchCV function. It involves splitting the dataset into 10 equal parts, called folds. The purpose of each successive fold is to assess the performance of a model that has been trained on the remaining nine folds. By using 10fold crossvalidation, we ensure that the model is evaluated on multiple subsets of the data, reducing the risk of overfitting or underperforming due to a particular trainingvalidation split. Optimal hyperparameter values, obtained through this process, were incorporated into the final model.
In general, all models employed in this investigation rely on supervised learning methodologies, as the dataset utilized for modeling purposes incorporates labeled data, encompassing both continuous and categorical response variables.
As mentioned earlier in the text, this series of articles will be divided into three separate articles based on the ML method used: regression, classification, and deep learning. We will begin with regression models, specifically the utilization of three regression models: linear regression, support vector regression, and random forest regression. A brief introduction to these models will be provided in the following section.
Linear regression (LR)
The linear regression (LR) model is a useful tool for predicting a numerical outcome ^{15}. It predicts the desired value by calculating a weighted sum of the input predictor features, along with a constant term known as the bias (or intercept) term ^{14}. The following equation provides a clearer explanation of the model:
Equation 2
Where is predicted value, 𝑛 is the number of predictors, is the value of the i^{th} predictor, is the intercept or bias term, and , ⋯, are the weights assigned to each predictor in the model parameter ^{14}. The goal of the linear regression model is to find the bestfit line that minimizes the residual sum of squares between the observed values in the dataset and the values predicted by the linear function of the predictors. This is achieved through the ordinary least square (OLS) method. The OLS method calculates the sum of the squared differences between the observed values and the predicted values, and the model aims to minimize this sum to obtain the best fit line.
Support vector regression (SVr)
To enhance comprehension of the support vector regressor model, a visual representation illustrating its underlying mechanism is presented in Figure below.
In contrast to the SV classification model, which aims to fit a maximal margin hyperplane between two classes while limiting margin violations (it will be explained in detail in the next article), the SVr model implements reverse engineering. In the SVr model, the goal is to fit as many instances as possible within the margin hyperplane while minimizing margin violations ^{14}. This means reducing the offvalues of the margin hyperplane. Research suggests that adding more training instances within the margin does not affect the prediction performance of the model ^{14}.
Random forest regressor (RFr)
Although the random forest model is commonly used for classification tasks, it can also be used for regression tasks. The random forest model is built upon the decision tree algorithm, and it utilizes a large number of decision trees as an ensemble. The model aggregates the predicted values by averaging the predictions from all the trees ^{14}. Two important techniques incorporated in every random forest model are the bagging method and feature randomness. The bagging method allows the random forest model to randomly sample the training data with replacement, creating trees that are different from each other. Feature randomness further increases variation among the trees by randomly selecting a subset of predictor variables. This creates an uncorrelated forest of trees, which is important for the random forest model to protect each subsequent tree from errors made by its neighboring trees. In general, this ensemble concept results in a similar bias but lower variance compared to a single tree, model, or predictor.
Figure below illustrates the application of the random forest method to both categorical and continuous variables. The overall concept of the random forest remains the same regardless of the type of data being analyzed.
The decision trees within the random forest are trained using bootstrap samples, where they are split at nodes until reaching a leaf node. This allows the decision trees to predict the class or continuous value of the outcome variable. Each tree in the random forest represents a single decision tree, which predicts the color or value depending on the type of variable being modeled. The final prediction is made by aggregating the predictions from all decision trees. For categorical variable models, the prediction is the category that receives the most votes across all decision trees (in this case, green). For continuous variable models, the prediction is the average value computed over all trees (in this case, 6.7 m).
Results of regression modes
The initial analysis focused on evaluating the predictive performance of various models in relation to a continuous outcome variable, specifically the peak Watt (Wpeak). The training dataset underwent a 10fold crossvalidation, and the results are presented in Table3. The linear regression (LR) model exhibited a prediction error (root mean square error, RMSE) of (32.50 ± 3.36 Watt), while the random forest regressor (RFr) achieved an RMSE value of 32.02 ± 3.09 Watt. Notably, the support vector regressor (SVR) demonstrated the most prominent performance, with an RMSE of 31.71 ± 3.09 Watt. The discrepancy in RMSE between the SVR and LR models amounted to 0.77 Watts.
Abbreviations: Crossvalidation (cv), linear regression (LR), ), random forest regressor (RFr), root mean squared error (RMSE), support vector regression (SVR), and standard deviation (SD).
In addition, the validation/test dataset was employed to validate the final models, as outlined in Table 4. The analysis revealed a decrease in the root mean square error (RMSE) metric, from 36.41 Watt in the linear regression (LR) model to 34.12 Watt in the random forest regressor (RFr) model, indicating a difference of 2.29 Watt.
Moreover, the accuracy of the models was assessed using the mean absolute percentage error (MAPE) metric.
Mean absolute percentage error (MAPE):
with 𝑛 being the number of observation and 𝑥 actual value and predicted value.
Abbreviations: confidence interval (CI), rsquared (R²).
The random forest regressor exhibited the lowest mean absolute percentage error (MAPE), indicating higher accuracy when compared to the linear regression (LR) and support vector regressor (SVR) models. The coefficients of determination (R²), which quantify the proportion of the variance in the response variable explained by the predictor variables, are also presented in Table 4. The RFr model achieved the highest R² score (33%), followed by SVR (30%) and LR model (23%).
Figure below illustrates the feature importance bar chart, representing the percentage of information attributed to each predictor variable in the RFr model. Among the predictors, body height had the highest ranking, accounting for 32% of the model’s feature importance, followed by age at 27%. These two features collectively contributed to 59% of the model’s predictive performance, while the remaining features (body weight, sex, and body mass index) accounted for the remaining 41% (15%, 15%, and 11%, respectively).
In addition to the feature importance analysis, the permutation importance analysis was conducted to further explore the significance of variables in predicting maximal performance in cancer patients (as shown in Table 5). The results revealed that body height emerged as a crucial factor, providing valuable information for accurate predictions. Conversely, the permutation importance analysis did not identify age as a significant factor in the prediction of maximal performance. Interestingly, it assigned greater importance to body weight and age compared to the previous feature importance analysis.
A comprehensive examination of all three types of machine learning (ML) models will be undertaken in the forthcoming article, specifically the third article. The second article will focus on presenting the outcomes of classification ML models and addressing the challenges that were encountered in order to ensure the validity and reproducibility of the results.
References:
 L. W. Jones, N. D. Eves, M. Haykowsky, A. A. Joy, and P. S. Douglas, “Cardiorespiratory exercise testing in clinical oncology research: systematic review and practice recommendations,” The Lancet Oncology, vol. 9, no. 8, pp. 757765, 2008/08/01/ 2008, doi: https://doi.org/10.1016/S14702045(08)701955.
 D. L. . 691691, 2015, doi: 10.1038/bonekey.2015.59.
 B. Fayyaz, H. J. Rehman, and H. Uqdah, “Cancerassociated myositis: an elusive entity,” (in eng), J Community Hosp Intern Med Perspect, vol. 9, no. 3, pp. 4549, 2019, doi: 10.1080/20009666.2019.1571880.
 D. G. Campelj, C. A. Goodman, and E. Rybalka, “ChemotherapyInduced Myopathy: The Dark Side of the Cachexia Sphere,” Cancers, vol. 13, no. 14, p. 3615, 2021. [Online]. Available: https://www.mdpi.com/20726694/13/14/3615
 S.A. Brown, “Preventive CardioOncology: The Time Has Come,” (in English), Frontiers in Cardiovascular Medicine, Opinion vol. 6, no. 187, 2020January10 2020, doi: 10.3389/fcvm.2019.00187
 H. Sung et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” (in eng), CA Cancer J Clin, vol. 71, no. 3, pp. 209249, May 2021, doi: 10.3322/caac.21660.
 D. Y. T. Fong et al., “Physical activity for cancer survivors: metaanalysis of randomised controlled trials,” BMJ, vol. 344, p. e70, 2012, doi: 10.1136/bmj.e70.
 A. C. S. Medicine and M. L. Irwin, ACSM’s Guide to Exercise and Cancer Survivorship. Human Kinetics, 2012
 D. Schmid and M. F. Leitzmann, “Association between physical activity and mortality among breast cancer and colorectal cancer survivors: a systematic review and metaanalysis,” Annals of Oncology, vol. 25, no. 7, pp. 12931311, 2014/07/01/ 2014, doi: https://doi.org/10.1093/annonc/mdu012
 R. M. Speck, K. S. Courneya, L. C. Mâsse, S. Duval, and K. H. Schmitz, “An update of controlled physical activity trials in cancer survivors: a systematic review and metaanalysis,” Journal of Cancer Survivorship, vol. 4, no. 2, pp. 87100, 2010/06/01 2010, doi: 10.1007/s1176400901105
 P. A. Ospina, A. McComb, L. E. PritchardWiart, D. D. Eisenstat, and M. L. McNeely, “Physical therapy interventions, other than general physical exercise interventions, in children and adolescents before, during and following treatment for cancer,” Cochrane Database of Systematic Reviews, no. 8, 2021, doi: 10.1002/14651858.CD012924.pub2.
 V. Rustler, M. Hagerty, J. Daeggelmann, S. Marjerrison, W. Bloch, and F. T. Baumann, “Exercise interventions for patients with pediatric cancer during inpatient acute care: A systematic review of literature,” Pediatric Blood & Cancer, vol. 64, no. 11, p. e26567, 2017, doi: https://doi.org/10.1002/pbc.26567.
 R. McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and STAN. CRC Press, 2020.
 A. Géron, HandsOn Machine Learning with ScikitLearn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 2019.
 G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: with Applications in R. Springer US, 2021.
This topic contains 0 replies, has 1 voice, and was last updated by Mihailo Tomic 11 months, 1 week ago.
You must be logged in to reply to this topic.