Overcoming the Limitations of Multiple Regression in Biological Age Calculation

Pysaruk AV

Published on: 2025-08-14

Abstract

This paper discusses the limitations of the multiple linear regression (MLR) method for calculating biological age (BA), including a systematic error that overestimates the age of young individuals and underestimates that of older individuals. Another significant limitation is the inflexibility of adding new biomarkers to the model without recollecting data, which is due to the interdependence (collinearity) of the indicators. To address these issues, two new approaches are presented. The first method involves correcting the MLR equation using residual regression, which significantly increases age prediction accuracy and eliminates the systematic error. The second, an alternative method, is not based on MLR and considers each biomarker independently. This approach allows for the use of incomplete datasets and the easy addition of new indicators, making it more flexible. However, using incomplete data does reduce the reliability of the assessment. Both proposed methods demonstrate high predictive power, making them suitable for clinical application in assessing age-related changes.

Keywords

Biological age; Multiple linear regression; Overcoming the limitations

Introduction

The calculation of biological age (BA) is a key task in gerontology, allowing for the assessment of an organism's aging rate and the effectiveness of anti-aging interventions [1-4]. One of the most common approaches to determining BA has historically been the multiple linear regression (MLR) method [5-10]. However, despite its popularity, MLR has a number of significant limitations that can lead to inaccuracies in age estimation and hinder further research.

One of the fundamental problems inherent in MLR is its tendency to overestimate the calculated age in young individuals and underestimate it in older individuals. This results in the regression line on a graph of predicted age versus chronological age (CA) not passing through the origin, and its slope being less than 45 degrees [10-11]. Such distortion necessitates the application of various correction methods, often utilizing information about CA. Currently, the Klemera & Doubal (KDM) BA calculation method, which uses information about the CA, has gained popularity [12]. However, its use is quite a difficult task.

Furthermore, a significant drawback of the MLR method is its inflexibility when adding new biomarkers. If there is a need to include a new indicator in an already existing BA calculation formula, it requires costly and laborious re-examinations of large groups of people using a new battery of tests. This approach significantly limits research opportunities and the development of more accurate and comprehensive models of aging.

Collinearity is another disadvantage of building a BA model using the multiple regression method. It arises when the biomarkers that are used to construct the BV calculation formula correlate with each other. In this case, during the step-by-step procedure for selecting informative indicators, both of these indicators can enter the formula, but with different signs. Although their correlation with the age of one sign.

In this paper, we will delve into these shortcomings of the multiple regression method for calculating biological age, and also present new methods based on independent information about each aging marker. These innovative approaches aim to overcome the aforementioned limitations, offering a more accurate and flexible tool for BA assessment that does not require repeated large-scale examinations.

Materials and Methods

The clinical study was conducted in accordance with Ukrainian legislation and the principles of the Helsinki Declaration on human rights. The study protocol, patient information, and informed consent form were reviewed and approved at a meeting of the ethics committee of the clinical department of the State Institution "D.F. Chebotarev Institute of Gerontology of the National Academy of Medical Sciences of Ukraine." Patients confirmed their voluntary decision to participate in the study by signing the informed consent form.

According to the study protocol, 93 individuals (49 women and 44 men) aged 30 to 80 were examined. These individuals did not have pathology of the cardiovascular, respiratory, endocrine, or central nervous systems, nor did they have chronic liver or kidney diseases or pathology of the hematopoietic system.

Measurements of systolic blood pressure (SBP), diastolic blood pressure (DBP), and heart rate (HR) were taken between 10 a.m. and 12 p.m. The measurements were performed in two positions: in a supine position, after 5 minutes of rest and immediately upon standing up (Table 1).

Methods for eliminating the shortcomings of multiple regression when constructing the BA formula are shown using the example of simple biomarkers of age-related changes in the cardiovascular system.

Table 1: Statistical Characteristics of the Indicators of the Examined People.

Indicators

Mean

Std. Dev.

Age, years

52,97

11,36

SBP, mm Hg

127,27

13,78

DBP, mm Hg

79,35

8,64

SBP-DBP (?BP), mm Hg

48,05

11,02

HR (supine position), bpm

64,97

6,43

SBP*, mm Hg

127,40

17,43

DBP*, mm Hg

88,15

10,57

HR*, bpm

78,74

8,31

HR*- HR (?HR), bpm

13,77

4,94

Note. * - in a standing position

The multiple regression equation of biomarkers with age was derived using the Statistica 7.0 for Windows program (StatSoft, USA), which was also used to generate all graphs. All studied parameters exhibited a normal distribution. Correlation coefficients were considered statistically significant at p<0.05.

Results and Discussion

The most common method for calculating BA is to construct a Multiple Linear Regression (MLR) model. The regression equation is a mathematical formula used to predict the dependent variable from independent variables. Each independent variable is assigned a regression coefficient, which describes the magnitude and direction of the relationship between the independent and dependent variables. These coefficients are determined through regression analysis. A strong relationship results in relatively large coefficient values, while a weak relationship is characterized by coefficients approaching zero. The term b0 represents the intercept, which is the predicted value of the dependent variable when all independent variables are zero. The regression equation can be written as follows:

Y=b0 + b1 X1 + b2 X2 + ? + bn Xn

Where:

  • Y is the dependent variable;
  • b0 is the intercept;
  • b1−bn are the regression coefficients for the independent variables;
  • X1−Xn are the independent variables.

The difference between simple linear regression and MLR is that MLR uses a hyperplane instead of a regression line. A key advantage of MLR over simple regression is that by using multiple independent variables, the model can explain a greater proportion of the dependent variable's variance.

To evaluate the model's quality, we calculate the multiple correlation coefficient (R) and the coefficient of determination (R2). The multiple correlation coefficient ranges from 0 to 1, with a value closer to 1 indicating a stronger relationship between the dependent variable and the entire set of independent variables. The overall significance of the MLR equation is assessed using the Fisher's F-test.

The first stage of building a BA model involves selecting the most informative aging biomarkers, specifically those that have a high correlation with age. In the second stage, their mutual correlation is examined. Of several indicators that are highly correlated with each other, only one should be used in the model: the one with the highest correlation with age and a weak correlation with other indicators. This approach helps to avoid unwanted collinearity. It is important to note that since all indicators used in the model are highly correlated with age, they will also be correlated with each other. Therefore, it is necessary to calculate the partial correlation of the indicators with one another. The correlation coefficients between the indicators and the age of the participants are presented in Table 2.

Table 2: Correlations Coefficients of the Indicators with Age.

Indicators

R

R2

p

SBP

0.489

0.24

<0.001

DBP

0.078

0.006

0.459

?BP

0.549

0.302

<0.001

HR

-0.382

0.146

0.0002

SBP*

0.254

0.065

0.014

DBP*

0.0732

0.005

0.485

HR*

-0.374

0.14

<0.001

?HR

-0.333

0.111

0.001

Note. Marked correlations are significant at p<0,05

The obtained data indicate a reliable correlation with age for SBP, HR, ΔBP and ΔHR. Table 3 shows the partial correlation coefficients of the indicators, adjusted for age, for the examined individuals (correlation matrix).

Table 3: Partial Correlations of Indicators.

 

SBP

DBP

?BP

HR

SBP*

DBP*

HR*

?HR

SBP

1

0.67

0.72

0.03

0.65

0.41

-0.03

-0.08

DBP

0.67

1

0.01

0.08

0.66

0.72

0.06

-0.01

?BP

0.72

0.01

1

-0.03

0.33

-0.06

-0.05

-0.04

HR

0.03

0.08

-0.03

1

0.15

0.04

0.77

-0.09

SBP*

0.65

0.66

0.33

0.15

1

0.77

0.13

0.01

DBP*

0.41

0.72

-0.06

0.04

0.77

1

0.05

0.02

HR*

-0.03

0.06

-0.05

0.77

0.13

0.05

1

0.57

?HR

-0.08

-0.01

-0.04

-0.09

0.01

0.02

0.57

1

Note. Marked correlations are significant at p<0,05

The data presented in Table 3 indicate that SBP, DBP, and ΔBP are highly correlated with one another in both the supine and standing positions. Consequently, only one of these indicators-specifically ΔBP, which has the highest correlation with bbage-should be included in the model.

Table 4: Regression Summary For Dependent Variable: Age.

R= 0.647; R2= 0.418; Adjusted R2= 0.399; F (3, 89) = 21.347; p<0.00001; Std. Err. of estimate: 8.811

Indicators

b*

SE of b*

b

SE of b

t(89)

p-level

Intercept

 

 

69.14

11.806

5.856

<0.001

?BP, mm Hg

0.439

0.085

0.452

0.088

5.16

<0.001

HR, bpm

-0.267

0.083

-0.473

0.147

-3.219

0.00179

?HR, bpm

-0.226

0.082

-0.521

0.19

-2.736

0.00752

Note: R-correlation coefficient of indicators with the model; R2- coefficient of model determination; Adjusted R2-adjusted R-square (taking into account the number of predictors in the model); F-Fisher's test; t- Student's test; p-assessment of the significance of the model; SE of estimate -standard error of estimation; Intercept-free member of the equation; b -regression coefficient; b*-standardized regression coefficient; SE of b*-standardized error of the coefficient

Although pulse rates in the supine and standing positions are highly correlated with each other, they do not correlate with blood pressure. Therefore, for the BA model, we can select the heart rate in the supine position due to its high correlation with age. We also include the heart rate increase upon standing. This indicator is highly correlated with age and does not correlate with the other selected indicators. The multiple regression equation for these three selected indicators with age is then calculated (Table 4). The following formula (Model 1) is used to predicted age (PA) from hemodynamic indicators:

PA, years = 0.452×?BP, mm Hg - 0.473×HR, bpm - 0.521×?HR, bpm + 69.140

The average absolute error (MAE±SD) calculation of PA is 6.71±5.42 years.

Figure 1 illustrates the relationship between the predicted and chronological ages of all participants. The correlation between the predicted and chronological age is 0.647. The coefficient of determination (R2), which represents the proportion of explained variation in chronological age, is 0.418. This accounts for less than half of the variation in the dependent variable and indicates the model's low predictive power. Figure 2 displays the differences between chronological and predicted ages. A notable pattern is the overestimation of predicted age in younger individuals and its underestimation in older individuals.

Figure 1: Correlation between the Predicted Age of the Examined Individuals and Their Chronological Age (Model 1).

Figure 2: Observed Values (CA) Vs. Residuals (CA-BA) (Model 1).

This is a typical feature of regression equations when the prediction line deviates from the ideal 45-degree angle (where predicted age equals chronological age). To correct for this systematic bias, the regression equation derived from the residuals (as shown in Figure 2) is used to refine the initial age prediction equation. This results in the following calculation equation for BA (Model 2):

BA, years) = 0.452×?BP, mm Hg – 0.473×HR, bpm – 0.521×?HR, bpm + 69.140 + (-30.80 + 0.582×CA)

The average absolute error (MAE±SD) calculation of BA is 4.39±3.44 years.

Figure 3: Correlation between the Biological Age of the Examined Individuals and Their Chronological Age after Correction of the Regression Equation Error (Model 2).

As seen in Figure 3, this equation allows us to predict a person's age much more accurately based on their data. It explains 80% of the variation in the dependent variable, and the regression line now passes through the origin at a 45-degree angle. This eliminates the distortion of age calculations at both the beginning and end of the regression line. Figure 4 confirms this by showing no relationship between the differences in chronological and biological ages and CA. Overall, the resulting BA model has high predictive power and can be successfully used in practice to assess age-related changes in the cardiovascular system.Figure 4: Observed Values (CA) Vs. Residuals (CA-BA) (Model 2).

The second serious drawback of the multiple regression method is the inability to incorporate new indicators into the established formula for calculating BA without collecting a new dataset. This limitation arises from the fact that the method considers all aging biomarkers in terms of their interdependence, not as isolated variables.

Alternative Method for BA Calculation

An alternative method for constructing a BA model is proposed to overcome this limitation. Unlike multiple regression, this approach considers each biomarker of aging separately [13].

Here is how the method works:

  • For each biomarker, a separate regression equation is calculated, which models the biomarker's relationship with age based on data from a healthy population.
  • For a given individual, the degree of age-related changes is calculated for each biomarker. This is done by taking the ratio of the individual’s measured biomarker value to its calculated expected value for their chronological age.
  • The different importance of each indicator in assessing aging is accounted for by its correlation coefficient with age.
  • Finally, a weighted average of all the ratios is calculated to get a single BA score. The ratios are weighted by their respective correlation coefficients, and the sum is then divided by the total sum of all correlation coefficients. This final value represents the overall age-related changes for all indicators combined.

The formula for calculating BA using the alternative method is as follows:

BA = (R1 × X1/X1c + R2 × X2/X2c + ….. Rn × Xn/Xnc) / (R1 + R2 + ….Rn);

R1-Rn - correlation coefficients of indicators with age;

X1-Xn - individual values of aging biomarkers;

X1?-Xn? - the estimated biomarker values that are normal for a person of that age.

The regression equations for the aging indicators, as previously discussed, are presented in Table 3.

Table 3: Regression Equations of Indicators with Age.

Indicator

Regression of the indicator with age

? BP, mm Hg

? BPc = 19.837 + 0.533*Age; R1 = 0.549; p<0.00001

HR, bpm

HRc = 76.411 – 0.216*Age; R2 = 0.382; p=0.0002

?HR, bpm

 ?HRc = 21.449 – 0.145*Age; R3 = 0.333; p<0.0011

The formula for calculating BA based on hemodynamic data will be as follows (model 3):

BA, years = CA × (0,549×?BP/?BPc + 0,382×HRc/HR + 0,333×?HRc/?HR) /

(R1 + R2 + R3)

Note: If an indicator decreases with age, the ratio is inverted: the calculated (normal) value is divided by the actual value for a given person.

As seen in Figure 5, this equation allows for a much more accurate prediction of a person's age, explaining 82% of the variation in the dependent variable. The regression line passes nearly through the origin at a 45-degree angle, indicating the absence of the systematic error typical of multiple regression.

Figure 5: Correlation between the Biological Age (BA) Of the Examined Individuals and Their Chronological Age (Model 3).

Figure 6: Observed Values (CA) Vs. Residuals (CA-BA) (Model 3).

Figure 6 further confirms this by showing no relationship between the differences in chronological and biological ages and CA. The average absolute error (MAE ± SD) for the BA calculation is 5.97 ± 4.99 years. This level of accuracy is high enough for clinical applications. The error is less than that of the multiple regression equation before correction, and slightly more than after correction.

To show the flexibility of this method, we can add another indicator to Model 3. Let's use breath holding time on inhalation (BHIT) as an example.

BHITc, s = 79.679 – 0.618 Age, years; R4 = 0.479; p < 0.0001.

The formula for calculating the BA will be as follows (model 4):

BA, years = CA× (0.549×?BP/?BPc + 0.382×HRc/HR + 0.333×?HRc/?HR + 0.479×BHITc/BHIT) / (R1 + R2 + R3 + R4)

The average absolute error (MAE±SD) calculation of BA is 5.22±4.40 years. After adding the new indicator, the error in the BA calculation becomes smaller.

The key advantage of this approach is that it allows for the use of incomplete databases or data from other studies on individual indicators to obtain average values for different age groups. This makes it possible to calculate a person’s BA and aging rate even if not all tests were performed, though this does reduce the reliability of the assessment.

Conclusion

The MLR method, while a common approach for calculating BA, has significant inherent drawbacks. These include a systematic error that overestimates age in younger individuals and underestimates it in older individuals, as well as the inability to incorporate new biomarkers without a complete recalculation of the model due to their interdependence (collinearity).

To address these shortcomings, two new methods were proposed. The first method focuses on correcting the standard MLR equation. Following a correction using residuals, the model demonstrates a significantly higher predictive accuracy, explaining 80% of the variation in the dependent variable. The regression line now passes through the origin at a 45-degree angle, indicating the elimination of the systematic error. The average absolute error (MAE) decreased from 6.71 ± 5.42 years to 4.39 ± 3.44 years, showing substantial improvement.

The second, alternative method does not rely on multiple regression. Instead, it calculates a separate regression equation for each biomarker and then computes a weighted average of the resulting ratios. This approach demonstrated high predictive power, explaining 82% of the dependent variable's variation. Its key advantage lies in its flexibility: it allows for the addition of new indicators without the need to collect a new dataset. Although the use of incomplete data may reduce the reliability of the assessment, this method makes it possible to calculate BA even with a partial set of tests.

In conclusion, both of the proposed methods the corrected MLR model and the alternative approach outperform standard multiple regression in terms of accuracy and reliability. They offer high predictive power, making them valuable tools for assessing age-related changes in clinical practice.

References

  1. Anstey KJ, Lord SR, Smith GA. Measuring human functional age: a review of empirical findings. Exp Aging Res. 1996; 22: 245-266.
  2. Murabito JM, Zhao Q, Larson MG, Rong J, Lin H, Benjamin EJ et al. Measures of biologic age in a community sample predict mortality and age-related disease: the framingham offspring study. J Gerontol Ser A Biol Sci Med Sci. 2018; 73: 757-762.
  3. Jia L, Zhang W, Chen X. Common methods of biological age estimation. Clin Interv Aging. 2017; 12: 759-772.
  4. Sebastian P, Thyagarajan B, Sun F, Schupf N, Newman AB, Montano M. et al. Biomarker signatures of aging. Aging Cell. 2017; 16: 329-338.
  5. Aykroyd RG, Lucy D, Pollard AM, Solheim T. Technical note: regression analysis in adult age estimation. Am J Phys Antropol. 1997; 104: 259-265.
  6. Singha B, Krishana K, Kaura K, Kanchanb T. Stature estimation from different combinations of foot measurements using linear and multiple regression analysis in a North Indian male population. J Forensic Leg Med. 2019; 62: 25-33.
  7. Kroll1 J, Saxtrup O. On the use of regression analysis for the estimation of human biological age. Biogerontology. 2000; 1: 363-368.
  8. Kumari K, Yadav S. Linear regression analysis study. J practice Cardiovascular Sci. 2018; 4: 33-36.
  9. Dubina TL, Dyundikova VA, Zhuk EV. Biological age and its estimation. II. Assessment of biological age of albino rats by multiple regression analysis. Exp Gerontol. 1983; 18: 5-18.
  10. Dubina TL, Mints AY, Zhuk EV. Biological age and its estimation. III. Introduction of a correction to the multiple regression model of biological age in cross-sectional and longitudinal studies. Exp Gerontol. 1984; 19: 133-143.
  11. Krutko VN, Smirnova TM, Dontsov VI, Borisov SE. Diagnosing aging: I. Problem of reliability of linear regression models of biological age. Human Physiol. 2001; 27: 725-731.
  12. Klemera P, Doubal S. A new approach to the concept and computation of biological age. Mech Ageing Dev. 2006; 127: 240-248.
  13. Reshetyuk AL, Polyakov OA, Korobeinikov GV. Determination of functional age and rates of human aging. Methodical recommendations. 1996.