# A new approach to categorising continuous variables in

## Transcript Of A new approach to categorising continuous variables in

Article

A new approach to categorising continuous variables in prediction models: Proposal and validation

Statistical Methods in Medical Research 0(0) 1–21 ! The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280215601873 smm.sagepub.com

Irantzu Barrio,1,2 Inmaculada Arostegui,1,2,3 Mar´ıa-Xose´ Rodr´ıguez-A´ lvarez4 and Jose´ -Mar´ıa Quintana2,5

Abstract When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.

Keywords Categorisation, prediction models, cut point, validation

1 Introduction

Prediction models are currently relevant in a number of ﬁelds, including medicine. Decisions such as the most appropriate treatment for a disease, or whether or not a given patient should be discharged,

1Departamento de Matema´tica Aplicada, Estad´ıstica e Investigacio´ n Operativa, Universidad del Pa´ıs Vasco UPV/EHU, Leioa, Spain. 2Red de Investigacio´ n en Servicios de Salud en Enfermedades Cro´ nicas (REDISSEC), Galdakao, Spain. 3BCAM – Basque Center for Applied Mathematics, Bilbao, Spain. 4Departamento de Estad´ıstica e Investigacio´ n Operativa. Universidade de Vigo, Vigo, Spain. 5Unidad de Investigacio´ n, Hospital Galdakao-Usansolo, Galdakao, Spain. Corresponding author: Irantzu Barrio, Departamento de Matema´tica Aplicada, Estad´ıstica e Investigacio´ n Operativa, Universidad del Pa´ıs Vasco UPV/EHU Email: [email protected]

Downloaded from smm.sagepub.com by guest on September 30, 2015

2

Statistical Methods in Medical Research 0(0)

etc., are based on the individual patient’s risk of suﬀering some unfavourable event, and such a risk is often measured on the basis of clinical variables that are continuous in nature.

When developing prediction models, the selection of the predictors or covariates (clinical variables) to be used in the model is essential. From a statistical point of view, categorising continuous variables is not regarded as advisable, since it may entail a loss of information and power.1,2 Additionally, there are statistical modelling techniques such as generalised additive models (GAM),3,4 which do not require any assumption of linearity between predictors and response variables, and so allow for the relationship between the predictor and the outcome to be modelled more appropriately. Yet, in clinical research and, more speciﬁcally, in the development of prediction models for use in clinical practice, both clinicians and health managers call for the categorisation of continuous variables. Indeed, in a recent survey of the epidemiological literature, in the 86% of the papers included in the study, the primary continuous predictor was categorised, of which 78% used 3–5 categories.5 In our opinion, there are several reasons for this. First, in clinical practice, the application of results obtained from techniques such as GAM is not always viable. It requires speciﬁc software which is not always possible to use at consulting rooms or emergency departments. On the other hand, decisions in clinical practice are often taken on the basis of an individual patient’s risk level, which is strongly related to a categorisation of that patient’s clinical variables. Yet, despite the fact that categorisation is a common practice in clinical research, there are no uniﬁed criteria for categorising continuous variables. Indeed, categorisation is very often based on percentiles, even though this is known to have drawbacks.6 Moreover, even when categorisation is based on clinical criteria, it has been shown that it can vary enormously from one practitioner or hospital (or even country) to another. For instance, a meta-analysis conducted by Lim and Kelly7 showed that reported cut-oﬀ values for partial pressure of carbon dioxide in the blood (PCO2) for hypercapnia screening ranged from 30 to 46 mmHg.

Previous work has been done on the categorisation of continuous variables. A review of these methods shows that these have been based: ﬁrst, on the graphical relationship between the predictor and the outcome, and second, on the minimum p-value approach.8 Moreover, the aim in almost all cases has been to seek a single cut point, or, expressed in another way, to dichotomise the continuous predictor.9–11 However, the use of more than two categories may be preferable, since this serves to reduce the loss of information and enables the relationship between the covariate and the response variable to be retained. In the context where the outcome of interest takes only two possible values, the search for more than one cut point has been considered for instance by Tsuruta and Bax12 and Barrio et al.13 Tsuruta and Bax propose a parametric method for obtaining cut points based on the overall discrimination C-index,14 which is equivalent to the area under the receiver operating characteristic curve (AUC). The authors showed the optimal location of cut points in a case where the distribution of the predictor variable is known, and illustrated the proposal for application to a normal distribution. Yet, in routine clinical practice and, by extension, in medical research, variables of interest do not usually respond to either a normal or a known distribution. On the other hand, Barrio et al. proposed a method based on a graphical display using GAM with P-spline smoothers to determine the relationship between the continuous predictor and the outcome.

Despite the fact that both approaches have proven to be useful, they suﬀer from the limitation of only being applicable in a univariable setting. Accordingly, we propose in this paper a new approach for the selection of optimal cut points that allows for more than one cut point to be selected as well as the possibility of being used in a multivariable setting. Speciﬁcally, our study has two main aims: ﬁrst, to propose a new approach for the selection of optimal cut points for categorising continuous variables in logistic prediction models, and second, to validate the proposed approach. Two diﬀerent

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

3

algorithms, called AddFor and Genetic, are proposed for the selection of the cut points which maximise the AUC, and the performance is evaluated and compared by means of simulations. Validation of the categorisation method is performed in two diﬀerent settings: (1) under deﬁned theoretical conditions, where the optimal cut points are known; and, (2) under empirical situations where the original variable is observed as categorical although an underlying continuous latent variable is supposed.

The rest of the paper is organised as follows. Section 2 provides a description of the IRYSSCOPD study of patients suﬀering from an exacerbation of chronic obstructive pulmonary disease which motivated the development of the methodology presented in this paper. Section 3 outlines the method proposed for categorising continuous variables in clinical prediction models where the response variable is dichotomous. In Section 4, the validation process is described, with a comparison of both cut point selection algorithms in diﬀerent settings, namely, under theoretical and empirical deﬁned conditions. Additionally, the results obtained from the validation study are reported. Section 5 describes the software implementation. Section 6 describes the application of the proposed methodology to the IRYSS-COPD study data-set. Finally, the paper closes with a discussion in Section 7 in which the ﬁndings are reviewed and conclusions are drawn.

2 The IRYSS-COPD study

Chronic obstructive pulmonary disease (COPD) is one of the most common chronic diseases, and its prevalence is expected to increase over the next few decades.15 COPD is a leading cause of death in developed countries, and patients with COPD generally have a substantial deterioration in their quality of life.16 Exacerbation of COPD (eCOPD) is deﬁned as an event in the natural course of a patient’s COPD characterised by a change in baseline dyspnea, cough, and/or sputum that is beyond normal day-to-day variations and that may have warranted a change in medication or treatment.17 Patients often experience eCOPD, and these often require assessment in an emergency department (ED) and hospitalisation. Exacerbations play a major role in the burden of COPD, its evolution and its cost.18 Some exacerbations are quite severe, leading to death or the need for invasive mechanical ventilation (IMV); others are more moderate, requiring little more than an adjustment of the patient’s current medical therapy. Currently, ED physicians must rely largely on experience and personal criteria for gauging how an eCOPD will evolve. Accordingly, the development of clinical prediction rules in this context would be of great importance to help ED physicians to make betterinformed decisions about treatment.

The IRYSS-COPD study (IRYSS: Red de investigacio´ n cooperativa para la Investigacio´ n en Resultados de Salud y Servicios Sanitarios – Co-operative Health Outcomes & Health Services Research Network) was created to address gaps for identifying eCOPD patients whose clinical situation is appropriate for admission to the hospital, and to develop and validate severity scores for eCOPD exacerbations.19 In this study, a sample of 2487 patients with eCOPD attending the EDs of 16 participating hospitals in Spain was collected. Information was recorded as follows: at the date on which patients were evaluated at the ED; at the date on which the decision was made to admit patients or discharge them home from the ED; and during follow-up after admission to the hospital or discharge home. Data collected upon arrival in the ED included socioeconomic data, information about the patient’s respiratory function (arterial blood gases, respiratory rate, dyspnea), and presence of other pathologies recorded in the Charlson Comorbidity Index. The consciousness level was measured by the Glasgow coma scale which was dichotomised as follows: altered consciousness deﬁned as a score of < 15 points,

Downloaded from smm.sagepub.com by guest on September 30, 2015

4

Statistical Methods in Medical Research 0(0)

unaltered consciousness as a score of 15 points.20 Additional data collected in the ED at the time a decision was made to admit or discharge the patient included the patient’s symptoms, signs and respiratory status at that moment.

One of the goals of the IRYSS-COPD study was to develop a clinical prediction rule for the short-term, very severe evolution of eCOPD deﬁned as any of the following: death, Intensive Care Unit (ICU) admission, need for IMV, and/or cardiac arrest. After a preliminary analysis, the predictors selected to be included in the prediction model were the Glasgow Coma scale, heart rate and the arterial blood gas PCO2. However, the covariate PCO2 had not a linear relationship with the outcome and hence it required to be introduced either modelled with a smooth function or in a categorised version. The clinical researchers involved in the study claimed for a categorised version of this predictor, but as mentioned earlier, there was not a previously ﬁxed cut point criteria in the literature.7 The authors previously proposed the categorisation of the covariate PCO2 which relayed on a graphical display.13 However, at this time we considered developing a more general methodology to obtain optimal cut points to categorise the continuous predictor variable PCO2, so that we could obtain the best categorised version to be introduced in the short-term, very severe evolution of eCOPD prediction model.

3 Methods

This section describes the methodology proposed to categorise continuous predictors in logistic regression models. Once the needed background and notation have been introduced, we describe the proposed methodology and the two algorithms for its implementation in Section 3.1. It should be noted that when developing the logistic regression model, the obtained AUC might be biased upward when the same data-set is used to ﬁt the model and compute the AUC. Accordingly, we considered correcting the overestimation of the AUC in the logistic model, which is explained in detail in Section 3.2. In addition to this, in clinical practice it might be needed to select which is the most desirable number of cut points. Therefore, in Section 3.3 we present two possible approaches to select the best number of cut points.

Suppose one has a dichotomous response variable Y, and a continuous predictor X. Furthermore, assume that the outcome variable has been coded as 0 or 1, representing the absence or the presence of the outcome characteristic, respectively. Then, the logistic regression model for Y is written as a linear function in the logistic transformation (logit) as it is shown in equation (1), where 0 is the intercept and 1 is the regression coeﬃcient for X.

PðY ¼ 1jXÞ logitðPðY ¼ 1jXÞÞ ¼ log 1 À PðY ¼ 1jXÞ ¼ 0 þ 1X ð1Þ

For a binary outcome, the AUC is the most commonly used performance measure to evaluate the

discriminative

ability

of

a

prediction

model.

More

speciﬁcally,

given

a

sample

fðxi

,

y

i

Þg

N i¼

1

,

the

coeﬃcients 0 and 1 are estimated by maximum likelihood and an iterative weighted least squares algorithm (denote ^0 and ^1). More detail about estimation methods can be seen in McCullagh and Nelder.21 Let p^i ¼ logitÀ1ð^0 þ ^1xiÞ be the estimated probability for subject i.

The AUC is frequently estimated by the Mann–Whitney statistic22 which is given as

Ad UC ¼ 1 X n0n1

X I ½p^j, p^m

j2DY¼0 m2DY¼1

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

5

where DY¼1 and DY¼0 are the sets of subjects with Y ¼ 1 and Y ¼ 0, respectively, n1 and n0 are the

sizes of these sets and I ½ is the indicator function adjusted for ties 8 < 1 if p^j 5 p^m

I ½p^j, p^m ¼ : 0:5 if p^j ¼ p^m 0 otherwise:

3.1 Proposed methodology

Assuming that the continuous predictor X is what one wishes to categorise, our proposal consists of categorising X such that the best predictive logistic model is obtained for Y. Speciﬁcally, given k the number of cut points set for categorising X in k þ 1 intervals, let us denote vk ¼ ðx1, . . . , xkÞ the vector of k cut points ordered from smaller to larger, and Xcatk the corresponding categorised variable taking values from 0 to k. Then, what we propose is that the vector of k cut points vk ¼ ðx1, . . . , xkÞ, which maximises the AUC of the logistic regression model shown in equation (2), is thus the vector of the optimal k cut points.

Xk PðY ¼ 1jXcatk Þ ¼ logitÀ1ð0 þ q1fXcatk ¼qgÞ ð2Þ

q¼1

Estimation of the model in equation (2) as well as of the associated AUC can be done as presented before for the model in equation (1). However, the problem now lies in looking for the vector of the cut points which maximises the AUC. To achieve this, we propose two alternative algorithms, respectively, named AddFor and Genetic.

3.1.1 AddFor Using this algorithm, one cut point is searched for at a time. In other words, it ﬁrst seeks x1 (in a grid of size M of equally spaced values in the range of X), such that the AUC of the logistic regression model shown in equation (2) for k ¼ 1 will be maximised. Once x1 has been selected, it is ﬁxed and the algorithm proceeds to seek x2 (in the grid of size M) (x2 6¼ x1), so as to ensure that the AUC of the model in (2) for k ¼ 2 will be maximised. The process is then repeated until the vector of k cut points, vk ¼ ðx½1, . . . , x½kÞ, has been obtained, with x½o denoting the o-th ordered cut point.

3.1.2 Genetic Using Genetic Algorithms, the most widely known type of evolutionary algorithm,23 this method simultaneously ﬁnds the vector of k cut points, vk ¼ ðx1, . . . , xkÞ, which maximises the AUC of the logistic regression model in equation (2). Evolutionary algorithms are inspired by the concept of natural evolution. The underlying idea is that, given a population of individuals, environmental pressure leads to survival of the ﬁttest, leading in turn to a rise in the overall ﬁtness of the population. In a more mathematical context, given a function to be maximised (ﬁtness function), a collection of heuristic rules are used to modify a population of possible solutions in such a way that each generation of potential solutions, tends to be, on average, better than its predecessor. The measure whether one potential solution is better than another is the potential solution’s ﬁtness value. In our case, the AUC is the selected ﬁtness function to be maximised and the optimal cut points would be then the best possible solution.

Downloaded from smm.sagepub.com by guest on September 30, 2015

6

Statistical Methods in Medical Research 0(0)

For both algorithms, the methodology above has been presented (for ease of notation and

illustration) for the particular case of the categorisation of a continuous covariate X in a

univariate logistic regression model. Nevertheless, it can be easily extended to the categorisation

of a continuous covariate X in a multiple logistic regression model. Suppose that along with the

predictor variable X we want to categorise, a set of other p predictors, Z1, . . . , Zp, are of interest.

Then, the categorisation of X in a multivariable setting including the p predictors will be that for

which the AUC of the multiple logistic regression model in equation (3) is maximised.

Xp X pþk ! PðY ¼ 1jðZ1, . . . , Zp, Xcatk ÞÞ ¼ logitÀ1 0 þ rZr þ q1fXcatk ¼qÀpg : ð3Þ

r¼1

q¼pþ1

3.2 Optimism correction for the AUC

When implementing the algorithms presented in the previous section, the obtained AUC may be biased

upward when the same data-set is used to: (a) ﬁt the logistic regression model (involved in the cut point selection process) and (b) compute the AUC.24 In our setting, the aim was to look for the vector vk that maximises the AUC of the corresponding logistic model. Thus, the overestimation of the AUC may

have an impact in the maximisation process itself and therefore on the selection of the optimal cut

points. Several approaches for correcting the bias of the estimated discriminative ability of a predictive model have been proposed in the statistical literature.25,26 In this work, the proposal is based on the bootstrap bias correction method proposed by Steyerberg.26 Moreover, the bias correction procedure

was performed at two diﬀerent levels. In the ﬁrst approach, the bias correction was performed during

the selection of the optimal cut points. In the second approach, however, the bias correction procedure

was applied once the optimal cut points had been selected. Appendix A of the web supporting material

shows the results of a simulation study performed to evaluate the impact of the bias correction

approaches (at ﬁrst and second level) on the selection of the optimal cut points. As can be observed

on the results, both approaches provide similar results. Hence, and due to computational cost savings,

we propose to correct the AUC at the end of the cut point selection process. Speciﬁcally, in this

approach, the bootstrap bias correction method can be described as follows: Step 1. Categorise the predictor variable on the basis of the original sample fðxi, yiÞgNi¼1 and

compute the corresponding AUC. Let us denote this apparent AUC as Ad UCapp. Step 2. For b ¼ 1, . . . , B, generate the bootstrap resample fðxÃib, yÃibÞgNi¼1 by drawing a random

sample of size N with replacement from the original sample, and categorise the bootstrapped predictor fxÃibgNi¼1 on the basis of the optimal cut points obtained in Step 1.

Step 3. Fit the logistic regression model to the bootstrap resample with the categorised version of the predictor and compute the corresponding AUC, Ad UCbboot for b ¼ 1, . . . , B.

Step 4. Obtain the predicted probabilities for the original sample based on the ﬁtted logistic regression model obtained in Step 3 and compute the AUC. Let us denote this AUC as Ad UCbo for b ¼ 1, . . . , B.

Once the above process has been completed, the optimism O of the original AUC is calculated as

follows

O ¼ 1 XB jAd UCb À Ad UCbj

B

boot

o

b¼1

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

7

and the bias corrected AUC is then computed as Ad UCapp À O. Finally, we would like to point out that in order to mimic the study design, it is advisable that the

resampling procedure described in Step 2 be done according to the design of the study. For instance, for a case-control study, data should be resampled separately within cases and controls. Moreover, if the data are clustered, the resampling units should be the clusters.

3.3 Selection of the number of cut points

To determine the optimal number of cut points, we studied two possible approaches. The ﬁrst approach is based on the diﬀerence between the bias-corrected AUCs obtained for k ¼ l and k ¼ l þ 1 cut points. To determine the need for an extra optimal cut point, we propose to compute the conﬁdence interval (CI) for this diﬀerence. Once the CI has been computed, an extra cut point is considered to be needed as long as the CI does not contain the zero. Speciﬁcally, in this paper bootstrap-based methods27 are proposed for constructing the CIs. The procedure can be summarised as follows:

(1) For v ¼ 1, . . . , V, generate the bootstrap resample fðxÃiv, yÃivÞgNi¼1 by drawing a random sample of size N with replacement from the original sample.

(2) Compute the bias-corrected AUC for the categorised variable for k ¼ l and k ¼ l þ 1 and denote it as Ad UCÃl,v and Ad UCÃlþ1,v, respectively. The bias-corrected AUC is computed as explained in Section 3.2, but using for Step 1 the optimal cut points obtained for k ¼ l and k ¼ l þ 1 on the basis of the original sample.

(3) Compute the diﬀerence between the bias-corrected AUCs obtained for k ¼ l þ 1 and k ¼ l.

Ad UCÃDiff,v ¼ Ad UCÃlþ1,v À Ad UCÃl,v:

Once the above process has been completed, the ð1 À Þ % limits for the CI for the diﬀerence are

given by

Ad UCD=i2ff, Ad UC1DÀiff =2

where Ad UCpDiff represents the p-percentile of the estimated Ad UCÃDiff,v ðv ¼ 1, . . . , VÞ. The second criterion used to evaluate the need for an extra optimal cut point was the integrated

discrimination improvement (IDI) index, proposed by Pencina et al.28 which in our setting can be deﬁned as shown in equation (1) in Pepe et al.29

IDI ¼ E½ plþ1 À pl jY ¼ 1 À E½ plþ1 À pl jY ¼ 0,

where pk ¼ PðY ¼ 1jXcatk Þ. The IDI is a useful measure to compare and assess the improvement in terms of risk prediction of

two predictive models. Accordingly, in our particular setting, the IDI can be a useful measure to evaluate the improvement oﬀered by adding an extra cut point. In particular, we propose the criterion that an extra cut point is needed as long as an statistically signiﬁcant IDI is obtained when comparing the ﬁtted logistic regression models obtained with k ¼ l and k ¼ l þ 1 cut points.

Downloaded from smm.sagepub.com by guest on September 30, 2015

8

Statistical Methods in Medical Research 0(0)

4 Validation study

This section reports the results of a simulation study conducted to analyse the empirical performance of the methods described in Section 3 above. Validation was provided at two diﬀerent levels, i.e. in a theoretical setting and in a backward process. Both settings are explained in detail below.

All computations were performed using the (64 bit) R 3.0.1 software package.30

4.1 Scenarios and set-up

4.1.1 Theoretical validation In the ﬁrst setting, the predictor variable X was simulated from a normal distribution separately in each of the populations deﬁned by the outcome (Y ¼ 0 and Y ¼ 1), i.e., XjðY ¼ 0Þ ’ Nð0, 0Þ and XjðY ¼ 1Þ ’ Nð1, 1Þ. It should be noted that, when 0 and 1 are equal, the linear relationship between X and the logistic function holds. Moreover, it can be shown that for k cut points, the theoretical location of the optimal cut points can be obtained,12 as well as the AUC associated with the corresponding categorical covariate. Accordingly, the aims of this simulation study were twofold: (a) to compare the cut points obtained with the proposed methodology and the theoretical optimal cut points; and (b) to compare the obtained bias corrected AUC and the theoretical one.

The most general results are presented in the main manuscript. Nevertheless, more speciﬁc results for diﬀerent scenarios and sample sizes are presented in the web Appendixes B and C. Speciﬁcally, in the simulations presented in this manuscript, we considered XjðY ¼ 0Þ ’ Nð0, 1Þ, XjðY ¼ 1Þ ’ Nð1:5, 1Þ. The simulations were done assuming the same number of individuals in Y ¼ 0 and Y ¼ 1 and a total sample sizes of N ¼ 500 and N ¼ 1000. As far as the number of cut points is concerned, k ¼ 1, 2 and 3 was considered. Finally, for the AddFor algorithm, grid sizes of M ¼ 100 and M ¼ 1000 were used. In all cases, B ¼ 50 was considered for the AUC bias corrected procedure. Sample sizes of N ¼ 500 and N ¼ 1000 were selected to ensure a requirement commonly used on the speciﬁc framework of prediction models.26 Nevertheless, the performance of the proposed methodology was also veriﬁed for smaller sample sizes. Detailed results can be seen in Appendix B of the supporting web material. R ¼ 500 and R ¼ 1000 replicates of simulated data were performed. Both number of replicates provided similar results (not shown). Accordingly, all results shown here are based on R ¼ 500 replicates.

As pointed out before, under this setting (1 ¼ 0) the relationship between the predictor X and the logit transformation of the response Y is linear. Nevertheless, the performance of the proposed methodology when the relationship is not linear was also assessed by comparing both algorithms in a controlled situation. Results are shown in Appendix C of the supporting web material.

4.1.2 Backward validation In the second setting, we envisaged simulating a continuous variable X starting from a categorical variable whose cut points had been scientiﬁcally pre-established and assuming that they represent an underlying continuum variable. The aim was to test whether the cut points obtained by applying the proposed methodology to the continuous variable were similar to the original cut-points. For the purpose, we considered the data-set available at the IRYSS-COPD study.19 In this data-set, we selected the variable forced expiratory volume in 1 second in percentage (FEV1 %) which is a clinical variable whose categorisation into four categories (mild ! 80, moderate ½50 À 80Þ, severe ½30 À 50Þ and very severe < 30) is well established, thanks to previous research in the ﬁeld.31 This variable was available in the data-set both in the continuous and the categorical versions for a total number of L ¼ 2069 patients.

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

9

Table 1. Backward validation study: total distribution of individuals in the four categories of forced expiratory volume in 1 s in percentage (FEV1 %) and distribution of diseased individuals in each category, under both scenarios.

Scenario I

Scenario II

FEV1 % [0,100]

Total (%)

Diseased (%)

Total (%)

Diseased (%)

Mild [80,100]

35

5

Moderate [50,80)

30

20

Severe [30,50)

20

25

Very severe [0,30)

15

40

3

0

30

4.5

47

8.6

20

14.2

To simulate the continuous covariate FEV1 %, we propose a bootstrap method starting from the

original categorical and continuous versions of FEV1 %. Let us denote X the original continuous

FEV1 % variable and Xcat the categorised variable taking values from 0 to 3, which correspond to

mild, moderate, severe and very severe categories, respectively. For each l ¼ 0, . . . , 3, consider dls as

the s-th decile of X when Xcat ¼ l. For each u ¼ 1, . . . , U and l ¼ 0, . . . , 3, we generated the bootstrap

sample

fxÃiugLi¼l 1

by

drawing

a

sample

of

size

Ll

with

replacement

from the P

original

sample

fxigLi¼l 1,

where Ll denotes the number of individuals in the l-th category (L ¼

3

Pl¼0

Ll).

We

considered

dÃls

as

the

average

of

the

U

bootstrap

deciles

of

each

category,

i.e.

dÃls

¼

1 U

U u¼1

duls.

The

continuous

variable Xsim was simulated assuming a uniform distribution in the interval ðdÃl ðsÀ1Þ, dÃlsÞ, enclosed by

the lower and upper limits of each category.

Additionally, the dichotomous response variable Y was simulated according to the two scenarios shown in Table 1 trying to mimic two possible real situations. In Scenario I, patients are distributed as 35%, 30%, 20% and 15% in mild, moderate, severe and very severe categories, respectively. In contrast, in Scenario II, only 3% of patients belong to the mild category. Additionally, the percentage of individuals with Y ¼ 1 (denoted as diseased) changes from Scenario I to Scenario II.

For each of the scenarios, R ¼ 500 replicates were conducted for total sample sizes of N ¼ 500 and N ¼ 1000, and B ¼ 50 as in the previous setting and U ¼ 10.000 bootstrap resamples were used. Optimal cut points were sought using the Genetic and AddFor algorithms, the latter with grid sizes of M ¼ 100 and M ¼ 1000.

4.2 Results

4.2.1 Theoretical validation Figure 1 depicts the boxplot of the estimated optimal cut points over 500 simulated data-sets, for each of the proposed algorithms, diﬀerent sample sizes and number of cut points. As can be observed, the cut points obtained by the Genetic or AddFor algorithms were close to the theoretical optimal cut points, with both algorithms presenting a low bias. The corresponding detailed numerical results are shown in Table 2. Under this scenario, the theoretical optimal cut points are v1 ¼ ð0:77Þ, v2 ¼ ð0:23, 1:27Þ and v3 ¼ ðÀ0:07, 0:75, 1:57Þ for k ¼ 1, 2 and 3 number of cut points, respectively. Note that the average of the estimated cut points across simulated data-sets was very similar for both algorithms with these values being very close to the theoretical optimal cut points. As expected, the diﬀerences with respect to the theoretical optimal cut points were smaller when the sample size increased from 500 to 1000. For example, for k ¼ 3 cut points, the average of the cut points obtained with the Genetic method across all replicates were v^3 ¼ ðÀ0:11, 0:76, 1:63Þ and v^3 ¼ ðÀ0:09, 0:74, 1:61Þ for sample sizes of 500 and 1000, respectively, while with the Addfor

Downloaded from smm.sagepub.com by guest on September 30, 2015

10

(a)

(b)

(c)

Statistical Methods in Medical Research 0(0)

Figure 1. Boxplot of the estimated optimal cut points based on 500 simulated data obtained according to the theoretical optimal cut point validation study and comparison with the theoretically optimal cut point (v1 ¼ ð0:77Þ, v2 ¼ ð0:23,1:27Þ and v3 ¼ ðÀ0:07,0:75,1:57Þ). From top to bottom: (a) for k ¼ 1 number of cut points; (b) for k ¼ 2 number of cut points and (c) for k ¼ 3 number of cut points.

algorithm and a grid of size 1000 they were v^3 ¼ ðÀ0:11, 0:73, 1:60Þ and v^3 ¼ ðÀ0:08, 0:74, 1:58Þ. It should be noted that, when the desired number of cut points was 2, the AddFor did not perform as well as the Genetic algorithm. While the former located only one of the two optimal cut points, the latter managed to approximate both cut points. For instance, for a sample size of 500 and k ¼ 2, the bias obtained for the estimated cut points was ð0:00, 0:04Þ using the Genetic method, and ð0:11, À 0:01Þ using the AddFor method with a grid of size 1000.

In Table 3, the average, bias and standard deviation of the bias-corrected AUC values over 500 simulated data-sets are given, for each of the proposed algorithms, diﬀerent sample sizes and number of cut points. Note that the AUC values obtained were almost unbiased, being the bias obtained less or equal to 0.02 when k ¼ 1 was chosen. Additionally, the Genetic approach generally provided slightly higher AUC values than the AddFor algorithm. However, when the AddFor grid size was increased from 100 to 1000, the obtained results were almost the same as those obtained with the Genetic algorithm. For instance, for a sample size of 500 and k ¼ 3 number of cut points, the

Downloaded from smm.sagepub.com by guest on September 30, 2015

A new approach to categorising continuous variables in prediction models: Proposal and validation

Statistical Methods in Medical Research 0(0) 1–21 ! The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280215601873 smm.sagepub.com

Irantzu Barrio,1,2 Inmaculada Arostegui,1,2,3 Mar´ıa-Xose´ Rodr´ıguez-A´ lvarez4 and Jose´ -Mar´ıa Quintana2,5

Abstract When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.

Keywords Categorisation, prediction models, cut point, validation

1 Introduction

Prediction models are currently relevant in a number of ﬁelds, including medicine. Decisions such as the most appropriate treatment for a disease, or whether or not a given patient should be discharged,

1Departamento de Matema´tica Aplicada, Estad´ıstica e Investigacio´ n Operativa, Universidad del Pa´ıs Vasco UPV/EHU, Leioa, Spain. 2Red de Investigacio´ n en Servicios de Salud en Enfermedades Cro´ nicas (REDISSEC), Galdakao, Spain. 3BCAM – Basque Center for Applied Mathematics, Bilbao, Spain. 4Departamento de Estad´ıstica e Investigacio´ n Operativa. Universidade de Vigo, Vigo, Spain. 5Unidad de Investigacio´ n, Hospital Galdakao-Usansolo, Galdakao, Spain. Corresponding author: Irantzu Barrio, Departamento de Matema´tica Aplicada, Estad´ıstica e Investigacio´ n Operativa, Universidad del Pa´ıs Vasco UPV/EHU Email: [email protected]

Downloaded from smm.sagepub.com by guest on September 30, 2015

2

Statistical Methods in Medical Research 0(0)

etc., are based on the individual patient’s risk of suﬀering some unfavourable event, and such a risk is often measured on the basis of clinical variables that are continuous in nature.

When developing prediction models, the selection of the predictors or covariates (clinical variables) to be used in the model is essential. From a statistical point of view, categorising continuous variables is not regarded as advisable, since it may entail a loss of information and power.1,2 Additionally, there are statistical modelling techniques such as generalised additive models (GAM),3,4 which do not require any assumption of linearity between predictors and response variables, and so allow for the relationship between the predictor and the outcome to be modelled more appropriately. Yet, in clinical research and, more speciﬁcally, in the development of prediction models for use in clinical practice, both clinicians and health managers call for the categorisation of continuous variables. Indeed, in a recent survey of the epidemiological literature, in the 86% of the papers included in the study, the primary continuous predictor was categorised, of which 78% used 3–5 categories.5 In our opinion, there are several reasons for this. First, in clinical practice, the application of results obtained from techniques such as GAM is not always viable. It requires speciﬁc software which is not always possible to use at consulting rooms or emergency departments. On the other hand, decisions in clinical practice are often taken on the basis of an individual patient’s risk level, which is strongly related to a categorisation of that patient’s clinical variables. Yet, despite the fact that categorisation is a common practice in clinical research, there are no uniﬁed criteria for categorising continuous variables. Indeed, categorisation is very often based on percentiles, even though this is known to have drawbacks.6 Moreover, even when categorisation is based on clinical criteria, it has been shown that it can vary enormously from one practitioner or hospital (or even country) to another. For instance, a meta-analysis conducted by Lim and Kelly7 showed that reported cut-oﬀ values for partial pressure of carbon dioxide in the blood (PCO2) for hypercapnia screening ranged from 30 to 46 mmHg.

Previous work has been done on the categorisation of continuous variables. A review of these methods shows that these have been based: ﬁrst, on the graphical relationship between the predictor and the outcome, and second, on the minimum p-value approach.8 Moreover, the aim in almost all cases has been to seek a single cut point, or, expressed in another way, to dichotomise the continuous predictor.9–11 However, the use of more than two categories may be preferable, since this serves to reduce the loss of information and enables the relationship between the covariate and the response variable to be retained. In the context where the outcome of interest takes only two possible values, the search for more than one cut point has been considered for instance by Tsuruta and Bax12 and Barrio et al.13 Tsuruta and Bax propose a parametric method for obtaining cut points based on the overall discrimination C-index,14 which is equivalent to the area under the receiver operating characteristic curve (AUC). The authors showed the optimal location of cut points in a case where the distribution of the predictor variable is known, and illustrated the proposal for application to a normal distribution. Yet, in routine clinical practice and, by extension, in medical research, variables of interest do not usually respond to either a normal or a known distribution. On the other hand, Barrio et al. proposed a method based on a graphical display using GAM with P-spline smoothers to determine the relationship between the continuous predictor and the outcome.

Despite the fact that both approaches have proven to be useful, they suﬀer from the limitation of only being applicable in a univariable setting. Accordingly, we propose in this paper a new approach for the selection of optimal cut points that allows for more than one cut point to be selected as well as the possibility of being used in a multivariable setting. Speciﬁcally, our study has two main aims: ﬁrst, to propose a new approach for the selection of optimal cut points for categorising continuous variables in logistic prediction models, and second, to validate the proposed approach. Two diﬀerent

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

3

algorithms, called AddFor and Genetic, are proposed for the selection of the cut points which maximise the AUC, and the performance is evaluated and compared by means of simulations. Validation of the categorisation method is performed in two diﬀerent settings: (1) under deﬁned theoretical conditions, where the optimal cut points are known; and, (2) under empirical situations where the original variable is observed as categorical although an underlying continuous latent variable is supposed.

The rest of the paper is organised as follows. Section 2 provides a description of the IRYSSCOPD study of patients suﬀering from an exacerbation of chronic obstructive pulmonary disease which motivated the development of the methodology presented in this paper. Section 3 outlines the method proposed for categorising continuous variables in clinical prediction models where the response variable is dichotomous. In Section 4, the validation process is described, with a comparison of both cut point selection algorithms in diﬀerent settings, namely, under theoretical and empirical deﬁned conditions. Additionally, the results obtained from the validation study are reported. Section 5 describes the software implementation. Section 6 describes the application of the proposed methodology to the IRYSS-COPD study data-set. Finally, the paper closes with a discussion in Section 7 in which the ﬁndings are reviewed and conclusions are drawn.

2 The IRYSS-COPD study

Chronic obstructive pulmonary disease (COPD) is one of the most common chronic diseases, and its prevalence is expected to increase over the next few decades.15 COPD is a leading cause of death in developed countries, and patients with COPD generally have a substantial deterioration in their quality of life.16 Exacerbation of COPD (eCOPD) is deﬁned as an event in the natural course of a patient’s COPD characterised by a change in baseline dyspnea, cough, and/or sputum that is beyond normal day-to-day variations and that may have warranted a change in medication or treatment.17 Patients often experience eCOPD, and these often require assessment in an emergency department (ED) and hospitalisation. Exacerbations play a major role in the burden of COPD, its evolution and its cost.18 Some exacerbations are quite severe, leading to death or the need for invasive mechanical ventilation (IMV); others are more moderate, requiring little more than an adjustment of the patient’s current medical therapy. Currently, ED physicians must rely largely on experience and personal criteria for gauging how an eCOPD will evolve. Accordingly, the development of clinical prediction rules in this context would be of great importance to help ED physicians to make betterinformed decisions about treatment.

The IRYSS-COPD study (IRYSS: Red de investigacio´ n cooperativa para la Investigacio´ n en Resultados de Salud y Servicios Sanitarios – Co-operative Health Outcomes & Health Services Research Network) was created to address gaps for identifying eCOPD patients whose clinical situation is appropriate for admission to the hospital, and to develop and validate severity scores for eCOPD exacerbations.19 In this study, a sample of 2487 patients with eCOPD attending the EDs of 16 participating hospitals in Spain was collected. Information was recorded as follows: at the date on which patients were evaluated at the ED; at the date on which the decision was made to admit patients or discharge them home from the ED; and during follow-up after admission to the hospital or discharge home. Data collected upon arrival in the ED included socioeconomic data, information about the patient’s respiratory function (arterial blood gases, respiratory rate, dyspnea), and presence of other pathologies recorded in the Charlson Comorbidity Index. The consciousness level was measured by the Glasgow coma scale which was dichotomised as follows: altered consciousness deﬁned as a score of < 15 points,

Downloaded from smm.sagepub.com by guest on September 30, 2015

4

Statistical Methods in Medical Research 0(0)

unaltered consciousness as a score of 15 points.20 Additional data collected in the ED at the time a decision was made to admit or discharge the patient included the patient’s symptoms, signs and respiratory status at that moment.

One of the goals of the IRYSS-COPD study was to develop a clinical prediction rule for the short-term, very severe evolution of eCOPD deﬁned as any of the following: death, Intensive Care Unit (ICU) admission, need for IMV, and/or cardiac arrest. After a preliminary analysis, the predictors selected to be included in the prediction model were the Glasgow Coma scale, heart rate and the arterial blood gas PCO2. However, the covariate PCO2 had not a linear relationship with the outcome and hence it required to be introduced either modelled with a smooth function or in a categorised version. The clinical researchers involved in the study claimed for a categorised version of this predictor, but as mentioned earlier, there was not a previously ﬁxed cut point criteria in the literature.7 The authors previously proposed the categorisation of the covariate PCO2 which relayed on a graphical display.13 However, at this time we considered developing a more general methodology to obtain optimal cut points to categorise the continuous predictor variable PCO2, so that we could obtain the best categorised version to be introduced in the short-term, very severe evolution of eCOPD prediction model.

3 Methods

This section describes the methodology proposed to categorise continuous predictors in logistic regression models. Once the needed background and notation have been introduced, we describe the proposed methodology and the two algorithms for its implementation in Section 3.1. It should be noted that when developing the logistic regression model, the obtained AUC might be biased upward when the same data-set is used to ﬁt the model and compute the AUC. Accordingly, we considered correcting the overestimation of the AUC in the logistic model, which is explained in detail in Section 3.2. In addition to this, in clinical practice it might be needed to select which is the most desirable number of cut points. Therefore, in Section 3.3 we present two possible approaches to select the best number of cut points.

Suppose one has a dichotomous response variable Y, and a continuous predictor X. Furthermore, assume that the outcome variable has been coded as 0 or 1, representing the absence or the presence of the outcome characteristic, respectively. Then, the logistic regression model for Y is written as a linear function in the logistic transformation (logit) as it is shown in equation (1), where 0 is the intercept and 1 is the regression coeﬃcient for X.

PðY ¼ 1jXÞ logitðPðY ¼ 1jXÞÞ ¼ log 1 À PðY ¼ 1jXÞ ¼ 0 þ 1X ð1Þ

For a binary outcome, the AUC is the most commonly used performance measure to evaluate the

discriminative

ability

of

a

prediction

model.

More

speciﬁcally,

given

a

sample

fðxi

,

y

i

Þg

N i¼

1

,

the

coeﬃcients 0 and 1 are estimated by maximum likelihood and an iterative weighted least squares algorithm (denote ^0 and ^1). More detail about estimation methods can be seen in McCullagh and Nelder.21 Let p^i ¼ logitÀ1ð^0 þ ^1xiÞ be the estimated probability for subject i.

The AUC is frequently estimated by the Mann–Whitney statistic22 which is given as

Ad UC ¼ 1 X n0n1

X I ½p^j, p^m

j2DY¼0 m2DY¼1

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

5

where DY¼1 and DY¼0 are the sets of subjects with Y ¼ 1 and Y ¼ 0, respectively, n1 and n0 are the

sizes of these sets and I ½ is the indicator function adjusted for ties 8 < 1 if p^j 5 p^m

I ½p^j, p^m ¼ : 0:5 if p^j ¼ p^m 0 otherwise:

3.1 Proposed methodology

Assuming that the continuous predictor X is what one wishes to categorise, our proposal consists of categorising X such that the best predictive logistic model is obtained for Y. Speciﬁcally, given k the number of cut points set for categorising X in k þ 1 intervals, let us denote vk ¼ ðx1, . . . , xkÞ the vector of k cut points ordered from smaller to larger, and Xcatk the corresponding categorised variable taking values from 0 to k. Then, what we propose is that the vector of k cut points vk ¼ ðx1, . . . , xkÞ, which maximises the AUC of the logistic regression model shown in equation (2), is thus the vector of the optimal k cut points.

Xk PðY ¼ 1jXcatk Þ ¼ logitÀ1ð0 þ q1fXcatk ¼qgÞ ð2Þ

q¼1

Estimation of the model in equation (2) as well as of the associated AUC can be done as presented before for the model in equation (1). However, the problem now lies in looking for the vector of the cut points which maximises the AUC. To achieve this, we propose two alternative algorithms, respectively, named AddFor and Genetic.

3.1.1 AddFor Using this algorithm, one cut point is searched for at a time. In other words, it ﬁrst seeks x1 (in a grid of size M of equally spaced values in the range of X), such that the AUC of the logistic regression model shown in equation (2) for k ¼ 1 will be maximised. Once x1 has been selected, it is ﬁxed and the algorithm proceeds to seek x2 (in the grid of size M) (x2 6¼ x1), so as to ensure that the AUC of the model in (2) for k ¼ 2 will be maximised. The process is then repeated until the vector of k cut points, vk ¼ ðx½1, . . . , x½kÞ, has been obtained, with x½o denoting the o-th ordered cut point.

3.1.2 Genetic Using Genetic Algorithms, the most widely known type of evolutionary algorithm,23 this method simultaneously ﬁnds the vector of k cut points, vk ¼ ðx1, . . . , xkÞ, which maximises the AUC of the logistic regression model in equation (2). Evolutionary algorithms are inspired by the concept of natural evolution. The underlying idea is that, given a population of individuals, environmental pressure leads to survival of the ﬁttest, leading in turn to a rise in the overall ﬁtness of the population. In a more mathematical context, given a function to be maximised (ﬁtness function), a collection of heuristic rules are used to modify a population of possible solutions in such a way that each generation of potential solutions, tends to be, on average, better than its predecessor. The measure whether one potential solution is better than another is the potential solution’s ﬁtness value. In our case, the AUC is the selected ﬁtness function to be maximised and the optimal cut points would be then the best possible solution.

Downloaded from smm.sagepub.com by guest on September 30, 2015

6

Statistical Methods in Medical Research 0(0)

For both algorithms, the methodology above has been presented (for ease of notation and

illustration) for the particular case of the categorisation of a continuous covariate X in a

univariate logistic regression model. Nevertheless, it can be easily extended to the categorisation

of a continuous covariate X in a multiple logistic regression model. Suppose that along with the

predictor variable X we want to categorise, a set of other p predictors, Z1, . . . , Zp, are of interest.

Then, the categorisation of X in a multivariable setting including the p predictors will be that for

which the AUC of the multiple logistic regression model in equation (3) is maximised.

Xp X pþk ! PðY ¼ 1jðZ1, . . . , Zp, Xcatk ÞÞ ¼ logitÀ1 0 þ rZr þ q1fXcatk ¼qÀpg : ð3Þ

r¼1

q¼pþ1

3.2 Optimism correction for the AUC

When implementing the algorithms presented in the previous section, the obtained AUC may be biased

upward when the same data-set is used to: (a) ﬁt the logistic regression model (involved in the cut point selection process) and (b) compute the AUC.24 In our setting, the aim was to look for the vector vk that maximises the AUC of the corresponding logistic model. Thus, the overestimation of the AUC may

have an impact in the maximisation process itself and therefore on the selection of the optimal cut

points. Several approaches for correcting the bias of the estimated discriminative ability of a predictive model have been proposed in the statistical literature.25,26 In this work, the proposal is based on the bootstrap bias correction method proposed by Steyerberg.26 Moreover, the bias correction procedure

was performed at two diﬀerent levels. In the ﬁrst approach, the bias correction was performed during

the selection of the optimal cut points. In the second approach, however, the bias correction procedure

was applied once the optimal cut points had been selected. Appendix A of the web supporting material

shows the results of a simulation study performed to evaluate the impact of the bias correction

approaches (at ﬁrst and second level) on the selection of the optimal cut points. As can be observed

on the results, both approaches provide similar results. Hence, and due to computational cost savings,

we propose to correct the AUC at the end of the cut point selection process. Speciﬁcally, in this

approach, the bootstrap bias correction method can be described as follows: Step 1. Categorise the predictor variable on the basis of the original sample fðxi, yiÞgNi¼1 and

compute the corresponding AUC. Let us denote this apparent AUC as Ad UCapp. Step 2. For b ¼ 1, . . . , B, generate the bootstrap resample fðxÃib, yÃibÞgNi¼1 by drawing a random

sample of size N with replacement from the original sample, and categorise the bootstrapped predictor fxÃibgNi¼1 on the basis of the optimal cut points obtained in Step 1.

Step 3. Fit the logistic regression model to the bootstrap resample with the categorised version of the predictor and compute the corresponding AUC, Ad UCbboot for b ¼ 1, . . . , B.

Step 4. Obtain the predicted probabilities for the original sample based on the ﬁtted logistic regression model obtained in Step 3 and compute the AUC. Let us denote this AUC as Ad UCbo for b ¼ 1, . . . , B.

Once the above process has been completed, the optimism O of the original AUC is calculated as

follows

O ¼ 1 XB jAd UCb À Ad UCbj

B

boot

o

b¼1

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

7

and the bias corrected AUC is then computed as Ad UCapp À O. Finally, we would like to point out that in order to mimic the study design, it is advisable that the

resampling procedure described in Step 2 be done according to the design of the study. For instance, for a case-control study, data should be resampled separately within cases and controls. Moreover, if the data are clustered, the resampling units should be the clusters.

3.3 Selection of the number of cut points

To determine the optimal number of cut points, we studied two possible approaches. The ﬁrst approach is based on the diﬀerence between the bias-corrected AUCs obtained for k ¼ l and k ¼ l þ 1 cut points. To determine the need for an extra optimal cut point, we propose to compute the conﬁdence interval (CI) for this diﬀerence. Once the CI has been computed, an extra cut point is considered to be needed as long as the CI does not contain the zero. Speciﬁcally, in this paper bootstrap-based methods27 are proposed for constructing the CIs. The procedure can be summarised as follows:

(1) For v ¼ 1, . . . , V, generate the bootstrap resample fðxÃiv, yÃivÞgNi¼1 by drawing a random sample of size N with replacement from the original sample.

(2) Compute the bias-corrected AUC for the categorised variable for k ¼ l and k ¼ l þ 1 and denote it as Ad UCÃl,v and Ad UCÃlþ1,v, respectively. The bias-corrected AUC is computed as explained in Section 3.2, but using for Step 1 the optimal cut points obtained for k ¼ l and k ¼ l þ 1 on the basis of the original sample.

(3) Compute the diﬀerence between the bias-corrected AUCs obtained for k ¼ l þ 1 and k ¼ l.

Ad UCÃDiff,v ¼ Ad UCÃlþ1,v À Ad UCÃl,v:

Once the above process has been completed, the ð1 À Þ % limits for the CI for the diﬀerence are

given by

Ad UCD=i2ff, Ad UC1DÀiff =2

where Ad UCpDiff represents the p-percentile of the estimated Ad UCÃDiff,v ðv ¼ 1, . . . , VÞ. The second criterion used to evaluate the need for an extra optimal cut point was the integrated

discrimination improvement (IDI) index, proposed by Pencina et al.28 which in our setting can be deﬁned as shown in equation (1) in Pepe et al.29

IDI ¼ E½ plþ1 À pl jY ¼ 1 À E½ plþ1 À pl jY ¼ 0,

where pk ¼ PðY ¼ 1jXcatk Þ. The IDI is a useful measure to compare and assess the improvement in terms of risk prediction of

two predictive models. Accordingly, in our particular setting, the IDI can be a useful measure to evaluate the improvement oﬀered by adding an extra cut point. In particular, we propose the criterion that an extra cut point is needed as long as an statistically signiﬁcant IDI is obtained when comparing the ﬁtted logistic regression models obtained with k ¼ l and k ¼ l þ 1 cut points.

Downloaded from smm.sagepub.com by guest on September 30, 2015

8

Statistical Methods in Medical Research 0(0)

4 Validation study

This section reports the results of a simulation study conducted to analyse the empirical performance of the methods described in Section 3 above. Validation was provided at two diﬀerent levels, i.e. in a theoretical setting and in a backward process. Both settings are explained in detail below.

All computations were performed using the (64 bit) R 3.0.1 software package.30

4.1 Scenarios and set-up

4.1.1 Theoretical validation In the ﬁrst setting, the predictor variable X was simulated from a normal distribution separately in each of the populations deﬁned by the outcome (Y ¼ 0 and Y ¼ 1), i.e., XjðY ¼ 0Þ ’ Nð0, 0Þ and XjðY ¼ 1Þ ’ Nð1, 1Þ. It should be noted that, when 0 and 1 are equal, the linear relationship between X and the logistic function holds. Moreover, it can be shown that for k cut points, the theoretical location of the optimal cut points can be obtained,12 as well as the AUC associated with the corresponding categorical covariate. Accordingly, the aims of this simulation study were twofold: (a) to compare the cut points obtained with the proposed methodology and the theoretical optimal cut points; and (b) to compare the obtained bias corrected AUC and the theoretical one.

The most general results are presented in the main manuscript. Nevertheless, more speciﬁc results for diﬀerent scenarios and sample sizes are presented in the web Appendixes B and C. Speciﬁcally, in the simulations presented in this manuscript, we considered XjðY ¼ 0Þ ’ Nð0, 1Þ, XjðY ¼ 1Þ ’ Nð1:5, 1Þ. The simulations were done assuming the same number of individuals in Y ¼ 0 and Y ¼ 1 and a total sample sizes of N ¼ 500 and N ¼ 1000. As far as the number of cut points is concerned, k ¼ 1, 2 and 3 was considered. Finally, for the AddFor algorithm, grid sizes of M ¼ 100 and M ¼ 1000 were used. In all cases, B ¼ 50 was considered for the AUC bias corrected procedure. Sample sizes of N ¼ 500 and N ¼ 1000 were selected to ensure a requirement commonly used on the speciﬁc framework of prediction models.26 Nevertheless, the performance of the proposed methodology was also veriﬁed for smaller sample sizes. Detailed results can be seen in Appendix B of the supporting web material. R ¼ 500 and R ¼ 1000 replicates of simulated data were performed. Both number of replicates provided similar results (not shown). Accordingly, all results shown here are based on R ¼ 500 replicates.

As pointed out before, under this setting (1 ¼ 0) the relationship between the predictor X and the logit transformation of the response Y is linear. Nevertheless, the performance of the proposed methodology when the relationship is not linear was also assessed by comparing both algorithms in a controlled situation. Results are shown in Appendix C of the supporting web material.

4.1.2 Backward validation In the second setting, we envisaged simulating a continuous variable X starting from a categorical variable whose cut points had been scientiﬁcally pre-established and assuming that they represent an underlying continuum variable. The aim was to test whether the cut points obtained by applying the proposed methodology to the continuous variable were similar to the original cut-points. For the purpose, we considered the data-set available at the IRYSS-COPD study.19 In this data-set, we selected the variable forced expiratory volume in 1 second in percentage (FEV1 %) which is a clinical variable whose categorisation into four categories (mild ! 80, moderate ½50 À 80Þ, severe ½30 À 50Þ and very severe < 30) is well established, thanks to previous research in the ﬁeld.31 This variable was available in the data-set both in the continuous and the categorical versions for a total number of L ¼ 2069 patients.

Downloaded from smm.sagepub.com by guest on September 30, 2015

Barrio et al.

9

Table 1. Backward validation study: total distribution of individuals in the four categories of forced expiratory volume in 1 s in percentage (FEV1 %) and distribution of diseased individuals in each category, under both scenarios.

Scenario I

Scenario II

FEV1 % [0,100]

Total (%)

Diseased (%)

Total (%)

Diseased (%)

Mild [80,100]

35

5

Moderate [50,80)

30

20

Severe [30,50)

20

25

Very severe [0,30)

15

40

3

0

30

4.5

47

8.6

20

14.2

To simulate the continuous covariate FEV1 %, we propose a bootstrap method starting from the

original categorical and continuous versions of FEV1 %. Let us denote X the original continuous

FEV1 % variable and Xcat the categorised variable taking values from 0 to 3, which correspond to

mild, moderate, severe and very severe categories, respectively. For each l ¼ 0, . . . , 3, consider dls as

the s-th decile of X when Xcat ¼ l. For each u ¼ 1, . . . , U and l ¼ 0, . . . , 3, we generated the bootstrap

sample

fxÃiugLi¼l 1

by

drawing

a

sample

of

size

Ll

with

replacement

from the P

original

sample

fxigLi¼l 1,

where Ll denotes the number of individuals in the l-th category (L ¼

3

Pl¼0

Ll).

We

considered

dÃls

as

the

average

of

the

U

bootstrap

deciles

of

each

category,

i.e.

dÃls

¼

1 U

U u¼1

duls.

The

continuous

variable Xsim was simulated assuming a uniform distribution in the interval ðdÃl ðsÀ1Þ, dÃlsÞ, enclosed by

the lower and upper limits of each category.

Additionally, the dichotomous response variable Y was simulated according to the two scenarios shown in Table 1 trying to mimic two possible real situations. In Scenario I, patients are distributed as 35%, 30%, 20% and 15% in mild, moderate, severe and very severe categories, respectively. In contrast, in Scenario II, only 3% of patients belong to the mild category. Additionally, the percentage of individuals with Y ¼ 1 (denoted as diseased) changes from Scenario I to Scenario II.

For each of the scenarios, R ¼ 500 replicates were conducted for total sample sizes of N ¼ 500 and N ¼ 1000, and B ¼ 50 as in the previous setting and U ¼ 10.000 bootstrap resamples were used. Optimal cut points were sought using the Genetic and AddFor algorithms, the latter with grid sizes of M ¼ 100 and M ¼ 1000.

4.2 Results

4.2.1 Theoretical validation Figure 1 depicts the boxplot of the estimated optimal cut points over 500 simulated data-sets, for each of the proposed algorithms, diﬀerent sample sizes and number of cut points. As can be observed, the cut points obtained by the Genetic or AddFor algorithms were close to the theoretical optimal cut points, with both algorithms presenting a low bias. The corresponding detailed numerical results are shown in Table 2. Under this scenario, the theoretical optimal cut points are v1 ¼ ð0:77Þ, v2 ¼ ð0:23, 1:27Þ and v3 ¼ ðÀ0:07, 0:75, 1:57Þ for k ¼ 1, 2 and 3 number of cut points, respectively. Note that the average of the estimated cut points across simulated data-sets was very similar for both algorithms with these values being very close to the theoretical optimal cut points. As expected, the diﬀerences with respect to the theoretical optimal cut points were smaller when the sample size increased from 500 to 1000. For example, for k ¼ 3 cut points, the average of the cut points obtained with the Genetic method across all replicates were v^3 ¼ ðÀ0:11, 0:76, 1:63Þ and v^3 ¼ ðÀ0:09, 0:74, 1:61Þ for sample sizes of 500 and 1000, respectively, while with the Addfor

Downloaded from smm.sagepub.com by guest on September 30, 2015

10

(a)

(b)

(c)

Statistical Methods in Medical Research 0(0)

Figure 1. Boxplot of the estimated optimal cut points based on 500 simulated data obtained according to the theoretical optimal cut point validation study and comparison with the theoretically optimal cut point (v1 ¼ ð0:77Þ, v2 ¼ ð0:23,1:27Þ and v3 ¼ ðÀ0:07,0:75,1:57Þ). From top to bottom: (a) for k ¼ 1 number of cut points; (b) for k ¼ 2 number of cut points and (c) for k ¼ 3 number of cut points.

algorithm and a grid of size 1000 they were v^3 ¼ ðÀ0:11, 0:73, 1:60Þ and v^3 ¼ ðÀ0:08, 0:74, 1:58Þ. It should be noted that, when the desired number of cut points was 2, the AddFor did not perform as well as the Genetic algorithm. While the former located only one of the two optimal cut points, the latter managed to approximate both cut points. For instance, for a sample size of 500 and k ¼ 2, the bias obtained for the estimated cut points was ð0:00, 0:04Þ using the Genetic method, and ð0:11, À 0:01Þ using the AddFor method with a grid of size 1000.

In Table 3, the average, bias and standard deviation of the bias-corrected AUC values over 500 simulated data-sets are given, for each of the proposed algorithms, diﬀerent sample sizes and number of cut points. Note that the AUC values obtained were almost unbiased, being the bias obtained less or equal to 0.02 when k ¼ 1 was chosen. Additionally, the Genetic approach generally provided slightly higher AUC values than the AddFor algorithm. However, when the AddFor grid size was increased from 100 to 1000, the obtained results were almost the same as those obtained with the Genetic algorithm. For instance, for a sample size of 500 and k ¼ 3 number of cut points, the

Downloaded from smm.sagepub.com by guest on September 30, 2015