Advanced colorectal neoplasia risk stratification by penalized logistic regression

Yunzhi Lin, Menggang Yu, Sijian Wang, Richard Chappell, Thomas Imperiale, Andrew B. Lawson, Duncan Lee, Ying MacNab

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90% of US residents who are considered "average risk." In this article, we investigate such risk stratification rules for advanced colorectal neoplasia (colorectal cancer and advanced, precancerous polyps). We use a recently completed large cohort study of subjects who underwent a first screening colonoscopy. Logistic regression models have been used in the literature to estimate the risk of advanced colorectal neoplasia based on quantifiable risk factors. However, logistic regression may be prone to overfitting and instability in variable selection. Since most of the risk factors in our study have several categories, it was tempting to collapse these categories into fewer risk groups. We propose a penalized logistic regression method that automatically and simultaneously selects variables, groups categories, and estimates their coefficients by penalizing the L 1 -norm of both the coefficients and their differences. Hence, it encourages sparsity in the categories, i.e. grouping of the categories, and sparsity in the variables, i.e. variable selection. We apply the penalized logistic regression method to our data. The important variables are selected, with close categories simultaneously grouped, by penalized regression models with and without the interactions terms. The models are validated with 10-fold cross-validation. The receiver operating characteristic curves of the penalized regression models dominate the receiver operating characteristic curve of naive logistic regressions, indicating a superior discriminative performance.

Original languageEnglish (US)
Pages (from-to)1677-1691
Number of pages15
JournalStatistical Methods in Medical Research
Volume25
Issue number4
DOIs
StatePublished - Aug 1 2016

Fingerprint

Penalized Regression
Logistic Regression
Stratification
Colorectal Cancer
Logistic Models
Colorectal Neoplasms
Neoplasms
Receiver Operating Characteristic Curve
Risk Factors
Variable Selection
Sparsity
ROC Curve
Screening
Regression Model
Coefficient Estimates
Cohort Study
Logistic Regression Model
Overfitting
Colonoscopy
Polyps

Keywords

  • colorectal cancer
  • interaction
  • lasso
  • penalized logistic regression
  • risk stratification

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability
  • Health Information Management

Cite this

Advanced colorectal neoplasia risk stratification by penalized logistic regression. / Lin, Yunzhi; Yu, Menggang; Wang, Sijian; Chappell, Richard; Imperiale, Thomas; Lawson, Andrew B.; Lee, Duncan; MacNab, Ying.

In: Statistical Methods in Medical Research, Vol. 25, No. 4, 01.08.2016, p. 1677-1691.

Research output: Contribution to journalArticle

Lin, Yunzhi ; Yu, Menggang ; Wang, Sijian ; Chappell, Richard ; Imperiale, Thomas ; Lawson, Andrew B. ; Lee, Duncan ; MacNab, Ying. / Advanced colorectal neoplasia risk stratification by penalized logistic regression. In: Statistical Methods in Medical Research. 2016 ; Vol. 25, No. 4. pp. 1677-1691.
@article{44f40c8073984551b951b1ee45e35f25,
title = "Advanced colorectal neoplasia risk stratification by penalized logistic regression",
abstract = "Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90{\%} of US residents who are considered {"}average risk.{"} In this article, we investigate such risk stratification rules for advanced colorectal neoplasia (colorectal cancer and advanced, precancerous polyps). We use a recently completed large cohort study of subjects who underwent a first screening colonoscopy. Logistic regression models have been used in the literature to estimate the risk of advanced colorectal neoplasia based on quantifiable risk factors. However, logistic regression may be prone to overfitting and instability in variable selection. Since most of the risk factors in our study have several categories, it was tempting to collapse these categories into fewer risk groups. We propose a penalized logistic regression method that automatically and simultaneously selects variables, groups categories, and estimates their coefficients by penalizing the L 1 -norm of both the coefficients and their differences. Hence, it encourages sparsity in the categories, i.e. grouping of the categories, and sparsity in the variables, i.e. variable selection. We apply the penalized logistic regression method to our data. The important variables are selected, with close categories simultaneously grouped, by penalized regression models with and without the interactions terms. The models are validated with 10-fold cross-validation. The receiver operating characteristic curves of the penalized regression models dominate the receiver operating characteristic curve of naive logistic regressions, indicating a superior discriminative performance.",
keywords = "colorectal cancer, interaction, lasso, penalized logistic regression, risk stratification",
author = "Yunzhi Lin and Menggang Yu and Sijian Wang and Richard Chappell and Thomas Imperiale and Lawson, {Andrew B.} and Duncan Lee and Ying MacNab",
year = "2016",
month = "8",
day = "1",
doi = "10.1177/0962280213497432",
language = "English (US)",
volume = "25",
pages = "1677--1691",
journal = "Statistical Methods in Medical Research",
issn = "0962-2802",
publisher = "SAGE Publications Ltd",
number = "4",

}

TY - JOUR

T1 - Advanced colorectal neoplasia risk stratification by penalized logistic regression

AU - Lin, Yunzhi

AU - Yu, Menggang

AU - Wang, Sijian

AU - Chappell, Richard

AU - Imperiale, Thomas

AU - Lawson, Andrew B.

AU - Lee, Duncan

AU - MacNab, Ying

PY - 2016/8/1

Y1 - 2016/8/1

N2 - Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90% of US residents who are considered "average risk." In this article, we investigate such risk stratification rules for advanced colorectal neoplasia (colorectal cancer and advanced, precancerous polyps). We use a recently completed large cohort study of subjects who underwent a first screening colonoscopy. Logistic regression models have been used in the literature to estimate the risk of advanced colorectal neoplasia based on quantifiable risk factors. However, logistic regression may be prone to overfitting and instability in variable selection. Since most of the risk factors in our study have several categories, it was tempting to collapse these categories into fewer risk groups. We propose a penalized logistic regression method that automatically and simultaneously selects variables, groups categories, and estimates their coefficients by penalizing the L 1 -norm of both the coefficients and their differences. Hence, it encourages sparsity in the categories, i.e. grouping of the categories, and sparsity in the variables, i.e. variable selection. We apply the penalized logistic regression method to our data. The important variables are selected, with close categories simultaneously grouped, by penalized regression models with and without the interactions terms. The models are validated with 10-fold cross-validation. The receiver operating characteristic curves of the penalized regression models dominate the receiver operating characteristic curve of naive logistic regressions, indicating a superior discriminative performance.

AB - Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90% of US residents who are considered "average risk." In this article, we investigate such risk stratification rules for advanced colorectal neoplasia (colorectal cancer and advanced, precancerous polyps). We use a recently completed large cohort study of subjects who underwent a first screening colonoscopy. Logistic regression models have been used in the literature to estimate the risk of advanced colorectal neoplasia based on quantifiable risk factors. However, logistic regression may be prone to overfitting and instability in variable selection. Since most of the risk factors in our study have several categories, it was tempting to collapse these categories into fewer risk groups. We propose a penalized logistic regression method that automatically and simultaneously selects variables, groups categories, and estimates their coefficients by penalizing the L 1 -norm of both the coefficients and their differences. Hence, it encourages sparsity in the categories, i.e. grouping of the categories, and sparsity in the variables, i.e. variable selection. We apply the penalized logistic regression method to our data. The important variables are selected, with close categories simultaneously grouped, by penalized regression models with and without the interactions terms. The models are validated with 10-fold cross-validation. The receiver operating characteristic curves of the penalized regression models dominate the receiver operating characteristic curve of naive logistic regressions, indicating a superior discriminative performance.

KW - colorectal cancer

KW - interaction

KW - lasso

KW - penalized logistic regression

KW - risk stratification

UR - http://www.scopus.com/inward/record.url?scp=84983784150&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84983784150&partnerID=8YFLogxK

U2 - 10.1177/0962280213497432

DO - 10.1177/0962280213497432

M3 - Article

C2 - 23907780

AN - SCOPUS:84983784150

VL - 25

SP - 1677

EP - 1691

JO - Statistical Methods in Medical Research

JF - Statistical Methods in Medical Research

SN - 0962-2802

IS - 4

ER -