Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data

Suranga N. Kasthurirathne, Brian Dixon, Judy Gichoya, Huiping Xu, Yuni Xia, Burke Mamlin, Shaun Grannis

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Objectives Existing approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and “off the shelf” tools could predict cancer with performance metrics between 80% and 90%. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries. Materials and methods We evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results Decision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70 and 90%. The source of features and feature subset size had no impact on the performance of a decision model. Conclusion Our study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing “off the shelf” approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.

Original languageEnglish (US)
Pages (from-to)160-176
Number of pages17
JournalJournal of Biomedical Informatics
Volume69
DOIs
StatePublished - May 1 2017

Fingerprint

Medical Dictionaries
Public health
Glossaries
Public Health
Neoplasms
ROC Curve
Feature extraction
Pathology
Sensitivity and Specificity
Research
Set theory

Keywords

  • Cancer
  • Data preprocessing
  • Decision models
  • Feature selection
  • Medical dictionaries
  • Pathology
  • Public health reporting

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

@article{efb12c0454f74685bda791d34b99a9ff,
title = "Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data",
abstract = "Objectives Existing approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and “off the shelf” tools could predict cancer with performance metrics between 80{\%} and 90{\%}. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries. Materials and methods We evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results Decision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70 and 90{\%}. The source of features and feature subset size had no impact on the performance of a decision model. Conclusion Our study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing “off the shelf” approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.",
keywords = "Cancer, Data preprocessing, Decision models, Feature selection, Medical dictionaries, Pathology, Public health reporting",
author = "Kasthurirathne, {Suranga N.} and Brian Dixon and Judy Gichoya and Huiping Xu and Yuni Xia and Burke Mamlin and Shaun Grannis",
year = "2017",
month = "5",
day = "1",
doi = "10.1016/j.jbi.2017.04.008",
language = "English (US)",
volume = "69",
pages = "160--176",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Toward better public health reporting using existing off the shelf approaches

T2 - The value of medical dictionaries in automated cancer detection using plaintext medical data

AU - Kasthurirathne, Suranga N.

AU - Dixon, Brian

AU - Gichoya, Judy

AU - Xu, Huiping

AU - Xia, Yuni

AU - Mamlin, Burke

AU - Grannis, Shaun

PY - 2017/5/1

Y1 - 2017/5/1

N2 - Objectives Existing approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and “off the shelf” tools could predict cancer with performance metrics between 80% and 90%. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries. Materials and methods We evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results Decision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70 and 90%. The source of features and feature subset size had no impact on the performance of a decision model. Conclusion Our study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing “off the shelf” approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.

AB - Objectives Existing approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and “off the shelf” tools could predict cancer with performance metrics between 80% and 90%. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries. Materials and methods We evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results Decision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70 and 90%. The source of features and feature subset size had no impact on the performance of a decision model. Conclusion Our study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing “off the shelf” approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.

KW - Cancer

KW - Data preprocessing

KW - Decision models

KW - Feature selection

KW - Medical dictionaries

KW - Pathology

KW - Public health reporting

UR - http://www.scopus.com/inward/record.url?scp=85017504880&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85017504880&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2017.04.008

DO - 10.1016/j.jbi.2017.04.008

M3 - Article

C2 - 28410983

AN - SCOPUS:85017504880

VL - 69

SP - 160

EP - 176

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

ER -