Toward better public health reporting using existing off the shelf approaches: A comparison of alternative cancer detection approaches using plaintext medical data and non-dictionary based feature selection

Suranga N. Kasthurirathne, Brian Dixon, Judy Gichoya, Huiping Xu, Yuni Xia, Burke Mamlin, Shaun Grannis

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Objectives: Increased adoption of electronic health records has resulted in increased availability of free text clinical data for secondary use. A variety of approaches to obtain actionable information from unstructured free text data exist. These approaches are resource intensive, inherently complex and rely on structured clinical data and dictionary-based approaches. We sought to evaluate the potential to obtain actionable information from free text pathology reports using routinely available tools and approaches that do not depend on dictionary-based approaches. Materials and methods: We obtained pathology reports from a large health information exchange and evaluated the capacity to detect cancer cases from these reports using 3 non-dictionary feature selection approaches, 4 feature subset sizes, and 5 clinical decision models: simple logistic regression, naïve bayes, k-nearest neighbor, random forest, and J48 decision tree. The performance of each decision model was evaluated using sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results: Decision models parameterized using automated, informed, and manual feature selection approaches yielded similar results. Furthermore, non-dictionary classification approaches identified cancer cases present in free text reports with evaluation measures approaching and exceeding 80-90% for most metrics. Conclusion: Our methods are feasible and practical approaches for extracting substantial information value from free text medical data, and the results suggest that these methods can perform on par, if not better, than existing dictionary-based approaches. Given that public health agencies are often under-resourced and lack the technical capacity for more complex methodologies, these results represent potentially significant value to the public health field.

Original languageEnglish (US)
Pages (from-to)145-152
Number of pages8
JournalJournal of Biomedical Informatics
Volume60
DOIs
StatePublished - Apr 1 2016

Fingerprint

Public health
Glossaries
Feature extraction
Public Health
Pathology
Health
Neoplasms
Decision Trees
Electronic Health Records
Decision trees
ROC Curve
Logistics
Logistic Models
Availability
Sensitivity and Specificity

Keywords

  • Cancer
  • Data preprocessing
  • Decision models
  • Feature selection
  • Pathology
  • Public health reporting

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

@article{83f2f1820c444d4bba2a786140784143,
title = "Toward better public health reporting using existing off the shelf approaches: A comparison of alternative cancer detection approaches using plaintext medical data and non-dictionary based feature selection",
abstract = "Objectives: Increased adoption of electronic health records has resulted in increased availability of free text clinical data for secondary use. A variety of approaches to obtain actionable information from unstructured free text data exist. These approaches are resource intensive, inherently complex and rely on structured clinical data and dictionary-based approaches. We sought to evaluate the potential to obtain actionable information from free text pathology reports using routinely available tools and approaches that do not depend on dictionary-based approaches. Materials and methods: We obtained pathology reports from a large health information exchange and evaluated the capacity to detect cancer cases from these reports using 3 non-dictionary feature selection approaches, 4 feature subset sizes, and 5 clinical decision models: simple logistic regression, na{\"i}ve bayes, k-nearest neighbor, random forest, and J48 decision tree. The performance of each decision model was evaluated using sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results: Decision models parameterized using automated, informed, and manual feature selection approaches yielded similar results. Furthermore, non-dictionary classification approaches identified cancer cases present in free text reports with evaluation measures approaching and exceeding 80-90{\%} for most metrics. Conclusion: Our methods are feasible and practical approaches for extracting substantial information value from free text medical data, and the results suggest that these methods can perform on par, if not better, than existing dictionary-based approaches. Given that public health agencies are often under-resourced and lack the technical capacity for more complex methodologies, these results represent potentially significant value to the public health field.",
keywords = "Cancer, Data preprocessing, Decision models, Feature selection, Pathology, Public health reporting",
author = "Kasthurirathne, {Suranga N.} and Brian Dixon and Judy Gichoya and Huiping Xu and Yuni Xia and Burke Mamlin and Shaun Grannis",
year = "2016",
month = "4",
day = "1",
doi = "10.1016/j.jbi.2016.01.008",
language = "English (US)",
volume = "60",
pages = "145--152",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Toward better public health reporting using existing off the shelf approaches

T2 - A comparison of alternative cancer detection approaches using plaintext medical data and non-dictionary based feature selection

AU - Kasthurirathne, Suranga N.

AU - Dixon, Brian

AU - Gichoya, Judy

AU - Xu, Huiping

AU - Xia, Yuni

AU - Mamlin, Burke

AU - Grannis, Shaun

PY - 2016/4/1

Y1 - 2016/4/1

N2 - Objectives: Increased adoption of electronic health records has resulted in increased availability of free text clinical data for secondary use. A variety of approaches to obtain actionable information from unstructured free text data exist. These approaches are resource intensive, inherently complex and rely on structured clinical data and dictionary-based approaches. We sought to evaluate the potential to obtain actionable information from free text pathology reports using routinely available tools and approaches that do not depend on dictionary-based approaches. Materials and methods: We obtained pathology reports from a large health information exchange and evaluated the capacity to detect cancer cases from these reports using 3 non-dictionary feature selection approaches, 4 feature subset sizes, and 5 clinical decision models: simple logistic regression, naïve bayes, k-nearest neighbor, random forest, and J48 decision tree. The performance of each decision model was evaluated using sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results: Decision models parameterized using automated, informed, and manual feature selection approaches yielded similar results. Furthermore, non-dictionary classification approaches identified cancer cases present in free text reports with evaluation measures approaching and exceeding 80-90% for most metrics. Conclusion: Our methods are feasible and practical approaches for extracting substantial information value from free text medical data, and the results suggest that these methods can perform on par, if not better, than existing dictionary-based approaches. Given that public health agencies are often under-resourced and lack the technical capacity for more complex methodologies, these results represent potentially significant value to the public health field.

AB - Objectives: Increased adoption of electronic health records has resulted in increased availability of free text clinical data for secondary use. A variety of approaches to obtain actionable information from unstructured free text data exist. These approaches are resource intensive, inherently complex and rely on structured clinical data and dictionary-based approaches. We sought to evaluate the potential to obtain actionable information from free text pathology reports using routinely available tools and approaches that do not depend on dictionary-based approaches. Materials and methods: We obtained pathology reports from a large health information exchange and evaluated the capacity to detect cancer cases from these reports using 3 non-dictionary feature selection approaches, 4 feature subset sizes, and 5 clinical decision models: simple logistic regression, naïve bayes, k-nearest neighbor, random forest, and J48 decision tree. The performance of each decision model was evaluated using sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Results: Decision models parameterized using automated, informed, and manual feature selection approaches yielded similar results. Furthermore, non-dictionary classification approaches identified cancer cases present in free text reports with evaluation measures approaching and exceeding 80-90% for most metrics. Conclusion: Our methods are feasible and practical approaches for extracting substantial information value from free text medical data, and the results suggest that these methods can perform on par, if not better, than existing dictionary-based approaches. Given that public health agencies are often under-resourced and lack the technical capacity for more complex methodologies, these results represent potentially significant value to the public health field.

KW - Cancer

KW - Data preprocessing

KW - Decision models

KW - Feature selection

KW - Pathology

KW - Public health reporting

UR - http://www.scopus.com/inward/record.url?scp=84962821265&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962821265&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2016.01.008

DO - 10.1016/j.jbi.2016.01.008

M3 - Article

C2 - 26826453

AN - SCOPUS:84962821265

VL - 60

SP - 145

EP - 152

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

ER -