Social tagging is no substitute for controlled indexing: A comparison of Medical Subject Headings and CiteULike tags assigned to 231,388 papers

Danielle H. Lee, Titus Schleyer

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

Social tagging and controlled indexing both facilitate access to information resources. Given the increasing popularity of social tagging and the limitations of controlled indexing (primarily cost and scalability), it is reasonable to investigate to what degree social tagging could substitute for controlled indexing. In this study, we compared CiteULike tags to Medical Subject Headings (MeSH) terms for 231,388 citations indexed in MEDLINE. In addition to descriptive analyses of the data sets, we present a paper-by-paper analysis of tags and MeSH terms: the number of common annotations, Jaccard similarity, and coverage ratio. In the analysis, we apply three increasingly progressive levels of text processing, ranging from normalization to stemming, to reduce the impact of lexical differences. Annotations of our corpus consisted of over 76,968 distinct tags and 21,129 distinct MeSH terms. The top 20 tags/MeSH terms showed little direct overlap. On a paper-by-paper basis, the number of common annotations ranged from 0.29 to 0.5 and the Jaccard similarity from 2.12% to 3.3% using increased levels of text processing. At most, 77,834 citations (33.6%) shared at least one annotation. Our results show that CiteULike tags and MeSH terms are quite distinct lexically, reflecting different viewpoints/processes between social tagging and controlled indexing.

Original languageEnglish (US)
Pages (from-to)1747-1757
Number of pages11
JournalJournal of the American Society for Information Science and Technology
Volume63
Issue number9
DOIs
StatePublished - Sep 2012
Externally publishedYes

Fingerprint

Text processing
indexing
text processing
Scalability
normalization
social process
popularity
Costs
coverage
Substitute
Tag
Indexing
Social tagging
costs
resources
Annotation

Keywords

  • automatic indexing
  • automatic taxonomy generation
  • knowledge representation

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications

Cite this

@article{ec74b3420ae14ff08fa8864644e64502,
title = "Social tagging is no substitute for controlled indexing: A comparison of Medical Subject Headings and CiteULike tags assigned to 231,388 papers",
abstract = "Social tagging and controlled indexing both facilitate access to information resources. Given the increasing popularity of social tagging and the limitations of controlled indexing (primarily cost and scalability), it is reasonable to investigate to what degree social tagging could substitute for controlled indexing. In this study, we compared CiteULike tags to Medical Subject Headings (MeSH) terms for 231,388 citations indexed in MEDLINE. In addition to descriptive analyses of the data sets, we present a paper-by-paper analysis of tags and MeSH terms: the number of common annotations, Jaccard similarity, and coverage ratio. In the analysis, we apply three increasingly progressive levels of text processing, ranging from normalization to stemming, to reduce the impact of lexical differences. Annotations of our corpus consisted of over 76,968 distinct tags and 21,129 distinct MeSH terms. The top 20 tags/MeSH terms showed little direct overlap. On a paper-by-paper basis, the number of common annotations ranged from 0.29 to 0.5 and the Jaccard similarity from 2.12{\%} to 3.3{\%} using increased levels of text processing. At most, 77,834 citations (33.6{\%}) shared at least one annotation. Our results show that CiteULike tags and MeSH terms are quite distinct lexically, reflecting different viewpoints/processes between social tagging and controlled indexing.",
keywords = "automatic indexing, automatic taxonomy generation, knowledge representation",
author = "Lee, {Danielle H.} and Titus Schleyer",
year = "2012",
month = "9",
doi = "10.1002/asi.22653",
language = "English (US)",
volume = "63",
pages = "1747--1757",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "9",

}

TY - JOUR

T1 - Social tagging is no substitute for controlled indexing

T2 - A comparison of Medical Subject Headings and CiteULike tags assigned to 231,388 papers

AU - Lee, Danielle H.

AU - Schleyer, Titus

PY - 2012/9

Y1 - 2012/9

N2 - Social tagging and controlled indexing both facilitate access to information resources. Given the increasing popularity of social tagging and the limitations of controlled indexing (primarily cost and scalability), it is reasonable to investigate to what degree social tagging could substitute for controlled indexing. In this study, we compared CiteULike tags to Medical Subject Headings (MeSH) terms for 231,388 citations indexed in MEDLINE. In addition to descriptive analyses of the data sets, we present a paper-by-paper analysis of tags and MeSH terms: the number of common annotations, Jaccard similarity, and coverage ratio. In the analysis, we apply three increasingly progressive levels of text processing, ranging from normalization to stemming, to reduce the impact of lexical differences. Annotations of our corpus consisted of over 76,968 distinct tags and 21,129 distinct MeSH terms. The top 20 tags/MeSH terms showed little direct overlap. On a paper-by-paper basis, the number of common annotations ranged from 0.29 to 0.5 and the Jaccard similarity from 2.12% to 3.3% using increased levels of text processing. At most, 77,834 citations (33.6%) shared at least one annotation. Our results show that CiteULike tags and MeSH terms are quite distinct lexically, reflecting different viewpoints/processes between social tagging and controlled indexing.

AB - Social tagging and controlled indexing both facilitate access to information resources. Given the increasing popularity of social tagging and the limitations of controlled indexing (primarily cost and scalability), it is reasonable to investigate to what degree social tagging could substitute for controlled indexing. In this study, we compared CiteULike tags to Medical Subject Headings (MeSH) terms for 231,388 citations indexed in MEDLINE. In addition to descriptive analyses of the data sets, we present a paper-by-paper analysis of tags and MeSH terms: the number of common annotations, Jaccard similarity, and coverage ratio. In the analysis, we apply three increasingly progressive levels of text processing, ranging from normalization to stemming, to reduce the impact of lexical differences. Annotations of our corpus consisted of over 76,968 distinct tags and 21,129 distinct MeSH terms. The top 20 tags/MeSH terms showed little direct overlap. On a paper-by-paper basis, the number of common annotations ranged from 0.29 to 0.5 and the Jaccard similarity from 2.12% to 3.3% using increased levels of text processing. At most, 77,834 citations (33.6%) shared at least one annotation. Our results show that CiteULike tags and MeSH terms are quite distinct lexically, reflecting different viewpoints/processes between social tagging and controlled indexing.

KW - automatic indexing

KW - automatic taxonomy generation

KW - knowledge representation

UR - http://www.scopus.com/inward/record.url?scp=84865243403&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84865243403&partnerID=8YFLogxK

U2 - 10.1002/asi.22653

DO - 10.1002/asi.22653

M3 - Article

AN - SCOPUS:84865243403

VL - 63

SP - 1747

EP - 1757

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 9

ER -