Distribution of information in biomedical abstracts and full-text publications
-
Upload
daphne-burris -
Category
Documents
-
view
24 -
download
1
description
Transcript of Distribution of information in biomedical abstracts and full-text publications
![Page 1: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/1.jpg)
Distribution of information in biomedical abstracts and full-text publications
M. J. Schuemie et al.
Dept. of Medical Informatics, Erasmus University Medical Center Rotterdam, Netherlands
![Page 2: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/2.jpg)
Abstract
Motivation:– Full-text documents hold more information
than their abstracts.– Investigated the added value of full text over
abstracts in terms of “information content” and “occurrences of gene symbol—gene name combinations” that can resolve gene-symbol ambiguity.
![Page 3: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/3.jpg)
Cont’d
Results:– Analyzed 3902 biomedical full-text articles– Information density is highest in abstracts– Information coverage in full text is much
greater than in abstracts– The highest information coverage is located in
the results section (out of 5 sections)– 30-40% of the information mentioned in each
section is unique to that section
![Page 4: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/4.jpg)
Cont’d
Results:– Only 30% of the gene symbols in the
ABSTRACT are accompanied by their corresponding names, and a further 8% of the gene names (whose symbols appear in the abstract) are found in the full text
– In the FULL TEXT, only 18% of the gene symbols are accompanied by their gene names
![Page 5: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/5.jpg)
Introduction
Limited evaluation of the beneficial value of full-text documents– Friedman et al (2001). found that, in an article
containing 19 unique molecular interactions, only 7 were found in the abstract
– Yu et al. (2002) found that more synonyms of genes and proteins can be more precisely retrieved from full-text documents (compared to abstracts)
Shah et al. (2003) performed a more systematic comparison of abstracts and full-text articles.
![Page 6: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/6.jpg)
Cont’d
They analyzed 104 full-text articles that contained all the five standard sections --Abstract, Introduction, Methods, Results and Discussion.
They showed that the highest frequency of keywords occurred in the abstract.
With a limited list of gene names, they also found that the abstract and introduction have the highest frequency of gene names.
![Page 7: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/7.jpg)
Cont’d
Shah et al. (2003) selected keywords by choosing single-word nouns that have a high K-value.
The K-value for a word wi:
, where is the number of times that wi and wj appear in a sentence and is the number of times that wi appears in the text.
ji
ijii WWWK ji WW
iW
![Page 8: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/8.jpg)
Cont’d
However, it is unclear why words with a high K-value (i.e. words in relatively long sentences ) should be preferentially considered keywords.
We seek to improve the research by Shah et al. by– Using more methodologically sound measures– Including both single and multiple word
terms and a more extensive list of gene names– Using a larger test corpus
![Page 9: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/9.jpg)
Methods – Document set
3902 full-text documents– 1275 publications from Nature Genetics– All 2754 publications from BioMed Central
containing 89 different journals– 127 (3.2%) of these articles were not indexed
in MEDLINE and were discarded because they mostly included letters and corrections with little relevance to the field
![Page 10: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/10.jpg)
Methods – Keyword identification
Five strategies to identify keywords:– (1) Mesh headings: The MeSH terms manually
attached to a publication. Headings under the category Miscellaneous were removed.
– (2) Exploded Mesh headings: MeSH headings extended with their children as defined in the thesaurus. E.g. If ‘Parasitic Disease’ was defined as a MeSH heading, then ‘Malaria’ would also be identified as a keyword.
– (3) TF*IDF: MeSH terms with a higher TF*IDF score are considered to be more relevant keywords
![Page 11: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/11.jpg)
Cont’d
Five strategies to identify keywords:– (4) Gene terms: used a self-constructed thesaurus
of human gene names and symbols extracted from five genetic databases: GDB, Genew, Locuslink, OMIM, and Swissprot.
– (5) Mesh terms per semantic type: The Mesh hierarchy classifies terms into different semantic classes. Three important categories within biomedical research were used: Organisms, Diseases, and Chemicals and Drugs. Additionally, genes is included as the fourth type.
![Page 12: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/12.jpg)
Methods – Information measures
Two important concepts for describing the information content of a piece of text:– Information density– Information coverage
Information coverage measures were calculated in terms of the fraction of the total information in a paper that was described in a part of that paper.
![Page 13: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/13.jpg)
Information density measures
Heading Density (HD): The number of instances of MeSH headings in the text divided by the number of words
Exploded Heading Density (XHD)Weighted MeSH Term (WMT) density:
TF*IDF as weight for each termGene Density (GD)Semantic Type Density (STD)
![Page 14: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/14.jpg)
Information coverage measures
WMT fractionHeading Fraction (HF)Exploded Heading Fraction (XHF)Gene Fraction (GF)Exploded Heading Uniqueness (XHU): The
fraction of the MeSH headings, including children, mentioned in a section that was not mentioned in any other section.
Gene Uniqueness (GU)Semantic Type Fraction (STF)
![Page 15: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/15.jpg)
Results
The keyword density was highest in the Abstract and lowest in the Methods and Discussion sections.
The keyword fraction was highest in the Results section.
The highest gene fraction was found in the Methods and Results sections.
Neither Exploded Headings Uniqueness nor Gene Uniqueness differed significantly between sections.
![Page 16: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/16.jpg)
Abstract versus full text
![Page 17: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/17.jpg)
Abstract versus full text
![Page 18: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/18.jpg)
Density among sections (keywords)
![Page 19: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/19.jpg)
Fraction among sections (keywords)
![Page 20: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/20.jpg)
Gene Fraction and Density among sections
![Page 21: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/21.jpg)
Uniqueness
![Page 22: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/22.jpg)
Semantic Type analysis
The semantic types Disease and Genes were found in relatively low density in the Methods section.
The widest variety (coverage) of “Chemical and Drugs” was discussed in the Methods section.
![Page 23: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/23.jpg)
Semantic Type Density distribution
![Page 24: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/24.jpg)
Semantic Type Coverage distribution
![Page 25: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/25.jpg)
Discussion
The Methods section was richest in information on Chemicals and Drugs, whilst Disease and Genes were mentioned less frequently in the Methods section than in other sections.
Since named-entity extraction algorithms are reported to have difficulties in distinguishing between gene names and chemical entities, not applying these algorithms to the Methods section might improve their performance.
![Page 26: Distribution of information in biomedical abstracts and full-text publications](https://reader035.fdocuments.us/reader035/viewer/2022070402/56813874550346895da02223/html5/thumbnails/26.jpg)
Cont’d
The results agree on several points with those obtained by Shah et al.
However, Shah reported the highest coverage in the Introduction and Methods and lowest in the Results section, whilst our results showed it to be highest in the Results section.
The difference is most likely due to difference between the keyword measure used by Shah and our measures.