Discovering research articles containing evolutionary timetrees by machine learning

Marija Stanojevic, Jovan Andjelkovic, Adrienne Kasprowicz, L.ouise A. Huuki, Jennifer Chao, S. Blair Hedges, Sudhir Kumar, Zoran Obradovic

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically. RESULTS: We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. AVAILABILITY AND IMPLEMENTATION: https://github.com/marija-stanojevic/time-tree-classification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Original languageEnglish
Article numberbtad035
JournalBioinformatics
Volume39
Issue number1
DOIs
StatePublished - Jan 1 2023

Keywords

  • Humans Phylogeny *Biological Evolution *Data Mining Databases, Factual Machine Learning
  • Biological Evolution
  • Humans
  • Data Mining
  • Phylogeny
  • Machine Learning
  • Databases, Factual

Fingerprint

Dive into the research topics of 'Discovering research articles containing evolutionary timetrees by machine learning'. Together they form a unique fingerprint.

Cite this