Scientific Area
Abstract Detail
Nº613/1800 - Access to the data imprisoned in 500 Million pages of biodiversity publications
Format: ORAL
Authors
Jonas Castro1
Donat Agosti2
Affiliations
1 Plazi, Porto Alegre, Brazil
2 Plazi, Bern, Switzerland
Abstract
Scientific knowledge about biodiversity is included in billions of statements in an estimated daily growing corpus of 500 million published pages including all the known taxa. All statements are linked by taxonomic names, thus playing a key role in organizing this knowledge, and to access it. Traditionally access has been through the citation of a publication. In the digital age, direct access to the cited statement, such as to taxonomic treatment, material citation, figure, or trait as FAIR (Findable Accessible, Interoperable, Reuseable) data is possible and desirable, especially to leverage the power of machines, and to enable artificial intelligence applications to support research and other usages.
A way to make this happen is to annotate the texts with terms that describe the content, such as taxonomic name, collector, material citation, taxonomic treatment for which reference vocabularies exist such as Darwin Core or Taxpub/JATS. In a more advanced step, terms can be linked to references such as the extended catalogue of life, World Flora Online, International Plant Name Index for taxonomic names, or digital copies of cited specimens in the Global Biodiversity Information Facility, DiSSCo or individual collections, or treatments to the Biodiversity Literature Repository, using persistent identifiers.
Detailed annotations allow automatic extraction of statements to populate databases such as WFO, GBIF, or ChecklistBank. They can be used as training corpus to teach algorithms to annotate the overwhelming corpus of literature that is currently not accessible digitally.
In this presentation an introduction to annotations is provided, and the workflow and access developed by Plazi to annotate and FAIR-izise publications is described, resulting in over 800,000 taxonomic treatments, 450,000 figures in the Biodiversity Literature Repository of which 45,000 datasets are reused by GBIF and ChecklistBank, and over 500,000 treatments in the biodiversity PMC.