An automated workflow was developed to standardise the names of alien taxa for South African alien species lists. The workflow implements taxonomic backbones that are relevant to the South African context.
Three taxonomic backbones are implemented: that of the Global Biodiversity Information Facility (GBIF); that of the Botanical Database of Southern Africa (BODATSA); and that of The World Checklist of Vascular Plants (WCVP), accessed via Plants of the World Online (POWO) (https://powo.science.kew.org/).
The taxonomic backbone used depends on the taxonomic group:
The GBIF backbone is used for taxa that are not vascular plants (e.g., animals, fungi, and chromista)
The BODATSA backbone is used for alien vascular plants recorded outside of captivity and cultivation in South Africa (i.e., those taxa included in BODATSA)
The WCVP backbone is used for alien vascular plants not found outside of captivity and cultivation in South Africa (i.e., those taxa not included in BODATSA)
In addition to standardising a list of taxon names based on the three taxonomic backbones, the workflow obtains canonical names and higher taxonomic information for the taxa from GBIF, and detects, corrects, and flags potential issues and errors.
This document describes the outputs of the workflow.
Global biodiversity data standards as set by the Darwin Core are followed as far as is possible (descriptions of the Darwin Core fields are available at: https://dwc.tdwg.org/terms/).
The final output is saved in the processed/ folder as a csv named ‘Taxon_names_standardised.csv’. The final output contains 18 columns.
Description: The taxon name as provided by the user
Darwin Core: dwc:verbatimScientificName
Type: Character
Description: The scientific name of the taxon without authorship and date information, as per GBIF
Type: Character (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF
Darwin Core: dwc:kingdom
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF
Darwin Core: dwc:phylum
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the class in which the taxon is classified, as per GBIF
Darwin Core: dwc:class
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the order in which the taxon is classified, as per GBIF
Darwin Core: dwc:order
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the family in which the taxon is classified, as per GBIF
Darwin Core: dwc:family
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The accepted scientific name of the taxon with authorship and date information, as per the relevant taxonomic backbone
Darwin Core: dwc:scientificName
Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in the relevant taxonomic backbone. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)
Description: If the verbatimScientificName is a synonym of an accepted scientific name (i.e., TRUE). Includes instances where verbatimScientificName is an unused autonym, and the species that it has been elevated to is a synonym.
Type: Factor with two levels (TRUE and FALSE)
Description: If the verbatimScientificName has been matched to an accepted scientific name (i.e., scientificName) through a non-exact match (e.g., fuzzy matching) (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE)
Description: If the verbatimScientificName was either matched to multiple accepted scientific names, and this could not be resolved (i.e., taxonomic homonym); or has a status that is not accepted or synonym; or could only be matched at a higher taxonomic level (e.g., genus) (i.e., TRUE). This is doubtful as per GBIF. Includes uncertain as per BODATSA; and invalid, illegitimate and missapplied as per WCVP.
Type: Factor with two levels (TRUE and FALSE)
Description: If the verbatimScientificName could not be matched to an accepted scientific name in the relevant taxonomic backbone; or was matched to an accepted scientific name at a higher taxonomic level (e.g., genus for a species, or species for an unused autonym) (i.e., TRUE).
Type: Factor with two levels (TRUE and FALSE)
Description: If any of unmatched.flag, uncertain.flag or doubtful.flag are TRUE (i.e, TRUE). The outputs for these taxon names should be examined manually.
Type: Factor with two levels (TRUE and FALSE)
Description: If the verbatimScientificName is a sub-specific entity (i.e., TRUE). These taxon names may need to be examined manually.
Type: Factor with two levels (TRUE and FALSE)
Description: If the scientificName appears multiple times in the list (i.e., TRUE). These taxon names will need to be removed manually
Type: Factor with two levels (TRUE and FALSE)
Description: If the verbatimScientificName appears multiple times in the list (i.e., TRUE). These taxon names will need to be removed manually
Type: Factor with two levels (TRUE and FALSE)
Description: If the verbatimScientificName contains open - abbreviations that indicate uncertainty or the provisional status of the taxon’s identification (“sp.”, “spp.”, “cf.”, “nr.”, “aff.”, “?”) (i.e., TRUE). These taxon names may need to be examined manually.
Type: Factor with two levels (TRUE and FALSE)
Description: The taxonomic backbone used for the scientificName, with the date it was accessed and a web-link as appropriate
Type: Factor with three levels (BODATSA, POWO, GBIF) (value of NA is possible if the verbatimScientificName could not be matched to a name in the relevant taxonomic backbone. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Value and definition
BODATSA: Botanical Database of Southern Africa
POWO: Plants of the World Online
GBIF: Global Biodiversity Information Facility
The output is saved in the interim/ folder as a csv named ‘GBIFInterim.csv’. The output contains 13 columns.
Description: The taxon name as provided by the user
Darwin Core: dwc:verbatimScientificName
Type: Character
Description: The scientific name of the taxon without authorship and date information, as per GBIF
Type: Character (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF
Darwin Core: dwc:kingdom
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF
Darwin Core: dwc:phylum
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the class in which the taxon is classified, as per GBIF
Darwin Core: dwc:class
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the order in which the taxon is classified, as per GBIF
Darwin Core: dwc:order
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the family in which the taxon is classified, as per GBIF
Darwin Core: dwc:family
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The accepted scientific name of the taxon with authorship and date information, as per GBIF
Darwin Core: dwc:scientificName
Type: Character (value of NA is possible if the taxon name could not be matched to a name in the relevant taxonomic backbone)
Description: If the verbatimScientificName is a synonym of an accepted scientific name (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE)
Description: The precise taxonomic level of the organism as per verbatimScientificName
Type: Factor (levels include: SPECIES, FAMILY, GENUS, SUBSPECIES, VARIETY, FORM. Value of NA is possible)
Description: Indicates if the verbatimScientificName is an accepted scientific name
Type: Factor with three levels (ACCEPTED, SYNONYM, DOUBTFUL. Value of NA is possible)
Value and definition
ACCEPTED: The verbatimScientificName is an accepted scientificName
SYNONYM: The verbatimScientificName is a synonym
DOUBTFUL: A taxon name can be flagged as doubtful for several reasons. These include: The name comes from a nomenclator only. There is more than one accepted name with the same canonical name but different author. In that case, the most trusted one is accepted and others are flagged doubtful. The name has homonyms in the same source. The name is a basionym placeholder and contains “? epithet”. The name is a genus or higher taxon without any species. The name is a species or and infra-species with a parent genus or species considered to be a synonym.
Description: How the verbatimScientificName was matched to the scientificName
Type: Factor with four levels (EXACT, FUZZY, HIGHERRANK, NONE)
Value and definition
EXACT: verbatimScientificName was matched to the scientificName through an exact match
FUZZY: verbatimScientificName was matched to the scientificName through fuzzy matching
HIGHERRANK: verbatimScientificName was matched to a scientificName at a higher taxonomic level (e.g., genus)
NONE: verbatimScientificName was not matched to a scientificName
Description: Equal to the taxonKey. A unique number used in GBIF to identify a taxon
Type: Character (value of NA is possible)
The output is saved in the interim/ folder as a csv named ‘BODATSAInterim.csv’. The output contains 25 columns.
Description: The taxon name as provided by the user
Darwin Core: dwc:verbatimScientificName
Type: Character
Description: The scientific name of the taxon without authorship and date information, as per GBIF
Type: Character (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF
Darwin Core: dwc:kingdom
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF
Darwin Core: dwc:phylum
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the class in which the taxon is classified, as per GBIF
Darwin Core: dwc:class
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the order in which the taxon is classified, as per GBIF
Darwin Core: dwc:order
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the family in which the taxon is classified, as per GBIF
Darwin Core: dwc:family
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the family in which the taxon is classified, as per BODATSA
Type: Factor (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA)
Description: The accepted scientific name of the taxon with authorship and date information, as per BODATSA
Darwin Core: dwc:scientificName
Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)
Description: The generic epithet of the verbatimScientificName, as per BODATSA
Type: Factor (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA)
Description: The specific epithet of the verbatimScientificName, as per BODATSA
Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA)
Description: Indicates if verbatimScientificName is a sub-specific entity
Type: Factor with three levels (subsp., var., f.. Value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a sub-specific entity)
Value and definition
subsp.: subspecies
var.: variety
f.: form
Description: First sub-specific epithet of the verbatimScientificName, as per BODATSA
Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a sub-specific entity)
Description: Indicates if verbatimScientificName contains a second sub-specific epithet
Type: Factor with two levels (var., f.. Value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName does not contain a second sub-specific epithet)
Value and definition
var.: variety
f.: form
Description: Second sub-specific epithet of the verbatimScientificName, as per BODATSA
Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName does not contain a second sub-specific epithet)
Description: Indicates whether the verbatimScientificName is an accepted scientific name
Type: Factor (levels include: acc, inc, aut, syn, unc, genlv. Value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone); or if the taxon name is an unused autonym and the species name that it has been elevated to is a synonym.
Value and definition
acc: verbatimScientificName is currently the accepted name
syn: verbatimScientificName is currently a synonym
inc: the verbatimScientificName includes lower ranks
aut: the verbatimScientificName is an unused autonym, and the taxon has been elevated to the species level
unc: verbatimScientificName is of uncertain application
genlv: verbatimScientificName is at the genus level only
Description: The accepted scientific name if the verbatimScientificName is a synonym as per BODATSA, before formatting was corrected during Step 2a of the workflow
Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a synonym)
Description: Ecological details for the taxon
Type: Character (value of NA is possible)
Description: The accepted scientific name if the verbatimScientificName is a synonym as per BODATSA, after formatting was corrected during Step 2a of the workflow; or if the verbatimScientificName is an unused autonym, and the taxon has been elevated to the species level.
Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a synonym or unused autonym)
Description: If the accepted scientific name of a synonym was corrected during Step 2a of the workflow (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE. Value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a synonym)
Description: If the verbatimScientificName was matched to multiple names in BODATSA (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE)
Description: If the verbatimScientificName was matched to multiple names in BODATSA, and this issue was not resolved during the execution of the workflow (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE)
The output is saved in the interim/ folder as a csv named ‘WCVPInterim.csv’. The output contains 23 columns.
Description: The taxon name as provided by the user
Darwin Core: dwc:verbatimScientificName
Type: Character
Description: The scientific name of the taxon without authorship and date information, as per GBIF
Type: Character (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF
Darwin Core: dwc:kingdom
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF
Darwin Core: dwc:phylum
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: The full scientific name of the class in which the taxon is classified, as per GBIF
Darwin Core: dwc:class
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the order in which the taxon is classified, as per GBIF
Darwin Core: dwc:order
Type: Factor (value of NA is possible if the value is not available on GBIF)
Description: The full scientific name of the family in which the taxon is classified, as per GBIF
Darwin Core: dwc:family
Type: Factor (value of NA is possible if the taxon is not available on GBIF)
Description: How the verbatimScientificName was matched to the scientificName. Such as through an exact match or fuzzy match (e.g., there were spelling mistakes), and whether the author was considered
Type: Factor (levels include: Exact, Exact (without author), Fuzzy (edit distance), Fuzzy (phonetic). Value of NA is possible if no match was found)
Description: If the verbatimScientificName was matched to multiple names in WCVP (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE)
Description: The similarity between the verbatimScientificName and the name to which it was matched. The value will be 1 if exact matching is used, but will be < 1 if fuzzy matching is used
Type: Numeric (ranges between 0 and 1, with 1 being 100% similarity. Value of NA is possible if no match was found. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: The number of edits required (additions and subtractions) to transform the verbatimScientificName into the name to which it was matched. The value will be 0 if exact matching is used, but will be > 0 if fuzzy matching is used
Type: Numeric (value >= 0. Value of NA is possible if no match was found. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: A unique number used in WCVP to identify a taxon
Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: The name of the taxon as it appears in WCVP without authorship or date information. This is not necessarily the accepted scientific name
Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: The precise taxonomic level of the organism as per verbatimScientificName
Type: Factor (levels include: Species, Subspecies, Variety. Value of NA is possible)
Description: Indicates if the verbatimScientificName is an accepted scientific name
Type: Factor (levels include: Accepted, Synonym, Invalid, Illegitimate, Missapplied. Value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)
Value and definition
Accepted: The verbatimScientificName is an accepted scientificName
Synonym: The verbatimScientificName is a synonym
Invalid: The verbatimScientificName is invalid
Illegitimate: The verbatimScientificName is illegitimate
Missapplied: The verbatimScientificName has mistakenly been given to a taxon
Description: If the verbatimScientificName is a homotypic synonym (i.e., TRUE)
Type: Factor with two levels (TRUE or FALSE. Value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)
Description: A unique number used in IPNI to identify a taxon
Type: Character (value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: A unique number used in WCVP to identify the accepted scientific name
Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: The accepted scientific name of the taxon with authorship and date information, as per WCVP
Darwin Core: dwc:scientificName
Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: A unique number used in POWO to identify the taxon
Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)
Description: If the verbatimScientificName was matched to multiple names in WCVP, and this issue was not resolved during the execution of the workflow (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE)
The output is saved in the interim/ folder and is a csv named ‘BODATSAPrepared.csv’. The output contains 17 columns.
Description: The full scientific name of the family in which the taxon is classified, as per BODATSA
Type: Factor
Description: The scientific name of the taxon without authorship and date information, as per BODATSA
Type: Character
Description: The scientific name of the taxon with authorship and date information, as per BODATSA. This is not necessarily the accepted scientific name
Type: Character
Description: The generic epithet, as per BODATSA
Type: Factor (value of NA is possible)
Description: The specific epithet, as per BODATSA
Type: Character (value of NA is possible)
Description: Indicates if the taxon is a sub-specific entity
Type: Factor with three levels (subsp., var., f.. Value of NA is possible)
Value and definition
subsp.: subspecies
var.: variety
f.: form
Description: First sub-specific epithet, as per BODATSA
Type: Character (value of NA is possible)
Description: Indicates if there is second sub-specific epithet
Type: Factor with two levels (var., f.. Value of NA is possible)
Value and definition
var.: variety
f.: form
Description: Second sub-specific epithet, as per BODATSA
Type: Character (value of NA is possible)
Description: Indicates if the taxon name (see canonicalName and scientificName) is an accepted scientific name
Type: Factor (levels include: acc, inc, syn, unc, genlv., aut, aut|syn. Value of NA is possible)
Value and definition
acc: taxon name is currently the accepted name
syn: taxon name is currently a synonym
inc: taxon name includes lower ranks
aut: taxon name is an unused autonym and has been elevated to the species level
unc: taxon name is of uncertain application
genlv: taxon name is at the genus level only
aut|syn: taxon name is an unused autonym and has been elevated to the species level. But that species name is a synonym.
Description: The accepted scientific name if the taxon name is a synonym as per BODATSA, before formatting was corrected during Step 2a of the workflow
Type: Character (value of NA is possible)
Description: Ecological details for the taxon
Type: Character (value of NA is possible)
Description: The accepted scientific name if the taxon name is a synonym as per BODATSA, after formatting was corrected during Step 2a of the workflow; or the accepted scientific name if the taxon name is a unused autonym as per BODATSA
Type: Character (value of NA is possible)
Description: If the accepted scientific name of a synonym was corrected during Step 2a of the workflow (i.e., TRUE)
Type: Factor with two levels (TRUE and FALSE)