Introduction

An automated workflow was developed to standardise the names of alien taxa for South African alien species lists. The workflow implements taxonomic backbones that are relevant to the South African context.

Three taxonomic backbones are implemented: that of the Global Biodiversity Information Facility (GBIF); that of the Botanical Database of Southern Africa (BODATSA); and that of The World Checklist of Vascular Plants (WCVP), accessed via Plants of the World Online (POWO) (https://powo.science.kew.org/).

The taxonomic backbone used depends on the taxonomic group:

  1. The GBIF backbone is used for taxa that are not vascular plants (e.g., animals, fungi, and chromista)

  2. The BODATSA backbone is used for alien vascular plants recorded outside of captivity and cultivation in South Africa (i.e., those taxa included in BODATSA)

  3. The WCVP backbone is used for alien vascular plants not found outside of captivity and cultivation in South Africa (i.e., those taxa not included in BODATSA)

In addition to standardising a list of taxon names based on the three taxonomic backbones, the workflow obtains canonical names and higher taxonomic information for the taxa from GBIF, and detects, corrects, and flags potential issues and errors.

This document describes the outputs of the workflow.

Global biodiversity data standards as set by the Darwin Core are followed as far as is possible (descriptions of the Darwin Core fields are available at: https://dwc.tdwg.org/terms/).

Final output

The final output is saved in the processed/ folder as a csv named ‘Taxon_names_standardised.csv’. The final output contains 18 columns.

Description of variables

verbatimScientificName

Description: The taxon name as provided by the user

Darwin Core: dwc:verbatimScientificName

Type: Character

canonicalName

Description: The scientific name of the taxon without authorship and date information, as per GBIF

Type: Character (value of NA is possible if the taxon is not available on GBIF)

kingdom

Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF

Darwin Core: dwc:kingdom

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

phylum

Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF

Darwin Core: dwc:phylum

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

class

Description: The full scientific name of the class in which the taxon is classified, as per GBIF

Darwin Core: dwc:class

Type: Factor (value of NA is possible if the value is not available on GBIF)

order

Description: The full scientific name of the order in which the taxon is classified, as per GBIF

Darwin Core: dwc:order

Type: Factor (value of NA is possible if the value is not available on GBIF)

family

Description: The full scientific name of the family in which the taxon is classified, as per GBIF

Darwin Core: dwc:family

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

scientificName

Description: The accepted scientific name of the taxon with authorship and date information, as per the relevant taxonomic backbone

Darwin Core: dwc:scientificName

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in the relevant taxonomic backbone. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)

synonym.flag

Description: If the verbatimScientificName is a synonym of an accepted scientific name (i.e., TRUE). Includes instances where verbatimScientificName is an unused autonym, and the species that it has been elevated to is a synonym.

Type: Factor with two levels (TRUE and FALSE)

uncertain.flag

Description: If the verbatimScientificName has been matched to an accepted scientific name (i.e., scientificName) through a non-exact match (e.g., fuzzy matching) (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE)

doubtful.flag

Description: If the verbatimScientificName was either matched to multiple accepted scientific names, and this could not be resolved (i.e., taxonomic homonym); or has a status that is not accepted or synonym; or could only be matched at a higher taxonomic level (e.g., genus) (i.e., TRUE). This is doubtful as per GBIF. Includes uncertain as per BODATSA; and invalid, illegitimate and missapplied as per WCVP.

Type: Factor with two levels (TRUE and FALSE)

unmatched.flag

Description: If the verbatimScientificName could not be matched to an accepted scientific name in the relevant taxonomic backbone; or was matched to an accepted scientific name at a higher taxonomic level (e.g., genus for a species, or species for an unused autonym) (i.e., TRUE).

Type: Factor with two levels (TRUE and FALSE)

possibleError.flag

Description: If any of unmatched.flag, uncertain.flag or doubtful.flag are TRUE (i.e, TRUE). The outputs for these taxon names should be examined manually.

Type: Factor with two levels (TRUE and FALSE)

subspecific.flag

Description: If the verbatimScientificName is a sub-specific entity (i.e., TRUE). These taxon names may need to be examined manually.

Type: Factor with two levels (TRUE and FALSE)

scientificNameDuplicate.flag

Description: If the scientificName appears multiple times in the list (i.e., TRUE). These taxon names will need to be removed manually

Type: Factor with two levels (TRUE and FALSE)

verbatimNameDuplicate.flag

Description: If the verbatimScientificName appears multiple times in the list (i.e., TRUE). These taxon names will need to be removed manually

Type: Factor with two levels (TRUE and FALSE)

openNomenclature.flag

Description: If the verbatimScientificName contains open - abbreviations that indicate uncertainty or the provisional status of the taxon’s identification (“sp.”, “spp.”, “cf.”, “nr.”, “aff.”, “?”) (i.e., TRUE). These taxon names may need to be examined manually.

Type: Factor with two levels (TRUE and FALSE)

scientificNameSource

Description: The taxonomic backbone used for the scientificName, with the date it was accessed and a web-link as appropriate

Type: Factor with three levels (BODATSA, POWO, GBIF) (value of NA is possible if the verbatimScientificName could not be matched to a name in the relevant taxonomic backbone. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

Value and definition

BODATSA: Botanical Database of Southern Africa

POWO: Plants of the World Online

GBIF: Global Biodiversity Information Facility

Interim output from GBIF match

The output is saved in the interim/ folder as a csv named ‘GBIFInterim.csv’. The output contains 13 columns.

Description of variables

verbatimScientificName

Description: The taxon name as provided by the user

Darwin Core: dwc:verbatimScientificName

Type: Character

canonicalName

Description: The scientific name of the taxon without authorship and date information, as per GBIF

Type: Character (value of NA is possible if the taxon is not available on GBIF)

kingdom

Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF

Darwin Core: dwc:kingdom

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

phylum

Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF

Darwin Core: dwc:phylum

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

class

Description: The full scientific name of the class in which the taxon is classified, as per GBIF

Darwin Core: dwc:class

Type: Factor (value of NA is possible if the value is not available on GBIF)

order

Description: The full scientific name of the order in which the taxon is classified, as per GBIF

Darwin Core: dwc:order

Type: Factor (value of NA is possible if the value is not available on GBIF)

family

Description: The full scientific name of the family in which the taxon is classified, as per GBIF

Darwin Core: dwc:family

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

scientificName

Description: The accepted scientific name of the taxon with authorship and date information, as per GBIF

Darwin Core: dwc:scientificName

Type: Character (value of NA is possible if the taxon name could not be matched to a name in the relevant taxonomic backbone)

synonym

Description: If the verbatimScientificName is a synonym of an accepted scientific name (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE)

rank

Description: The precise taxonomic level of the organism as per verbatimScientificName

Type: Factor (levels include: SPECIES, FAMILY, GENUS, SUBSPECIES, VARIETY, FORM. Value of NA is possible)

status

Description: Indicates if the verbatimScientificName is an accepted scientific name

Type: Factor with three levels (ACCEPTED, SYNONYM, DOUBTFUL. Value of NA is possible)

Value and definition

ACCEPTED: The verbatimScientificName is an accepted scientificName

SYNONYM: The verbatimScientificName is a synonym

DOUBTFUL: A taxon name can be flagged as doubtful for several reasons. These include: The name comes from a nomenclator only. There is more than one accepted name with the same canonical name but different author. In that case, the most trusted one is accepted and others are flagged doubtful. The name has homonyms in the same source. The name is a basionym placeholder and contains “? epithet”. The name is a genus or higher taxon without any species. The name is a species or and infra-species with a parent genus or species considered to be a synonym.

matchtype

Description: How the verbatimScientificName was matched to the scientificName

Type: Factor with four levels (EXACT, FUZZY, HIGHERRANK, NONE)

Value and definition

EXACT: verbatimScientificName was matched to the scientificName through an exact match

FUZZY: verbatimScientificName was matched to the scientificName through fuzzy matching

HIGHERRANK: verbatimScientificName was matched to a scientificName at a higher taxonomic level (e.g., genus)

NONE: verbatimScientificName was not matched to a scientificName

usageKey

Description: Equal to the taxonKey. A unique number used in GBIF to identify a taxon

Type: Character (value of NA is possible)

Interim output from BODATSA match

The output is saved in the interim/ folder as a csv named ‘BODATSAInterim.csv’. The output contains 25 columns.

Description of variables

verbatimScientificName

Description: The taxon name as provided by the user

Darwin Core: dwc:verbatimScientificName

Type: Character

canonicalName

Description: The scientific name of the taxon without authorship and date information, as per GBIF

Type: Character (value of NA is possible if the taxon is not available on GBIF)

kingdom

Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF

Darwin Core: dwc:kingdom

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

phylum

Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF

Darwin Core: dwc:phylum

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

class

Description: The full scientific name of the class in which the taxon is classified, as per GBIF

Darwin Core: dwc:class

Type: Factor (value of NA is possible if the value is not available on GBIF)

order

Description: The full scientific name of the order in which the taxon is classified, as per GBIF

Darwin Core: dwc:order

Type: Factor (value of NA is possible if the value is not available on GBIF)

family

Description: The full scientific name of the family in which the taxon is classified, as per GBIF

Darwin Core: dwc:family

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

Family

Description: The full scientific name of the family in which the taxon is classified, as per BODATSA

Type: Factor (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA)

scientificName

Description: The accepted scientific name of the taxon with authorship and date information, as per BODATSA

Darwin Core: dwc:scientificName

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)

Genus

Description: The generic epithet of the verbatimScientificName, as per BODATSA

Type: Factor (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA)

Sp1

Description: The specific epithet of the verbatimScientificName, as per BODATSA

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA)

Author1

Description: Authorship of the binomial name of verbatimScientificName, as per BODATSA

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)

Rank1

Description: Indicates if verbatimScientificName is a sub-specific entity

Type: Factor with three levels (subsp., var., f.. Value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a sub-specific entity)

Value and definition

subsp.: subspecies

var.: variety

f.: form

Sp2

Description: First sub-specific epithet of the verbatimScientificName, as per BODATSA

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a sub-specific entity)

Author2

Description: Authorship of the first sub-specific epithet of verbatimScientificName, as per BODATSA

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a sub-specific entity)

Rank2

Description: Indicates if verbatimScientificName contains a second sub-specific epithet

Type: Factor with two levels (var., f.. Value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName does not contain a second sub-specific epithet)

Value and definition

var.: variety

f.: form

Sp3

Description: Second sub-specific epithet of the verbatimScientificName, as per BODATSA

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName does not contain a second sub-specific epithet)

Author3

Description: Authorship of the second sub-specific epithet of verbatimScientificName, as per BODATSA

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName does not contain a second sub-specific epithet)

TaxStat

Description: Indicates whether the verbatimScientificName is an accepted scientific name

Type: Factor (levels include: acc, inc, aut, syn, unc, genlv. Value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone); or if the taxon name is an unused autonym and the species name that it has been elevated to is a synonym.

Value and definition

acc: verbatimScientificName is currently the accepted name

syn: verbatimScientificName is currently a synonym

inc: the verbatimScientificName includes lower ranks

aut: the verbatimScientificName is an unused autonym, and the taxon has been elevated to the species level

unc: verbatimScientificName is of uncertain application

genlv: verbatimScientificName is at the genus level only

AcceptedIssues

Description: The accepted scientific name if the verbatimScientificName is a synonym as per BODATSA, before formatting was corrected during Step 2a of the workflow

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a synonym)

Ecology

Description: Ecological details for the taxon

Type: Character (value of NA is possible)

Accepted

Description: The accepted scientific name if the verbatimScientificName is a synonym as per BODATSA, after formatting was corrected during Step 2a of the workflow; or if the verbatimScientificName is an unused autonym, and the taxon has been elevated to the species level.

Type: Character (value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a synonym or unused autonym)

synAccIssue

Description: If the accepted scientific name of a synonym was corrected during Step 2a of the workflow (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE. Value of NA is possible if the verbatimScientificName could not be matched to a name in BODATSA or if the verbatimScientificName is not a synonym)

multipleMatches

Description: If the verbatimScientificName was matched to multiple names in BODATSA (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE)

multipleMatchesUnresolved

Description: If the verbatimScientificName was matched to multiple names in BODATSA, and this issue was not resolved during the execution of the workflow (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE)

Interim output from WCVP match

The output is saved in the interim/ folder as a csv named ‘WCVPInterim.csv’. The output contains 23 columns.

Description of variables

verbatimScientificName

Description: The taxon name as provided by the user

Darwin Core: dwc:verbatimScientificName

Type: Character

canonicalName

Description: The scientific name of the taxon without authorship and date information, as per GBIF

Type: Character (value of NA is possible if the taxon is not available on GBIF)

kingdom

Description: The full scientific name of the kingdom in which the taxon is classified, as per GBIF

Darwin Core: dwc:kingdom

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

phylum

Description: The full scientific name of the phylum in which the taxon is classified, as per GBIF

Darwin Core: dwc:phylum

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

class

Description: The full scientific name of the class in which the taxon is classified, as per GBIF

Darwin Core: dwc:class

Type: Factor (value of NA is possible if the value is not available on GBIF)

order

Description: The full scientific name of the order in which the taxon is classified, as per GBIF

Darwin Core: dwc:order

Type: Factor (value of NA is possible if the value is not available on GBIF)

family

Description: The full scientific name of the family in which the taxon is classified, as per GBIF

Darwin Core: dwc:family

Type: Factor (value of NA is possible if the taxon is not available on GBIF)

match_type

Description: How the verbatimScientificName was matched to the scientificName. Such as through an exact match or fuzzy match (e.g., there were spelling mistakes), and whether the author was considered

Type: Factor (levels include: Exact, Exact (without author), Fuzzy (edit distance), Fuzzy (phonetic). Value of NA is possible if no match was found)

multiple_matches

Description: If the verbatimScientificName was matched to multiple names in WCVP (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE)

match_similarity

Description: The similarity between the verbatimScientificName and the name to which it was matched. The value will be 1 if exact matching is used, but will be < 1 if fuzzy matching is used

Type: Numeric (ranges between 0 and 1, with 1 being 100% similarity. Value of NA is possible if no match was found. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

match_edit_distance

Description: The number of edits required (additions and subtractions) to transform the verbatimScientificName into the name to which it was matched. The value will be 0 if exact matching is used, but will be > 0 if fuzzy matching is used

Type: Numeric (value >= 0. Value of NA is possible if no match was found. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

wcvp_id

Description: A unique number used in WCVP to identify a taxon

Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

wcvp_name

Description: The name of the taxon as it appears in WCVP without authorship or date information. This is not necessarily the accepted scientific name

Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

wcvp_authors

Description: Authorship information for wcvp_name

Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

wcvp_rank

Description: The precise taxonomic level of the organism as per verbatimScientificName

Type: Factor (levels include: Species, Subspecies, Variety. Value of NA is possible)

wcvp_status

Description: Indicates if the verbatimScientificName is an accepted scientific name

Type: Factor (levels include: Accepted, Synonym, Invalid, Illegitimate, Missapplied. Value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)

Value and definition

Accepted: The verbatimScientificName is an accepted scientificName

Synonym: The verbatimScientificName is a synonym

Invalid: The verbatimScientificName is invalid

Illegitimate: The verbatimScientificName is illegitimate

Missapplied: The verbatimScientificName has mistakenly been given to a taxon

wcvp_homotypic

Description: If the verbatimScientificName is a homotypic synonym (i.e., TRUE)

Type: Factor with two levels (TRUE or FALSE. Value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in relevant taxonomic backbone)

wcvp_ipni_id

Description: A unique number used in IPNI to identify a taxon

Type: Character (value of NA is possible. Can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

wcvp_accepted_id

Description: A unique number used in WCVP to identify the accepted scientific name

Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

scientificName

Description: The accepted scientific name of the taxon with authorship and date information, as per WCVP

Darwin Core: dwc:scientificName

Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

powo_id

Description: A unique number used in POWO to identify the taxon

Type: Character (can be several values separated by a pipe delimiter if verbatimScientificName was a taxonomic homonym and was matched to multiple names in the relevant taxonomic backbone)

multiple_matchesUnresolved

Description: If the verbatimScientificName was matched to multiple names in WCVP, and this issue was not resolved during the execution of the workflow (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE)

Interim output from preparation of BODATSA data

The output is saved in the interim/ folder and is a csv named ‘BODATSAPrepared.csv’. The output contains 17 columns.

Description of variables

Family

Description: The full scientific name of the family in which the taxon is classified, as per BODATSA

Type: Factor

canonicalName

Description: The scientific name of the taxon without authorship and date information, as per BODATSA

Type: Character

scientificName

Description: The scientific name of the taxon with authorship and date information, as per BODATSA. This is not necessarily the accepted scientific name

Type: Character

Genus

Description: The generic epithet, as per BODATSA

Type: Factor (value of NA is possible)

Sp1

Description: The specific epithet, as per BODATSA

Type: Character (value of NA is possible)

Author1

Description: Authorship of the binomial name, as per BODATSA

Type: Character (value of NA is possible)

Rank1

Description: Indicates if the taxon is a sub-specific entity

Type: Factor with three levels (subsp., var., f.. Value of NA is possible)

Value and definition

subsp.: subspecies

var.: variety

f.: form

Sp2

Description: First sub-specific epithet, as per BODATSA

Type: Character (value of NA is possible)

Author2

Description: Authorship of the first sub-specific epithet, as per BODATSA.

Type: Character (value of NA is possible)

Rank2

Description: Indicates if there is second sub-specific epithet

Type: Factor with two levels (var., f.. Value of NA is possible)

Value and definition

var.: variety

f.: form

Sp3

Description: Second sub-specific epithet, as per BODATSA

Type: Character (value of NA is possible)

Author3

Description: Authorship of the second sub-specific epithet, as per BODATSA

Type: Character (value of NA is possible)

TaxStat

Description: Indicates if the taxon name (see canonicalName and scientificName) is an accepted scientific name

Type: Factor (levels include: acc, inc, syn, unc, genlv., aut, aut|syn. Value of NA is possible)

Value and definition

acc: taxon name is currently the accepted name

syn: taxon name is currently a synonym

inc: taxon name includes lower ranks

aut: taxon name is an unused autonym and has been elevated to the species level

unc: taxon name is of uncertain application

genlv: taxon name is at the genus level only

aut|syn: taxon name is an unused autonym and has been elevated to the species level. But that species name is a synonym.

AcceptedIssues

Description: The accepted scientific name if the taxon name is a synonym as per BODATSA, before formatting was corrected during Step 2a of the workflow

Type: Character (value of NA is possible)

Ecology

Description: Ecological details for the taxon

Type: Character (value of NA is possible)

Accepted

Description: The accepted scientific name if the taxon name is a synonym as per BODATSA, after formatting was corrected during Step 2a of the workflow; or the accepted scientific name if the taxon name is a unused autonym as per BODATSA

Type: Character (value of NA is possible)

synAccIssue

Description: If the accepted scientific name of a synonym was corrected during Step 2a of the workflow (i.e., TRUE)

Type: Factor with two levels (TRUE and FALSE)