galah
is an R interface to biodiversity data hosted by the Atlas of Living Australia (ALA). The ALA is a repository of biodiversity data, focussed primarily on observations of individual life forms. Like the Global Biodiversity Information Facility (GBIF), the basic unit of data at ALA is an occurrence record, based on the ‘Darwin Core’ data standard.
galah
enables users to locate and download species observations, taxonomic information, or associated media such images or sounds, and to restrict their queries to particular taxa or locations. Users can specify which columns are returned by a query, or restrict their results to observations that meet particular quality-control criteria. All functions return a data.frame
as their standard format.
Functions in galah
are designed according to a nested architecture. Users that require data should begin by locating the relevant ala_
function (see downloading data section); the arguments within that function then call correspondingly-named select_
functions; and finally the specific values that can be interpreted by those select_
functions are given by find_
functions.
Install from CRAN:
install.packages("galah")
Install the development version from GitHub:
install.packages("remotes")
::install_github("AtlasOfLivingAustralia/galah") remotes
See the README for system requirements.
Load the package
library(galah)
galah_config(atlas = "Australia")
Each occurrence record contains taxonomic information, and also some information about the observation itself, such as its location and the date of the observation. Each piece of information associated with a given occurrence is stored in a field, which corresponds to a column when imported to a data.frame
.
Data fields are important because they provide a means to filter occurrence records; i.e. to return only the information that you need, and no more. Consequently, much of the architecture of galah
has been designed to make filtering as simple as possible, by using functions with the select_
prefix.
select_taxa()
enables users search for taxonomic names and check the results are ‘correct’ before using the result to download data. The function allows both free-text searches and searches where the rank(s) are specified. Specifying the rank can be useful when names are ambiguous.
# free text search
<- select_taxa("Eolophus")
taxa_filter
# specifying ranks
select_taxa(query = list(genus = "Eolophus", kingdom = "Aves"))
## search_term scientific_name scientific_name_authorship
## 1 Eolophus_Aves Eolophus Bonaparte, 1854
## taxon_concept_id rank match_type kingdom phylum class
## 1 urn:lsid:biodiversity.org.au:afd.taxon:b2de5e40-df8f-4339-827d-25e63454a4a2 genus exactMatch Animalia Chordata Aves
## order family genus issues
## 1 Psittaciformes Cacatuidae Eolophus noIssue
For more detailed taxonomic information use search_taxonomy()
, as outlined in vignette("taxonomic_information")
Users can provide an sf
object or a Well-Known Text (WKT) string for location-based filtering.
<- select_locations(query = st_read('act_rect.shp')) locations
As mentioned above, all occurrence records in the ALA contain additional information about the record, stored in fields. Field-based filters are specified with select_filters()
, which takes individual filters, in the form field = value
, and/or a data quality profile.
To find available fields and corresponding valid values, field lookup functions are provided. For finding field names, use search_fields()
, for finding valid field values, use find_field_values()
.
search_fields("basis")
## id
## 11 basisOfRecord
## 186 raw_basisOfRecord
## 661 BASIS_OF_RECORD_INVALID
## 726 OCCURRENCE_STATUS_INFERRED_FROM_BASIS_OF_RECORD
## description
## 11 What this is a record of e.g. specimen, human observation, fossil http://rs.tdwg.org/dwc/terms/basisOfRecord
## 186 The basis of record as supplied by the data publisher http://rs.tdwg.org/dwc/terms/verbatimBasisOfRecord
## 661 Basis of record badly formed
## 726 Occurrence status inferred from basis of record
## type link
## 11 fields <NA>
## 186 fields <NA>
## 661 assertions <NA>
## 726 assertions <NA>
<- find_field_values("basisOfRecord") field_values
Build a field filter
<- select_filters(basisOfRecord = "HumanObservation") filters
It is also possible to pass other kinds of logical statement to select_filters()
.
<- select_filters(basisOfRecord = "HumanObservation",
filters >= 2010,
year != "absent") occurrenceStatus
A notable extention of the filtering approach is to remove records with low ‘quality’. ALA performs quality control checks on all records that it stores. These checks are used to generate new fields, that can then be used to filter out records that are unsuitable for particular applications. However, there are many possible data quality checks, and it is not always clear which are most appropriate in a given instance. Therefore, galah
supports ALA data quality profiles, which can be passed to select_filters()
to quickly remove undesirable records. A full list of data quality profiles is returned by find_profiles()
.
<- find_profiles() profiles
View filters included in a profile
find_profile_attributes("ALA")
## description
## 1: Exclude all records where spatial validity is "false"
## 2: Exclude all records with an assertion that the scientific name provided does not match any of the names lists used by the ALA. For a full explanation of the ALA name matching process see https://github.com/AtlasOfLivingAustralia/ala-name-matching
## 3: Exclude all records with an assertion that the scientific name provided is not structured as a valid scientific name. Also catches rank values or values such as "UNKNOWN"
## 4: Exclude all records with an assertion that the name and classification supplied can't be used to choose between 2 homonyms
## 5: Exclude all records with an assertion that kingdom provided doesn't match a known kingdom e.g. Animalia, Plantae
## 6: Exclude all records with an assertion that the scientific name provided in the record does not match the expected taxonomic scope of the resource e.g. Mammal records attributed to bird watch group
## 7: Exclude all records with an assertion of the occurence is cultivated or escaped from captivity
## 8: Exclude all records with an assertion of latitude value provided is zero
## 9: Exclude all records with an assertion of longitude value provided is zero
## 10: Exclude all records with an assertion of latitude and longitude have been transposed
## 11: Exclude all records with an assertion of coordinates are the exact centre of the state or territory
## 12: Exclude all records with an assertion of coordinates are the exact centre of the country
## 13: Exclude all records where duplicate status is "duplicate"
## 14: Exclude all records where coordinate uncertainty (in metres) is greater than 10km
## 15: Exclude all records with unresolved user assertions
## 16: Exclude all records with unconfirmed user assertions
## 17: Exclude all records where outlier layer count is 3 or more
## 18: Exclude all records where Record type is "Fossil specimen"
## 19: Exclude all records where Record type is "EnvironmentalDNA"
## 20: Exclude all records where Presence/Absence is "absent"
## 21: Exclude all records where year is prior to 1700
## description
## filter
## 1: -spatiallyValid:"false"
## 2: -assertions:TAXON_MATCH_NONE
## 3: -assertions:INVALID_SCIENTIFIC_NAME
## 4: -assertions:TAXON_HOMONYM
## 5: -assertions:UNKNOWN_KINGDOM
## 6: -assertions:TAXON_SCOPE_MISMATCH
## 7: -establishmentMeans:"MANAGED"
## 8: -decimalLatitude:0
## 9: -decimalLongitude:0
## 10: -assertions:"PRESUMED_SWAPPED_COORDINATE"
## 11: -assertions:"COORDINATES_CENTRE_OF_STATEPROVINCE"
## 12: -assertions:"COORDINATES_CENTRE_OF_COUNTRY"
## 13: -duplicateStatus:"ASSOCIATED"
## 14: -coordinateUncertaintyInMeters:[10001 TO *]
## 15: -userAssertions:50001
## 16: -userAssertions:50005
## 17: -outlierLayerCount:[3 TO *]
## 18: -basisOfRecord:"FOSSIL_SPECIMEN"
## 19: -(basisOfRecord:"MATERIAL_SAMPLE" AND contentTypes:"EnvironmentalDNA")
## 20: -occurrenceStatus:ABSENT
## 21: -year:[* TO 1700]
## filter
Include a profile in the filters
<- select_filters(basisOfRecord = "HumanObservation",
filters profile = "ALA")
Functions that return data from ALA are named with the prefix ala_
, followed by a suffix describing the information that they provide.
By combining different filter functions, it is possible to build complex queries to return only the most valuable information for a given problem. Once you have retrieved taxon information, you can use this to search for occurrence records with ala_occurrences()
. However, it is also possible to download data on species via ala_species()
, or media content (largely images) via ala_media()
. Alternatively, users can retrieve record counts using ala_counts()
.
In addition to the filter functions above, when downloading occurrence data users can specify which columns are returned using select_columns()
. Individual column names and/or column groups can be specified. To view the fields for each group, see the documentation for select_columns()
. To view the list of available fields, run search_fields()
.
<- select_columns("institutionID", group = "basic") cols
To download occurrence data you will need to specify your email in galah_config()
. This email must be associated with an active ALA account. See more information in the config section
galah_config(email = your_email_here, atlas = "Australia")
Download occurrence records for Eolophus roseicapilla
<- ala_occurrences(taxa = select_taxa("Eolophus roseicapilla"),
occ filters = select_filters(stateProvince = "Australian Capital Territory",
>= 2010,
year profile = "ALA"),
columns = select_columns("institutionID", group = "basic"))
head(occ)
## decimalLatitude decimalLongitude eventDate scientificName
## 1 -35.88717 148.9713 Eolophus roseicapilla
## 2 -35.86784 149.0101 Eolophus roseicapilla
## 3 -35.86556 149.0106 2012-01-18T13:00:00Z Eolophus roseicapilla
## 4 -35.86429 149.0052 Eolophus roseicapilla
## 5 -35.77517 148.9591 Eolophus roseicapilla
## 6 -35.76652 148.9654 Eolophus roseicapilla
## taxonConceptID recordID
## 1 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 17f46d49-7db0-4929-89f4-b29323f3fcc5
## 2 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 ef2b9066-c078-4660-b9a4-c31192aa8bf7
## 3 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 4f7cd714-2997-45d6-adaf-f7dfc80adfe1
## 4 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 3236c470-a144-4300-ae9b-782d0e5e4dd1
## 5 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 e340c423-3a19-4293-aa72-e4045bb7f702
## 6 urn:lsid:biodiversity.org.au:afd.taxon:577ff059-a2a7-48b0-976c-fdd6a345f878 f25793f3-2704-43ff-80ee-c2a1787490e7
## dataResourceName institutionID
## 1 eBird Australia
## 2 eBird Australia
## 3 BirdLife Australia, Birdata
## 4 eBird Australia
## 5 eBird Australia
## 6 eBird Australia
A common use case of the ALA is to identify which species occur in a specified region, time period, or taxonomic group. ala_species()
enables the user to look up this information, using the common set of filter functions.
# List rodent species in the NT
<- ala_species(taxa = select_taxa("Rodentia"),
species filters = select_filters(stateProvince = "Northern Territory"))
head(species)
## kingdom phylum class order family genus species author
## 1 Animalia Chordata Mammalia Rodentia Muridae Mesembriomys Mesembriomys gouldii (J.E. Gray, 1843)
## 2 Animalia Chordata Mammalia Rodentia Muridae Zyzomys Zyzomys argurus (Thomas, 1889)
## 3 Animalia Chordata Mammalia Rodentia Muridae Pseudomys Pseudomys hermannsburgensis (Waite, 1896)
## 4 Animalia Chordata Mammalia Rodentia Muridae Notomys Notomys alexis Thomas, 1922
## 5 Animalia Chordata Mammalia Rodentia Muridae Melomys Melomys burtoni (Ramsay, 1887)
## 6 Animalia Chordata Mammalia Rodentia Muridae Mus Mus musculus Linnaeus, 1758
## species_guid vernacular_name
## 1 urn:lsid:biodiversity.org.au:afd.taxon:f38bcd7e-ae6a-4734-bd64-06995bc230eb Black-footed Tree-rat
## 2 urn:lsid:biodiversity.org.au:afd.taxon:46611113-a1e3-45b1-b58c-7aef088a9da7 Common Rock-rat
## 3 urn:lsid:biodiversity.org.au:afd.taxon:5d73fc2f-3caa-4b44-aa40-3711e8304f80 Sandy Inland Mouse
## 4 urn:lsid:biodiversity.org.au:afd.taxon:49001532-929e-4b78-97d3-c885e97d671b Spinifex Hopping-mouse
## 5 urn:lsid:biodiversity.org.au:afd.taxon:89dfa41e-2c5a-44d1-80bf-8d4cd3c73089 Grassland Melomys
## 6 urn:lsid:biodiversity.org.au:afd.taxon:107696b5-063c-4c09-a015-6edfdb6f4d52 House Mouse
ala_counts()
provides summary counts on records in the ALA, without needing to download all the records. In addition to the filter arguments, it has an optional group_by
argument, which provides counts binned by the requested field.
# Total number of records in the ALA
ala_counts()
## [1] 100871912
# Total number of records, broken down by kindgom
ala_counts(group_by = "kingdom")
## kingdom count
## 1 Animalia 75649166
## 2 Plantae 21472247
## 3 Fungi 1877219
## 4 Chromista 914334
## 5 Protista 67279
## 6 Bacteria 58081
## 7 Protozoa 22681
## 8 Archaea 1103
## 9 Eukaryota 735
## 10 Virus 421
In addition to text data describing individual occurrences and their attributes, ALA stores images, sounds and videos associated with a given record. These can be downloaded to R
using ala_media()
and the same set of filters as the other data download functions.
# Use the occurrences previously downloaded
<- ala_media(
media_data taxa = select_taxa("Eolophus roseicapilla"),
filters = select_filters(year = 2020),
download_dir = "media")
Various aspects of the galah package can be customized. To preserve configuration for future sessions, set profile_path
to a location of a .Rprofile
file.
To download occurrence records, you will need to provide an email address registered with the ALA. You can create an account here. Once an email is registered with the ALA, it should be stored in the config:
galah_config(email="myemail@gmail.com")
galah
can cache most results to local files. This means that if the same code is run multiple times, the second and subsequent iterations will be faster.
By default, this caching is session-based, meaning that the local files are stored in a temporary directory that is automatically deleted when the R session is ended. This behaviour can be altered so that caching is permanent, by setting the caching directory to a non-temporary location.
galah_config(cache_directory="example/dir")
By default, caching is turned off. To turn caching on, run
galah_config(caching=FALSE)
If things aren’t working as expected, more detail (particularly about web requests and caching behaviour) can be obtained by setting the verbose
configuration option:
galah_config(verbose=TRUE)
ALA requires that you provide a reason when downloading occurrence data (via the galah ala_occurrences()
function). The reason is set as “scientific research” by default, but you can change this using galah_config()
. See find_reasons()
for valid download reasons.
galah_config(download_reason_id=your_reason_id)