The stanza package provides an R interface to the Stanford NLP Group’s Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:
First, install the stanza R package from CRAN:
install.packages("stanza")
You can install the Python package using either virtualenv (recommended):
library("stanza")
virtualenv_install_stanza()
Or using conda if you prefer:
library("stanza")
conda_install_stanza()
Make sure that pip is installed along with the
Python version you choose. To set a special
Python for the virtualenv use the environment variable
RETICULATE_PYTHON
. For example testing on Windows I set
RETICULATE_PYTHON
to
"C:/apps/Python/python.exe"
<- normalizePath("C:/apps/Python/python.exe")
python_path Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")
virtualenv_install_stanza()
during the installation. However, after the installation
library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")
is sufficent since then "~\\.virtualenvs\\stanza"
is
detected, but if RETICULATE_PYTHON
is still
"C:/apps/Python/python.exe"
it does not find the correct
environment and therefore stanza can not be loaded.
library("stanza")
stanza_initialize(virtualenv = "stanza")
Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.
To download the English model:
stanza_download("en")
Similarly, for German:
stanza_download("de")
A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:
<- 'tokenize,ner,lemma,pos,mwt'
processors <- stanza_pipeline(language = "en", processors = processors) p
The Stanza documentation provides detailed information on all available processors:
tokenize
: Split text into sentences and wordsmwt
: Expand multi-word tokenspos
: Part-of-speech tagginglemma
: Lemmatizationner
: Named entity recognitiondepparse
: Dependency parsingTo select specific models for each processor, use a named list:
<- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
processors_specific <- stanza_pipeline(language = "en", processors = processors) p_specific
The stanza_pipeline()
function returns a pipeline
function that transforms text into annotated document objects:
<- p('R is a collaborative project with many contributors.')
doc
doc#> <stanza_document>
#> number of sentences: 1
#> number of tokens: 9
#> number of words: 9
# Using the pipeline with specific processor models
<- p_specific('R is a collaborative project with many contributors.')
doc_specific
doc_specific#> <stanza_document>
#> number of sentences: 1
#> number of tokens: 9
#> number of words: 9
Stanza provides several helper functions to extract different types of information from the processed documents:
sents(doc)
#> [[1]]
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
words(doc)
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
tokens(doc)
#> [1] "R" "is" "a" "collaborative"
#> [5] "project" "with" "many" "contributors"
#> [9] "."
entities(doc)
#> list()
multi_word_token(doc)
#> tid wid token word
#> 1 1 1 R R
#> 2 2 2 is is
#> 3 3 3 a a
#> 4 4 4 collaborative collaborative
#> 5 5 5 project project
#> 6 6 6 with with
#> 7 7 7 many many
#> 8 8 8 contributors contributors
#> 9 9 9 . .