README

The stanza package provides an R interface to the Stanford NLP Group’s Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:

Installation

Step 1: Install the R package

install.packages("stanza")

Step 2: Install the Python backend

library("stanza")
virtualenv_install_stanza()

library("stanza")
conda_install_stanza()

Environment variables

Make sure that pip is installed along with the Python version you choose. To set a special Python for the virtualenv use the environment variable RETICULATE_PYTHON. For example testing on Windows I set RETICULATE_PYTHON to "C:/apps/Python/python.exe"

python_path <- normalizePath("C:/apps/Python/python.exe")
Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")

virtualenv_install_stanza()

library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")

is sufficent since then "~\\.virtualenvs\\stanza" is detected, but if RETICULATE_PYTHON is still "C:/apps/Python/python.exe" it does not find the correct environment and therefore stanza can not be loaded.

Getting Started

Load the package and initialize

library("stanza")
stanza_initialize(virtualenv = "stanza")

Download language models

Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.

stanza_download("en")

stanza_download("de")

Building a Pipeline

A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:

processors <- 'tokenize,ner,lemma,pos,mwt'
p <- stanza_pipeline(language = "en", processors = processors)

The Stanza documentation provides detailed information on all available processors:

Using specific models for processors

processors_specific <- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
p_specific <- stanza_pipeline(language = "en", processors = processors)

Processing Text

The stanza_pipeline() function returns a pipeline function that transforms text into annotated document objects:

doc <- p('R is a collaborative project with many contributors.')
doc
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

# Using the pipeline with specific processor models
doc_specific <- p_specific('R is a collaborative project with many contributors.')
doc_specific
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

Extracting Results

Stanza provides several helper functions to extract different types of information from the processed documents:

Sentences

sents(doc)
#> [[1]]
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Words with linguistic features

words(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Tokens

tokens(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Named entities

entities(doc)
#> list()

Multi-word tokens

multi_word_token(doc)
#>   tid wid         token          word
#> 1   1   1             R             R
#> 2   2   2            is            is
#> 3   3   3             a             a
#> 4   4   4 collaborative collaborative
#> 5   5   5       project       project
#> 6   6   6          with          with
#> 7   7   7          many          many
#> 8   8   8  contributors  contributors
#> 9   9   9             .             .

stanza: An R Interface to the Stanford NLP Toolkit

Overview