stanza: An R Interface to the Stanford NLP Toolkit

2025-05-16

Overview

The stanza package provides an R interface to the Stanford NLP Group’s Stanza Python library, a collection of tools for natural language processing in many human languages. With stanza, you can:

Installation

Step 1: Install the R package

First, install the stanza R package from CRAN:

install.packages("stanza")

Step 2: Install the Python backend

You can install the Python package using either virtualenv (recommended):

library("stanza")
virtualenv_install_stanza()

Or using conda if you prefer:

library("stanza")
conda_install_stanza()

Environment variables

Make sure that pip is installed along with the Python version you choose. To set a special Python for the virtualenv use the environment variable RETICULATE_PYTHON. For example testing on Windows I set RETICULATE_PYTHON to "C:/apps/Python/python.exe"

python_path <- normalizePath("C:/apps/Python/python.exe")
Sys.setenv(RETICULATE_PYTHON = python_path)
library("stanza")

virtualenv_install_stanza()

during the installation. However, after the installation

library("stanza")
stanza_initialize(virtualenv = "stanza")
stanza_options()
stanza_download("en")

is sufficent since then "~\\.virtualenvs\\stanza" is detected, but if RETICULATE_PYTHON is still "C:/apps/Python/python.exe" it does not find the correct environment and therefore stanza can not be loaded.

Getting Started

Load the package and initialize

library("stanza")
stanza_initialize(virtualenv = "stanza")

Download language models

Before processing text, you need to download language models. Stanza supports over 70 languages, the language codes and the performance of the models can be found at the stanza homepage.

To download the English model:

stanza_download("en")

Similarly, for German:

stanza_download("de")

Building a Pipeline

A natural language processing pipeline can be created by specifying the language and desired processors as a comma-separated string:

processors <- 'tokenize,ner,lemma,pos,mwt'
p <- stanza_pipeline(language = "en", processors = processors)

The Stanza documentation provides detailed information on all available processors:

Using specific models for processors

To select specific models for each processor, use a named list:

processors_specific <- list(tokenize = 'gsd', pos = 'hdt', ner = 'conll03', lemma = 'default')
p_specific <- stanza_pipeline(language = "en", processors = processors)

Processing Text

The stanza_pipeline() function returns a pipeline function that transforms text into annotated document objects:

doc <- p('R is a collaborative project with many contributors.')
doc
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

# Using the pipeline with specific processor models
doc_specific <- p_specific('R is a collaborative project with many contributors.')
doc_specific
#> <stanza_document>
#>   number of sentences: 1
#>   number of tokens: 9
#>   number of words: 9

Extracting Results

Stanza provides several helper functions to extract different types of information from the processed documents:

Sentences

sents(doc)
#> [[1]]
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Words with linguistic features

words(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Tokens

tokens(doc)
#> [1] "R"             "is"            "a"             "collaborative"
#> [5] "project"       "with"          "many"          "contributors" 
#> [9] "."

Named entities

entities(doc)
#> list()

Multi-word tokens

multi_word_token(doc)
#>   tid wid         token          word
#> 1   1   1             R             R
#> 2   2   2            is            is
#> 3   3   3             a             a
#> 4   4   4 collaborative collaborative
#> 5   5   5       project       project
#> 6   6   6          with          with
#> 7   7   7          many          many
#> 8   8   8  contributors  contributors
#> 9   9   9             .             .