The statlingua R package is designed to help bridge the gap between complex statistical outputs and clear, human-readable explanations. By leveraging the power of Large Language Models (LLMs), statlingua helps you effortlessly translate the dense jargon of statistical models—coefficients, p-values, model fit indices, and more—into straightforward, context-aware natural language.
Whether you’re a student grappling with new statistical concepts, a researcher needing to communicate findings to a broader audience, or a data scientist looking to quickly draft reports, statlingua makes your statistical journey smoother and more accessible.
Statistical models are powerful, but their outputs can be intimidating. statlingua empowers you to:
By providing clear and contextualized explanations, statlingua helps you focus on the implications of your findings rather than getting bogged down in technical minutiae.
As of now, statlingua explicitly supports a variety of common statistical models in R, including:
"htest"
(e.g., from
t.test()
, prop.test()
).lm()
) and Generalized Linear Models
(glm()
).lme()
) and lme4
(lmer()
, glmer()
).gam()
from package mgcv).survreg()
,
coxph()
from package survival).polr()
from
package MASS).rpart()
from package rpart).statlingua is not yet on CRAN, but you can install the development version from GitHub:
if (!requireNamespace("remotes")) {
install.packages("remotes")
}::install_github("bgreenwell/statlingua") remotes
You’ll also need to install the ellmer package, which you can obtain from CRAN:
install.packages("ellmer") # >= 0.2.0
statlingua doesn’t directly handle API keys or LLM communication. It acts as a sophisticated prompt engineering toolkit that prepares inputs and then passes them to ellmer. The ellmer package is responsible for interfacing with various LLM providers (e.g., OpenAI, Google AI Studio, Anthropic).
Please refer to the ellmer package documentation for detailed instructions on:
OPENAI_API_KEY
, GEMINI_API_KEY
, etc.).Once ellmer is installed and has access to an LLM provider, statlingua will seamlessly leverage that connection.
# Ensure you have an appropriate API key set up first!
# Sys.setenv(GEMINI_API_KEY = "<YOUR_API_KEY_HERE>")
library(statlingua)
# Fit a polynomial regression model
<- lm(dist ~ poly(speed, degree = 2), data = cars)
fm_cars summary(fm_cars)
# Define some context (highly recommended!)
<- "
cars_context This model analyzes the 'cars' dataset from the 1920s. Variables include:
* 'dist' - The distance (in feet) taken to stop.
* 'speed' - The speed of the car (in mph).
We want to understand how speed affects stopping distance in the model.
"
# Establish connection to an LLM provider (in this case, Google Gemini)
<- ellmer::chat_google_gemini(echo = "none") # defaults to gemini-2.0-flash
client
# Get an explanation
explain(
# model for LLM to interpret/explain
fm_cars, client = client, # connection to LLM provider
context = cars_context, # additional context for LLM to consider
audience = "student", # target audience
verbosity = "detailed", # level of detail
style = "markdown" # output style
)
# Ask a follow-up question
$chat(
client"How can I construct confidence intervals for each coefficient in the model?"
)
For more examples, including output, see the introductory vignette.
One of statlingua’s core strengths is its
extensibility. You can add or customize support for new statistical
model types by crafting specific prompt components. The system prompt
sent to the LLM is dynamically assembled from several markdown files
located in the inst/prompts/
directory of the package.
The main function explain()
uses S3 dispatch. When
explain(my_model_object, ...)
is called, R looks for a
method like explain.class_of_my_model_object()
. If not
found, explain.default()
is used.
The prompts are organized as follows within
inst/prompts/
:
common/
: Contains base prompts applicable to all
models.
role_base.md
: Defines the fundamental role of the
LLM.caution.md
: A general cautionary note appended to
explanations.audience/
: Markdown files for different target
audiences (e.g., novice.md
, researcher.md
).
The filename (e.g., “novice”) matches the audience
argument
in explain()
.verbosity/
: Markdown files for different verbosity
levels (e.g., brief.md
, detailed.md
). The
filename matches the verbosity
argument.style/
: Markdown files defining the output format
(e.g., markdown.md
, json.md
). The filename
matches the style
argument.models/<model_class_name>/
: Directory for
model-specific prompts. <model_class_name>
should
correspond to the R class of the statistical object (e.g., “lm”, “glm”,
“htest”).
instructions.md
: The primary instructions for
explaining this specific model type. This tells the LLM what to look for
in the model output, how to interpret it, and what assumptions to
discuss.role_specific.md
(Optional): Additional role details
specific to this model type, augmenting
common/role_base.md
.vglm
from the VGAM
packageLet’s imagine you want to add dedicated support for vglm
(Vector Generalized Linear Models) objects from the VGAM package.
Create New Prompt Files: You would create a new
directory inst/prompts/models/vglm/
. Inside this directory,
you’d add:
inst/prompts/models/vglm/instructions.md
: This file
will contain the detailed instructions for the LLM on how to interpret
vglm
objects. You’d detail what aspects of
summary(vglm_object)
are important, how to discuss
coefficients (potentially for multiple linear predictors), link
functions, model fit statistics specific to vglm
, and
relevant assumptions.
You are explaining a **Vector Generalized Linear Model (VGLM)** (from `VGAM::vglm()`).
**Core Concepts & Purpose:**
VGLMs are highly flexible, extending GLMs to handle multiple linear predictors and a wider array of distributions and link functions, including multivariate responses.
Identify the **Family** (e.g., multinomial, cumulative) and **Link functions**.
**Interpretation:*** **Coefficients:** Explain for each linear predictor. Pay attention to link functions (e.g., log odds, log relative risk). Clearly state reference categories.
* **Model Fit:** Discuss deviance, AIC, etc.
* **Assumptions:** Mention relevant assumptions.
inst/prompts/models/vglm/role_specific.md
(Optional): If vglm
models require the LLM to adopt a
slightly more specialized persona.
You have particular expertise in Vector Generalized Linear Models (VGLMs), understanding their diverse applications for complex response types.
Implement the S3 Method: Add an S3 method for
explain.vglm
in an R script (e.g.,
R/explain_vglm.R
):
#' Explain a vglm object
#'
#' @inheritParams explain
#' @param object A \code{vglm} object.
#' @export
<- function(
explain.vglm
object,
client,context = NULL,
audience = c("novice", "student", "researcher", "manager", "domain_expert"),
verbosity = c("moderate", "brief", "detailed"),
style = c("markdown", "html", "json", "text", "latex"),
...
) {<- match.arg(audience)
audience <- match.arg(verbosity)
verbosity <- match.arg(style)
style
# Use the internal .explain_core helper if it suits,
# or implement custom logic if vglm needs special handling.
# .explain_core handles system prompt assembly, user prompt building,
# and calling the LLM via the client.
# 'name' should match the directory name in inst/prompts/models/
# 'model_description' is what's shown to the user in the prompt.
.explain_core(
object = object,
client = client,
context = context,
audience = audience,
verbosity = verbosity,
style = style,
name = "vglm", # This tells .assemble_sys_prompt to look in inst/prompts/models/vglm/
model_description = "Vector Generalized Linear Model (VGLM) from VGAM"
) }
The summarize.vglm
method might also need to be
implemented in R/summarize.R
if
summary(object)
for vglm
needs special capture
or formatting for the LLM. If
utils::capture.output(summary(object))
is sufficient,
summarize.default
might work initially.
Add to NAMESPACE
and Document:
NAMESPACE
file (usually handled by roxygen2
):
S3method(explain, vglm)
roxygen2
documentation blocks for
explain.vglm
.Testing: Thoroughly test with various
vglm
examples. You might need to iterate on your
instructions.md
and role_specific.md
to refine
the LLM’s explanations.
By following this pattern, statlingua can be systematically extended to cover a vast array of statistical models in R!
Contributions are welcome! Please see the GitHub issues for areas where you can help.
statlingua is available under the GNU General Public
License v3.0 (GNU GPLv3). See the LICENSE.md
file for more
details.