`kgrams`

provides tools for training and evaluating *k*-gram language
models, including several probability smoothing methods, perplexity
computations, random text generation and more. It is based on an C++
back-end which makes `kgrams`

fast, coupled with an
accessible R API which aims at streamlining the process of model
building, and can be suitable for small- and medium-sized NLP
experiments, baseline model building, and for pedagogical purposes.

If you have no idea about what *k*-gram models are
*and* didn’t get here by accident, you can check out my hands-on
tutorial
post on *k*-gram language models using R at DataScience+.

You can install the latest release of `kgrams`

from CRAN with:

`install.packages("kgrams")`

You can install the development version from my R-universe with:

`install.packages("kgrams", repos = "https://vgherard.r-universe.dev/")`

This example shows how to train a modified Kneser-Ney 4-gram model on
Shakespeare’s play “Much Ado About Nothing” using
`kgrams`

.

```
library(kgrams)
# Get k-gram frequency counts from text, for k = 1:4
<- kgram_freqs(kgrams::much_ado, N = 4)
freqs # Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
<- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75) mkn
```

We can now use this `language_model`

to compute sentence
and word continuation probabilities:

```
# Compute sentence probabilities
probability(c("did he break out into tears ?",
"we are predicting sentence probabilities ."
), model = mkn
)#> [1] 2.466856e-04 1.184963e-20
# Compute word continuation probabilities
probability(c("tears", "pieces") %|% "did he break out into", model = mkn)
#> [1] 9.389238e-01 3.834498e-07
```

Here are some sentences sampled from the language model’s
distribution at temperatures `t = c(1, 0.1, 10)`

:

```
# Sample sentences from the language model at different temperatures
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
#> [1] "i have studied eight or nine truly by your office [...] (truncated output)"
#> [2] "ere you go : <EOS>"
#> [3] "don pedro welcome signior : <EOS>"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
#> [1] "i will not be sworn but love may transform me [...] (truncated output)"
#> [2] "i will not fail . <EOS>"
#> [3] "i will go to benedick and counsel him to fight [...] (truncated output)"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
#> [1] "july cham's incite start ancientry effect torture tore pains endings [...] (truncated output)"
#> [2] "lastly gallants happiness publish margaret what by spots commodity wake [...] (truncated output)"
#> [3] "born all's 'fool' nest praise hurt messina build afar dancing [...] (truncated output)"
```

For further help, you can consult the reference page of the
`kgrams`

website or open an issue on
the GitHub repository of `kgrams`

. A vignette is available on
the website, illustrating the process of building language models
in-depth.