The parseLatex package

Introduction

Parsing LaTeX is tricky, because LaTeX macros (in LaTeX packages, or in user code) can change the parsing rules as they go.

parseLatex is not a LaTeX interpreter (at least mostly it isn’t, but see the detailed comparison below), so it can’t do that: it uses the same parsing rules for all code that it looks at. If you’re using a LaTeX package that uses non-standard rules, you can use those, but they have to apply to the whole section of code passed to parseLatex().

Subject to the limitation that the code only uses one set of rules, parseLatex should be able to parse any LaTeX code. It extends the base tools::parseLatex() function in a few ways:

  1. It classifies every character in the source file according to TeX “catcodes”. The base function only handles some of them.
  2. The parseLatex::parseLatex() function marks its output with class "LaTeX2" instead of "LaTeX", and marks each item in the output with class "LaTeX2item". This allows it to print things in a more readable way.
  3. The parseLatex package includes a large selection of functions for extracting and modifying parts of the parsed LaTeX.

More differences are listed below.

Demo

A simple demonstration is in order.

First, we use knitr to create a LaTeX table.

library(knitr)
latex <- kable(mtcars[1:2, 1:2], format = "latex")
cat(latex)
#> 
#> \begin{tabular}{l|r|r}
#> \hline
#>   & mpg & cyl\\
#> \hline
#> Mazda RX4 & 21 & 6\\
#> \hline
#> Mazda RX4 Wag & 21 & 6\\
#> \hline
#> \end{tabular}

Next, we parse it in parseLatex.

library(parseLatex)
parsed <- parseLatex(latex)

Printing the result would appear to duplicate the input, but in fact it is quite different. parsed is a list of class "LaTeX2". Items in the list are of class "LaTeX2item". In this example, there are only two items: the blank that knitr puts at the beginning of each table, and a second entry which is the whole table environment:

parsed[[1]]
#> SPECIAL:
parsed[[2]]
#> ENVIRONMENT: \begin{tabular}{l|r|r}
#> \hline
#>   & mpg & cyl\\
#> \hline
#> Mazda RX4 & 21 & 6\\
#> \hline
#> Mazda RX4 Wag & 21 & 6\\
#> \hline
#> \end{tabular}

SPECIAL” and “ENVIRONMENT” label the types of items. The table environment contains the environment name, and a "LaTeX2" list containing all the content.

If we hadn’t known where we put it, we could find the table location using find_env():

find_env(parsed, "tabular")
#> [1] 2

We can extract the table, and use other functions to work with it:

table <- parsed[[find_env(parsed, "tabular")]]
# Get the alignment options from the content
columnOptions(table)
#> {l|r|r}
tableCell(table, 2,2) # The title counts as a row
#>  21
tableCell(table, 1,1) <- "Model"
table
#> ENVIRONMENT: \begin{tabular}{l|r|r}
#> \hline
#>  Model & mpg & cyl\\
#> \hline
#> Mazda RX4 & 21 & 6\\
#> \hline
#> Mazda RX4 Wag & 21 & 6\\
#> \hline
#> \end{tabular}

Differences from tools::parseLatex

The parser in this package is based on the one used by the base R tools::parseLatex function (which I also wrote, based on other parsers in R). The output format is similar, but not compatible. These are the main differences.

Tag Description Type
BLOCK A block enclosed in curly braces list
COMMENT A LaTeX comment character
DISPLAYMATH A display math block list
ENVIRONMENT A LaTeX environment list
MACRO A LaTeX macro character
MATH An inline math block list
SPECIAL A non-alphabetic character character
TEXT Text (consisting of letters only) character
VERB A verbatim environment character
DEFINITION A command or environment definition list
ERROR A block of items referenced in an error message list

As mentioned above, parseLatex() does a little bit more than parsing. Both versions recognize LaTeX environments and verbatim code.

The parser in this package also takes special action when it sees the document environment: it stops parsing at \end{document}. (You can use the get_leftovers() function to see what parts of the input were skipped.)

It also changes the rules a bit when it sees macros defining things: \newenvironment, \renewenvironment, \newcommand, \renewcommand and \providecommand. The arguments to these macros are parsed but not interpreted, allowing definitions to parse without triggering a syntax error. For example:

\newenvironment{newenv}{\begin{oldenv}}{\end{oldenv}}

The \begin{oldenv} part of the definition shouldn’t be interpreted here as the start of an oldenv environment, because \end{oldenv} isn’t in the same {} block.

One plain TeX version of these macros is \def. It is recognized and an attempt is made to handle it, but there’s some really arcane syntax possible with \def. If you use that, it probably won’t be parsed properly. Stick with simple syntax like

\def\bea{\begin{eqnarray*}}

and you should be okay.

Magic Comments

The parseLatex::parseLatex() parser can parse most LaTeX inputs, but not all. To allow it to be used on files that contain unsupported syntax, it allows “magic comments” to be inserted to control its actions.

Several LaTeX editors support magic comments of the form % !TEX ..., and those were the model for parseLatex magic comment support. There are 4 magic comments supported in this parser:

The parser is quite strict about the format of the magic comments. The whitespace between parts of it must be spaces, not tabs, and nothing else can appear in the comment after the magic text other than more spaces.

Work in progress!

This is a work in progress, so if you have a use for something like this and need help, post an “issue” on the Github page: https://github.com/dmurdoch/parseLatex .