Parsing LaTeX is tricky, because LaTeX macros (in LaTeX packages, or in user code) can change the parsing rules as they go.
parseLatex
is not a LaTeX interpreter (at least mostly
it isn’t, but see the detailed comparison below), so it can’t do that:
it uses the same parsing rules for all code that it looks at. If you’re
using a LaTeX package that uses non-standard rules, you can use those,
but they have to apply to the whole section of code passed to
parseLatex()
.
Subject to the limitation that the code only uses one set of rules,
parseLatex
should be able to parse any LaTeX code. It
extends the base tools::parseLatex()
function in a few
ways:
parseLatex::parseLatex()
function marks its output
with class "LaTeX2"
instead of "LaTeX"
, and
marks each item in the output with class "LaTeX2item"
. This
allows it to print things in a more readable way.parseLatex
package includes a large selection of
functions for extracting and modifying parts of the parsed LaTeX.More differences are listed below.
A simple demonstration is in order.
First, we use knitr
to create a LaTeX table.
library(knitr)
latex <- kable(mtcars[1:2, 1:2], format = "latex")
cat(latex)
#>
#> \begin{tabular}{l|r|r}
#> \hline
#> & mpg & cyl\\
#> \hline
#> Mazda RX4 & 21 & 6\\
#> \hline
#> Mazda RX4 Wag & 21 & 6\\
#> \hline
#> \end{tabular}
Next, we parse it in parseLatex
.
Printing the result would appear to duplicate the input, but in fact
it is quite different. parsed
is a list of class
"LaTeX2"
. Items in the list are of class
"LaTeX2item"
. In this example, there are only two items:
the blank that knitr
puts at the beginning of each table,
and a second entry which is the whole table environment:
parsed[[1]]
#> SPECIAL:
parsed[[2]]
#> ENVIRONMENT: \begin{tabular}{l|r|r}
#> \hline
#> & mpg & cyl\\
#> \hline
#> Mazda RX4 & 21 & 6\\
#> \hline
#> Mazda RX4 Wag & 21 & 6\\
#> \hline
#> \end{tabular}
“SPECIAL
” and “ENVIRONMENT
” label the types
of items. The table environment contains the environment name, and a
"LaTeX2"
list containing all the content.
If we hadn’t known where we put it, we could find the table location
using find_env()
:
We can extract the table, and use other functions to work with it:
table <- parsed[[find_env(parsed, "tabular")]]
# Get the alignment options from the content
columnOptions(table)
#> {l|r|r}
tableCell(table, 2,2) # The title counts as a row
#> 21
tableCell(table, 1,1) <- "Model"
table
#> ENVIRONMENT: \begin{tabular}{l|r|r}
#> \hline
#> Model & mpg & cyl\\
#> \hline
#> Mazda RX4 & 21 & 6\\
#> \hline
#> Mazda RX4 Wag & 21 & 6\\
#> \hline
#> \end{tabular}
tools::parseLatex
The parser in this package is based on the one used by the base R
tools::parseLatex
function (which I also wrote, based on
other parsers in R). The output format is similar, but not compatible.
These are the main differences.
tools::parseLatex
, the result
of calling the parser is a list of items."LaTeX2"
in this package, and class
"LaTeX"
in tools::parseLatex
.latexTag()
function identifying the type of item. In this package the possible tags
areTag | Description | Type |
---|---|---|
BLOCK | A block enclosed in curly braces | list |
COMMENT | A LaTeX comment | character |
DISPLAYMATH | A display math block | list |
ENVIRONMENT | A LaTeX environment | list |
MACRO | A LaTeX macro | character |
MATH | An inline math block | list |
SPECIAL | A non-alphabetic character | character |
TEXT | Text (consisting of letters only) | character |
VERB | A verbatim environment | character |
DEFINITION | A command or environment definition | list |
ERROR | A block of items referenced in an error message | list |
tools::parseLatex()
parser does not have the
SPECIAL
; such characters are included in TEXT
.
It also doesn’t have the DEFINITION
or ERROR
tags. Definitions are treated as regular macros, which sometimes leads
to parsing errors. Errors are always fatal.\end{document}
, just
as LaTeX does. The tools::parseLatex()
parser continues
parsing beyond that, often leading to parsing errors as it tries to
parse things that LaTeX would ignore.COMMENT
,
MACRO
, SPECIAL
, TEXT
, and
VERB
) are stored as length 1 character vectors; the others
are stored as lists of items corresponding to their content.tools::parseLatex()
function stores some lists in two
levels (e.g. the content of an environment named item
would
be in item[[2]]
), while in this package, all lists contain
the content directly (e.g. the content of that environment would be in
item
itself)."LaTeX2item"
.
tools::parseLatex()
does not assign a class to items."LaTeX2item"
so that individual items print nicely.verb
macros
like \Sexpr
. The tools::parseLatex()
parser
assumed there would be no braces within the macro (which is the case for
legal Sweave()
source). This parser assumes any braces
within the macro are balanced, e.g. this would be legal:\Sexpr{1 + {x <- 2; x + 1}}
As mentioned above, parseLatex()
does a little bit more
than parsing. Both versions recognize LaTeX environments and verbatim
code.
The parser in this package also takes special action when it sees the
document
environment: it stops parsing at
\end{document}
. (You can use the
get_leftovers()
function to see what parts of the input
were skipped.)
It also changes the rules a bit when it sees macros defining things:
\newenvironment
, \renewenvironment
,
\newcommand
, \renewcommand
and
\providecommand
. The arguments to these macros are parsed
but not interpreted, allowing definitions to parse without triggering a
syntax error. For example:
\newenvironment{newenv}{\begin{oldenv}}{\end{oldenv}}
The \begin{oldenv}
part of the definition shouldn’t be
interpreted here as the start of an oldenv
environment,
because \end{oldenv}
isn’t in the same {}
block.
One plain TeX version of these macros is \def
. It is
recognized and an attempt is made to handle it, but there’s some really
arcane syntax possible with \def
. If you use that, it
probably won’t be parsed properly. Stick with simple syntax like
\def\bea{\begin{eqnarray*}}
and you should be okay.
The parseLatex::parseLatex()
parser can parse most LaTeX
inputs, but not all. To allow it to be used on files that contain
unsupported syntax, it allows “magic comments” to be inserted to control
its actions.
Several LaTeX editors support magic comments of the form
% !TEX ...
, and those were the model for
parseLatex
magic comment support. There are 4 magic
comments supported in this parser:
% !parser off
This tells the parser to absorb all
following text as part of the comment, so anything that would be classed
as a parsing error is never seen.% !parser on
This tells it to resume normal
parsing.% !parser verb [name]
This tells the parser to add the
name to the list of macros holding verbatim text, i.e. the list given by
the verb
argument when parseLatex()
was
called. The name should include the backslash, e.g.% !parser verb \Sexpr
% !parser defcmd [name]
does the same for commands like
\newcommand
.% !parser defenv [name]
does it for commands like
\newenvironment
.% !parser verbatim [name]
This tells the parser to add
the name to the list of environments holding verbatim text, i.e. the
list given by the verbatim
argument. For example% !parser verbatim Sinput
The parser is quite strict about the format of the magic comments. The whitespace between parts of it must be spaces, not tabs, and nothing else can appear in the comment after the magic text other than more spaces.
This is a work in progress, so if you have a use for something like this and need help, post an “issue” on the Github page: https://github.com/dmurdoch/parseLatex .