Statistics::R::IO::Parser - Functions for parsing R data files
use Statistics::R::IO::ParserState;
use Statistics::R::IO::Parser;
my $state = Statistics::R::IO::ParserState->new(
data => 'file.rds'
);
say $state->at
say $state->next->at;
You shouldn't create instances of this class, it exists mainly to
handle deserialization of R data files by the
"IO" classes.
This library is inspired by monadic parser frameworks from the
Haskell world, like Packrat <http://bford.info/packrat/> or Parsec
<http://hackage.haskell.org/package/parsec>. What this means is that
parsers are constructed by combining simpler parsers.
The library offers a selection of basic parsers and combinators.
Each of these is a function (think of it as a factory) that returns another
function (the actual parser) which receives the current parsing state
(Statistics::R::IO::ParserState) as the argument and returns a two-element
array reference (called for brevity "a pair" in the following
text) with the result of the parser in the first element and the new parser
state in the second element. If the parser fails, say if the current
state is "a" where a number is expected, it returns
"undef" to signal failure.
The descriptions of individual functions below use a shorthand
because the above mechanism is implied. Thus, when
"any_char" is described as "parses
any character", it really means that calling
"any_char" will return a function that
when called with the current state will return "a pair of the
character...", etc.
CHARACTER PARSERS
- any_char
- Parses any character, returning a pair of the character at the current
State's position and the new state, advanced by one from the starting
state. If the state is at the end
("$state-"eof> is true), returns
undef to signal failure.
- char $c
- Parses the given character $c, returning a pair of
the character at the current State's position if it is equal to
$c and the new state, advanced by one from the
starting state. If the state is at the end
("$state-"eof> is true) or the
character at the current position is not $c,
returns undef to signal failure.
- string $s
- Parses the given string $s, returning a pair of
the sequence of characters starting at the current State's position if it
is equal to $s and the new state, advanced by
"length($s)" from the starting state. If
the state is at the end
("$state-"eof> is true) or the string
starting at the current position is not $s,
returns undef to signal failure.
NUMBER PARSERS
- endianness
[$end]
- When the $end argument is given, this functions
sets the byte order used by parsers in the module to be little- (when
$end is "<") or big-endian
($end is ">"). This function changes
the module's state and remains in effect until the next change.
When called with no arguments,
"endianness" returns the current byte
order in effect. The starting byte order is big-endian.
- any_uint8,
any_uint16, any_uint24, any_uint32
- Parses an 8-, 16-, 24-, and 32-bit unsigned integer, returning a
pair of the integer starting at the current State's position and the new
state, advanced by 1, 2, 3, or 4 bytes from the starting state, depending
on the parser. The integer value is determined by the current value of
"endianness". If there are not enough
elements left in the data from the current position, returns undef to
signal failure.
- uint8 $n, uint16 $n, uint24
$n, uint32 $n
- Parses the specified 8-, 16-, 24-, and 32-bit unsigned integer
$n, returning a pair of the integer at the current
State's position if it is equal $n and the new
state. The new state is advanced by 1, 2, 3, or 4 bytes from the starting
state, depending on the parser. The integer value is determined by the
current value of "endianness". If there
are not enough elements left in the data from the current position or the
current position is not $n, returns undef to
signal failure.
- any_int8, any_int16,
any_int24, any_int32
- Parses an 8-, 16-, 24-, and 32-bit signed integer, returning a pair
of the integer starting at the current State's position and the new state,
advanced by 1, 2, 3, or 4 bytes from the starting state, depending on the
parser. The integer value is determined by the current value of
"endianness". If there are not enough
elements left in the data from the current position, returns undef to
signal failure.
- int8 $n, int16 $n, int24 $n,
int32 $n
- Parses the specified 8-, 16-, 24-, and 32-bit signed integer
$n, returning a pair of the integer at the current
State's position if it is equal $n and the new
state. The new state is advanced by 1, 2, 3, or 4 bytes from the starting
state, depending on the parser. The integer value is determined by the
current value of "endianness". If there
are not enough elements left in the data from the current position or the
current position is not $n, returns undef to
signal failure.
- any_real32,
any_real64
- Parses an 32- or 64-bit real number, returning a pair of the number
starting at the current State's position and the new state, advanced by 4
or 8 bytes from the starting state, depending on the parser. The real
value is determined by the current value of
"endianness". If there are not enough
elements left in the data from the current position, returns undef to
signal failure.
- any_int32_na,
any_real64_na
- Parses a 32-bit signed integer or 64-bit real number, respectively,
but recognizing R-style missing values (NAs): INT_MIN for integers and a
special NaN bit pattern for reals. Returns a pair of the number value
("undef" if a NA) and the new state,
advanced by 4 or 8 bytes from the starting state, depending on the parser.
If there are not enough elements left in the data from the current
position, returns undef to signal failure.
SEQUENCING
- seq $p1, ...
- This combinator applies parsers $p1, ... in
sequence, using the returned parse state of $p1 as
the input parse state to $p2, etc. Returns a pair
of the concatenation of all the parsers' results and the parsing state
returned by the final parser. If any of the parsers returns undef,
"seq" will return it immediately without
attempting to apply any further parsers.
- many_till $p,
$end
- This combinator applies a parser $p until parser
$end succeeds. It does this by alternating
applications of $end and
$p; once $end succeeds,
the function returns the concatenation of results of preceding
applications of $p. (Thus, if
$end succeeds immediately, the 'result' is an
empty list.) Otherwise, $p is applied and must
succeed, and the procedure repeats. Returns a pair of the concatenation of
all the $p's results and the parsing state
returned by the final parser. If any applications of
$p returns undef,
"many_till" will return it
immediately.
- count $n, $p
- This combinator applies the parser $p exactly
$n times in sequence, threading the parse state
through each call. Returns a pair of the concatenation of all the parsers'
results and the parsing state returned by the final application. If any
application of $p returns undef,
"count" will return it immediately
without attempting any more applications.
- with_count [$num_p
= any_uint32], $p
- This combinator first applies parser $num_p to get
the number of times that $p should be applied in
sequence. If only one argument is given,
"any_uint32" is used as the default
value of $num_p. (So
"with_count" works by getting a number
$n by applying
$num_p and then calling
"count $n, $p".) Returns a pair of the
concatenation of all the parsers' results and the parsing state returned
by the final application. If the initial application of
$num_p or any application of
$p returns undef,
"with_count" will return it immediately
without attempting any more applications.
- choose $p1, ...
- This combinator applies parsers $p1, ... in
sequence, until one of them succeeds, when it immediately returns the
parser's result. If all of the parsers fail,
"choose" fails and returns undef
COMBINATORS
- bind $p1, $f
- This combinator applies parser $p1 and, if it
succeeds, calls function $f using the first
element of $p1's result as the argument. The call
to $f needs to return a parser, which
"bind" applies to the parsing state
after $p1's application.
The "bind" combinator is an
essential building block for most combinators described so far. For
instance, "with_count" can be written
as:
bind($num_p,
sub {
my $n = shift;
count $n, $p;
})
- mreturn $value
- Returns a parser that when applied returns $value
without changing the parsing state.
- error $message
- Returns a parser that when applied croaks with the
$message and the current parsing state.
SINGLETONS
These functions are an interface to ParseState's singleton-related
functions, "add_singleton" in ParseState and
"get_singleton" in ParseState. They exist because certain types of
objects in R data files, for instance environments, have to exist as unique
instances, and any subsequent objects that include them refer to them by a
"reference id".
- add_singleton
$singleton
- Adds the $singleton to the current parsing state.
Returns a pair of $singleton and the new parsing
state.
- get_singleton
$ref_id
- Retrieves from the current parse state the singleton identified by
$ref_id, returning a pair of the singleton and the
(unchanged) state.
- reserve_singleton
$p
- Preallocates a space for a singleton before running a given parser, and
then assigns the parser's value to the singleton. Returns a pair of the
singleton and the new parse state.
Instances of this class are intended to be immutable. Please do
not try to change their value or attributes.
There are no known bugs in this module. Please see
Statistics::R::IO for bug reporting.
See Statistics::R::IO for support and contact information.
Davor Cubranic <cubranic@stat.ubc.ca>
This software is Copyright (c) 2017 by University of British
Columbia.
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007