goto-analyzer - Data-flow analysis for C programs and goto
binaries
goto-analyzer [-?|-h|--help]
goto-analyzer --version
goto-analyzer [options] file.c|file.gb
goto-analyzer is an abstract interpreter which uses the
same front-end and GOTO binary representation as cbmc(1).
The key difference is that cbmc(1) under-approximates the
behavior of the program (execution traces that are too long or require too
many loop unwindings are not considered) while goto-analyzer
over-approximates the behavior of the program. cbmc(1) can determine
if a property is A. true for a bounded number of iterations or B. false and
giving an error trace. In contrast goto-analyzer can determine if a
property is A. true for all iterations or B. possibly false. In this sense,
each tool has its own strengths and weaknesses.
To use goto-analyzer you need to give options for:
- What task to perform after the abstract interpreter has run.
- How to format the output.
- Which abstract interpreter is used.
- Which domain is used to describe the state of the program at a
point during execution.
- How the history of the control flow of the program determines the
number of points of execution.
- The storage that links points of execution and domains.
goto-analyzer first runs the abstract interpreter until it
reaches a fix-point, then it will perform the task the user has chosen.
- --show
- Displays a domain for every instruction in the GOTO binary. The format and
information will depend on the domain that has been selected. If
there are multiple domains corresponding to the same location (see
history below) these will be merged before they are displayed.
- --show-on-source
- The source code of the program is displayed line-by-line with the abstract
domains corresponding to each location displayed between them. As the
analysis is done at the level of instructions in the GOTO binary, some
domains may not be displayed. Also if parts of the GOTO binary have been
generated or manipulated by other tools, these may not be displayed as
there is no corresponding source. --show-on-source is the more
user-friendly output, but --show gives a better picture of exactly
what is computed.
- --verify
- Every property in the program is checked to see whether it is true (it
always holds), unreachable, false if it is reachable (due to the
over-approximate analysis, it is not clear if locations are reachable or
if it is an overapproximation, so this is the best that can be achieved)
or unknown. If there are multiple points of execution that reach the same
location, each will be checked and the answers combined, with unknown
taking precedence.
- --simplify
file_name
- Writes a new version of the input program to file_name in which the
program has been simplified using information from the abstract
interpreter. The exact simplification will depend on the domain that is
used but typically this might be replacing any expression that has a
constant value. If this makes instructions unreachable (for example if
GOTO can be shown to never be taken) they will be removed. Removal
can be deactivated by passing --no-simplify-slicing. In the ideal
world simplify would be idempotent (i.e. running it a second time would
not simplify anything more than the first). However there are edge cases
which are difficult or prohibitively expensive to handle in the domain
which can result in a second (or more) runs giving simplification.
Submitting bug reports for these is helpful but they may not be viable to
fix.
- --no-simplify-slicing
- Do not remove instructions from which no property can be reached (use with
--simplify).
- --unreachable-instructions
- Lists which instructions have a domain which is bottom (i.e. unreachable).
If --function has been used to set the program entry point then
this can flag things like the main function as unreachable.
- --unreachable-functions
- Similar to --unreachable-instructions, but reports which functions
are definitely unreachable rather than just instructions.
- --reachable-functions
- The negation of --unreachable-functions, reports which functions
may be reachable. Note that because the analysis is over-approximate, it
is possible this will mark functions as reachable when a more precise
analysis (possibly using cbmc(1)) will show that there are no
execution traces that reach them.
These options control which abstract interpreter is used and how
the analysis is performed. In principle this can significantly change the
accuracy and performance of goto-analyzer but the current options are
reasonably similar.
If --verbosity is set above 8 the abstract
interpreter will log what it is doing. This is intended to aid developers in
understanding how the algorithms work, where time is being spent,
etc. but can be generally quite instructive.
- --legacy-ait
- This is the default option. Abstract interpretation is performed eagerly
from the start of the program until fixed-point is reached. Functions are
analyzed as needed and in the order of that they are reached. This option
also fixes the history and storage options to their
defaults.
- --recursive-interprocedural
- This extends --legacy-ait by allowing the history and
storage to be configured. As the name implies, function calls are
handled by recursion within the interpreter. This is a good all-round
choice and will likely become the default at some point in the
future.
- --three-way-merge
- This extends --recursive-interprocedural by performing a
"modification aware" merge after function calls. At the time of
writing only --vsd supports the necessary differential reasoning.
If you are using --vsd this is recommended as it is more accurate
with little extra cost.
- --legacy-concurrent
- This extends --legacy-ait with very restricted and special purpose
handling of threads. This needs the domain to have certain unusual
properties for it to give a correct answer. At the time of writing only
--dependence-graph is compatible with it.
- --location-sensitive
- Use location-sensitive abstract interpreter.
To over-approximate what a program does, it is necessary to
consider all of the paths of execution through the program. As there are a
potentially infinite set of traces (and they can be potentially infinitely
long) it is necessary to merge some of them. The common approach (the
"collecting abstraction") is to merge all paths that reach the
same instruction. The abstract interpretation is then done between
instructions without thinking about execution paths. This ensures
termination but means that it is not possible to distinguish different call
sites, loop iterations or paths through a program.
Note that --legacy-ait, the default abstract interpreter
fixes the history to --ahistorical so you will need to choose another
abstract interpreter to make use of these options.
- --ahistorical
- This is the default and the coarsest abstraction. No history information
is kept, so all traces that reach an instruction are merged. This is the
collecting abstraction that is used in most abstract interpreters.
- --call-stack
n
- This is an inter-procedural abstraction; it tracks the call stack and only
merges traces that reach the same location and have the same call stack.
The effect of this is equivalent to inlining all functions and then using
--ahistorical. In larger programs this can be very expensive in
terms of both time and memory but can give much more accurate results.
Recursive functions create a challenge as the call stack will be different
each time. To prevent non-termination, the parameter n limits how
many times a loop of recursive functions can be called. When n is
reached all later ones will be merged. Setting this to 0 will
disable the limit.
- --loop-unwind
n
- This tracks the backwards jumps that are taken in the current function.
Traces that reach the same location are merged if their history of
backwards jumps is the same. At most n traces are kept for each
location, after that they are merged regardless of whether their histories
match. This gives a similar effect to unrolling the loops n times
and then using --ahistorical. In the case of nested loops, the
behavior can be a little different to unrolling as the limit is the number
of times a location is reached, so a loop with x iterations
containing a loop with y iterations will require n =
x*y. The time and memory taken by this option will rise (at worst)
linearly in terms of n. If n is 0 then there is no
limit. Be warned that if there are loops that can execute an unbounded
number of iterations or if the domain is not sufficiently precise to
identify the termination conditions then the analysis will not
terminate.
- --branching
n
- This works in a similar way to --loop-unwind but tracking forwards
jumps (if, switch, goto, etc.) rather than backwards
ones. This gives per-path analysis but limiting the number of times each
location is visited. There is not a direct form of program transformation
that matches this but it is similar to the per-path analysis that symbolic
execution does. The scalability and the risk of non-termination if
n is 0 remain the same. Note that the goto-programs
generated by various language front-ends have a conditional forwards jump
to exit the loop if the condition fails at the start and an unconditional
backwards jump at the end. This means that --branching can wind up
distinguishing different loop iterations — "has not exited for
the last 3 iterations" rather than "has jumped back to the top 3
times".
- --loop-unwind-and-branching
n
- Again, this is similar to --loop-unwind but tracks both forwards
and backwards jumps. This is only a very small amount more expensive than
--branching and is probably the best option for detailed analysis
of each function.
These control how the possible states at a given execution point
are represented and manipulated.
- --dependence-graph
- Tracks data flow and information flow dependencies between instructions
and produces a graph. This includes doing points-to analysis and tracking
reaching definitions (i.e. use-def chains). This is one of the most
extensive, correct and feature complete domains.
- --vsd,
--variable-sensitivity
- This is the Variable Sensitivity Domain (VSD). It is a non-relational
domain that stores an abstract object for each live variable. Which kind
of abstract objects are used depends on the type of the variable and the
run-time configuration. This means that sensitivity of the domain can be
chosen — for example, do you want to track every element of an
array independently, or just a few of them or simply ignore arrays all
together. A set of options to configure VSD are given below. This domain
is extensive and does not have any known architectural limits on
correctness. As such it is a good choice for many kinds of analysis.
- --dependence-graph-vs
- This is a variant of the dependence graph domain that uses VSD to do the
foundational pointer and reaching definitions analysis. This means it can
be configured using the VSD options and give more precise analysis (for
example, field aware) of the dependencies.
- --constants
- The default option, this stores one constant value per variable. This
means it is fast but will only find things that can be resolved by
constant propagation. The domain has some handling of arrays but limited
support for pointers which means that in can potentially give unsound
behavior. --vsd --vsd-values constants is probably a
better choice for this kind of analysis.
- --intervals
- A domain that stores an interval for each integer and float variable. At
the time of writing not all operations are supported so the results can be
quite over-approximate at points. It also has limitations in the handling
of pointers so can give unsound results. --vsd --vsd-values
intervals is probably a better choice for this kind of
analysis.
- --non-null
- This domain is intended to find which pointers are not null. Its
implementation is very limited and it is not recommended.
VSD has a wide range of options that allow you to choose what kind
of abstract objects (and thus abstractions) are used to represent variables
of each type.
- --vsd-values
[constants|intervals|set-of-constants]
- This controls the abstraction used for values, both int and
float. The default option is constants which tracks if the
variable has a constant value. This is fast but not very precise so it may
well be unable to prove very much. intervals uses an interval that
contains all of the possible values the variable can take. It is more
precise than constants in all cases but a bit slower. It is good
for numerical code. set-of-constants uses a set of up to 10
(currently fixed) constants. This is more general than using a single
constant but can make analysis up to 10 times (or in rare cases 100 times)
slower. It is good for control code with flags and modes.
- --vsd-structs
[top-bottom|every-field]
- This controls how structures are handled. The default is top-bottom
which uses an abstract object with just two states (top and bottom). In
effect writes to structures are ignored and reads from them will always
return top (any value). The other alternative is every-field which
stores an abstract object for each field. Depending on how many structures
are live at any one time and how many fields they have this may increase
the amount of memory used by goto-analyzer by a reasonable amount.
But this means that the analysis will be field-sensitive.
- --vsd-arrays
[top-bottom|smash|up-to-n-elements|every-element]
- This controls how arrays are handled. As with structures, the default is
top-bottom which effectively ignores writes to the array and
returns top on a read. More precise than this is smash which stores
one abstract element for all of the values. This is relatively cheap but a
lot more precise, particularly if used with intervals or
set-of-constants. up-to-n-elements generalizes smash
by storing abstract objects for the first n elements of each array
(n defaults to 10 and can be controlled by
--vsd-array-max-elements) and then condensing all other elements
down to a single abstract object. This allows reasonably fine-grained
control over the amount of memory used and can give much more precise
results for small arrays. every-element is the most precise, but
most expensive option where an abstract element is stored for every entry
in the array.
- --vsd-array-max-elements
- Configures the value of n in --vsd-arrays
up-to-n-elements and defaults to 10 if not given.
- --vsd-pointers
[top-bottom|constants|value-set]
- This controls the handling of pointers. The default, top-bottom
effectively ignores pointers, this is OK if they are just read (all reads
return top) but if they are written then there is the problem that we know
that a variable is changed but we don't know which one, so we have to set
the whole domain to top. constants is somewhat misleadingly named
as it uses an abstract object that tracks a pointer to a single variable.
This includes the offset within the variable; a stack of field names for
structs and abstract objects for offsets in arrays. Offsets are tracked
even if the abstract object for the variable itself does not distinguish
different fields or indexes. value-set is the most precise option;
it stores a set of pointers to single variables as described above.
- --vsd-unions
top-bottom
- At the time of writing there is only one option for unions which is
top-bottom, discarding writes and returning top for all reads from
a variable of union type.
- --vsd-data-dependencies
- Wraps each abstract object with a set of locations where the variable was
last modified. The set is reset when the variable is written and takes the
union of the two sides' sets on merge. This was originally intended for
--dependence-graph-vs but has proved useful for --vsd as
well. This is not strictly necessary for --three-way-merge as the
mechanism it uses to work out which variables have changed is independent
of this option.
- --vsd-liveness
- Wraps each abstract object with the location of the last assignment or
merge. This is more basic and limited than --vsd-data-dependencies
and is intended to track SSA-like regions of variable liveness.
- --vsd-flow-insensitive
- This does not alter the abstract objects used or their configuration. It
disables the reduction of the domain when a branch is taken or an
assumption is reached. This normally gives a small saving in time but at
the cost of a large amount of precision. This is why the default is to do
the reduction. It can be useful for debugging issues with the
reduction.
The histories described above are used to keep track of where in
the computation needs to be explored. The most precise option is to keep one
domain for every history but in some cases, to save memory and time, it may
be desirable to share domains between histories. The storage options allow
this kind of sharing.
- --one-domain-per-location
- This is the default option. All histories that reach the same location
will use the same domain. Setting this means that the results of other
histories will be similar to setting --ahistorical. One difference
is how and when widening occurs: --one-domain-per-location
--loop-unwind n will wait until n iterations of a
loop have been completed and then will start to widen.
- --one-domain-per-history
- This is the best option to use if you are using a history other than
--ahistorical. It stores one domain per history which can result in
a significant increase in the amount of memory used.
These options control how the result of the task is output. The
default is text to the standard output. In the case of tasks that produce
goto-programs (--simplify for example), the output options only
affect the logging and not the final form of the program.
- --text
file_name
- Output results in plain text to given file.
- --json
file_name
- Writes the output as a JSON object to file_name.
- --xml
file_name
- Output results in XML format to file_name.
- --dot
file_name
- Writes the output in dot(1) format to file_name. This is
only supported by some domains and tasks (for example --show
--dependence-graph).
- --arch
arch
- Set analysis architecture, which defaults to the host architecture. Use
one of: alpha, arm, arm64, armel,
armhf, hppa, i386, ia64, mips,
mips64, mips64el, mipsel, mipsn32,
mipsn32el, powerpc, ppc64, ppc64le,
riscv64, s390, s390x, sh4, sparc,
sparc64, v850, x32, x86_64, or
none.
- --os os
- Set analysis operating system, which defaults to the host operating
system. Use one of: freebsd, linux, macos,
solaris, or windows.
- --i386-linux,
--i386-win32, --i386-macos, --ppc-macos,
--win32, --winx64
- Set analysis architecture and operating system.
- --LP64, --ILP64,
--LLP64, --ILP32, --LP32
- Set width of int, long and pointers, but don't override default
architecture and operating system.
- --16, --32, --64
- Equivalent to --LP32, --ILP32, --LP64 (on Windows:
--LLP64).
- --little-endian
- allow little-endian word-byte conversions
- --big-endian
- allow big-endian word-byte conversions
- --gcc
- use GCC as preprocessor
All tools honor the TMPDIR environment variable when generating
temporary files and directories.
If you encounter a problem please create an issue at
https://github.com/diffblue/cbmc/issues
2017-2018, Daniel Kroening, Diffblue