linkchecker - command line client to check HTML documents and
websites for broken links
linkchecker [options] [file-or-url]...
LinkChecker features
- recursive and multithreaded checking
- output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph
in different formats
- support for HTTP/1.1, HTTPS, FTP, mailto: and local file links
- restriction of link checking with URL filters
- proxy support
- username/password authorization for HTTP and FTP
- support for robots.txt exclusion protocol
- support for Cookies
- support for HTML5
- Antivirus check
- a command line and web interface
The most common use checks the given domain recursively:
$ linkchecker http://www.example.com/
Beware that this checks the whole site which can have thousands of
URLs. Use the -r option to restrict the recursion depth.
Don't check URLs with /secret in its name. All other links
are checked as usual:
$ linkchecker --ignore-url=/secret mysite.example.com
Checking a local HTML file on Unix:
$ linkchecker ../bla.html
Checking a local HTML file on Windows:
C:\> linkchecker c:empest.html
You can skip the http:// url part if the domain starts with
www.:
$ linkchecker www.example.com
You can skip the ftp:// url part if the domain starts with
ftp.:
$ linkchecker -r0 ftp.example.com
Generate a sitemap graph and convert it with the graphviz dot
utility:
$ linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
- -h, --help
- Help me! Print usage information for this program.
- -t NUMBER,
--threads=NUMBER
- Generate no more than the given number of threads. Default number of
threads is 10. To disable threading specify a non-positive number.
- -F TYPE[/ENCODING][/FILENAME],
--file-output=TYPE[/ENCODING][/FILENAME]
- Output to a file linkchecker-out.TYPE, $XDG_DATA_HOME/linkchecker/failures
for the failures output type, or FILENAME if specified. The ENCODING
specifies the output encoding, the default is that of your locale. Valid
encodings are listed at
https://docs.python.org/library/codecs.html#standard-encodings. The
FILENAME and ENCODING parts of the none output type will be ignored, else
if the file already exists, it will be overwritten. You can specify this
option more than once. Valid file output TYPEs are text, html, sql, csv,
gml, dot, xml, sitemap, none or failures. Default is no file output. The
various output types are documented below. Note that you can suppress all
console output with the option -o none.
- -o TYPE[/ENCODING],
--output=TYPE[/ENCODING]
- Specify the console output type as text, html, sql, csv, gml, dot, xml,
sitemap, none or failures. Default type is text. The various output types
are documented below. The ENCODING specifies the output encoding, the
default is that of your locale. Valid encodings are listed at
https://docs.python.org/library/codecs.html#standard-encodings.
- -v, --verbose
- Log all checked URLs, overriding --no-warnings. Default is to log
only errors and warnings.
- -D STRING,
--debug=STRING
- Print debugging output for the given logger. Available debug loggers are
cmdline, checking, cache, plugin and all. all is an alias for all
available loggers. This option can be given multiple times to debug with
more than one logger.
- -q, --quiet
- Quiet operation, an alias for -o none that also hides
application information messages. This is only useful with -F, else
no results will be output.
- --cookiefile=FILENAME
- Use initial cookie data read from a file. The cookie data format is
explained below.
- --ignore-url=REGEX
- URLs matching the given regular expression will only be syntax checked.
This option can be given multiple times. See section REGULAR
EXPRESSIONS for more info.
- --no-follow-url=REGEX
- Check but do not recurse into URLs matching the given regular expression.
This option can be given multiple times. See section REGULAR
EXPRESSIONS for more info.
- --no-robots
- Check URLs regardless of any robots.txt files.
- -p, --password
- Read a password from console and use it for HTTP and FTP authorization.
For FTP the default password is anonymous@. For HTTP there is no default
password. See also -u.
- --timeout=NUMBER
- Set the timeout for connection attempts in seconds. The default timeout is
60 seconds.
- -u STRING,
--user=STRING
- Try the given username for HTTP and FTP authorization. For FTP the default
username is anonymous. For HTTP there is no default username. See also
-p.
- --user-agent=STRING
- Specify the User-Agent string to send to the HTTP server, for example
"Mozilla/4.0". The default is "LinkChecker/X.Y" where
X.Y is the current version of LinkChecker.
- --stdin
- Read from stdin a list of white-space separated URLs to check.
- FILE-OR-URL
- The location to start checking with. A file can be a simple list of URLs,
one per line, if the first line is "# LinkChecker URL
list".
Configuration files can specify all options above. They can also
specify some options that cannot be set on the command line. See
linkcheckerrc(5) for more info.
Note that by default only errors and warnings are logged. You
should use the option --verbose to get the complete URL list,
especially when outputting a sitemap graph format.
- text
- Standard text logger, logging URLs in keyword: argument fashion.
- html
- Log URLs in keyword: argument fashion, formatted as HTML. Additionally has
links to the referenced pages. Invalid URLs have HTML and CSS syntax check
links appended.
- csv
- Log check result in CSV format with one URL per line.
- gml
- Log parent-child relations between linked URLs as a GML sitemap
graph.
- dot
- Log parent-child relations between linked URLs as a DOT sitemap
graph.
- gxml
- Log check result as a GraphXML sitemap graph.
- xml
- Log check result as machine-readable XML.
- sitemap
- Log check result as an XML sitemap whose protocol is documented at
https://www.sitemaps.org/protocol.html.
- sql
- Log check result as SQL script with INSERT commands. An example script to
create the initial SQL table is included as create.sql.
- failures
- Suitable for cron jobs. Logs the check result into a file
$XDG_DATA_HOME/linkchecker/failures which only contains entries
with invalid URLs and the number of times they have failed.
- none
- Logs nothing. Suitable for debugging or checking the exit code.
LinkChecker accepts Python regular expressions. See
https://docs.python.org/howto/regex.html for an introduction. An
addition is that a leading exclamation mark negates the regular
expression.
A cookie file contains standard HTTP header (RFC 2616) data with
the following possible names:
Multiple entries are separated by a blank line. The example below
will send two cookies to all URLs starting with
http://example.com/hello/ and one to all URLs starting with
https://example.org/:
Host: example.com
Path: /hello
Set-cookie: ID="smee"
Set-cookie: spam="egg"
Host: example.org
Set-cookie: baggage="elitist"; comment="hologram"
To use a proxy on Unix or Windows set the http_proxy or
https_proxy environment variables to the proxy URL. The URL should be
of the form
http://[user:pass@]host[:port].
LinkChecker also detects manual proxy settings of Internet Explorer under
Windows systems. On a Mac use the Internet Config to select a proxy. You can
also set a comma-separated domain list in the no_proxy environment
variable to ignore any proxy settings for these domains. The
curl_ca_bundle environment variable can be used to identify an
alternative certificate bundle to be used with an HTTPS proxy.
Setting a HTTP proxy on Unix for example looks like this:
$ export http_proxy="http://proxy.example.com:8080"
Proxy authentication is also supported:
$ export http_proxy="http://user1:mypass@proxy.example.org:8081"
Setting a proxy on the Windows command prompt:
C:\> set http_proxy=http://proxy.example.com:8080
All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues are errors.
After the syntax check passes, the URL is queued for connection checking.
All connection check types are described below.
- HTTP links (http:,
https:)
- After connecting to the given HTTP server the given path or query is
requested. All redirections are followed, and if user/password is given it
will be used as authorization when necessary. All final HTTP status codes
other than 2xx are errors.
HTML page contents are checked for recursion.
- Local files
(file:)
- A regular, readable file that can be opened is valid. A readable directory
is also valid. All other files, for example device files, unreadable or
non-existing files are errors.
HTML or other parseable file contents are checked for
recursion.
- Mail links
(mailto:)
- A mailto: link eventually resolves to a list of email addresses. If one
address fails, the whole list will fail. For each mail address we check
the following things:
- 1.
- Check the address syntax, both the parts before and after the @ sign.
- 2.
- Look up the MX DNS records. If we found no MX record, print an error.
- 3.
- Check if one of the mail hosts accept an SMTP connection. Check hosts with
higher priority first. If no host accepts SMTP, we print a warning.
- 4.
- Try to verify the address with the VRFY command. If we got an answer,
print the verified address as an info.
- FTP links (ftp:)
- For FTP links we do:
- 1.
- connect to the specified host
- 2.
- try to login with the given user and password. The default user is
anonymous, the default password is anonymous@.
- 3.
- try to change to the given directory
- 4.
- list the file with the NLST command
- Unsupported
links (javascript:, etc.)
- An unsupported link will only print a warning. No further checking will be
made.
The complete list of recognized, but unsupported links can be
found in the linkcheck/checker/unknownurl.py source file. The
most prominent of them should be JavaScript links.
Sitemaps are parsed for links to check and can be detected either
from a sitemap entry in a robots.txt, or when passed as a FILE-OR-URL
argument in which case detection requires the urlset/sitemapindex tag to be
within the first 70 characters of the sitemap. Compressed sitemap files are
not supported.
There are two plugin types: connection and content plugins.
Connection plugins are run after a successful connection to the URL host.
Content plugins are run if the URL type has content (mailto: URLs have no
content for example) and if the check is not forbidden (ie. by HTTP
robots.txt). Use the option --list-plugins for a list of plugins and
their documentation. All plugins are enabled via the linkcheckerrc(5)
configuration file.
Before descending recursively into a URL, it has to fulfill
several conditions. They are checked in this order:
- 1.
- A URL must be valid.
- 2.
- A URL must be parseable. This currently includes HTML files, Opera
bookmarks files, and directories. If a file type cannot be determined (for
example it does not have a common HTML file extension, and the content
does not look like HTML), it is assumed to be non-parseable.
- 3.
- The URL content must be retrievable. This is usually the case except for
example mailto: or unknown URL types.
- 4.
- The maximum recursion level must not be exceeded. It is configured with
the --recursion-level option and is unlimited per default.
- 5.
- It must not match the ignored URL list. This is controlled with the
--ignore-url option.
- 6.
- The Robots Exclusion Protocol must allow links in the URL to be followed
recursively. This is checked by searching for a "nofollow"
directive in the HTML header data.
Note that the directory recursion reads all files in that
directory, not just a subset like index.htm.
URLs on the commandline starting with ftp. are treated like
ftp://ftp., URLs starting with www. are treated like
http://www.. You can also give local files as arguments. If you have
your system configured to automatically establish a connection to the
internet (e.g. with diald), it will connect when checking links not pointing
to your local host. Use the --ignore-url option to prevent this.
Javascript links are not supported.
If your platform does not support threading, LinkChecker disables
it automatically.
You can supply multiple user/password pairs in a configuration
file.
- curl_ca_bundle
- an alternative certificate bundle to be used with an HTTPS proxy
- no_proxy
- comma-separated list of domains to not contact over a proxy server
The return value is 2 when
- •
- a program error occurred.
The return value is 1 when
- invalid links were found or
- link warnings were found and warnings are enabled
Else the return value is zero.
LinkChecker consumes memory for each queued URL to check. With
thousands of queued URLs the amount of consumed memory can become quite
large. This might slow down the program or even the whole system.
$XDG_CONFIG_HOME/linkchecker/linkcheckerrc - default
configuration file
$XDG_DATA_HOME/linkchecker/failures - default failures
logger output filename
linkchecker-out.TYPE - default logger file output
name
linkcheckerrc(5)
https://docs.python.org/library/codecs.html#standard-encodings
- valid output encodings
https://docs.python.org/howto/regex.html - regular
expression documentation
Bastian Kleineidam <bastian.kleineidam@web.de>
2000-2016 Bastian Kleineidam, 2010-2023 LinkChecker Authors