Locale::Po4a::Xml - convert XML documents and derivates from/to PO
The po4a (PO for anything) project goal is to ease translations
(and more interestingly, the maintenance of translations) using gettext
tools on areas where they were not expected like documentation.
Locale::Po4a::Xml is a module to help the translation of XML
documents into other [human] languages. It can also be used as a base to
build modules for XML-based documents.
This module can be used directly to handle generic XML documents.
This will extract all tag's content, and no attributes, since it's where the
text is written in most XML based documents.
There are some options (described in the next section) that can
customize this behavior. If this doesn't fit to your document format you're
encouraged to write your own module derived from this, to describe your
format's details. See the section WRITING DERIVATE MODULES below, for
the process description.
The global debug option causes this module to show the excluded
strings, in order to see if it skips something important.
These are this module's particular options:
- nostrip
- Prevents it to strip the spaces around the extracted strings.
- wrap
- Canonicalizes the string to translate, considering that whitespaces are
not important, and wraps the translated document. This option can be
overridden by custom tag options. See the translated option
- unwrap_attributes
- Attributes are wrapped by default. This option disables wrapping.
- caseinsensitive
- It makes the tags and attributes searching to work in a case insensitive
way. If it's defined, it will treat <BooK>laNG and <BOOK>Lang
as <book>lang.
- escapequotes
- Escape quotes in output strings. Necessary, for example, for creating
string resources for use by Android build tools.
See also:
- includeexternal
- When defined, external entities are included in the generated (translated)
document, and for the extraction of strings. If it's not defined, you will
have to translate external entities separately as independent
- ontagerror
- This option defines the behavior of the module when it encounters invalid
XML syntax (a closing tag which does not match the last opening tag). It
can take the following values:
- fail
- This is the default value. The module will exit with an error.
- warn
- The module will continue, and will issue a warning.
- silent
- The module will continue without any warnings.
Be careful when using this option. It is generally recommended to
fix the input file.
- tagsonly
- Note: This option is deprecated.
Extracts only the specified tags in the tags option.
Otherwise, it will extract all the tags except the ones specified.
- doctype
- String that will try to match with the first line of the document's
doctype (if defined). If it doesn't, a warning will indicate that the
document might be of a bad type.
- addlang
- String indicating the path (e.g. <bbb><aaa>) of a tag where a
lang="..." attribute shall be added. The language will be
defined as the basename of the PO file without any .po extension.
- optionalclosingtag
- Boolean indicating whether closing tags are optional (as in HTML). By
default, missing closing tags raise an error handled according to
- tags
- Note: This option is deprecated. You should use the translated and
untranslated options instead.
Space-separated list of tags you want to translate or skip. By
default, the specified tags will be excluded, but if you use the
"tagsonly" option, the specified tags will be the only ones
included. The tags must be in the form <aaa>, but you can join
some (<bbb><aaa>) to say that the content of the tag
<aaa> will only be translated when it's into a <bbb>
You can also specify some tag options by putting some
characters in front of the tag hierarchy. For example, you can put
w (wrap) or W (don't wrap) to override the default
behavior specified by the global wrap option.
Example: W<chapter><title>
- attributes
- Space-separated list of tag's attributes you want to translate. You can
specify the attributes by their name (for example, "lang"), but
you can prefix it with a tag hierarchy, to specify that this attribute
will only be translated when it's in the specified tag. For example:
<bbb><aaa>lang specifies that the lang attribute will only be
translated if it's in an <aaa> tag, and it's in a <bbb>
- foldattributes
- Do not translate attributes in inline tags. Instead, replace all
attributes of a tag by po4a-id=<id>.
This is useful when attributes shall not be translated, as
this simplifies the strings for translators, and avoids typos.
- customtag
- Space-separated list of tags which should not be treated as tags. These
tags are treated as inline, and do not need to be closed.
- break
- Space-separated list of tags which should break the sequence. By default,
all tags break the sequence.
The tags must be in the form <aaa>, but you can join
some (<bbb><aaa>), if a tag (<aaa>) should only be
considered when it's within another tag (<bbb>).
Please note a tag should be listed in only one of the
break, inline placeholder, or customtag
setting string.
- inline
- Space-separated list of tags which should be treated as inline. By
default, all tags break the sequence.
The tags must be in the form <aaa>, but you can join
some (<bbb><aaa>), if a tag (<aaa>) should only be
considered when it's within another tag (<bbb>).
- placeholder
- Space-separated list of tags which should be treated as placeholders.
Placeholders do not break the sequence, but the content of placeholders is
translated separately.
The location of the placeholder in its block will be marked
with a string similar to:
<placeholder type=\"footnote\" id=\"0\"/>
The tags must be in the form <aaa>, but you can join
some (<bbb><aaa>), if a tag (<aaa>) should only be
considered when it's within another tag (<bbb>).
- break-pi
- By default, Processing Instructions (i.e., "<?
... ?>" tags) are handled as inline tags. Pass this option
if you want the PI to be handled as breaking tag. Note that unprocessed
PHP tags are handled as Processing Instructions by the parser.
- nodefault
- Space separated list of tags that the module should not try to set by
default in any category.
If you have a tag which has its default setting by the
subclass of this module but you want to set alternative setting, you
need to list that tag as a part of the nodefault setting
- cpp
- Support C preprocessor directives. When this option is set, po4a will
consider preprocessor directives as paragraph separators. This is
important if the XML file must be preprocessed because otherwise the
directives may be inserted in the middle of lines if po4a consider it
belong to the current paragraph, and they won't be recognized by the
preprocessor. Note: the preprocessor directives must only appear between
tags (they must not break a tag).
- translated
- Space-separated list of tags you want to translate.
The tags must be in the form <aaa>, but you can join
some (<bbb><aaa>), if a tag (<aaa>) should only be
considered when it's within another tag (<bbb>).
You can also specify some tag options by putting some
characters in front of the tag hierarchy. This overrides the default
behavior specified by the global wrap and
defaulttranslateoption option.
- w
- Tags should be translated and content can be re-wrapped.
- W
- Tags should be translated and content should not be re-wrapped.
- i
- Tags should be translated inline.
- p
- Tags should be translated as placeholders.
Internally, the XML parser only cares about these four options:
w W i p.
* Tags listed in break are set to w or W
depending on the wrap option.
* Tags listed in inline are set to i.
* Tags listed in placeholder are set to p.
* Tags listed in untranslated are without any of these
options set.
You can verify actual internal parameter behavior by invoking
po4a with --debug option.
Example: W<chapter><title>
Please note a tag should be listed in either translated or
untranslated setting string.
- untranslated
- Space-separated list of tags you do not want to translate.
The tags must be in the form <aaa>, but you can join
some (<bbb><aaa>), if a tag (<aaa>) should only be
considered when it's within another tag (<bbb>).
Please note a translatable inline tag in an untranslated tag
is treated as a translatable breaking tag, i setting is dropped
and w or W is set depending on the wrap option.
- defaulttranslateoption
- The default categories for tags that are not in any of the translated,
untranslated, break, inline, or placeholder.
This is a set of letters as defined in translated and
this setting is only valid for translatable tags.
The simplest customization is to define which tags and attributes
you want the parser to translate. This should be done in the initialize
function. First you should call the main initialize, to get the command-line
options, and then, append your custom definitions to the options hash. If
you want to treat some new options from command line, you should define them
before calling the main initialize:
$self->{options}{'_default_translated'}.=' <p> <head><title>';
$self->{options}{'attributes'}.=' <p>lang id';
$self->{options}{'_default_inline'}.=' <br>';
You should use the _default_inline, _default_break,
_default_placeholder, _default_translated,
_default_untranslated, and _default_attributes options in
derivative modules. This allow users to override the default behavior
defined in your module with command line options.
If you don't like the default behavior of this xml module and its
derivative modules, you can provide command line options to change their
See Locale::Po4a::Docbook(3pm),
Another simple step is to override the function
"found_string", which receives the extracted strings from the
parser, in order to translate them. There you can control which strings you
want to translate, and perform transformations to them before or after the
translation itself.
It receives the extracted text, the reference on where it was, and
a hash that contains extra information to control what strings to translate,
how to translate them and to generate the comment.
The content of these options depends on the kind of string it is
(specified in an entry of this hash):
- type="tag"
- The found string is the content of a translatable tag. The entry
"tag_options" contains the option characters in front of the tag
hierarchy in the module "tags" option.
- type="attribute"
- Means that the found string is the value of a translatable attribute. The
entry "attribute" has the name of the attribute.
It must return the text that will replace the original in the
translated document. Here's a basic example of this function:
sub found_string {
my ($self,$text,$ref,$options)=@_;
$text = $self->translate($text,$ref,"type ".$options->{'type'},
return $text;
There's another simple example in the new Dia module, which only
filters some strings.
This is a more complex one, but it enables a (almost) total
customization. It's based on a list of hashes, each one defining a tag
type's behavior. The list should be sorted so that the most general tags are
after the most concrete ones (sorted first by the beginning and then by the
end keys). To define a tag type you'll have to make a hash with the
following keys:
- beginning
- Specifies the beginning of the tag, after the "<".
- end
- Specifies the end of the tag, before the ">".
- breaking
- It says if this is a breaking tag class. A non-breaking (inline) tag is
one that can be taken as part of the content of another tag. It can take
the values false (0), true (1) or undefined. If you leave this undefined,
you'll have to define the f_breaking function that will say whether a
concrete tag of this class is a breaking tag or not.
- f_breaking
- It's a function that will tell if the next tag is a breaking one or not.
It should be defined if the breaking option is not.
- If you leave this key undefined, the generic extraction function will have
to extract the tag itself. It's useful for tags that can have other tags
or special structures in them, so that the main parser doesn't get mad.
This function receives a boolean that says if the tag should be removed
from the input stream or not.
- f_translate
- This function receives the tag (in the get_string_until() format)
and returns the translated tag (translated attributes or all needed
transformations) as a single string.
- get_path()
- This function returns the path to the current tag from the document's
root, in the form <html><body><p>.
An additional array of tags (without brackets) can be passed
as argument. These path elements are added to the end of the current
- tag_type()
- This function returns the index from the tag_types list that fits to the
next tag in the input stream, or -1 if it's at the end of the input file.
Here, the tag has structure started by < and end by >
and it can contain multiple lines.
This works on the array
"@{$self->{TT}{doc_in}}" holding
input document data and reference indirectly via
"$self->shiftline()" and
- This function returns the next tag from the input stream without the
beginning and end, in an array form, to maintain the references from the
input file. It has two parameters: the type of the tag (as returned by
tag_type) and a boolean, that indicates if it should be removed from the
input stream.
This works on the array
"@{$self->{TT}{doc_in}}" holding
input document data and reference indirectly via
"$self->shiftline()" and
- get_tag_name(@)
- This function returns the name of the tag passed as an argument, in the
array form returned by extract_tag.
- breaking_tag()
- This function returns a boolean that says if the next tag in the input
stream is a breaking tag or not (inline tag). It leaves the input stream
- treat_tag()
- This function translates the next tag from the input stream. Using each
tag type's custom translation functions.
This works on the array
"@{$self->{TT}{doc_in}}" holding
input document data and reference indirectly via
"$self->shiftline()" and
- tag_in_list($@)
- This function returns a string value that says if the first argument (a
tag hierarchy) matches any of the tags from the second argument (a list of
tags or tag hierarchies). If it doesn't match, it returns 0. Else, it
returns the matched tag's options (the characters in front of the tag) or
1 (if that tag doesn't have options).
- treat_attributes(@)
- This function handles the translation of the tags' attributes. It receives
the tag without the beginning / end marks, and then it finds the
attributes, and it translates the translatable ones (specified by the
module option attributes). This returns a plain string with the
translated tag.
- treat_content()
- This function gets the text until the next breaking tag (not inline) from
the input stream. Translate it using each tag type's custom translation
This works on the array
"@{$self->{TT}{doc_in}}" holding
input document data and reference indirectly via
"$self->shiftline()" and
- treat_options()
- This function fills the internal structures that contain the tags,
attributes and inline data with the options of the module (specified in
the command-line or in the initialize function).
- get_string_until($%)
- This function returns an array with the lines (and references) from the
input document until it finds the first argument. The second argument is
an options hash. Value 0 means disabled (the default) and 1, enabled.
The valid options are:
- include
- This makes the returned array to contain the searched text
- remove
- This removes the returned stream from the input
- unquoted
- This ensures that the searched text is outside any quotes
- regex
- This denotes that the first argument is a regular expression rather than
an plain string
- skip_spaces(\@)
- This function receives as argument the reference to a paragraph (in the
format returned by get_string_until), skips his heading spaces and returns
them as a simple string.
- join_lines(@)
- This function returns a simple string with the text from the argument
array (discarding the references).
This module can translate tags and attributes.
There is a minimal support for the translation of entities. They
are translated as a whole, and tags are not taken into account. Multilines
entities are not supported and entities are always rewrapped during the
structure inside the $self hash?)
Locale::Po4a::TransTractor(3pm), po4a(7)
Jordi Vilalta <jvprat@gmail.com>
Nicolas François <nicolas.francois@centraliens.net>
Copyright © 2004 Jordi Vilalta <jvprat@gmail.com>
Copyright © 2008-2009 Nicolas François <nicolas.francois@centraliens.net>
This program is free software; you may redistribute it and/or
modify it under the terms of GPL (see the COPYING file).