XML::Filter::Sort(3pm) | User Contributed Perl Documentation | XML::Filter::Sort(3pm) |
XML::Filter::Sort - SAX filter for sorting elements in XML
use XML::Filter::Sort; use XML::SAX::Machines qw( :all ); my $sorter = XML::Filter::Sort->new( Record => 'person', Keys => [ [ 'lastname', 'alpha', 'asc' ], [ 'firstname', 'alpha', 'asc' ], [ '@age', 'num', 'desc'] ], ); my $filter = Pipeline( $sorter => \*STDOUT ); $filter->parse_file(\*STDIN);
Or from the command line:
xmlsort
This module is a SAX filter for sorting 'records' in XML documents (including documents larger than available memory). The "xmlsort" utility which is included with this distribution can be used to sort an XML file from the command line without writing Perl code (see "perldoc xmlsort").
These examples assume that you will create an XML::Filter::Sort object and use it in a SAX::Machines pipeline (as in the synopsis above). Of course you could use the object directly by hooking up to a SAX generator and a SAX handler but such details are omitted from the sample code.
When you create an XML::Filter::Sort object (with the "new()" method), you must use the 'Record' option to identify which elements you want sorted. The simplest way to do this is to simply use the element name, eg:
my $sorter = XML::Filter::Sort->new( Record => 'colour' );
Which could be used to transform this XML:
<options> <colour>red</colour> <colour>green</colour> <colour>blue</colour> <options>
to this:
<options> <colour>blue</colour> <colour>green</colour> <colour>red</colour> </options>
You can define a more specific path to the record by adding a prefix of element names separated by forward slashes, eg:
my $sorter = XML::Filter::Sort->new( Record => 'hair/colour' );
which would only sort <colour> elements contained directly within a <hair> element (and would therefore leave our sample document above unchanged). A path which starts with a slash is an 'absolute' path and must specify all intervening elements from the root element to the record elements.
A record element may contain other elements. The order of the record elements may be changed by the sorting process but the order of any child elements within them will not.
The default sort uses the full text of each 'record' element and uses an alphabetic comparison. You can use the 'Keys' option to specify a list of elements within each record whose text content should be used as sort keys. You can also use this option to specify whether the keys should be compared alphabetically or numerically and whether the resulting order should be ascending or descending, eg:
my $sorter = XML::Filter::Sort->new( Record => 'person', Keys => [ [ 'lastname', 'alpha', 'asc' ], [ 'firstname', 'alpha', 'asc' ], [ '@age', 'alpha', 'desc' ], ] );
Given this record ...
<person age='35'> <firstname>Aardvark</firstname> <lastname>Zebedee</lastname> </person>
The above code would use 'Zebedee' as the first (primary) sort key, 'Aardvark' as the second sort key and the number 35 as the third sort key. In this case, records with the same first and last name would be sorted from oldest to youngest.
As with the 'record' path, it is possible to specify a path to the sort key elements (or attributes). To make a path relative to the record element itself, use './' at the start of the path.
When a record element is re-ordered, it takes its leading whitespace with it.
Only lists of contiguous record elements will be sorted. A list of records which has a 'foreign body' (a non-record element, non-whitespace text, a comment or a processing instruction) between two elements will be treated as two separate lists and each will be sorted in isolation of the other.
This item is optional and defaults to 'alpha'.
You may prefer to define the Keys using a delimited string rather than a list of lists. Keys in the string should be separated by either newlines or semicolons and the components of a key should be separated by whitespace or commas. It is not possible to define a subroutine reference comparator using the string syntax.
Note: The contents of the record are not affected by this setting - merely the copy of the data that is used in the sort comparisons.
Note: You can enable both the NormaliseKeySpace and the KeyFilterSub options - space normalisation will occur first.
Note: It is safe to specify the same temporary directory path for multiple instances since each will create a uniquely named subdirectory (and clean it up afterwards).
If you have not enabled disk buffering (using 'TempDir'), the MaxMem option has no effect. Attempting to sort a large document using only memory buffering may result in Perl dying with an 'out of memory' error.
A simple element path syntax is used in two places:
In each case you can use a just an element name, or a list of element names separated by forward slashes. eg:
Record => 'ul/li', Keys => 'name'
If a 'Record' path begins with a '/' then it will be anchored at the document root. If a 'Keys' path begins with './' then it is anchored at the current record element. Unanchored paths can match at any level.
A 'Keys' path can include an attribute name prefixed with an '@' symbol, eg:
Keys => './@href'
Each element or attribute name can include a namespace URI prefix in curly braces, eg:
Record => '{http://www.w3.org/1999/xhtml}li'
If you do not include a namespace prefix, all elements with the specified name will be matched, regardless of any namespace URI association they might have.
If you include an empty namespace prefix (eg: '{}li') then only records which do not have a namespace association will be matched.
In order to arrange records into sorted order, this module uses buffering. It does not need to buffer the whole document, but for any sequence of records within a document, all records must be buffered. Unless you specify otherwise, the records will be buffered in memory. The memory requirements are similar to DOM implementations - 10 to 50 times the character count of the source XML. If your documents are so large that you would not process them with a DOM parser then you should enable disk buffering.
If you enable disk buffering, sequences of records will be assembled into 'chunks' of approximately 10 megabytes (this value is configurable). Each chunk will be sorted and saved to disk. At the end of the record sequence, all the sorted chunks will be merged and written out as SAX events.
The memory buffering mode represents each record an a XML::Filter::Sort::Buffer object and uses XML::Filter::Sort::BufferMgr objects to manage the buffers. For details of the internals, see XML::Filter::Sort::BufferMgr.
The disk buffering mode represents each record an a XML::Filter::Sort::DiskBuffer object and uses XML::Filter::Sort::DiskBufferMgr objects to manage the buffers. For details of the internals, see XML::Filter::Sort::DiskBufferMgr.
ignorable_whitespace() events shouldn't be translated to normal characters() events - perhaps in a later release they won't be.
XML::Filter::Sort requires XML::SAX::Base and plays nicely with XML::SAX::Machines.
Copyright 2002-2005 Grant McLean <grantm@cpan.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
2018-03-30 | perl v5.26.1 |