NAME

xtract - NCBI Entrez Direct XML conversion and transformation tool

SYNOPSIS

xtract [-help] [-strict] [-mixed] [-self] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-aliases filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-includes str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-tag tag] [-att key value] [-cls] [-slf] [-end tag] [-element element] [-first element] [-last element] [-backward element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-acc element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-bin element] [-oct element] [-hex element] [-bit element] [-pad element] [-encode element] [-upper element] [-lower element] [-chain element] [-title element] [-mirror element] [-alnum element] [-basic element] [-plain element] [-simple element] [-author element] [-prose element] [-terms element] [-words element] [-pairs element] [-order element] [-reverse element] [-letters element] [-clauses element] [-year element] [-month element] [-date element] [-page element] [-auth element] [-initials element] [-jour element] [-trim element] [-wct element] [-doi element] [-translate element] [-classify element] [-replace -reg target -exp replacement] [-revcomp] [-nucleic] [-fasta] [-ncbi2na] [-ncbi4na] [-molwt] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-histogram] [-e2index [extras]] [-indices element] [-article element] [-abstract element] [-paragraph element] [-stemmed element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort[-fwd] element] [-sort-rev element] [-format fmt [-unicode style]] [-verify] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]

DESCRIPTION

xtract converts an XML document into a table of data values according to user-specified rules.

OPTIONS

Processing Flags

-strict: Remove HTML and MathML tags.
-mixed: Allow mixed content XML.
-self: Allow detection of empty self-closing tags.
-accent: Delete Unicode accents and diacritical marks.
-ascii: Convert Unicode to numeric HTML character entities.
-compress: Compress runs of spaces.
-stops: Retain stop words in selected phrases.

Data Source

-input filename: Read XML from file instead of standard input.
-transform filename: File of substitutions for -translate.
-aliases filename: Mappings file for -classify operation.

Exploration Argument Hierarchy

-pattern expr
-group expr
-block expr
-subset expr: Name of record within set. Use of different argument names allows command-line control of nested looping.

Path Navigation

-path path: Explore by list of adjacent object names.

Exploration Constructs

Object: DateRevised
Parent/Child: Book/AuthorList
Path: MedlineCitation/Article/Journal/JournalIssue/PubDate
Heterogeneous: "PubmedArticleSet/*"
Exhaustive: "History/**"
Nested: "*/Taxon"

Conditional Execution

-if expr [constraint]: Element (or @attribute) must exist and satisfy any specified constraint.
-unless expr [constraint]: Skip if element matches.
-and condition: Preceding and following tests must both pass.
-or condition: Any passing test suffices.
-else: Execute if conditional test failed.
-position pos: first/last/outer/inner/even/odd/all.

String Constraints

-equals str: String must match exactly.
-contains str: Substring must be present.
-includes str: Substring must match at word boundaries.
-is-within str: String must be present.
-starts-with str: Substring must be at beginning.
-ends-with str: Substring must be at end.
-is-not str: String must not match.
-is-before str: First string < second string.
-is-after str: First string > second string.
-matches str: Matches without commas or semicolons.
-resembles str: Requires all words, but in any order.

Object Constraints

-is-equal-to expr: Object values must match.
-differs-from expr: Object values must differ.

Numeric Constraints

-gt N: Greater than.
-ge N: Greater than or equal to.
-lt N: Less than to.
-le N: Less than or equal to.
-eq N: Equal to.
-ne N: Not equal to.

Format Customization

-ret str: Override line break between patterns.
-tab str: Replace tab character between fields.
-sep str: Separator between group members.
-pfx str: Prefix to print before group.
-sfx str: Suffix to print after group.
-rst: Reset -sep through -elg.
-clr: Clear queued tab separator.
-pfc str: Preface combines -clr and -pfx.
-deq str: Delete and replace queued tab separator.
-def str: Default placeholder for missing fields.
-lbl str: Insert arbitrary text.

XML Generation

-set tag: XML tag for entire set.
-rec tag: XML tag for each record.
-wrp tag: Wrap elements in XML object.
-enc tag: Encase instance in XML object.
-plg str: Prologue to print before instance.
-elg str: Epilogue to print after instance.
-pkg tag: Package subset in XML object.
-fwd str: Foreword to print before subset.
-awd str: Afterword to print after subset.

Tag and Attribute Construction

-tag tag: Start with <tag.
-att key value: Attribute key and value.
-cls: Close with >.
-slf: Self-close with />.
-end tag: End contents with </tag>.

Element Selection

-element element: Print all items that match tag name.
-first element: Only print value of first item.
-last element: Only print value of last item.
-backward element: Print values in reverse order.
-NAME: Record value in named variable.
--STATS: Accumulate values into variable.

-element Constructs

Tag: Caption
Group: Initials,LastName
Parent/Child: MedlineCitation/PMID
Recursive: "**/Gene-commentary_accession"
Unrestricted: PubDate/*
Attribute: DescriptorName@MajorTopicYN
Range: MedlineDate[1:4]
Substring: "Title[phospholipase | rattlesnake]"
Object Count: "#Author"
Item Length: "%Title"
Element Depth: "^PMID"
Variable: "&NAME"

Special -element Operations

Parent Index: "+"
Object Name: "?"
Object Value: "~"
XML Subtree: "*"
Children: "$"
Attributes: "@"
ASN.1 Record: "."
JSON Record: "%"

Numeric Processing

-num element: Count.
-len element: Length.
-sum element: Sum.
-acc element: Accumulator.
-min element: Minimum.
-max element: Maximum.
-inc element: Increment.
-dec element: Decrement.
-sub element: Difference.
-avg element: Average.
-dev element: Deviation.
-med element: Median.
-mul element: Product.
-div element: Quotient.
-mod element: Remainder.
-bin element: Binary.
-oct element: Octal.
-hex element: Hexadecimal.
-bit element: Bit count.
-pad element: Zero-pad to eight digits.

Character Processing

-encode element: XML-encode <, >, &, ", and ' characters.
-upper element: Convert text to uppercase.
-lower element: Convert text to lowercase.
-chain element: Change spaces to underscores.
-title element: Capitalize initial letters of words.
-mirror element: Reverse order of letters.
-alnum element: Non-alphanumeric characters to space.

String Processing

-basic element: Convert superscripts and subscripts.
-plain element: Remove embedded mixed-content markup tags.
-simple element: Normalize accented letters; spell Greek letters.
-author element: Multi-step author cleanup.
-prose element: Text conversion to ASCII.

Text Processing

-terms element: Partition text at spaces.
-words element: Split at punctuation marks.
-pairs element: Adjacent informative words.
-order element: Rearrange words in sorted order.
-reverse element: Reverse words in string.
-letters element: Separate individual letters.
-clauses element: Break at phrase separators.

Citation Functions

-year element: Extract first 4-digit year from string.
-month element: Match first month name and return a corresponding integer.
-date element: YYYY/MM/DD from -unit "PubDate" -date "*"
-page element: Get digits (and letters) of first page number.
-auth element: Change GenBank authors to Medline form.
-initials element: Parse initials from forename or given name.
-jour element: Clean up journal name punctuation.
-trim element: Remove extra spaces and leading zeros.
-wct element: Count number of -words in a string.
-doi element: Add https://doi.org/ prefix, URL encode.

Value Transformation

-translate element: Substitute values with -transform table.
-classify element: Substring word or phrase matches to -aliases table.

Regular Expression

-replace: Substitute text using regular expressions.

-reg target: Target expression.
-exp pattern: Replacement pattern.

Sequence Processing

-revcomp: Reverse complement nucleotide sequence.
-nucleic: Subrange determines forward or revcomp.
-fasta: Split sequence into blocks of 70 uppercase letters.
-ncbi2na: Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
-ncbi4na: Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
-molwt: Calculate molecular weight of peptide.

Sequence Coordinates

-0-based element: Zero-based.
-1-based element: One-based.
-ucsc-based element: Half-open.

Command Generator

-insd arg ...: Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:

Descriptor(s): INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
Completeness: complete/partial
Feature(s): CDS/mRNA/...[,...]
Qualifier(s): INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]

Frequency Table

-histogram: Collects data for sort-uniq-count(1) on entire set of records.

Entrez Indexing

-e2index [extras]: Create Entrez index XML. extras (true or false; false by default) indicates whether to index extra fields.
-indices element: Index normalized words.
-article element: Title positional index.
-abstract element: Abstract positional index.
-paragraph element: Index text paragraphs.
-stemmed element: Apply Porter2 algorithm.

Output Organization

-head str: Print before everything else.
-tail str: Print after everything else.
-hd str: Print before each record.
-tl str: Print after each record.

Record Selection

-select condition: Select record subset by conditions.
-in filename: File of identifiers to use for selection.

Record Rearrangement

-sort[-fwd] element: Element to use as sort key.
-sort-rev element: Sort records in reverse order.

Reformatting

-format fmt

copy: Fast block copy (still applies processing flags).
compact: Compress runs of spaces.
flush: Suppress line indentation.
indent: Indent according to nesting depth.
expand: Place each attribute on a separate line.

Validation

-verify: Report XML data integrity problems.

Summary

-outline: Display outline of XML structure.
-synopsis: Display individual XML paths.
-contour [delimiter]: Display XML paths to leaf nodes (delimited by / by default).

Documentation

-help: Print usage information and some example argument combinations.
-examples: Complete usage examples, involving additional Entrez Direct tools.
-unix: Illustrate common Unix command arguments.
-version: Print version number.

NOTES

String constraints use case-insensitive comparisons.

Numeric constraints and selection arguments use integer values.

-num and -len selections are synonyms for Object Count (#) and Item Length (%).

-words, -pairs, and -indices convert to lower case.