adding word-level markup

Rehdon and snail and others occasionally have asked me recently about marking up words inside another element where there may be markup (sometimes containing more than one word) inside this so I thought I’d write it up.

So for example we might have an XML file that looked like:

<root>
 <l>This is a test;</l>
 <l>Only a <seg type="foo">test</seg> ok?</l> 
 <l>And <seg>so; is</seg> this as well.</l>
</root>

Let’s say we want to mark up each of the whitespace-separated words, and for some reason the randomly added semi-colons, as words with a element. What we can use is and a regex. For example:

<xsl:template match="l//text()"> 
 <xsl:analyze-string regex="(\w+|;+)" select=".">
 <xsl:matching-substring><w><xsl:value-of select="."/></w></xsl:matching-substring>
 <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
 </xsl:analyze-string>
</xsl:template>

In this example we’re matching any text() inside an element anywhere and if it matches the \w regex (or is a semicolon) it will get wrapped in a element. If it doesn’t match, then the text that was there gets output. Because this is l//text() (as opposed to l/text()) it will recurse down into grandchildren elements and further.

So assuming we have a copy-all template something like:

<xsl:template match="@*|node()" priority="-1">
  <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
</xsl:template>

(where we basically copy any nodes and attributes unless something else matches them) then we should get the result:

<root>
  <l><w>This</w> <w>is</w> <w>a</w> <w>test</w><w>;</w></l>
  <l><w>Only</w> <w>a</w> <seg type="foo"><w>test</w></seg> <w>ok</w>?</l>
  <l><w>And</w> <seg><w>so</w><w>;</w> <w>is</w></seg> <w>this</w> <w>as</w> <w>well</w>.</l>
</root>

Of course that is only the beginning, as your documents will probably have weird special cases and punctuation that you want to handle differently. And also it would, of course, be useful to create an @xml:id attribute for each word element.

-James

Evaluate a string as an XPath

Looking at ways to process a suggested change in TEI P5, I wanted to test that there is a straightforward way to evaluate a string that exists in a document as if it was an XPath you had included in your document.

So say I have a made-up document where I store some xpaths relating to that very document in the document itself as bits of text.

Input

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <paths>
        <path>/foo/blort/wibble[1]</path>
        <path>/foo/blort/wibble[2]</path>
        <path>//*[@xml:id='wibNum2']/splat/@att</path>
    </paths>
    <blort>
        <wibble>test text 1</wibble>
        <wibble>Another wibble </wibble>
        <wibble xml:id="wibNum2">This is <splat att="value1">a
            test</splat></wibble>
    </blort>
</foo>

To grab these and evaluate them as XPaths, you need to use an extension in saxon, unfortunately, saxon:evaluate(). For example in this stylesheet:

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" xmlns:saxon="http://saxon.sf.net/"
    exclude-result-prefixes="#all">
    <xsl:output indent="yes"/>

    </xsl><xsl:template match="/foo">
        <foo>
            <xsl:for-each select="paths/path">
                <out>
                    <xsl:value-of select="saxon:evaluate(.)"/>
                </out>
            </xsl>
        </foo>
    </xsl>



This should produce the output:

Output

< ?xml version="1.0" encoding="UTF-8"?>
<foo>
  <out>test text 1</out>
  <out>Another wibble </out>
  <out>value1</out>
</foo>

This does use the saxon:evaluate(.) extension. There are similar extensions in a variety of other implementations for XSLT1 as well.

-James