@rend and the war on text-bearing attributes

In discussing that the TEI attribute @rend from att.global although it allows you to type just about anything in it, doesn’t actually allow anything more that a set of single tokens. I recently explained to John, Paul, George, or Ringo (can’t remember which), that it really doesn’t mean that spaces are allowed, simply that whitespace is the delimiter in the attribute value.

The definition of @rend is “(rendition) indicates how the element in question was rendered or presented in the source text.” but it is very often used by some encoders to signal to processing how you want the output to appear.  In the remarks on the values allowed for the attribute it says:

may contain any number of tokens, each of which may contain letters, punctuation marks, or symbols, but not word-separating characters.

The point here being the ‘word-separating characters’ part. So although you can say <hi rend=”It looks a bit like that other one”>text</hi>, this actually has 8 tokens “It”, “looks”, “a”, “bit”, “like”, “that”, “other”, “one”. Sometimes people stick CSS or CSS-like rendition information into @rend so have values like “text-align: right”. Which I would say was wrong… or at least saying that there are two classifications applicable to its rendition in the source material, one that it is “text-align:” and another that it is “right”.  Of course they could solve this just be deleting the space “text-align:right” would be better, or even “text-align:right; font-size:large;” if you wanted to add another token.  However, even better would be to use @rendition to point to at least one @xml:id of a <rendition> element in the header.  This allows you to specify exactly what scheme you are using (e.g. CSS) and to give multiple statements for one classification.

Why does this matter you might ask? Well, of course, it doesn’t really — they are all magic tokens of one sort or the other to be interpreted (or not) by your processing for whatever reason you are undertaking the encoding. The <rendition> method is the most detailed in documenting precisely how you are interpreting the rendition in the original document.

However, the reason it matters to me is that there are NO attributes in the TEI which allow free-text.

By that I mean that all attributes are assigned to one datatype or another, and in none of them can you just type sentences of prose and have it be semantically meaningful.  This is as a result of the long War on Text-Bearing Attributes that was undertaken in the run-up to the first release of TEI P5. This took as one of its many principles that because any bit of free text might have a need to use a non-Unicode character, and that the TEI’s method for documenting non-Unicode characters was to use its <g> element, that you couldn’t have free-text attributes because you can’t use an element inside an attribute value. This is the reason for the creation of many new child elements like <desc> which are intended to contain free text concerning the nature of the element that contains them.

In the case of the @rend attribute it allows one to infinity of the data.word datatype.  This data type, even in P5 1.0.0 “defines the range of attribute values expressed as a single word or token.”  Thus when people put space separated characters into it, they are really putting in multiple tokens.  The war of text-bearing attributes attempted to limit the places where people were able to do this by the use of datatypes and the removal of free text in attribute values.

This helps to highlight the difference between syntactic and semantic validity. Just because your document validates against a schema, does not mean that it is semantically valid.  You can put the text of a title inside an <author> element and vice-versa and there is no way your schema can know that you have done this.

So really, I’ve posted this post so I can point to it later when people ask me about spaces in @rend and similar datatype kerfuffles.