Metamark Wrapping: putting brackets around spanning markup

A project PI recently asked me the following XSLT question:

In my TEI I have the following markup:


A bunch of text and other various markup

and then a

<metamark rend="red" spanTo="#meta-111"/>more text
here<anchor xml:id="meta-111"/> and then

the text continues, etc.

How do I wrap a bracket around the <metamark> to <anchor> pairing, and moreover how do I make it red?

That is, I want: … [more text here] … in my HTML output. Help!

 

This is actually fairly easy to do in XSLT if the encoding is consistent and you are only using metamark in this way (otherwise you should use the other attributes on metamark to delineate its function as well).

In this case the secret is to not think about wrapping the metamark/anchor pairing in square brackets but providing a starting square bracket and an ending square bracket through using two templates. We treat them as separate actions rather than trying to link them in any complicated way. (That is possible but much more difficult.)


<xsl:template match="metamark
[contains(@rend, 'red')]
[substring-after(@spanTo, '#')= following::anchor/@xml:id]">
<span class="metamark makeRed">[</span>
</xsl:template>

<xsl:template match="anchor
[@xml:id]
[contains(preceding::metamark/@spanTo, @xml:id)]">
<span class="metamark makeRed">]</span>
</xsl:template>

In the first XSLT template we only match those metamarks where:

  • the @rend attribute contains ‘red’ and
  • the @spanTo attribute, once we have removed the ‘#’ on the front, equals any @xml:id that is on an anchor element following it. (This means there is one following it somewhere but we don’t need to know where. means there is one somewhere following it… it doesn’t necessarily have to be the next one.)

Then on the second template we match any anchor where:

  •  there is an @xml:id and
  • the @xml:id attribute on this anchor is pointed at by a metamark/@spanTo attribute that precedes it somewhere

We don’t need to have any real correspondence or connection between the two templates, and indeed if any of these accidentally fired on other metamarks or anchors, we could put in additional testing.

In both cases we put out an HTML span element with a bracket in it and given this span two classes ‘metamark’ and ‘makeRed’ to enable the project to control the styling of the metamark display and to colour things ‘red’.  i.e.


.makeRed{color:red;}
.metamark{font-size:80%; font-weight:bold;}

 

This is fairly straightforward and it is only a conceptual approach where those used to XML structures would often think about wrapping an element around something rather than just giving its starting and ending points in their output.

[UPDATE: As Torsten notes below, the reason I don’t need to provide the namespace for these TEI elements because I have, sensibly, used @xpath-default-namespace and an @xmlns on the <xsl:stylesheet> element that you don’t see in these extracts.]


 

Teaching for DEMM: Digital Editing of Medieval Manuscripts

 

This is the second year that, as part of my commitment to DiXiT, I have also taught on the Erasmus+ Digital Editing of Medieval Manuscripts network.  Digital Editing of Medieval Manuscripts (DEMM) is a joint training programme between Charles University in Prague, Queen Mary University of London, the Ecole des Hautes Etudes en Sciences Sociales, the University of Siena, and the library of the Klosterneuburg Monastery. It equips advanced MA and PhD students in medieval studies with the necessary skills to eAmiatinusdit medieval texts and work in a digital environment. This is done through a year-long programme on editing medieval manuscripts and their online publication: a rigorous introduction to medieval manuscripts and their analysis is accompanied by formal training in ICT and project management. The end of each one-year programme will see the students initiated into practical work-experience alongside developers, as they will work on their own digital editions, leading to its online publication.

Funded by the Strategic Partnership strand of the European Union’s Erasmus+ Programme, DEMM will run for three consecutive years, always with a new group of students. It will lead to the publication, in print and online, of teaching materials, as well as a sandbox of editions.

My institution is not directly involved in it (but there is overlap with DiXiT) and last year I taught and assisted at both the workshop in Lyon and the Hackathon in London. This year the students had a a week’s introduction to Palaeography, Codicology and Philology at Stift Klosterneuburg in the autumn and then in March had a week’s workshop on encoding, tagging and publishing in Lyon.

Needless to say I was providing tuition on the Text Encoding Initiative and a full schedule, with links to my presentations (some of the others are behind a password protected site) is available at:

http://www.digitalmanuscripts.eu/training-programme/digital-editing-2016/

This follows a fairly predictable pattern of introducing people to the concept of markup, the formal syntax of XML, and the vocabulary of the TEI. It then goes on to expand this with an introduction to the core elements, named entities, and the following morning TEI metadata. Here of course we also single out the elements for both manuscript description and transcription since that is key for those undertaking  to build digital editions of medieval manuscripts.  The course continued on to talk about critical apparatus, genetic editing, and publication / interrogation of your results.

What is the TEI? And Why Should I Care? (A brief introduction for classicists)

Recently I gave a lecture to those interested in Digital Classics at the University of Oxford as part of the Digital Classics Seminar Series with people much more qualified to talk about Classics (digital or otherwise) than me.  I’m not, nor ever have been or ever will be a classicist. Ok, I did learn Classic Latin at one point but quickly replaced this with the much more complicated (though not necessarily more sophisticated) Medieval Latin as I did an MA and PhD in Medieval Studies.  So I was understandably nervous speaking to a room full of classicists.  Luckily I was talking about something I know fairly well, and only making reference to its use in Digital Classics.  In this case the title of my talk was “What is the TEI? And Why Should I Care? (A brief introduction for classicists)”.  There are versions of the talk online:

I needn’t have worried, of course, the audience was wonderfully attentive as I  through, at a fairly basic level a brief introduction to:

  • Markup: I looked at the differences between Procedural, Presentational, and Descriptive Markup, and why one might want to annotate information in this way
  • XML: I quickly covered the basic descriptions of how XML is formatted and what its rules are; the power of deeply nesting annotation; and compared the pros and cons of XML vs Databases
  • TEI: I surveyed what the TEI is, what it is not, how it is customisable, and how it is developed and used.
  • EpiDoc: Lastly I discussed a vibrant TEI community of epigraphers and the EpiDoc TEI P5 customisation they have made. As someone only on the very edge of this Digital Classiscist community I probably didn’t do it justice, but it is a very good example of people customising the TEI (as a pure subset) creating even more targeted resources that conform to the needs of their community.

I encourage people to go to the other Digital Classics Seminar Series lectures or follow them as they are live streamed that evening (or catch up afterwards). The live streams are advertised shortly before the talk at: http://users.ox.ac.uk/~corp1223/DigitalClassics.htm

Text Creation Partnership: Made for everyone

Oxford Text Archive TCP Catalogue

From the 1 January 2015 the first phase of EEBO-TCP (Early English Books Online – Text Creation Partnership) transcribed books entered the public domain. They join those created by ECCO-TCP (Eighteenth Century Collections Online – Text Creation Partnership) and Evans-TCP (Evans Early American Imprints – Text Creation Partnership). The goal of the Text Creation Partnership is to create accurate XML/SGML encoded electronic text editions of early printed books. They transcribe and encode the page images of books from ProQuest’s Early English Books Online, Gale Cengage’s Eighteenth Century Collections Online, and Readex’s Evans Early American Imprints. The work the TCP does, and hence the resulting transcriptions that they create, are jointly funded and owned by more than 150 libraries worldwide. Eventually all of the TCP’s work will be placed into the public domain for anyone to use and the release of Phase 1 of EEBO-TCP is a milestone in this process.

The TCP began in 1999 as a partnership among the libraries of the University of Michigan and the University of OxfordProQuest, and the Council on Library and Information Resources (CLIR).  As and when TCP texts have entered into the public domain we have made them available at the Oxford Text Archive. This was already distributing the public domain copies of ECCO-TCP, and now adds the phase one of EEBO-TCP and Evans-TCP this collection. The hard work of managing the creation, encoding, checking, and providing the texts have been done by the Bodleian Library at the University of Oxford and the University of Michigan Library, while the Academic IT group of IT Services at the University of Oxford has undertaken the task of bringing the encoding into full conformance with the Text Encoding Initiative P5 Guidelines and making the results available in various forms.

The Academic IT group of IT Services at the University of Oxford has made use of these texts for a number of projects and so wanted to make sure that the texts were easily available now that they have entered the public domain. To do so we have placed them in a special collection at the OTA which displays the metadata (stored in a postgresql database) as a jQuery dataTable enabling sorting and filtering by any aspect of this. This table currently lists 61315 texts, but this includes 28462 texts which are ‘restricted’. These are not in the public domain yet, but are available to those at the University of Oxford to use in the meantime. The remaining 32853 texts are freely available to the public. You can see only the free ones by filtering by ‘Free’ in the availability column. Each entry in the table provides basic metadata of the TCP ID, links, the title, availability, date, other IDs associated with the text, keyword terms TCP provided it, and a rough page count. The links provided are to:

A lot of the work to make these texts available via the Oxford Text Archive, after they were created by the TCP, has been done by Sebastian Rahtz, Magdalena Turska, and James Cummings. The research support team at IT Services can be reached at: researchsupport@it.ox.ac.uk.  You can read more about TCP and EEBO at http://www.textcreationpartnership.org/tcp-eebo/ and http://www.bodleian.ox.ac.uk/eebotcp/.

Self Study (Part 7) Customising the TEI

Self Study (Part 7) Customising the TEI

This post is the seventh in a series of posts providing a reading course of the TEI Guidelines. It starts with

  1. a basic one on Introducing XML and Markup then
  2. on Introduction to the Text Encoding Initiative Guidelines then
  3. one on the TEI Default Text Structure then
  4. one on the TEI Core Elements then
  5. one looking at at The TEI Header.
  6. and a sixth one on transcribing primary sources.

None of these are really complete in themselves and barely scratch the surface but are offered up as a help should people think them useful. This seventh post is looking at customising the TEI for your own uses.

The TEI has many different modules and lots of elements that you may or may not need for your project. One of the strongest aspects of the TEI Guidelines compared to other standards is that any project is able to constrain, customise and extend the Guidelines. One reason for customising the Guidelines is because most projects do not need the vast array of elements provided in the TEI Guidelines and in order to reduce human error and speed up encoding providing less choice is a good thing. The generalised Guidelines need to provide as much choice and flexibility as possible — in order to cope with the different needs of projects and intellectual methods to be captured — and yet I’d not be surprised if the consistency of a project is proportionally related to the amount it constrains that same flexibility.

Roma

The TEI Consortium provides a (quite dated) web interface to customise the TEI. This allows you to do some sorts of customisation. This is available at: http:/www.tei-c.org/Roma/. You should explore this. It is fairly straight forward. I recommend doing the following:

  1. Visit http:/www.tei-c.org/Roma/ and notice that you have various options on how to start your customisation, including being able to upload a customisation you had saved earlier.
  2. Choose the ‘Build Up’ method; This takes you to a screen which allows you
    to change some basic metadata about the customisation. If you change anything click ‘Save’.
  3. Click on the ‘Modules’ tab to see a list of modules on the left, which you can ‘add’ to the customisation, and a list of modules on the right which have already been added to your customisation. Notice that the core, tei, header, and textstructure modules are already selected.
  4. Add a few more modules, maybe manuscript description, names and dates, critical apparatus, and transcription of primary sources.
  5. Clicking on any individual module name in the customisation you are making takes you to the list of elements in that customisation. For example, click on ‘Core’.
  6. Clicking on ‘Core’ takes you to a list of the elements in the ‘Core’ module. You can choose to ‘Include’ or ‘Exclude’ elements from your customisation (by clicking the radio buttons for each element or clicking the ‘Include’ or ‘Exclude’ at the top to include/exclude all of the elements).
  7. Choose to exclude certain elements from the Core module. For example, you may wish to remove ‘analytic’, ‘biblStruct’, ‘binaryObject’, ‘imprint’, ‘monogr’, ‘series’ and then click ‘Save’ at the bottom of the school.
  8. Choose the ‘Schema’ tab and look at the options for generating a schema. I recommend a Relax NG schema (compact or XML syntax). Choose one of these and click ‘Generate’ to create and download your schema.
  9. In an XML editor like oXygen you can Associate a document with this schema that you’ve just generated.(Or take an existing TEI document associated with a schema and change the association to point to you new schema.) Hopefully this uses an xml-model processing instruction at the top of the file. Maybe try this out! You should find that you are unable to use any of the elements you excluded!
  10. Back in Roma (you didn’t shut the browser down did you? If so you’d have to go do the above again!) you should be able to return to the ‘Modules’ tab, click on the ‘textstructure’ module, and then notice that where the ‘div’ element is listed on the far right-hand there is a ‘Change Attributes’ link. Click on it!
  11. This lists all the attributes available on div (sometimes provided directly on element, sometimes by an attribute class it is a member of.
  12. Scroll down to the ‘type’ attribute and click on it. This takes you to some settings you can change about this attribute. Some of the things you can change include:
    • You can say that it is not optional (i.e. it is required)
    • You can say whether it is a closed list (whether the values you provide are the only ones)
    • You can provide a list of comma-separated values

    I suggest that you say that it is not optional, a closed list, and give “chapter,section,other” as values. Remember to click save.

  13. You could now go back to the ‘Schema’ tab, generate and download a schema, and re-associate it in your document (your operating system will most likely name it something different if there is already a file there…another option is to move the previous schema out of the way).
  14. Something else you should do is click on the ‘Save Customization’ tab. This should download an XML file (‘myTEI.xml’ if you didn’t change the name of the filename on the ‘Customize’ tab)
  15. Open that ‘myTEI.xml’ customisation in your XML editor and have a look at it. This records all of the details of your customisation.

It should look something like:

schemaSpec

Here a <schemaSpec> element contains <moduleRef> elements for each of the modules you included. In this case Roma defaulted to an ‘exclusion’ method of referencing the elements. (So “give me all elements from the ‘core’ module except this list of elements. Using the ‘include’ attribute could have had us give a list of specific elements to include. The difference between these is that if you save this customisation and come back to Roma at some point in the future, with the exclusion method you will get any new elements added by the TEI, whereas the inclusion method would never get any new elements.  Both approaches have their uses. Below that you have documented a change to the <div> element (using the <elementSpec>) where the <attDef> element records that usage of this is required, and has a closed <valList> replacing the existing one.

TEI ODD

Your customisation is written using the TEI ODD language, a part of the TEI Guidelines for describing markup. This is ‘One Document Does-it-all’, named because from this you can generate project-specific documentation. (The ‘Documentation’ tab in Roma.)  There are elements in the TEI for referring to phrase-level discussion of markup (with the <gi>, <att>, and <val> elements) as well as ways to document the customisation or extension of a schema (e.g. the <schemaSpec>).

Read more about these documentation elements at: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TD.html When you have done so, you should have no problem in answering these questions (to make sure for yourself that you have read it)

  1. What is the difference between a <gi> element and a <tag>?
  2. What does the ‘atts’ attribute on <specDesc> record?
  3. What is the difference between <gloss>, <desc>, and <remarks>?
  4. How does one use the <equiv/> element?
  5. What is the difference between a <eg> and a <egXML>?
  6. What is a <content> element used for?
  7. Why might you want to use a <constraintSpec>?
  8. How do you provide a <gloss> for an <attDef>?
  9. What is a <classSpec> for?

There are many other important parts of this chapter, but if you understand the above that is a good start.

An example ODD showing some of the basic techniques (with a lot of documentation) from the LEAP project is available at: https://github.com/jamescummings/LEAP-ODD/blob/master/leap.odd.xml

Auto Update Your TEI Framework in oXygen

One of the great things about the oXygen XML Editor that I use is that it allows frameworks as add-ons (from version 14+, though actually for the TEI one you need 15.2+) for various document types. These can consist of template documents, XSLT files for transformations, CSS, and all manner of customisations to oXygen.

The TEI Consortium jointly maintains an open source and openly-licensed oxygen-tei framework at http://github.com/TEIC/oxygen-tei.

I’ve been asked a number of times to explain to someone how they can keep the TEI framework in their oXygen installation up to date automatically with releases of the TEI P5 Guidelines (and thus the underlying schema) as well as releases to the TEI-XSL Stylesheets.

The process for this isn’t entirely intuitive, but is not too difficult if you follow the steps below.

Add the oXygen-TEI Add-on

Go to Options/Preferences -> Add-ons and click ‘Add’
oxygen-tei-update1

The updateSite.oxygen File

Add the URL http://www.tei-c.org/release/oxygen/updateSite.oxygen this is a file which is updated every time there is a TEI Guidelines or Stylesheets release. Click ‘Ok’.

oxygen-tei-update2

Automatic Updates

Back in the preferences window check ‘Enable automatic updates checking’ and click ‘OK’.
oxygen-tei-update3

Check for Updates

Go to the ‘Help’ menu and select ‘Check for add-ons updates’
oxygen-tei-update4

Update Available!

If there has been a TEI Guidelines or e update since you last updated (or installed oXygen) then you should get prompted to install an update. Click ‘Review updates’.
oxygen-tei-update5

Review Updates

Click the checkbox next to ‘TEI P5’ and then click ‘Install’.
oxygen-tei-update6

Downloading…

The oXygen-TEI framework package will be downloaded, speed depending on your connection to the internet.
oxygen-tei-update7

Install the Update

Once oXygen has downloaded the package you must check to agree to all the license terms. (All TEI Consortium materials are dual-licensed as BSD 2-Clause and/or Creative Commons Attribution.) Accept, and then click ‘Continue’.
oxygen-tei-update8

Warning: Valid Signatures

When you install the package you’ll be warned that it doesn’t have valid signatures. If you trust the TEI Consortium then you should click ‘Continue anyway’.
oxygen-tei-update9

To Complete the Update: Restart oXygen

In order to have oXygen start using the new framework, you must restart the application.
oxygen-tei-update10

Next Time

Next time there is a TEI Guidelines or TEI XSL Stylesheets release you will get prompted to install the updates. That is all there is to it!

ODDly Pragmatic: Documenting encoding practices in Digital Humanities projects

[This is the rough draft text of a plenary lecture I will have given at JADH2013 http://www.dh-jac.net/jadh2013/abst21.html#plenary2 (click to expand abstract). This isn’t necessarily the precise text I delivered but the notes from a couple days before.

It is written very much to go with the slides, really a prezi, at: http://tinyurl.com/jc-JADH2013 and doesn’t really make lots of sense without it.  (I’m not claiming it makes lots of sense with it either!) It re-uses much material I’ve discussed and written about in other locations so I’m not making claims of originality either. Credit is due to everyone involved in the TEI, DH Projects mentioned, and many ideas from the DH community at large. All errors and misrepresentations are mine and unintentional, I apologise in advance. The intention is to superficially expose a slightly larger audience at JADH2013 to some of the concepts and benefits of TEI ODD Customisation.]

 The TEI

Use of the TEI Guidelines for Electronic Text Encoding and Interchange is often held up as the gold standard for Digital Humanities textual projects. These Guidelines describe a wide variety of methods for encoding digital text and in some cases there are multiple options for marking up the same kinds of thing. The TEI takes a generalistic approach to describing textual phenomena consistently across texts of different times, places, languages, genres, cultures, and physical manifestations, but it simultaneously recognises that there are distinct use cases or divergent theoretical traditions which sometimes necessitate fundamentally different underlying data models.  Unlike most standards, however, the TEI Guidelines are not a fixed entity as they give projects the ability to customise their use of the TEI — to constrain it by limiting the options available or extending it into areas the TEI has not yet dealt with. It is this act of customisation and the benefits of it that I will speak of today.

But what is the TEI?

The Text Encoding Initiative Consortium (TEI) is an international membership consortium whose community and elected representatives collectively develop and maintain the de facto standard for the representation of digital texts for research purposes. The main output the community is the TEI Guidelines which provide recommendations for encoding methods for the creation of digital texts. Generally the TEI is used by academic research projects in the humanities, social sciences, and linguistics, but also by publishers, libraries, museums, and individual scholars, for the creation of digital texts for research, teaching, and long-term preservation.

It is also a community of volunteers, institutions like the University of Oxford donate a fraction of staff time (like part of mine) towards the TEI, as do other institutions with elected volunteers or contributors working on research projects.

The TEI is also the outputs that it creates such as the Guidelines themselves, definitions and examples of over 530 markup distinctions, and various transformation software to convert to and from the TEI. It is also a consensus-based way or structuring textual resources – it isn’t determined by the weight of a single institution or commercial company but by the Technical Council members elected by the membership. The TEI is a way to produce customised, internationalised, schemas for validating a project’s digital texts. It is a format that allows you to document your interpretation and understanding of a text, but it is also a well-understood format suitable for long-term preservation in digital archives. But most of all, it is a community-driven standard so it is a product of all of those involved in it.

What the TEI is not:

It isn’t the only standard in this area. It is the most popular but there are others, and people re-invent the wheel unnecessarily all the time. It isn’t objective or non-interpretative: the application of markup is an interpretative act that shouldn’t just be left to junior research assistants – it is the intellectual and editorial content of a digital text. The TEI isn’t used consistently in different projects, and often not even in the same project. (Which is why TEI customisation for consistency is an important form of documentation.) The TEI isn’t fixed and unchanging. Unlike most standards which are static the TEI evolves as the community finds new and important textual distinctions. But customisation gives you a way to document precisely what version of the TEI you are using. It isn’t your research end-point: The creation of a collection of digital texts isn’t an end in itself — it is what you can then do with those texts, the research questions they can enable you to answer that is important.

It isn’t also automatic publication of your materials in a useful way. Any of the off-the-shelf TEI publication systems all will need customising to deal with the specific and interesting reasons you were encoding these texts in the first place. In general though experience teaches us that the benefits of a shared vocabulary far outweigh any difficulties in adoption of the TEI

Generalistic Approach:

The TEI takes a generalistic approach to describing textual phenomena consistently across texts of different times, places, languages, genres, cultures, and physical manifestations, but it simultaneously recognises that there are distinct use cases or divergent theoretical traditions which sometimes necessitate fundamentally different underlying data models. The ability to customise the TEI scheme is something which sets it apart from other international standards. At first glance this may seem contradictory: how can one have a standard that any project is allowed to change? This is because the TEI’s approach to creation of this community-based standard is not to create a fixed entity, but to provide recommendations within a framework in which projects are able to extend or constrain the scheme itself. They can constrain it by limiting the options available to their project or extend it into areas not yet covered by the TEI.

It is nonsensical for a project to dismiss use of the TEI because it does not yet have elements specific to its needs as that project is able to extend it in that direction.

This combination of the generalistic nature and ability to customise the TEI Guidelines is both one of its greatest strengths as well as one of its greatest weaknesses: it makes it extremely flexible, but this can be a barrier to the seamless interchange of digital text from sources with different encoding practices. Any difficulty can be lessened by documentation through proper customisation.

TEI ODD Customisation

Every project using the TEI is dependent upon some form of customisation (even if it is the ‘tei_all’ customisation with everything in it, that the TEI provides as an example). The TEI has many elements covering textual distinctions from linguistics and the marking up of speech to transcribing or describing medieval manuscripts and more. The TEI organises all these elements into a wide array of modules. A module is simply a convenient way of grouping together a number of associated element declarations. Sometimes, as with the TEI’s Core module (containing the most common elements) these may be grouped together for practical reasons. However, it is more usual, for example with the ‘Dictionaries’ module, to group the elements together because they are all semantically-related to one particular sort of text or encoding need. As one would expect, an element can only appear in one module lest there be a conflict when modules are combined.

Almost every chapter of the TEI Guidelines has a corresponding module of elements. In the underlying TEI ODD customisation language both the prose of that chapter of the Guidelines and the specifications for all the elements are stored in one file. It is from this file that both the TEI documentation and the element relationships that are used to generate a schema are created. So there is a chapter on dictionaries and it also creates the module for dictionaries.

The TEI method of customisation is written in a TEI format called ‘ODD’, or ‘One Document Does-it-all’, because from this one source we can generate multiple outputs such as schemas, localised encoding documentation, and internationalised reference pages in different languages. A TEI ODD file is a method of documenting a project’s variance from any particular release of the full TEI Guidelines. The TEI provides a number of methods for users to undertake customisation ranging from intuitive web-based interfaces to authoring TEI ODD files directly. These allow users to remove unwanted modules, classes, elements, and attributes from their schema and redefine how any of those work, or indeed add new ones. One of the benefits of doing this through a meta-schema language like TEI ODD is that these customisations are documented in a machine-processable format which indicates precisely which version of the TEI Guidelines the project was using and how it differed from the full Guidelines. This same format is what underlies the TEI’s own steps towards internationalisation of the TEI Guidelines into a variety of languages (including Japanese).

This concept of customisation originates from a fundamental difference between the TEI and other standards — it does not try to tell users that if they want to be good TEI citizens they must do something this one way and only that way, but instead while making recommendations it gives projects a framework by which they can do whatever it is that they need to do but document it in a (machine-processable) form that the TEI understands. This is standardisation by not saying “Do what I do” but instead by saying “Do what you need to do but tell me about it in a language I understand”.

The result of a customisation might be only to include certain modules, and by doing so lessen the amount of choice available when using a generated schema to encode a digital text. But of course, even inside these modules there will be elements that your project does not need.

ROMA

We do not necessarily need to learn the underlying TEI ODD format to create our customisation. The TEI community provides various tools to do this, such as ‘Roma’ which is a basic web interface for creating customisations. It gives you a way to build up from the most minimal schema, reduce down from the largest possible one, use one of the existing templates, use one of the common TEI example customisations, or upload a customisation that you had saved previously.

And of course the TEI strongly believes in internationalisation so wherever we can get volunteers to translate the website and the descriptions of elements into their own languages, we can incorporate that into the interface. What’s more is that this means the schemas you generate can have glosses and tooltips in your XML editing software that come up in that particular language.

On the ‘Modules’ tab we see a list of all of the modules and it is an easy thing to click ‘Add’ on the lefthand side and the modules will then be included in our schema and appear on the list on the righthand side. Removing them is just as easy.

Clicking on any of the modules, enables us to include or exclude those elements we want from the schema we are building

But what is happening underneath? In this case we’re generating a TEI ODD XML file which stores the changes we have made. We document that we want to included these modules, but also that we want to delete these elements, or in the case of the last one an attribute on an element. Back in the web interface we could look at the attributes for the <div> element and choose to include or exclude those that we want.

And for each of those attributes, here the @type attribute, we could choose whether it was required or not, whether its list of values was closed or open, and what those values might be.

Again, underneath this is XML that documents how we are changing the TEI schema. Here making the @type attribute required, and giving it a closed value list of prose, verse, drama, and other.

But there are limitations to this web interface: for example it currently doesn’t allow you to provide a description to each of these values (the <desc> element here). There is no reason it shouldn’t, just that the creators haven’t had time or money to improve the software in the last few years. The TEI Council is actively looking at ways to encourage the community to create newer ODD editors. From our customisation we can generate a variety of documentation and this documentation will be localised, meaning that your changes will be reflected in it, as well as internationalised in that it will use your choice of language where it can. One of the great things about TEI ODD files is that you can also include as much prose as you want describing your project’s encoding practice. And, of course, you can also generate a variety of schema languages to validate your documents. The TEI tends to recommend Relax NG as its preferred format. And although you can generate DTDs from it, this is now a dated document validation format that I would not recommend.

One of the interesting recent developments is that a user can now ‘chain’ customisations together. Their TEI ODD file points at an existing one as a source and so on. This means that if there is an existing customisation that you like (for example like the EpiDoc customisation for classical epigraphy), then a project can point at that to use it as a starting point, and add to it, but regenerate their schemas with new additions any time the original source has changed.

Such documentation of variance of practice and encoding methods enables real, though necessarily mediated, interchange between complicated textual resources. Moreover, with time a collection of these meta-schema documentation files helps to record the changing assumptions and concerns of digital humanities projects.

OxGarage

OxGarage is the web front end to a set of conversions scripts the TEI provides to convert to and from TEI. They are really easy to use, you choose what type of input document you have and if it can get from that, to any format, and from that to another other format in a pipeline then you can choose that as an output format. Once you’ve chosen the output you can convert to it, or there are all sorts of advanced options for handling things like embedded images. One of the benefits of this freely available tool is that it is a web service, and so you can build it into other platforms. For example the Roma tool we saw when it converted to HTML documentation, or indeed to the Relax NG schema, behind the scenes it is sending it to this OxGarage web service to do the conversion.

The Stationers’ Register Online

The Stationers’ Regsister Online project is a good example of how TEI ODD customisation can save a project money and further their research aims. This project received minimal institutional funding from the University of Oxford’s Lyell Research Fund to transcribe and digitize the first four volumes of the Arber’s edition of the Register of the Stationers’ Company. The Register is one of the most important sources for the study of British book history after the books themselves, being the method by which the ownership of texts was claimed, argued, and controlled between 1577 – 1924. This register survives intact in two series of volumes which are now at the National Archives and the Stationers’ Hall itself. The pilot SRO project has created full-text transcriptions of Edward Arber’s 1894 edition of the earliest volumes of the Register (1557—1640) and the Eyre, Rivington, and Plomer 1914 edition (1640—1708). It has also estimated the costs involved in the proofing and correction of the resulting transcription against the manuscript originals, as well as potential costs of transcription of the later series from both manuscript and printed sources.

A typical entry lists the members of the company registering the book, to ensure their right to print it, the name of the author, and title of the book. There is also an amount shown which is the cost of registering it. In this case the book is the Comedies, Histories, and Tragedies, of one Mr William Shakespeare. As Edward Arber’s nineteenth-century edition of the Stationers’ Register existed as a source, it was decided that this was a much better starting point for the pilot than the manuscript materials themselves. In the earlier volumes the register is also used as a general accounts book for the Stationers’ Company, but over time evolves into a more or less formulaic set of entries following a fairly predictable format.

Although hard to see in this low-res scan, even in the nineteenth century Arber recognized the potential usefulness of markup and thus marked particular features of the Register surprisingly consistently in the volumes he edited. The encoding tools at his disposal, however, were only page layout and choice of fonts. The ‘nineteenth-century XML’, as the presentational markup he chose was termed within the project, was used to indicate basic semantic data categories. For Members of the Stationers’ Company Arber uses a different font, Clarendon. Other names are in roman small capitals, but the names of authors are in italic capitals.

Arber’s extremely consistent use of this presentational markup, and the subsequent encoding of it by the data keying company, meant that the project could generate much of the descriptive markup itself. If this presentational markup had not existed then a pilot project (with very minimal funding) to produce a digital textual dataset would not have been possible. As with all TEI customisations, this was done with a TEI ODD file. This TEI ODD file used the technique of inclusion rather than exclusion (that is, it said which elements were allowed instead of taking all of them but deleting the ones it did not want). What this meant was that when the project regenerated its schemas or documentation using the TEI Consortium’s freely available services, only the original requested elements were included, and new elements that had been added to the TEI since the project created the ODD would be excluded.

The Bodleian Library’s relationship with a number of keying companies meant that the SRO project was able to find one willing to encode the texts in XML to any documented schema. And indeed, very importantly, this particular keying company charged for their work by kilobyte of output. Owing to this, the project realised that it would save money if it could create a byte-reduced schema which resulted in files of smaller size. Our ODD customisation replaced the long, human-readable names of elements, attributes, and their values with highly abbreviated forms.

For example, the <div> element became <d>, the @type attribute became @t, and the allowed values for @t were tightly controlled. This meant that what might be expanded as <div type=”entry”> (24 characters with its closing tag) was coded as <d t=”e”> (13 characters). The creation of such a schema was intended solely to reduce the number of characters used in the resulting edited transcription, as an intermediate step in the project’s workflow — document instances matching this schema are not public, since it is the expanded version that is more useful. This sacrificed the extremely laudable aims of human-readable XML and replaced it with cost-efficient brevity. Because of this compression of elements we called our customisation tei_corset.

This sort of literate programming becomes fairly straightforward once one is used to the concept. However, there is an important additional step here, which is the use of the <equiv> element. This informs any software processing this TEI ODD that a filter for this element exists in a file called ‘corset-acdc.xsl’ which would revert to, or further document or process, an equivalent notation. In this case a template in that XSLT file transforms any <ls> element back into <list>element. In addition to renaming the @type attribute to be @t, some of the other element customisations constrain the values that it is able to contain. For example, in the <n> element (which is a renamed TEI <name> element) the @t attribute has a closed value list enabling only the values of ‘per’ (personal name), ‘pla’ (place name), and ‘oth’ (other name). In most cases though the names are documented by Arber using his presentational markup, and this is captured with the @rend attribute (or its renamed version as @r).

As with many TEI customisations designed solely for internal workflows, the tei_corset schema is not in fact TEI Conformant. The popular TEI mass digitisation schema tei_tite has the same non-conformancy issues. Both of these schemas make changes which fly in the face of the TEI Abstract Model as expressed in the TEI Guidelines. The tei_corset schema, in addition to temporarily renaming the <TEI> element as <file>, changes the content model of the <teiHeader> element beyond recognition.

This bit of the customisation documents the renaming of the <teiHeader> element to <header> which compared to other abbreviations is quite long, but it was only used once per file so had less pressure to be highly abbreviated. The @type attribute is deleted and more importantly the entire content model is fully replaced. This uses embedded Relax NG schema language to say that a <title> element (which is later renamed to <t>) is all that is required, but can have zero or more members of the model.pLike class after it. This enabled the keying company to put a basic title for the file (to say what volume it was), but gave them nothing but some paragraphs as a place to note any problems or questions they had. Usually TEI documents have more metadata, but this is unproblematic because these headesr were replaced with more detailed ones at a later stage in the project data workflow. Other changes meant that elements that were usually empty would be (temporarily) allowed text inside. In the process of up-converting the resulting XML, these were replaced with the correct TEI structures. In this customisation of the TEI <gap> element in addition to allowing text, the locally-defined attributes, @agent, @hand, and @reason are removed.

In a full tei_all schema the <gap> element would have the possibility of many more attributes, but these are provided by its claiming membership in particular TEI attribute classes. For the tei_corset schema many TEI classes were simply deleted which meant that the elements that were claiming membership in these classes no longer received these attributes.

The result of the customisation is a highly abbreviated, and barely human-readable form of TEI-inspired XML. For example here we have a <n> element marking ‘Master William Shakespeers’ with the forename and surname marked with ‘fn’ and ‘sn’. The conversion of this back to being a <persName> element with <forename> and <surname> is very trivial renaming in XSLT.

Passing a couple centuries worth of records through the transformation results in much more verbose markup.

But it isn’t just simple renaming that we undertook in reverting this highly compressed markup to a fuller form. But more detailed up-conversion. Such entries contain fees paid and they are almost always aligned to the right margin by Arber and recorded in roman numerals. The keying company was asked to mark these fees (the <num> element having been renamed to <nm>) and to use the @r attribute to indicate its formatting of ‘ar rm’ (aligned to the right and roman numerals). The benefit to the project of them doing this is that it meant that the SRO project could up-convert this simple number into a more complex markup for the fee.

The up-conversion I wrote here isn’t simply to revert numbers back to the correct TEI markup, but to make them to even better markup by deriving information from the textual string that is encoded. The tokenization of the provided amounts into pounds, shillings and pence, and consistent encoding of the unit indicator as superscript are key parts of this. Arber’s edition provided all the markers of pounds/shillings/pence as superscript, so the keying company was not asked to provide it, as the project realised this could be done automatically after the fact and would save even more characters. I also converted the roman numerals to ‘arabic’ numbers so that easy calculations of total amount of pence (for comparative purposes) could be provided. To do this, the XSLT stylesheet converted the keyed text string back into pure TEI and simultaneously broke up the string based on whether it ended with a sign for pounds, shilling, pence, or half-pence. An additional XSLT function converted the roman numerals in-between these to arabic, and then to pence so that the individual and aggregate amounts could be stored. The markup that results provides significantly more detail than the original input.

The benefit of this customisation was based entirely on the keying company both using whatever XML schema we gave them, and charging per kilobyte of output. Originally we’d calculated that by having them use this schema rather than full TEI we were saving around 40%. In the end, if we include the up-converted information as well, this rises to a 60% saving. The extra money we had left meant that we were able to include the 1640-1708 material as well even though it had been out of scope for the original project.

The Godwin Diary project

The Godwin Diary project was funded by the Leverhulme Trust to digitise and do a full-text edition of the 48 years of William Godwin’s diary. William Godwin, 1756-1836 was a philosopher, writer, and political activist. He is perhaps most commonly known as the husband of Mary Wollstonecraft and the father of Mary Wollstonecraft Shelley, the author of Frankenstein. Godwin faithfully kept a diary from 1788 until his death in 1836; the diary is now preserved in the Abinger collection in the Bodleian Library. It is an extremely detailed resource of great importance to researchers in fields such as history, politics, literature, and women studies. The concise diary entries consist of notes of who Godwin ate with or met with, his own reading and writing, and major events of the day. The diary gives us a glimpse into this turbulent period of radical intellectualism and politics, and many of the most important figures of this time feature in its pages, including Samuel Coleridge, Richard Sheridan, Mary Wollstonecraft, William Hazlitt, Charles Lamb, Mary Robinson, and Thomas Holcroft, among many others.

The project team was small consisting mostly of Mark Philp and David O’Shaugnessy and a couple of their students in the politics department. It is important to note that it is the politics department since it is less Godwin’s life as a literary person, but the social network of relationships which concerned the project.

The Bodleian has provided hi-res images of the diary, and done so under an open license that has already significantly benefited research in this area. In providing the technical support to the project it is important to note that I gave them only 2 days of technical training for the project. Partly this is a benefit of the TEI ODD customisation; they didn’t have to learn the entirety of the TEI, only the bits they were using. I provided this training, created the TEI ODD customisation, developed the website and was also a source of general technical support during the life of the project.

However, even with basic training they were able to mark up the 48 years of the diary, categorise every meal, meeting, event, text mentioned, and person named. In addition they identified more than 50,000 of the ~64,000 name instances recorded in the diary and linked these to additional prosopographical information.

Godwin’s diaries are simultaneously immensely detailed (recording the names of almost everyone he ever met with) and frustratingly concise (he only rarely gives details of what they talked about). Godwin’s diary is quite neatly written and easy to read. The dates, here in a much lighter ink, are usually given (and given correctly) and generally a day’s entry forms the basic structural unit of the diary. In only a very few instances do the notes from one day stray into the page area already pre-ruled for the following day. Occasionally there are marginal notes to provide more information, but in most cases the textual phenomena are quite predictable – mostly substitutions and interlinear additions. In many ways the hierarchical nature of a calendrical diary entry makes it ideal for encoding in XML.

There is some indication that Godwin may have returned to certain volumes at a later date to rewrite or correct them. And yet, it is certainly impressive that there are entries for most days, and that whatever minimal information is given, the names of those attending the frequent meetings Godwin had with those in his circle are recorded. The majority of his diary entries were seen to be able to be broken down into several categories and sub-types. These include his meals, who he shared them with, who he met, very rarely what they talked about, and what works he was reading or writing at that time. The political historians, it is easy to understand, were eager to use the resource to explore which individuals might be meeting with what other friends of Godwin’s at specific times. Meanwhile those exploring Godwin’s writings might be interested in knowing what works he was reading when he was writing specific parts of some of his works.

But that is enough about Godwin, back to the project itself. Of course having the hi-res images means that I included a typical pan/zoom interface, here built on top of google maps, to show each page of the diary. Two links are important to notice on this screenshot though, one is the link to the creative commons ‘full image’. There is no barrier in getting the full image, no one that researchers need to ask, they can just download it. The same is true for all the underlying XML. The other link is a direct link to the diary text for this page. This means that one can browse the diaries based on their physical manifestation, as a series of images, and jump to the text at any point. Or one can read the transcribed text and jump to the image for that page. The project specifically asked for there not to be a side-by-side facing image/text view because they wanted to preserve the distinction between these two experiences of reading the text.

The customised TEI ODD in the case of the Godwin project wasn’t made to create highly abbreviated element names for some keying company. Instead it was to create aliases for elements to give those encoding the diary a small and easy set of elements through which to categorise the parts of a diary entry in terms that made sense to them.

So there were element specifications created for divisions that renamed them to be diary year, month, and day. There were specialised elements to mark segments of text, really re-namings of the TEI seg element, for those portions of diary entries for meals, meetings, events, and more, all with specific names that made sense to the project.

For example with the element specification showing here, it creates a new element called ‘dMeal’ which is a diary-entry meal. There is an <equiv> element pointing back to an XSLT file which can revert this to pure TEI.

There is a description of the new element, and some information about what classes it is a member of and what is allowed inside it. There is a locally-defined @type attribute which which has been made required, and has a list of values for each type of meal, but also indicates whether the person was dining at Godwin’s place or whether he was visiting them.

As with the Stationers’ Register project markup, this was easily converted back to pure TEI P5 XML. You can see some of the @type attribute values preserve the original name of the customised markup. Once restored this dMeal element becomes a <seg type=”dMeal”>.

In this case it is a supper, where Godwin has sup’ed at his friends the Lamb’s with a variety of other people. While at the meal he has had a short little side meeting with H Robinson.

The structure of the diary is also quite straightforward. As you can see each month has an @xml:id attribute which gives its year and month, each day has precisely the same thing, but with the day. These were required by the ODD customisation, and moreover, the schema requires that each day entry have a date element with a @when attribute encoded in it. This means that in creating the processing for the diary entries I could be sure that each diary entry would have a day, and each month a clearly understandable ID and so creating transformations of this which produce the website by each year, month, or day becomes very straightforward.

The changes to the TEI scheme, in renaming elements this time not for brevity but simplicity, meant that the project’s ability to mark up the documents in XML increased dramatically. The other changes, such as requiring a date element with a @when attribute, meant that the processing of the documents was even easier. In short, the customisation made both my life and the encoders lives easier.

In the resulting webpages, one can toggle on or off a variety of formatting for indicating all these categories of information they recorded, people, places, meals, meetings, reading, writing, topics mentioned, and events. The general website is clear, cleanly minimalistic, and intuitive with a calendar for each year one is looking at, and anything that can be a link has been turned into one. But one of the great strengths of the website is the amount of work they have put into the marking of all those people’s names. Because they have done that it means that we can pull out dataTables of information about the people, birth date, death date, gender, occupation, and how many times they are mentioned in all of the diary volumes and whether this was when they are acting as a venue (Godwin visits them) or were listed by Godwin as ‘not-at-home’.

For each person we produce a prosopographical page listing biographical details, editorial notes, a bibliography of works, and a generated graph showing when and how much they are mentioned in the diary. Of course, each of these references links back to that diary entry for a very circular navigation through the resource.

Extracting information from the diary was the reason the project team put so much effort into adding this encoding to the XML files. This means that we’re able to extract this information for any of the categories that they marked and each of the sub-types within that. In this case one of the subtypes of events was ‘theatre’ used to note when he went to the theatre and if known which theatre he was going to. With this data available in the eXist XML database that powers the resource, it is then easy to pull out all of the trips to the theatre, which theatre, and show the event usually containing the title of the play he went to. The website does this for every single category and sub-type of information they marked, so researchers can indeed compare how many times he ate supper with someone at his house compared to how many time he ate supper at their house. (If they really want!)

EEBO-TCP

Another benefit of the documentation of local encoding practice is for the legacy data migration of document instances in the future. Even the conversion of closely related documents such as those from the Early English Books Online – Text Creation Partnership into pure TEI P5 XML can be an onerous task. We recently converted the more than 40,000 texts of the EEBO-TCP corpus to TEI P5 XML. As the first phase of these will become public domain in 2015 we’re testing and improving the conversions we have for them to do fun things like create ePubs so we can read these early printed books on our iPads and phones.

The EEBO-TCP markup was based on TEI P3 but then evolved separately when it encountered problems the TEI hadn’t yet dealt with. However, it did not document these in a TEI extension or customisation file. In converting them to TEI P5 we used the TEI ODD customisation language to understand and record the differences of variations between EEBO-TCP and the more modern TEI P5. One proven approach to comparing texts is to define their formats in an objective meta-schema language such as TEI ODD, and in doing so the precise variation between the categories of markup used is exposed, and more importantly, provided in a machine-processable form. As part of the process of converting these to TEI P5 one of the things we looked at was the markup before and after conversion, and thus the frequency of certain elements. The resulting markup has almost 40 million instances of highlighting, but this is because this is one of the basic things captured by the TCP project.

Most of the elements that are highest in frequency are structural in nature. Remember how in the Stationers’ Register project limited the schema to a tiny 34 elements? In all of EEBO-TCP there are only 78 distinct elements used in the entire corpus. This reflects the nature of the TCP encoding guidelines of capturing basic structural and rendering markup. There are very few Interoperability problems between EEBO-TCP texts, as their markup is fairly consistent and basic. But what is interesting about these newly converted EEBO-TCP files is that now that we are able to convert them they are becoming the source for further research. Projects can take our TEI P5 XML files and add more markup to them to document the aspects of the texts that they are interested in.

Three EEBO-TCP Projects

Very briefly I’d like to mention 3 projects which have benefited from these conversions of EEBO-TCP materials, each of which I could go into more detail about at another time. This project (Verse Miscellanies Online) recently went online at the Bodleian, we took the converted EEBO-TCP texts and some researchers from another university edited them and provided information about genre, rhyme scheme, and editorial notes for each of the poems. They also glossed any unfamiliar words and provided pop-up regularisations for others. From these enhanced texts we built them a website to use for teaching and reading of the 8 verse miscellanies they encoded during the project. Similarly in this project (Poetic Forms Online) some researchers again took the TEI P5 converted versions of the EEBO-TCP texts that we supplied them and provided highly detailed metrical analysis, counted syllables, marked the type and location of all rhyme words as well as a regularisation of their rhyme sounds. From these enhanced texts we built them a faceted searchable website with all of the categories which they plan to expand by adding more texts as time goes on. The Holinshed project was slightly different, one of the earlier conversions of EEBO-TCP material that we did. In this case there are two editions of a very large text, Holinshed’s Chronicles of England, Scotland and Wales, one published in 1577 and the other published in 1587. The academics in question were writing a secondary guide to this huge work and wanted a way of following where paragraphs in one edition had been fragmented and moved around in the creation of the second edition. Sometimes whole sections had been moved, sometimes parts of paragraphs had been moved around and mixed with others, etc. In this case we converted the texts to TEI P5 and then designed a fuzzy string comparison system to find the most probable matches and record their paragraph ID numbers. We then built a website where the researchers could confirm that these were indeed the correct matches, before using the resulting links between the two to generate a website where when reading the text a user could jump to the same paragraph in the other edition and see how the social changes during Queen Elizabeth’s reign had affected the topics, especially religious topics, in the chronicle. All of these projects have benefited from our ongoing work to improve the transformations of EEBO-TCP to TEI P5, which itself is dependent on the TEI ODD Customisation language.

The Unmediated Interoperability Fantasy

One of the misconceptions about the TEI, and indeed any sufficiently complex data format, is that once one uses this format that interoperability problems simply vanish. This is usually not the case. Following the recommendations of the TEI Guidelines does, without question, aid the process of interchange especially when there is a fully documented TEI ODD customisation file. However, interchange is not and should not be confused with true interoperability.

I would argue that being able to seamlessly integrate highly complex and changing digital structures from a variety of heterogeneous sources through interoperable methods without either significant conditions or intermediary agents is a deluded fantasy. In particular, this is not and should not be the goal of the TEI. And yet, when this is not provided as an off-the-shelf solution some blame the format rather than their own use of it. The TEI instead provides the framework for the documentation and simplification of the process of the interchange of texts. This is a good thing and is a much better goal for the TEI. If digital resources do seamlessly and unproblematically interoperate with no careful or considered effort then:

  • the initial data structures are trivial, limited or of only structural granularity,
  • the method of interoperation or combined processing is superficial,
  • there has been a loss of intellectual content, or
  • the results gained by the interoperation are not significant

It should be emphasised that this is not a terrible thing, nor a failing of digital humanities nor any particular data format, but instead this truly is an opportunity. The necessary mediation, investigation, transformation, exploration, analysis, and systems design is the interesting and important heart of digital humanities.

Open Data

While proper customisation of the TEI and open standards generally are a good start, what still isn’t happening as much as it should is the release of the underlying data openly. All projects, especially publicly funded projects, need to release their data openly, but they also need centralised institutional support to enable them to do so. If other people can’t see your data, then we can’t re-use it, test it, and if so there is little benefit to the world to make the data.

I don’t know the situation here in Japan, but in the UK and the USA it is certainly the case that funding bodies are increasingly requiring data to be open.

I leave you with the final thought that the “coolest thing to be done with your data will be thought of by someone else”.