Until December 2009 I worked on the ENRICH project, and as it has now finished, I thought that I should reflect on some of what the project has done and the aspects we’ve been involved with here in Oxford. For the most part the project has been attempting to both aggregate manuscript descriptions into the manuscriptorium framework and standardise these manuscript descriptions to a single, common, agreed format. For the background to the ENRICH project, see the website, and especially this article on the ENRICH Project and TEI P5. A list of deliverables is also available.
Standardisation of Specification
The workpackage we were most involved with, partly because we were leading it, was workpackage 3 whose object was:
To ensure interoperability of the metadata used to describe all the shared resources by analysing the various standards used by different partners and ensuring their mapping to a single common format, which will be expressed in a way conformant with current standards.
As one might expect, in practice, this common format was a more tightly constrained subset of the TEI recommendations on Manuscript Description. The difficulty in any such endeavour is getting coherent agreement between a large number of representatives on a wide variety of customisations. As part of this process we undertook a comparison of MASTER, TEI P5, and Manuscriptorium formats. A number of revisions were made to the ENRICH schema through the course of the project. Deliverable D3.1 was a “Revised TEI-Conformant specification” available in a number of schema languages. The ENRICH Schema is publicly and freely available as as DTD, RELAX NG, and W3C Schema, but we recommend the RELAX NG format:
The next deliverable, D3.2, was “Documentation and training materials for use with the ENRICH Specification”. Because the TEI ODD had been written with documentation in it, the same TEI ODD which generated the schemas above could also be used to generate project-specific documentation. This meant that in addition to the documentation written specifically for the ENRICH project, it had access to all the internationalised reference material available in the TEI Guidelines as a whole. This meant that we could produce versions of the documentation which while still primarily in English, contained glosses of the elements in another language. So for example:
<msIdentifier> (manuscript identifier) contains the information required to identify the manuscript being described.
in the English documentation for the ENRICH Specification became, in the French:
<msIdentifier> (identifiant du manuscrit) Contient les informations requises pour identifier le manuscrit en cours de description.
While this is admittedly of limited benefit, since the bulk of the documentation remains in English, it can aid comprehension to those reading in a foreign language to have the element descriptions in their own language. The ENRICH Specification documentation is available in the following languages and formats:
- English, HTML
- English, PDF
- French glosses, HTML
- French glosses, PDF
- Spanish glosses, HTML
- Spanish glosses, PDF
- Italian glosses, HTML
- Italian glosses, PDF
Training materials were also created as part of D3.2 and took the form of slide sets as PDF, HTML, and TEI XML that project partners were free to take, modify, and use in teaching the ENRICH schema:
- What is XML markup for? (PDF; also HTML and XML source)
- Live long and prosper! Lessons from the TEI (PDF; also HTML and XML source)
- Using the basic TEI structural elements (PDF; also HTML and XML source)
- Names, People, and Places (PDF; also HTML and XML source)
- Handling primary sources in TEI XML (PDF; also HTML and XML source)
- booklet with all the above
While the primary migration tools from other formats to the ENRICH Specification were undertaken by the lead technical partner, we were tasked with undertaking a case study based analysis of the construction of migration tools and the make recommendations to the project based on these. The Migration case studies focussed on MASTER records that we had accumulated as a testbed and EAD records given to us by the Bodleian Library. The Case Studies on Migration to the ENRICH Specification and all their materials are freely available online. The case studies examined methods for transformation of MASTER and EAD records to TEI P5, mainly using XSLT-based conversions. The report on the Development and Validation of Migration Tools is available online.
ENRICH Garage Engine
Originally D3.4 of the ENRICH Project was a “Report on METS/TEI interoperability, best practice with respect to handling of Unicode and non-Unicode data in Manuscriptorium and P5 conversion techniques”. However, after much investigation it was determined that the use of METS was unnecessary for our extension to the Manuscriptorium platform. (This is not to say that it would not have been suitable for this or other uses.)
Part 1 of D3.4 and some of the work on it was replaced through the development of the ENRICH Garage Engine (EGE) and a report on the Documentation and Use of the ENRICH Garage Engine. This is a primarily web-service based format conversion engine developed by PSNC which enables document conversion through a number of formats. The engine itself consists of a web service and website frontend and underneath consists of a recognizer, a validator, and a converter. As the EGE website explains:
- Recognizer – this plug-in is responsible for the recognition of the Internet Media Type (MIME type) of the given input data. For example, it will receive the input data and state that the input data has text/xml MIME type. The recognized data may then be further validated to check the format of the data.
- Validator – this plug-in is responsible for validation of the input data. For example it may be used to validate the ENRICH TEI P5 data stored in a MIME type (e.g. text/xml) either received from end user or created by one of the converters. The following notation is assumed: ENRICH TEI P5 (text/xml) – it means that validator is able to validate ENRICH TEI P5 format encoded in text/xml.
- Converter – this plug-in is responsible for converting the input data. It may be, for example, conversion from XML to Word, conversion from Word to PDF, conversion of the XML from one form to another (e.g. MASTER -> ENRICH TEI P5) or even cleaning the input data (e.g. removing redundant information).
You can try the EGE at its website:
ENRICH gBank and Non-Unicode Characters
One problem encountered in the migration of legacy documents to the ENRICH Specification might be that these records use characters which are not currently present in Unicode. The Medieval Unicode Font Initiative (MUFI) campaigns for inclusion of some of these specialized characters into the Unicode Specification. The second half of the D3.4 deliverable we produced was a report on Best practice in handling non-unicode characters. This included the description of a software tool, the ENRICH gBank produced to assist in normalization and documentation of non-Unicode characters. This contains a list of all of MUFI non-Unicode characters in the Private Use Area (PUA), images of them, and a representation of them using a TEI <char> element. For the most part these were automatically generated from the MUFI Spec. Conversion of this involved exporting the Adobe InDesign file as RTF, converting this to a basic presentation TEI XML, running a transformation script on this to extract just the data we needed for our own tables. In addition, the PUA references were used, in conjunction with the Andron Scriptor Web font, to produce first SVG files (using Apache Batik) and then specific-sized PNG files from this. This allowed us to have character images for each of the characters in the PUA.
You can see the ENRICH gBank on the ENRICH beta website at:
As part of the ENRICH teaching materials we also created some ENRICH templates, to assist those who wanted a guide as to the kind of material that should be present in an ENRICH manuscript description.
- A CSS file for manuscript descriptions (e.g. for use in oXygen’s author mode)
- A basic ENRICH template file for manuscript descriptions (view source)
- A detailed ENRICH template file for manuscript descriptions (view source)
A number of projects have taken these templates as starting points to further customise in their own use of the the ENRICH Specification or TEI P5 msDesc.
Working for any large and dispersed EU project always has its benefits and drawbacks. In the case of ENRICH we were able to draw on a wide range of experience, technologies and data because of the diverse nature of the project. One of the major drawbacks stems from being partnered with commercial organisations. While all the work they did in their development and support of the Manuscriptorium platform was top notch, they naturally have commercial interests of their business model at the forefront of their activities. This meant, for example, that while the ENRICH Specification and all the software, documentation, training materials and tools that we (OUCS) produced were licensed under an open licence, the same was not true of the main commercial company behind Manuscriptorium. The platform itself is not open source, at no point were we able to see the workings of the platform, nor contribute patches or bug fixes to it. This meant any of our development took place in an isolated manner and at arm’s reach.
Fair enough, the EU (via its eContent+ programme) funded this project with the understanding, presumably, that this would be the case. However, I feel that it is wrong for the EU to fund projects with commercial partners where those partners are not required to release the products of the funded work under an open licence of some sort. I’m not in any way against these commercial companies, but there are plenty of workable business models which enable them still to profit from materials they have developed and released under an open licence.
The ENRICH project has produced a lot that is good and interesting, and one of its major achievements is the network of individuals, projects, and institutions which are all approaching medieval manuscript description in the same manner. Although ENRICH (as a schema or project) is certainly not the last word in large-scale projects for the aggregation and standardization of medieval manuscript descriptions, it is a good development and milestone along that road.
List of Deliverable Reports
- D 3.1 — Revised ENRICH TEI P5 Specification
- D 3.2 — Documentation and training materials for use with ENRICH Specification
- D 3.3 — The development and validation of migration tools
- D 3.4 — 1: Documentation and use of the ENRICH Garage Engine and 2: Best practice in handling of Unicode and non-Unicode data