Author: David J. Birnbaum (djbpitt@gmail.com) Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2023-10-21T07:21:36+0000
Developed by David J. Birnbaum, Michel de Dobbeleer, Alexandre Popowycz, and Lara Sels
This document is a guide for encoding the Bdinski sbornik in XML, which is subsequently transformed to HTML for publication at http://bdinski.obdurodon.org. The starting point of the digital edition is the 1973 typeset edition, which we digitized through optical character recognition. Each textual unit (text) is encoded as a separate XML file; for example, the raw XML-encoded version of the first (Abraham) text is available at http://bdinski.obdurodon.org/abraham.xml. If you just click on this link, some browsers will download the file, some will display it with markup, and some will display just the textual content, without markup. If what you see is not what you want (browsers typically reformat the white space in XML files for rendering, which misrepresents the contents), you can download the file instead of just clicking on the link, whereupon you can open it in the application of your choice. Project development is performed with the help of the <oXygen/> XML editor and integrated development environment.
The structure of a file for the project (which elements are permitted to occur where) is described by a Relax NG schema (bdinski.rnc) and a Schematron schema (bdinski.sch), which can be linked to the file itself in <oXygen/>. We use both schemas because they validate different aspects of the structure. There are two reasons to link the schemas to the file:
To attach a schema to a file, click on Document
in the menu bar above the
<oXygen/> editing window, then on Schema,
and then on Associate
schema.
To the right of the text-entry box labeled URL
is a small
icon shaped like a folder. Click on that and navigate to the main Bdinski sbornik
Dropbox folder, which is called bdinski-sbornik.
Inside that folder, select
the file called bdinski.rnc,
ensure that Use relative paths
is
checked, and click OK.
(You don’t have to specify the schema type;
<oXygen/> will figure that out on its own.) Then do the same with
bdinski.sch
. For each schema you add <oXygen/> will insert a line
near the top of the file you are editing, and with both schema links in place the
top of your file will look something like:
<?xml-model href="../bdinski.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="../bdinski.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
You have to do this only once for each file you edit. Once the lines have been inserted by <oXygen/> into the top of your document, <oXygen/> will know to use the two schemas to validate the document and to provide command completion.
The Bdinski sbornik markup is based primarily on the physical layout of the manuscript as a series of folios, each of which contains a series of lines. We always encode entire folios, which means that if a text begins in the middle of one folio and ends in the middle of a different folio, we encode both folios in their entirety, including the lines that occur before the beginning of the text we are editing and those that occur after the end of that text.
Every line in the manuscript is encoded as a <line>
element, with a <line>
start tag and a
</line>
end tag, e.g.:
<line>
двєрца мала посрѣдѣ єю.</line>
If a word is divided across a line break, the editors should add a hyphen at the end of the first of the two lines, corresponding to modern hyphenation conventions, e.g.:
<line>
же єи седмию лѣть. ѡ-</line> <line>
н же повелѣ еи быти вь</line>
The beginning of each folio is marked as an empty (self-closing)
<folio>
element, which has an @n
attribute that consists of a sequence of digits followed by the letter
r for recto
or v for verso.
The following
example shows an abbreviated version of how the encoding of folio 22r and
folio 22v might look:
<folio n="22r"/> <line>
first line of 22r</line> <line>
second line of 22r</line> <!-- more lines --> <line>
last line of 22r</line> <folio n="22v"/> <line>
first line of 22v</line> <line>
second line of 22v</line> <!-- more lines --> <line>
last line of 22v</line>
Superscription is encoded with the <sup>
element,
e.g.:
<line>
вьнѣшнѣи хизинѣ. сам<sup>
ъ</sup></line>
Omega with superscript t
is rendered not as the single Unicode U+047f
(ѿ), but as omega followed by t
wrapped
in <sup>
tags, e.g.:
<line>
сль. и ѡ<sup>
т</sup>
врьзши двєрце</line>
Word divsion is entirely editorial, which is to say that the editors insert spaces between all words, regardless of spacing in the manuscript. The particles се (reflexive) and же are independent words, and are preceded by spaces. This is true even when they are superscripted, so ѡн ж (with a space before the superscript ж) rather than ѡнж (with the superscript ж immediately adjacent to the preceding н.
Text in red should be tagged as <red>
. In cases where an
entire line may be in red, the <red >
tags should go
inside the <line>
tags, e.g., 1r2 should be encoded
as:
<line><red>
женаго аврамиа. како п-</red></line>
Note that <red>
tags, like <line>
tags,
may capture only the beginning or end of a word.
The 1973 edition expands abbreviations, wrapping the inserted letters in parentheses. We remove those, restoring a titlo or superscript letter plus pokrytie if there was one in the manuscript.
Some of the text that would have occurred on folios now missing from the
manuscript has been restored in the 1973 edition on the basis of other
manuscripts. For our purposes, we keep that text, without dividing it into
lines, but surround it in <lacuna>
tags.
Errors corrected in the manuscript by a scribe (original or subsequent) are
encoded by creating a <subst>
(= substitution) element.
<subst>
must have exactly two children, one instance
of <del>
(which contains the original reading that was
subsequently corrected by the scribe) followed by one instance of
<add>
(which contains the text inserted by the scribe
as a correction).
If the original reading is illegible because of the correction, it may be
rendered as a <gap/>
element (with an optional
@extent
attribute to indicate the number of characters, if
the editor can discern that with reasonable confidence) inside the
<del>
element. For example, the replacement of one
illegible character with two at the end of 39v16 should be encoded as:
ꙩбраз<subst><del><gap extent="1" reason="overwritten"/></del><add>
ѡм҇</add></subst>
If the text is legible but unclear, it may be wrapped instead in
<unclear>
tags inside the <del>
element.
In case of corrections that do not involve the complete deletion of an
initial value, the <del>
element should carry the
attribute @status
with the value partial
. For example,
in 57r3 the scribe began to write л and then
corrected himself to и, and we write:
пом<subst><del status="partial">
л</del><add>
и</add></subst>
шлꙗѥ
This markup is borrowed from the TEI P5 guidelines. The notion of partial
deletion reflects the TEI interpretation that <del>
does
not have to mean complete deletion (in the case of the example above, the
deletion is conceptual, rather than graphic). Note that the TEI
<corr>
element should not be used to encode
corrections that can be read in the manuscript, that is, that were created
by a scribe. <corr>
is to be used only for corrections
inserted by the modern editors, and at this stage in our development the
editors of this project are not encoding any new corrections.
Text that has been erased and not replaced but that is still legible should
be tagged as <del>
. If the deleted text is not legible
but it is clear that text was deleted, the <del>
element
should still be used, but it should contain only an empty <gap
reason="erased"/>
element, with an optional
@extent
attribute to indicate the number of erased letters,
if the editor can determine that with reasonable confidence. For example,
the five-character erasure at 40v8 should be transcribed as:
<line>
вь доомь<del><gap reason="erased" extent="5"/></del>
нисифоровь.</line>
Do not use <gap/>
by itself for this purpose; wrapping it
in <del>
is what makes explicit that the gap results from
scribal deletion, and not from damage to the
manuscript or for other reasons. If the text is partially legible, it is
possible to combine raw text and <gap/>
elements inside a
<del>
element. <gap/>
optionally
may contain an @extent
attribute that records the estimated
number of characters deleted, so that, for example
<del>
и<gap reason="illegible" extent="2"/></del>
records a three-letter erasure where the first letter can be read with reasonable confidence as и and the next two are illegible.
Text inserted into the manuscript by a later scribe should be tagged as
<add>
. If the editor is confident that the insertion
is in a later hand, optional @hand="other"
attribute markup may
be included (insertions by the original scribe should omit the
@hand
attribute entirely, since the original scribal hand
is assumed to be the default). For example, if the editor is confident that
the superscript д at 46r12 was added in a later
hand, that can be encoded as:
боу<add hand="other"><sup>
д</sup></add>
ть
Combinations of deletions and insertions that should be regarded as
connected, that is, that should be considered a correction, should be
encoded using <subst>
, as described above, under Corrections.
Text that is legible but damaged should be tagged as
<damage>
, e.g., at 39v17:
пав<damage>
ьль</damage>
.
If the damaged text can be read, but not with confidence, it (or any unclear
portions) can be wrapped in <unclear>
tags inside the
<damage>
element. If the text cannot be read at all,
it should be tagged as an empty <gap/>
element inside
<damage>
, where <gap/>
has an
optional @extent
attribute that indicates the approximate
number of illegible characters (where the editor is able to discern).
Generic problems should be tagged as <problem>
. These will
be resolved and reclassified later, after discussion, and this interim
markup will help find them at that time.
The beginning of the text being edited is marked by inserting an empty
<start/>
tag before the line on which the text
begins, and the end of that test is marked by inserting an empty
<end/>
tag after the last line. For example, if the
text being edited begins on the third line of folio 22r, the markup would
look as follows:
<folio n="22r"> <line>
first line of 22r</line> <line>
second line of 22r</line> <start/> <line>
third line of 22r</line> <!-- more lines -->
There should be exactly one <start/>
and one
<end/>
tag in each file, surrounding the text being
edited at the moment. Do not mark the end of the preceding text or the
beginning of the following one; that is, in the example above, do not
include an <end/>
before the
<start/>
The entire edited section is wrapped in a single <root>
element. The first subelement inside the <root>
must be a
<metadata>
element, which contains the
<name>
and <email>
of the primary
person who edited the section. (This assumes that every section will have
exactly one editor to be credited officially on the site.) The following
example shows the beginning of thais.xml, and demonstrates the
<root>
start tag at the beginning of the file, the
<metadata>
element with its <name>
and <email>
children, the <folio>
tag
for folio 106v, which is where this text begins, the
<line>
tags for the lines on that folio, and the
<start/>
tag before the first line of the text of the
Vita of Thaïs. This example also includes the <sup>
element for superscript characters and an <editionPageNo>
element, about which see below:
<root> <metadata> <name>
Alexandre Popowycz</name> <email>
alexandre.popowycz@ugent.be</email> </metadata> <folio n="106v"/> <line>
рцами. и исписах ѥи все стра-</line> <line>
сти юже имѣ кь бѣсоу борбꙋ.</line> <line>
и кь ѡбразномоу и змїю, и</line> <line>
все м҃лтвы ѥє. и поустихь</line> <line>
кь всѣ<sup>
м҇</sup>
хр<sup>
с҇</sup>
тїанѡ<sup>
м҇</sup>
сь всею исти-</line> <line>
ною. сконьча же се. с҃таа моу-</line> <line>
ченица марина, м<sup>
с҇</sup>
ца їоулѣ</line> <line>
вь, з҃і҃. и твореще ѥи паметь,</line> <line>
сп҃сениѥ оулоучимь. вь име</line> <line>
г҃а нашего іс҃ х҃а. ємꙋже сла<sup>
в҇</sup></line> <line>
и дрьжава сь безначелнимь</line> <line>
єго ѡ҃цемь. и сь прѣс҃тымь</line> <line>
и бл҃гымь и животворещимь</line> <line>
д҃хомь твоимь, н҃нѣ и пр҃сно:—</line> <start/> <line><editionPageNo n="130"/>
ЖИТИѤ И ЖИЗНЬ ПРѢ-</line> <line>
по<sup>
д</sup>
бниѥ ѳаисиѥ,</line> <line>
Братиꙗ моꙗ пр<sup>
с҇</sup>
наа, хощꙋ ва<sup>
м҇</sup></line> <!-- text continues --> </root>
Some of the texts in the 1973 edition number paragraphs or larger sections.
This information is retained as empty (self-closing)
<editionParagraphNo>
elements with an @n
attribute to indicate the numerical value, with a space before and after,
e.g.:
<line>
лы.<editionParagraphNo n="28"/>
ре<sup>
ч</sup>
же вь себѣ азь оужє</line>
Page numbers from the typeset edition are not present in the OCR output, and
must be inserted manually into the XML by the editors. The markup for this
purpose is an empty (self-closing) <editionPageNo>
element with an @n
attribute, the value of which corresponds to
the beginning of a page in the 1973 edition, e.g.:
<line>
неже<editionPageNo n="44"/>
бо бѣ ѡ<sup>
т</sup>
ць єѥ, имѣ-</line>
Asterisks in the 1973 edition point to a wide variety of editorial footnotes. When correcting the OCR, the asterisks should be retained during editing, but in places where they indicate that the editors have replaced the actual manuscript text with their own emendation, the actual manuscript text (in the footnote in the typeset edition) should be restored to the transcription.
The 1973 edition capitalizes proper nouns, sentence-initial letters, etc. We correct those according to the manuscript, which means that we use capital letters only for letters that are large in the manuscript, typically initials and the title at the beginning.
Ligatures are wrapped in <lig>
tags, e.g.:
р<lig>
ау</lig>
<sup>
д</sup>
е (13v2)
e(е ~ є)
The manuscript distinguishes broad and narrow e
, but these are not
distinguished in the 1973 edition. We correct this programmatically after
OCR to bring the distribution into agreement with the general scribal
orthographic norm, writing narrow e
(е) after
consonant letters and broad e
(є) elsewhere,
that is, after vowel letters and in initial position. Because the scribe may
occasionally violate his own norm, editors of the new digital edition need
to proofread especially carefully for such deviations and correct the
transcription, so that it comes to represent the actual spellings in the
manuscript.
oletters (о ~ ѡ ~ ꙩ ~ ꙫ ~ оо)
The manuscript includes omicron (о), omega (ѡ), ocular o
(ꙩ, e.g.,
6v16), binocular o
(ꙫ, e.g., 5r12) and broad
o
(ѻ, e.g., 8v8). The double omicron
(оо) is a distinctive feature of the orthography
of this manuscript, and is transcribed as a sequence of two regular omicron
letters. The two omicrons typically are touching, and therefore technically
a ligature, but we’ll add the ligature markup automatically after editing,
so the editors do not have to type the tags manually in these cases.
uletters
The sound [u] may be spelled as omicron plus u
(this should be encoded
as two characters, regular o
followed by regular u
, e.g.,
beginning of оумершꙋ 1v4), ꙋ
(e.g., end of оумершꙋ 1v4), and with superscript
у over omicron, i.e., оу (e.g., оуловити 3r3). Note that this means that we have a
<sup>
element within a <sup>
element in the case of второомоу 6v15:
второом<sup>
о<sup>
у</sup></sup>
Jery in the manuscript is regularly written with two marks over the second component, which sometimes looks like two dots and sometimes like a kendema (double grave accent). In all cases we render the jery like regular jery, with front jer onset and without any superscript diacritic (cf. below concerning diacritics): ы.
All jer letters are written as front jer (ь) unless they are unambiguously back jers (ъ), which normally occurs only at the ends of lines.
tletters
There are three basic shapes of the t
letter: three-legged (e.g.,
что 4r5), regular (but with strong serifs on the
ends of the crossbar, e.g., что 4r4), and a tall
t
(e.g., что 4r12). These are all
transcribed, identically, as regular т.
Titlo is transcribed over the last continuous letter of the word (counting from the beginning) before the first omission, without regard to where it appears graphically in the manuscript. In this way, the titlo records the fact that there is a titlo in the manuscript, but does not attempt to record its exact placement over a particular base character. This accommodates the fact that titlo may be placed not only over individual letters, but also between letters, and that it may span multiple letters, all of which are features that are impractical to represent in a character-based transcription. The imperative to transcribe with exact placement is relaxed because we provide also photographs, so that users who require this level of paleographic detail have access to it by way of the photographs. Thus:
Exception: if the last consecutive non-omitted letter from the beginning of the word is superscripted, we place the titlo over the following letter. Thus ѡт҇ц҃ь for ѡт(ь)ць.
We write pokrytie over superscript letters where it appears in the manuscript. This manuscript does not strongly distinguish titlo and pokrytie by ductus (that is, the squiggles look similar); we use titlo ( ҃ ) exclusively over in-line base characters and pokrytie ( ҇ ) exclusively over superscript characters.
We write paerok as Unicode U+2E2F “Verticle Tilde” ( ⸯ ) where it appers in the manuscript. Taggers may find it convenient to copy and paste the character from this page.
For information about the history of this character in Unicode see Kempgen, et al, Unicode U+2E2F, Cyrillic Yerik (Vertical Tilde), Scripta & e-scripta: the journal of interdisciplinary medieval studies 7 (2009), pp. 9–12. ISSN: 1312-238X.
Accent marks and other diacritics (other than titlo, pokrytie, and paerok) are omitted.
Numbers are rendered as they appear in the manuscript, that is, as Cyrillic letters, often with titlo and often preceded or followed by a comma or mid-dot. Numbers in this manuscript are always preceded by punctuation, even where it is not required syntactically, and the punctuation is written adjacent to preceding letter, followed by a space, followed by the number, e.g., дни. і҃е (with the dot next to the preceding и and followed by a space), rather than дни .і҃е (with the space before the dot, which is then adjacent to the following і). The decisive example is б҃оу. | к҃. лѣт҇, 2v11–12, where a line break clearly associates the dot with the preceding word, and not with the following number.
The 1973 edition uses punctuation supplied by the editors. We remove that, encoding punctuation as it occurs in the actual manuscript. See the section about numbers, above, concerning punctuation adjacent to numbers.