August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

PRACTICAL LINGUISTIC ANNOTATION: THE HEBREW BIBLE1

DIRK ROORDA

introduction

1. Annotation

An annotation is a piece of information attached to another piece of information.2

Annotations generally do not have the same authorship, publishing workflow,
and audience as the information sources they are attached to. Annotations serve
to provide comments to sources, and these comments may involve analysis,
explanation, correction, linking, evaluation, tagging, counting, and much more.
In this article we focus on the logistics of information, rather than on the
meaning. While it is useful to distinguish annotations for their type of content,
our interest lies in the patterns of information distribution. How are annotations
created, how are they published, and how do they behave in the research data
cycle?

2. The Hebrew Bible

The Hebrew Bible is a family of ancient texts with a complex origin. It is
recognized by several world religions, and it has pervaded large swaths of
human culture. Academic research into the Bible occurs in several disciplines:
linguistics, history, and theology with their specialties such as linguistic
variation, historical linguistics, textual criticism, literary analysis, exegesis, and
hermeneutics.

Religious communities have added their own sets of interpretations and
observations. The practice of Bible translation into a great many languages of
the world3 has tuned people’s antennas for interpretation. There are editions of
the text of the Hebrew Bible in which the pages contain a small square of source
text, surrounded by layers and layers of annotation.4

International Journal of Humanities and Arts Computing 11.2 (2017): 276–287
DOI: 10.3366/ijhac.2017.0196
© Edinburgh University Press 2017
www.euppublishing.com/ijhac

276


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Practical linguistic annotation

Figure 1. : Text and annotations in SHEBANQ. Clicking on a verse number hides
and shows the annotations.

shebanq: a system for hebrew text

The ETCBC is the department of the Faculty of Theology at the Vrije
Universiteit Amsterdam that has created a linguistic text database of the Hebrew
Bible.5 In 2013–2014 the SHEBANQ project has reshaped that database into a
standard form: LAF6 and has built a demonstrator to show new ways of utilizing
that database in the age of internet connectedness. Indeed, the ETCBC database
has been modeled as a huge set of annotations. This demonstrator is now a
website in production, also called SHEBANQ.

We show how the Hebrew Bible has been captured in a system of annotations
and point to a number of non-trivial, innovative uses of the concept of annotation
which were not possible or practical before the digital handling of information.

1. Exhaustive linguistic annotation

Each of the more than 400,000 words carries annotations specifying its part of
speech, it morphological characteristics, its various representations and more.
The same holds for larger units, such as phrases and clauses. All in all, this gives
tens of millions of annotated features. Before the arrival of digital information
processing, this was not a feasible thing to do. But here we have it: a text with
millions of annotations, online, in a working system: SHEBANQ (see Fig 1).

277


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Dirk Roorda

Figure 2. : Text in phonetic representation, with all markings and annotations in
place.

2. Multiple textual representations as annotation

There is something else to note: the text itself exists as the content of annotations.
This has to do with the peculiar fact that the older variants of biblical material
were written down in a consonantal script, while the vowels were added as
diacritical marks (‘pointing’) several centuries later, near the final consolidation
of the text around 900 AD. So every word still has a consonantal representation,
but also a fully ‘pointed’ representation. It is a clear case where the text does
not have a single representation. Annotation provides a neat way to expose those
representations together.

Further down that road, we also provide a phonetic representation of the
text (see Fig. 2). That will help people not familiar with Hebrew to get access
to the linguistic annotations and use it for their own purposes.7 Nevertheless,
the authoritative text of the Biblia Hebraica Stuttgartensia is the default
representation.8

In SHEBANQ, the annotations are not tied to the representation of the text.
So if the user switches representation, all the highlights and other annotations
remain in place.

3. Queries as annotations

Now that text and linguistic annotations reside in a database, it becomes possible
to query both kinds of data. An important objective of the creators of the
ETCBC database has always been the ability to search for peculiar syntactic
patterns. When reading the Bible, every now and then a passage is particularly
problematic and requires explanation. But what kind of explanation? Has there
been a text transmission error? Is there a hidden borrowing from another text?
Is there a syntactic construction that belongs to another dialect or language? Is

278


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Practical linguistic annotation

Figure 3. : Queries as notes in the margin. The reader of the passage is drawn to
exegetical problems of others, and their solutions.

there deliberate use of language to achieve a literary effect? Or is there a truly
special meaning lurking behind the text? Research into these problems is greatly
helped by catalogues of occurrences of the same or partly the same phenomenon.
By using a text database, we are able to systematically query those patterns.

It is not easy to write such queries. The data is full of unexpected patterns, it
is easy to miss cases, so many checks and cross-checks are needed. A successful
query is a piece of scholarly crafts(wo)manship, and should be shared and
published as such.

Seen in an abstract way, a query is an annotation to all its results. One
annotation targeting multiple passages is already a little bit innovative, although
one might say that cross-references and indexes are examples of multi-target
annotations. But here there is a bit more going on. By presenting a query as an
annotation to its results, an unexpected flow of information is made possible:
from result to query. When a scholar reads a difficult passage, (s)he might be
interested in the exegetical queries that have results in that passage (see Fig. 3).
This is exactly what SHEBANQ makes possible. Next to every chapter in the
Bible a list of relevant queries is presented, and the results of those queries are
highlighted in the chapter at hand.9

279


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Dirk Roorda

4. Semi-automatic analysis as annotation

Linguistic research into the Hebrew Bible has not ended. The meaning of
Hebrew verb forms in poetry is a long-standing problem (and many occurrences
in prose are far from clear for that matter), and data-driven research has the
potential to produce new solutions.10 Verb meanings are also dependent on the
number and nature of constituents in the sentence (verbal valence), and it is
worthwhile to devise a flow chart system to generate verb senses on the basis of
signals near verb occurrences.11 This involves a lot of trial and error. Sometimes
it leads to a review of the linguistic encoding, to new syntactic and semantic
distinctions. One way to organize this, is to generate the results of a flow chart
as a set of annotations to be presented next to the text. The researcher can then
see the decisions in full context and comment on those outcomes by manual
annotations. These annotations can be harvested in turn and provide a basis for
an improved algorithm. This workflow is supported on SHEBANQ, although not
many people are fully utilizing it yet.

Experience, however, shows that it is cumbersome to execute this work
exclusively on a website. A website such as SHEBANQ only supports that many
use cases, while every research activity requires its own data preprocessing.
An efficient workflow for this kind of research is to collect data, store it in
spreadsheets, have the researcher work on them, and then feed the filled-in sheets
back into the system. We support this workflow by means of LAF-Fabric, which
is an off-line companion to SHEBANQ, based on exactly the same data. With the
help of LAF-Fabric, the programming scholar can grab all data that is needed for
a particular task, lay it out neatly in columns, and convert edited sheets into new
sets of annotations.12 The work of verbal valency is available on the SHEBANQ
tools page (see Fig. 4). These new annotations have been bulk-imported into
SHEBANQ and pubished, but they can also serve as basis for new algorithms in
LAF-Fabric.13

5. Everything else

Although versatile, SHEBANQ cannot do everything. For example, teaching
Hebrew to academic students could profit from SHEBANQ, but SHEBANQ is
not optimized for it. There is a system called Bible Online Learner14, based on
the same ETCBC database, that has facilities to generate drills and exercises for
students and score their answers. Rather than to try to pack all functionality
into one system, it is better to have several systems around, each geared to
their own task, but yet knowing of each other’s existence. Every chapter page
in SHEBANQ links to the corresponding chapter page in BibleOL and vice
versa. Moreover, in order to compose exercises, BibleOL uses queries that are
published in SHEBANQ (see Fig. 5).

280


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Practical linguistic annotation

Figure 4. : Verbal valence notes have been bulk-imported into SHEBANQ and are
visible in notes view. Users can mute note sets and focus on the topics of their
interest.

Figure 5. : Interlinking with Bible Online Learner. Clicking on the SHEBANQ
logo takes you to SHEBANQ, where there is a Bible OL logo to link you back.

6. Summing up

In the digital age, annotation has become a practical paradigm to carry out
scholarly work: we can use annotations in quantities unheard of, to achieve old
goals in new ways, and to pursue new goals with new workflows.

The reader is invited not only to look at the screenshots, because they tend
to show screens packed with information. One of the strong points of digitally
displaying information is that most of the material can be hidden most of the
time. SHEBANQ as an annotation tool helps the researcher to collect all data
relevant to the task at hand in one or two screens, for a great variety of tasks.
And where SHEBANQ falls short, the companion tool LAF-Fabric takes over,
but the price is that the user must program it. This is where the digital paradigm
affects (or should we say infects) the daily work of the scholar: programming
skills are becoming increasingly relevant.

An important characteristic mentioned in most of the cases above is the facility
to share and publish annotations. The Hebrew Text database is the result of a lot
of scholarly work, and that work should be published, not only for the academic

281


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Dirk Roorda

record, but also for the purposes of teaching and training.15 Moreover, published
annotations enable useful cooperation of different systems based on the same
data.

requirements for scholarly annotation

In the previous section we described annotations in action. When the action is
research, it is important to comply with a few essential requirements.

Archiving

We saw how annotations capture scholarly work, sometimes at a high level of
abstraction and expertise. So scholars must be able to save annotations and then
share and publish them. Researchers that work years from now must be able
to retrieve annotations when they see the sources, and to retrieve the sources
when they see the annotations. While the digital paradigm is very beneficial
to transform information flexibly and distribute it globally, it is much more
challenging to fix existing information rigidly and distribute it over decades to
come.

The digital age calls for digital archives that recognize these challenges and
do something about it. In the SHEBANQ case, the data has been archived
at DANS16, all the code sits on Github (see an overview of the sources) and
repository snaphsots have been archived at Zenodo at CERN. The live website
is run by DANS on a server of the Royal Netherlands Academy of Arts and
Sciences.

Coupling

The particular thing about annotations is that they need the coupling to another
resource in order to be ‘to-the-point’. In the age of analogue resources, this
coupling tended to be tight: in the margins, or as footnotes, usually within the
same material container. Where the coupling was less tight, such as in endnotes,
indexes, registers as separate books or volumes, it became quickly unwieldy to
handle all relevant annotations.

In the digital age these problems of information logistics can be solved much
more elegantly and effectively, provided certain agreements are being made by
the designers of information. It is a bit like geotagging photos by means of a
recorded GPS track: if the track points are coded with the same time codings
as the photos, the photos can be located on the track and then on the map. For
annotations we need anchors: points in sources to link to. These points should
be standardized so that different scholars, as producers of annotations, use the
same anchors. That will help to make their annotations interoperable.

282


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Practical linguistic annotation

For linguistic annotations, the LAF standard helps a lot to refer to primary data
in an objective way, although these anchors are still project dependent. There
are efforts to bring about a more global persistent linking system to canonical
resources (see Canonical Text Services and the CITE architecture), and it is a
matter of time before it will be applied to the Hebrew Bible as well.

The holy grail of this all is the Linked Open Data (http://linkeddata.org)
endeavour, which is an attempt to map all entities in human discourse unto
unique, persistent identifiers, and code all properties that can be expressed into
triples consisting of a subject, predicate and object, according to well-defined
vocabularies and ontologies. This is a huge modelling effort, and it is not always
clear how computing-intensive workflows may take advantage of it. But for
importing and exporting data across boundaries of project and discipline, this
is definitely the way to go.

An advantage of well-coupled annotations is that they can be sorted and
organized on the basis of where they point to. But we need other organizing
principles as well, such as the provenance of an annotation (researcher, project,
organization), time (creation, update), motivation (correction, evaluation), nature
(linguistic, hermeneutical). Of these, motivation and nature can be entered in free
text description fields, which in practice, sadly, quite often reveal the text ‘None’.

Innovation

A lot of digital development starts with mimicking analogue concepts. After
a certain period, those digital counterparts may exhibit new dynamics. This
only happens if the new concepts manage to exploit typical advantages of
the digital paradigm over the old ways. One of the key digital advantages is
the network effect: for certain tasks it has become possible to mobilize many
people with mostly limited contributions. Such loosely organized networks can
deliver impressive results, such as Wikipedia.17 If scholars grab the opportunity
to ‘socialize’ parts of their workflows, they may gain results not previously
possible.

SHEBANQ has socialized the art of making exegetical queries. It is being
used in the classroom, and scholars can quote queries to each other and cite them
in papers. Everybody may enter new queries. And everybody can comment on
specific query results by means of simple manual annotations. However, we are
not seeing (yet) that kind of spontaneous manual annotation.

Reflection and action

Before building SHEBANQ, we tried to design its layout and the details of how
queries should be displayed to the user. Query results are structured objects, and
queries may have many structured results; it was not at all clear how we could

283


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Dirk Roorda

provide the users with a good visual representation of query results, and how to
show them in context.

Most of this became clear after we started construction. Only fully engaging in
building this web app made us discover one unanticipated problem after another,
and solve them all. For example, we decided to provide on-the-fly heat maps
of query results, which give users an instant overview of how the results of a
particular query are distributed in the Bible (see Fig. 6). But we refrained from
presenting query results in their full complexity as structured objects. We also
modified our goals. Rather than make SHEBANQ into the ultimate research tool,
we developed LAF-Fabric as an off-line side tool, with more flexibility to tackle
the nitty-gritty of daily research. SHEBANQ got redefined from a laboratory
to a showroom of research results, where very diverse research output comes
together in one context. Now SHEBANQ and LAF-Fabric together provide the
facilities of a scholarly lab.

In our opinion, it makes no sense to reflect on the nature of annotations without
being involved in digital construction work. The ontology of a (digital) medium
is the reflection of its usage patterns. When migrating annotations from analog
to digital, we are potentially upsetting those very usage patterns, and hence the
ontology of annotations.

Programming skills

Just as analogue information systems presuppose the skills of reading and
writing, the potential of the digital media cannot be unleashed without new
skills. For researchers, this means definitely: programming. Especially where
experimentation is involved, it is impractical to outsource development of
new tools to ‘mere’ programmers. Instead, scholarly teams should insource
programming skills in their own skulls. They do not need to master professional
levels. Data oriented programming has become much easier by the evolution
of scripting languages such as Python and additional tools such as the Jupyter
notebook.18 And not every team member needs to learn to program, if only the
team as a whole is able to produce experimental or pilot solutions. Only after
many experiments by scholars, it will be the right time to bring the professional
coders in to turn the successful pilots into products and infrastructure.

Addendum

From the start of 2017 onwards, I have deprecated LAF-Fabric in favour of a
new format and tool: Text-Fabric.19 Thanks to the move from an XML based
format into a plain text based format all data fits in a Github repository.20

284


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Practical linguistic annotation

Figure 6. : Heat map of query results. Every square represents a block of 500 words
of Bible text. The color indicates how many result words the query has in that block.
Every square is clickable and takes you to the corresponding passage.

end notes
1

This works rests on the shoulders of the giants at the ETCBC, such as Eep Talstra and
Constantijn Sikkel who conceived the database and made it work through the decades
behind us. See E. Talstra and C. J. Sikkel, ‘Genese und Kategorienentwicklung der WIVU-
Datenbank’, in C. Hardmeier et al., ed., Ad Fontes! Quellen erfassen—lesen—deuten. Was ist
Computerphilologie? Ansatzpunkte und Methodologie—Instrument und Praxis (Amsterdam,
2000), 33–68; E. Talstra, ‘Computer-assisted linguistic analysis. The Hebrew Database used

285


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Dirk Roorda

in Quest.2’, in J. A. Cook, ed., Bible and Computer. The Stellenbosch AIBI-6 Conference.
2000–07–17/21, Stellenbosch: Proceedings of the Association Internationale Bible et
Informatique (Leiden, 2000), 3–22, https://shebanq.ancient-data.org/shebanq/static/docs/
methods/2000_Talstra_QuestDataTypes.pdf. The query engine of SHEBANQ is the one
made by Ulrik Petersen. See U. Peterson, ‘Emdros—a text database engine for analyzed or
annotated text’, Proceedings of COLING 2004, 1190–3, http://emdros.org/petersen-emdros-
COLING-2004.pdf; U. Peterson, ‘Principles, Implementation Strategies, and Evaluation
of a Corpus Query System’, Lecture Notes in Computer Science, 4002 (2006). 215–26,
http://link.springer.com/chapter/10.1007%2F11780885_21; U. Peterson, EMDROS. Text
database engine for analyzed or annotated text, 2002–2014, http://emdros.org. Peterson has
relied on the ideas of Christ-Jan Doedens: C.-J. Doedens, Text Databases. One Database
Model and Several Retrieval Languages (Amsterdam, 1994). Researchers, senior and junior
have put data and tools to many tests: Janet Dyk, Reinoud Oosting, Oliver Glanz, Gino
Kalkman, Martijn Naaijer, Christiaan Erwich, Cody Kingham plus 89 users of SHEBANQ
that shared 686 queries with us.

2
See M. Bauer and A. Zirker, ‘Whipping Boys Explained: Literary Annotation and Digital
Humanities’, in Ray Siemens and Kenneth M. Price, eds., Literary Studies in the Digital Age:
An Evolving Anthology (New York, 2015), https://dlsanthology.commons.mla.org/whipping-
boys-explained-literary-annotation-and-digital-humanities/; and M. Bauer and A. Zirker,
‘Explanatory Annotation of Literary Texts and the Reader: Seven Types of Problems’, this
volume.

3
See M. Cysouw, ‘Parallel Bible Corpus. 1169 unique Bible translations’, n.d.,
http://www.paralleltext.info/data/, and C. A. Christodoulopoulos, ‘A multilingual parallel
corpus created from translations of the Bible’, https://github.com/christos-c/bible-corpus, 22
June 2017.

4
See also R. Siemens et al., this volume.

5
D. Roorda, ‘The Hebrew Bible as Data: Laboratory—Sharing—Experiences’, in J. Odijk and
A. van Hessen, eds., CLARIN in the Low Countries, 2015, https://arxiv.org/abs/1501.01866;
D. Roorda, J. Krans, B.-J. Lietaert-Peerbolte, W. T. van Peursen, U. Sandborg-
Petersen and E. Talstra, ‘Scientific report of the workshop Biblical Scholarship and
Humanities Computing: Data Types, Text, Language and Interpretation, held at the
Lorentz Centre Leiden from 6 Feb 2012 through 10 Feb 2012’, Lorentz Center, Leiden,
2012, http://www.lorentzcenter.nl/lc/web/2012/480/report.php3?wsid=480&venue=Oort, 22
June 2017.

6
N. Ide and L. Romary, Linguistic Annotation Framework, 2012, http://www.iso.org/iso/
home/store/catalogue_tc/catalogue_detail.htm?csnumber=37326, 22 June 2017.

7
F. de Vree, ‘Using social co-occurrence networks to analyze biblical narrative’, 2016,
https://github.com/Fred-Erik/social-biblical-networks, 22 June 2017.

8
See K. Elliger and W. R. Rudolpfh, eds., Biblia Hebraica Stuttgartensia, 5th
corrected edition (Stuttgart, 1997), www.bibelwissenschaft.de/online-bibeln/biblia-hebraica-
stuttgartensia-bhs/lesen-im-bibeltext/, 22 June 2017.

9
See Roorda and van den Heuvel for an early formulation of the idea of queries-as-annotations;
D. Roorda and C. M. J. M. van den Heuvel, ‘Annotation as a New Paradigm in Research
Archiving’, Proceedings of ASIS&T 2012 Annual Meeting. Final Papers, Panels and Posters,
2012, http://arxiv.org/abs/1412.6069, 22 June 2017.

10
G. J. Kalkman, Verbal Forms in Biblical Hebrew Poetry: Poetical Freedom or
Linguistic System? PhD thesis, VU University (Amsterdam, 2015), https://shebanq.ancient-
data.org/tools?goto=verbsystem.

11
J. W. Dyk, O. Glanz and R. Oosting, ‘Analysing Valence Patterns in Biblical Hebrew:
Theoretical Questions and Analytic Frameworks’, Journal of Northwest Semitic Languages,
40 (2014), 43–62, https://shebanq.ancient-data.org/shebanq/static/docs/methods/2014_
Dyk_jnsl.pdf.

286


August 18, 2017 Time: 05:11pm ijhac.2017.0196.tex

Practical linguistic annotation

12
See Roorda, Naaijer, Kalkman, & van Cranenburgh for initial examples; D. Roorda, M.
Naaijer, G. J. Kalkman and A. van Cranenburgh, ‘LAF-Fabric: a data analysis tool for
Linguistic Annotation Framework with an application to the Hebrew Bible’, Computational
Linguistics in the Netherlands Journal, 4.4 (2015), preprint http://arxiv.org/abs/1410.0286.

13
Indeed, using LAF-Fabric requires programming skills. It is a Python package that gives
streamlined access to the Hebrew Text Database. A beginner’s course in Python is enough
to get started. Another, even more computationally intensive, example is the quest for parallel
passages in the Bible. This is part of the Syntactic Variation project, carried out by a team
of (PhD) researchers at the ETCBC. To see what is at stake here, see R. Rezetko and M.
Naaijer, ‘An Alternative Approach to the Lexicon of Late Biblical Hebrew’, Journal of Hebrew
Scriptures, 16.1 (2016), www.jhsonline.org/Articles/article_213.pdf.

14
N. Winther-Nielsen and C. Tøndering, Bible Online Learner, n.d., http://www.
bibleol.3bmoodle.dk/, 22 June 2017.

15
SHEBANQ is meant as a service to publish queries for the academic record. DANS, as a
national research data archive, is capable of archiving the database as a whole. It is also
possible to store the data on Github, and preserve a snapshot of the repository to Zenodo,
a service of CERN to preserve repositories for the academic record.

16
E. Talstra, C. J. Sikkel, O. Glanz, R. Oosting and J. W. Dyk, Text Database of the
Hebrew Bible, 2012, http://www.persistent-identifier.nl/?identifier=urn:nbn:nl:ui:13-ukhm-eb;
W. T. van Peursen and D. Roorda, Hebrew Text Database in Linguistic Annotation Framework,
2014, PID: urn:nbn:nl:ui:13–048i-71, http://www.persistent-identifier.nl/?identifier=urn:nbn:
nl:ui:13-048i-71; W. T. van Peursen and D. Roorda, Hebrew text database ETCBC4b.
Dataset available online at Data Archiving and Networked services, Den Haag, 2015,
dx.doi.org/10.17026/dans-z6y-skyh.

17
S. Clay, Here Comes Everybody: The Power of Organizing Without Organizations (London,
2012).

18
F. Pérez and B. E. Granger, ‘IPython: a System for Interactive Scientific Computing’,
Computing in Science and Engineering, 9.3 (2007), 21–29, http://ipython.org, ISSN: 1521-
9615, DOI: 10.1109/MCSE.2007.53.

19
Text-Fabric: Data model, file format and processing tool for annotated texts.
https://github.com/ETCBC/text-fabric/wiki.

20
Text-Fabric-Data: Text and Annotations of the Hebrew Bible and the Greek New Testament.
Includes documentation of the annotation features. https://etcbc.github.io/text-fabric-data/.

287

srnp191
AQ: Please provide missing abstract for this article.