ijhac.2016.0160.dvi


SEMI-SUPERVISED TEXTUAL ANALYSIS AND
HISTORICAL RESEARCH HELPING EACH OTHER:

SOME THOUGHTS AND OBSERVATIONS

FEDERICO NANNI, HIRAM KUMPER AND
SIMONE PAOLO PONZETTO

Abstract Future historians will describe the rise of the World Wide Web as
the turning point of their academic profession. As a matter of fact, thanks to an
unprecedented amount of digitization projects and to the preservation of born-
digital sources, for the first time they have at their disposal a gigantic collection
of traces of our past. However, to understand trends and obtain useful insights
from these very large amounts of data, historians will need more and more fine-
grained techniques. This will be especially true if their objective will turn to
hypothesis-testing studies, in order to build arguments by employing their deep
in-domain expertise.

For this reason, we focus our paper on a set of computational techniques,
namely semi-supervised computational methods, which could potentially provide
us with a methodological turning point for this change. As a matter of fact these
approaches, due to their potential of affirming themselves as both knowledge
and data driven at the same time, could become a solid alternative to some of
the today most employed unsupervised techniques.

However, historians who intend to employ them as evidences for supporting
a claim, have to use computational methods not anymore as black boxes but
as a series of well known methodological approaches. For this reason, we
believe that if developing computational skills will be important for them, a
solid background knowledge on the most important data analysis and results
evaluation procedures will become far more capital.

International Journal of Humanities and Arts Computing 10.1 (2016): 63–77
DOI: 10.3366/ijhac.2016.0160
© Edinburgh University Press 2016
www.euppublishing.com/journal/ijhac

63


F. Nanni, H. Kümper, and S. Paolo Ponzetto

Keywords: semi-supervised methods; historical studies; data analysis; born-
digital archives

1. introduction

In December 2010 Google presented a service called ‘Google Ngram Viewer’.1

This tool allows us to look at the occurrence of single words or sentences in
specific subsets of the immense corpus digitized by the Google Books project.

A few weeks after, Erez Lieberman Aiden and Jean-Baptiste Michel, team
leaders of the prototype Viewer, offered a demonstration of the tool at the
annual meeting of the American Historical Association in Boston.2 In front of
around 25 curious historians, they noted the enormous potential of conducting
historical research by extracting information from large corpora. In particular,
they revealed a way to deal with one of the biggest issues for historians who
are exploring large datasets, namely rapidly detecting the distribution of specific
words in the corpus.3

Interestingly, the development and the functionalities of this tool demonstrate
some of the most relevant characteristics of the current interactions between the
practice of historical research and the use of computational methods:

Firstly, no historian has been directly involved in any step of the
development of this project.4 This is particularly significant, given that they
would likely be the primary targets of a tool able to process information
from a corpus spanning five hundred years. As Aiden and Michel remarked,
this is due to two well-known reasons: historians traditionally do not
have solid computational skills and they are usually skeptical about the
development of quantitative approaches for the analysis of sources.5

Secondly, others have noted that the Ngram Viewer offers an over-
simplified research tool, which usually leads to general coarse-grained
explorative analyses and to few simple historical discoveries.6

Finally, the way in which the Ngram Viewer has been presented and
identified outside academia as a representative tool of the digital humanities
also reveals the growing enthusiasm for methodology studies and big-
data driven researches in this community.7 However, as already remarked,
researchers in digital humanities need to bear in mind their long-term
purpose, that is to use the computer in order to answer specific and relevant
research questions and not simply to build tools.8

But while the Ngram Viewer symbolises a current widespread way of employing
computational methods for studying historical corpora, namely for data-
exploration and general hypothesis-confirmation analyses, we believe that a
change is about to come. In fact, in our opinion new generations of historians

64


Textual Analysis and Historical Research

will need more and more fine-grained techniques to conduct inspections of large
datasets. This will be especially true if their objective turns from exploratory
analyses to hypothesis-testing studies, in order to build arguments by employing
their deep in-domain expertise.

For this reason, we focus here on a set of computational techniques, namely
semi-supervised computational methods, which could potentially provide us
with a methodological turning point for this change.9 As a matter of fact
these approaches make it possible to actively include the human expert in the
computational process. Therefore, due to their potential of affirming themselves
as both knowledge- and data-driven at the same time, they could become a solid
alternative to some of the most common unsupervised techniques currently used.

However, historians who intend to employ computational methods as
evidence for supporting a claim, have to use them as a series of well known
methodological approaches rather than as ‘black boxes’ whose workings are
unknown.10

For this reason, if developing computational skills will be important for
historians, a solid background knowledge of the most important data analysis
and results evaluation procedures will become far more necessary.

Starting from all these assumptions, this paper is organized as follows: firstly,
a few basic concepts of machine learning methods are introduced. Then, a
diachronic description of the use of computational methods in historical research
is presented. Following this, our focus on a specific technique, namely Latent
Dirichlet Allocation topic modeling, is defined. Next, the advantages and the
consequences of the use of semi-supervised topic modeling approaches on the
historian’s craft are described. Finally, a future project on the use of these
methodological frameworks for the analysis of the different semantic dimensions
of specific concepts in a collection of around 1,000 French legal books from the
17th and 18th century is introduced.

Our essay is focused on a precise potentiality of the complex datasets of
sources that historians have now at their disposal. This is the possibility of
exploiting the results of fine-grained analyses as historical evidence through
the combination of specific in-domain research interests and the scientifically
correct employment of computational methods. This will help researchers to
deal with the abundance of digital materials by extracting precise information
from them, and to move from exploratory studies to hypothesis testing analyses.
However, now that both large datasets and text mining methods are at our
disposal, other challenges are emerging, such as multilingual corpora or the
evolution of languages in diachronic extended datasets. In the near future this
will raise other issues for the new generations of historians, increasing the
need for advanced computational approaches (i.e. specific language models
for machine translation) and demanding always more advanced competencies
of the humanities researcher.

65


F. Nanni, H. Kümper, and S. Paolo Ponzetto

2. supervised and unsupervised text analyses

Before going into the details of how these methods have previously been
employed in historical research and how they could be used in the near future,
it is important to clarify a few key concepts in data analysis and machine
learning that have already been mentioned in the previous paragraphs.11 As
described earlier, an initial requirement of many historical studies is to identify
semantic similarities and recurrent lexical patterns in a collection of documents.
In machine learning there are two main different kinds of approaches that allow
us to do this.

The first one consists of supervised learning methods, which focus on
classification tasks. In classification tasks, humans identify a specific property
of a subset of elements in the dataset (for example articles about foreign policy
in a newspaper archive) and then guide the computer, by means of an algorithm,
to learn how to find other elements with that characteristic. This is done by
providing the machine with a dataset of labeled examples (‘this is an article
about foreign policy’, ‘this is not’), called a ‘gold standard’, which are described
by a set of other ‘features’ (for instance, the frequency of each word in each
document). Moreover, the learning process is typically divided into two main
phases, namely: i) a training phase, in which the predictive model is learnt
from the labeled data; ii) a testing phase, in which the previously learnt model
is applied to unseen, unlabeled data in order to quantity its predictive power,
specifically its ability to generalize to data other than the labeled ones seen
during training. Additionally, a validation phase can take place to fine-tune the
model’s parameters for the specific task or domain at hand – e.g., classifying
foreign policy articles from newspaper sources, as opposed to websites.

The potential of a good classifier is immense, in that it offers a model that
generalizes from labeled to (a potentially very large set of) unlabeled data.
However, building such models can also be extremely time-consuming. In fact,
researchers not only need a dataset with specific annotated examples to train the
classifier but, perhaps even more fundamentally, they need to have extremely
clear sense of what they are looking for, since this leads them to define the
annotation guidelines and learning task itself. For this reason, it is evident that
classification methods are arguably not the most convenient approaches for
conducting data exploration in those situations where a researcher sets out to
investigate the dataset with no clear goal in mind other than searching for any
phenomenon they deem interesting a posteriori.

The second class of methods is unsupervised, and addresses the problem of
clustering. In a nutshell, clustering methods aim at grouping elements from a
dataset on the basis of their similarity, as computed from their set of features
(for example by looking at patterns in the frequency of words in different
documents). This is achieved by computing likenesses across features without

66


Textual Analysis and Historical Research

relying on labeled examples, unsupervised by humans. Crucially for digital
humanities scholars, researchers can study the resulting clusters in order to
understand what the (latent) semantic meaning of the similarities between the
elements is.

Clustering techniques are extremely useful for analyzing large corpora of
unlabeled data (i.e., consisting of ‘just text’), since they rapidly offer researchers
a tool to get a first idea of their content in a structured way (i.e., as clusters
of similar elements, which can be optionally hierarchically arranged by using
so-called hierarchical clustering methods). This is primarily because, as they do
not require labeled data, they can be applied without having in mind a specific
phenomenon or characteristics of the dataset to mine (i.e., learn). However,
even if scholars noted their potential, for example by creating serendipity, and
different metrics have been proposed for evaluating the number and correctness
of these clusters, this is still an extremely challenging task, typically due to the
difficulties of interpreting the clusters output by the algorithms.12

3. studying the past, in the digital world

The potential of computational methods for the study of primary sources has
been a recurrent topic in the humanities. As Thomas remarked, already in 1945
Vannevar Bush, in his famous essay ‘As We May Think’, pointed out that
technology could be the solution that will enable us to manage the abundance
of scientific and humanistic data; in his vision the Memex could become
an extremely useful instrument for historians.13 The use of the computer in
historical researches consolidated between the Sixties and the Seventies with
its application to the analysis of economic and census data. The advent of
cliometrics gave birth to a long discussion on the use of the results of quantitative
analysis as evidence in the study of the past.14

Due in part to this long debate on the application of quantitative methods in
historical research and in part to the new potentials of the Web as a platform for
the collection, presentation, and dissemination of material, during the Nineties
a different research focus emerged in what was already at that time identified
as digital history.15 As Robertson recently pointed out, this specific attention on
the more ‘communicative aspects’ of doing research in the humanities could be
recognized as one of the main differences between the ways in which historians
have been interpreting the digital turn compared to their colleagues in literary
studies over the last twenty years.16

However, regardless of whether historians of the 21st Century are interested
in employing computational methods for analysis textual documents or not, it
is evident that the never-ending increase of digitized and born digital sources
is no longer manageable with traditional close reading hermeneutic approaches
alone.17 For this reason, two different activities have consolidated in the digital

67


F. Nanni, H. Kümper, and S. Paolo Ponzetto

humanities community during the last decade. On one side digital historians
started creating tools in order to help other traditionally trained colleagues in
employing computational methods.18 On the other side, more recently a small
but strongly connected community of historians has decided to focus their efforts
on teaching the basic of programming languages and the potential of different
textual analyses techniques for conducting exploratory studies of their datasets.
As Turkel remarked: ‘My priority is to help train a generation of programming
historians. I acknowledge the wonderful work that my colleagues are doing by
presenting history on the Web and by building digital tools for people who can’t
build their own. I know that the investment of time and energy that programming
requires will make sense only for one historian in a hundred’.19

a. Computational History

The works conducted by Willam J. Turkel at the University of Western Ontario,
with particular attention to his blog ‘Digital History Hacks’ and his project
‘The programming historian’, could be identified as a starting point of these
digital interactions.20 Following Turkel’s approaches and advice, a group of
historians has begun experimenting with these different computational methods
to explore large historical corpora.21 The use of Natural Language Processing
and Information Retrieval methods, combined with network analysis techniques
and a solid set of visualization tools, are the points around which this new wave
of quantitative methods in historiography has consolidated.

During recent years several interesting examples of these interactions between
historical research and computational approaches have been presented.22 In
addition, thanks to the collaborations with other digital humanities colleagues
(i.e. literary studies researchers and digital archivists), the words ‘text mining’
and ‘distant reading’ have become buzzwords of this new trend in digital history.
If we were to look more closely at how these techniques have been applied,
we could notice that the first objective of the digital humanities researchers
has been to show the exploratory potential of these methods and to confirm
their accuracy by re-evaluating already well-known historical facts.23 As we will
remark in the next sections, this is due to the unsupervised nature of the specific
textual analysis techniques most widely used in historical research (e.g., topic
modeling), which do not need (but at the same time cannot obtain benefit from)
human supervision and in-domain knowledge during the computational process.

b. Topic modeling

Topic modeling is arguably the most popular text mining technique in digital
humanities.24 Its success is due to its ability to address one of the deepest need
of a historian, namely to automatically identify with as little human supervision

68


Textual Analysis and Historical Research

as possible (none, ideally) a list of topics in a collections of documents, and how
these are intertwined with specific document sources in the collection. At a first
sight this technique seems to be the methodological future of historical research.

However, as researchers rapidly discovered, working with topic modeling
toolboxes is neither easy nor always yielding satisfactory results. First of all,
Latent Dirichlet Allocation (LDA - the main topic models algorithm), like other
unsupervised techniques, needs to be told in advance the number of topics (resp.
clusters) that the researcher is interested in.25 However, knowing the number
of topics is itself a non-trivial issue, which leads researchers to a chicken-
and-egg-problem in which they use LDA to find some interesting topics, while
being required to explicitly state the exact number of such topics they are after.
Moreover, as this technique looks at the distribution of topics by document, the
results will be extremely different in relation to the number of topics chosen.

Thus, topic modeling highlights both advantages and limitations of
unsupervised techniques. In fact, the obtained topics are, as others noticed,
usually difficult to decode; each of them is presented as a list of words, and being
able to identify it with a specific concept generally depends on the intuitions of
the researcher.26

The first paper on LDA was published in 2003, however before 2010
there were just a few publications on humanities topics where this technique
was employed.27 We could identify a turning point in the digital humanities
community between 2011 and 2012, when suddenly a remarkable number of
blogposts, online discussions, workshops and then publications been focused on
how to deal and employ this technique.28 As we will describe later, in the same
period Owens observed the risks for humanists of using topic modeling results as
justification for a theory and in general suggested limiting its use to exploratory
studies.29

4. semi-supervised textual analysis

Today, if there is something more criticized than the use of quantitative
methods in the humanities, this is data-driven research.30 More specifically, we
agree that the practice of employing unsupervised computational approaches
to analyse a dataset and then relying on their automatically generated results
to build a scholarly argument could reduce the role of the humanist in the
research process. This is due to two main reasons: firstly, since even the more
technically skilled historian does not have a solid statistical background as
computational linguists, computer scientists or other kinds of researchers that
currently are implementing these methods; this will consequently limit their
understanding of both the techniques and the obtained results.31 Secondly,
because by employing unsupervised techniques, historians will not draw on
their background knowledge, and will not directly use these methods for

69


F. Nanni, H. Kümper, and S. Paolo Ponzetto

answering specific research questions they have in mind. This is because, since
unsupervised methods do not rely on human supervision and are mainly targeted
at generating serendipity, they do not, and are not meant to include human
feedback to guide the process of model creation.

However, on the other side of the spectrum, supervised classification
approaches are particularly time-consuming to build, and their usefulness
depends on specific research purposes (i.e., what is the scholar trying to discover
by classifying documents in different categories?). Therefore, it is evident
that for historians interested in performing more fine-grained explorations, a
different computational technique is needed that is able to stake out a middle
ground between explicit human supervision and serendipitous searching and
exploration; a method that could help the researcher switching from general
exploration analyses to more specific ones, from getting a first idea of the
contents of a corpus to start evaluating theories by employing her/his domain
expertise.

For this purpose, we argue that a series of semi-supervised topic modeling
algorithms, adopted in recent years in the fields of machine learning and natural
language processing, could also become established research methods in digital
history.

The first one is Supervised LDA, originally presented by Mcauliffe and Blei.32

This method makes it possible to derive distribution of topics by considering a
set of labels, each one associated with each document. In their paper the authors
note the potential of this method when the prediction of a specific value is the
ultimate goal; to this end, they combine movie ratings and text reviews to predict
the score of unrated reviews. However, as remarked by Travis Brown, historians
could also experiment with this technique, to, for example identify the relation
between topics and labels (i.e. to find the most relevant topics for ‘economics’
articles).33

A conceptual extension of this technique is Labeled LDA, developed by
Ramage et al.34 This method makes it possible to highlight the distribution of
labeled topics in a set of multi-labeled documents. If we imagine a corpus where
every document is described by a set of meta-tags (for example a newspaper
archive with articles associated with both ‘economics’, ‘foreign policy’, and so
on), Labeled LDA will identify the relation between topics, documents and tags,
and its output will consist of a list of topics, one for each tag. This, in turn, could
be used to identify which part of each document is associated with each tag.

Another relevant approach is Dirichlet-multinomial regression, proposed
by Mimno and McCallum.35 As the authors describe, rather than generating
metadata (as for example the ratings in Supervised LDA) or estimating topical
densities for metadata elements (as the topics related to metadata, like Labeled
LDA), this method learns topic assignments by considering a set of pre-
assigned document-features. In their paper the researchers show how authors,

70


Textual Analysis and Historical Research

paper-citations and date of publications could be useful features of external
knowledge to improve the topic model representation on a dataset of academic
publications.

Finally, a last method is Seeded LDA.36 Instead of using a prior set of
descriptive labels for each document or topic, as in previous approaches, Seeded
LDA offers the possibility of manually defining a list of seed words for the
topics the researcher is interested in. Let us imagine, for instance, that we are
after a specific topic within the corpus of interest (e.g., news related to the
relations between USA and Cuba in a newspaper archive): using Seeded LDA
the researcher could guide the topic model in a specific direction, receiving as
output the distribution of topics that she/he is interested in.

A thorough comparison of these different semi-supervised topic modeling
techniques is beyond the scope of this paper. However, the fact that all methods
make it possible to include the human (i.e., the humanities scholar) in the loop
(i.e., the learning process) by requiring the expert to provide either labeled meta-
data, or a set of initial seed words to guide the topic acquisition process is
crucial for out argument. We argue that this last option, in particular, is very
attractive for digital historians in that it forces them to explicitly state the lexical
components of the specific topics they are after, while requiring a minimal
amount of supervision. That is, the scholar has to input a small set of seed words
he/she deems important on the basis of her/his expertise, as opposed to merely
labeling documents with a pre-compiled set of class labels.

5. how data becomes evidence

In the previous section we gave a brief overview of different semi-supervised
topic modeling techniques, and argued that they could help historians exploit
different sources like metadata and seed words, stemming from their human
expertise as scholars, in order to perform fine-grained exploration analyses.

Topic modeling is a fascinating way of navigating through large corpora, and
it could become even more interesting for the researcher by making the tool
consider specific labels or seed-words. Regarding this, Owens remarked: ‘If you
shove a bunch of text through MALLET and see some strange clumps clumping
that make you think differently about the sources and go back to work with them,
great’.37 Then, he continues: ‘If you aren’t using the results of a digital tool as
evidence then anything goes’.

In the second sentence Owens perfectly describes the current main problem
of digital humanities scholars employing text mining methods. As others already
remarked, on the one hand the research community wants to see the humanistic
relevance of these analyses, and not only the computational benefits.38 On the
other hand, digital humanists are aware that they cannot present the results of

71


F. Nanni, H. Kümper, and S. Paolo Ponzetto

Figure 1. In this figure the methodological framework we suggest for analysing
large historical corpora is summarized. Both the in-domain knowledge of the
researcher and a solid expertise in data analysis are key components.

their studies as evidence without a solid evaluation of the performance of the
methods.

For instance, if the purpose is to detect articles related to a specific subject
(i.e. the relations between USA and Cuba), the documents obtained by looking
at the distribution of specific (LDA-derived) topics are nothing more than an
innovative way of searching through the dataset. Thus, it is important to keep in
mind that these documents are not the only articles about the subject, and that
maybe they are not even about that specific subject at all – due to the errors in
the automatic learning process. Therefore, if we want to transform our data into
evidences for supporting a specific argument or for confirming a hypothesis, we
always have to evaluate our approach first.

It is interesting to notice that this specific process would sound perfectly
ordinary if we were not talking about machine learning methods, computers
and algorithms. When a researcher wants to be sure that a viewpoint is correct
(‘I believe this article is focused on the relations between USA and Cuba’),
she/he will ask other colleagues.39 The process described here is the same: we
need human annotations (for example articles marked as ‘being focused on the
relations between USA and Cuba’ or not) in order to confirm that our hypothesis
(what the machine is showing to me are articles related to the relations between
USA and Cuba) is correct.

Moreover, since humanists are working on extremely specific in-domain
research tasks, they cannot rely on Amazon Mechanical Turk annotations as
others usually do.40 For solving this specific issue, they cannot even rely on
computer scientists or data mining experts: they need the help of their peers.

Therefore, we believe that future advances in historical research on large
corpora will be essentially achieved by exploiting deep human expertise, such as
that provided by history scholars, as key components within weakly-supervised
computational methods in two different ways.

In our vision (Fig. 1), a first stage will still consist of exploratory studies,
which are extremely useful to develop an initial idea of arbitrary datasets. During
this process, both standard LDA and especially the semi-supervised methods
presented earlier could be particularly useful, as they will help researchers
manage the vastness of digital data at their disposal. Following the exploratory
phase, when the interest on a specific phenomenon has been established,

72


Textual Analysis and Historical Research

we envision researchers moving on and developing models to quantify such
phenomenon in text, and creating a gold standard for evaluation based on
human ground truth judgments – again, based on input from domain experts, i.e.,
scholars. During this second part of the study it might be that useful methods for
exploratory studies (such as LDA) are not always as helpful when the task is
to precisely identify specific phenomena. For this reason, the new generation of
historians needs to learn how to employ text classification algorithms and have
to become more and more confident with data analysis evaluation procedures.41

As a matter of fact, these practices have the potential to sustain and improve our
comprehension of the past, when dealing with digital sources.

6. case study: applying these procedures in a well-defined
historical research

In this final section we describe how we intend to employ the methodological
framework presented before in an interdisciplinary research project that, in the
near future, will bring together researchers from the Historical Institute and the
Data and Web Science Group of the University of Mannheim.

Our cases study will be focused on circa 1,000 legal books from the 17th and
18th century, comprising over 310,000 pages of text. This is of course a large
corpus for a historian, but only a small one for current research in computerized
text analysis. Therefore, testing computational methods for specific analyses
may proof insightful for both disciplines.

These volumes form the ‘Juridica’ part of a book collection brought
to Mannheim by the learned Jesuit François-Joseph Terrasse Desbillons
(1711–1798) in the 1770s. They cover a broad variety of legal matters with a
special, but not very surprising interests in canon law, and another, little more
surprising interest in legal history, or more precisely: the old (French) law.

Based on this corpus, we want to know more about this old French law, the
‘ancien droit’. Yet, we do not trace legal institutions, ideas, or regulations. Rather
we ask for the fundamental terms that old French law rested upon. These terms
lay the conceptual groundwork upon which concrete institutions, rules, and
distinctions of legal thinking were built. Hence, they are usually not technical in
a stricter sense (i.e. not exclusively legal), or bear multiple semantic dimensions
largely depending upon their uses in specific contexts, e.g. terms like volonté
(‘will’), origin (‘origin’), or liberté (‘liberty’). We aim to find these terms and
their specific contexts, cluster together similar contexts, and weight them against
each other, iteratively reaching a broad, yet precise spectrum of their meanings.

Traditionally, dictionaries like these are compiled by domain experts (i.e.
historians) by reading large amounts of contemporary texts, and by analysing
these texts in what we, broadly speaking, term a ‘hermeneutical’ fashion. The
selection of texts rests upon the researcher and his or her scope of reach, its

73


F. Nanni, H. Kümper, and S. Paolo Ponzetto

amount on what he or she can physically read/bear, and its results rest largely
on what he or she can find by physically reading either line by line or hastily
flipping through the texts. This is not to say that this traditional method cannot
or will not lead to fruitful conclusions.42

In the end, however, these projects are largely based on the presuppositions
of the researcher about what she/he can (or will) actually find in the texts,
and which texts will be more likely to give fruitful results. In other words, the
researcher predefines both search terms and contexts. Our approach, in contrast,
will also start with presuppositions, but iteratively enlarge them by finding both
new contexts and probably even new search terms.

It could, for instance, well be that notions of ‘will’ (volonté) and its faculties
will be discussed in contexts of compulsion (contrainte, compulsion, coercition)
without even using a word deriving from volonté. Term-based textual analysis
will not find such instances, but concept-based analysis will – even in far less
obvious examples than the one given here.

As described before, our work will proceed through different steps. In the
beginning, coarse-grained exploratory analyses (i.e. using standard LDA) will
offer us a general idea of the content of the volumes and their similarities. Then,
by combining different weakly-supervised techniques like Supervised LDA and
Seeded LDA we will exploit domain expert knowledge to identify the semantic
contexts in which these relevant concepts appear and to detect other similar
patterns in the corpus. Finally, in order to use the results of these analyses as
historical evidences, we will test, compare and improve our methods on a gold
standard that it will be built with this specific purpose.

7. conclusions

In this paper, we have discussed the applicability of a set of computational
techniques for conducting fine-grained analyses on historical corpora.
Furthermore, we have remarked the importance of an evaluation step when
the data are exploited as evidence to support specific hypotheses. We believe
that these practises will allow us to deepen our understanding of historical
information embedded in digital data.

acknowledgements

The authors want to thank Laura Dietz (Data and Web Science Group) and Charlotte Colding
Smith (Historical Institute) for their precious methodological advice.

end notes
1

https://books.google.com/ngrams; all the URLs mentioned in this research were lastly checked
on November 13th 2015.

74


Textual Analysis and Historical Research

2
J.B. Michel et al., ‘Quantitative analysis of culture using millions of digitized
books’, Science, 331.6014 (2011), 176–182; A. Grafton. ‘Loneliness and Freedom’,
Perspectives on History, online edition, March 2011, http://www.historians.org/publications-
and-directories/perspectives-on-history/march-2011/loneliness-and-freedom.

3
G. Crane, ‘What do you do with a million books?’, D-Lib magazine, 12.3 (2006).

4
Grafton, ‘Loneliness and Freedom’.

5
See: http:// www . culturomics . org / Resources / faq / thoughts - clarifications - on - grafton-s-
loneliness-and-freedom; F. Gibbs and T. Owens, ‘The hermeneutics of data and historical
writing’, in J. Dougherty and K. Nawrotzki ed., Writing History in the Digital Age (Ann Arbor,
MI, 2013).

6
D. Cohen, ‘Initial Thoughts on the Google Books Ngram Viewer and Datasets’, Dan Cohen’s
Digital Humanities Blog, 19/10/2010, http://www.dancohen.org/2010/12/19/initial-thoughts-
on-the-google-books-ngram-viewer-and-datasets/.

7
See the answer to ‘How does this relate to ‘humanities computing’ and ‘digital humanities’?’
in Culturomics FAQ section: http://www.culturomics.org/Resources/faq; C. S. Fisher,
‘Digital Humanities, Big Data, and Ngrams, Boston Review, 20/06/2013, http://www.
bostonreview.net/blog/digital-humanities-big-data-and-ngrams; C. Blevins, ‘The Perpetual
Sunrise of Methodology’, 05/01/2015, http://www.cameronblevins.org/posts/perpetual-
sunrise-methodology/

8
I. Gregory, ‘Challenges and opportunities for digital history’, Frontiers in Digital Humanities,
1 (2014); M. Thaller, ‘Controversies around the Digital Humanities: An Agenda’, Historical
Social Research/Historische Sozialforschung (2012), 7–23.

9
O. Chapelle et al. (edited by), Semi-Supervised Learning (Cambridge, MA, 2006).

10
T. Owens, ‘Discovery and justification are different: Notes on science-ing the
humanities’, 19/11/2012, http://www.trevorowens.org/2012/11/discovery-and-justification-
are-different-notes-on-sciencing-the-humanities/; D. Sculley and B. M. Pasanek. ‘Meaning
and mining: the impact of implicit assumptions in data mining for the humanities’, Literary
and Linguistic Computing, 23.4 (2008), 409–424.

11
R. S. Michalski, J. G. Carbonell and T. M. Mitchell, Machine learning: An artificial
intelligence approach (Heidelberg, 1983).

12
E. Alexander et al. ‘Serendip: Topic model-driven visual exploration of text corpora’,
Proceedings of IEEE Conference on Visual Analytics Science and Technology (Paris, 2014);
M. Steinbach, G. Karypis, and V. Kumar, ‘A comparison of document clustering techniques’,
KDD workshop on text mining. 400–1 (2000), 525–526.

13
W. G. Thomas III, ‘Computing and the historical imagination’, in S. Schreibman, R. Siemens
and J. Unsworth, ed., A companion to digital humanities (Oxford, 2004), 56–68.

14
D. N. McCloskey, ‘The achievements of the cliometric school’, The Journal of Economic
History, 38.01 (1978), 13–28.

15
D. J. Cohen, and R. Rosenzweig. Digital history: a guide to gathering, preserving, and
presenting the past on the web (Philadelphia, 2006).

16
S. Robertson, The differences between digital history and digital humanities, 23/05/2014,
http://drstephenrobertson.com/ blogpost/ the-differences-between-digital-history-and-digital-
humanities/.

17
S. Graham, I. Milligan and S. Weingart. The Historian’s Macroscope - working title, Open
Draft Version, Autumn 2013, http://themacroscope.org.

18
For example the TAPoR project: http://www.tapor.ca/.

19
In D. J. Cohen et al, ‘Interchange: The promise of digital history’, The Journal of American
History (2008), 452–491.

75


F. Nanni, H. Kümper, and S. Paolo Ponzetto

20
Willam J. Turkel’ blog: http://digitalhistoryhacks.blogspot.com/; The Programming Historian:
http://programminghistorian.org/.

21
For example, I. Milligan, ‘Mining the ‘Internet Graveyard’: Rethinking the Historians’
Toolkit’, Journal of the Canadian Historical Association/Revue de la Société historique du
Canada, 23.2 (2012), 21–64.

22
For instance, C. Blevins, ‘Space, Nation, and the Triumph of Region: A View of the
World from Houston’, Journal of American History, 101.1 (2014), 122–147 and M.
Kaufman, ‘Everything on Paper Will Be Used Against Me: Quantifying Kissinger’, 2014,
http://blog.quantifyingkissinger.com/.

23
For example, C. Au Yeung and A. Jatowt. ‘Studying how the past is remembered:
towards computational history through large scale text mining’, Proceedings of the 20th
ACM international conference on Information and knowledge management (Glasgow,
2011).

24
E. Meeks and S. Weingart, ‘The digital humanities contribution to topic modeling’, Journal of
Digital Humanities, 2.1 (2012), 1–6.

25
D. M. Blei, A. Y. Ng and M. I. Jordan, ‘Latent dirichlet allocation’, the Journal of machine
Learning research, 3 (2003), 993–1022.

26
J. Chang et al., ‘Reading tea leaves: How humans interpret topic models’, Advances in neural
information processing systems, 2009.

27
R. Brauer, M. Dymitrow and M. Fridlund, ‘The digital shaping of humanities research: The
emergence of Topic Modeling within historical studies’, Enacting Futures: DASTS 2014
(Roskilde, 2014).

28
T. Underwood, ‘Topic modeling made just simple enough’, The Stone and Shell, 07/04/2012,
http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/; Storify of
the DH Topic Modeling Workshop: https://storify.com/sekleinman/dh-topic-modeling-
seminar; Meeks and Weingart, ‘The digital humanities contribution to topic modeling’.

29
Owens, ‘Discovery and justification are different: Notes on science-ing the humanities’.

30
S. Marche, ‘Literature is not data: Against digital humanities’, LA Review of Books (2012);
L. Wieseltier, ‘Crimes against humanities’, New Republic, 244.15 (2013), 32–39.

31
D. Hall, D. Jurafsky and C. D. Manning, ‘Studying the history of ideas using topic
models’, Proceedings of the conference on empirical methods in natural language processing
(Honolulu, 2008); D. Mimno, ‘Computational historiography: Data mining in a century of
classics journals’, Journal on Computing and Cultural Heritage, 5.1 (2012); M. Schich et al.,
‘A network framework of cultural history’, Science, 345.6196 (2014), 558–562.

32
J. D. Mcauliffe, and D. M. Blei, ‘Supervised topic models’, Advances in neural information
processing systems (2008).

33
T. Brown, ‘Telling New Stories about our Texts: Next Steps for Topic Modeling in
the Humanities’, DH2012: Topic Modeling the Past, http://rlskoeser.github.io/2012/08/
10/dh2012-topic-modeling-past/

34
D. Ramage et al., ‘Labeled LDA: A supervised topic model for credit attribution in multi-
labeled corpora’, Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing (Singapore, 2009).

35
D. Mimno and A. McCallum ‘Topic models conditioned on arbitrary features with Dirichlet
multinomial regression’, Uncertainty in Artificial Intelligence, 2008.

36
J. Jagarlamudi, H. Daumé III and R. Udupa, ‘Incorporating lexical priors into topic models’,
Proceedings of the 13th Conference of the European Chapter of the Association for
Computational Linguistics (Avignon, 2012).

37
Owens, ‘Discovery and justification are different: Notes on science-ing the humanities’.

38
M. Thaller, ‘Controversies around the Digital Humanities: An Agenda’.

76


Textual Analysis and Historical Research

39
The examples presented here describe an over simplified case study. However, the complexity
of the evaluation process can easily be shown by turning to more complex, realistic tasks like,
for example, to identify how the different meanings of ‘will’ evolve within a reasonably sized
historical corpus.

40
In computational linguistics and natural language processing during last decade the use
of human non-expert annotators for the construction of labeled datasets has become an
established practice. To know more about the online labor market Amazon Mechanical Turk:
https:// www.mturk.com/mturk/welcome.

41
F. Sebastiani, ‘Machine learning in automated text categorization’, ACM computing surveys,
34.1 (2002), 1–47.

42
For example R. Koselleck, W. Conze and O. Brunner ed. by, Geschichtliche Grundbegriffe,
8 vols. (Stuttgart, 1972–1997) and R. Rolf, E. Schmitt, and H. J. Lüsebrinck, Handbuch
politisch-sozialer Grundbegriffe in Frankreich, 1680–1820 (Berlin et al, 1985ff).

77


	Your short guide to the EUP Journals Blog.pdf
	1. The primary goal of the EUP Journals Blog
	To aid discovery of authors, articles, research, multimedia and reviews published in Journals,  and as a consequence contribute to increasing traffic, usage and citations of journal content.
	2. Audience
	Blog posts are written for an educated, popular and academic audience within EUP Journals’ publishing fields.
	3. Content criteria - your ideas for posts
	4. Word count, style, and formatting
	 Flexible length, however typical posts range 70-600 words.
	 Related images and media files are encouraged.
	 No heavy restrictions to the style or format of the post, but it should best reflect the content and topic discussed.
	5. Linking policy
	6. Submit your post
	If you’d like to be a regular contributor, then we can set you up as an author so you can create, edit, publish, and delete your own posts, as well as upload files and images.

	7. Republishing/repurposing
	8. Items to accompany post