Evaluating Visual Representations for Topic Understanding and
Their Effects on Manually Generated Topic Labels

Alison Smith∗ Tak Yeon Lee∗ Forough Poursabzi-Sangdeh†
Jordan Boyd-Graber† Niklas Elmqvist∗ Leah Findlater∗

∗University of Maryland, College Park, MD
†University of Colorado, Boulder, CO
{amsmit,tylee}@cs.umd.edu

{forough.poursabzisangdeh, jordan.boyd.graber}@colorado.edu
{elm,leahkf}@cs.umd.edu

Abstract

Probabilistic topic models are important tools
for indexing, summarizing, and analyzing
large document collections by their themes.
However, promoting end-user understanding
of topics remains an open research prob-
lem. We compare labels generated by users
given four topic visualization techniques—
word lists, word lists with bars, word clouds,
and network graphs—against each other and
against automatically generated labels. Our
basis of comparison is participant ratings of
how well labels describe documents from the
topic. Our study has two phases: a label-
ing phase where participants label visualized
topics and a validation phase where different
participants select which labels best describe
the topics’ documents. Although all visual-
izations produce similar quality labels, sim-
ple visualizations such as word lists allow par-
ticipants to quickly understand topics, while
complex visualizations take longer but expose
multi-word expressions that simpler visualiza-
tions obscure. Automatic labels lag behind
user-created labels, but our dataset of man-
ually labeled topics highlights linguistic pat-
terns (e.g., hypernyms, phrases) that can be
used to improve automatic topic labeling al-
gorithms.

1 Comprehensible Topic Models Needed

A central challenge of the “big data” era is to help
users make sense of large text collections (Hotho et
al., 2005). A common approach to summarizing the
main themes in a corpus is to use topic models (Blei,
2012), which are data-driven statistical models that

identify words that appear together in similar docu-
ments. These sets of words or “topics” evince inter-
nal coherence and can help guide users to relevant
documents. For instance, an FBI investigator sifting
through the released Hillary Clinton e-mails may see
a topic with the words “Benghazi”, “Libya”, “Blu-
menthal”, and “success”, spurring the investigator
to dig deeper to find further evidence of inappro-
priate communication with longtime friend Sidney
Blumenthal regarding Benghazi.

A key challenge for topic modeling, however, is
how to promote end-user understanding of individ-
ual topics and the overall model. Most existing
topic presentations use simple word lists (Chaney
and Blei, 2012; Eisenstein et al., 2012). Although a
variety of alternative topic visualization techniques
exist (Sievert and Shirley, 2014; Yi et al., 2005),
there has been no systematic assessment to compare
them. Beyond exploring different visualization tech-
niques, another means of making topics easier for
users to understand is to provide descriptive labels
to complement a topic’s set of words (Aletras et al.,
2014). Unfortunately, manual labeling is slow and,
while automatic labeling approaches exist (Lau et
al., 2010; Mei et al., 2007; Lau et al., 2011), their
effectiveness is not guaranteed for all tasks.

To better understand these problems, we use la-
beling to evaluate topic model visualizations. Our
study compares the impact of four commonly used
topic visualization techniques on the labels that
users create when interpreting a topic (Figure 1):
word lists, word lists with bars, word clouds, and
network graphs. On Amazon Mechanical Turk, one
set of users viewed a series of individual topic vi-

1

Transactions of the Association for Computational Linguistics, vol. 5, pp. 1–16, 2017. Action Editor: Timothy Baldwin.
Submission batch: 2/2016; Revision batch: 6/2016; Published 1/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


sualizations and provided a label to describe each
topic, while a second set of users assessed the qual-
ity of those labels alongside automatically generated
ones.1 Better labels imply that the topic visualiza-
tion provide users a more accurate interpretation (la-
beling) of the topic.

The four visualization techniques have inherent
trade-offs. Perhaps unsurprisingly, there is no mean-
ingful difference in the quality of the labels pro-
duced from the four visualization techniques. How-
ever, simple visualizations (word list and word
cloud) support a quick, first-glance understanding
of topics, while more complex visualizations (net-
work graph) take longer but reveal relationships be-
tween words. Also, user-created labels are better
received than algorithmically-generated labels, but
more detailed analysis uncovers features specific to
high-quality labels (e.g., tendency towards abstrac-
tion, inclusion of phrases) and the types of topics
for which automatic labeling works. These findings
motivate future automatic labeling algorithms.

2 Background

Presenting the full text of a document corpus is often
impractical. For truly large and complex text cor-
pora, abstractions, such as topic models, are neces-
sary. Here we review probabilistic topic modeling
and topic model interfaces.

2.1 Probabilistic Topic Modeling

Topic modeling algorithms produce statistical mod-
els that discover key themes in documents (Blei,
2012). Many specific algorithms exist; in this work
we use Latent Dirichlet Allocation (Blei et al., 2003,
LDA) as it is commonly employed. LDA is an un-
supervised statistical topic modeling algorithm that
considers each document to be a “bag of words”
and can scale to large corpora (Zhai et al., 2012;
Hoffman et al., 2013; Smola and Narayanamurthy,
2010). Assuming that each document is an admix-
ture of topics, inference discovers each topic’s dis-
tribution over words and each document’s distribu-
tion over topics that best explain the corpus. The set
of topics provide a high-level overview of the cor-

1Data available at https://github.com/
alisonmsmith/Papers/tree/master/
TopicRepresentations.

pus, and individual topics can link back to the orig-
inal documents to support directed exploration. The
topic distributions can also be used to present other
documents related to a given document.

Clustering is hard because there are multiple rea-
sonable objectives that are impossible to satisfy si-
multaneously (Kleinberg, 2003). Topic modeling
evaluation has focused on perplexity, which mea-
sures how well a model can predict words in un-
seen documents (Wallach et al., 2009b; Jelinek et al.,
1977). However, Chang et al. (2009) argue that eval-
uations optimizing for perplexity encourage com-
plexity at the cost of human interpretability. New-
man et al. (2010a) build on this insight, noting that
“one indicator of usefulness is the ease by which one
could think of a short label to describe the topic.”
Unlike previous interpretability studies, here we ex-
amine the connection between a topic’s visual repre-
sentation (not just its content) and its interpretabil-
ity.

Recent work has focused on automatic generation
of labels for topics. Lau et al. (2011) use Wikipedia
articles to automatically label topics. The assump-
tion is that for each topic there will be a Wikipedia
article title that offers a good representation of the
topic. Aletras et al. (2014) use a graph-based ap-
proach to better rank candidate labels. They gen-
erate a graph from the words in candidate articles
and use PageRank to find a representative label. In
Section 3 we use an adapted version of the method
presented by Lau et. al. (2011) as a representative
automatic labeling algorithm.

2.2 Topic Model Visualizations
The topic visualization techniques in our study—
word list, word list with bars, word cloud, and net-
work graph—commonly appear in topic modeling
tools. Here, we provide an overview of tools that
display an entire topic model or models to the user,
while more detail on the individual topic visualiza-
tion techniques can be found in Section 3.2.

Topical Guide (Gardner et al., 2010), Topic
Viz (Eisenstein et al., 2012), and the Topic Model
Visualization Engine (Chaney and Blei, 2012) are
tools that support corpus understanding and directed
browsing through topic models. They display the
model overview as an aggregate of underlying topic
visualizations. For example, Topical Guide uses hor-

2

https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations
https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations
https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations


Figure 1: Examples of the twelve experimental conditions, each a different visualization of the same topic about
the George W. Bush presidential administration and the Iraq War. Rows represent cardinality, or number of topic
words shown (five, ten, twenty). Columns represent visualization techniques. For word list and word list with
bars, topic words are ordered by their probability for the topic. Word list with bars also includes horizontal bars to
represent topic-term probabilities. In the word cloud, words are randomly placed but are sized according to topic-term
probabilities. The network graph uses a force-directed layout algorithm to co-locate words that frequently appear
together in the corpus.

izontal word lists when displaying an overview of
an entire topic model but uses a word cloud of the
top 100 words for a topic when displaying only a
single topic. Topic Viz and the Topic Model Visu-
alization Engine both represent topics with vertical
word lists; the latter also uses set notation.

Other tools provide additional information within
topic model overviews, such as the relationship be-
tween topics or temporal changes in the model.
However, they still require the user to understand
individual topics. LDAVis (Sievert and Shirley,
2014) includes information about the relationship
between topics in the model. Multi-dimensional
scaling projects the model’s topics as circles onto
a two-dimensional plane based on their inter-topic
distances; the circles are sized by their overall preva-
lence. The individual topics, however, are then vi-
sualized on demand using a word list with bars.
Smith et al. (2014) visualize a topic model using
a nested network graph layout called group-in-a-
box (Rodrigues et al., 2011, GIB). The individual

topics are displayed using a network graph visu-
alization, and related topics are displayed within a
treemap (Shneiderman, 1992) layout. The result is a
visualization where related words cluster within top-
ics and related topics cluster in the overall layout.

TopicFlow (Smith et al., 2015) visualizes how
a model changes over time using a Sankey dia-
gram (Riehmann et al., 2005). The individual top-
ics are represented both as word lists in the model
overview and as word list with bars when view-
ing a single topic or comparing between two top-
ics. Argviz (Nguyen et al., 2013) captures tempo-
ral shifts in topics during a debate or a conversation.
The individual topics are presented as word lists in
the model overview and using word list with bars
for the selected topics. Klein et al. (2015) use a dust-
and-magnet visualization (Yi et al., 2005) to visu-
alize the force of topics on newspaper issues. The
temporal trajectories of several newspapers are dis-
played as dust trails in the visualization. The indi-
vidual topics are displayed as word clouds.

3


In contrast to these visualizations which sup-
port viewing the underlying topics on demand, Ter-
mite (Chuang et al., 2012) uses a tabular layout
of words and topics to provide an overview of the
model to compare across topics. It organizes the
model into clusters of related topics based on word
overlap. This clustered representation is both space-
efficient and speeds corpus understanding.

Despite the breadth of topic model visualizations,
a small set of individual topic representations are
ubiquitous: word list, word list with bars, word
cloud, and network graph. In the following sections,
we compare these topic visualization techniques.

3 Method: Comparing Visualizations

We conduct a controlled online study to compare the
four commonly used visualization techniques identi-
fied in Section 2: word list, word list with bars, word
cloud, and network graph. We also compare effec-
tiveness with the number of topic words shown, that
is, the cardinality of the visualization: five, ten or
twenty topic words.

3.1 Dataset

We select a corpus that does not assume domain ex-
pertise: 7,156 New York Times articles from January
2007 (Sandhaus, 2008). We model the corpus using
an LDA (Blei et al., 2003) implementation in Mal-
let (Yao et al., 2009) with domain-specific stopwords
and standard hyperparameter settings.2 Our simple
setup is by design: our goal is to emulate the “off
the shelf” behavior of conventional topic modeling
tools used by novice users. Instead of improving the
quality of the model using asymmetric priors (Wal-
lach et al., 2009a) or bigrams (Boyd-Graber et al.,
2014), our topic model has topics of variable qual-
ity, allowing us to explore the relationship between
topic quality and our task measures.

Automatic labels are generated from representa-
tive Wikipedia article titles using a technique sim-
ilar to Lau et al. (2011). We first index Wikipedia
using Apache Lucene.3 To label a topic, we query
Wikipedia with the top twenty topic words to re-
trieve fifty articles. These articles’ titles comprise
our candidate set of labels. We then represent each

2n=50, α =0.1, β =0.01
3http://lucene.apache.org/

article using its TF-IDF vector and calculate the cen-
troid (average TF-IDF) of the retrieved articles. To
rank and choose the most representative of the set,
we calculate the cosine similarity between the cen-
troid TF-IDF vector and the TF-IDF vector of each of
the articles. We choose the title of the article with
the maximum cosine similarity to the centroid. Un-
like Lau et al. (2011), we do not include the topic
words or Wikipedia title n-grams derived from our
label set, as these labels are typically not the best
candidates. Although other automatic labeling tech-
niques exist, we choose this one as it is representa-
tive of general techniques.

3.2 Visualizations
As discussed in Section 2, our study compares
four of the most common topic visualization tech-
niques. To produce a meaningful comparison, the
space given to each visualization is held constant:
400 × 250 pixels. Figure 1 shows each visualiza-
tion for the three cardinalities (or number of words
displayed) for the same topic.

Word List The most straightforward topic repre-
sentation is a list of the top n words in the topic,
ranked by their probability. In practice, topic word
lists have many variations. They can be represented
horizontally (Gardner et al., 2010; Smith et al.,
2015) or vertically (Eisenstein et al., 2012; Chaney
and Blei, 2012), with or without commas separating
the individual words, or using set notation (Chaney
and Blei, 2012). Nguyen et al. (2013) add the
weights to the word list by sizing the words based
on their probability for the topic, which blurs the
boundary with word clouds; however, this approach
is not common. We use a horizontal list of equally
sized words ordered by the probability p(w|z) for
the word w in the topic z. For space efficiency, we
organize our word list in two columns and add item
numbers to make the ordering explicit.

Word List with Bars Combining bar graphs with
word lists yields a visual representation that not only
conveys the ordering but also the absolute value of
the weights associated with the words. We use a
similar implementation to Smith et al. (2015) to add
horizontal bars to the word list for a topic z where the
length of each bar represents the probability p(w|z)
for each word w.

4

http://lucene.apache.org/


Figure 2: The labeling task for the network graph and ten words. Users create a short label and full sentence describing
the topic and rate their confidence that the label and sentence represent the topic well.

Word Cloud The word cloud (or tag cloud) is one
of the most popular and well-known text visualiza-
tion techniques and is a common visualization for
topics. Many options exist for word cloud layout,
color scheme, and font size (Mueller, 2012). Ex-
isting work on layouts is split between those that
size words by their frequency or probability for the
topic (Ramage et al., 2010) and those that size by the
rank order of the word (Barth et al., 2014). We use
a combination of these techniques where the word’s
font size is initially set proportional to its probabil-
ity in a topic p(w|z). However, when the word is
too large to fit in the canvas, the size is gradually
decreased (Barth et al., 2014). We use a gray scale
to visually distinguish words and display all words
horizontally to improve readability.

Network Graph Our most complex topic visual-
ization is a network graph. We use a similar network
graph implementation to Smith et al. (2014), which
represents each topic as a node-link diagram, where
words are circular nodes with edges drawn between
commonly co-occurring words. Each word’s radius
is scaled by the probability p(w|z) for the word w
in a topic z. While Smith et al. (2014) draw edges
based on document-level co-occurrence, we instead
use edges to pull together phrases, so they are drawn
between words w1 and w2 based on bigram count,

specifically if log(count(w1,w2)) > k, with k = 0.1.4

Edge width and color are applied uniformly to fur-
ther reduce complexity in the graph. The network
graph is displayed using a force-directed graph lay-
out algorithm (Fruchterman and Reingold, 1991)
where all nodes repel each other but links attract
connected nodes.

3.3 Cardinality

Although every word has some probability for every
topic, p(w|z), visualizations typically display only
the top n words. The cardinality may interact with
the effectiveness of the different visualization tech-
niques (e.g., more complicated visualizations may
degrade with more words). We use n ∈{5,10,20}.

3.4 Task and Procedure

The study includes two phases with different users.
In Labeling (Phase I), users describe a topic given
a specific visualization, and we measure speed and
self-reported confidence in completing the task. In
Validation (Phase II), users select the best and worst
among a set of Phase I descriptions and an automat-
ically generated description for how well they repre-
sent the original topics’ documents.

Phase I: Labeling For each labeling task, users
see a topic visualization, provide a short label (up

4From k ∈{0.01,0.05,0.1,0.5}, we chose k = 0.1 as the best
trade-off between complexity and provided information.

5


Figure 3: The validation task shows the titles of the top ten documents and five potential labels for a topic. Users are
asked to pick the best and worst labels. Four labels were created by Phase I users after viewing different visualizations
of the topic, while the fifth was generated by the algorithm. The labels are shown in random order.

to three words), then give a longer sentence to de-
scribe the topic, and finally use a five-point Likert
scale to rate their confidence that the label and sen-
tence represent the topic well. We also track the time
to perform the task. Figure 2 shows an example of a
labeling task using the network graph visualization
technique with ten words.

Labeling tasks are randomly grouped into human
intelligence tasks (HIT) on Mechanical Turk5 such
that each HIT includes five tasks from the same vi-
sualization technique.6

Phase II: Validation In the validation phase, a
new set of users assesses the quality of the labels
and sentences created in Phase I by evaluating them
against documents associated with the given topic.
It is important to evaluate the topic labels in con-
text; a label that superficially looks good is useless if
it is not representative of the underlying documents

5All users are in the US or Canada, have more than fifty
previously approved HITs, and have an approval rating greater
than 90%.

6We did not restrict users from performing multiple HITs,
which may have exposed them to multiple visualization tech-
niques. Users completed on average 1.5 HITs.

in the corpus. Algorithmically generated labels (not
sentences) are also included. Figure 3 shows an ex-
ample of the validation task.

The user-generated labels and sentences are eval-
uated separately. For each task, the user sees the
titles of the top ten documents associated with a
topic and a randomized set of labels or sentences,
one elicited from each of the four visualization tech-
niques within a given cardinality. The set of labels
also includes an algorithmically generated label. We
ask the user to select the “best” and “worst” of the
labels or sentences based on how well they describe
the documents. Documents are associated to topics
based on the probability of the topic, z, given the
document, d, p(z|d). Only the title of each docu-
ment is initially shown to the user with an option to
“show article” (or view the first 400 characters of the
document).

All labels are lowercased to enforce uniformity.
We merge identical labels so users do not see dupli-
cates. If a merged label receives a “best” or “worst”
vote, the vote is split equally across all of the origi-
nal instances (i.e., across multiple visualization tech-
niques with that label). Finally, we track task com-

6


pletion time.
Each user completes four randomly selected vali-

dation tasks as part of a HIT, with the constraint that
each task must be from a different topic. We also
use ground truth seeding for quality control: each
HIT includes one additional test task that has a pur-
posefully bad label generated by concatenating three
random dictionary words. If the user does not pick
the bad label as the “worst”, we discard all data in
that HIT.

3.5 Study Design and Data Collection

For Phase I, we use a factorial design with factors
of Visualization (levels: word list, word list with
bars, word cloud, and network graph) and Cardinal-
ity (levels: 5, 10, and 20), yielding twelve condi-
tions. For each of the fifty topics in the model and
each of the twelve conditions, at least five users per-
form the labeling task, describing the topic with a
label and sentence, resulting in a minimum of 3,000
label and sentence pairs. Each HIT includes five of
these labeling tasks, for a minimum of 600 HITs.
The users are paid $0.30 per HIT.

For Phase II, we compare descriptions across
the four visualization techniques (and automatically
generated labels), but only within a given cardinality
level rather than across cardinalities. We collected
3,212 label and sentence pairs from 589 users during
Phase I. For validation in Phase II, we use the first
five labels and sentences collected for each condi-
tion for a total of 3.000 labels and sentences. These
are shown in sets of four (labels or sentences) dur-
ing Phase II, yielding a total of 1,500 (3,000/4 +
3,000/4) tasks. Each HIT contains four validation
tasks and one ground truth seeding task, for a to-
tal of 375 HITs. To increase robustness, we validate
twice for a total of 750 HITs, without allowing any
two labels or sentences to be compared twice. The
users get $0.50 per HIT.

4 Results

We analyze labeling time and self-reported confi-
dence for the labeling task (Phase I) before report-
ing on the label quality assessments (Phase II). We
then analyze linguistic qualities of the labels, which
should motivate future work in automatic label gen-
eration.

(a) TOPIC 25 (coh. = 0.21)(b) TOPIC 26 (0.21) (c) TOPIC 3 (0.20)

(d) TOPIC 9 (0.01) (e) TOPIC 16 (0.01) (f) TOPIC 23 (0.02)

To
pi
cs
 w
/ H
ig
h 
C
oh
er
en
ce

To
pi
cs
 w
/ L
ow
 C
oh
er
en
ce

Figure 4: Word list with bar visualizations of the three
best (top) and worst (bottom) topics according to their
coherence score, which is shown to the right of the topic
number. The average topic coherence is 0.09 (SD=0.05).

We first provide an example of user-generated la-
bels and sentences: the user labels for the topic
shown in Figure 1 include government, iraq war,
politics, bush administration, and war on terror. Ex-
amples of sentences include “President Bush’s mili-
tary plan in Iraq” and “World news involving the US
president and Iraq”.7

To interpret the results, it is useful to also un-
derstand the quality of the generated topics, which
varies throughout the model and may impact a user’s
ability to generate good labels. We measure topic
quality using topic coherence, an automatic measure
that correlates with how much sense a topic makes to
a user (Lau et al., 2014).8 The average topic coher-
ence for the model is 0.09 (SD = 0.05). Figure 4
shows the three best (top) and three worst topics
(bottom) according to their observed coherence: the
coherence metric distinguishes obvious topics from
inscrutable ones. Section 4.3 shows that users cre-

7The complete set of labels and sentences are available
at https://github.com/alisonmsmith/Papers/
tree/master/TopicRepresentations.

8We use a reference corpus of 23 million Wikipedia arti-
cles for computing normalized pointwise mutual information
needed for computing the observed coherence.

7

https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations
https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations


Technique Word List Word List w/ Bars Word Cloud Network Graph
Cardinality 5 10 20 5 10 20 5 10 20 5 10 20

# tasks completed 264 268 268 264 280 260 268 268 268 267 274 263

Avg time (SD)
53.0

(44.3)
53.2

(46.6)
52.1

(53.3)
58.4

(75.1)
58.7

(51.1)
60.7

(57.9)
52.7

(47.4)
49.4

(37.4)
68.4

(85.4)
55.0

(50.7)
55.6

(56.0)
77.9

(71.9)

Avg confidence (SD)
3.7

(0.9)
3.7

(0.9)
3.6

(0.9)
3.6

(0.9)
3.6

(0.8)
3.7

(0.8)
3.5

(1.0)
3.6

(0.9)
3.6

(0.9)
3.4

(1.1)
3.6

(0.8)
3.7

(0.8)

Table 1: Overview of the labeling phase: number of tasks completed, the average and standard deviation (in paren-
theses) for time spent per task in seconds, and the average and standard deviation for self-reported confidence on a
5-point Likert scale for each of the twelve conditions.

吀椀洀攀
⠀猀攀挀⸀⤀

眀漀爀搀猀㔀 ㄀　 ㈀　 㔀 ㄀　 ㈀　 㔀 ㄀　 ㈀　 㔀 ㄀　 ㈀　

圀漀爀搀 䰀椀猀琀 圀漀爀搀 䰀椀猀琀 眀⼀ 䈀愀爀猀 圀漀爀搀 䌀氀漀甀搀 一攀琀眀漀爀欀 䜀爀愀瀀栀

㔀　

㐀　

㘀　

㜀　

㠀　

Figure 5: Average time for the labeling task, across vi-
sualizations and cardinalities, ordered from left to right
by visual complexity. For 20 words, network graph was
significantly slower and word list was significantly faster
than the other visualization techniques. Error bars show
standard error.

ated lower quality labels for low coherence topics.

4.1 Labeling Time
More complex visualization techniques take longer
to label (Table 1 and Figure 5). The labeling tasks
took on average 57.9 seconds (SD = 58.5) to com-
plete and a two-way ANOVA (visualization technique
× cardinality) reveals significant main effects for
both the visualization technique9 and the cardinal-
ity,10 as well as a significant interaction effect.11

For lower cardinality, the labeling time across vi-
sualization techniques is similar, but there are no-
table differences for higher cardinality. Posthoc
pairwise comparisons based on the interaction ef-
fect (with Bonferroni adjustment) found no signif-
icant differences between visualizations with five
words and only one significant difference for ten
words (word list with bars was slower than word
cloud, p < .05). For twenty words, however, the net-
work graph was significantly slower at an average
of 77.9s (SD = 72.0) than the other three visualiza-

9F(3,3199) = 10.58, p < .001, η
2
p = .01

10F(2,3199) = 14.60, p < .001, η
2
p = .01

11F(6,3199) = 4.59, p < .001, η
2
p = .01

tions ( p < .05). This effect is likely due to the net-
work graph becoming increasingly dense with more
nodes (Figure 1, bottom right). In contrast, the rel-
atively simple word list visualization was signifi-
cantly faster with twenty words than the three other
visualizations (p < .05), taking only 52.1s on aver-
age (SD = 53.4). Word list with bars and word cloud
were not significantly different from each other.

As a secondary analysis, we examine the rela-
tionship between elapsed time and the observed co-
herence for each topic. Topics with high coher-
ence scores, for example, may be faster to label,
because they are easier to interpret. However, the
small negative correlation between time and coher-
ence (Figure 6, top) was not significant (r48 =−.13,
p = .364).

4.2 Self-Reported Labeling Confidence

For each labeling task, users rate their confidence
that their labels and sentences describe the topic well
on a scale from 1 (least confident) to 5 (most confi-
dent). The average confidence across all conditions
was 3.6 (SD = 0.9). Kruskal-Wallis tests show a sig-
nificant impact of visualization technique on con-
fidence with five and ten words, but not twenty.12

While average confidence ratings across all condi-
tions only range from 3.4 to 3.7, perceived confi-
dence with network graph suffers when the visual-
ization has too few words (Table 1).

As a secondary analysis, we compare the self-
reported confidence with observed coherence for
each topic (Figure 6, bottom). Increased user con-
fidence with more coherent topics is supported by a
moderate positive correlation between topic coher-

12Five words: χ 23 = 12.62, p = .006. Ten words: χ
2
3 = 7.94,

p = .047. We used nonparametric tests because the data is ordi-
nal and we cannot guarantee that all differences between points
on the scale are equal.

8


Figure 6: Relationship between observed coherence and
labeling time (top) and observed coherence and self-
reported confidence (bottom) for each topic. The positive
correlation (Slope = 1.64 and R2 = 0.10) for confidence
is significant.

ence and confidence (r48 = .32, p = .026). This re-
sult provides further evidence that topic coherence
is an effective measurement of topic interpretability.

4.3 Other Users’ Rating of Label Quality

Other users’ perceived quality of topic labels is the
best real-world measure of quality (as described in
Section 3.4). Overall, the visualization techniques
had similar quality labels, but automatically gener-
ated labels do not fare well. Automatic labels get
far fewer “best” votes and far more “worst” votes
than user-generated labels produced from any of the
four visualization techniques (Figure 7). Chi-square
tests on the distribution of “best” votes for labels
for each cardinality show that the visualization mat-
ters.13 Posthoc analysis using pairwise Chi-square

13Five words: χ 24,N=500 = 16.47, p = .002. Ten words:
χ 24,N=500 = 14.62, p = .006. Twenty words: χ

2
4,N=500 = 22.83,

p < .001.

tests with Bonferroni correction show that automatic
labels were significantly worse than user-generated
labels from each of the visualization techniques (all
comparisons p < .05). No other pairwise compar-
isons were significant.

For sentences, no visualization technique
emerged as better than the others. Additionally,
there is no existing automatic approach to compare
against. The distribution of “best” counts here was
relatively uniform. Separate Kruskal-Wallis tests
for each cardinality to examine the impact of the
visualization techniques on “best” counts did not
reveal any significant results.

As a secondary qualitative analysis, we examine
the relationship between topic coherence and the as-
sessed quality of the labels. The automatic algorithm
tended to produce better labels for the coherent top-
ics than for the incoherent topics. For example,
Topic 26 (Figure 4, b)—{music, band, songs}—and
Topic 31 (Figure 4, c)—{food, restaurant, wine}—
are two of the most coherent topics. The automatic
algorithm labeled Topic 26 as music and Topic 31 as
food. For both of these coherent topics, the labels
generated by the automatic algorithm secured the
most “best” votes and no “worst” votes. In contrast,
Topic 16 (Figure 4, e)—{years, home, work}—and
Topic 23 (Figure 4, f)—{death, family, board}—
are two of the least coherent topics. The automatic
labels refusal of work and death of michael jackson
yielded the most “worst” votes and fewest “best”
votes.

To further demonstrate this relationship, we ex-
tracted from the 50 topics the top and bottom quar-
tiles of 13 topics each14 based on their observed co-
herence scores. Figure 8 shows a comparison of
the “best” and “worst” votes for the topic labels for
these quartiles, including user-generated and auto-
matically generated labels. For the top quartile, the
number of “best” votes per technique ranged from
61 for automatic labels to 96 for the network graph
visualization. The range for the bottom quartile was
larger, from only 45 “best” votes for automatic la-
bels to 99 for word list with bars. The automatic la-
bels, in particular, received a large relative increase
in “best” votes when comparing the bottom quartile

14We could not get exact quartiles, because we have 50 top-
ics, so we rounded up to include 13 topics in each quartile.

9


# 
of
 “B
es
t” 
 / 
 “W
or
st
” v
ot
es

5 words 5 words

10 words 10 words

20 words 20 words

(a) Labels per condition (b) Sentences per condition

“Best” votes “Worst” votes

Figure 7: The “best” and “worst” votes for labels and sentences for each condition. The automatically generated labels
received more “worst” votes and fewer “best” votes compared to the user-created labels.

to the top quartile (increase of 37%).
Additionally, the word list, word cloud, and net-

work graph visualizations all lead to labels with sim-
ilar “best” and “worst” votes for both the top and
bottom quartiles. However, the word list with bars
representation shows both a large relative increase
for the best votes (increase of 19%) and relative de-
crease for the “worst” votes (decrease of 23%) when
comparing the top to the bottom quartile. These re-
sults suggest that adding numeric word probability
information highlighted by the bars may help users
understand poor quality topics.

4.4 Label Analysis

The results of Phase I provide a large manually gen-
erated label set. Exploratory analysis of these labels
reveals linguistic features users tend to incorporate
when labeling topics. We discuss implications for
automatic labeling in Section 5. In particular, users
prefer shorter labels, labels that include topic words
and phrases, and abstraction in topic labeling.

Length The manually generated labels use 2.01
words (SD = 0.95), and the algorithmically gener-
ated labels use 3.16 words (SD = 2.05). Interest-
ingly, the labels voted as “best” were shorter on aver-
age than those voted “worst”, regardless of whether
algorithmically generated labels are included in the
analysis. With algorithmically generated labels in-

Network Graph AlgorithmWord CloudWord List w/ BarsWord List

Network Graph AlgorithmWord CloudWord List w/ BarsWord List

(a) Top quartile of topics

(b) Bottom quartile of topics

# 
of
 “
B
es
t”
 / 
“W
or
st
” 
Vo
te
s 
fo
r 
La
be
ls

Figure 8: Comparison of the “best” and “worst” votes
for labels generated using the different visualization tech-
niques (and the automatically generated labels) for the
top quartile of topics (top) and bottom quartile of topics
(bottom) by topic coherence. The automatically gener-
ated labels receive far more “best” votes for the coherent
topics.

cluded, the average lengths are 2.04 (SD = 1.16)
words for “best” labels and 2.83 (SD = 1.79) words
for “worst” labels,15 but even without the algo-
rithmically generated labels, the “best” labels are

15The “best” label set includes all labels voted at least once
as “best”, and similarly the “worst” label set includes all labels
voted at least once as “worst”.

10


Figure 9: Relationship between rank of topic words and
the average probability of occurrences in labels. The
three lines—red, green, and blue—represent cardinality
of five, ten, and twenty, respectively. The higher-ranked
words were used more frequently.

shorter (M = 1.96, SD = .87) than the “worst” la-
bels (M = 2.09, SD = 1.01).

Shared Topic Words Of the 3,212 labels, 2,278,
or 71%, contain at least one word taken directly from
the topic words—that is, the five, ten, or twenty
words shown in the visualization; however, there
are no notable differences between the visualization
techniques. Additionally, the number of topic words
included on average was similar across all three car-
dinalities, suggesting that users often use the same
number of topic words regardless of how many were
shown in the visualization.

We further examine the relationship between a
topic word’s rank and whether the word was selected
for inclusion in the labels. Figure 9 shows the aver-
age probability of a topic word being used in a label
by the topic word’s rank. More highly ranked words
were included more frequently in labels. As cardi-
nality increased, the highest ranked words were also
less likely to be employed, as users had more words
available to them.

Phrases Although LDA makes a “bag of words”
assumption when generating topics, users can recon-
struct relevant phrases from the unique words. For
Topic 26, for example, all visualizations include the
same topic terms. However, the network graph vi-
sualization highlights the phrases “jazz singer” and
“rock band” by linking their words as commonly co-
occurring terms in the corpus. These phrases are
not as easily discernible in the word cloud visual-
ization (Figure 10). We compute a set of common

Figure 10: Word cloud and network graph visualizations
of Topic 26. Phrases such as “jazz singer” and “rock
band” are obscured in the word cloud but are shown in
the network graph as connected nodes.

phrases by taking all bigrams and trigrams that oc-
cur more than fifty and twenty times, respectively, in
the NYT corpus. Of the 3212 labels, 575 contain one
of these common phrases, but those generated by
users with the network graph visualization contain
the most phrases. Labels generated in the word list
(22% of the labels), word list with bars (25%), and
word cloud (24%) conditions contain fewer phrases
than the labels generated in the network graph condi-
tion (29%). Although it is not surprising that the net-
work graph visualization better communicates com-
mon phrases in the corpus as edges are drawn be-
tween these phrases, this suggests other approaches
to drawing edges. Edges drawn based on sentence or
document-based co-occurrence, for example, could
instead uncover longer-distance dependencies be-
tween words, potentially identifying distinct sub-
topics with a topic.

Hyponymy Users often prefer more general terms
for labels than the words in the topic (Newman et
al., 2010b). To measure this, we look for the set
of unique hyponyms and hypernyms of the topic
words, or those that are not themselves a topic word,
that appear in the manually generated labels. We
use the super-subordinate relation, which represents
hypernymy and hyponymy, from WordNet (Miller,
1995). Of the 3,212 labels, 235 include a unique
hypernym and 152 include a unique hyponym of
the associated topic words found using WordNet,
confirming that users are significantly more likely
to produce a more generic description of the topic
(χ 21,N=387 = 17.38, p < .001). For the 235 more
generic labels, fewer of these came from word list
(22%) and more from the network graph (30%) than
the other visualization techniques—word list with
bars (24%) and word cloud (24%). This may mean

11


that the network graph helps users to better under-
stand the topic words as a group and therefore la-
bel them using a hypernym. We also compared hy-
pernym inclusion for “best” and “worst” labels: 63
(5%) of the “best” labels included a hypernym while
only 44 (3%) of the “worst” labels included a hy-
pernym. Each of the visualization techniques led to
approximately the same percentage of the 152 total
more specific labels.

5 Discussion

Although the four visualization techniques yield
similar quality labels, our crowdsourced study high-
lights the strengths and weaknesses of the tech-
niques. It also reveals some preferred linguistic fea-
tures of user-generated labels and how these differ
from automatically generated labels.

The trade-offs among the visualization tech-
niques show that context matters. If efficiency is
paramount, then word lists—both simple and fast—
are likely best. For a cardinality of twenty words,
for example, users presented with the simple word
list are significantly faster at labeling than those
shown the network graph visualization. At the same
time, more complex visualizations expose users to
multi-word expressions that the simpler visualiza-
tion techniques may obscure (Section 4.4). Future
work should investigate for what types of user tasks
this information is most useful. There is also po-
tential for misinterpretation of topic meaning when
cardinality is low. Users can misunderstand the
topic based on the small set of words, or adjacent
words can inadvertently appear to form a meaning-
ful phrase, which may be particularly an issue for
the word cloud.

Our crowdsourced study identified the “best” and
“worst” labels for the topic’s documents. An addi-
tional qualitative coding phase could evaluate each
“worst” label to determine why, whether due to
misinterpretation, spelling or grammatical errors,
length, or something else.

Surprisingly, we found no relationship between
topic coherence and labeling time (Section 4.1).
This is perhaps because not only are users quick to
label topics they understand, but they also quickly
give up when they have no idea what a topic is about.
We do, however, find a relationship between coher-

ence and confidence (Section 4.2). This positive
correlation supports topic coherence as an effective
measure for human interpretability.

Automatically generated labels are consistently
chosen as the “worst” labels, although they are com-
petitive with the user-generated labels for highly
coherent topics (Section 4.3). Future automatic
labeling algorithms should still be robust to poor
topics. Algorithmically generated labels were
longer and more specific than the user-generated la-
bels. It is unsurprising that these automatic labels
were consistently deemed the worst. Users pre-
fer shorter labels with more general words (e.g.,
hypernyms, Section 4.4). We show specific ex-
amples of this phenomenon from Topic 14 and
Topic 48. For Topic 14—{health, drug, med-
ical, research, conditions}—the algorithm gener-
ated the label health care in the united states, but
users preferred the less specific labels health and
medical research. Similarly, for Topic 48—{league,
team, baseball, players, contract}—the algorithm
generated the label major league baseball on fox;
users preferred simpler labels, such as baseball. Au-
tomatic labeling algorithms thus can be improved to
focus on general, shorter labels. Interestingly, sim-
ple textual labels have been shown to be more ef-
ficient but less effective than topic keywords (i.e.,
word lists) for an automatic document retrieval
task (Aletras and Stevenson, 2014), highlighting the
extra information present in the word lists. Our find-
ings show that users are also able to effectively in-
terpret the word list information, as that visualiza-
tion was both efficient and effective for the task of
topic labeling compared to the other more complex
visualizations.

Although we use WordNet to verify that users pre-
fer more general labels, this is not a panacea, be-
cause WordNet does not capture all of the general-
ization users want in labels. In many cases, users
use terms that synthesize relationships beyond triv-
ial WordNet relationships, such as locations or en-
tities. For example, Topic 18—{san, los, angeles,
terms, francisco}—was consistently labeled as the
location California, and Topic 38—{open, second,
final, won, williams}—which almost all users la-
beled as tennis, required a knowledge of the enti-
ties Serena Williams and the U.S. Open. In addition
to WordNet, an automatic labeling algorithm could

12


use a gazetteer for determining locations from topic
words and a knowledge base such as TAP (Guha and
McCool, 2003), which provides a broad range of in-
formation about popular culture for matching topic
words to entities.

6 Conclusion and Future Work

We present a crowdsourced user study to com-
pare four topic visualization techniques—a simple
ranked word list, a ranked word list with bars rep-
resenting word probability, a word cloud, and a net-
work graph—based on how they impact the user’s
understanding of a topic. The four visualization
techniques lead to similar quality labels as rated by
end users. However, users label more quickly with
the simple word list, yet tend to incorporate phrases
and more generic terminology when using the more
complex network graph. Additionally, users feel
more confident labeling coherent topics, and manual
labels far outperform the automatically generated la-
bels against which they were evaluated.

Automatic labeling can benefit from this research
in two ways: by suggesting when to apply automatic
labeling and by providing training data for improv-
ing automatic labeling. While automatic labels falter
compared to human labels in general, they do quite
well when the underlying topics are of high qual-
ity. Thus, one reasonable strategy would be to use
automatic labels for a portion of topics, but to use
human validation to either first improve the remain-
der of the topics (Hu et al., 2014) or to provide labels
(as in this study) for lower quality topics. Moreover,
our labels provide training data that may be use-
ful for automatic labeling techniques using feature-
based models (Charniak, 2000)—combining infor-
mation from Wikipedia, WordNet, syntax, and the
underlying topics—to reproduce the types of labels
and sentences created (and favored) by users.

Finally, our study focuses on comparing individ-
ual topic visualization techniques. An open ques-
tion that we do not address is whether this gen-
eralizes to understanding entire topic models. In
other words, simple word list visualizations are use-
ful for quick and high-quality topic summarization,
but does this mean that a collection of word lists—
one per topic—will also be optimal when displaying
the entire model? Future work should look at com-

paring visualization techniques for full topic model
understanding.

Acknowledgments

We would like to thank the anonymous reviewers as
well as the TACL editors, Timothy Baldwin and Lil-
lian Lee, for helpful comments on an earlier draft of
this paper. This work was funded by NSF grant IIS-
1409287. Any opinions, findings, and conclusions
or recommendations expressed in this material are
those of the authors and do not necessarily reflect
the views of the National Science Foundation.

References
Nikolaos Aletras and Mark Stevenson. 2014. Labelling

topics using unsupervised graph-based methods. In
Proceedings of the Association for Computational Lin-
guistics.

Nikolaos Aletras, Timothy Baldwin, Jey Han Lau, and
Mark Stevenson. 2014. Representing topics labels
for exploring digital libraries. In Proceedings of the
IEEE/ACM Joint Conference on Digital Libraries.

Lukas Barth, Stephen G. Kobourov, and Sergey Pupyrev.
2014. Experimental comparison of semantic word
clouds. In Experimental Algorithms. Springer.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
2003. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022.

David M. Blei. 2012. Probabilistic topic models. Com-
munications of the ACM, 55(4):77–84.

Jordan Boyd-Graber, David Mimno, and David Newman,
2014. Care and Feeding of Topic Models: Problems,
Diagnostics, and Improvements. CRC Handbooks of
Modern Statistical Methods. CRC Press, Boca Raton,
Florida.

Allison June Barlow Chaney and David M. Blei. 2012.
Visualizing topic models. In International Conference
on Weblogs and Social Media.

Jonathan Chang, Jordan Boyd-Graber, Chong Wang,
Sean Gerrish, and David M. Blei. 2009. Reading tea
leaves: How humans interpret topic models. In Pro-
ceedings of Advances in Neural Information Process-
ing Systems.

Eugene Charniak. 2000. A maximum-entropy-inspired
parser. In Conference of the North American Chapter
of the Association for Computational Linguistics.

Jason Chuang, Christopher D. Manning, and Jeffrey
Heer. 2012. Termite: Visualization techniques for as-
sessing textual topic models. In Proceedings of the
ACM Conference on Advanced Visual Interfaces.

13


Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and
Eric Xing. 2012. TopicViz: Interactive topic explo-
ration in document collections. In International Con-
ference on Human Factors in Computing Systems.

Thomas M.J. Fruchterman and Edward M. Reingold.
1991. Graph drawing by force-directed placement.
Software: Practice and experience, 21(11):1129–
1164.

Matthew J. Gardner, Joshua Lutes, Jeff Lund, Josh
Hansen, Dan Walker, Eric Ringger, and Kevin Seppi.
2010. The Topic Browser: An interactive tool for
browsing topic models. In Proceedings of the NIPS
Workshop on Challenges of Data Visualization.

Ramanathan Guha and Rob McCool. 2003. TAP:
A semantic web platform. Computer Networks,
42(5):557–577.

Matthew Hoffman, David M. Blei, Chong Wang, and
John Paisley. 2013. Stochastic variational inference.
Journal of Machine Learning Research, 14:1303–
1347.

Andreas Hotho, Andreas Nürnberger, and Gerhard Paass.
2005. A brief survey of text mining. Journal for
Computational Linguistics and Language Technology,
20(1):19–62.

Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and
Alison Smith. 2014. Interactive topic modeling. Ma-
chine Learning, 95(3):423–469.

Fred Jelinek, Robert L. Mercer, Lalit R. Bahl, and
James K. Baker. 1977. Perplexity–a measure of the
difficulty of speech recognition tasks. The Journal of
the Acoustical Society of America, 62(S1):S63–S63.

Lauren F. Klein, Jacob Eisenstein, and Iris Sun. 2015.
Exploratory thematic analysis for digitized archival
collections. Digital Scholarship in the Humanities.

Jon Kleinberg. 2003. An impossibility theorem for clus-
tering. In Proceedings of Advances in Neural Infor-
mation Processing Systems.

Jey Han Lau, David Newman, Sarvnaz Karimi, and Tim-
othy Baldwin. 2010. Best topic word selection for
topic labelling. In Proceedings of the Association for
Computational Linguistics.

Jey Han Lau, Karl Grieser, David Newman, and Timothy
Baldwin. 2011. Automatic labelling of topic models.
In Proceedings of the Association for Computational
Linguistics.

Jey Han Lau, David Newman, and Timothy Baldwin.
2014. Machine reading tea leaves: Automatically
evaluating topic coherence and topic model quality. In
Proceedings of the European Chapter of the Associa-
tion for Computational Linguistics.

Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai.
2007. Automatic labeling of multinomial topic mod-
els. In Knowledge Discovery and Data Mining.

George A. Miller. 1995. WordNet: A lexical database for
English. Communications of the ACM, 38(11):39–41.

Andrew Mueller. 2012. Word cloud. https://
github.com/amueller/word_cloud.

David Newman, Jey Han Lau, Karl Grieser, and Timothy
Baldwin. 2010a. Automatic evaluation of topic coher-
ence. In Conference of the North American Chapter of
the Association for Computational Linguistics.

David Newman, Youn Noh, Edmund Talley, Sarvnaz
Karimi, and Timothy Baldwin. 2010b. Evaluating
topic models for digital libraries. In Proceedings of
the IEEE/ACM Joint Conference on Digital Libraries.

Viet-An Nguyen, Yuening Hu, Jordan Boyd-Graber, and
Philip Resnik. 2013. Argviz: Interactive visualiza-
tion of topic dynamics in multi-party conversations. In
Conference of the North American Chapter of the As-
sociation for Computational Linguistics.

Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling.
2010. Characterizing microblogs with topic models.
In International Conference on Weblogs and Social
Media.

Patrick Riehmann, Manfred Hanfler, and Bernd
Froehlich. 2005. Interactive Sankey diagrams. In
IEEE Symposium on Information Visualization.

Eduarda Mendes Rodrigues, Natasa Milic-Frayling,
Marc Smith, Ben Shneiderman, and Derek Hansen.
2011. Group-in-a-box layout for multi-faceted anal-
ysis of communities. In Proceedings of the IEEE Con-
ference on Social Computing.

Evan Sandhaus. 2008. The New York Times annotated
corpus LDC2008T19. Linguistic Data Consortium,
Philadelphia.

Ben Shneiderman. 1992. Tree visualization with
treemaps: A 2-D space-filling approach. ACM Trans-
actions on Graphics, 11(1):92–99.

Carson Sievert and Kenneth E. Shirley. 2014. LDAvis:
A method for visualizing and interpreting topics. In
Proceedings of the Workshop on Interactive Language
Learning, Visualization, and Interfaces.

Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-
Graber, and Leah Findlater. 2014. Concurrent visu-
alization of relationships between words and topics in
topic models. In Proceedings of the Workshop on In-
teractive Language Learning, Visualization, and Inter-
faces.

Alison Smith, Sana Malik, and Ben Shneiderman. 2015.
Visual analysis of topical evolution in unstructured
text: Design and evaluation of TopicFlow. In Appli-
cations of Social Media and Social Network Analysis.

Alexander Smola and Shravan Narayanamurthy. 2010.
An architecture for parallel topic models. In Proceed-
ings of the VLDB Endowment.

14

https://github.com/amueller/word_cloud
https://github.com/amueller/word_cloud


Hanna Wallach, David Mimno, and Andrew McCallum.
2009a. Rethinking LDA: Why priors matter. In Pro-
ceedings of Advances in Neural Information Process-
ing Systems.

Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov,
and David Mimno. 2009b. Evaluation methods for
topic models. In Proceedings of the 26th Annual In-
ternational Conference on Machine Learning.

Limin Yao, David Mimno, and Andrew McCallum.
2009. Efficient methods for topic model inference on
streaming document collections. In Knowledge Dis-
covery and Data Mining.

Ji Soo Yi, Rachel Melton, John Stasko, and Julie A.
Jacko. 2005. Dust & magnet: Multivariate informa-
tion visualization using a magnet metaphor. Informa-
tion Visualization, 4(4):239–256.

Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mo-
hamad Alkhouja. 2012. Mr. LDA: A flexible large
scale topic modeling package using variational infer-
ence in MapReduce. In Proceedings of the ACM Con-
ference on World Wide Web.

15


16