What Makes Writing Great? First Experiments on Article Quality
Prediction in the Science Journalism Domain

Annie Louis
University of Pennsylvania

Philadelphia, PA 19104
lannie@seas.upenn.edu

Ani Nenkova
University of Pennsylvania

Philadelphia, PA 19104
nenkova@seas.upenn.edu

Abstract

Great writing is rare and highly admired.
Readers seek out articles that are beautifully
written, informative and entertaining. Yet
information-access technologies lack capabil-
ities for predicting article quality at this level.
In this paper we present first experiments on
article quality prediction in the science jour-
nalism domain. We introduce a corpus of
great pieces of science journalism, along with
typical articles from the genre. We imple-
ment features to capture aspects of great writ-
ing, including surprising, visual and emotional
content, as well as general features related to
discourse organization and sentence structure.
We show that the distinction between great
and typical articles can be detected fairly ac-
curately, and that the entire spectrum of our
features contribute to the distinction.

1 Introduction
Measures of article quality would be hugely bene-
ficial for information retrieval and recommendation
systems. In this paper, we describe a dataset of New
York Times science journalism articles which we
have categorized for quality differences and present
a system that can automatically make the distinction.

Science journalism conveys complex scientific
ideas, entertaining and educating at the same time.
Consider the following opening of a 2005 article by
David Quammen from Harper’s magazine:

One morning early last winter a small item appeared in
my local newspaper announcing the birth of an extraordi-
nary animal. A team of researchers at Texas A&M Uni-
versity had succeeded in cloning a whitetail deer. Never

done before. The fawn, known as Dewey, was developing
normally and seemed to be healthy. He had no mother,
just a surrogate who had carried his fetus to term. He
had no father, just a “donor” of all his chromosomes. He
was the genetic duplicate of a certain trophy buck out
of south Texas whose skin cells had been cultured in a
laboratory. One of those cells furnished a nucleus that,
transplanted and rejiggered, became the DNA core of an
egg cell, which became an embryo, which in time be-
came Dewey. So he was wildlife, in a sense, and in an-
other sense elaborately synthetic. This is the sort of news,
quirky but epochal, that can cause a person with a mouth-
ful of toast to pause and marvel. What a dumb idea, I
marveled.

The writing is clear and well-organized but the
text also contains creative use of language and a
clever story-like explanation of the scientific con-
tribution. Such properties make science journalism
an attractive genre for studying writing quality. Sci-
ence journalism is also a highly relevant domain for
information retrieval in the context of educational
as well as entertaining applications. Article quality
measures can hugely benefit such systems.

Prior work indicates that three aspects of article
quality can be successfully predicted: a) whether
a text meets the acceptable standards for spelling
(Brill and Moore, 2000), grammar (Tetreault and
Chodorow, 2008; Rozovskaya and Roth, 2010) and
discourse organization (Barzilay et al., 2002; Lap-
ata, 2003); b) has a topic that is interesting to a par-
ticular user. For example, content-based recommen-
dation systems standardly represent user interest us-
ing frequent words from articles in a user’s history
and retrieve other articles on the same topics (Paz-

341

Transactions of the Association for Computational Linguistics, 1 (2013) 341–352. Action Editor: Mirella Lapata.
Submitted 12/2012; Revised 3/2013, 5/2013; Published 7/2013. c©2013 Association for Computational Linguistics.


zani et al., 1996; Mooney and Roy, 2000); and c) is
easy to read for a target readership. Shorter words
(Flesch, 1948), less complex syntax (Schwarm and
Ostendorf, 2005) and high cohesion between sen-
tences (Graesser et al., 2004) typically indicate eas-
ier and more ‘readable’ articles.

Less understood is the question of what makes an
article interesting and beautifully written. An early
and influential work on readability (Flesch, 1948)
also computed an interest measure with the hypoth-
esis that interesting articles would be easier to read.
More recently, McIntyre and Lapata (2009) found
that people’s ratings of interest for fairy tales can be
successfully predicted using token-level scores re-
lated to syntactic items and categories from a psy-
cholinguistic database. But large scale studies of in-
terest measures for adult educated readers have not
been carried out.

Further, there have been little attempts to measure
article quality in a genre-specific setting. But it is
reasonable to expect that properties related to the
unique aspects of a genre should contribute to the
prediction of quality in the same way that domain-
specific spelling and grammar correction (Cucerzan
and Brill, 2004; Bao et al., 2011; Dale and Kilgar-
riff, 2010) techniques have been successful.

Here we address the above two issues by develop-
ing measures related to interesting and well-written
nature specifically for science journalism. Central
to our work is a corpus of science news articles with
two categories: written by popular journalists and
typical articles in science columns (Section 2). We
introduce a set of genre-specific features related to
beautiful writing, visual nature and affective content
(Section 3) and show that they have high predictive
accuracies, 20% above the baseline, for distinguish-
ing our quality categories (Section 4). Our final sys-
tem combines the measures for interest and genre-
specific features with those proposed for identifying
readable, well-written and topically interesting arti-
cles, giving an accuracy of 84% (Section 5).

2 Article quality corpus
Our corpus1 contains chosen articles from the larger
New York Times (NYT) corpus (Sandhaus, 2008),
the latter containing a wealth of metadata about each

1Available from http://www.cis.upenn.edu/
˜nlp/corpora/scinewscorpus.html

article including author information and manually
assigned topic tags.

2.1 General corpus
The articles in the VERY GOOD category include all
contributions to the NYT by authors whose writing
appeared in “The Best American Science Writing”
anthology published annually since 1999. Articles
from the science columns of leading newspapers are
nominated and expert journalists choose a set they
consider exceptional to appear in these anthologies.
There are 63 NYT articles in the anthology (between
years 1999 and 2007) that are also part of the digital
NYT corpus; these articles form the seed set of the
VERY GOOD category.

We further include in the VERY GOOD category
all other science articles contributed to NYT by the
authors of the seed examples. Science articles by
other authors not in our seed set form the TYPICAL
category. We perform this expansion by first creat-
ing a relevant set of science articles. There is no
single meta-data tag in the NYT which refers to all
the science articles. So we use the topic tags from
the seed articles as an initial set of research tags.
We then compute the minimal set of research tags
that cover all best articles. We greedily add tags
into the minimal set, at each iteration choosing the
tag that is present in the majority of articles that re-
main uncovered. This minimal set contains 14 tags
such as ‘Medicine and Health’, ‘Space’, ‘Research’,
‘Physics’ and ‘Evolution’.

We collect articles from the NYT which have at
least one of the minimal set tags. However, even
a cursory mention of a research topic results in a
research-related tag being assigned to the article. So
we also use a dictionary of research-related terms
to determine whether the article passes a minimum
threshold for research content. We created this dic-
tionary manually and it contains the following words
and their morphological variants (total 63 items).
We used our intuition about a few categories of re-
search words to create this list. The category is
shown in capital letters below.
PEOPLE: researcher, scientist, physicist, biologist, economist,

anthropologist, environmentalist, linguist, professor, dr, student

PROCESS: discover, found, experiment, work, finding, study,

question, project, discuss

TOPIC: biology, physics, chemistry, anthropology, primatology

342


PUBLICATIONS: report, published, journal, paper, author, issue

OTHER: human, science, research, knowledge, university, lab-

oratory, lab

ENDINGS: -ology -gist, -list, -mist, -uist, -phy

The items in the ENDINGS category are used
to match word suffixes. An article is considered
science-related if at least 10 of its tokens match the
dictionary and in addition, at least 5 unique words
from the dictionary are matched. Since the time span
of the best articles is 1999 to 2007, we limit our col-
lection to this timespan. In addition, we only con-
sider articles that are at least 500 words long. This
filtered set of 23,709 articles form the relevant set of
science journalism.

The 63 seed samples of great writing were con-
tributed by about 40 authors. Some authors have
multiple articles selected for the best writing book
series, supporting the idea that these authors produce
high quality pieces that can be considered distinct
from typical articles. Separating the articles from
these authors gives us 3,467 extra samples of VERY
GOOD writing. In total, the VERY GOOD set has
3,530 articles. The remaining articles from the rele-
vant set, 20,242, written by about 3000 other authors
form the TYPICAL category.

2.2 Topic-paired corpus
The general corpus of science writing introduced so
far contains articles on diverse topics including bi-
ology, astronomy, religion and sports. The VERY
GOOD and TYPICAL categories created above al-
low us to study writing quality without regard to
topic. However a typical information retrieval sce-
nario would involve comparison between articles of
the same topic, i.e. relevant to the same query. To
investigate how quality differentiation can be done
within topics, we created another corpus where we
paired articles of VERY GOOD and TYPICAL quality.

For each article in the VERY GOOD category, we
compute similarity with all articles in the TYPICAL
set. This similarity is computed by comparing the
topic words (computed using a loglikelihood ratio
test (Lin and Hovy, 2000)) of the two articles. We
retain the most similar 10 TYPICAL articles for each
VERY GOOD article. We enumerate all pairs of VERY
GOOD with matched up TYPICAL ARTICLES (10 in
number) giving a total of 35,300 pairs.

There are two distinguishing aspects of our cor-

pus. First, the average quality of articles is high.
They are unlikely to have spelling, grammar and ba-
sic organization problems allowing us to investigate
article quality rather than the detection of errors.
Second, our corpus contains more realistic samples
of quality differences for IR or article recommen-
dation compared to prior work, where system pro-
duced texts and permuted versions of an original ar-
ticle were used as proxies for lower quality text.

2.3 Tasks

We perform two types of classification tasks. We
divide our corpus into development and test sets for
these tasks in the following way.

Any topic: Here the goal is to separate out VERY
GOOD versus TYPICAL articles without regard to
topic. The setting roughly corresponds to picking
out an interesting article from an archive or a day’s
newspaper. The test set contains 3,430 VERY GOOD
articles and we randomly sample 3,430 articles from
the TYPICAL category to comprise the negative set.

Same topic: Here we use the topic-paired VERY
GOOD and TYPICAL articles. The goal is to predict
which article in the pair is the VERY GOOD one. This
task is closer to a information retrieval setting, where
articles similar in topic (retrieved for a user query)
need to be distinguished for quality. For test set, we
selected 34,300 pairs.

Development data: We randomly selected 100
VERY GOOD articles and their paired (10 each)
TYPICAL articles from the topic-normalized corpus.
Overall, these constitute 1,000 pairs which we use
for developing the same-topic classifier. From these
selected pairs we take the 100 VERY GOOD articles
and sample 100 unique articles from the TYPICAL
articles making up the pairs. These 200 articles are
used to tune the any-topic classifier.

3 Facets of science writing

In this section, we discuss six prominent facets of
science writing which we hypothesized will have
an impact on text quality. These are the presence
of passages of highly visual nature, people-oriented
content, use of beautiful language, sub-genres, sen-
timent or affect, and the depth of research descrip-
tion. Several other properties of science writing
could also be relevant to quality such as the use of

343


humor, metaphor, suspense and clarity of explana-
tions and we plan to explore these in future work.

We describe how we computed features related to
each property and tested how these features are dis-
tributed in the VERY GOOD and TYPICAL categories.
To do this analysis, we randomly sampled 1,000 ar-
ticles from each of the two categories as representa-
tive examples. We compute the value of each feature
on these articles and use a two-sided t-test to check
if the mean value of the feature is higher in one class
of articles. A p-value less than 0.05 is taken to in-
dicate significantly different trend for the feature in
the VERY GOOD versus TYPICAL articles.

Note that our feature computation step is not
tuned for the quality prediction task in any way.
Rather we aim to represent each facet as accurately
as possible. Ideally we would require manual anno-
tations for each facet (visual, sentiment nature etc.)
to achieve this goal. At this time, we simply check
some chosen features’ values on a random collection
of snippets from our corpus and check if they behave
as intended without resorting to these annotations.

3.1 Visual nature of articles

Some texts create an image in the reader’s mind. For
example, the snippet below has a high visual effect.

When the sea lions approached close, seemingly as curious

about us as we were about them, their big brown eyes were

encircled by light fur that looked like makeup. One sea lion

played with a conch shell as if it were a ball.

Such vivid descriptions can engage and entertain
a reader. Kosslyn (1980) found that people spon-
taneously form images of concrete words that they
hear and use them to answer questions or perform
other tasks. Books written for student science jour-
nalists (Blum et al., 2006; Stocking, 2010) also em-
phasize the importance of visual descriptions.

We measure the visual nature of a text by count-
ing the number of visual words. Currently, the only
resource of imagery ratings for words is the MRC
psycholinguistic database (Wilson, 1988). It con-
tains a list of 3,394 words rated for their ability to
invoke an image, so the list contains both words that
are highly visual along with words that are not visual
at all. With a cutoff value we adopted, of 4.5 for the
Gilhooly-Logie and 350 for the Bristol Norms2 we

2The visual words resource in MRC contains two lists—

obtain 1,966 visual words. So the coverage of that
lexicon is likely to be low for our corpus.

We collect a larger set of visual words from a cor-
pus of tagged images from the ESP game (von Ahn
and Dabbish, 2004). The corpus contains 83,904
total images and 27,466 unique tags. The average
number of tags per picture is 14.5. The tags were
collected in a game setting where two users individ-
ually saw the same image and had to guess words
related to it. The players increased their scores when
the word guessed by one player matched that of the
other. Due to the simple annotation method, there
is considerable noise and non-visual words assigned
as tags. So we performed filtering to find high pre-
cision image words and also group them into topics.

We use Latent Dirichlet Allocation (Blei et al.,
2003) to cluster image tags into topics. We treat each
picture as a document and the tags assigned to the
picture are the document’s contents. We use sym-
metric priors set to 0.01 for both topic mixture and
word distribution within each topic. We filter out the
30 most common words in the corpus, words that ap-
pear in less than four pictures and images with fewer
than five tags. The remaining words are clustered
into 100 topics with the Stanford Topic Modeling
Toolbox3 (Ramage et al., 2009). We did not tune the
number of topics and choose the value of 100 based
on the intuition that the number of visual topics is
likely to be small.

To select clean visual clusters, we make the as-
sumption that visual words are likely to be clustered
with other visual terms. Topics that are not visual
are discarded altogether. We use the manual an-
notations available with the MRC database to de-
termine which clusters are visual. For each of the
100 topics from the topic model, we examine the
top 200 words with highest probability in that topic.
We compute the precision of each topic as the pro-
portion of these 200 words that match the MRC list
of visual words (1,966 words using the cutoff men-
tioned above). Only those topics which had a pre-
cision of at least 25% were retained, resulting in 68
visual topics. Some example topics, with manually
created headings, include:

landscape. grass, mountain, green, hill, blue, field,
brown, sand, desert, dirt, landscape, sky

Gilhooly-Logie and Bristol Norms.
3http://nlp.stanford.edu/software/tmt/tmt-0.4/

344


jewellery. silver, white, diamond, gold, necklace,
chain, ring, jewel, wedding, diamonds, jewelry

shapes. round, ball, circles, logo, dots, square, dot,
sphere, glass, hole, oval, circle

Combining these 68 topics, there are 5,347 unique
visual words because topics can overlap in the list of
most probable words. 2,832 words from this set are
not present in the MRC database. Some examples
of new words in our list are ‘daffodil’, ‘sailor’, ‘hel-
met’, ‘postcard’, ‘sticker’, ‘carousel’, ‘kayak’, and
‘camouflage’. For later experiments we consider the
5,347 words as the visual word set and also keep
the information about the top 200 words in the 68
selected topics. We compute two classes of features
one based on all visual words and the other on visual
topics. We consider only the adjectives, adverbs,
verbs and common nouns in an article as candidate
words for computing visual quality.
Overall visual use: We compute the propor-
tion of candidate words that match the visual
word list as the TOTAL VISUAL feature. We also
compute the proportions based only on the first
200 words of the article (BEG VISUAL), the last
200 words (END VISUAL) and the middle region
(MID VISUAL) as features. We also divide the arti-
cle into five equally sized bins of words where each
bin captures consecutive words in the article. Within
each bin we compute the proportion of visual words.
We treat these values as a probability distribution
and compute its entropy (ENTROPY VISUAL). We
expected these position features to indicate how the
placement of visual words is related to quality.
Topic-based features: We also compute what pro-
portion of the words we identify as visual matches
the list under each topic. The maximum proportion
from a single topic (MAX TOPIC VISUAL) is a fea-
ture. We also compute a greedy cover set of top-
ics for the visual words in the article. The topic
that matches the most visual words is added first,
and the next topic is selected based on the remain-
ing unmatched words. The number of topics needed
to cover 50% of the article’s visual words is the
TOPIC COVER VISUAL feature. These features cap-
ture the mix of visual words from different topics.
Disregarding topic information, we also compute a
feature NUM PICTURES which is the number of im-
ages in the corpus where 40% of the image’s tags are
matched in the article.

We found 8 features to vary significantly be-
tween the two types of articles. The fea-
tures with significantly higher values in VERY
GOOD articles are: BEG VISUAL, END VISUAL,
MAX TOPIC VISUAL. The features with signifi-
cantly higher values in the TYPICAL articles are:
TOTAL VISUAL, MID VISUAL, ENTROPY VISUAL,
TOPIC COVER VISUAL, NUM PICTURES.

It appears that the simple expectation that VERY
GOOD articles contain more visual words overall
does not hold true here. However the great writ-
ing samples have a higher degree of visual content
in the beginning and ends of articles. Good articles
also have lower entropy for the distribution of vi-
sual words indicating that they appear in localized
positions in contrast to being distributed throughout.
The topic-based features further indicate that for the
VERY GOOD articles, the visual words come from
only a few topics (compared to TYPICAL articles)
and so may evoke a coherent image or scene.

3.2 The use of people in the story
We hypothesized that articles containing research
findings that directly affect people in some way, and
therefore involve explicit references to people in the
story, will make a bigger impact on the reader. For
example, the most frequent topic among our VERY
GOOD samples is ‘medicine and health’ where ar-
ticles are often written from the view of a patient,
doctor or scientist. An example is below.

Dr. Remington was born in Reedville, Va., in 1922, to Maud

and P. Sheldon Remington, a school headmaster. Charles spent

his boyhood chasing butterflies alongside his father, also a col-

lector. During his graduate studies at Harvard, he founded the

Lepidopterists’ Society with an equally butterfly-smitten under-

graduate, Harry Clench.

We approximate this facet by computing the num-
ber of explicit references to people, relying on three
sources of information about animacy of words. The
first is named entity (NE) tags (PERSON, ORGANI-
ZATION and LOCATION) returned by the Stanford
NE recognition tool (Finkel et al., 2005). We also
created a list of personal pronouns such as ‘he’, ‘my-
self’ etc. which standardly indicate animate entities
(animate pronouns).

Our third resource contains the number of times
different noun phrases (NP) were followed by each
of the relative pronouns ‘who’, ‘where’ and ‘which’.

345


These counts for 664,673 noun phrases were col-
lected by Ji and Lin (2009) from the Google Ngram
Corpus (Lin et al., 2010). We use a simple heuris-
tic to obtain a list of animate (google animate) and
inanimate nouns (google inanimate) from this list.
The head of each NP is taken as a candidate noun.
If the noun does not occur with ‘who’ in any of the
noun phrases where it is the head, then it is inani-
mate. In contrast, if it appears only with ‘who’ in
all noun phrases, it is animate. Otherwise, for each
NP where the noun is a head, we check whether the
count of times the noun phrase appeared with ‘who’
is greater than each of the occurrences of ‘which’,
‘where’ and ‘when’ (taken individually) with that
noun phrase. If the condition holds for at least one
noun phrase, the noun is marked as animate.

When computing the features for an article, we
consider all nouns and pronouns as candidate words.
If the word is a pronoun and appears in our list of an-
imate pronouns, it is assigned an ‘animate’ label and
‘inanimate’ otherwise. If the word is a proper noun
and tagged with the PERSON NE tag, we mark it
as ‘animate’ and if it is a ORGANIZATION or LO-
CATION tag, the word is ‘inanimate’. For common
nouns, we check it if appears in the google animate
and inanimate lists. Any match is labelled accord-
ingly as ‘animate’ and ‘inanimate’. Note that this
procedure may leave some nouns without any labels.

Our features are counts of animate tokens
(ANIM), inanimate tokens (INAMIN) and both these
counts normalized by total words in the article
(ANIM PROP, INANIM PROP). Three of these fea-
tures had significantly higher mean values in the
TYPICAL category of articles: ANIM, ANIM PROP,
INANIM PROP. We found upon observation that sev-
eral articles that talk about government policies in-
volve a lot of references to people but are often in the
TYPICAL category. These findings suggest that the
‘human’ dimension might need to be computed not
only based on simple counts of references to people
but also involve finer distinctions between them.

3.3 Beautiful language
Beautiful phrasing and word choice can entertain
the reader and leave a positive impression. Multi-
ple studies in the education genre (Diederich, 1974;
Spandel, 2004) note that when teachers and expert
adult readers graded student writing, word choice

and phrasing always turn out as a significant factors
influencing the raters’ scores.

We implement a method for detecting creative
language based on a simple idea that creative words
and phrases are sometimes those that are used in un-
usual contexts and combinations or those that sound
unusual. We compute measures of unusual language
both at the level of individual words and for the com-
bination of words in a syntactic relation.
Word level measures: Unusual words in an ar-
ticle are likely to be those with low frequencies
in a background corpus. We use the full set of
articles (not only science) from year 1996 in the
NYT corpus as a background (these do not over-
lap with our corpus for article quality). We also ex-
plore patterns of letters and phoneme sequences with
the idea that unusual combination of characters and
phonemes could create interesting words. We used
the CMU pronunciation dictionary (Weide, 1998) to
get the phoneme information for words and built a 4-
gram model of phonemes on the background corpus.
Laplace smoothing is used to compute probabilities
from the model. However, the CMU dictionary does
not contain phoneme information for several words
in our corpus. So we also compute an approximate
model using the letters in the words and obtain an-
other 4-gram model.4 Only words that are longer
than 4 characters are used in both models and we fil-
ter out proper names, named entities and numbers.

During development, we analyzed the articles
from an entire year of NYT, 1997, with the three
models to identify unusual words. Below is the list
of words with lowest frequency and those with high-
est perplexity under the phoneme and letter models.

Low frequency. undersheriff, woggle, ahmok,
hofman, volga, oceanaut, trachoma, baneful, truffler,
acrimal, corvair, entomopter

High perplexity-phoneme model. showroom, yahoo,
dossier, powwow, plowshare, oomph, chihuahua, iono-
sphere, boudoir, superb, zaire, oeuvre

High perplexity-letter model. kudzu, muumuu, qi-
pao, yugoslav, kohlrabi, iraqi, yaqui, yakuza, jujitsu, oeu-
vre, yaohan, kaffiyeh

For computing the features, we consider only
nouns, verbs, adjectives and adverbs. We also
require that the words are at least 5 letters long

4We found that higher order n-grams provided better pre-
dictions of unusual nature during development.

346


and do not contain a hyphen5. Three types of
scores are computed. FREQ NYT is the aver-
age of word frequencies computed from the back-
ground corpus. The second set of features are
based on the phoneme model. We compute the
average perplexity of words under the model,
AVR PHONEME PERP ALL. In addition, we also or-
der the words in an article based on decreasing per-
plexity values and the average perplexity of the top
10, 20 and 30 words in this list are added as fea-
tures (AVR PHONEME PERP 10, 20, 30). We ob-
tain similar features from the letter n-gram model
(AVR CHAR PERP ALL, AVR CHAR PERP 10, 20,
30). In phoneme features, we ignore words that do
not have an entry in the CMU dictionary.
Word pair measures: Next we attempt to detect un-
usual combinations of words. We do this calculation
only for certain types of syntactic relations–a) nouns
and their adjective modifiers, b) verbs with adverb
modifiers, c) adjacent nouns in a noun phrase and
d) verb and subject pairs. Counts for co-occurrence
again come from NYT 1996 articles. The syntactic
relations are obtained using the constituency and de-
pendency parses from the Stanford parser (Klein and
Manning, 2003; De Marneffe et al., 2006). To avoid
the influence of proper names and named entities,
we replace them with tags (NNP for proper names
and PERSON, ORG, LOC for named entities).

We treat the words for which the dependency
holds as a (auxiliary word, main word) pair. For
adjective-noun and adverb-verb pairs, the auxiliary
is the adjective or adverb; for noun-noun pairs, it is
the first noun; and for verb-subject pairs, the auxil-
iary is the subject. Our idea is to compute usualness
scores based on frequency with which a particular
pair of words appears in the background.

Specifically, we compute the conditional proba-
bility of the auxiliary word given the main word
as the score for likelihood of observing the pair.
We consider the main word as related to the article
topic, so we use the conditional probability of auxil-
iary given main word and not the other way around.
However, the conditional probability has no infor-
mation about the frequency of the auxiliary word. So
we apply ideas from interpolation smoothing (Chen

5We noticed that in this genre several new words are created
using hyphen to concatenate common words.

ADJ-NOUN ADV-VERB
hypoactive NNP suburbs said
plasticky woman integral was
psychogenic problems collective do
yoplait television physiologically do
subminimal level amuck run
ehatchery investment illegitimately put

NOUN-NOUN SUBJ-VERB
specification today blog said
auditory system briefer said
pal programs hr said
steganography programs knucklehead said
wastewater system lymphedema have
autism conference permissions have

Table 1: Unusual word-pairs from different categories

and Goodman, 1996) and compute the conditional
probability as a interpolated quantity together with
the unigram probability of the auxiliary word.

p̂(aux|main) = λ∗p(aux|main)+(1−λ)∗p(aux)
The unigram and conditional probabilities are

also smoothed using Laplace method. We train the
lambda values to optimize data likelihood using the
Baum Welch algorithm and use the pairs from NYT
1997 year articles as a development set. The lambda
values across all types of pairs tended to be lower
than 0.5 giving higher weight to the unigram proba-
bility of the auxiliary word.

Based on our observations on the development
set, we picked a cutoff of 0.0001 on the proba-
bility (0.001 for adverb-verb pairs) and consider
phrases with probability below this value as un-
usual. For each test article, we compute the num-
ber of unusual phrases (total for all categories)
as a feature (SURP) and also this value normal-
ized by total number of word tokens in the article
(SURP WD) and normalized by number of phrases
(SURP PH). We also compute features for indi-
vidual pair types and in each case, the number of
unusual phrases is normalized by the total words
in the article (SURP ADJ NOUN, SURP ADV VERB,
SURP NOUN NOUN, SURP SUBJ VERB).

A list of the top unusual words under the different
pair types are shown in Table 1. These were com-
puted on pairs from a random set of articles from our
corpus. Several of the top pairs involve hyphenated
words which are unusual by themselves, so we only
show in the table the top words without hyphens.

347


Most of these features are significantly
different between the two classes. Those
with higher values in the VERY GOOD
set include: AVR PHONEME PERP ALL,
AVR CHAR PERP (ALL, 10), SURP, SURP PH,
SURP WD, SURP ADJ NOUN, SURP NOUN NOUN,
SURP SUBJ VERB. The FREQ NYT feature has
higher value in the TYPICAL class.

All these trends indicate that unusual phrases are
associated with the VERY GOOD category of articles.

3.4 Sub-genres
There are several sub-genres within science writing
(Stocking, 2010): short descriptions of discoveries,
longer explanatory articles, narratives, stories about
scientists, reports on meetings, review articles and
blog posts. Naturally, some of these sub-genres will
be more appealing to readers. To investigate this
aspect, we compute scores for some sub-genres of
interest—narrative, attribution and interview.

Narrative texts typically have characters and
events (Nakhimovsky, 1988), so we look for entities
and past tense in the articles. We count the number
of sentences where the first verb in surface order is
in the past tense. Then among these sentences, we
pick those which have either a personal pronoun or a
proper noun before the target verb (again in surface
order). The proportion of such sentences in the text
is taken as the NARRATIVE score.

We also developed a measure to identify the de-
gree to which the article’s content is attributed to ex-
ternal sources as opposed to the author’s own state-
ments. Attribution to other sources is frequent in
the news domain since many comments and opin-
ions are not the views of the journalist (Semetko and
Valkenburg, 2000). For science news, attribution be-
comes more important since the research findings
were obtained by scientists and reported in a second-
hand manner by the journalists. The ATTRIB score
is the proportion of sentences in the article that have
a quote symbol, or the words ‘said’ and ‘says’.

We also compute a score to indicate if the article
is the account of an interview. There are easy clues
in NYT for this genre with paragraphs in the inter-
view portion of the article beginning with either ‘Q.’
(question) or ‘A.’ (answer). We count the total num-
ber of ‘Q.’ and ‘A.’ prefixes combined and divide
the value by the total number of sentences (INTER-

VIEW). When either the number of ‘Q.’ tags is zero
or ‘A.’ tags is zero, the score is set to zero.

All three scores are significantly higher for the
TYPICAL class.

3.5 Affective content
Some articles, for example those detailing research
on health, crime, ethics, can provoke emotional re-
actions in readers as shown in the snippet below.

Medicine is a constant trade-off, a struggle to cure the dis-

ease without killing the patient first. Chemotherapy, for exam-

ple, involves purposely poisoning someone – but with the ex-

pectation that the short-term injury will be outweighed by the

eventual benefits.

We compute affect-related features using three
lexicons. The MPQA (Wilson et al., 2005) and Gen-
eral Inquirer (Stone et al., 1966) give lists of positive
and negative sentiment words. The third resource
is emotion-related words from FrameNet (Baker et
al., 1998). The sizes of these lexicon are 8,221,
5,395, and 653 words respectively. We compute
the counts of positive, negative, polar, and emotion
words, each normalized by the total number of con-
tent words in the article (POS PROP, NEG PROP, PO-
LAR PROP, EMOT PROP). We also include the pro-
portion of emotion and polar words taken together
(POLAR EMOT PROP) and the ratio between count
of positive and negative words (POS BY NEG).

The features with higher values in the VERY
GOOD class are NEG PROP, POLAR PROP,
EMOT POLAR PROP. In TYPICAL articles,
POS BY NEG, EMOT PROP have higher values.

VERY GOOD articles have more sentiment words,
mostly skewed towards negative sentiment.

3.6 Amount of research content
For a lay audience, a science writer presents only the
most relevant findings and methods from a research
study and interleaves research information with de-
tails about the relevance of the finding, people in-
volved in the research and general information about
the topic. As a result, the degree of explicit research
descriptions in the articles varies considerably.

To test how this aspect is associated with qual-
ity, we count references to research methods and re-
searchers in the article. We use the research dictio-
nary that we introduced in Section 2 as the source
of research-related words. We count the total num-

348


ber of words in the article that match the dictionary
(RES TOTAL) and also the number of unique match-
ing words (RES UNIQ). We also normalize these
counts by the total words in the article and create
features RES TOTAL PROP and RES UNIQ PROP.

All four features have significantly higher values
in the VERY GOOD articles which indicate that great
articles are also associated with a great amount of
direct research content and explanations.

4 Classification accuracy
We trained classifiers using all the above features for
for the two settings–‘any-topic’ and ‘same-topic’ in-
troduced in Section 2.3. The baseline random accu-
racy in both cases is 50%. We use a SVM classi-
fier with a radial basis kernel (R Development Core
Team, 2011) and parameters were tuned using cross
validation on the development data.

The best parameters were then used to classify the
test set in a 10 fold cross-validation setting. We di-
vide the test set into 10 parts, train on 9 parts and
test on the held-out data. The average accuracies in
the 10 experiments are 75.3% accuracy for the ‘any-
topic’ setup, and 68% accuracy for the topic-paired
‘same-topic’ setup. These accuracies are consider-
able improvements over the baseline.

The ‘same-topic’ data contains article pairs with
varying similarity. So we investigate the relationship
between topic similarity and accuracy of prediction
more closely for this setting. We divide the article
pairs into bins based on the similarity value. We
compute the 10-fold cross validation predictions us-
ing the different feature classes above and collect the
predicted values across all the folds. Then we com-
pute accuracy of examples within each bin. These
results are plotted in Figure 1. int-science refers to
the full set of features and the results from the six
feature classes are also indicated.

As the similarity increases, the prediction task be-
comes harder. The combination of all features gives
66% accuracy for pairs above 0.4 similarity and 74%
when the similarity is less than 0.15. Most individ-
ual feature classes also show a similar trend. This
result is understandable because articles on simi-
lar topics could exhibit similar properties. For ex-
ample, two articles about ‘controversies surround-
ing vaccination’ are likely to have similar levels of
people-oriented nature or written in a narrative style

Figure 1: Accuracy on pairs with different similarity

in the same way as two space-related articles are
both likely to contain high visual content. There are
however two exceptions—affect and research. For
these features, the accuracies improve with higher
similarity; affect features give 51% for pairs with
similarity 0.1 and 62% for pairs above 0.4 similar-
ity, accuracy of research features goes from 52% to
57% for the same similarity values. This finding il-
lustrates that even articles on very similar topics can
be written differently, with the articles by the excel-
lent authors associated with greater degree of senti-
ment, and deeper study of the research problem.

5 Combining aspects of article quality
We now compare and combine the genre-specific
interest-science features (41 total) with those dis-
cussed in work on readability, well-written nature,
interest and topic classification.
Readability (16 features): We test the full set of
readability features studied in Pitler and Nenkova
(2008), involving token-type ratio, word and sen-
tence length, language model features, cohesion
scores and syntactic estimates of complexity.
Well-written nature (23 features): For well-
written nature, we use two classes of features, both
related to discourse. One is the probabilities of dif-
ferent types of entity transitions from the Entity Grid
model (Barzilay and Lapata, 2008) which we com-
pute with the Brown Coherence Toolkit (Elsner et
al., 2007). The other class of features are those de-
fined in Pitler and Nenkova (2008) for likelihoods
and counts of explicit discourse relations. We iden-
tified the relations for texts in our corpus using the

349


AddDiscourse tool (Pitler and Nenkova, 2009).
Interesting fiction (22 features): are those intro-
duced by McIntyre and Lapata (2009) for predicting
interest ratings on fairy tales. They include counts of
syntactic items and relations, and token categories
from the MRC psycholinguistic database. We nor-
malize each feature by the total words in the article.
Content: features are based on the words present
in the articles. Word features are standard in
content-based recommendation systems (Pazzani et
al., 1996; Mooney and Roy, 2000) where they are
used to pick out articles similar to those which a user
has already read. In our work the features are the
most frequent n words in our corpus after removing
the 50 most frequent ones. The word’s count in the
article is the feature value. Note that word features
indicate topic as well as other content in the article
such as sentiment and research. A random sample of
the word features for n = 1000 is shown below and
reflects this aspect. “matter, series, wear, nation, ac-
count, surgery, high, receive, remember, support, worry,
enough, office, prevent, biggest, customer”.

Table 2 compares the accuracies of SVM classi-
fiers trained on features from different classes and
their combinations.6 The readability, well-written
nature and interesting fiction classes provide good
accuracies 60% and above. The genre-specific
interesting-science features are individually much
stronger than these classes. Different writing as-
pects (without content) are clearly complementary
and when combined give 76% to 79% accuracy for
the ‘any-topic’ task and 74% for the topic pairs task.

The simple bag of words features work remark-
ably well giving 80% accuracy in both settings. As
mentioned before these word features are a mix of
topic indicators as well as other content of the ar-
ticles, ie., they also implicitly indicate animacy, re-
search or sentiment. But the high accuracy of word
features above all the writing categories indicates
that topic plays an important role in article quality.
However, despite the high accuracy, word features
are not easily interpretable in different classes re-
lated to writing as we have done with other writing
features. Further, the total set of writing features is

6For classifiers involving content features, we did not tune
the SVM parameters because of the small size of development
data compared to number of features. Default SVM settings
were used instead.

Feature set Any Topic Same
Interesting-science 75.3 68.0
Readable 65.5 63.0
Well-written 59.1 59.9
Interesting-fiction 67.9 62.8
Readable + well-writ 64.7 64.3
Readable + well-writ + Int-fict 71.0 70.3
Readable + well-writ + Int-sci 79.5 73.2
All writing aspects 76.7 74.7
Content (500 words) 81.7 79.4
Content (1000 words) 81.2 82.1

Combination: Writing (all) + Content (1000w)
In feature vector 82.6* 84.0*
Sum of confidence scores 81.6 84.9
Oracle 87.6 93.8
Table 2: Accuracy of different article quality aspects

only 102 in contrast to 1000 word features. In our
interest-science feature set, we aimed to highlight
how well some of the aspects considered important
to good science writing can predict quality ratings.

We also combined writing and word features to
mix topic with writing related predictors. We do the
combination in three ways a) word and writing fea-
tures are included together in the feature vector; b)
two separate classifiers are trained (one using word
features and the other using writing ones) and the
sum of confidence measures is used to decide on the
final class; c) an oracle method: two classifiers are
trained just as in option (b) but when they disagree
on the class, we pick the correct label. The oracle
method gives a simple upper bound on the accuracy
obtainable by combination. These values are 87%
for ‘any-topic’ and a higher 93.8% for ‘same-topic’.
The automatic methods, both feature vector combi-
nation and classifier combination also give better ac-
curacies than only the word or writing features. The
accuracies for the folds from 10 fold cross valida-
tion in the feature vector combination method were
also found to be significantly higher than those from
word features only (using a paired Wilcoxon signed-
rank test). Therefore both topic and writing features
are clearly useful for identifying great articles.

6 Conclusion
Our work is a step towards measuring overall arti-
cle quality by showing the complementary benefits
of general and domain-specific writing measures as
well as indicators of article topic. In future we plan
to focus on development of more features as well as
better methods for combining different measures.

350


References
C. F. Baker, C. J. Fillmore, and J. B. Lowe. 1998.

The berkeley framenet project. In Proceedings of
COLING-ACL, pages 86–90.

Z. Bao, B. Kimelfeld, and Y. Li. 2011. A graph ap-
proach to spelling correction in domain-centric search.
In Proceedings of ACL-HLT, pages 905–914.

R. Barzilay and M. Lapata. 2008. Modeling local coher-
ence: An entity-based approach. Computational Lin-
guistics, 34(1):1–34.

R. Barzilay, N. Elhadad, and K. McKeown. 2002.
Inferring strategies for sentence ordering in multi-
document summar ization. Journal of Artificial Intel-
ligence Research, 17:35–55.

D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent
dirichlet allocation. the Journal of machine Learning
research, 3:993–1022.

D. Blum, M. Knudson, and R. M. Henig, editors. 2006.
A field guide for science writers: the official guide of
the national association of science writers. Oxford
University Press, New York.

E. Brill and R.C. Moore. 2000. An improved error model
for noisy channel spelling correction. In Proceedings
of ACL, pages 286–293.

S. F. Chen and J. Goodman. 1996. An empirical study
of smoothing techniques for language modeling. In
Proceedings of ACL, pages 310–318.

S. Cucerzan and E. Brill. 2004. Spelling correction as an
iterative process that exploits the collective knowledge
of web users. In Proceedings of EMNLP, pages 293–
300.

R. Dale and A. Kilgarriff. 2010. Helping our own:
text massaging for computational linguistics as a new
shared task. In Proceedings of INLG, pages 263–267.

M. C. De Marneffe, B. MacCartney, and C. D. Man-
ning. 2006. Generating typed dependency parses from
phrase structure parses. In Proceedings of LREC, vol-
ume 6, pages 449–454.

P. Diederich. 1974. Measuring Growth in English. Na-
tional Council of Teachers of English.

M. Elsner, J. Austerweil, and E. Charniak. 2007. A uni-
fied local and global model for discourse coherence.
In Proceedings of NAACL-HLT, pages 436–443.

J. R. Finkel, T. Grenager, and C. Manning. 2005. In-
corporating non-local information into information ex-
traction systems by gibbs sampling. In Proceedings of
ACL, pages 363–370.

R. Flesch. 1948. A new readability yardstick. Journal of
Applied Psychology, 32:221 – 233.

A.C. Graesser, D.S. McNamara, M.M. Louwerse, and
Z. Cai. 2004. Coh-Metrix: Analysis of text on co-
hesion and language. Behavior Research Methods In-
struments and Computers, 36(2):193–202.

H. Ji and D. Lin. 2009. Gender and animacy knowledge
discovery from web-scale n-grams for unsupervised
person name detection. In Proceedings of PACLIC.

D. Klein and C.D. Manning. 2003. Accurate unlexical-
ized parsing. In Proceedings of ACL, pages 423–430.

S.M. Kosslyn. 1980. Image and mind. Harvard Univer-
sity Press.

M. Lapata. 2003. Probabilistic text structuring: Experi-
ments with sentence ordering. In Proceedings of ACL,
pages 545–552.

C. Lin and E. Hovy. 2000. The automated acquisition of
topic signatures for text summarization. In Proceed-
ings of COLING, pages 495–501.

D. Lin, K. W. Church, H. Ji, S. Sekine, D. Yarowsky,
S. Bergsma, K. Patil, E. Pitler, R. Lathbury, V. Rao,
K. Dalwani, and S. Narsale. 2010. New tools for web-
scale n-grams. In Proceedings of LREC.

N. McIntyre and M. Lapata. 2009. Learning to tell tales:
A data-driven approach to story generation. In Pro-
ceedings of ACL-IJCNLP, pages 217–225.

R. J. Mooney and L. Roy. 2000. Content-based book
recommending using learning for text categorization.
In Proceedings of the fifth ACM conference on Digital
libraries, pages 195–204.

A. Nakhimovsky. 1988. Aspect, aspectual class, and the
temporal structure of narrative. Computational Lin-
guistics, 14(2):29–43, June.

M. Pazzani, J. Muramatsu, and D. Billsus. 1996. Syskill
& Webert: Identifying interesting web sites. In Pro-
ceedings of AAAI, pages 54–61.

E. Pitler and A. Nenkova. 2008. Revisiting readabil-
ity: A unified framework for predicting text quality. In
Proceedings of EMNLP, pages 186–195.

E. Pitler and A. Nenkova. 2009. Using syntax to dis-
ambiguate explicit discourse connectives in text. In
Proceedings of ACL-IJCNLP, pages 13–16.

R Development Core Team, 2011. R: A Language and
Environment for Statistical Computing. R Foundation
for Statistical Computing.

D. Ramage, D. Hall, R. Nallapati, and C.D. Manning.
2009. Labeled lda: A supervised topic model for
credit attribution in multi-labeled corpora. In Proceed-
ings of EMNLP, pages 248–256.

A. Rozovskaya and D. Roth. 2010. Generating confu-
sion sets for context-sensitive error correction. In Pro-
ceedings of EMNLP, pages 961–970.

E. Sandhaus. 2008. The new york times annotated cor-
pus. Corpus number LDC2008T19, Linguistic Data
Consortium, Philadelphia.

S. Schwarm and M. Ostendorf. 2005. Reading level as-
sessment using support vector machines and statistical
language models. In Proceedings of ACL, pages 523–
530.

351


H.A. Semetko and P.M. Valkenburg. 2000. Framing eu-
ropean politics: A content analysis of press and televi-
sion news. Journal of communication, 50(2):93–109.

V. Spandel. 2004. Creating Writers Through 6-Trait
Writing: Assessment and Instruction. Allyn and Ba-
con, Inc.

S. H. Stocking. 2010. The New York Times Reader: Sci-
ence and Technology. CQ Press, Washington DC.

P. J. Stone, J. Kirsh, and Cambridge Computer Asso-
ciates. 1966. The General Inquirer: A Computer Ap-
proach to Content Analysis. MIT Press.

J. R. Tetreault and M. Chodorow. 2008. The ups and
downs of preposition error detection in esl writing. In
Proceedings of COLING, pages 865–872.

L. von Ahn and L. Dabbish. 2004. Labeling images with
a computer game. In Proceedings of CHI, pages 319–
326.

R. L. Weide. 1998. The cmu pronunciation dictio-
nary, release 0.6. http://www.speech.cs.cmu.edu/cgi-
bin/cmudict.

T. Wilson, J. Wiebe, and P. Hoffmann. 2005. Recogniz-
ing contextual polarity in phrase-level sentiment anal-
ysis. In Proceedings of HLT-EMNLP, pages 347–354.

M. Wilson. 1988. MRC psycholinguistic database:
Machine-usable dictionary, version 2.00. Behavior
Research Methods, 20(1):6–10.

352