Mining Goodreads. A Digital Humanities Project for the Study of Reading 
Absorption 

Simone Rebora​1, 2*​, Moniek Kuijpers​1​, Piroska Lendvai​1  

1 University of Basel, Switzerland 

2 University of Verona, Italy 

*Corresponding author: Simone Rebora simone.rebora@unibas.ch  

 
Abstract 
We present our method and interim results of the “Mining Goodreads” project, aimed at developing a                
computational approach to measure reading absorption in user-generated book reviews in English. A             
team of eight people (three supervisors and five annotators) have joined skills from the fields of                
empirical literary studies, natural language processing, and digital humanities, with the goal of             
producing a gold-standard annotated dataset and strengthening the theoretical framework of reading            
absorption. Annotation of more than 800 texts showed the difficulties in finding an agreement in the                
tagging of sentences. However, through more than one year of work in strict collaboration, the team                
reached some substantial improvements: inter-annotator agreement increased through seven annotation          
rounds, while machine learning approaches were applied on the annotated corpus, producing            
promising results. 
 
keywords 
reading absorption; social reading; empirical literary studies; machine learning; inter-annotator          
agreement 
 
I THE PROJECT’S IDEA 
The “Mining Goodreads” project is conceived as a computational expansion of empirical            
literary studies. Empirical studies frequently use methods such as interviews or questionnaires            
to test theories and verify hypotheses, but they can also involve technologies such as              
eye-tracking and fMRI scans (for a general introduction, see [Peer, Hakemulder, and Zyngier             
2012]). In all cases, direct involvement of readers in experiments is required to investigate              
reading experiences and their effects. 
One of the most researched topics in empirical literary studies is narrative absorption,             
understood as the sensation of being absorbed into a story (see [Hakemulder, Kuijpers, and              
Tan 2017]). [Kuijpers et al. 2014] developed the Story World Absorption Scale (SWAS), a              
questionnaire aimed at measuring different dimensions of absorption into fictional worlds of            
literature. The questionnaire is built upon a theorization that distinguishes four main            
dimensions: attention (focused attention on the text, reducing awareness of the self and the              
passing of time), transportation (the feeling of having traveled to the story world), emotional              
engagement, and mental imagery. A total of 18 statements express different facets of what it               
is like to feel absorbed in a story (e.g., “When I finished the story I was surprised to see that                    
time had gone by so fast”, or “I could imagine what it must be like to be in the shoes of the                      
main character”). During experiments using the SWAS, participants are asked to read            
narratives and rate their agreement with each of the 18 statements on a 7-point Likert scale.                
Statistically significant trends among readers’ answers can then be used to quantify            

1 
 

intrinsically absorbing properties of texts. The SWAS has been empirically validated and used             
in multiple studies (e.g. [Bálint et al. 2016; Hartung et al. 2017; Kuzmičová et al. 2017]). 
[Rebora, Lendvai, and Kuijpers 2018] showed a possible alternative use of the SWAS,             
building on the fact that multiple sentences in reviews published on the ​Goodreads ​platform              
[http1] overlap semantically and conceptually with SWAS statements. For example, a           
reviewer writes: “I’m so absorbed in the world Martin produced out of his wits” (a sentence                
that matches with the SWAS statement “I felt absorbed in the story”); another reviewer              
expresses her identification with the main character: “I went through all the emotional ups and               
downs right along with her” (matching with “I felt how the main character was feeling”). This                
phenomenon offers the possibility of using the SWAS without directly involving readers in             
experiments: an estimate of the absorbing properties of a book (or of a literary genre) can be                 
inferred directly from its reviews. Possible noisiness and unreliability of reviews is countered             
by the fact that ​Goodreads hosts about 90 million reviews [http2]: a big data repository that                
can be studied from a “distant reading” perspective. Inevitably, such a wide repository             
requires the application of computational methods for its analysis.  
 
II THE PROJECT’S STRUCTURE 
 
2.1 People 
The “Mining Goodreads” project, funded by the Swiss National Science Foundation in the             
“Digital Lives” funding scheme (grant number 10DL15_183194), involves a team of eight            
people. The three supervisors embody the three main disciplines involved: empirical studies,            
which provides the theoretical framework on reading absorption; natural language processing,           
which develops methods to automatically identify and retrieve absorption statements; and           
digital humanities, which mediates between the two by grounding the research in a “distant              
reading” perspective. At the core of the project is the work of five annotators, whose goal is                 
that of generating annotations that will be adjudicated and consolidated into a ground truth              
dataset to train algorithms of different types. Once able to recognize absorption statements             
with an acceptable level of accuracy, these algorithms will scale up the analysis to millions of                
reviews. 
 
2.2 Resources 
A corpus of about 6 million reviews (amounting to more than 900 million tokens) has been                
generated by scraping the ​Goodreads ​website between 2018 and 2019. Titles were selected by              
focusing on 9 genres (as categorized by the ​Goodreads ​tagging system, which allows multiple              
genre assignments): general statistics are provided by Figure 1. 
Due to property and privacy issues, we have provisionally decided against publicly sharing             
the corpus. However, new European directives are suggesting the introduction of significant            
exceptions in text and data mining for research purposes, e.g. the ​Directive on Copyright in               
the Digital Single Market​. Some of these exceptions have been already included in national              
laws such as the ​Urheberrechts-Wissensgesellschafts-Gesetz in Germany, the country where          
our project started and where the entire scraping activity took place. We are currently              
evaluating the possibility of making the annotated corpus accessible under a specific license,             
after having complied with all legal and ethical requirements. 

2 
 

Figure 1. Proportions of genres in the corpus. 
 
2.3 Procedure 
Between March 2019 and May 2020, the five annotators have tagged a total of 830 reviews.                
After two months of training (getting acquainted with the absorption framework and with the              
annotation infrastructure), work has been split into seven annotation rounds: for each round,             
annotators were assigned a new batch of reviews to be annotated in parallel; at the end of each                  
round, they met with the supervisors to discuss discrepancies in their annotation strategies.             
This procedure was aimed at improving inter-annotator agreement without directly interfering           
with the annotation work. Inter-round meetings proved fundamental also to strengthen (and            
eventually redefine) the theoretical framework of reading absorption. As shown by Table 1,             
amounts of annotated reviews gradually diminished at each round (offering the possibility to             
meet more frequently), while the number of tags increased substantially (mirroring a more             
precise distinction of the phenomena to be tagged). 
 

Annotation round Annotated reviews Number of tags 
1 180 6 
2 200 12 
3 150 80  
4 60 145 
5 90 145 
6 75 145 
7 75 145 

Table 1. Number of tags and annotated reviews. 
 
Tagsets were expanded by using a hierarchical structure, where all new tags can always be               
collapsed into a few, higher-level tags: ​SWAS_specific​, for sentences that show direct            
similarity with the SWAS statements; ​SWAS_related​, for sentences not included in the            
SWAS, but listed in a wider taxonomy of reading absorption [Bálint et al., 2016]; and               
mention_SWAS​, for mentions of the SWAS concepts without reference to the actual reading             
experience of the user who wrote the review (i.e., “usually when I read a book, I like to be                   
able to fully imagine what the world of the story looks like”). To these labels was also added a                   
Present/Absent flag, for distinguishing sentences that explicitly confirm or negate absorption           
concepts. 

3 
 

Annotations were initially performed using the ​brat ​platform [Stenetorp et al. 2012], while             
from round 4 the ​INCEpTION ​platform [Klie et al. 2018] was adopted, which offered more               
advanced functionalities. 
 
III PRELIMINARY RESULTS 
 
3.1 Inter-annotator agreement 
Figure 2 shows the evolution of Krippendorff’s Alpha for the main tags in the seven rounds.                
As evident, there is a slight but steady improvement throughout the annotation process, that              
can be verified via the evolution of the “mean” and “all” scores: “mean” indicates the mean of                 
the alpha scores for all of the tags (as it was not possible to calculate a single alpha score,                   
because different tags could be assigned to the same sentences); “all” indicates the alpha score               
for a unique tag, obtained by checking if the sentence was annotated or not, independently               
from the assigned tag. In both cases, values move from fair (~0.2/0.4) to substantial              
agreement (~0.6/0.7). Among the high-level tags, SWAS_related_PRESENT reaches the         
highest values, while mention_SWAS_PRESENT scores the lowest, confirming the difficulty          
in recognizing absorption when no experiences of the I are mentioned. 
 

Figure 2. Inter-annotator agreement for the seven rounds of annotation. Alpha scores were calculated on a 
sentence basis (sentences split using ​Spacy​). 
 
Figure 3 shows the evolution of the mean Cohen’s Kappa scores for each annotator. Mean               
kappa scores were obtained by calculating the scores for all pairs of annotators (considering              
just the “all” tag) and then calculating the mean value for each annotator. Values offer thus an                 
indication of how much one annotator agrees with all the others. Two main trends are evident:                
first, there is a clear improvement through the seven rounds (moving from fair/moderate to              
substantial agreement); second, two annotators tend to always reach the highest scores,            
showing a better ability to agree with the others. 
 

4 
 

https://spacy.io/


Figure 3. Inter-annotator agreement for the seven rounds of annotation. Kappa scores were calculated on a 
sentence basis (sentences split using ​Spacy​). 
 
Curation is currently in progress. However, first results confirm the already-observed trends.            
Mean agreement with the curator (mean Cohen’s Kappa for the “all” tag) was 0.55 for the                
first round, while it reached 0.68 for the fourth.  
 
3.2 Machine learning 
We used several state of the art machine learning approaches to train a binary classifier on the                 
annotated reviews, cf. [Lendvai et al. 2020]. When the current full dataset became available              
for training, a fine-tuned version of BERT [Devlin et al. 2018] reached 0.63 F-score on the                
target class, i.e., detecting absorption statements, ​and a linear regression model stacked on             
BERT predictions reached a mean average error of 0.08 (test set size: 149 reviews), cf.               
[Lendvai, Reichel, et al. 2020], which allow us to automate the annotation task and scale up                
the analysis of narrative absorption.  
 
CONCLUSION 
The “Mining Goodreads” project confirms the importance of interdisciplinary collaboration in           
the study of new phenomena such as digital social reading [Rebora et al. 2019]. The               
integration between empirical and computational methods also stimulates the definition of           
new research workflows in the wider context of digital humanities, where all the involved              
disciplines have the possibility to reach relevant goals: from the definition of a tool able to                
automatically recognize a complex linguistic and social phenomenon, to the improvement of            
the theoretical framework that defines it, to the broadening of literary studies towards             
unexplored grounds. 
 
REFERENCES 

Bálint, Katalin, Frank Hakemulder, Moniek M. Kuijpers, Miruna M. Doicaru, and Ed S. Tan. 2016. “Reconceptualizing                
Foregrounding: Identifying Response Strategies to Deviation in Absorbing Narratives.” ​Scientific Study of Literature 6              

5 
 

https://spacy.io/


(2): 176–207. https://doi.org/10.1075/ssol.6.2.02bal. 
Devlin, Jacob, M.W. Chang, K. Lee, K. Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language                 

understanding. arXiv preprint, arXiv:1810.04805. 
Hakemulder, Jèmeljan, Moniek M. Kuijpers, and Ed S. Tan, eds. 2017. ​Narrative Absorption​. Linguistic Approaches to                

Literature, volume 27. Amsterdam ; Philadelphia: John Benjamins Publishing Company. 
Hartung, Franziska, Peter Withers, Peter Hagoort, and Roel M. Willems. 2017. “When Fiction Is Just as Real as Fact: No                    

Differences in Reading Behavior between Stories Believed to Be Based on True or Fictional Events.” ​Frontiers in                 
Psychology​ 8 (September). https://doi.org/10.3389/fpsyg.2017.01618. 

http1. www.goodreads.com. 
http2. www.goodreads.com/about/us. 
Klie, Jan-Christoph, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, and Iryna Gurevych. 2018. “The               

INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation.” In ​Proceedings of the 27th            
International Conference on Computational Linguistics: System Demonstrations​, 5–9. Association for Computational           
Linguistics. http://tubiblio.ulb.tu-darmstadt.de/106270/. 

Kuijpers, Moniek M., Frank Hakemulder, Ed S. Tan, and Miruna M. Doicaru. 2014. “Exploring Absorbing Reading                
Experiences. Developing and Validating a Self-Report Scale to Measure Story World Absorption.” ​Scientific Study of               
Literature​ 4 (1): 89–122. 

Kuzmičová, Anežka, Anne Mangen, Hildegunn Støle, and Anne Charlotte Begnum. 2017. “Literature and Readers’ Empathy:               
A Qualitative Text Manipulation Study.” ​Language and Literature: International Journal of Stylistics 26 (2): 137–52.               
https://doi.org/10.1177/0963947017704729. 

Lendvai, Piroska, Sándor Darányi, Christian Geng, Moniek Kuijpers, Oier Lopez de Lacalle, Jean-Christophe Mensonides,              
Simone Rebora, and Uwe Reichel. 2020. “Detection of Reading Absorption in User-Generated Book Reviews: Resources               
Creation and Evaluation.” In ​Proceedings of The 12th Language Resources and Evaluation Conference​, 4835–4841.              
Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.595. 

Lendvai, Piroska, Uwe Reichel, Moniek Kuijpers, and Simone Rebora. 2020. “Ranking of Social Reading Reviews Based on                 
Richness in Narrative Absorption.” In ​SwissText and Konvens 2020 5th SwissText & 16th KONVENS Joint Conference​. 

Peer, Willie van, Jèmeljan Hakemulder, and Sonia Zyngier. 2012. ​Scientific Methods for the Humanities​. Linguistic               
Approaches to Literature, v. 13. Amsterdam ; Philadelphia: John Benjamins Pub. Co. 

Rebora, Simone, Peter Boot, Federico Pianzola, Brigitte Gasser, J. Berenike Herrmann, Maria Kraxenberger, Moniek              
Kuijpers, et al. 2019. “Digital Humanities and Digital Social Reading.” ​OSF Preprint​, November.             
https://doi.org/10.31219/osf.io/mf4nj. 

Rebora, Simone, Piroska Lendvai, and Moniek Kuijpers. 2018. “Reader Experience Labeling Automatized: Text Similarity              
Classification of User-Generated Book Reviews.” In ​EADH2018​. Galway: EADH.         
https://eadh2018.exordo.com/programme/presentation/90. 

Stenetorp, Pontus, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsujii. 2012. “BRAT: A                
Web-Based Tool for NLP-Assisted Text Annotation.” In ​Proceedings of the Demonstrations at the 13th Conference of                
the European Chapter of the Association for Computational Linguistics​, 102–7. Association for Computational             
Linguistics. 

6