Summary of your 'study carrel'
==============================

This is a summary of your Distant Reader 'study carrel'.

The Distant Reader harvested & cached your content into a
collection/corpus. It then applied sets of natural language
processing and text mining against the collection. The results of
this process was reduced to a database file -- a 'study carrel'.
The study carrel can then be queried, thus bringing light
specific characteristics for your collection. These
characteristics can help you summarize the collection as well as
enumerate things you might want to investigate more closely.

                               Eric Lease Morgan <emorgan@nd.edu>
                                                     May 27, 2019


Number of items in the collection; 'How big is my corpus?'
----------------------------------------------------------
12


Average length of all items measured in words; "More or less, how big is each item?"
------------------------------------------------------------------------------------
5206


Average readability score of all items (0 = difficult; 100 = easy)
------------------------------------------------------------------
56


Top 50 statistically significant keywords; "What is my collection about?"
-------------------------------------------------------------------------
4	datum
3	machine
2	system
2	library
2	learning
2	image
2	Learning
1	research
1	process
1	problem
1	pmss
1	place
1	new
1	moral
1	material
1	human
1	example
1	archive
1	algorithm
1	University
1	Tönnies
1	Networks
1	Microsoft
1	Markov
1	Machine
1	MARC
1	Kentucky
1	Information
1	Generative
1	GAN
1	Eastern
1	Disciplinary
1	Chinese
1	Chicago
1	Balke
1	Adversarial


Top 50 lemmatized nouns; "What is discussed?"
---------------------------------------------
351	datum
282	machine
279	learning
236	library
149	system
142	example
135	research
134	image
129	process
128	model
116	algorithm
109	information
108	problem
104	project
100	result
99	time
96	way
93	dataset
91	tool
87	work
87	data
86	set
83	collection
82	use
73	place
72	text
72	researcher
71	people
71	file
71	decision
70	computer
70	article
68	material
67	training
67	question
66	name
66	application
63	network
61	archive
59	case
57	service
55	value
55	level
52	number
51	topic
51	knowledge
50	technology
49	method
47	classification
46	type


Top 50 proper nouns; "What are the names of persons or places?"
--------------------------------------------------------------
163	AI
101	Learning
90	Machine
70	al
60	ML
59	Chicago
52	Library
45	Intelligence
45	Artificial
44	et
39	University
38	New
32	Digital
30	Google
28	Data
28	Daniel
27	Johnson
25	Information
25	IEEE
24	Research
23	York
23	GAN
23	Adversarial
22	Science
22	Microsoft
22	Generative
21	n.d
21	Review
21	Networks
21	MARC
20	Press
20	Journal
19	Technology
18	May
18	Markov
18	International
18	Conference
17	Reading
17	Kentucky
17	.
16	March
16	January
16	Computer
15	December
15	Congress
14	Řehůřek
14	Proceedings
14	Mark
14	MSC
14	Libraries


Top 50 personal pronouns nouns; "To whom are things referred?"
-------------------------------------------------------------
386	we
343	it
275	you
158	they
123	i
48	them
39	us
24	one
18	itself
12	themselves
9	me
6	he
3	yourself
3	she
3	ourselves
3	her
2	ours
1	’s
1	ml+history
1	https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/.
1	him
1	alphago


Top 50 lemmatized verbs; "What do things do?"
---------------------------------------------
1850	be
382	have
259	use
228	do
148	learn
138	make
113	see
91	generate
83	create
82	find
80	give
76	include
74	work
69	provide
68	know
63	build
63	base
59	need
57	help
56	develop
55	become
50	train
50	identify
49	take
47	add
40	produce
40	go
38	discuss
38	come
38	call
37	exist
36	try
36	get
36	automate
35	understand
35	allow
33	want
33	think
33	describe
32	require
32	look
30	save
30	read
29	apply
28	start
28	explain
28	classify
28	change
26	support
26	suggest


Top 50 lemmatized adjectives and adverbs; "How are things described?"
---------------------------------------------------------------------
323	not
177	more
141	new
140	such
129	other
125	also
123	well
93	many
82	only
79	different
77	digital
76	then
72	as
67	good
66	large
60	deep
57	moral
56	very
56	human
56	even
52	possible
51	out
48	most
48	first
47	so
47	local
47	important
47	-
43	ethical
41	able
40	together
40	just
40	high
39	now
39	however
38	much
37	social
37	long
36	specific
36	same
36	instead
35	likely
35	here
34	up
34	often
33	available
32	still
32	historical
31	real
31	own


Top 50 lemmatized superlative adjectives; "How are things described to the extreme?"
-------------------------------------------------------------------------
18	good
12	most
11	least
6	near
5	great
2	labels_t
1	sparse
1	simple
1	silly
1	safe
1	rich
1	raw
1	quick
1	new
1	large
1	high
1	broad
1	big
1	bad


Top 50 lemmatized superlative adverbs; "How do things do to the extreme?"
------------------------------------------------------------------------
36	most
6	well
2	least
1	train.py


Top 50 Internet domains; "What Webbed places are alluded to in this corpus?"
----------------------------------------------------------------------------
46	doi.org
12	smcproxy1.saintmarys.edu:2048
10	github.com
9	arxiv.org
6	www.wired.com
4	www.nytimes.com
4	towardsdatascience.com
3	www.technologyreview.com
2	zbmath.org
2	www.yewno.com
2	www.theverge.com
2	www.openstreetmap.org
2	www.forbes.com
2	www.clevelandart.org
2	www.chipublib.org
2	www.bbc.com
2	www.ala.org
2	www.aclweb.org
2	read.gov
2	plato.stanford.edu
2	passamaquoddypeople.com
2	papers.nips.cc
2	nlp.stanford.edu
2	mukurtu.org
2	mathscinet.ams.org
2	mallet.cs.umass.edu
2	linkedgeodata.org
2	journals.ala.org
2	journal.code4lib.org
2	geodeepdive.org
2	dh.depaul.press
2	collectionsasdata.github.io
2	academic.microsoft.com
1	xpmethod.plaintext.in
1	www.zotero.org
1	www.youtube.com
1	www.who.int
1	www.weforum.org
1	www.washingtonpost.com
1	www.wandb.com
1	www.theguardian.com
1	www.theatlantic.com
1	www.sowetanlive.co.za
1	www.scientificamerican.com
1	www.sciencedirect.com
1	www.sas.com
1	www.prepare-enrich.com
1	www.nyu.edu
1	www.numdam.org
1	www.nltk.org


Top 50 URLs; "What is hyperlinked from this corpus?"
----------------------------------------------------
3	http://github.com/ericleasemorgan/bringing-algorithms.]
2	http://mukurtu.org/
2	http://journals.ala.org/index.php/ltr/issue/viewIssue/709/471
2	http://journal.code4lib.org/articles/13671
2	http://github.com/googlecreativelab/quickdraw-dataset
2	http://doi.org/10.25333/xk7z-9g97
2	http://collectionsasdata.github.io/part2whole/
1	http://zbmath.org/?q=py%3A2018
1	http://zbmath.org/
1	http://xpmethod.plaintext.in/torn-apart/volume/2/
1	http://www.zotero.org
1	http://www.youtube.com/watch?v=Qi1Yry33TQE
1	http://www.yewno.com/education
1	http://www.yewno.com/
1	http://www.wired.com/story/facebooks-ai-says-field-hit-wall/
1	http://www.wired.com/story/ai-biased-how-scientists-trying-fix/
1	http://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/
1	http://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/
1	http://www.wired.com/2014/10/future-of-artificial-intelligence/
1	http://www.wired.com/2012/06/google-x-neural-network/
1	http://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov]
1	http://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning
1	http://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/
1	http://www.wandb.com/articles/object-detection-with-retinanet
1	http://www.theverge.com/circuitbreaker/2018/10/15/17978298/pixel-buds-google-translate-google-assistant-headphones
1	http://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018
1	http://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes
1	http://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-gallery-scene/584134/?utm_source=share&utm_campaign=share
1	http://www.technologyreview.com/2020/04/01/974997/deepminds-ai-57-atari-games-but-its-still-not-versatile-enough/
1	http://www.technologyreview.com/2019/04/08/103223/two-rival-ai-approaches-combine-to-let-machines-learn-about-the-world-like-a-child/
1	http://www.technologyreview.com/2017/04/11/5113/the-dark-secret-at-the-heart-of-ai/
1	http://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/
1	http://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/
1	http://www.sciencedirect.com/science/article/pii/S2589750019301232
1	http://www.sas.com/en_us/insights/analytics/machine-learning.html
1	http://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf
1	http://www.openstreetmap.org/.]
1	http://www.openstreetmap.org/
1	http://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf
1	http://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html
1	http://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html
1	http://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html
1	http://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html
1	http://www.numdam.org/
1	http://www.nltk.org/book/ch07.html.]
1	http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329
1	http://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/
1	http://www.mendeley.com.]
1	http://www.kaggle.com/c/deepfake-detection-challenge
1	http://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf


Top 50 email addresses; "Who are you gonna call?"
-------------------------------------------------
1	mjiang2@nd.edu
1	hansensm@umich.edu
1	emorgan@nd.edu


Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?"
-------------------------------------------------------------------------------
5	machine learning algorithms
4	machine learning solution
3	machines do not
2	data are good
2	libraries do not
2	libraries have not
2	machine learning algorithm
2	machine learning application
2	machine learning model
1	ai does not
1	ai is not
1	ai is only
1	ai makes sense
1	ai was not
1	algorithm be easy
1	algorithm called lda
1	algorithms are able
1	algorithms are already
1	algorithms are complex
1	algorithms are curious
1	algorithms are not
1	algorithms have not
1	algorithms include linear
1	algorithms include naive
1	algorithms making unethical
1	algorithms work well
1	collections is clear
1	data are all
1	data are also
1	data are sometimes
1	data be clean
1	data becomes increasingly
1	data becomes information
1	data is biased
1	data is far
1	data is messy
1	data is time
1	data is ultimately
1	data is unlabeled
1	data provide new
1	data was exactly
1	data were accurate
1	dataset became available
1	dataset is freely
1	example is mark
1	examples are more
1	examples does only
1	examples include records
1	image using unique
1	images include surprising


Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?"
---------------------------------------------------------------------------------------
1	ai is not yet
1	ai was not only
1	algorithms are not smart
1	libraries are not as
1	model is not as
1	process was not ideal
1	processes are not as
1	results were not just
1	text is no longer


Sizes of items; "Measures in words, how big is each item?"
----------------------------------------------------------
7632	cohen-nakazawa
6982	kim
6152	wiegand
6071	altman
5838	harper
5269	morgan
5083	hintze-schossau
4868	lesk
4321	hansen
3690	prudhomme
3583	jiang
2981	lucic-shanahan


Readability of items; "How difficult is each item to read?"
-----------------------------------------------------------
64.0	lesk
60.0	altman
59.0	hansen
59.0	harper
59.0	morgan
58.0	lucic-shanahan
56.0	hintze-schossau
55.0	jiang
55.0	kim
49.0	prudhomme
48.0	cohen-nakazawa
44.0	wiegand


Item summaries; "In a narrative form, how can each item be abstracted?"
-----------------------------------------------------------------------
altman	I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least during the initial phase of testing different algorithms or configurations. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you''ve started training and testing models. As you begin ingesting and preparing data, you''ll want to explore possible machine learning algorithms to perform on your dataset.

cohen-nakazawa	Consequently, our chapter describes the process we used to (1) generate technical and descriptive metadata for historical photographs as we pulled material from an extant blog website into a digital archives platform; (2) identify recurring faces in individual pictures as well as in photographs of groups of sometimes unidentified people in order to generate social networks as metadata; and (3) to help develop a controlled vocabulary for the institution''s future needs for object management and description. Similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextricably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what Ryan Calo calls the "historical validation" of primary source materials (2017, 424-5).

hansen	[5: https://dml.cz/ ] [6: http://www.numdam.org/ ] [7: https://zbmath.org/ ] [8: Mathematical Subject Classification (MSC) values in MathSciNet and zbMath are a particularly interesting categorization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in the same, or closely related, subfield as the article''s content before they are published. Now let us shift from mathematics-specific categorization to subject categorization in general and look at the work Microsoft has done assigning Fields of Study (FoS) in the Microsoft Academic Graph (MAG) which is used to create their Microsoft Academic article search product.[footnoteRef:15] While the MAG FoS project is also attempting to categorize articles for proper indexing and search, it represents the second path which is taken by automated categorization projects: using machine learning techniques to both create the taxonomy and to classify.

harper	Figure 2 Images generated with a simple statistical model appear as noise as the model is insufficient to capture the structure of the real data (Markov chains trained using wine bottles and circles from Google''s QuickDraw dataset). Other types of generative statistical models, like Naive Bayes or a higher-order Markov chain,[footnoteRef:1] could perhaps capture a bit more information about the training data, but they would still be insufficient for real-world applications like this.[footnoteRef:2] Image, video, and audio are complicated; it is hard to reduce them to their essence with basic statistical rules in the way we were able to with the ordering of letters in English and Italian. Figure 4 A GAN being trained on wine bottle sketches from Google''s quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset) shows the generator learning how to produce better sketches over time. GANs in Action: Deep Learning with Generative Adversarial Networks.

hintze-schossau	Artificial Intelligence, with its ability to machine learn coupled to an almost humanlike understanding, sounds like the ideal tool to the humanities. Machine learning allows us to learn from these data sets in ways that exceed human capabilities, while an artificial brain will eventually allow us to objectively describe a subjective experience (through quantifying neural activations or positively and negatively associated memories). The following paragraphs will explore current Machine Learning and Artificial Intelligence technologies, explain how quantitative or qualitative they really are, and explore what the possible implications for future Digital Humanities could be. Currently, machines do not learn but must be trained, typically with human-labeled data. At the same time, memory formation (Marstaller, Hintze, and Adami 2013), information integration in the brain (Tononi 2004), and how systems evolve the ability to learn (Sheneman, Schossau, and Hintze 2019) are also being researched, as they are building blocks of general purpose intelligence.

jiang	Among the top strengths of happy marriages, at least five can be reflected in cross-disciplinary ML research, including "discuss problems well," "handle differences creatively," and "maintain a good balance of time alone and together." I use two examples of my personal experiences (as a computer scientist) of collaborating with researchers from multiple disciplines (e.g., historians, psychologists, IT technicians) to illustrate. Cross-disciplinary research matters, because (1) it provides an understanding of complex problems that require a multifaceted approach to solve; (2) it combines disciplinary breadth with the ability to collaborate and synthesize varying expertise; (3) it enables researchers to reach a wider audience and communicate diverse viewpoints; (4) it encourages researchers to confront questions that traditional disciplines do not ask while opening up new areas of research; and (5) it promotes disciplinary self-awareness about methods and creative practices (Urquhart et al.

kim	With their limited intelligence and fully deterministic nature, early rule-based symbolic AI systems raised few ethical concerns.[footnoteRef:4] AI systems that near or surpass human capability, on the other hand, are likely to be given the autonomy to make their own decisions without humans, even when their workings are not entirely transparent, and some of those decisions are distinctively moral in character. The Library of Congress has worked on detecting features, such as railroads in maps, using the convolutional neural network model, and issued a solicitation for a machine learning and deep learning pilot program that will maximize the use of its digital collections in 2019.[footnoteRef:18] Indiana University Libraries, AVP, University of Texas Austin School of Information, and the New York Public Library are jointly developing the Audiovisual Metadata Platform (AMP), using many AI tools in order to automatically generate metadata for audiovisual materials, which collection managers can use to supplement their archival description and processing workflows.[footnoteRef:19] [18: See Blewer, Kim, and Phetteplace 2018 and Price 2019.

lesk	Fragility errors here can arise from many sources for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). Similarly, the New York Times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (Tugend 2019). Instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, "more data beats better algorithms." In a review of the history of speech recognition, Xuedong Huang, James Baker, and Raj Reddy write, "The power of these systems arises mainly from their ability to collect, process, and learn from very large datasets.

lucic-shanahan	On its "Big Read" website, the Library of Congress includes information about One Book programs around the United States,[footnoteRef:2] and the American Library Association (ALA) also provides materials with which a library can build its own One Book program and, in this way, bring members of their communities together in a conversation.[footnoteRef:3] While community reading programs are not a new phenomenon and exist in various formats and sizes, the One Book One Chicago program is notable because of its size (the Chicago Public Library has 81 local branches) as well as its history (the program has been in existence for nearly 20 years). As part of ongoing work of the "Reading Chicago Reading" project, we used the secure data portal of the HathiTrust Research Consortium to access and pre-process the in-copyright novels in our set. The place names extracted from our three Chicago-setting OBOC books allowed us to focus on particular areas of the city such as Hyde Park, which is mentioned in each of them.

morgan	Now, in a time of "big data," it is possible to go beyond mere automation and towards the more intelligent use of computers; the use of algorithms and machine learning is an integral part of future library collection building and service provision. Finally, this chapter outlines both a number of possible machine learning applications for libraries as well as a few real world use cases. Like the scale of computer input, the library profession has not really exploited computers'' ability to save, organize, and retrieve data; on the whole, the library profession does not understand the concept of a "data structure." For example, tab-delimited files, CSV (comma-separated value) files, relational database schema, XML files, JSON files, and the content of email messages or HTTP server responses are all examples of different types of data structures.

prudhomme	However, "the viability of machine learning and artificial intelligence is predicated on the representativeness and quality of the data that they are trained on," as Thomas Padilla, Interim Head, Knowledge Production at the University of Nevada Las Vegas, asserts (2019, 14). In this essay, I begin by placing artificial intelligence and machine learning in context, then proceed by discussing why AI matters for archives and libraries, and describing the techniques used in a pilot automation project from the perspective of digital curation at Oklahoma State University Archives. Artificial intelligence, and specifically machine learning as a subfield of AI, has direct applications through pattern recognition techniques that predict the labeling values for unlabeled data. Along with greater computing capabilities, artificial intelligence could be an opportunity for libraries and archives to boost the discovery of their digital collections by pushing text and image recognition machine learning techniques to new limits.

wiegand	JSTOR, for example, will provide up to 25,000 documents (or more at special request) in a dataset for analysis.[footnoteRef:2] Clarivate''s Content as a Service provides Web of Science data to accommodate multiple purposes.[footnoteRef:3] Besides the many freely available bibliodata sources, researchers can sign up for developer accounts in databases such as Scopus to work with datasets for text mining and computational analysis.[footnoteRef:4] Using library-licensed collections as data could allow researchers to save time in reading a large corpus, stay updated on a topic of interest, analyze the most important topics at a given time period, confirm gaps in the research literature for investigation, and increase the efficiency of sifting through massive amounts of research in, for instance, the race to develop a vaccine (Ong 2020; Vamathevan 2019). By building out new services and tools, and instructing at all levels, libraries can reinvent themselves continuously by investing in creative and sustainable innovation, from digital and data literacy to assembling modules for a library-based Researchers'' Workstation that uses Machine Learning to enhance the efficiency of the scholars'' research cycle.