Summary of your 'study carrel' ============================== This is a summary of your Distant Reader 'study carrel'. The Distant Reader harvested & cached your content into a collection/corpus. It then applied sets of natural language processing and text mining against the collection. The results of this process was reduced to a database file -- a 'study carrel'. The study carrel can then be queried, thus bringing light specific characteristics for your collection. These characteristics can help you summarize the collection as well as enumerate things you might want to investigate more closely. Eric Lease Morgan May 27, 2019 Number of items in the collection; 'How big is my corpus?' ---------------------------------------------------------- 12 Average length of all items measured in words; "More or less, how big is each item?" ------------------------------------------------------------------------------------ 5206 Average readability score of all items (0 = difficult; 100 = easy) ------------------------------------------------------------------ 56 Top 50 statistically significant keywords; "What is my collection about?" ------------------------------------------------------------------------- 4 datum 3 machine 2 system 2 library 2 learning 2 image 2 Learning 1 research 1 process 1 problem 1 pmss 1 place 1 new 1 moral 1 material 1 human 1 example 1 archive 1 algorithm 1 University 1 Tönnies 1 Networks 1 Microsoft 1 Markov 1 Machine 1 MARC 1 Kentucky 1 Information 1 Generative 1 GAN 1 Eastern 1 Disciplinary 1 Chinese 1 Chicago 1 Balke 1 Adversarial Top 50 lemmatized nouns; "What is discussed?" --------------------------------------------- 351 datum 282 machine 279 learning 236 library 149 system 142 example 135 research 134 image 129 process 128 model 116 algorithm 109 information 108 problem 104 project 100 result 99 time 96 way 93 dataset 91 tool 87 work 87 data 86 set 83 collection 82 use 73 place 72 text 72 researcher 71 people 71 file 71 decision 70 computer 70 article 68 material 67 training 67 question 66 name 66 application 63 network 61 archive 59 case 57 service 55 value 55 level 52 number 51 topic 51 knowledge 50 technology 49 method 47 classification 46 type Top 50 proper nouns; "What are the names of persons or places?" -------------------------------------------------------------- 163 AI 101 Learning 90 Machine 70 al 60 ML 59 Chicago 52 Library 45 Intelligence 45 Artificial 44 et 39 University 38 New 32 Digital 30 Google 28 Data 28 Daniel 27 Johnson 25 Information 25 IEEE 24 Research 23 York 23 GAN 23 Adversarial 22 Science 22 Microsoft 22 Generative 21 n.d 21 Review 21 Networks 21 MARC 20 Press 20 Journal 19 Technology 18 May 18 Markov 18 International 18 Conference 17 Reading 17 Kentucky 17 . 16 March 16 January 16 Computer 15 December 15 Congress 14 Řehůřek 14 Proceedings 14 Mark 14 MSC 14 Libraries Top 50 personal pronouns nouns; "To whom are things referred?" ------------------------------------------------------------- 386 we 343 it 275 you 158 they 123 i 48 them 39 us 24 one 18 itself 12 themselves 9 me 6 he 3 yourself 3 she 3 ourselves 3 her 2 ours 1 ’s 1 ml+history 1 https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/. 1 him 1 alphago Top 50 lemmatized verbs; "What do things do?" --------------------------------------------- 1850 be 382 have 259 use 228 do 148 learn 138 make 113 see 91 generate 83 create 82 find 80 give 76 include 74 work 69 provide 68 know 63 build 63 base 59 need 57 help 56 develop 55 become 50 train 50 identify 49 take 47 add 40 produce 40 go 38 discuss 38 come 38 call 37 exist 36 try 36 get 36 automate 35 understand 35 allow 33 want 33 think 33 describe 32 require 32 look 30 save 30 read 29 apply 28 start 28 explain 28 classify 28 change 26 support 26 suggest Top 50 lemmatized adjectives and adverbs; "How are things described?" --------------------------------------------------------------------- 323 not 177 more 141 new 140 such 129 other 125 also 123 well 93 many 82 only 79 different 77 digital 76 then 72 as 67 good 66 large 60 deep 57 moral 56 very 56 human 56 even 52 possible 51 out 48 most 48 first 47 so 47 local 47 important 47 - 43 ethical 41 able 40 together 40 just 40 high 39 now 39 however 38 much 37 social 37 long 36 specific 36 same 36 instead 35 likely 35 here 34 up 34 often 33 available 32 still 32 historical 31 real 31 own Top 50 lemmatized superlative adjectives; "How are things described to the extreme?" ------------------------------------------------------------------------- 18 good 12 most 11 least 6 near 5 great 2 labels_t 1 sparse 1 simple 1 silly 1 safe 1 rich 1 raw 1 quick 1 new 1 large 1 high 1 broad 1 big 1 bad Top 50 lemmatized superlative adverbs; "How do things do to the extreme?" ------------------------------------------------------------------------ 36 most 6 well 2 least 1 train.py Top 50 Internet domains; "What Webbed places are alluded to in this corpus?" ---------------------------------------------------------------------------- 46 doi.org 12 smcproxy1.saintmarys.edu:2048 10 github.com 9 arxiv.org 6 www.wired.com 4 www.nytimes.com 4 towardsdatascience.com 3 www.technologyreview.com 2 zbmath.org 2 www.yewno.com 2 www.theverge.com 2 www.openstreetmap.org 2 www.forbes.com 2 www.clevelandart.org 2 www.chipublib.org 2 www.bbc.com 2 www.ala.org 2 www.aclweb.org 2 read.gov 2 plato.stanford.edu 2 passamaquoddypeople.com 2 papers.nips.cc 2 nlp.stanford.edu 2 mukurtu.org 2 mathscinet.ams.org 2 mallet.cs.umass.edu 2 linkedgeodata.org 2 journals.ala.org 2 journal.code4lib.org 2 geodeepdive.org 2 dh.depaul.press 2 collectionsasdata.github.io 2 academic.microsoft.com 1 xpmethod.plaintext.in 1 www.zotero.org 1 www.youtube.com 1 www.who.int 1 www.weforum.org 1 www.washingtonpost.com 1 www.wandb.com 1 www.theguardian.com 1 www.theatlantic.com 1 www.sowetanlive.co.za 1 www.scientificamerican.com 1 www.sciencedirect.com 1 www.sas.com 1 www.prepare-enrich.com 1 www.nyu.edu 1 www.numdam.org 1 www.nltk.org Top 50 URLs; "What is hyperlinked from this corpus?" ---------------------------------------------------- 3 http://github.com/ericleasemorgan/bringing-algorithms.] 2 http://mukurtu.org/ 2 http://journals.ala.org/index.php/ltr/issue/viewIssue/709/471 2 http://journal.code4lib.org/articles/13671 2 http://github.com/googlecreativelab/quickdraw-dataset 2 http://doi.org/10.25333/xk7z-9g97 2 http://collectionsasdata.github.io/part2whole/ 1 http://zbmath.org/?q=py%3A2018 1 http://zbmath.org/ 1 http://xpmethod.plaintext.in/torn-apart/volume/2/ 1 http://www.zotero.org 1 http://www.youtube.com/watch?v=Qi1Yry33TQE 1 http://www.yewno.com/education 1 http://www.yewno.com/ 1 http://www.wired.com/story/facebooks-ai-says-field-hit-wall/ 1 http://www.wired.com/story/ai-biased-how-scientists-trying-fix/ 1 http://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/ 1 http://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/ 1 http://www.wired.com/2014/10/future-of-artificial-intelligence/ 1 http://www.wired.com/2012/06/google-x-neural-network/ 1 http://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov] 1 http://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning 1 http://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/ 1 http://www.wandb.com/articles/object-detection-with-retinanet 1 http://www.theverge.com/circuitbreaker/2018/10/15/17978298/pixel-buds-google-translate-google-assistant-headphones 1 http://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018 1 http://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes 1 http://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-gallery-scene/584134/?utm_source=share&utm_campaign=share 1 http://www.technologyreview.com/2020/04/01/974997/deepminds-ai-57-atari-games-but-its-still-not-versatile-enough/ 1 http://www.technologyreview.com/2019/04/08/103223/two-rival-ai-approaches-combine-to-let-machines-learn-about-the-world-like-a-child/ 1 http://www.technologyreview.com/2017/04/11/5113/the-dark-secret-at-the-heart-of-ai/ 1 http://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/ 1 http://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/ 1 http://www.sciencedirect.com/science/article/pii/S2589750019301232 1 http://www.sas.com/en_us/insights/analytics/machine-learning.html 1 http://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf 1 http://www.openstreetmap.org/.] 1 http://www.openstreetmap.org/ 1 http://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf 1 http://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html 1 http://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html 1 http://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html 1 http://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html 1 http://www.numdam.org/ 1 http://www.nltk.org/book/ch07.html.] 1 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329 1 http://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/ 1 http://www.mendeley.com.] 1 http://www.kaggle.com/c/deepfake-detection-challenge 1 http://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf Top 50 email addresses; "Who are you gonna call?" ------------------------------------------------- 1 mjiang2@nd.edu 1 hansensm@umich.edu 1 emorgan@nd.edu Top 50 positive assertions; "What sentences are in the shape of noun-verb-noun?" ------------------------------------------------------------------------------- 5 machine learning algorithms 4 machine learning solution 3 machines do not 2 data are good 2 libraries do not 2 libraries have not 2 machine learning algorithm 2 machine learning application 2 machine learning model 1 ai does not 1 ai is not 1 ai is only 1 ai makes sense 1 ai was not 1 algorithm be easy 1 algorithm called lda 1 algorithms are able 1 algorithms are already 1 algorithms are complex 1 algorithms are curious 1 algorithms are not 1 algorithms have not 1 algorithms include linear 1 algorithms include naive 1 algorithms making unethical 1 algorithms work well 1 collections is clear 1 data are all 1 data are also 1 data are sometimes 1 data be clean 1 data becomes increasingly 1 data becomes information 1 data is biased 1 data is far 1 data is messy 1 data is time 1 data is ultimately 1 data is unlabeled 1 data provide new 1 data was exactly 1 data were accurate 1 dataset became available 1 dataset is freely 1 example is mark 1 examples are more 1 examples does only 1 examples include records 1 image using unique 1 images include surprising Top 50 negative assertions; "What sentences are in the shape of noun-verb-no|not-noun?" --------------------------------------------------------------------------------------- 1 ai is not yet 1 ai was not only 1 algorithms are not smart 1 libraries are not as 1 model is not as 1 process was not ideal 1 processes are not as 1 results were not just 1 text is no longer Sizes of items; "Measures in words, how big is each item?" ---------------------------------------------------------- 7632 cohen-nakazawa 6982 kim 6152 wiegand 6071 altman 5838 harper 5269 morgan 5083 hintze-schossau 4868 lesk 4321 hansen 3690 prudhomme 3583 jiang 2981 lucic-shanahan Readability of items; "How difficult is each item to read?" ----------------------------------------------------------- 64.0 lesk 60.0 altman 59.0 hansen 59.0 harper 59.0 morgan 58.0 lucic-shanahan 56.0 hintze-schossau 55.0 jiang 55.0 kim 49.0 prudhomme 48.0 cohen-nakazawa 44.0 wiegand Item summaries; "In a narrative form, how can each item be abstracted?" ----------------------------------------------------------------------- altman I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least during the initial phase of testing different algorithms or configurations. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you''ve started training and testing models. As you begin ingesting and preparing data, you''ll want to explore possible machine learning algorithms to perform on your dataset. cohen-nakazawa Consequently, our chapter describes the process we used to (1) generate technical and descriptive metadata for historical photographs as we pulled material from an extant blog website into a digital archives platform; (2) identify recurring faces in individual pictures as well as in photographs of groups of sometimes unidentified people in order to generate social networks as metadata; and (3) to help develop a controlled vocabulary for the institution''s future needs for object management and description. Similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextricably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what Ryan Calo calls the "historical validation" of primary source materials (2017, 424-5). hansen [5: https://dml.cz/ ] [6: http://www.numdam.org/ ] [7: https://zbmath.org/ ] [8: Mathematical Subject Classification (MSC) values in MathSciNet and zbMath are a particularly interesting categorization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in the same, or closely related, subfield as the article''s content before they are published. Now let us shift from mathematics-specific categorization to subject categorization in general and look at the work Microsoft has done assigning Fields of Study (FoS) in the Microsoft Academic Graph (MAG) which is used to create their Microsoft Academic article search product.[footnoteRef:15] While the MAG FoS project is also attempting to categorize articles for proper indexing and search, it represents the second path which is taken by automated categorization projects: using machine learning techniques to both create the taxonomy and to classify. harper Figure 2 Images generated with a simple statistical model appear as noise as the model is insufficient to capture the structure of the real data (Markov chains trained using wine bottles and circles from Google''s QuickDraw dataset). Other types of generative statistical models, like Naive Bayes or a higher-order Markov chain,[footnoteRef:1] could perhaps capture a bit more information about the training data, but they would still be insufficient for real-world applications like this.[footnoteRef:2] Image, video, and audio are complicated; it is hard to reduce them to their essence with basic statistical rules in the way we were able to with the ordering of letters in English and Italian. Figure 4 A GAN being trained on wine bottle sketches from Google''s quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset) shows the generator learning how to produce better sketches over time. GANs in Action: Deep Learning with Generative Adversarial Networks. hintze-schossau Artificial Intelligence, with its ability to machine learn coupled to an almost humanlike understanding, sounds like the ideal tool to the humanities. Machine learning allows us to learn from these data sets in ways that exceed human capabilities, while an artificial brain will eventually allow us to objectively describe a subjective experience (through quantifying neural activations or positively and negatively associated memories). The following paragraphs will explore current Machine Learning and Artificial Intelligence technologies, explain how quantitative or qualitative they really are, and explore what the possible implications for future Digital Humanities could be. Currently, machines do not learn but must be trained, typically with human-labeled data. At the same time, memory formation (Marstaller, Hintze, and Adami 2013), information integration in the brain (Tononi 2004), and how systems evolve the ability to learn (Sheneman, Schossau, and Hintze 2019) are also being researched, as they are building blocks of general purpose intelligence. jiang Among the top strengths of happy marriages, at least five can be reflected in cross-disciplinary ML research, including "discuss problems well," "handle differences creatively," and "maintain a good balance of time alone and together." I use two examples of my personal experiences (as a computer scientist) of collaborating with researchers from multiple disciplines (e.g., historians, psychologists, IT technicians) to illustrate. Cross-disciplinary research matters, because (1) it provides an understanding of complex problems that require a multifaceted approach to solve; (2) it combines disciplinary breadth with the ability to collaborate and synthesize varying expertise; (3) it enables researchers to reach a wider audience and communicate diverse viewpoints; (4) it encourages researchers to confront questions that traditional disciplines do not ask while opening up new areas of research; and (5) it promotes disciplinary self-awareness about methods and creative practices (Urquhart et al. kim With their limited intelligence and fully deterministic nature, early rule-based symbolic AI systems raised few ethical concerns.[footnoteRef:4] AI systems that near or surpass human capability, on the other hand, are likely to be given the autonomy to make their own decisions without humans, even when their workings are not entirely transparent, and some of those decisions are distinctively moral in character. The Library of Congress has worked on detecting features, such as railroads in maps, using the convolutional neural network model, and issued a solicitation for a machine learning and deep learning pilot program that will maximize the use of its digital collections in 2019.[footnoteRef:18] Indiana University Libraries, AVP, University of Texas Austin School of Information, and the New York Public Library are jointly developing the Audiovisual Metadata Platform (AMP), using many AI tools in order to automatically generate metadata for audiovisual materials, which collection managers can use to supplement their archival description and processing workflows.[footnoteRef:19] [18: See Blewer, Kim, and Phetteplace 2018 and Price 2019. lesk Fragility errors here can arise from many sources for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). Similarly, the New York Times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (Tugend 2019). Instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, "more data beats better algorithms." In a review of the history of speech recognition, Xuedong Huang, James Baker, and Raj Reddy write, "The power of these systems arises mainly from their ability to collect, process, and learn from very large datasets. lucic-shanahan On its "Big Read" website, the Library of Congress includes information about One Book programs around the United States,[footnoteRef:2] and the American Library Association (ALA) also provides materials with which a library can build its own One Book program and, in this way, bring members of their communities together in a conversation.[footnoteRef:3] While community reading programs are not a new phenomenon and exist in various formats and sizes, the One Book One Chicago program is notable because of its size (the Chicago Public Library has 81 local branches) as well as its history (the program has been in existence for nearly 20 years). As part of ongoing work of the "Reading Chicago Reading" project, we used the secure data portal of the HathiTrust Research Consortium to access and pre-process the in-copyright novels in our set. The place names extracted from our three Chicago-setting OBOC books allowed us to focus on particular areas of the city such as Hyde Park, which is mentioned in each of them. morgan Now, in a time of "big data," it is possible to go beyond mere automation and towards the more intelligent use of computers; the use of algorithms and machine learning is an integral part of future library collection building and service provision. Finally, this chapter outlines both a number of possible machine learning applications for libraries as well as a few real world use cases. Like the scale of computer input, the library profession has not really exploited computers'' ability to save, organize, and retrieve data; on the whole, the library profession does not understand the concept of a "data structure." For example, tab-delimited files, CSV (comma-separated value) files, relational database schema, XML files, JSON files, and the content of email messages or HTTP server responses are all examples of different types of data structures. prudhomme However, "the viability of machine learning and artificial intelligence is predicated on the representativeness and quality of the data that they are trained on," as Thomas Padilla, Interim Head, Knowledge Production at the University of Nevada Las Vegas, asserts (2019, 14). In this essay, I begin by placing artificial intelligence and machine learning in context, then proceed by discussing why AI matters for archives and libraries, and describing the techniques used in a pilot automation project from the perspective of digital curation at Oklahoma State University Archives. Artificial intelligence, and specifically machine learning as a subfield of AI, has direct applications through pattern recognition techniques that predict the labeling values for unlabeled data. Along with greater computing capabilities, artificial intelligence could be an opportunity for libraries and archives to boost the discovery of their digital collections by pushing text and image recognition machine learning techniques to new limits. wiegand JSTOR, for example, will provide up to 25,000 documents (or more at special request) in a dataset for analysis.[footnoteRef:2] Clarivate''s Content as a Service provides Web of Science data to accommodate multiple purposes.[footnoteRef:3] Besides the many freely available bibliodata sources, researchers can sign up for developer accounts in databases such as Scopus to work with datasets for text mining and computational analysis.[footnoteRef:4] Using library-licensed collections as data could allow researchers to save time in reading a large corpus, stay updated on a topic of interest, analyze the most important topics at a given time period, confirm gaps in the research literature for investigation, and increase the efficiency of sifting through massive amounts of research in, for instance, the race to develop a vaccine (Ong 2020; Vamathevan 2019). By building out new services and tools, and instructing at all levels, libraries can reinvent themselves continuously by investing in creative and sustainable innovation, from digital and data literacy to assembling modules for a library-based Researchers'' Workstation that uses Machine Learning to enhance the efficiency of the scholars'' research cycle.