mv: 'input-file.zip' and './input-file.zip' are the same file
Creating study carrel named subject-robinHood-freebo
Initializing database
Unzipping
Archive: input-file.zip
inflating: ./tmp/input/xml2htm.xsl
inflating: ./tmp/input/metadata.csv
inflating: ./tmp/input/A07897.xml
caution: excluded filename not matched: *MACOSX*
=== DIRECTORIES: ./tmp/input
=== DIRECTORY:
=== metadata file: ./tmp/input/metadata.csv
=== found metadata file
=== updating bibliographic database
Building study carrel named subject-robinHood-freebo
May 24, 2021 8:29:02 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
May 24, 2021 8:29:02 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
May 24, 2021 8:29:02 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO Starting Apache Tika 1.24.1 server
INFO Setting the server's publish address to be http://localhost:9998/
INFO Logging initialized @3076ms to org.eclipse.jetty.util.log.Slf4jLog
INFO jetty-9.4.27.v20200227; built: 2020-02-27T18:37:21.340Z; git: a304fd9f351f337e7c0e2a7c28878dd536149c6c; jvm 1.8.0_281-b09
INFO Started ServerConnector@3e74829{HTTP/1.1, (http/1.1)}{localhost:9998}
INFO Started @3168ms
WARN Empty contextPath
INFO Started o.e.j.s.h.ContextHandler@62010f5c{/,null,AVAILABLE}
INFO Started Apache Tika server at http://localhost:9998/
INFO rmeta/text (autodetecting type)
FILE: cache/A07897.xml
OUTPUT: txt/A07897.txt
=== file2bib.sh ===
INFO Detecting media type for Filename: b'A07897.xml'
INFO rmeta/text (autodetecting type)
A07897 txt/../pos/A07897.pos
A07897 txt/../ent/A07897.ent
A07897 txt/../wrd/A07897.wrd
=== file2bib.sh ===
id: A07897
author: Henry, Chettle, d. 1607?.
title: The Death of Robert, Earl of Huntingdon
date: 1601
pages:
extension: .xml
txt: ./txt/A07897.txt
cache: ./cache/A07897.xml
Content-Encoding ISO-8859-1
Content-Type text/plain; charset=ISO-8859-1
X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser']
X-TIKA:content_handler ToTextContentHandler
X-TIKA:embedded_depth 0
X-TIKA:parse_time_millis 19
resourceName b'A07897.xml'
Done mapping.
Reducing subject-robinHood-freebo
=== reduce.pl bib ===
id = A07897
author = Henry, Chettle, d. 1607?.
title = The Death of Robert, Earl of Huntingdon
date = 1601
pages =
extension = .xml
mime = text/plain
words = 286217
sentences = 91352
flesch = 101
summary = Textual changes and metadata enrichments aim at making the text more computationally tractable, easier to read, and suitable for network-based collaborative curation by amateur and professional end users from many walks of life. Otherwise called Robin Hood of merrie Sherwodde: with the lamentable tragedie of chaste Matilda, his faire maid Marian, poysoned at Dunmowe by King Iohn. Otherwise called Robin Hood of merrie Sherwodde: with the lamentable tragedie of chaste Matilda, his faire maid Marian, poysoned at Dunmowe by King Iohn. Acted by the Right Honourable, the Earle of Notingham, Lord high Admirall of England, his seruants. Acted by the Right Honourable, the Earle of Notingham, Lord high Admirall of England, his seruants. A notation like "6-b-2890" means "look for EEBO page image 6 of that text, word 289 on the right side of the double-page image." That reference is followed by the corrupt reading. head , we shall haue hornes good slore .
cache = ./cache/A07897.xml
txt = ./txt/A07897.txt
Building ./etc/reader.txt
A07897
A07897
number of items: 1
sum of words: 286,217
average size in words: 286,217
average readability score: 101
nouns: xml; pc; l; p; pos="n1; pos="vvi; >; pos="n2; cs; pos="n1-nn; w; pos="po; q; sic; speaker; cc; stage; surface; av; g; x; pos="pns; r; im; unit="sentence; type="unclear; pos="vvg; rendition="#hi">.sonnevponhauemeebruse&hubertilesteroxfordkeepedoncasterfrierfairehaueiohnmeefeare.doegoehauerobin,queenewarmanknowe
proper nouns: id="a07897; w; pos="acp; pos="d; pos="j; pos="vvb; pos="po; xml; pos="pns; sp; pos="cc; unit="sentence"/; pos="n; lemma="be; lemma="and; lemma="the; pos="vvz; lemma="i; speaker; pos="pn; pos="pno; pos="vmb; pos="vvn; lemma="you; pos="av; pos="crq; lemma="a; lemma="my; lemma="he; lemma="will; lemma="of; lemma="have; rendition="#hi">,?.irobinmatildahubert,yeevsvponsonnequeeneneuermeeloueleaueladiehauegiuefairedoebruse& Acted by the Right Honourable, the Earle of Notingham, Lord high Admirall of England, his seruants. A notation like "6-b-2890" means "look for EEBO page image 6 of that text, word 289 on the right side of the double-page image." That reference is followed by the corrupt reading. head , we shall haue hornes good slore .
==== make-pages.sh questions
==== make-pages.sh search
==== make-pages.sh topic modeling corpus
Zipping study carrel