mv: ‘./input-file.zip’ and ‘./input-file.zip’ are the same file
Creating study carrel named machine-learning

Initializing database
Unzipping 
Archive:  input-file.zip
   creating: ./tmp/input/machine-learning/
  inflating: ./tmp/input/machine-learning/altman.docx  
  inflating: ./tmp/input/machine-learning/prudhomme.docx  
  inflating: ./tmp/input/machine-learning/cohen-nakazawa.docx  
  inflating: ./tmp/input/machine-learning/harper.docx  
  inflating: ./tmp/input/machine-learning/hansen.docx  
  inflating: ./tmp/input/machine-learning/morgan.docx  
  inflating: ./tmp/input/machine-learning/hintze-schossau.docx  
  inflating: ./tmp/input/machine-learning/wiegand.docx  
  inflating: ./tmp/input/machine-learning/lesk.docx  
  inflating: ./tmp/input/machine-learning/kim.docx  
  inflating: ./tmp/input/machine-learning/lucic-shanahan.docx  
  inflating: ./tmp/input/machine-learning/jiang.docx  
=== updating bibliographic database
Building study carrel named machine-learning
  FILE: cache/altman.docx
OUTPUT: txt/altman.txt
  FILE: cache/hansen.docx
OUTPUT: txt/hansen.txt
  FILE: cache/lucic-shanahan.docx
OUTPUT: txt/lucic-shanahan.txt
  FILE: cache/cohen-nakazawa.docx
OUTPUT: txt/cohen-nakazawa.txt
  FILE: cache/jiang.docx
OUTPUT: txt/jiang.txt
  FILE: cache/hintze-schossau.docx
OUTPUT: txt/hintze-schossau.txt
  FILE: cache/lesk.docx
OUTPUT: txt/lesk.txt
  FILE: cache/morgan.docx
OUTPUT: txt/morgan.txt
  FILE: cache/prudhomme.docx
OUTPUT: txt/prudhomme.txt
  FILE: cache/kim.docx
OUTPUT: txt/kim.txt
  FILE: cache/wiegand.docx
OUTPUT: txt/wiegand.txt
  FILE: cache/harper.docx
OUTPUT: txt/harper.txt
prudhomme  txt/../wrd/prudhomme.wrd
lucic-shanahan  txt/../pos/lucic-shanahan.pos
lucic-shanahan  txt/../wrd/lucic-shanahan.wrd
hansen  txt/../wrd/hansen.wrd
hansen  txt/../pos/hansen.pos
jiang  txt/../pos/jiang.pos
lucic-shanahan  txt/../ent/lucic-shanahan.ent
prudhomme  txt/../ent/prudhomme.ent
jiang  txt/../ent/jiang.ent
jiang  txt/../wrd/jiang.wrd
altman  txt/../wrd/altman.wrd
morgan  txt/../pos/morgan.pos
lesk  txt/../wrd/lesk.wrd
harper  txt/../wrd/harper.wrd
lesk  txt/../pos/lesk.pos
prudhomme  txt/../pos/prudhomme.pos
hansen  txt/../ent/hansen.ent
morgan  txt/../wrd/morgan.wrd
hintze-schossau  txt/../wrd/hintze-schossau.wrd
wiegand  txt/../wrd/wiegand.wrd
hintze-schossau  txt/../pos/hintze-schossau.pos
=== file2bib.sh ===
         id: prudhomme
     author: 
      title: prudhomme
       date: 
      pages: 
  extension: .docx
        txt: ./txt/prudhomme.txt
      cache: ./cache/prudhomme.docx

Component 1	Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Component 2	Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
Component 3	Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
Compression Type	Baseline
Content-Type	['application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/jpeg']
Creation-Date	2020-04-14T20:53:08
Data Precision	8 bits
Exif IFD0:Artist	pprudho
Exif IFD0:Padding	[2060 values]
Exif IFD0:Windows XP Author	pprudho
Exif SubIFD:Date/Time Digitized	2020:04:14 20:53:08
Exif SubIFD:Date/Time Original	2020:04:14 20:53:08
Exif SubIFD:Padding	[2060 values]
Exif SubIFD:Sub-Sec Time Digitized	48
Exif SubIFD:Sub-Sec Time Original	48
File Modified Date	Thu Dec 10 14:22:05 +00:00 2020
File Name	apache-tika-4059332103772536558.tmp
File Size	111014 bytes
Image Height	357 pixels
Image Width	1263 pixels
Number of Components	3
Number of Tables	4 Huffman tables
Resolution Units	inch
Thumbnail Height Pixels	0
Thumbnail Width Pixels	0
X Resolution	168 dots
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser', ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.jpeg.JpegParser']]
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	['0', '1']
X-TIKA:embedded_resource_path	/image1.jpg
X-TIKA:parse_time_millis	['41', '4']
XMP Value Count	3
Y Resolution	168 dots
dcterms:created	2020-04-14T20:53:08
embeddedRelationshipId	rId8
exif:DateTimeOriginal	2020-04-14T20:53:08
meta:creation-date	2020-04-14T20:53:08
resourceName	["b'prudhomme.docx'", 'image1.jpg']
tiff:BitsPerSample	8
tiff:ImageLength	357
tiff:ImageWidth	1263
=== file2bib.sh ===
         id: lucic-shanahan
     author: Microsoft Office User
      title: lucic-shanahan
       date: 
      pages: 
  extension: .docx
        txt: ./txt/lucic-shanahan.txt
      cache: ./cache/lucic-shanahan.docx

Author	Microsoft Office User
Chroma BlackIsZero	['true', 'true']
Chroma ColorSpaceType	['RGB', 'RGB']
Chroma NumChannels	['3', '3']
Compression CompressionTypeName	['deflate', 'deflate']
Compression Lossless	['true', 'true']
Compression NumProgressiveScans	['1', '1']
Content-Type	['application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/png', 'image/png']
Creation-Date	2020-06-24T14:38:00Z
Data BitsPerSample	['8 8 8', '8 8 8']
Data PlanarConfiguration	['PixelInterleaved', 'PixelInterleaved']
Data SampleFormat	['UnsignedIntegral', 'UnsignedIntegral']
Dimension ImageOrientation	['Normal', 'Normal']
Dimension PixelAspectRatio	['1.0', '1.0']
IHDR	['width=1432, height=1073, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1429, height=1172, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none']
Transparency Alpha	['none', 'none']
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser', ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser']]
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	['0', '1', '1']
X-TIKA:embedded_resource_path	['/image2.png', '/image1.png']
X-TIKA:parse_time_millis	['103', '2', '1']
creator	Microsoft Office User
dc:creator	Microsoft Office User
dcterms:created	2020-06-24T14:38:00Z
embeddedRelationshipId	['rId10', 'rId11']
height	['1073', '1172']
meta:author	Microsoft Office User
meta:creation-date	2020-06-24T14:38:00Z
resourceName	["b'lucic-shanahan.docx'", 'image2.png', 'image1.png']
tiff:BitsPerSample	['8 8 8', '8 8 8']
tiff:ImageLength	['1073', '1172']
tiff:ImageWidth	['1432', '1429']
width	['1432', '1429']
altman  txt/../pos/altman.pos
=== file2bib.sh ===
         id: jiang
     author: 
      title: jiang
       date: 
      pages: 
  extension: .docx
        txt: ./txt/jiang.txt
      cache: ./cache/jiang.docx

Chroma BlackIsZero	['true', 'true']
Chroma ColorSpaceType	['RGB', 'RGB']
Chroma NumChannels	['4', '4']
Compression CompressionTypeName	['deflate', 'deflate']
Compression Lossless	['true', 'true']
Compression NumProgressiveScans	['1', '1']
Content-Type	['application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/png', 'image/png']
Creation-Date	2020-01-04T03:04:00Z
Data BitsPerSample	['8 8 8 8', '8 8 8 8']
Data PlanarConfiguration	['PixelInterleaved', 'PixelInterleaved']
Data SampleFormat	['UnsignedIntegral', 'UnsignedIntegral']
Dimension ImageOrientation	['Normal', 'Normal']
Dimension PixelAspectRatio	['1.0', '1.0']
IHDR	['width=1410, height=1208, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1970, height=1358, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none']
Transparency Alpha	['nonpremultipled', 'nonpremultipled']
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser', ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser']]
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	['0', '1', '1']
X-TIKA:embedded_resource_path	['/image1.png', '/image2.png']
X-TIKA:parse_time_millis	['99', '2', '1']
dcterms:created	2020-01-04T03:04:00Z
embeddedRelationshipId	['rId10', 'rId11']
height	['1208', '1358']
meta:creation-date	2020-01-04T03:04:00Z
resourceName	["b'jiang.docx'", 'image1.png', 'image2.png']
tiff:BitsPerSample	['8 8 8 8', '8 8 8 8']
tiff:ImageLength	['1208', '1358']
tiff:ImageWidth	['1410', '1970']
width	['1410', '1970']
hintze-schossau  txt/../ent/hintze-schossau.ent
harper  txt/../pos/harper.pos
wiegand  txt/../pos/wiegand.pos
=== file2bib.sh ===
         id: hansen
     author: 
      title: hansen
       date: 
      pages: 
  extension: .docx
        txt: ./txt/hansen.txt
      cache: ./cache/hansen.docx

Content-Type	application/vnd.openxmlformats-officedocument.wordprocessingml.document
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	51
resourceName	b'hansen.docx'
kim  txt/../wrd/kim.wrd
lesk  txt/../ent/lesk.ent
kim  txt/../pos/kim.pos
cohen-nakazawa  txt/../wrd/cohen-nakazawa.wrd
altman  txt/../ent/altman.ent
=== file2bib.sh ===
         id: hintze-schossau
     author: 
      title: hintze-schossau
       date: 
      pages: 
  extension: .docx
        txt: ./txt/hintze-schossau.txt
      cache: ./cache/hintze-schossau.docx

Content-Type	application/vnd.openxmlformats-officedocument.wordprocessingml.document
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	37
resourceName	b'hintze-schossau.docx'
=== file2bib.sh ===
         id: morgan
     author: 
      title: morgan
       date: 
      pages: 
  extension: .docx
        txt: ./txt/morgan.txt
      cache: ./cache/morgan.docx

Content-Type	application/vnd.openxmlformats-officedocument.wordprocessingml.document
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	57
resourceName	b'morgan.docx'
cohen-nakazawa  txt/../pos/cohen-nakazawa.pos
morgan  txt/../ent/morgan.ent
wiegand  txt/../ent/wiegand.ent
=== file2bib.sh ===
         id: lesk
     author: 
      title: lesk
       date: 
      pages: 
  extension: .docx
        txt: ./txt/lesk.txt
      cache: ./cache/lesk.docx

Chroma BlackIsZero	['true', 'true', 'true']
Chroma ColorSpaceType	['RGB', 'RGB', 'RGB']
Chroma NumChannels	['4', '4', '4']
Compression CompressionTypeName	['deflate', 'deflate', 'deflate']
Compression Lossless	['true', 'true', 'true']
Compression NumProgressiveScans	['1', '1', '1']
Content-Type	['application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/png', 'image/png', 'image/png']
Data BitsPerSample	['8 8 8 8', '8 8 8 8', '8 8 8 8']
Data PlanarConfiguration	['PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved']
Data SampleFormat	['UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral']
Dimension ImageOrientation	['Normal', 'Normal', 'Normal']
Dimension PixelAspectRatio	['1.0', '1.0', '1.0']
IHDR	['width=950, height=784, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=733, height=352, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=694, height=250, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none']
Transparency Alpha	['nonpremultipled', 'nonpremultipled', 'nonpremultipled']
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser', ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser']]
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	['0', '1', '1', '1']
X-TIKA:embedded_resource_path	['/image3.png', '/image2.png', '/image1.png']
X-TIKA:parse_time_millis	['63', '2', '0', '0']
embeddedRelationshipId	['rId10', 'rId11', 'rId9']
height	['784', '352', '250']
resourceName	["b'lesk.docx'", 'image3.png', 'image2.png', 'image1.png']
tiff:BitsPerSample	['8 8 8 8', '8 8 8 8', '8 8 8 8']
tiff:ImageLength	['784', '352', '250']
tiff:ImageWidth	['950', '733', '694']
width	['950', '733', '694']
cohen-nakazawa  txt/../ent/cohen-nakazawa.ent
=== file2bib.sh ===
         id: altman
     author: 
      title: altman
       date: 
      pages: 
  extension: .docx
        txt: ./txt/altman.txt
      cache: ./cache/altman.docx

Content-Type	application/vnd.openxmlformats-officedocument.wordprocessingml.document
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	57
resourceName	b'altman.docx'
=== file2bib.sh ===
         id: kim
     author: Bohyun Kim
      title: kim
       date: 
      pages: 
  extension: .docx
        txt: ./txt/kim.txt
      cache: ./cache/kim.docx

Author	Bohyun Kim
Content-Type	application/vnd.openxmlformats-officedocument.wordprocessingml.document
Creation-Date	2020-06-02T05:47:00Z
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	72
creator	Bohyun Kim
dc:creator	Bohyun Kim
dcterms:created	2020-06-02T05:47:00Z
meta:author	Bohyun Kim
meta:creation-date	2020-06-02T05:47:00Z
resourceName	b'kim.docx'
=== file2bib.sh ===
         id: harper
     author: 
      title: harper
       date: 
      pages: 
  extension: .docx
        txt: ./txt/harper.txt
      cache: ./cache/harper.docx

Chroma BlackIsZero	['true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true']
Chroma ColorSpaceType	['RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB', 'RGB']
Chroma NumChannels	['4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '3', '3', '4', '4', '4', '4', '4', '4', '4', '4', '4', '3', '4']
Compression CompressionTypeName	['deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate', 'deflate']
Compression Lossless	['true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'true']
Compression NumProgressiveScans	['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']
Content-Type	['application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png', 'image/png']
Data BitsPerSample	['8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8', '8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8', '8 8 8 8']
Data PlanarConfiguration	['PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved', 'PixelInterleaved']
Data SampleFormat	['UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral', 'UnsignedIntegral']
Data SignificantBitsPerSample	['8 8 8 8', '8 8 8 8', '8 8 8 8']
Dimension ImageOrientation	['Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal', 'Normal']
Dimension PixelAspectRatio	['1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0', '1.0']
IHDR	['width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=142, height=142, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=588, height=576, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=588, height=576, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=588, height=576, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=588, height=576, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=588, height=576, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=926, height=700, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=592, height=451, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=199, height=203, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=503, height=501, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=423, height=420, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1480, height=533, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1295, height=257, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1150, height=128, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1150, height=128, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=539, height=253, bitDepth=8, colorType=RGB, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none', 'width=1, height=1, bitDepth=8, colorType=RGBAlpha, compressionMethod=deflate, filterMethod=adaptive, interlaceMethod=none']
Transparency Alpha	['nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'none', 'none', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'nonpremultipled', 'none', 'nonpremultipled']
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser', ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser'], ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.image.ImageParser']]
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	['0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']
X-TIKA:embedded_resource_path	['/image6.png', '/image25.png', '/image23.png', '/image26.png', '/image27.png', '/image30.png', '/image20.png', '/image21.png', '/image19.png', '/image8.png', '/image3.png', '/image15.png', '/image13.png', '/image12.png', '/image16.png', '/image14.png', '/image2.png', '/image29.png', '/image32.png', '/image9.png', '/image24.png', '/image31.png', '/image10.png', '/image1.png', '/image28.png', '/image7.png', '/image17.png', '/image5.png', '/image22.png', '/image18.png', '/image4.png', '/image33.png', '/image11.png']
X-TIKA:parse_time_millis	['203', '2', '1', '3', '1', '1', '1', '0', '1', '1', '0', '1', '1', '0', '0', '1', '1', '0', '1', '1', '0', '1', '0', '1', '0', '1', '0', '0', '1', '0', '1', '1', '0', '1']
embeddedRelationshipId	['rId14', 'rId15', 'rId16', 'rId17', 'rId18', 'rId19', 'rId20', 'rId21', 'rId22', 'rId23', 'rId26', 'rId28', 'rId29', 'rId30', 'rId31', 'rId32', 'rId33', 'rId34', 'rId35', 'rId36', 'rId39', 'rId40', 'rId41', 'rId47', 'rId59', 'rId60', 'rId66', 'rId67', 'rId70', 'rId71', 'rId72', 'rId76', 'rId77']
height	['1', '142', '142', '142', '142', '142', '142', '142', '142', '1', '1', '576', '576', '576', '576', '576', '1', '700', '451', '1', '203', '501', '1', '420', '533', '1', '257', '1', '128', '128', '1', '253', '1']
resourceName	["b'harper.docx'", 'image6.png', 'image25.png', 'image23.png', 'image26.png', 'image27.png', 'image30.png', 'image20.png', 'image21.png', 'image19.png', 'image8.png', 'image3.png', 'image15.png', 'image13.png', 'image12.png', 'image16.png', 'image14.png', 'image2.png', 'image29.png', 'image32.png', 'image9.png', 'image24.png', 'image31.png', 'image10.png', 'image1.png', 'image28.png', 'image7.png', 'image17.png', 'image5.png', 'image22.png', 'image18.png', 'image4.png', 'image33.png', 'image11.png']
sBIT sBIT_RGBAlpha	['red=8, green=8, blue=8, alpha=8', 'red=8, green=8, blue=8, alpha=8', 'red=8, green=8, blue=8, alpha=8']
tiff:BitsPerSample	['8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8', '8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8 8', '8 8 8', '8 8 8 8']
tiff:ImageLength	['1', '142', '142', '142', '142', '142', '142', '142', '142', '1', '1', '576', '576', '576', '576', '576', '1', '700', '451', '1', '203', '501', '1', '420', '533', '1', '257', '1', '128', '128', '1', '253', '1']
tiff:ImageWidth	['1', '142', '142', '142', '142', '142', '142', '142', '142', '1', '1', '588', '588', '588', '588', '588', '1', '926', '592', '1', '199', '503', '1', '423', '1480', '1', '1295', '1', '1150', '1150', '1', '539', '1']
width	['1', '142', '142', '142', '142', '142', '142', '142', '142', '1', '1', '588', '588', '588', '588', '588', '1', '926', '592', '1', '199', '503', '1', '423', '1480', '1', '1295', '1', '1150', '1150', '1', '539', '1']
=== file2bib.sh ===
         id: wiegand
     author: Sue Wiegand
      title: wiegand
       date: 
      pages: 
  extension: .docx
        txt: ./txt/wiegand.txt
      cache: ./cache/wiegand.docx

Author	Sue Wiegand
Content-Type	application/vnd.openxmlformats-officedocument.wordprocessingml.document
Creation-Date	2020-01-14T23:22:00Z
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	73
creator	Sue Wiegand
dc:creator	Sue Wiegand
dcterms:created	2020-01-14T23:22:00Z
meta:author	Sue Wiegand
meta:creation-date	2020-01-14T23:22:00Z
resourceName	b'wiegand.docx'
harper  txt/../ent/harper.ent
=== file2bib.sh ===
         id: cohen-nakazawa
     author: Jason E. Cohen
      title: cohen-nakazawa
       date: 
      pages: 
  extension: .docx
        txt: ./txt/cohen-nakazawa.txt
      cache: ./cache/cohen-nakazawa.docx

Author	Jason E. Cohen
Content-Type	application/vnd.openxmlformats-officedocument.wordprocessingml.document
Creation-Date	2020-02-18T19:17:00Z
X-Parsed-By	['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.microsoft.ooxml.OOXMLParser']
X-TIKA:content_handler	ToTextContentHandler
X-TIKA:embedded_depth	0
X-TIKA:parse_time_millis	70
creator	Jason E. Cohen
dc:creator	Jason E. Cohen
dcterms:created	2020-02-18T19:17:00Z
meta:author	Jason E. Cohen
meta:creation-date	2020-02-18T19:17:00Z
resourceName	b'cohen-nakazawa.docx'
kim  txt/../ent/kim.ent
Done mapping.
Reducing machine-learning
=== reduce.pl bib ===
         id = prudhomme
     author = 
      title = prudhomme
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 3690
  sentences = 245
     flesch = 49
    summary = However, "the viability of machine learning and artificial intelligence is predicated on the representativeness and quality of the data that they are trained on," as Thomas Padilla, Interim Head, Knowledge Production at the University of Nevada Las Vegas, asserts (2019, 14). In this essay, I begin by placing artificial intelligence and machine learning in context, then proceed by discussing why AI matters for archives and libraries, and describing the techniques used in a pilot automation project from the perspective of digital curation at Oklahoma State University Archives. Artificial intelligence, and specifically machine learning as a subfield of AI, has direct applications through pattern recognition techniques that predict the labeling values for unlabeled data. Along with greater computing capabilities, artificial intelligence could be an opportunity for libraries and archives to boost the discovery of their digital collections by pushing text and image recognition machine learning techniques to new limits.
      cache = ./cache/prudhomme.docx
       txt  = ./txt/prudhomme.txt
=== reduce.pl bib ===
         id = harper
     author = 
      title = harper
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 5838
  sentences = 489
     flesch = 59
    summary = Figure 2 Images generated with a simple statistical model appear as noise as the model is insufficient to capture the structure of the real data (Markov chains trained using wine bottles and circles from Google's QuickDraw dataset). Other types of generative statistical models, like Naive Bayes or a higher-order Markov chain,[footnoteRef:1] could perhaps capture a bit more information about the training data, but they would still be insufficient for real-world applications like this.[footnoteRef:2] Image, video, and audio are complicated; it is hard to reduce them to their essence with basic statistical rules in the way we were able to with the ordering of letters in English and Italian. Figure 4 A GAN being trained on wine bottle sketches from Google's quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset) shows the generator learning how to produce better sketches over time. GANs in Action: Deep Learning with Generative Adversarial Networks.
      cache = ./cache/harper.docx
       txt  = ./txt/harper.txt
=== reduce.pl bib ===
         id = cohen-nakazawa
     author = Jason E. Cohen
      title = cohen-nakazawa
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 7632
  sentences = 334
     flesch = 48
    summary = Consequently, our chapter describes the process we used to (1) generate technical and descriptive metadata for historical photographs as we pulled material from an extant blog website into a digital archives platform; (2) identify recurring faces in individual pictures as well as in photographs of groups of sometimes unidentified people in order to generate social networks as metadata; and (3) to help develop a controlled vocabulary for the institution's future needs for object management and description. Similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextricably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what Ryan Calo calls the "historical validation" of primary source materials (2017, 424-5).
      cache = ./cache/cohen-nakazawa.docx
       txt  = ./txt/cohen-nakazawa.txt
=== reduce.pl bib ===
         id = hansen
     author = 
      title = hansen
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 4321
  sentences = 235
     flesch = 59
    summary = [5: https://dml.cz/ ] [6: http://www.numdam.org/ ] [7: https://zbmath.org/ ] [8: Mathematical Subject Classification (MSC) values in MathSciNet and zbMath are a particularly interesting categorization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in the same, or closely related, subfield as the article's content before they are published. Now let us shift from mathematics-specific categorization to subject categorization in general and look at the work Microsoft has done assigning Fields of Study (FoS) in the Microsoft Academic Graph (MAG) which is used to create their Microsoft Academic article search product.[footnoteRef:15] While the MAG FoS project is also attempting to categorize articles for proper indexing and search, it represents the second path which is taken by automated categorization projects: using machine learning techniques to both create the taxonomy and to classify.
      cache = ./cache/hansen.docx
       txt  = ./txt/hansen.txt
=== reduce.pl bib ===
         id = altman
     author = 
      title = altman
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 6071
  sentences = 311
     flesch = 60
    summary = I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least during the initial phase of testing different algorithms or configurations. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you've started training and testing models. As you begin ingesting and preparing data, you'll want to explore possible machine learning algorithms to perform on your dataset.
      cache = ./cache/altman.docx
       txt  = ./txt/altman.txt
=== reduce.pl bib ===
         id = hintze-schossau
     author = 
      title = hintze-schossau
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 5083
  sentences = 336
     flesch = 56
    summary = Artificial Intelligence, with its ability to machine learn coupled to an almost humanlike understanding, sounds like the ideal tool to the humanities. Machine learning allows us to learn from these data sets in ways that exceed human capabilities, while an artificial brain will eventually allow us to objectively describe a subjective experience (through quantifying neural activations or positively and negatively associated memories). The following paragraphs will explore current Machine Learning and Artificial Intelligence technologies, explain how quantitative or qualitative they really are, and explore what the possible implications for future Digital Humanities could be. Currently, machines do not learn but must be trained, typically with human-labeled data. At the same time, memory formation (Marstaller, Hintze, and Adami 2013), information integration in the brain (Tononi 2004), and how systems evolve the ability to learn (Sheneman, Schossau, and Hintze 2019) are also being researched, as they are building blocks of general purpose intelligence.
      cache = ./cache/hintze-schossau.docx
       txt  = ./txt/hintze-schossau.txt
=== reduce.pl bib ===
         id = kim
     author = Bohyun Kim
      title = kim
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 6982
  sentences = 516
     flesch = 55
    summary = With their limited intelligence and fully deterministic nature, early rule-based symbolic AI systems raised few ethical concerns.[footnoteRef:4] AI systems that near or surpass human capability, on the other hand, are likely to be given the autonomy to make their own decisions without humans, even when their workings are not entirely transparent, and some of those decisions are distinctively moral in character. The Library of Congress has worked on detecting features, such as railroads in maps, using the convolutional neural network model, and issued a solicitation for a machine learning and deep learning pilot program that will maximize the use of its digital collections in 2019.[footnoteRef:18] Indiana University Libraries, AVP, University of Texas Austin School of Information, and the New York Public Library are jointly developing the Audiovisual Metadata Platform (AMP), using many AI tools in order to automatically generate metadata for audiovisual materials, which collection managers can use to supplement their archival description and processing workflows.[footnoteRef:19] [18: See Blewer, Kim, and Phetteplace 2018 and Price 2019.
      cache = ./cache/kim.docx
       txt  = ./txt/kim.txt
=== reduce.pl bib ===
         id = morgan
     author = 
      title = morgan
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 5269
  sentences = 375
     flesch = 59
    summary = Now, in a time of "big data," it is possible to go beyond mere automation and towards the more intelligent use of computers; the use of algorithms and machine learning is an integral part of future library collection building and service provision. Finally, this chapter outlines both a number of possible machine learning applications for libraries as well as a few real world use cases. Like the scale of computer input, the library profession has not really exploited computers' ability to save, organize, and retrieve data; on the whole, the library profession does not understand the concept of a "data structure." For example, tab-delimited files, CSV (comma-separated value) files, relational database schema, XML files, JSON files, and the content of email messages or HTTP server responses are all examples of different types of data structures.
      cache = ./cache/morgan.docx
       txt  = ./txt/morgan.txt
=== reduce.pl bib ===
         id = wiegand
     author = Sue Wiegand
      title = wiegand
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 6152
  sentences = 426
     flesch = 44
    summary = JSTOR, for example, will provide up to 25,000 documents (or more at special request) in a dataset for analysis.[footnoteRef:2] Clarivate's Content as a Service provides Web of Science data to accommodate multiple purposes.[footnoteRef:3] Besides the many freely available bibliodata sources, researchers can sign up for developer accounts in databases such as Scopus to work with datasets for text mining and computational analysis.[footnoteRef:4] Using library-licensed collections as data could allow researchers to save time in reading a large corpus, stay updated on a topic of interest, analyze the most important topics at a given time period, confirm gaps in the research literature for investigation, and increase the efficiency of sifting through massive amounts of research in, for instance, the race to develop a vaccine (Ong 2020; Vamathevan 2019). By building out new services and tools, and instructing at all levels, libraries can reinvent themselves continuously by investing in creative and sustainable innovation, from digital and data literacy to assembling modules for a library-based Researchers' Workstation that uses Machine Learning to enhance the efficiency of the scholars' research cycle.
      cache = ./cache/wiegand.docx
       txt  = ./txt/wiegand.txt
=== reduce.pl bib ===
         id = jiang
     author = 
      title = jiang
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 3583
  sentences = 323
     flesch = 55
    summary = Among the top strengths of happy marriages, at least five can be reflected in cross-disciplinary ML research, including "discuss problems well," "handle differences creatively," and "maintain a good balance of time alone and together." I use two examples of my personal experiences (as a computer scientist) of collaborating with researchers from multiple disciplines (e.g., historians, psychologists, IT technicians) to illustrate. Cross-disciplinary research matters, because (1) it provides an understanding of complex problems that require a multifaceted approach to solve; (2) it combines disciplinary breadth with the ability to collaborate and synthesize varying expertise; (3) it enables researchers to reach a wider audience and communicate diverse viewpoints; (4) it encourages researchers to confront questions that traditional disciplines do not ask while opening up new areas of research; and (5) it promotes disciplinary self-awareness about methods and creative practices (Urquhart et al.
      cache = ./cache/jiang.docx
       txt  = ./txt/jiang.txt
=== reduce.pl bib ===
         id = lucic-shanahan
     author = Microsoft Office User
      title = lucic-shanahan
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 2981
  sentences = 180
     flesch = 58
    summary = On its "Big Read" website, the Library of Congress includes information about One Book programs around the United States,[footnoteRef:2] and the American Library Association (ALA) also provides materials with which a library can build its own One Book program and, in this way, bring members of their communities together in a conversation.[footnoteRef:3] While community reading programs are not a new phenomenon and exist in various formats and sizes, the One Book One Chicago program is notable because of its size (the Chicago Public Library has 81 local branches) as well as its history (the program has been in existence for nearly 20 years). As part of ongoing work of the "Reading Chicago Reading" project, we used the secure data portal of the HathiTrust Research Consortium to access and pre-process the in-copyright novels in our set. The place names extracted from our three Chicago-setting OBOC books allowed us to focus on particular areas of the city such as Hyde Park, which is mentioned in each of them.
      cache = ./cache/lucic-shanahan.docx
       txt  = ./txt/lucic-shanahan.txt
=== reduce.pl bib ===
         id = lesk
     author = 
      title = lesk
       date = 
      pages = 
  extension = .docx
       mime = application/vnd.openxmlformats-officedocument.wordprocessingml.document
      words = 4868
  sentences = 364
     flesch = 64
    summary = Fragility errors here can arise from many sources for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). Similarly, the New York Times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (Tugend 2019). Instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, "more data beats better algorithms." In a review of the history of speech recognition, Xuedong Huang, James Baker, and Raj Reddy write, "The power of these systems arises mainly from their ability to collect, process, and learn from very large datasets.
      cache = ./cache/lesk.docx
       txt  = ./txt/lesk.txt
Building ./etc/reader.txt
altman
wiegand
morgan
harper
wiegand
prudhomme
                number of items: 12
                   sum of words: 62,470
          average size in words: 5,205
      average readability score: 55

                          nouns: data; learning; machine; research; libraries; library; information; process; model; example; time; images; use; systems; project; people; system; results; training; text; place; tools; work; way; algorithms; collections; researchers; problem; set; algorithm; dataset; materials; image; knowledge; problems; examples; number; applications; level; services; articles; recognition; network; gans; decision; classification; archives; techniques; input; file
                          verbs: is; be; are; have; was; were; do; has; using; see; learning; used; make; use; given; based; been; help; create; had; does; find; learn; generated; need; did; work; trained; provide; build; generate; being; making; working; know; including; identify; get; called; become; known; include; ’s; produce; found; add; want; understand; think; made
                     adjectives: such; new; other; many; different; more; digital; moral; human; deep; possible; large; good; local; important; -; ethical; able; social; specific; same; available; historical; real; own; neural; intelligent; high; full; better; common; library; final; traditional; public; first; computational; multiple; likely; cultural; artificial; unique; similar; particular; technical; simple; open; generative; disciplinary; second
                        adverbs: not; also; more; then; well; only; as; even; very; out; so; together; just; now; however; most; instead; here; up; often; still; n’t; already; first; rather; especially; perhaps; much; highly; really; far; back; always; too; morally; previously; sometimes; on; increasingly; down; fully; finally; automatically; yet; similarly; never; generally; enough; easily; better
                       pronouns: we; it; you; their; our; they; your; i; its; them; us; my; one; itself; themselves; her; me; his; he; yourself; she; ourselves; ours; ’s; ml+history; https://www.kaggle.com/c/deepfake-detection-challenge; https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/.; him; alphago
                   proper nouns: ai; learning; machine; al; ml; chicago; library; intelligence; artificial; et; university; new; digital; google; data; daniel; johnson; information; ieee; research; york; marc; gan; adversarial; science; n.d; microsoft; generative; review; networks; press; journal; technology; may; markov; international; conference; reading; kentucky; .; march; january; computer; december; congress; Řehůřek; proceedings; msc; mark; libraries
                       keywords: machine; learning; datum; system; library; image; university; tönnies; research; process; problem; pmss; place; new; networks; moral; microsoft; material; markov; marc; kentucky; information; human; generative; gan; example; eastern; disciplinary; chinese; chicago; balke; archive; algorithm; adversarial

       one topic; one dimension: learning
                        file(s): ./cache/altman.docx
                      titles(s): altman

    three topics; one dimension: learning; learning; data
                        file(s): ./cache/kim.docx, ./cache/harper.docx, ./cache/altman.docx
                      titles(s): kim | harper | altman

  five topics; three dimensions: learning machine data; data learning machine; ai machine learning; library learning machine; chicago data place
                        file(s): ./cache/cohen-nakazawa.docx, ./cache/altman.docx, ./cache/kim.docx, ./cache/wiegand.docx, ./cache/lucic-shanahan.docx
                      titles(s): cohen-nakazawa | altman | kim | wiegand | lucic-shanahan

==== make-pages.sh htm files
==== make-pages.sh complex files
==== make-pages.sh named enities
==== making bibliographics
         id: altman
     author: 
      title: altman
       date: 
      words: 6071
  sentences: 311
      pages: 
     flesch: 60
      cache: ./cache/altman.docx
        txt: ./txt/altman.txt
    summary: I did most of my data cleanup by hand using spreadsheet software, and was not careful about preserving the formulas for each step of the process; instead, I deleted and wrote over many important intermediate computations, saving only the final results. The pipeline for a machine learning project generally comprises five stages: data acquisition, data preparation, model training and testing, evaluation and analysis, and application of results. However you get your initial data, it is generally a good idea to save a copy in the rawest possible form and treat that copy as immutable, at least during the initial phase of testing different algorithms or configurations. This is often the part of the process that requires the most work, and you should expect to iterate over your data preparations many times, even after you''ve started training and testing models. As you begin ingesting and preparing data, you''ll want to explore possible machine learning algorithms to perform on your dataset.

         id: hansen
     author: 
      title: hansen
       date: 
      words: 4321
  sentences: 235
      pages: 
     flesch: 59
      cache: ./cache/hansen.docx
        txt: ./txt/hansen.txt
    summary: [5: https://dml.cz/ ] [6: http://www.numdam.org/ ] [7: https://zbmath.org/ ] [8: Mathematical Subject Classification (MSC) values in MathSciNet and zbMath are a particularly interesting categorization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in the same, or closely related, subfield as the article''s content before they are published. Now let us shift from mathematics-specific categorization to subject categorization in general and look at the work Microsoft has done assigning Fields of Study (FoS) in the Microsoft Academic Graph (MAG) which is used to create their Microsoft Academic article search product.[footnoteRef:15] While the MAG FoS project is also attempting to categorize articles for proper indexing and search, it represents the second path which is taken by automated categorization projects: using machine learning techniques to both create the taxonomy and to classify.

         id: harper
     author: 
      title: harper
       date: 
      words: 5838
  sentences: 489
      pages: 
     flesch: 59
      cache: ./cache/harper.docx
        txt: ./txt/harper.txt
    summary: Figure 2 Images generated with a simple statistical model appear as noise as the model is insufficient to capture the structure of the real data (Markov chains trained using wine bottles and circles from Google''s QuickDraw dataset). Other types of generative statistical models, like Naive Bayes or a higher-order Markov chain,[footnoteRef:1] could perhaps capture a bit more information about the training data, but they would still be insufficient for real-world applications like this.[footnoteRef:2] Image, video, and audio are complicated; it is hard to reduce them to their essence with basic statistical rules in the way we were able to with the ordering of letters in English and Italian. Figure 4 A GAN being trained on wine bottle sketches from Google''s quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset) shows the generator learning how to produce better sketches over time. GANs in Action: Deep Learning with Generative Adversarial Networks.

         id: hintze-schossau
     author: 
      title: hintze-schossau
       date: 
      words: 5083
  sentences: 336
      pages: 
     flesch: 56
      cache: ./cache/hintze-schossau.docx
        txt: ./txt/hintze-schossau.txt
    summary: Artificial Intelligence, with its ability to machine learn coupled to an almost humanlike understanding, sounds like the ideal tool to the humanities. Machine learning allows us to learn from these data sets in ways that exceed human capabilities, while an artificial brain will eventually allow us to objectively describe a subjective experience (through quantifying neural activations or positively and negatively associated memories). The following paragraphs will explore current Machine Learning and Artificial Intelligence technologies, explain how quantitative or qualitative they really are, and explore what the possible implications for future Digital Humanities could be. Currently, machines do not learn but must be trained, typically with human-labeled data. At the same time, memory formation (Marstaller, Hintze, and Adami 2013), information integration in the brain (Tononi 2004), and how systems evolve the ability to learn (Sheneman, Schossau, and Hintze 2019) are also being researched, as they are building blocks of general purpose intelligence.

         id: jiang
     author: 
      title: jiang
       date: 
      words: 3583
  sentences: 323
      pages: 
     flesch: 55
      cache: ./cache/jiang.docx
        txt: ./txt/jiang.txt
    summary: Among the top strengths of happy marriages, at least five can be reflected in cross-disciplinary ML research, including "discuss problems well," "handle differences creatively," and "maintain a good balance of time alone and together." I use two examples of my personal experiences (as a computer scientist) of collaborating with researchers from multiple disciplines (e.g., historians, psychologists, IT technicians) to illustrate. Cross-disciplinary research matters, because (1) it provides an understanding of complex problems that require a multifaceted approach to solve; (2) it combines disciplinary breadth with the ability to collaborate and synthesize varying expertise; (3) it enables researchers to reach a wider audience and communicate diverse viewpoints; (4) it encourages researchers to confront questions that traditional disciplines do not ask while opening up new areas of research; and (5) it promotes disciplinary self-awareness about methods and creative practices (Urquhart et al.

         id: lesk
     author: 
      title: lesk
       date: 
      words: 4868
  sentences: 364
      pages: 
     flesch: 64
      cache: ./cache/lesk.docx
        txt: ./txt/lesk.txt
    summary: Fragility errors here can arise from many sources for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). Similarly, the New York Times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (Tugend 2019). Instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, "more data beats better algorithms." In a review of the history of speech recognition, Xuedong Huang, James Baker, and Raj Reddy write, "The power of these systems arises mainly from their ability to collect, process, and learn from very large datasets.

         id: morgan
     author: 
      title: morgan
       date: 
      words: 5269
  sentences: 375
      pages: 
     flesch: 59
      cache: ./cache/morgan.docx
        txt: ./txt/morgan.txt
    summary: Now, in a time of "big data," it is possible to go beyond mere automation and towards the more intelligent use of computers; the use of algorithms and machine learning is an integral part of future library collection building and service provision. Finally, this chapter outlines both a number of possible machine learning applications for libraries as well as a few real world use cases. Like the scale of computer input, the library profession has not really exploited computers'' ability to save, organize, and retrieve data; on the whole, the library profession does not understand the concept of a "data structure." For example, tab-delimited files, CSV (comma-separated value) files, relational database schema, XML files, JSON files, and the content of email messages or HTTP server responses are all examples of different types of data structures.

         id: prudhomme
     author: 
      title: prudhomme
       date: 
      words: 3690
  sentences: 245
      pages: 
     flesch: 49
      cache: ./cache/prudhomme.docx
        txt: ./txt/prudhomme.txt
    summary: However, "the viability of machine learning and artificial intelligence is predicated on the representativeness and quality of the data that they are trained on," as Thomas Padilla, Interim Head, Knowledge Production at the University of Nevada Las Vegas, asserts (2019, 14). In this essay, I begin by placing artificial intelligence and machine learning in context, then proceed by discussing why AI matters for archives and libraries, and describing the techniques used in a pilot automation project from the perspective of digital curation at Oklahoma State University Archives. Artificial intelligence, and specifically machine learning as a subfield of AI, has direct applications through pattern recognition techniques that predict the labeling values for unlabeled data. Along with greater computing capabilities, artificial intelligence could be an opportunity for libraries and archives to boost the discovery of their digital collections by pushing text and image recognition machine learning techniques to new limits.

         id: kim
     author: Bohyun Kim
      title: kim
       date: 
      words: 6982
  sentences: 516
      pages: 
     flesch: 55
      cache: ./cache/kim.docx
        txt: ./txt/kim.txt
    summary: With their limited intelligence and fully deterministic nature, early rule-based symbolic AI systems raised few ethical concerns.[footnoteRef:4] AI systems that near or surpass human capability, on the other hand, are likely to be given the autonomy to make their own decisions without humans, even when their workings are not entirely transparent, and some of those decisions are distinctively moral in character. The Library of Congress has worked on detecting features, such as railroads in maps, using the convolutional neural network model, and issued a solicitation for a machine learning and deep learning pilot program that will maximize the use of its digital collections in 2019.[footnoteRef:18] Indiana University Libraries, AVP, University of Texas Austin School of Information, and the New York Public Library are jointly developing the Audiovisual Metadata Platform (AMP), using many AI tools in order to automatically generate metadata for audiovisual materials, which collection managers can use to supplement their archival description and processing workflows.[footnoteRef:19] [18: See Blewer, Kim, and Phetteplace 2018 and Price 2019.

         id: cohen-nakazawa
     author: Jason E. Cohen
      title: cohen-nakazawa
       date: 
      words: 7632
  sentences: 334
      pages: 
     flesch: 48
      cache: ./cache/cohen-nakazawa.docx
        txt: ./txt/cohen-nakazawa.txt
    summary: Consequently, our chapter describes the process we used to (1) generate technical and descriptive metadata for historical photographs as we pulled material from an extant blog website into a digital archives platform; (2) identify recurring faces in individual pictures as well as in photographs of groups of sometimes unidentified people in order to generate social networks as metadata; and (3) to help develop a controlled vocabulary for the institution''s future needs for object management and description. Similarly, as the ownership of historical images suddenly extended to include present-day community members, and as these questions of access and serving a local public were inextricably bound up with interactions with members of that shared public whose family names and faces appear in the images we were making available, we began to consider the ways in which our archival work was tied to what Ryan Calo calls the "historical validation" of primary source materials (2017, 424-5).

         id: lucic-shanahan
     author: Microsoft Office User
      title: lucic-shanahan
       date: 
      words: 2981
  sentences: 180
      pages: 
     flesch: 58
      cache: ./cache/lucic-shanahan.docx
        txt: ./txt/lucic-shanahan.txt
    summary: On its "Big Read" website, the Library of Congress includes information about One Book programs around the United States,[footnoteRef:2] and the American Library Association (ALA) also provides materials with which a library can build its own One Book program and, in this way, bring members of their communities together in a conversation.[footnoteRef:3] While community reading programs are not a new phenomenon and exist in various formats and sizes, the One Book One Chicago program is notable because of its size (the Chicago Public Library has 81 local branches) as well as its history (the program has been in existence for nearly 20 years). As part of ongoing work of the "Reading Chicago Reading" project, we used the secure data portal of the HathiTrust Research Consortium to access and pre-process the in-copyright novels in our set. The place names extracted from our three Chicago-setting OBOC books allowed us to focus on particular areas of the city such as Hyde Park, which is mentioned in each of them.

         id: wiegand
     author: Sue Wiegand
      title: wiegand
       date: 
      words: 6152
  sentences: 426
      pages: 
     flesch: 44
      cache: ./cache/wiegand.docx
        txt: ./txt/wiegand.txt
    summary: JSTOR, for example, will provide up to 25,000 documents (or more at special request) in a dataset for analysis.[footnoteRef:2] Clarivate''s Content as a Service provides Web of Science data to accommodate multiple purposes.[footnoteRef:3] Besides the many freely available bibliodata sources, researchers can sign up for developer accounts in databases such as Scopus to work with datasets for text mining and computational analysis.[footnoteRef:4] Using library-licensed collections as data could allow researchers to save time in reading a large corpus, stay updated on a topic of interest, analyze the most important topics at a given time period, confirm gaps in the research literature for investigation, and increase the efficiency of sifting through massive amounts of research in, for instance, the race to develop a vaccine (Ong 2020; Vamathevan 2019). By building out new services and tools, and instructing at all levels, libraries can reinvent themselves continuously by investing in creative and sustainable innovation, from digital and data literacy to assembling modules for a library-based Researchers'' Workstation that uses Machine Learning to enhance the efficiency of the scholars'' research cycle.

==== make-pages.sh questions
==== make-pages.sh search
==== make-pages.sh topic modeling corpus
Zipping study carrel