GitHub - cmoa/teenie-week-of-play: Documents a one week intensive focussed on scripting and testing experimental code to document the limitations and capabilities of machine learning, text parsing, and crowdsourcing technologies on making a meaningful contribution to the archival metadata of the Teenie Harris collection. Skip to content Sign up Why GitHub? Features → Mobile → Actions → Codespaces → Packages → Security → Code review → Project management → Integrations → GitHub Sponsors → Customer stories → Security → Team Enterprise Explore Explore GitHub → Learn & contribute Topics → Collections → Trending → Learning Lab → Open source guides → Connect with others The ReadME Project → Events → Community forum → GitHub Education → GitHub Stars program → Marketplace Pricing Plans → Compare plans → Contact Sales → Nonprofit → Education → In this repository All GitHub ↵ Jump to ↵ No suggested jump to results In this repository All GitHub ↵ Jump to ↵ In this organization All GitHub ↵ Jump to ↵ In this repository All GitHub ↵ Jump to ↵ Sign in Sign up {{ message }} cmoa / teenie-week-of-play Watch 9 Star 1 Fork 0 Documents a one week intensive focussed on scripting and testing experimental code to document the limitations and capabilities of machine learning, text parsing, and crowdsourcing technologies on making a meaningful contribution to the archival metadata of the Teenie Harris collection. MIT License 1 star 0 forks Star Watch Code Issues 0 Pull requests 1 Actions Projects 0 Security Insights More Code Issues Pull requests Actions Projects Security Insights master 4 branches 0 tags Go to file Code Clone HTTPS GitHub CLI Use Git or checkout with SVN using the web URL. Work fast with our official CLI. Learn more. Open with GitHub Desktop Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching Xcode If nothing happens, download Xcode and try again. Go back Launching Visual Studio If nothing happens, download the GitHub extension for Visual Studio and try again. Go back Latest commit   Git stats 45 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time MechanicalTurk     NER     README-Assets     SubjectHeaders     title-trimmer     .DS_Store     LICENSE     README.md     ~$README.md     View code README.md Teenie Week of Play This repository contains the code from one week of intensive programming dedicated to testing, experimenting, and documenting the limitations and capabilities of machine learning, text parsing, and crowdsourcing technologies on making a meaningful contribution to the archival metadata of the Teenie Harris Archive. This project builds off of the significant advances made from the CMU NEH Grant for Advanced Computer Analysis of Teenie Harris Archives by Zarhia Howard, Golan Levin, and David Newbury at the STUDIO for Creative Inquiry (STFCI). The Teenie Week of Play participants included: Dominique Luster, Neil Kulas, and Travis Snyder from Carnegie Museum of Art Golan Levin from CMU’s Frank-Ratchye STUDIO for Creative Inquiry Jesse Hixson and Patrick Burke from Carney Sam Ticknor and Caroline Record from the Innovation Studio During the week we explored four areas: Auto shortening titles, cleaning up existing subject headings, extracting names and locations from descriptions using Named Entity Recognition (NER), and verifying face recognition data using Amazon Mechanical Turk. Subject Headings Rational: Having properly assigned Library of Congress (LOC) approved subject headings is part of having a processed archive. Currently the Teenie collection has subject headings, but they don't follow best practice. Approach: We created a tool in JavaScript that looks through the current subject headings, refines them, and extracts location data. It confirms that the subject headings are LOC approved and if not suggests an approved one. It removes unnecessary location qualifiers. It uses the Google Maps API to see how many times various locations are referenced in the subject headings and produces a list of locations with a percentage of confidence. Results: Correctly removing location qualifiers and suggest an approved subject heading. Correctly removing location qualifiers and suggest an approved subject heading for Selma Burke. Future Work: Combine this work with the NER location extraction to more accurately determine where the photo was taken. Feed in the descriptions to suggest new subject headings. Auto Shortening Titles Rational: The titles are currently paragraph descriptions and too long to use on many labels/credits. Currently Dominique has to manually shorten each title on an as needed basis. To reduce this work load and lay the ground work for applying it across the collection, we investigated automatically suggesting shortened titles. Approach: We used the Python modules Spacy and Textacy to break each description down into a part of speech. It detects the root phrase and all its dependencies and removes all other unnecessary phrases. Results: Before After Eight members of Ebenezer Little Rens basketball team posed in gymnasium Eight members of Ebenezer Little Rens basketball team in gymnasium Group portrait of fifteen South Hills High School basketball players, kneeling, left to right: John Bolla, Ray Sineway, John Carr, Henry Hemphill, Paul Rue and John Patterson; standing: Harry Guldatti, Tony Jeffries, Ken wade, Cal Jones, Paul Dorsett, Coach Bruce J. Weston, Carl Wade, Ulna Calloway, Jack Kress, and Andy Dick fifteen South Hills High School basketball players Children, including Cub Scouts and Brownies, posing on grandstand with television personalities Lolo and Marty Wolfson, with sign for TV Safety Rangers, at WTAE studio Children at WTAE studio Mal Goode, and woman posing beside Dodge car, at 1959 Pittsburgh Courier Home Service Fair, Syria Mosque Mal Goode and woman at 1959 Pittsburgh Courier Home Service Fair , Syria Mosque Nat King Cole, Harold Keith, George Pitts, and Gulf Oil executive Roy Kohler, posed in Carlton House for Capitol Records press party Nat King Cole in Carlton House Group portrait of clergy and church leaders, from left, seated: Rev. E. W. Gantt, Rev. E. F. Smith, Rev. Eugene E. Morgan, Bishop Stephen G. Spottswood, Rev. A. E. Harris, Rev. A.L. Pierce, and Rev. C. C. Nobles; standing: Rev. W. C. Ardrey, Dave Williamson, Rev. C. C. Ware, Willa Mae Rice, Mrs. B. F. Wright, P. L. Prattis, Rev. F. D. Porter, Rev. W. T. Kennedy, Rev. B. F. Wright, Rev. A. Nicholson, Rev. W. F. Moses, Rev. M. L. Banner, Rev. Allie Mae Johnson, Rev. L. Comer, Rev. A. L. Fuller, and Rev. Charles Foggie, posed in Wesley Center AME Zion Church for meeting on AMICO Corp. clergy and church leaders Men and women, including Prince "Big Blue" Bruce and Mrs. Bruce gathered around their son, John N. Bruce in U. S. Army uniform and holding little boy, in Bruce home, Rose Street Men and women in U. S. Army uniform in Bruce home , Rose Street Board of Education of Central Baptist Church members, seated, from left: Josephine Moore, Edith Venable, guest of honor; Rev. W. Augustus Jones, Mrs. Isaac Green, and Bell Lunn; standing: A.D. Taylor, Andrew Brookins, Catherine Graham, Robert Bailey, Evelyn Young, Isaac Willoughby, Christine Jones, Jack Fisher, Oplenell Rockamore, Clarence Payne, and Helen Thompson, gathered in Central Baptist Church for baby shower, another version Board of Education in Central Baptist Church Future Work: Standardizing to use approved verbs. Combining with NER to correctly identify locations and lists of names so they can be revmoved. Natural Entity Recognition Rational: Currently the majority of the Teenie Harris photographs have long descriptive titles that contain locations and names. Extracting the location and name data from the longer description would lay the ground work for geotagging the images and identifying all the people in them. Approach: We used the NER in Python to extract locations and names out of the titles. We first tried the NER built into the Natural Language Toolkit (NLTK) and then tried the Stanford NER. Results: Stanford’s NER (S-NER) worked significantly better than the one built into NLTK. Based off a sample size of 100 the S-NER extracted all the names successfully 73% of the time. When it did fail it was most often because of a location being confused for a name, read NER Readme for more details. Success examples: Title Extracted Names Pittsburgh Pirates baseball team manager Fred Haney posing with Birmingham Black Barons player Dwight Smallwood, at Forbes Field Fred Haney, Dwight Smallwood Group portrait of Grace Memorial Presbyterian Church basketball players, front row from left: David Cook, William Hicks, Alfred Hunt, Joe Baber and Crocker; back row: coach Joe Baber, Nolan, Ronald Moon, C. Paige, and Walter Boon, posed in Herron Hill gym Names extracted: David Cook, William Hicks, Alfred Hunt, Joe Baber, Crocker, Joe Baber, Ronald Moon, C. Paige, Walter Boon Mal Goode, and woman posing beside Dodge car, at 1959 Pittsburgh Courier Home Service Fair, Syria Mosque Mal Goode Failure Examples: Title Extracted Names Children, some in bathing suits, possibly including Rebecca Tab in center, Vernon Vaughn in light colored top, and Sara Mae Allen in back, standing on Webster Avenue under spray from fire hose, near Webster Avenue firehouse at Wandless Street, Hill District Rebecca Tab, Vernon Vaughn, Sara Mae Allen, Hill District Boxer “Jersey” Joe Walcott with baby on lap getting haircut from barber Clarence “Speedy” Williams in Crystal Barber Shop Boxer, Joe Walcott, Clarence, Williams Eddie Cooper and Wilhelmina handing out samples to women carrying basket style purses, behind Pepsi and Teem booth, at 1963 Pittsburgh Courier - WAMO, Home-A-Rama fair, Syria Mosque Eddie Cooper, Wilhelmina, Home-A-Rama Future Work: This is a promising approach for applying a simple technique to clean archival metadata. It would improve the script significantly to give it a list of Pittsburgh locations to check against and to teach it to recognize nick-names. Doing the process of separating names from the larger description is essential to bridging the gap between the STFCI’s work visually identifying the figures in the image and linking those figures with an actual name. Mechanical Turk Rational: The STFCI applied face recognition across the entire collection and generated 3452 potential matches. Each match comes with a score of how strong the match is. This process generated a range of results. Some matches are simply duplicates of images across the collections. Some matches are slightly different images from the same shoot. Some matches “the holy grails” appear to identify the same people in totally different locations and contexts. Some matches are incorrect, matching people of different genders and ages. Since the project generated such a range of results it is necessary to have all but the strongest matches verified by humans. During the Teenie Week on Play we experimented with using Amazon Mechanical Turk to verify potential matches. Approach: We created two surveys, each with 5 potential matches. The surveys asked participants to id whether the matches depicted were correct or not with four levels of certainty. We put in one obviously correct and one obviously incorrect match in the batch to verify the participant was reliable. We had 15 people do each do the surveys. We paid 50 cents for one of the surveys and 25 cents for the other. Results: We got results back within an 1.5 hours for both surveys. On average the results successfully ID’s blatantly correct and incorrect matches. Future Work: In this experiment we manually coded the surveys. In future we could use the Mturk API to automatically generate these surveys and feed the results directly back into json. We could automatically get more confirmation on matches in which there is more disagreement. To scale this across all 3456 matches would cost a minimum of $520. We generated that cost based off the following assumptions. Reasonable Mechanical Turk rate at 4 cents per each match. (20 cents for answering 5 questions) Confirming the top 450 matches twice = $36.00 Confirming the remaining 3,002 four times = $480.32 About Documents a one week intensive focussed on scripting and testing experimental code to document the limitations and capabilities of machine learning, text parsing, and crowdsourcing technologies on making a meaningful contribution to the archival metadata of the Teenie Harris collection. Topics metadata archives ner nlp-machine-learning library-of-congress Resources Readme License MIT License Releases No releases published Packages 0 No packages published Contributors 4         Languages HTML 72.9% Python 27.1% © 2021 GitHub, Inc. Terms Privacy Security Status Docs Contact GitHub Pricing API Training Blog About You can’t perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.