inkdroid inkdroid Paper or Plastic Outgoing Tree Rings by Tracy O This post is really about exploring historical datasets with version control systems. Mark Graham posed a question to me earlier today asking what we know about the Twitter accounts of the members of Congress, specifically whether they have been removed after they left office. The hypothesis was that some members of the House and Senate may decide to delete their account on leaving DC. I was immediately reminded of the excellent congress-legislators project which collects all kinds of information about House and Senate members including their social media accounts into YAML files that are versioned in a GitHub repository. GitHub is a great place to curate a dataset like this because it allows anyone with a GitHub account to contribute to editing the data, and to share utilities to automate checks and modifications. Unfortunately the file that tracks social media accounts is only for current members. Once they leave office they are removed from the file. The project does track other historical information for legislators. But the social media data isn’t pulled in when this transition happens, or so it seems. Luckily Git doesn’t forget. Since the project is using a version control system all of the previously known social media links are in the history of the repository! So I wrote a small program that uses gitpython to walk the legislators-social-media.yaml file backwards in time through each commit, parse the YAML at that previous state, and merge that information into a union of all the current and past legislator information. You can see the resulting program and output in us-legislators-social. There’s a little bit of a wrinkle in that not everything in the version history should be carried forward because errors were corrected and bugs were fixed. Without digging into the diffs and analyzing them more it’s hard to say whether a commit was a bug fix or if it was simply adding new or deleting old information. If the YAML doesn’t parse at a particular state that’s easy to ignore. It also looks like the maintainers split out account ids from account usernames at one point. Derek Willis helpfully pointed out to me that Twitter don’t care about the capitalization of usernames in URLs, so these needed to be normalized when merging the data. The same is true of Facebook, Instagram and YouTube. I guarded against these cases but if you notice other problems let me know. With the resulting merged historical data it’s not too hard to write a program to read in the data, identify the politicians who left office after the 116th Congress, and examine their Twitter accounts to see that they are live. It turned out to be a little bit harder than I expected because it’s not as easy as you might think to check if a Twitter account is live or not. Twitter’s web servers return a HTTP 200 OK message even when responding to requests for URLs of non-existent accounts. To complicate things further the error message that displays indicating it is not an account only displays when the page is rendered in a browser. So a simple web scraping job that looks at the HTML is not sufficient. And finally just because a Twitter username no longer seems to work, it’s possible that the user has changed it to a new screen_name. Fortunately the unitedstates project also tracks the Twitter User ID (sometimes). If the user account is still there you can use the Twitter API to look up their current screen_name and see if it is different. After putting all this together it’s possible to generate a simple table of legislators who left office at the end of the 116th Congress, and their Twitter account information. name url url_ok user_id new_url Lamar Alexander https://twitter.com/senalexander True 76649729 Michael B. Enzi https://twitter.com/senatorenzi True 291756142 Pat Roberts https://twitter.com/senpatroberts True 75364211 Tom Udall https://twitter.com/senatortomudall True 60828944 Justin Amash https://twitter.com/justinamash True 233842454 Rob Bishop https://twitter.com/reprobbishop True 148006729 K. Michael Conaway https://twitter.com/conawaytx11 True 295685416 Susan A. Davis https://twitter.com/repsusandavis False 432771620 Eliot L. Engel https://twitter.com/repeliotengel True 164007407 Bill Flores https://twitter.com/repbillflores False 237312687 Cory Gardner https://twitter.com/sencorygardner True 235217558 Peter T. King https://twitter.com/reppeteking True 18277655 Steve King https://twitter.com/stevekingia True 48117116 Daniel Lipinski https://twitter.com/replipinski True 1009269193 David Loebsack https://twitter.com/daveloebsack True 510516465 Nita M. Lowey https://twitter.com/nitalowey True 221792092 Kenny Marchant https://twitter.com/repkenmarchant True 23976316 Pete Olson https://twitter.com/reppeteolson True 20053279 Martha Roby https://twitter.com/repmartharoby False 224294785 https://twitter.com/MarthaRobyAL David P. Roe https://twitter.com/drphilroe True 52503751 F. James Sensenbrenner, Jr. https://twitter.com/jimpressoffice False 851621377 José E. Serrano https://twitter.com/repjoseserrano True 33563161 John Shimkus https://twitter.com/repshimkus True 15600527 Mac Thornberry https://twitter.com/mactxpress True 377534571 Scott R. Tipton https://twitter.com/reptipton True 242873057 Peter J. Visclosky https://twitter.com/repvisclosky True 193872188 Greg Walden https://twitter.com/repgregwalden True 32010840 Rob Woodall https://twitter.com/reprobwoodall True 2382685057 Ted S. Yoho https://twitter.com/reptedyoho True 1071900114 Doug Collins https://twitter.com/repdougcollins True 1060487274 Tulsi Gabbard https://twitter.com/tulsipress True 1064206014 Susan W. Brooks https://twitter.com/susanwbrooks True 1074101017 Joseph P. Kennedy III https://twitter.com/repjoekennedy False 1055907624 https://twitter.com/joekennedy George Holding https://twitter.com/repholding True 1058460818 Denny Heck https://twitter.com/repdennyheck False 1068499286 https://twitter.com/LtGovDennyHeck Bradley Byrne https://twitter.com/repbyrne True 2253968388 Ralph Lee Abraham https://twitter.com/repabraham True 2962891515 Will Hurd https://twitter.com/hurdonthehill True 2963445730 David Perdue https://twitter.com/sendavidperdue True 2863210809 Mark Walker https://twitter.com/repmarkwalker True 2966205003 Francis Rooney https://twitter.com/reprooney True 816111677917851649 Paul Mitchell https://twitter.com/reppaulmitchell True 811632636598910976 Doug Jones https://twitter.com/sendougjones True 941080085121175552 TJ Cox https://twitter.com/reptjcox True 1080875913926139910 Gilbert Ray Cisneros, Jr. https://twitter.com/repgilcisneros True 1080986167003230208 Harley Rouda https://twitter.com/repharley True 1075080722241736704 Ross Spano https://twitter.com/reprossspano True 1090328229548826627 Debbie Mucarsel-Powell https://twitter.com/repdmp True 1080941062028447744 Donna E. Shalala https://twitter.com/repshalala False 1060584809095925762 Abby Finkenauer https://twitter.com/repfinkenauer True 1081256295469068288 Steve Watkins https://twitter.com/rep_watkins False 1080307235350241280 Xochitl Torres Small https://twitter.com/reptorressmall True 1080830346915209216 Max Rose https://twitter.com/repmaxrose True 1078692057940742144 Anthony Brindisi https://twitter.com/repbrindisi True 1080978331535896576 Kendra S. Horn https://twitter.com/repkendrahorn False 1083019402046513152 https://twitter.com/KendraSHorn Joe Cunningham https://twitter.com/repcunningham True 1080198683713507335 Ben McAdams https://twitter.com/repbenmcadams False 196362083 https://twitter.com/BenMcAdamsUT Denver Riggleman https://twitter.com/repriggleman True 1080504024695222273 In most cases where the account has been updated the individual simply changed their Twitter username, sometimes remove “Rep” from it–like RepJoeKennedy to JoeKennedy. As an aside I’m kind of surprised that Twitter username wasn’t taken to be honest. Maybe that’s a perk of having a verified account or of being a politician? But if you look closely you can see there were a few that seemed to have deleted their account altogether: name url url_ok user_id Susan A. Davis https://twitter.com/repsusandavis False 432771620 Bill Flores https://twitter.com/repbillflores False 237312687 F. James Sensenbrenner, Jr. https://twitter.com/jimpressoffice False 851621377 Donna E. Shalala https://twitter.com/repshalala False 1060584809095925762 Steve Watkins https://twitter.com/rep_watkins False 1080307235350241280 There are two notable exceptions to this. The first is Vice President Kamala Harris. My logic for determining if a person was leaving Congress was to see if they served in a term ending on 2021-01-03, and weren’t serving in a term starting then. But Harris is different because her term as a Senator is listed as ending on 2021-01-18. Her old account (???) is no longer available, but her Twitter User ID is still active and is now attached to the account at (???). The other of course is Joe Biden, who stopped being a senator in order to become the President. His Twitter account remains the same at (???). It’s worth highlighting here how there seems to be no uniform approach to handling this process. In one case (???) is temporarily blessed as the VP, with a unified account history underneath. In the other there is a separation between (???) and (???). It seems like Twitter has some work to do on managing identities, or maybe the Congress needs to prescribe a set of procedures? Or maybe I’m missing part of the picture, and that just as (???) somehow changed back to (???) there is some namespace management going on behind the scenes? If you are interested in other social media platforms like Facebook, Instagram and YouTube the unitedstates project tracks information for those platforms too. I merged that information into the legislators.yaml file I discussed here if you want to try to check them. I think that one thing this experiment shows is that if the platform allows for usernames to be changed it is critical to track the user id as well. I didn’t do the work to check that those accounts exist. But that’s a project for another day. I’m not sure this list of five deleted accounts is terribly interesting at the end of all this. Possibly? But on the plus side I did learn how to interact with Git better from Python, which is something I can imagine returning to in the future. It’s not every day that you have to think of the versions of a dataset as an important feature of the data, outside of serving as a backup that can be reverted to if necessary. But of course data changes in time, and if seeing that data over time is useful, then the revision history takes on a new significance. It’s nothing new to see version control systems as critical data provenance technologies, but it felt new to actually use one that way to answer a question. Thanks Mark! Trump's Tweets TLDR: Trump’s tweets are gone from twitter.com but still exist spectrally in various states all over the web. After profiting off of their distribution Twitter now have a responsibility to provide meaningful access to the Trump tweets as a read only archive. This post is also published on the Documenting the Now Medium where you can comment, if the mood takes you. So Trump’s Twitter account is gone. Finally. It’s strange to have had to wait until the waning days of his presidency to achieve this very small and simple act of holding him accountable to Twitter’s community guidelines…just like any other user of the platform. Better late than never, especially since his misinformation and lies can continue to spread after he has left office. But isn’t it painful to imagine what the last four years (or more) could have looked like if Twitter and the media at large had recognized their responsibility and acted sooner? When Twitter suspended Trump’s account they didn’t simply freeze it and prevent him from sending more hateful messages. They flipped a switch that made all the tweets he has ever sent disappear from the web. These are tweets that had real material consequences in the world. As despicable as Trump’s utterances have been, a complete and authentic record of them having existed is important for the history books, and for holding him to account. Twitter’s suspension of Donald Trump’s account has also removed all of his thousands of tweets sent over the years. I personally find it useful as a reporter to be able to search through his tweets. They are an important part of the historical record. Where do they live now? — Olivia Nuzzi (@Olivianuzzi) January 9, 2021 Where indeed? One hopes that they will end up in the National Archives (more on that in a moment). But depending on how you look at it, they are everywhere. Twitter removed Trump’s tweets from public view at twitter.com. But fortunately, as Shawn Jones notes, embedded tweets like the one above persist the tweet text into the HTML document itself. When a tweet is deleted from twitter.com the text stays behind elsewhere on the web like a residue, as evidence (that can be faked) of what was said and when. It’s difficult to say whether this graceful degradation was an intentional design decision to make their content more resilient, or it was simply a function of Twitter wanting their content to begin rendering before their JavaScript had loaded and had a chance to emboss the page. But design intent isn’t really what matters here. What does matter is the way this form of social media content degrades in the web commons. Kari Kraus calls this process “spectral wear”, where digital media “help mitigate privacy and surveillance concerns through figurative rather than quantitative displays, reflect and document patterns of use, and promote an ethics of care.” (Kraus, 2019). This spectral wear is a direct result of tweet embed practices that Twitter itself promulgates while simultaneously forbidding it Developer Terms of Service: If Twitter Content is deleted, gains protected status, or is otherwise suspended, withheld, modified, or removed from the Twitter Applications (including removal of location information), you will make all reasonable efforts to delete or modify such Twitter Content (as applicable) as soon as possible. Fortunately for history there has probably never been a more heavily copied social media content than Donald Trump’s tweets. We aren’t immediately dependent on twitter.com to make this content available because of the other other places on the web where it exists. What does this copying activity look like? I intentionally used copied instead of archived above because the various representations of Trump’s tweets vary in terms of their coverage, and how they are being cared for. Given their complicity in bringing Trump’s messages of division and hatred to a worldwide audience, while profiting off of them, Twitter now have a responsibility to provide as best a representation of this record for the public, and for history. We know that the Trump administration have been collecting the @realDonaldTrump Twitter account, and plan to make it available on the web as part of their responsibilities under the Presidential Records Act: The National Archives will receive, preserve, and provide public access to all official Trump Administration social media content, including deleted posts from @realDonaldTrump and @POTUS. The White House has been using an archiving tool that captures and preserves all content, in accordance with the Presidential Records Act and in consultation with National Archives officials. These records will be turned over to the National Archives beginning on January 20, 2021, and the President’s accounts will then be made available online at NARA’s newly established trumplibrary.gov website. NARA is the logical place for these records to go. But it is unclear what shape these archival records will take. Sure the Library of Congress has (or had) it’s Twitter archive. It’s not at all clear if they are still adding to it. But even if they are LC probably hasn’t felt obligated to collect the records of an official from the Executive Branch, since they are firmly lodged in the Legislative. Then again they collect GIFs so, maybe? Reading between the lines it appears that a third party service is being used to collect the social media content: possibly one of the several e-discovery tools like ArchiveSocial or Hanzo. It also looks like the Trump Administration themselves have entered into this contract, and at the end of its term (i.e. now) will extract their data and deliver it to NARA. Given their past behavior it’s not difficult to imagine the Trump administration not living up to this agreement in substantial ways. This current process is a slight departure from the approach taken by the Obama administration. Obama initiated a process where platforms [migrate official accounts] to new accounts that were then managed going forward by NARA (Acker & Kriesberg, 2017). We can see that this practice is being used again on January 20, 2021 when Biden became President. But what is different is that Barack Obama retained ownership of his personal account @barackobama, which he continues to use.NARA has announced that they will be archiving Trump’s now deleted (or hidden) personal account. A number of Trump administration officials, including President Trump, used personal accounts when conducting government business. The National Archives will make the social media content from those designated accounts publicly available as soon as possible. The question remains, what representation should be used, and what is Twitter’s role in providing it? Meanwhile there are online collections like The Trump Archive, the New York Times’ Complete List of Trump’s Twitter Insults, Propublica’s Politwoops and countless GitHub repositories of data which have collected Trump’s tweets. These tweets are used in a multitude of ways including things as absurd as a source for conducting trades on the stock market. But seeing these tweets as they appeared in the browser, with associated metrics and comments is important. Of course you can go view the account in the Wayback Machine and browse around. But what if we wanted a list of all the Trump tweets? How many times were these tweets actually archived? How complete is the list? After some experiments with the Internet Archive’s API it’s possible to get a picture of how the tweets from the @realDonaldTrump account have been archived there. There are a few wrinkles because a given tweet can have many different URL forms (e.g. tracking parameters in the URL query string). In addition just because there was a request to archive a URL for something that looks like a realDonaldTrump tweet URL doesn’t mean it resulted in a successful response. Success here means a 200 OK from twitter.com when resolving the URL. Factoring these issues into the analysis it appears the Wayback Machine contains (at least) 16,043,553 snapshots of Trump’s tweets. https://twittter.com/realDonaldTrump/status/{tweet-id} Of these millions of snapshots there appear to be 57,292 unique tweets. This roughly correlates with the 59K total tweets suggested by the last profile snapshots of the account. The maximum number of times in one day that his tweets were archived was 71,837 times on February 10, 2020. Here’s what the archive snapshots of Trump’s tweets look like over time (snapshots per week). It is relatively easy to use the CSV export from the [Trump Archive] project to see what tweets they know about that the Internet Archive does not and vice-versa (for the details see the Jupyter notebook and SQLite database here). It looks like there are 526 tweet IDs in the Trump Archive that are missing from the Internet Archive. But further examination shows that many of these are retweets, which in Twitter’s web interface, have sometimes redirected back to the original tweet. Removing these retweets to specifically look at Trump’s own tweets there are only 7 tweets in the Trump Archive that are missing from the Internet Archive. Of these 4 are in fact retweets that have been miscategorized by the Trump Archive. One of the three is this one, which is identified in the Trump Archive as deleted, and wasn’t collected quick enough by the Internet Archive before it was deleted: Roger Stone was targeted by an illegal Witch Hunt tha never should have taken place. It is the other side that are criminals, including the fact that Biden and Obama illegally spied on my campaign - AND GOT CAUGHT!" Sure enough, over at the Politwoops project you can see that this tweet was deleted 47 seconds after it was sent: Flipping the table it’s also possible to look at what tweets are in the Internet Archive but not in the Trump Archive. It turns out that there are 3,592 tweet identifiers in the Wayback machine for Trump’s tweets which do not appear in the Trump Archive. Looking a bit closer we can see that some are clearly wrong, because the id itself is too small a number, or too large. And then looking at some of the snapshots it appears that they often don’t resolve, and simply display a “Something went wrong” message: Yes, something definitely went wrong (in more ways than one). Just spot checking a few there also appear to be some legit tweets in the Wayback that are not in the Trump archive like this one: Notice how the media will not play there? It would take some heavy manual curation work to sort through these tweet IDs to see which ones are legit, and which ones aren’t. But if you are interested here’s an editable Google Sheet. Finally, here is a list of the top ten archived (at Internet Archive) tweets. The counts here reflect all the variations for a given tweet URL. So they will very likely not match the count you see in the Wayback Machine, which is for the specific URL (no query paramters). Thank you Alabama! #Trump2016#SuperTuesday 65,489 MAKE AMERICA GREAT AGAIN! 65,360 Thank you Georgia!#SuperTuesday #Trump2016 65,358 Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before 55,426 Watching the returns at 9:45pm. #ElectionNight #MAGA🇺🇸 https://t.co/HfuJeRZbod 54,889 #ElectionDay https://t.co/MXrAxYnTjY https://t.co/FZhOncih21 54,291 I will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting! 54,263 I love you North Carolina- thank you for your amazing support! Get out and https://t.co/HfihPERFgZ tomorrow!Watch:… https://t.co/jZzfqUZNYh 54,252 Still time to #VoteTrump! #iVoted #ElectionNight https://t.co/UZtYAY1Ba6 54,224 Watching my beautiful wife, Melania, speak about our love of country and family. We will make you all very proud…. https://t.co/DiKmSnTlC2 54100 The point of this rambling data spelunking, if you’ve made it this far, is to highlight the degree to which Trump’s tweets have been archived (or collected), and how the completeness and quality of those representations is very fluid and difficult to ascertain. Hopefully Twitter is working with NARA to provide as complete a picture as possible of what Trump said on Twitter. As much as we would like to forget, we must not. References Acker, A., & Kriesberg, A. (2017). Tweets may be archived: Civic engagement, digital preservation and obama white house social media data. Proceedings of the Association for Information Science and Technology, 54(1), 1–9. Kraus, K. (2019). The care of enchanted things. In M. K. Gold & L. F. Klein (Eds.), Debates in the digital humanities 2019. Retrieved from https://www.jstor.org/stable/10.5749/j.ctvg251hk.17 noarchive I wrote a few weeks ago about the timeworn practice of using robots.txt to control whether web content is playable in web archives. This practice started back in 2002, with the Oakland Archive Policy. While it’s imperfect, and perhaps contrary to common sense, [technologies for consent in web archiving] are important levers to have available, even while the web is designed to be an open information system. While doing some research with Jess and Shawn into the use of web archives I recently ran across something I felt like I should have known, or perhaps did know once and then forgot about, which I thought might I’d drop a note about here so I am less likely to forget it again. This concerns the noarchive meta tag and HTTP header (see this page if the previous link fails). It appears that since at least 2007 Google and other major search engines have allowed web publishers to control whether the content that Google has crawled from their website will show up with a Google Cache link in search results. I had certainly known about the use of noindex and nofollow in meta tags and on links to control whether search engines index a page. But even after writing an entire dissertation about web archiving practice I’m somewhat abashed to admit that I didn’t know noarchive existed. The basic idea is that you can put this in your Web page: <meta name="robots" content="noarchive" /> or have your web server respond using this HTTP header: X-Robots-Tag: noarchive and search engines like Google will not display a link to cached content of the page. Reading between the lines a bit this doesn’t mean that the content isn’t being stored/cached (it needs to be indexed afterall). It just means that Google won’t display a link to the cached content. Maybe it’s just me, but the directive is a bit oddly named because noarchive controls a cache, and the content in caches, at least in computing, is typically thought to be temporary. Caches allow processes to be sped up by localizing resources that are expensive to retrieve or compute. But caches themselves have resource limits and often need to be methodically purged, which is known as cache invalidation. The practice of using caches has long been an intrinsic feature of the web, and is arguably is one of the primary reasons for its success as a global information architecture. In addition to this computational definition cache has some additional layers of meaning. The OED offers this one for the noun form, which dates back to the mid 1800s and the French verb cacher (to hide): a. A hiding place, esp. of goods, treasure, etc. b. especially a hole or mound made by American pioneers and Arctic explorers to hide stores of provisions, ammunition, etc. The store of provisions so hidden. It also has a verb form that originated in the United States during the early 1800s: transitive. To put in a cache; to store (provision) under ground; said also of animals. These definitions suggest a less volatile state for cached contents, where the cache could store content for some time. But it’s interesting to note how these definitions, along with the computational one, underscore use. The resources that are being stored in the cache are valuable, and are have an intended use in the future. And speaking of value, in English cache sounds identical to another word… cash, which seems to have an entirely different meaning. Of course cash is usually used to talk about money. But interestingly it derives from another French word casse, which is “a box, case, or chest, to carrie or kepe wares in”. Which really echoes the previous definition of cache. The tangled connections between caches and money, as well as the semantic interchangeability of the container with the contained seem significant here. So what does all this have to do with noarchive? Well I’m not sure to be honest :) These are just some rambling notes after all. But I do think it’s interesting how the noun and verb forms of cache mirror the noun and verb forms of archive, and how this creative analogy has been cooked into web standards and crawling practices. But bringing us back to the question of the use of noarchive on the web, it seems to me that noarchive is a striking example of an explicit practice for expressing consent (or lack thereof) to having your content archived on the web. It is better than robots.txt because it more accurately expresses the intent that the content not be archived, and it allows for two modes of expression (HTML and HTTP). The caveat being that the archive most likely still exists, but only as a shadow. How long those shadows are cast would be an interesting research project in itself. How much of an archive is the Google Cache, really? Do any public (or private) web archiving projects actually look for and act on noarchive? But it also would be nice to know a bit more about the prehistory of noarchive. It was introduced not long after a lawsuit was brought by Belgian company Copiepresse against Google and Google News. The order in 2006 found that the activities of Google News and the use of cache by Google violates copyrights and neighbouring rights (Act of 1994) and rights on databases (Act of 1998). Perhaps there was a direct connection between Google losing this case and the introduction of a control for whether cache links are displayed? Maybe the proceedings of that case provide more insight into how Google’s cache operates? And how did noarchive.net come about, and who is keeping it online? Answering those questions are for another day… What's the diff? This is a really excellent lecture by Bernhard Rieder about what Digital Humanities and Media Studies have in common, how they are different, and why it matters. Since starting to work in a group that has traditionally seen itself as firmly planted in the digital humanities I’ve sometimes found myself a bit perplexed by the disciplinary walls that seem to come up in projects, when the object of study isn’t a traditional media form like a book, or a piece of art, a piece of music, a dance, etc. I’ve come up against this a bit in my own research into how archives function on the web, because I chose to see web archives as part of set of practices that extend back in time, to the pre-digital world of paper and print (which of course are information technologies as well). One thing I like about the media studies approach is it sees various forms of media as part of a continuum of technologies, where the digital isn’t necessarily exceptional or privileged. As Rieder points out, the debates around what is, and is not, DH are widely recognized to be a bit stale now. But understanding the similarities and differences between digital humanities and media studies is still highly relevant, especially for interdisciplinary, collaborative scholarship. I very much liked Rieder’s approach of laying out the differences and similarities in terms of his Mapping YouTube project, and the theoretical and methodological influences that guided the work he and his collaborators undertook. The key message I took away is that the primary difference between digital humanities and media studies has traditionally been the objects of study. For example, asking questions about a large collection of printed books is fundamentally different from asking questions about a social media platform. But the way one goes about answering these questions (the methods) can actually be quite similar in both areas. In fact digital methods, and the theories they draw on, are something that bring these two disciplines into alignment, and make them vibrant sites for collaborative work. I used the past tense has there because I agree with Rieder that the differences are breaking down significantly as digital humanities research increasingly analyzes the contemporary, and media studies itself has a historical component. A few things I learned about that I want to follow up on: The idea of platform vernacular developed by Gibbs et al to describe the ways that communicative practice evolve in platforms. The practice of exploratory data analysis which I feel like I’ve understood intuitively in much of my work, but am somewhat abashed to admit I didn’t realize was written about by John Tukey back in 1977. The goal of EDA is for data analysis to help in the formulation of hypotheses, rather than the testing of a hypothesis. It is also useful for gaining insights into what more data needs to be collected. I’m kind of curious to see what relation, if any, this idea of exploratory data analysis might have with Peirce’s idea of abductive reasoning. I’ve known about Rieder’s book Engines of Order for a few months now, and really need to move it up the to-read pile. In my own recently completed dissertation work I looked at how algorithms (fixity algorithms) participate in complex systems of control, and knowledge production. So I’m interested to see what argument Rieder makes, and hope to learn a bit more about Simondon, who I’ve had some difficulty understanding in the past. (Although, I did enjoy this offbeat and accessible documentary about his life & work.) 25 for 2020 An obvious follow on from my last post is to see what my top 25 albums of the year are. In the past I’ve tried to mentally travel over the releases of the past year to try to cook up a list. But this year I thought it would be fun to use the LastFM API to look at my music listening history for 2020, and let the data do the talking as it were. The first problem that while LastFM is a good source of my listening history its metadata for albums seems quite sparse. The LastFM album.getInfo API call doesn’t seem to return the year the album was published. The LastFM docs indicate that a releasedate property is available, but I couldn’t seem to find it either in the XML or JSON responses. Maybe it was there once and now is gone? Maybe there’s some trick I was overlooking with the API? Who knows. So to get around this I used LastFM to get my listening history, but then the Discogs API to fetch metadata for a specific album using their search endpoint. LastFM includes MusicBrainz identifiers for tracks and most artists and albums. So I could have used those to look up the album using the MusicBrainz API. But I wasn’t sure if I would find good release dates there either as their focus seems to be on recognizing tracks, and linking them to albums and artists. Discogs is a superb human curated database, like a Wikipedia for music aficionados. Their API returns a good amount of information for each album, for example: { "country": "US", "year": "1983", "format": [ "Vinyl", "LP", "Album" ], "label": [ "I.R.S. Records", "I.R.S. Records", "I.R.S. Records", "A&M Records, Inc.", "A&M Records, Inc.", "I.R.S., Inc.", "I.R.S., Inc.", "Electrosound Group Midwest, Inc.", "Night Garden Music", "Unichappell Music, Inc.", "Reflection Sound Studios", "Sterling Sound" ], "type": "master", "genre": [ "Rock" ], "style": [ "Indie Rock" ], "id": 14515, "barcode": [ "SP-070604-A", "SP-070604-B", "SP0 70604 A ES1 EMW", "SP0 70604-B-ES1 EMW", "SP0 70604-B-ES2 EMW", "STERLING", "(B)", "BMI" ], "user_data": { "in_wantlist": false, "in_collection": false }, "master_id": 14515, "master_url": "https://api.discogs.com/masters/14515", "uri": "/REM-Murmur/master/14515", "catno": "SP 70604", "title": "R.E.M. - Murmur", "thumb": "https://discogs-images.imgix.net/R-414122-1459975774-1411.jpeg?auto=compress&blur=0&fit=max&fm=jpg&h=150&q=40&w=150&s=52b867c541b102b5c8bcf5accae025e0", "cover_image": "https://discogs-images.imgix.net/R-414122-1459975774-1411.jpeg?auto=compress&blur=0&fit=max&fm=jpg&h=600&q=90&w=600&s=0e227f30b3981fd2b0fb20fb4362df92", "resource_url": "https://api.discogs.com/masters/14515", "community": { "want": 17287, "have": 26133 } } So I created a small function that looks up an artist/album combination using the Discogs search API. I applied the function to the Pandas DataFrame of my listening history, which was grouped by artist and album. When I ran this across the 1,312 distinct albums I listened to in 2020 I actually ran into a handful of albums (86) that didn’t turn up at Discogs. I had actually listened to some of these albums quite often, and wanted to see if they were from 2020. I figured that these probably were obscure things I picked up on Bandcamp. Knowing the provenance of data is important. Bandcamp is another wonderful site for music lovers. It has an API too, but you have to write to them to request a key because it’s mostly designed for publishers that need to integrate their music catalogs with Bandcamp. I figured this little experiment wouldn’t qualify so I wrote a quick little scraping function that does a search, finds a match, and extracts the release date from the album’s page on the Bandcamp website. This left just four things that I listened just a handful of times,which have since disappeared from Bandcamp (I think). What I thought would be an easy little exercise with the LastFM API actually turned out to require me to talk to the Discogs API, and then scraping the Bandcamp website. So it goes with data analysis I suppose. If you want to see the details they are in this Jupyter notebook. And so, without further ado, here are my to 25 albums of 2020. .albums { display: flex; flex-wrap: wrap; } .album { margin: 5px; max-width :210px; text-align: center; border: thin solid #eee; } .album img { max-width: 200px; } 25 Perfume Genius / Set My Heart On Fire Immediately 24 Roger Eno / Mixing Colours 23 Blochemy / nebe 22 Idra / Lone Voyagers, Lovers and Lands 21 Rutger Zuydervelt and Bill Seaman / Rutger Zuydervelt and Bill Seaman - Movements of Dust 20 Purl / Renovatio 19 mute forest / Riderstorm 18 Michael Grigoni & Stephen Vitiello / Slow Machines 17 Seabuckthorn / Other Other 16 Windy & Carl / Unreleased Home Recordings 1992-1995 15 Mathieu Karsenti / Bygones 14 Rafael Anton Irisarri / Peripeteia 13 Mikael Lind / Give Shape to Space 12 Taylor Swift / folklore (deluxe version) 11 koji itoyama / I Know 10 Andrew Weathers / Dreams and Visions from the Llano Estacado 9 Jim Guthrie / Below OST - Volume III 8 Norken & Nyquist / Synchronized Minds 7 Jim Guthrie / Below OST - Volume II 6 Halftribe / Archipelago 5 Hazel English / Wake Up! 4 R Beny / natural fiction 3 Warmth / Life 2 David Newlyn / Apparitions I and II 1 Seabuckthorn / Through A Vulnerable Occur Diss Music I recently defended my dissertation, and am planning to write a short post here with a synopsis of what I studied. But before that I wanted to do a bit of navel gazing and examine the music of my dissertation. To be clear, my dissertation has no music. It’s one part discourse analysis, two parts ethnographic field study, and is comprised entirely of text and images bundled into a PDF. But over the last 5 years as I took classes, wrote papers, conducted research and did the final write up my research results I was almost always listening to music. I spent a lot of time on weekends in the tranquil workspaces of the Silver Spring Public Library. After the Coronavirus hit earlier this year I spent more time surrounded by piles of books in my impromptu office in the basement of my house. But wherever I found myself working music was almost always on. I leaned heavily on Bandcamp over this time period, listening and then purchasing music I enjoyed. Bandcamp is a truly remarkable platform for learning about new music from people whose tastes align with yours. My listening habits definitely trended over this time towards instrumental, experimental, found sound and ambient, partly because lyrics can distract me if I’m writing or reading. I’m also a long time LastFM user. So all the music I listened to over this period was logged (or “scrobbled”). LastFM have an API so I thought it would be fun to create a little report of the top albums I listened to each month of my dissertation. So this is the music of my dissertation–or the hidden soundtrack of my research, between August 2015 and November 2020. You can see how I obtained the information from the API in this Jupyter notebook. But the results are here below. .albums { display: flex; flex-wrap: wrap; } .album { margin: 5px; max-width :210px; text-align: center; border: thin solid #eee; } .album img { max-width: 200px; } 2015-08 White Rainbow / Thru.u 2015-09 Deradoorian / The Expanding Flower Planet 2015-10 James Elkington and Nathan Salsburg / Ambsace 2015-11 Moderat / II 2015-12 Deerhunter / Fading Frontier 2016-01 David Bowie / Blackstar 2016-02 Library Tapes / Escapism 2016-03 Twincities / …plays the brown mountain lights 2016-04 Moderat / III 2016-05 Radiohead / A Moon Shaped Pool 2016-06 Tigue / Peaks 2016-07 A Winged Victory for the Sullen / A Winged Victory for the Sullen 2016-08 Oneohtrix Point Never / Garden of Delete 2016-09 Oneohtrix Point Never / Drawn and Quartered 2016-10 Chihei Hatakeyama / Saunter 2016-11 Biosphere / Departed Glories 2016-12 Sarah Davachi / The Untuning of the Sky 2017-01 OFFTHESKY / The Beautiful Nowhere 2017-02 Clark / The Last Panthers 2017-03 Tim Hecker / Harmony In Ultraviolet 2017-04 Goldmund / Sometimes 2017-05 Deerhunter / Halcyon Digest 2017-06 Radiohead / OK Computer OKNOTOK 1997 2017 2017-07 Arcade Fire / Everything Now 2017-08 oh sees / Orc 2017-09 Lusine / Sensorimotor 2017-10 Four Tet / New Energy 2017-11 James Murray / Eyes to the Height 2017-12 Jlin / Black Origami 2018-01 Colleen / Captain of None (Bonus Track Version) 2018-02 Gersey / What You Kill 2018-03 Rhucle / Yellow Beach 2018-04 Christina Vantzou / No. 3 2018-05 Hotel Neon / Context 2018-06 Brendon Anderegg / June 2018-07 A Winged Victory for the Sullen / Atomos 2018-08 Ezekiel Honig / A Passage of Concrete 2018-09 Paperbark / Last Night 2018-10 Flying Lotus / You’re Dead! (Deluxe Edition) 2018-11 Porya Hatami / Kaziwa 2018-12 Sven Laux / You’ll Be Fine. 2019-01 Max Richter / Mary Queen Of Scots (Original Motion Picture Soundtrack) 2019-02 Ian Nyquist / Cuan 2019-03 Jens Pauly / Vihne 2019-04 Ciro Berenguer / El Mar De Junio 2019-05 Rival Consoles / Persona 2019-06 Caught In The Wake Forever / Waypoints 2019-07 Spheruleus / Light Through Open Blinds 2019-08 Valotihkuu / By The River 2019-09 Moss Covered Technology / Slow Walking 2019-10 Tsone / pagan oceans I, II, III 2019-11 Big Thief / Two Hands 2019-12 A Winged Victory for the Sullen / The Undivided Five 2020-01 Hirotaka Shirotsubaki / fragment 2011-2017 2020-02 Luis Miehlich / Timecuts 2020-03 Federico Durand / Jardín de invierno 2020-04 R.E.M. / Document - 25th Anniversary Edition 2020-05 Chicano Batman / Invisible People 2020-06 Hazel English / Wake Up! 2020-07 Josh Alexander / Hiraeth 2020-08 The Beatles / The Beatles (Remastered) 2020-09 Radiohead / OK Computer OKNOTOK 1997 2017 2020-10 Mathieu Karsenti / Bygones 2020-11 R.E.M. / Murmur - Deluxe Edition 25 Years of robots.txt After just over 25 years of use the Robots Exclusion Standard, otherwise known as robots.txt is being standardized at the IETF. This isn’t really news, as the group at Google that is working on it announced the work over a year ago. The effort continues apace, with the latest draft having been submitted back in the middle of pandemic summer. But it is notable I think because of the length of gestation time this particular standard took. It made me briefly think about what it would be like if standards always worked this way–by documenting established practices, desire lines if you will, rather than being quiet ways to shape markets (Russell, 2014). But then again maybe that hands off approach is fraught in other ways. Standardization processes offer the opportunity for consensus, and a framework for gathering input from multiple parties. It seems like a good time to write down some tricks of the robots.txt trade (e.g the stop reading after 500kb rule, which I didn’t know about). What would Google look like today if it wasn’t for some of the early conventions that developed around web crawling? Would early search engines have existed at all if a convention for telling them what to crawl and what not to crawl didn’t come into existence? Even though it has been in use for 25 years it will be important to watch the diffs with the existing de-facto standard, to see what new functionality gets added and what (if anything) is removed. I also wonder if this might be an opportunity for the digital preservation community to grapple with documenting some of its own practices around robots.txt. Much web archiving crawling software has options for observing robots.txt, or explicitly ignoring it. There are clearly legitimate reasons for a crawler to ignore robots.txt, as in cases where CSS files or images are accidentally blocked by a robots.txt and which prevent the rendering of an otherwise unblocked page. I think ethical arguments can also be made for ignoring an exclusion. But ethics are best decided by people not machines– even though some think the behavior of crawling bots can be measured and evaluated (Giles, Sun, & Councill, 2010 ; Thelwall & Stuart, 2006). Web archives use robots.txt in another significant way too. Ever since the Oakland Archive Policy the web archiving community has used the robots.txt in playback of archived data. Software like the Wayback Machine has basicaly become the reading room of the archived web. The Oakland Archive Policy made it possible for website owners to tell web archives about content on their site that they would like the web archive not to “play back”, even if they had the content. Here is what they said back then: Archivists should provide a ‘self-service’ approach site owners can use to remove their materials based on the use of the robots.txt standard. Requesters may be asked to substantiate their claim of ownership by changing or adding a robots.txt file on their site. This allows archivists to ensure that material will no longer be gathered or made available. These requests will not be made public; however, archivists should retain copies of all removal requests. This convention allows web publishers to use their robots.txt to tell the Internet Archive (and potentially other web archives) not to provide access to archived content from their website. It also is not really news at all. The Internet Archive’s Mark Graham wrote in 2017 about how robots.txt haven’t really been working out for them lately, and how they now ignore them for playback of .gov and .mil domains. There was a popular article about this use of robots.txt written by David Bixenspan at Gizmodo, When the Internet Archive Forgets, and a follow up from David Rosenthal Selective Amnesia. Perhaps the collective wisdom now is that the use of robots.txt to control playback in web archives is fundamentally flawed and shouldn’t be written down in a standard. But lacking a better way to request that something be removed from the Internet Archive I’m not sure if that is feasible. Some, like Rosenthal, suggest that it’s too easy for these take down notices to be issued. Consent on the web is difficult once you are operating at the scale that the Internet Archive does in its crawls. But if there were a time to write it down in a standard I guess that time would be now. References Giles, C. L., Sun, Y., & Councill, I. G. (2010). Measuring the web crawler ethics. In WWW 2010. Retrieved from https://clgiles.ist.psu.edu/pubs/WWW2010-web-crawler-ethics.pdf Russell, A. L. (2014). Open standards and the digital age. Cambridge University Press. Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology. https://doi.org/10.1002/asi.20388 Curation Communities As I indicated in the last post I’ve been teaching digital curation this semester at UMD. I ended up structuring the class around the idea of abstraction where we started at a fairly low level looking at file systems and slowly zoomed out to file formats and standards, types of metadata, platforms and finally community. It was a zooming out process, like changing the magnification on a microscope, or maybe more like the zooming out that happens as you pop between levels in the pages of Istvan Banyai’s beautiful little children’s book Zoom (pun intended). I’m curious to hear how well this worked from my student’s perspective, but it definitely helped me organize my own thoughts about a topic that can branch off in many directions. This is especially the case because I wanted the class to include discussion of digital curation concepts while also providing an opportunity to get some hands on experience using digital curation techniques and tools in the context of Jupyter notebooks. In addition to zooming out, it was a dialectical approach, flipping between reading and writing prose and reading and writing code, with the goal of reaching a kind of synthesis of understanding that digital curation practice is about both concepts and computation. Hopefully it didn’t just make everyone super dizzy :) This final module concerned community. In our reading and discussion we looked at the FAIR Principles and talked about what types of practices they encourage, and to evaluate some data sources in terms of findability, accessibility, interoperability and reusability. For the notebook exercise I decided to have students experiment with the Lumen Database (formerly Chilling Effects) which is a clearinghouse for cease-and-desist notices received by web platforms like Google, Twitter and Wikipedia. The database was created by Wendy Seltzer and a team of legal researchers that wanted to be able to study how copyright law and other legal instruments shaped what was, and was not, on the web. Examining Lumen helped us explore digital curation communities for two reasons. The first is that it provides an unprecedented look at how web platforms curate their content in partnership with their users. There really is nothing else like it unless you consider individual efforts like GitHub’s DMCA Repository which is an interesting approach too. The second reason is that Lumen itself is an example of community digital curation practice and principles like FAIR. FAIR began in the scientific community, and certainly has that air about it. But Lumen embodies principles around findability and accessibility: this is information that would be difficult if not impossible to access otherwise. Lumen also shows how some data cannot be readily available: there is redacted content, some notices lack information like infringing URLs. Working with Lumen helps students see that not all data can be open, and that the FAIR principles are a starting place for ethical conversations and designs, and not a rulebook to be followed. The Lumen API requires that you get a key for doing any meaningful work (the folks at Berkman-Klein were kind enough to supply me a temporary one for the semester). At any rate, if you are interested in taking a look the notebook (without the Lumen key) is available on GitHub. I’ve noticed that sometimes the GitHub JavaScript viewer for notebooks can timeout, so if you want you can also take a look at it over in Colab, which is the environment we’ve been using over the semester. The notebook explores the basics of interacting with the API using the Python requests library, while explaining the core data model that is behind the API, which relates together the principal, the sender, the recpipient and the submitter of a claim. It provides just a taste of the highly expressive search options that allow searching, ordering and filtering of results along many dimensions. It also provides an opportunity to show students the value of build functional abstractions to help reduce copy and paste, and develop reusable and testable curation functions. The goal was to do a module about infrastructure after talking about community. But unfortunately we ran out of time due to the pace of classes during the pandemic. I felt that a lot was being asked of students in the all online environment and I’ve really tried over the semester to keep things simple. This last module on community was actually completely optional, but I was surprised when half the class continued to do the work when it was not officially part of their final grade. The final goal of using Lumen this week was to introduce them to a resource that they could write about (essay) or use in a notebook or application that will be their final project. I’ve spent the semester stressing the need to be able to write both prose and code about digital curation practices and the final project is an opportunity for them to choose to inflect one of those modes more than the other. Mystery File! We started the semester in my Digital Curation class by engaging in a little exercise I called Mystery File. The exercise was ungraded and was designed to simply get the students thinking about some of the issues we would be exploring over the semester such as files, file formats, metadata, description, communities of practice and infrastructure. The exercise also gave me an opportunity to introduce them to some of the tools and skills we would be using such as developing and documenting our work in Jupyter notebooks. The students had a lot of fun with it, and it was really helpful for me to see the variety of knowledge and skills they brought to the problem. The mystery file turned out to be bundle of genetic data and metadata from the public National Center for Biotechnology Information a few minutes drive from UMD at the National Institutes of Health in Bethesda. If the students were able to notice that this file was a tar file, they could expand it and explore the directories and subdirectories. They could notice that some files were compressed, and examine some of them to notice that they contained metadata and a genetic sequence. Once they had submitted their answers I shared a video with them (the class is asynchronous except for in person office hours) where I answered these questions myself in a Jupyter notebook running in Google Colab. I shared the completed notebook with them for them to try on their own. It was a good opportunity to reacquaint students with notebooks since they were introduced to them in an Introduction to Programming class that is a pre-requisite. But I wanted to show how notebooks were useful for documenting their work, and especially useful in digital curation activities which are often ad-hoc, but include some repeatable steps. The bundle of data includes a manifest with hashes for fixity checking to ensure a bit hasn’t flipped, which anticipated our discussion of technical metadata later in the semester. I thought it was a good example of how a particular community is making data available, and how the NCBI and its services form a piece of critical infrastructure for the medical community. I also wanted to highlight how the data came from a Chinese team, despite the efforts of the Chinese government to suppress the information. This was science, the scientific community, and information infrastructures working despite (or in spite of) various types of social and political breakdowns. But I actually didn’t start this post wanting to write about all that, but rather to comment on a recent story I read about the origins of this data. It gave me so much hope and reason to celebrate data curation practices to read Zeynep Tufekci’s The Pandemic Heroes Who Gave us the Gift of Time and Gift of Information this afternoon. She describes how brave Yong-Zhen Zhang and his team in China were in doing their science, and releasing the information in a timely way to the world. If you look closely you can see Zhang’s name highlighted in the pictured metadata record above. It is simply astonishing to read how Zhang set the scientific machinery in motion which created a vaccine all the way back in January, just days after the virus was discovered and sequenced. Sending my students this piece from Zeynep here at the end of the semester gives me such pleasure, and is the perfect way to round out the semester as we talk about communities and infrastructure. (P.S. I’m planning on bundling up the discussion and notebook exercises once the semester is finished in case it is useful for others to adapt.) Kettle kettle boiling kitchen table two windows daylight reaching leaves kettle boiling   again