Planet Code4Lib http://planet.code4lib.org Planet Code4Lib - http://planet.code4lib.org David Rosenthal: Talk At Berkeley's Information Access Seminar https://blog.dshr.org/2021/02/talk-at-berkeleys-information-access.html Once again Cliff Lynch invited me to give a talk to the Information Access Seminar at UC Berkeley's iSchool. Preparation time was limited because these days I'm a full-time grandparent so the talk, entitled Securing The Digital Supply Chain summarizes and updates two long posts from two years ago:

The abstract was:

The Internet is suffering an epidemic of supply chain attacks, in which a trusted supplier of content is compromised and delivers malware to some or all of their clients. The recent SolarWinds compromise is just one glaring example. This talk reviews efforts to defend digital supply chains.

Below the fold, the text of the talk with links to the sources.

SolarWinds, and many other recent system and network compromises have been supply chain attacks. These are extremely efficient, because unlike one-at-a-time attacks such as phishing, they provide a built-in mass deployment mechanism. A single compromise of SolarWinds infected at least 18,000 networks. Clearly, the vendors' security practices, and their vendors' security practices, and so on ad infinitum are important, but the sad truth is that current digital supply chain technologies are incapable of mitigating the inevitable security lapses along the chain.

This talk reviews the efforts to defend supply chains that deliver digital content, such as software. But lets start with a simpler case, web pages.

Web Page Supply Chain

How do I know that I'm talking to the right Web site? Because there's a closed padlock icon in the URL bar, right?

[Slide 1]

Mozilla says:

A green padlock (with or without an organization name) indicates that:
You are definitely connected to the website whose address is shown in the address bar; the connection has not been intercepted.
The connection between Firefox and the website is encrypted to prevent eavesdropping.

NB - this is misleading!

The padlock icon appears when the browser has validated that the connection to the URL in the URL bar supplied a certificate for the site in question carrying a signature chain ending in one of the root certificates the browser trusts. Browsers come with a default list of root certificates from Certificate Authorities (CAs). My Firefox trusts 140 certificates from 71 different organizations including, for example, Amazon and Google, but also Chunghwa Telecom Co., Ltd. and the Dutch government. Why is this list a problem?

The browser trusts all of them equally.
The browser trusts CAs that the CAs on the list delegate trust to. Back in 2010, the EFF found more than 650 organizations that Internet Explorer and Firefox trusted.
Commercial CAs on the list, and CAs they delegate to, have regularly been found to be issuing false or insecure certificates.

[Slide 2]

One of these trusted organizations is the Internet Security Research Group, a not-for-profit organization hosted by the Linux Foundation and sponsored by many organizations including Mozilla and the EFF, which has greatly improved the information hygiene of the Web through a program called Let's Encrypt. This has provided over 225 million Web sites with free certificates carrying a signature chain rooted in a certificate that almost all browsers trust. My blog's certificate is one of them, as you can see by clicking on the padlock icon.

[Slide 3]

Barysevich identified four such sellers of counterfeit certificates since 2011. Two of them remain in business today. The sellers offered a variety of options. In 2014, one provider calling himself C@T advertised certificates that used a Microsoft technology known as Authenticode for signing executable files and programming scripts that can install software. C@T offered code-signing certificates for macOS apps as well. ... "In his advertisement, C@T explained that the certificates are registered under legitimate corporations and issued by Comodo, Thawte, and Symantec—the largest and most respected issuers,"

Dan Goodin One-stop counterfeit certificate shops for all your malware-signing needs

Abuse of the trust users place in CAs is routine:

In one case, a prominent Dutch CA (DigiNotar) was compromised and the hackers were able to use the CA’s system to issue fake SSL certificates. The certificates were used to impersonate numerous sites in Iran, such as Gmail and Facebook, which enabled the operators of the fake sites to spy on unsuspecting site users. ... More recently, a large U.S.-based CA (TrustWave) admitted that it issued subordinate root certificates to one of its customers so the customer could monitor traffic on their internal network. Subordinate root certificates can be used to create SSL certificates for nearly any domain on the Internet. Although Trustwave has revoked the certificate and stated that it will no longer issue subordinate root certificates to customers, it illustrates just how easy it is for CAs to make missteps and just how severe the consequences of those missteps might be.

In 2018 Sennheiser provided another example:

The issue with the two HeadSetup apps came to light earlier this year when German cyber-security firm Secorvo found that versions 7.3, 7.4, and 8.0 installed two root Certification Authority (CA) certificates into the Windows Trusted Root Certificate Store of users' computers but also included the private keys for all in the SennComCCKey.pem file.

Certificates depend on public-key cryptography, which splits keys into public/private key pairs. Private keys can decrypt text encrypted by the public key, and vice versa. The security of the system depends upon private keys being kept secret. This poses two problems:

As the Sennheiser example shows, it is easy for the private keys to leak. Another common way for them to leak is for a server to be compromised. For the server to be able to verify its identity, and thus unlock the padlock, the private key needs to be stored on the server in cleartext. So an intruder can steal it to impersonate the server.
There is no alarm bell or notification to the owner or affected users when a private key leaks. So, as in the Sennheiser case, the attacker may be able to use it unimpeded for a long time, until some security researcher notices some anomaly.

Catalin Cimpanu continues:

In a report published today, Secorvo researchers published proof-of-concept code showing how trivial would be for an attacker to analyze the installers for both apps and extract the private keys.

Making matters worse, the certificates are also installed for Mac users, via HeadSetup macOS app versions, and they aren't removed from the operating system's Trusted Root Certificate Store during current HeadSetup updates or uninstall operations.
...
Sennheiser's snafu ... is not the first of its kind. In 2015, Lenovo shipped laptops with a certificate that exposed its private key in a scandal that became known as Superfish. Dell did the exact same thing in 2016 in a similarly bad security incident that became known as eDellRoot.

Cimpanu also reports on a more recent case:

Under the guise of a "cybersecurity exercise," the Kazakhstan government is forcing citizens in its capital of Nur-Sultan (formerly Astana) to install a digital certificate on their devices if they want to access foreign internet services.

Once installed, the certificate would allow the government to intercept all HTTPS traffic made from users' devices via a technique called MitM (Man-in-the-Middle).

This type of “mistake” allows attackers to impersonate any Web site to affected devices.

CAs are supposed to issue three grades of certificate based on increasingly rigorous validation:

Domain Validated (DV) certificates verify control over the DNS entries, email and Web content of the specified domain. They can be issued via automated processes, as with Let's Encrypt.
Organization Validated (OV) certificates are supposed to verify the legal entity behind the DV-level control of the domain, but in practice are treated the same as DV certificates.
Extended Validation (EV) certificates require "verification of the requesting entity's identity by a certificate authority (CA)". Verification is supposed to be an intrusive, human process.

[Slide 4]

Source

But, as can be seen from the advert, the Extended Verification process is far from fool-proof. This lack of trustworthiness of CAs should not be a surprise. Six years ago Security Collapse in the HTTPS Market, a fascinating analysis of the (lack of) security on the Web from an economic rather than a technical perspective by Axel Arnbak et al from Amsterdam and Delft Universities showed that CAs lack incentives to be trustworthy. They write:

Information asymmetry prevents buyers from knowing what CAs are really doing. Buyers are paying for the perception of security, a liability shield, and trust signals to third parties. None of these correlates verifiably with actual security. Given that CA security is largely unobservable, buyers’ demands for security do not necessarily translate into strong security incentives for CAs.
Negative externalities of the weakest-link security of the system exacerbate these incentive problems. The failure of a single CA impacts the whole ecosystem, not just that CA’s customers. All other things being equal, these interdependencies undermine the incentives of CAs to invest, as the security of their customers depends on the efforts of all other CAs.

The reason for the weakest-link is:

A crucial technical property of the HTTPS authentication model is that any CA can sign certificates for any domain name. In other words, literally anyone can request a certificate for a Google domain at any CA anywhere in the world, even when Google itself has contracted one particular CA to sign its certificate.

This "technical property" is actually important, it is what enables a competitive market of CAs. Symantec in particular has exploited it wholesale:

Google's investigation revealed that over a span of years, Symantec CAs have improperly issued more than 30,000 certificates. ... They are a major violation of the so-called baseline requirements that major browser makers impose of CAs as a condition of being trusted by major browsers.

But Symantec has suffered no effective sanctions because they are too big to fail:

Symantec's repeated violations underscore one of the problems Google and others have in enforcing terms of the baseline requirements. When violations are carried out by issuers with a big enough market share they're considered too big to fail. If Google were to nullify all of the Symantec-issued certificates overnight, it might cause widespread outages.

My Firefox still trusts Symantec root certificates. Because Google, Mozilla and others prioritize keeping the Web working over keeping it secure, deleting misbehaving big CAs from trust lists won't happen. When Mozilla writes:

You are definitely connected to the website whose address is shown in the address bar; the connection has not been intercepted.

they are assuming a world of honest CAs that isn't this world. If you have the locked padlock icon in your URL bar, you are probably talking to the right Web site, but there is a chance you aren't.

[Slide 5]

Recent data from anti-phishing company PhishLabs shows that 49 percent of all phishing sites in the third quarter of 2018 bore the padlock security icon next to the phishing site domain name as displayed in a browser address bar. That’s up from 25 percent just one year ago, and from 35 percent in the second quarter of 2018.

Brian Krebs Half of all Phishing Sites Now Have the Padlock

Building on earlier work by Wendlandt et al, Moxie Marlinspike, the EFF and others, in 2012 Google started work on an approach specified in RFC6962, and called Certificate Transparency (CT). The big difference from earlier efforts, which didn't require cooperation from website owners and CAs, was that Google's did require cooperation and they had enough leverage to obtain it:

[Slide 6]

Google's Certificate Transparency project fixes several structural flaws in the SSL certificate system, which is the main cryptographic system that underlies all HTTPS connections. These flaws weaken the reliability and effectiveness of encrypted Internet connections and can compromise critical TLS/SSL mechanisms, including domain validation, end-to-end encryption, and the chains of trust set up by certificate authorities. If left unchecked, these flaws can facilitate a wide range of security attacks, such as website spoofing, server impersonation, and man-in-the-middle attacks.

Certificate Transparency helps eliminate these flaws by providing an open framework for monitoring and auditing SSL certificates in nearly real time. Specifically, Certificate Transparency makes it possible to detect SSL certificates that have been mistakenly issued by a certificate authority or maliciously acquired from an otherwise unimpeachable certificate authority. It also makes it possible to identify certificate authorities that have gone rogue and are maliciously issuing certificates.

Certificate Transparency

The basic idea is to accompany the certificate with a hash of the certificate signed by a trusted third party, attesting that the certificate holder told the third party that the certificate with that hash was current. Thus in order to spoof a service, an attacker would have to both obtain a fraudulent certificate from a CA, and somehow persuade the third party to sign a statement that the service had told them the fraudulent certificate was current. Clearly this is:

more secure than the current situation, which requires only compromising a CA, and:
more effective than client-only approaches, which can detect that a certificate has changed but not whether the change was authorized.

CT also requires participation from browser manufacturers:

In order to improve the security of Extended Validation (EV) certificates, Google Chrome requires Certificate Transparency (CT) compliance for all EV certificates issued after 1 Jan 2015.

Clients now need two lists of trusted third parties, the CAs and the sources of CT attestations. The need for these trusted third parties is where the blockchain enthusiasts would jump in and claim (falsely) that using a blockchain would eliminate the need for trust. But CT has a much more sophisticated approach, Ronald Reagan's "Trust, but Verify". In the real world it isn't feasible to solve the problem of untrustworthy CAs by eliminating the need for trust. CT's approach instead is to provide a mechanism by which breaches of trust, both by the CAs and by the attestors, can be rapidly and unambiguously detected.

[Slide 7]

Source

Here is a brief overview of how CT works to detect breaches of trust. The system has the following components:

Logs, to which CAs report their current certificates, and from which they obtain attestations, called Signed Certificate Timestamps (SCTs), that owners can attach to their certificates. Clients can verify the signature on the SCT, then verify that the hash it contains matches the certificate. If it does, the certificate was the one that the CA reported to the log, and the owner validated. It is envisaged that there will be tens but not thousands of logs; Chrome currently trusts 46 logs operated by 11 organizations. Each log maintains a Merkle tree data structure of the certificates for which it has issued SCTs.
Monitors, which periodically download all newly added entries from the logs that they monitor, verify that they have in fact been added to the log, and perform a series of validity checks on them. They also thus act as backups for the logs they monitor.
Auditors, which use the Merkle tree of the logs they audit to verify that certificates have been correctly appended to the log, and that no retroactive insertions, deletions or modifications of the certificates in the log have taken place. Clients can use auditors to determine whether a certificate appears in a log. If it doesn't, they can use the SCT to prove that the log misbehaved.

In this way, auditors, monitors and clients cooperate to verify the correct operation of logs, which in turn provides clients with confidence in the [certificate,attestation] pairs they use to secure their communications. Although the process works if certificate owners each obtain their SCTs from only one log, they should get them from multiple logs and send a random selection of their SCTs to each client to improve robustness. Note the key architectural features of CT:

[Slide 8]

Certificate Transparency Architecture:

The certificate data is held by multiple independent services.
They get the data directly from the source, not via replication from other services.
Clients access the data from a random selection of the services.
There is an audit process continually monitoring the services looking for inconsistencies.

These are all also features of the protocol underlying the LOCKSS digital preservation system, published in 2003. In both cases, the random choice among a population of independent services makes life hard for attackers. If they are to avoid detection, they must compromise the majority of the services, and provide correct information to auditors while providing false information to victims.

Looking at the list of logs Chrome currently trusts, it is clear that almost all are operated by CAs themselves. Assuming that each monitor at each CA is monitoring some of the other logs as well as the one it operates, this does not represent a threat, because misbehavior by that CA would be detected by other CAs. A CA's monitor that was tempted to cover up misbehavior by a different CA's log it was monitoring would risk being "named and shamed" by some other CA monitoring the same log, just as the misbehaving CA would be "named and shamed".

It is important to observe that, despite the fact that CAs operate the majority of the CT infrastructure, its effectiveness in disciplining CAs is not impaired. All three major CAs have suffered reputational damage from recent security failures, although because they are "too big to fail" this hasn't impacted their business much. However, as whales in a large school of minnows it is in their interest to impose costs (for implementing CT) and penalties (for security lapses) on the minnows. Note that Google was sufficiently annoyed with Symantec's persistent lack of security that it set up its own CA. The threat that their business could be taken away by the tech oligopoly is real, and cooperating with Google may have been the least bad choice. Because these major corporations have an incentive to pay for the CT infrastructure, it is sustainable in a way that a market of separate businesses, or a permissionless blockchain supported by speculation in a cryptocurrency would not be.

Fundamentally, if applications such as CT attempt to provide absolute security they are doomed to fail, and their failures will be abrupt and complete. It is more important to provide the highest level of security compatible with resilience, so that the inevitable failures are contained and manageable. This is one of the reasons why permissionless blockchains, subject to 51% attacks, and permissioned blockchains, with a single, central point of failure, are not suitable.

Software Supply Chain

[Slide 9]

When the mass compromise came to light last month, Microsoft said the hackers also stole signing certificates that allowed them to impersonate any of a target’s existing users and accounts through the Security Assertion Markup Language. Typically abbreviated as SAML, the XML-based language provides a way for identity providers to exchange authentication and authorization data with service providers.

The full impact of the recent compromise of SolarWind's Orion network management software will likely never be known, It affected at least 18,000 networks, including Microsoft's and:

the Treasury Department, the State Department, the Commerce Department, the Energy Department and parts of the Pentagon

It was not detected by any of the US government's network monitoring systems, but by FireEye, a computer security company that was also a victim. But for a mistake by the attackers at FireEye it would still be undetected. It was an extremely sophisticated attack, which has rightfully gained a lot of attention.

To understand how defenses against attacks like this might work, it is first necessary to understand how the supply chain that installs and updates the software on your computer works. I'll use apt, the system used by Debian Linux and its derivatives, as the example.

A system running Debian or another APT-based Linux distribution runs software it received in "packages" that contain the software files, and metadata that includes dependencies. Their hashes can be verified against those in a release file, signed by the distribution publisher. Packages come in two forms, source and compiled. The source of a package is signed by the official package maintainer and submitted to the distribution publisher. The publisher verifies the signature and builds the source to form the compiled package, whose hashes are then included in the release file.

The signature on the source package verifies that the package maintainer approves this combination of files for the distributor to build. The signature on the release file verifies that the distributor built the corresponding set of packages from approved sources and that the combination is approved for users to install.

[Slide 10]

There are thus two possible points of entry for an attacker:

They could compromise the developer, so that the signed source code files received by the distributor contained malware (type A),
Or they could compromise the distributor, so that the package whose hash was in the signed release file did not reflect the signed source code, but contained malware (type B).

An example of a type A attack occurred in November 2018. Dan Goodin reported that:

The malicious code was inserted in two stages into event-stream, a code library with 2 million downloads that's used by Fortune 500 companies and small startups alike. In stage one, version 3.3.6, published on September 8, included a benign module known as flatmap-stream. Stage two was implemented on October 5 when flatmap-stream was updated to include malicious code that attempted to steal bitcoin wallets and transfer their balances to a server located in Kuala Lumpur.

How were the attackers able to do this? Goodin explains:

According to the Github discussion that exposed the backdoor, the longtime event-stream developer no longer had time to provide updates. So several months ago, he accepted the help of an unknown developer. The new developer took care to keep the backdoor from being discovered. Besides being gradually implemented in stages, it also narrowly targeted only the Copay wallet app. The malicious code was also hard to spot because the flatmap-stream module was encrypted.

All that was needed to implement this type A attack was e-mail and github accounts, and some social engineering.

Dan Goodin describes a simple Type B attack in New supply chain attack uses poisoned updates to infect gamers’ computers:

In a nutshell, the attack works this way: on launch, Nox.exe sends a request to a programming interface to query update information. The BigNox API server responds with update information that includes a URL where the legitimate update is supposed to be available. Eset speculates that the legitimate update may have been replaced with malware or, alternatively, a new filename or URL was introduced.

Malware is then installed on the target’s machine. The malicious files aren’t digitally signed the way legitimate updates are. That suggests the BigNox software build system isn’t compromised; only the systems for delivering updates are. The malware performs limited reconnaissance on the targeted computer. The attackers further tailor the malicious updates to specific targets of interest.

[Slide 11]

Source

The SolarWinds attackers tried but failed to penetrate the network of Crowdstrike, another computer security company. SUNSPOT: An Implant in the Build Process, Crowdstrike's analysis of the attack, reveals the much greater sophistication of this Type B attack. Once implanted in SolarWinds' build system:

SUNSPOT runs once a second scanning for instances of MsBuild.exe, the tool used to build the target software.
If SUNSPOT finds an MsBuild.exe, it next locates the directory in which the build is running.
Then SUNSPOT checks whether what is being built is the target software.
If it is, SUNSPOT checks whether the target source file has changed.
If it hasn't, SUNSPOT carefully substitutes the modified source file for the target source file.
SUNSPOT waits until the build completes, then carefully restores the target source file and erases the traces of its work.

SolarWinds forensic timeline shows that the attackers penetrated their network in September 2019, and a month later tested SUNSPOT by injecting test code into the next release of Orion. An improved SUNSPOT was deployed from February to June, when it was removed having successfully compromised the Orion release with the production malware. No-one noticed until December, when FireEye spotted suspicious activity on their internal network and traced it to Orion.

Microsoft's analysis reveals a lot more sophistication of the attacker's operations once they had penetrated the network:

[Slide 12]

Each Cobalt Strike DLL implant was prepared to be unique per machine and avoided at any cost overlap and reuse of folder name, file name, export function names, C2 domain/IP, HTTP requests, timestamp, file metadata, config, and child process launched. This extreme level of variance was also applied to non-executable entities, such as WMI persistence filter name, WMI filter query, passwords used for 7-zip archives, and names of output log files.

How could software supply chains be enhanced to resist these attacks? In an important paper entitled Software Distribution Transparency and Auditability, Benjamin Hof and Georg Carle from TU Munich:

Describe how APT works to maintain up-to-date software on clients by distributing signed packages.
Review previous efforts to improve the security of this process.
Propose to enhance APT's security by layering a system similar to Certificate Transparency (CT) on top.
Detail the operation of their systems' logs, auditors and monitors, which are similar to CT's in principle but different in detail.
Describe and measure the performance of an implementation of their layer on top of APT using the Trillian software underlying some CT implementations.

Their system's CT-like logs contain the hashes of both the source and the binaries of each version of each package, and ensure that attackers would be detected if they, for example, create a short-lived version containing malware for a specific victim. It certainly defeats a significant class of attacks but, alas, does not address either the SolarWinds or the event-stream attacks.

As regards the SolarWinds attack, there are two important "missing pieces" in their system, and all the predecessors. Each is the subject of a separate effort:
[Slide 13]

Reproducible Builds.
Bootstrappable Compilers.

Suppose SolarWinds had been working in Hof and Carle's system. They would have signed their source code, built it, and signed the resulting binaries. The attackers would have arranged that the source that was built was not the source that SolarWinds signed, but SolarWinds would not have known that. So the signatures on both the unmodified source and the modified binaries would appear valid in the logs, but the binaries would be malign.

The problem is that the connection between the source and the binaries rests on an assumption that the distributor's build environment has not been compromised - i.e. no type B attack. As with the multiple logs of CT, what is needed is multiple independent builds of the signed source. Unless all of the independent build environments are compromised, a compromised build will differ from the others because it contains malware.

This is a great idea, but in practice it is very hard to achieve for both technical and organizational reasons:

The first technical reason is that in general, building the same source twice results in different binaries. Compiler and linker output typically contains timestamps, temporary file names, and other sources of randomness. The build system needs to be reproducible.
The second technical reason is that, in order to be reproducible, the multiple independent builds have to use the same build environment. So each of the independent build environments will have the same vulnerabilities, allowing for the possibility that the attacker could compromise them all.
The organizational reason is that truly independent builds can only be done in an open source environment in which anyone, and in particular each of the independent builders, can access the source code.

To enable binaries to be securely connected to their source, a Reproducible Builds effort has been under way for more than 7 years. Debian project lead Chris Lamb's 45-minute talk Think you're not a target? A tale of 3 developers ... provides an overview of the problem and the work to solve it using three example compromises:

Alice, a package developer who is blackmailed to distribute binaries that don't match the public source (a Type A attack).
Bob, a build farm sysadmin whose personal computer has been compromised, leading to a compromised build toolchain in the build farm that inserts backdoors into the binaries (a Type B attack).
Carol, a free software enthusiast who distributes binaries to friends. An evil maid attack has compromised her laptop.

As Lamb describes, eliminating all sources of irreproducibility from a package is a painstaking process because there are so many possibilities. They include non-deterministic behaviors such as iterating over hashmaps, parallel builds, timestamps, build paths, file system directory name order, and so on. The work started in 2013 with 24% of Debian packages building reproducibly. Currently, around 95% of the Debian packages for the amd64 and arm64 architectures are now reproducible. That is good, but 100% coverage is really necessary to provide security.

[Slide 14]

Way back in 1974, Paul Karger and Roger Schell discovered a devastating attack against computer systems. Ken Thompson described it in his classic 1984 speech, "Reflections on Trusting Trust." Basically, an attacker changes a compiler binary to produce malicious versions of some programs, INCLUDING ITSELF. Once this is done, the attack perpetuates, essentially undetectably. Thompson demonstrated the attack in a devastating way: he subverted a compiler of an experimental victim, allowing Thompson to log in as root without using a password. The victim never noticed the attack, even when they disassembled the binaries -- the compiler rigged the disassembler, too.

In 2006, Bruce Schneier summarized the message of perhaps the most famous of ACM's annual Turing Award lectures. In this attack, the compromised build environment inserts malware even though it is building the unmodified source code. Unlike the SolarWinds attack, the signatures testifying that the binaries are the output of building the signed source code are correct.

[Slide 15]

Source

This is the motivation for the Bootstrappable Builds project, whose goal is to create a process for building a complete toolchain starting from a "seed" binary that is simple enough to be certified "by inspection". Recently, they achieved a major milestone. Starting from a tiny "seed" binary, they were able to create a working TinyCC compiler for the ARM architecture. Starting from TinyCC, it is possible to build the entire GnuCC toolchain and thus, in principle, a working Linux. There is clearly a long way still to go to a bootstrapped full toolchain proof against Type B attacks.

The event-stream attack can be thought of as the organization-level analog of a Sybil attack on a peer-to-peer system. Creating an e-mail identity is almost free. The defense against Sybil attacks is to make maintaining and using an identity in the system expensive. As with proof-of-work in Bitcoin, the idea is that the white hats will spend more (compute more useless hashes) than the black hats. Even this has limits. Eric Budish's analysis shows that, if the potential gain from an attack on a blockchain is to be outweighed by its cost, the value of transactions in a block must be less than the block reward.

Would a similar defense against "Sybil" type A attacks on the software supply chain be possible? There are a number of issues:

The potential gains from such attacks are large, both because they can compromise very large numbers of systems quickly (event-stream had 2M downloads), and because the banking credentials, cryptocurrency wallets, and other data these systems contain can quickly be converted into large amounts of cash.
Thus the penalty for mounting an attack would have to be an even larger amount of cash. Package maintainers would need to be bonded or insured for large sums, which implies that distributions and package libraries would need organizational structures capable of enforcing these requirements.
Bonding and insurance would be expensive for package maintainers, who are mostly unpaid volunteers. There would have to be a way of paying them for their efforts, at least enough to cover the costs of bonding and insurance.
Thus users of the packages would need to pay for their use, which means the packages could neither be free, nor open source.
Which would make implementing the reproducible builds and bootstrapped compilers needed to defend against type B attacks extremely difficult.

The FOSS (Free Open Source Software) movement will need to find other ways to combat Sybil attacks, which will be hard if the reward for a successful attack greatly exceeds the cost of mounting it. Adequately rewarding maintainers for their essential but under-appreciated efforts is a fundamental problem for FOSS.

It turns out that this talk is timely. Two days ago, Eric Brewer, Rob Pike et al from Google posted Know, Prevent, Fix: A framework for shifting the discussion around vulnerabilities in open source, an important and detailed look at the problem of vulnerabilities in open source and what can be done to reduce them. Their summary is:

It is common for a program to depend, directly or indirectly, on thousands of packages and libraries. For example, Kubernetes now depends on about 1,000 packages. Open source likely makes more use of dependencies than closed source, and from a wider range of suppliers; the number of distinct entities that need to be trusted can be very high. This makes it extremely difficult to understand how open source is used in products and what vulnerabilities might be relevant. There is also no assurance that what is built matches the source code.

Taking a step back, although supply-chain attacks are a risk, the vast majority of vulnerabilities are mundane and unintentional—honest errors made by well-intentioned developers. Furthermore, bad actors are more likely to exploit known vulnerabilities than to find their own: it’s just easier. As such, we must focus on making fundamental changes to address the majority of vulnerabilities, as doing so will move the entire industry far along in addressing the complex cases as well, including supply-chain attacks.

The bulk of their post addresses improvements to the quality of the development process, with three goals:

Know about the vulnerabilities in your software
Prevent the addition of new vulnerabilities, and
Fix or remove vulnerabilities.

Then, in a section entitled Prevention for Critical Software they specifially address the security of the development process and thus the two types of supply chain attacks we have been discussing. They write:

This is a big task, and currently unrealistic for the majority of open source. Part of the beauty of open source is its lack of constraints on the process, which encourages a wide range of contributors. However, that flexibility can hinder security considerations. We want contributors, but we cannot expect everyone to be equally focused on security. Instead, we must identify critical packages and protect them. Such critical packages must be held to a range of higher development standards, even though that might add developer friction.

[Slide 16]

Define Criteria for “Critical” Open Source Projects that Merit Higher Standards
No Unilateral Changes to Critical Software
- Require Code Review for Critical Software
- Changes to Critical Software Require Approval by Two Independent Parties
Authentication for Participants in Critical Software
- For Critical Software, Owners and Maintainers Cannot be Anonymous
- Strong Authentication for Contributors of Critical Software
- A Federated Model for Identities
Notification for Changes in Risk
Transparency for Artifacts
Trust the Build Process

Their goals for the "higher development standards" include identifying the important packages that require higher standards, implementing review and signoff of changes by at least two independent developers, "transparency for artifacts", by which they mean reproducible builds, and "trust the build process" which implies a bootstrappable toolchain.

They acknowledge that these are very aggressive goals, because in many ways they cut against the free-wheeling development culture of open source that has sparked its remarkable productivity. If Google were to persuade other major corporations to put significant additional resources of money and manpower into implementing them they would likely succeed. Absent this, the additional load on developers will likely cause resistance.

2021-02-06T01:59:05+00:00 David. (noreply@blogger.com) Andromeda Yelton: archival face recognition for fun and nonprofit https://andromedayelton.com/2021/02/05/archival-face-recognition-for-fun-and-nonprofit/

In 2019, Dominique Luster gave a super good Code4Lib talk about applying AI to metadata for the Charles “Teenie” Harris collection at the Carnegie Museum of Art — more than 70,000 photographs of Black life in Pittsburgh. They experimented with solutions to various metadata problems, but the one that’s stuck in my head since 2019 is the face recognition one. It sure would be cool if you could throw AI at your digitized archival photos to find all the instances of the same person, right? Or automatically label them, given that any of them are labeled correctly?

Sadly, because we cannot have nice things, the data sets used for pretrained face recognition embeddings are things like lots of modern photos of celebrities, a corpus which wildly underrepresents 1) archival photos and 2) Black people. So the results of the face recognition process are not all that great.

I have some extremely technical ideas for how to improve this — ideas which, weirdly, some computer science PhDs I’ve spoken with haven’t seen in the field. So I would like to experiment with them. But I must first ~~invent the universe~~ set up a data processing pipeline.

Three steps here:

Fetch archival photographs;
Do face detection (draw bounding boxes around faces and crop them out for use in the next step);
Do face recognition.

For step 1, I’m using DPLA, which has a super straightforward and well-documented API and an easy-to-use Python wrapper (which, despite not having been updated in a while, works just fine with Python 3.6, the latest version compatible with some of my dependencies).

For step 2, I’m using mtcnn, because I’ve been following this tutorial.

For step 3, face recognition, I’m using the steps in the same tutorial, but purely for proof-of-concept — the results are garbage because archival photos from mid-century don’t actually look anything like modern-day celebrities. (Neural net: “I have 6% confidence this is Stevie Wonder!” How nice for you.) Clearly I’m going to need to build my own corpus of people, which I have a plan for (i.e. I spent some quality time thinking about numpy) but haven’t yet implemented.

So far the gotchas have been:

Gotcha 1: If you fetch a page from the API and assume you can treat its contents as an image, you will be sad. You have to treat them as a raw data stream and interpret that as an image, thusly:

from PIL import Image
import requests

response = requests.get(url, stream=True)
response.raw.decode_content = True
data = requests.get(url).content
Image.open(io.BytesIO(data))

This code is, of course, hilariously lacking in error handling, despite fetching content from a cesspool of untrustworthiness, aka the internet. It’s a first draft.

Gotcha 2: You see code snippets to convert images to pixel arrays (suitable for AI ingestion) that look kinda like this: np.array(image).astype('uint8'). Except they say astype('float32') instead of astype('uint32'). I got a creepy photonegative effect when I used floats.

Gotcha 3: Although PIL was happy to manipulate the .pngs fetched from the API, it was not happy to write them to disk; I needed to convert formats first (image.convert('RGB')).

Gotcha 4: The suggested keras_vggface library doesn’t have a Pipfile or requirements.txt, so I had to manually install keras and tensorflow. Luckily the setup.py documented the correct versions. Sadly the tensorflow version is only compatible with python up to 3.6 (hence the comment about DPyLA compatibility above). I don’t love this, but it got me up and running, and it seems like an easy enough part of the pipeline to rip out and replace if it’s bugging me too much.

The plan from here, not entirely in order, subject to change as I don’t entirely know what I’m doing until after I’ve done it:

Build my own corpus of identified people
- This means the numpy thoughts, above
- It also means spending more quality time with the API to see if I can automatically apply names from photo metadata rather than having to spend too much of my own time manually labeling the corpus
Decide how much metadata I need to pull down in my data pipeline and how to store it
Figure out some kind of benchmark and measure it
Try out my idea for improving recognition accuracy
Benchmark again
Hopefully celebrate awesomeness

2021-02-05T22:03:45+00:00 Andromeda Mita Williams: Weeknote 5 (2021) https://librarian.aedileworks.com/2021/02/05/weeknote-5-2021/

§1

Last Friday I was interviewed for the podcast The Grasscast — a game-themed podcast named after the book, The Grasshopper: Games, Life, and Utopia. I ramble a little bit in the episode as I tried to be more open and conversational than concise and correct. But I also spoke that way because for some of the questions, no pat answer came immediately to mind.

There was one question that stumped me but in my trying to answer, I think I found something I had not considered before. The question was, What is one bad thing about games? And I tried to convey that, unlike video games where you can play with strangers, most tabletop games are generally constrained by the preferences of your social circles. In order to convince others to spend time on a game that might think is too complicated for them or not for them, you need to have be a successful evangelist.

Also the episode drifts into chatter about libraries, copyright and ebooks.

§2

This week, I reviewed and published another batch of works for our institutional repository from our department of History that was prepared by our library assistants at Leddy At this point, we have reviewed and uploaded the works of half the faculty from this department. I’m hoping to finish the rest this month but I think I have some outstanding H5P work that might push the end of this project til March.

§3

This morning I assisted with an online workshop called Data Analysis and Visualization in R for Ecologists that was being lead by a colleague of mine.

R Version 4.0.3 (“Bunny-Wunnies Freak Out”) was released on 2020-10-10.

The release of R 4.0.4 (“Lost Library Book”) is scheduled for Monday 2021-02-15.

§4

On Sunday, I published a short response to “Windsor Works – An Economic Development Strategy” which is going to City Council on Monday.

Why am I writing about this document here?

I am mention this here because the proposed strategy (L.I.F.T.) lists the following as potential metric for measuring the strategy’s success…

Take it from me, someone who knows a quite a bit about citations — the city should use another metric — perhaps one pertaining to local unemployment levels instead.

§5

A viral post from 2019 resurfaced on my FB feed this week and unlike most of the posts I read there, this one did spark joy:

And it struck me how much I loved that the anti-prom was being at the library.

So I started doing some research!

It appears to me that some anti-proms are technically better described as alternative proms. These proms have been established as an explicitly safe place where LGBTQ young people can enjoy prom. Other anti-proms are true morps.

I now wonder what other anti-traditions should find a home at the public library.

2021-02-05T19:29:23+00:00 Mita Williams David Rosenthal: Chromebook Linux Update https://blog.dshr.org/2021/02/chromebook-linux-update.html My three Acer C720 Chromebooks running Linux are still giving yeoman service, although for obvious reasons I'm not travelling these days. But it is time for an update to 2017's Travels with a Chromebook. Below the fold, an account of some adventures in sysadmin.

Battery Replacement

The battery in C720 #1, which is over six years old, would no longer hold a charge. I purchased a Dentsing AP13J3K replacement battery from Amazon. I opened the C720, removed the old battery, inserted the new one, closed up the case, and all was well. It was an impressively easy fix.

Sleeping & Waking

Sometime around last October Linux Mint 19 switched from kernels in the 5.0 series to kernels in the 5.4 series. Mint 20 uses 5.4 series kernels. The 5.0 kernels on the C720 went to sleep properly when the lid closed, and woke properly when the lid opened. The 5.4 kernels appeared to go to sleep correctly, but when the lid opened did a cold boot. Because this problem happens immediately on wake, and because sleep appears to work correctly, there is no useful information in the logs; this appears to be a very hard problem to diagnose.

Here is my work-around to use the 5.0.0-32 kernel (the last I installed via updates) on a vanilla Linux Mint installation:

Install Linux Mint 20.1 MATE edition.
Add repositories using Administration/Software Sources:
- deb http://archive.ubuntu.com/ubuntu bionic main restricted universe multiverse
- deb http://archive.ubuntu.com/ubuntu bionic-updates main restricted universe multiverse
- deb http://security.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse
Install kernel 5.0.0-32-generic:

sudo apt-get install linux-headers-5.0.0-32 linux-headers-5.0.0-32-generic linux-image-5.0.0-32-generic linux-modules-5.0.0-32-generic linux-modules-extra-5.0.0-32-generic
Edit /etc/default/grub to show the menu of kernels:

GRUB_TIMEOUT_STYLE=menu GRUB_TIMEOUT=15
Edit /etc/default/grub so that your most recent choice of kernel becomes the default:

GRUB_SAVEDEFAULT=true GRUB_DEFAULT=saved
Run update-grub

After you choose the 5.0.0-32 kernel the first time, it should boot by default, and sleep and wake should work properly. The problem with ehci-pci on wakeup has gone away, there is no need to install the userland files from galliumos.

Disk & Home Directory Encryption

Note that you should ideally install Linux Mint 20.1 with full-disk encryption. The release notes explain:

The move to systemd caused a regression in ecrypts which is responsible for mounting/unmounting encrypted home directories when you login and logout. Because of this issue, please be aware that in Mint 20 and newer releases, your encrypted home directory is no longer unmounted on logout: https://bugs.launchpad.net/ubuntu/+source/gnome-session/+bug/1734541.

Mint 19 with a full-disk encryption had this problem but I haven't been able to reproduce it with Mint 20 and the 5.0.0-32 kernel. Home directory encryption works, but will leave its contents decrypted after you log out, rather spoiling the point.

Touchpad

As I described here, the touchpad isn't one of the C720's best features, and it is necessary to disable it while typing, or while using a mouse such as the excellent Tracpoint. I use Ataraeo's touchpad-indicator, but this doesn't seem to work on Mint 20.1 out of the box, using X.org's libinput driver. The release notes discuss using the synaptics driver instead. I installed it and, after creating the directory ~/.config/autostart the touchpad-indicator starts on login and works fine.
2021-02-04T16:20:18+00:00 David. (noreply@blogger.com) Open Knowledge Foundation: What is a public impact algorithm? https://blog.okfn.org/2021/02/04/what-is-a-public-impact-algorithm/

Meg Foulkes discusses public impact algorithms and why they matter.

“When I look at the picture of the guy, I just see a big Black guy. I don’t see a resemblance. I don’t think he looks like me at all.”

This is what Robert Williams said to police when he was presented with the evidence upon which he was arrested for stealing watches in June 2020. Williams had been identified by an algorithm, when Detroit Police ran grainy security footage from the theft through a facial recognition system. Before questioning Williams, or checking for any alibi, he was arrested. It was not until the matter came to trial that Detroit Police admitted that he had been falsely, and solely, charged on the output of an algorithm.

It’s correct to say that in many cases, when AI and algorithms go wrong, the impact is pretty innocuous – like when a music streaming service recommends music you don’t like. But often, AI and algorithms go wrong in ways that cause serious harm, as in the case of Robert Williams. Although he had done absolutely nothing wrong, he was deprived of a fundamental right on the basis of a computer output: his liberty.

It’s not just on an individual scale that these harms are felt. Algorithms are written by humans, so they can reflect human biases. What algorithms can do is amplify, through automatedly entrenching the bias, this prejudice over a massive scale.

The bias isn’t exclusively racialised; last year, an algorithm used to determine exam grades disproportionately downgraded disadvantaged students. Throughout the pandemic, universities have been turning to remote proctoring software that falsely identifies students with disabilities as cheats. For example, those who practice self-stimulatory behaviour or ‘stimming’ may get algorithmically flagged again and again for suspicious behaviour, or have to disclose sensitive medical information to avoid this.

We identify these types of algorithms as ‘public impact algorithms’ to clearly name the intended target of our concern. There is a big difference between the harm caused by inaccurate music suggestions and algorithms that have the potential to deprive us of our fundamental rights. To call out these harms, we have to precisely define the problem. Only then can we hold the deployers of public impact algorithms to account, and ultimately to achieve our mission of ensuring public impact algorithms do no harm.

Sign up to our mailing list or follow the Open Knowledge Justice Programme on Twitter to receive updates.

2021-02-04T11:08:37+00:00 Meg Foulkes Ed Summers: Outgoing https://inkdroid.org/2021/02/04/legislators-social/

Mark Graham posed a question to me earlier today asking what we know about the Twitter accounts of the members of Congress, specifically whether they have been removed after they left office. The hypothesis was that some members of the House and Senate may decide to delete their account on leaving DC.

I was immediately reminded of the excellent congress-legislators project which collects all kinds of information about House and Senate members including their social media accounts into YAML files that are versioned in a GitHub repository. GitHub is a great place to curate a dataset like this because it allows anyone with a GitHub account to contribute to editing the data, and to share utilities to automate checks and modifications.

Unfortunately the file that tracks social media accounts is only for current members. Once they leave office they are removed from the file. The project does track other historical information for legislators. But the social media data isn’t pulled in when this transition happens, or so it seems.

Luckily Git doesn’t forget. Since the project is using a version control system all of the previously known social media links are in the history of the repository! So I wrote a small program that uses gitpython to walk the legislators-social-media.yaml file backwards in time through each commit, parse the YAML at that previous state, and merge that information into a union of all the current and past legislator information. You can see the resulting program and output in us-legislators-social.

There’s a little bit of a wrinkle in that not everything in the version history should be carried forward because errors were corrected and bugs were fixed. Without digging into the diffs and analyzing them more it’s hard to say whether a commit was a bug fix or if it was simply adding new or deleting old information. If the YAML doesn’t parse at a particular state that’s easy to ignore.

It also looks like the maintainers split out account ids from account usernames at one point. Derek Willis helpfully pointed out to me that Twitter don’t care about the capitalization of usernames in URLs, so these needed to be normalized when merging the data. The same is true of Facebook, Instagram and YouTube. I guarded against these cases but if you notice other problems let me know.

With the resulting merged historical data it’s not too hard to write a program to read in the data, identify the politicians who left office after the 116th Congress, and examine their Twitter accounts to see that they are live. It turned out to be a little bit harder than I expected because it’s not as easy as you might think to check if a Twitter account is live or not.

Twitter’s web servers return a HTTP 200 OK message even when responding to requests for URLs of non-existent accounts. To complicate things further the error message that displays indicating it is not an account only displays when the page is rendered in a browser. So a simple web scraping job that looks at the HTML is not sufficient.

And finally just because a Twitter username no longer seems to work, it’s possible that the user has changed it to a new screen_name. Fortunately the unitedstates project also tracks the Twitter User ID (sometimes). If the user account is still there you can use the Twitter API to look up their current screen_name and see if it is different.

After putting all this together it’s possible to generate a simple table of legislators who left office at the end of the 116th Congress, and their Twitter account information.

name	url	url_ok	user_id	new_url
Lamar Alexander	https://twitter.com/senalexander	True	76649729
Michael B. Enzi	https://twitter.com/senatorenzi	True	291756142
Pat Roberts	https://twitter.com/senpatroberts	True	75364211
Tom Udall	https://twitter.com/senatortomudall	True	60828944
Justin Amash	https://twitter.com/justinamash	True	233842454
Rob Bishop	https://twitter.com/reprobbishop	True	148006729
K. Michael Conaway	https://twitter.com/conawaytx11	True	295685416
Susan A. Davis	https://twitter.com/repsusandavis	False	432771620
Eliot L. Engel	https://twitter.com/repeliotengel	True	164007407
Bill Flores	https://twitter.com/repbillflores	False	237312687
Cory Gardner	https://twitter.com/sencorygardner	True	235217558
Peter T. King	https://twitter.com/reppeteking	True	18277655
Steve King	https://twitter.com/stevekingia	True	48117116
Daniel Lipinski	https://twitter.com/replipinski	True	1009269193
David Loebsack	https://twitter.com/daveloebsack	True	510516465
Nita M. Lowey	https://twitter.com/nitalowey	True	221792092
Kenny Marchant	https://twitter.com/repkenmarchant	True	23976316
Pete Olson	https://twitter.com/reppeteolson	True	20053279
Martha Roby	https://twitter.com/repmartharoby	False	224294785	https://twitter.com/MarthaRobyAL
David P. Roe	https://twitter.com/drphilroe	True	52503751
F. James Sensenbrenner, Jr.	https://twitter.com/jimpressoffice	False	851621377
José E. Serrano	https://twitter.com/repjoseserrano	True	33563161
John Shimkus	https://twitter.com/repshimkus	True	15600527
Mac Thornberry	https://twitter.com/mactxpress	True	377534571
Scott R. Tipton	https://twitter.com/reptipton	True	242873057
Peter J. Visclosky	https://twitter.com/repvisclosky	True	193872188
Greg Walden	https://twitter.com/repgregwalden	True	32010840
Rob Woodall	https://twitter.com/reprobwoodall	True	2382685057
Ted S. Yoho	https://twitter.com/reptedyoho	True	1071900114
Doug Collins	https://twitter.com/repdougcollins	True	1060487274
Tulsi Gabbard	https://twitter.com/tulsipress	True	1064206014
Susan W. Brooks	https://twitter.com/susanwbrooks	True	1074101017
Joseph P. Kennedy III	https://twitter.com/repjoekennedy	False	1055907624	https://twitter.com/joekennedy
George Holding	https://twitter.com/repholding	True	1058460818
Denny Heck	https://twitter.com/repdennyheck	False	1068499286	https://twitter.com/LtGovDennyHeck
Bradley Byrne	https://twitter.com/repbyrne	True	2253968388
Ralph Lee Abraham	https://twitter.com/repabraham	True	2962891515
Will Hurd	https://twitter.com/hurdonthehill	True	2963445730
David Perdue	https://twitter.com/sendavidperdue	True	2863210809
Mark Walker	https://twitter.com/repmarkwalker	True	2966205003
Francis Rooney	https://twitter.com/reprooney	True	816111677917851649
Paul Mitchell	https://twitter.com/reppaulmitchell	True	811632636598910976
Doug Jones	https://twitter.com/sendougjones	True	941080085121175552
TJ Cox	https://twitter.com/reptjcox	True	1080875913926139910
Gilbert Ray Cisneros, Jr.	https://twitter.com/repgilcisneros	True	1080986167003230208
Harley Rouda	https://twitter.com/repharley	True	1075080722241736704
Ross Spano	https://twitter.com/reprossspano	True	1090328229548826627
Debbie Mucarsel-Powell	https://twitter.com/repdmp	True	1080941062028447744
Donna E. Shalala	https://twitter.com/repshalala	False	1060584809095925762
Abby Finkenauer	https://twitter.com/repfinkenauer	True	1081256295469068288
Steve Watkins	https://twitter.com/rep_watkins	False	1080307235350241280
Xochitl Torres Small	https://twitter.com/reptorressmall	True	1080830346915209216
Max Rose	https://twitter.com/repmaxrose	True	1078692057940742144
Anthony Brindisi	https://twitter.com/repbrindisi	True	1080978331535896576
Kendra S. Horn	https://twitter.com/repkendrahorn	False	1083019402046513152	https://twitter.com/KendraSHorn
Joe Cunningham	https://twitter.com/repcunningham	True	1080198683713507335
Ben McAdams	https://twitter.com/repbenmcadams	False	196362083	https://twitter.com/BenMcAdamsUT
Denver Riggleman	https://twitter.com/repriggleman	True	1080504024695222273

In most cases where the account has been updated the individual simply changed their Twitter username, sometimes remove “Rep” from it–like RepJoeKennedy to JoeKennedy. As an aside I’m kind of surprised that Twitter username wasn’t taken to be honest. Maybe that’s a perk of having a verified account or of being a politician? But if you look closely you can see there were a few that seemed to have deleted their account altogether:

name	url	url_ok	user_id
Susan A. Davis	https://twitter.com/repsusandavis	False	432771620
Bill Flores	https://twitter.com/repbillflores	False	237312687
F. James Sensenbrenner, Jr.	https://twitter.com/jimpressoffice	False	851621377
Donna E. Shalala	https://twitter.com/repshalala	False	1060584809095925762
Steve Watkins	https://twitter.com/rep_watkins	False	1080307235350241280

There are two notable exceptions to this. The first is Vice President Kamala Harris. My logic for determining if a person was leaving Congress was to see if they served in a term ending on 2021-01-03, and weren’t serving in a term starting then. But Harris is different because her term as a Senator is listed as ending on 2021-01-18. Her old account (???) is no longer available, but her Twitter User ID is still active and is now attached to the account at (???). The other of course is Joe Biden, who stopped being a senator in order to become the President. His Twitter account remains the same at (???).

It’s worth highlighting here how there seems to be no uniform approach to handling this process. In one case (???) is temporarily blessed as the VP, with a unified account history underneath. In the other there is a separation between (???) and (???). It seems like Twitter has some work to do on managing identities, or maybe the Congress needs to prescribe a set of procedures? Or maybe I’m missing part of the picture, and that just as (???) somehow changed back to (???) there is some namespace management going on behind the scenes?

If you are interested in other social media platforms like Facebook, Instagram and YouTube the unitedstates project tracks information for those platforms too. I merged that information into the legislators.yaml file I discussed here if you want to try to check them. I think that one thing this experiment shows is that if the platform allows for usernames to be changed it is critical to track the user id as well. I didn’t do the work to check that those accounts exist. But that’s a project for another day.

I’m not sure this list of five deleted accounts is terribly interesting at the end of all this. Possibly? But on the plus side I did learn how to interact with Git better from Python, which is something I can imagine returning to in the future. It’s not every day that you have to think of the versions of a dataset as an important feature of the data, outside of serving as a backup that can be reverted to if necessary. But of course data changes in time, and if seeing that data over time is useful, then the revision history takes on a new significance. It’s nothing new to see version control systems as critical data provenance technologies, but it felt new to actually use one that way to answer a question. Thanks Mark!

2021-02-04T05:00:00+00:00 David Rosenthal: Stablecoins https://blog.dshr.org/2020/12/stablecoins.html I have long been skeptical of Bitcoin's "price" and, despite its recent massive surge, I'm still skeptical. But it turns out I was wrong two years ago when I wrote in Blockchain: What's Not To Like?:

Permissionless blockchains require an inflow of speculative funds at an average rate greater than the current rate of mining rewards if the "price" is not to collapse. To maintain Bitcoin's price at $4K requires an inflow of $300K/hour.

I found it hard to believe that this much actual money would flow in, but since then Bitcoin's "price" hasn't dropped below $4K, so I was wrong. Caution — I am only an amateur economist, and what follows below the fold is my attempt to make sense of what is going on.

First, why did I write that? The economic argument is that, because there is a low barrier to entry for new competitors, margins for cryptocurrency miners are low. So the bulk of their income in terms of mining rewards has to flow out of the system in "fiat" currency to pay for their expenses such as power and hardware. These cannot be paid in cryptocurrencies. At the time, the Bitcoin block reward was 12.5BTC/block, or 75BTC/hour. At $4K/BTC this was $300K/hour, so on average 300K USD/hour had to flow in from speculators if the system was not to run out of USD.

Source

What has happened since then? Miners' income comes in two parts, transaction fees (currently averaging around 100BTC/day) and mining rewards (900BTC/day) for a total around 1K BTC/day. At $20K/BTC, that is $830K/hour. The combination of halving of the block reward, increasing transaction fees, and quintupling the "price" has roughly tripled the required inflow.

Second, lets set the context for what has happened in cryptocurrencies in the last year.

Source

In the last year Bitcoin's "market cap" went from around $130B to around $340B (2.6x) and its "price" went from about $7K to about $22K.

Source

In the last year Ethereum's "market cap" went from around $16B to around $65B (4.1x) and its "price went from around $145 to around $590.

Source

The key observation that explains why I write "price" in quotes is shown in this graph. Very little of the trading in BTC is in terms of USD, most of it is in terms of Tether (USDT). The "price" is set by how many USDT people are prepared to pay for BTC, not by how many USD. The USD "price" follows because people believe that 1USDT ≅ 1USD.

Source

In the past year, Tether's "market cap" has gone from about 4B USDT to about 20B USDT (5x).

Tether (USDT) is a "stablecoin", intended to maintain a stable price of 1USD = 1USDT. Initially, Tether claimed that it maintained a stable "price" because every USDT was backed by an actual USD in a bank account. Does that mean that investors transferred around sixteen billion US dollars into Tether's bank account in the past year? No-one believes that. There has never been an audit of Tether to confirm what is backing USDT. Tether themselves admitted to the New York Attorney General in October 2018 that:

the $2.8 billion worth of tethers are only 74% backed:
Tether has cash and cash equivalents (short term securities) on hand totaling approximately $2.1 billion, representing approximately 74 percent of the current outstanding tethers.

If USDT isn't backed by USD, what is backing it, and is 1USDT really worth 1USD?

Source

Just in October, Tether minted around 6B USDT. The graph tracks the "price" of Bitcoin against the "market cap" of USDT. Does it look like they're correlated? Amy Castor thinks so.

Tether transfers newly created USDT to an exchange, where one of two things can happen to it:

It can be used to buy USD or an equivalent "fiat" currency. But only a few exchanges allow this. For example, Coinbase, the leading regulated exchange, will not provide this "fiat off-ramp":

Please note that Coinbase does not support USDT — do not send it to your Bitcoin account on Coinbase.
Because of USDT's history and reputation, exchanges that do offer a "fiat off-ramp" are taking a significant risk, so they will impose a spread; the holder will get less than $1. Why would you send $1 to Tether to get less than $1 back?
It can be used to buy another cryptocurrency, such as Bitcoin (BTC) or Ethereum (ETH), increasing demand for that cryptocurrency and thus increasing its price.

Since newly created USDT won't be immediately sold for "fiat", they will pump the "price" of cryptocurrencies.

For simplicity of explanation, lets imagine a world in which there are only USD, USDT and BTC. In this world some proportion of the backing for USDT is USD and some is BTC.

Someone sends USD to Tether. Why would they do that? They don't want USDT as a store of value, because they already have USD, which is obviously a better store of value than USDT. They want USDT in order to buy BTC. Tether adds the USD to the backing for USDT, and issues the corresponding number of USDT, which are used to buy BTC. This pushes the "price" of BTC up, which increases the "value" of the part of the backing for USDT that is BTC. So Tether issues the corresponding amount of USDT, which is used to buy BTC. This pushes the "price" of BTC up, which increases the "value" of the part of the backing for USDT that is BTC. ...

Tether has a magic "money" pump, creating USDT out of thin air. But there is a risk. Suppose for some reason the "price" of BTC goes down, which reduces the "value" of the backing for USDT. Now there are more USDT in circulation than are backed. So Tether must buy some USDT back. They don't want to spend USD for this, because they know that USD are a better store of value than USDT created out of thin air. So they need to sell BTC to get USDT. This pushes the "price" of BTC down, which reduces the "value" of the part of the backing for USDT that is BTC. So Tether needs to buy more USDT for BTC, which pushes the "price" of BTC down. ...

The magic "money" pump has gone into reverse, destroying the USDT that were created out of thin air. Tether obviously wants to prevent this happening, so in our imaginary world what we would expect to see is that whenever the "price" of BTC goes down, Tether supplies the market with USDT, which are used to buy BTC, pushing the price back up. Over time, the BTC "price" would generally go up, keeping everybody happy. But there is a second-order effect. Over time, the proportion of the backing for USDT that is BTC would go up too, because each USD that enters the backing creates R>1 USD worth of "value" of the BTC part of the backing. And, over time, this effect grows because the greater the proportion of BTC in the backing, the greater R becomes.

Source

In our imaginary world we would expect to see:

The "price" of BTC correlated with the number of USDT in circulation. The graph shows this in the real world.
Both the "price" of BTC and the number of USDT in circulation growing exponentially. The graph shows this in the real world.
Spikes in the number of USDT in circulation following falls in the "price" of BTC. Is Bitcoin Really Untethered? by John Griffin and Amit Shams shows that:
Rather than demand from cash investors, these patterns are most consistent with the supply‐based hypothesis of unbacked digital money inflating cryptocurrency prices.
Their paper was originally published in 2018 and updated in 2019 and 2020.
Tether being extremely reluctant to be audited because that would reveal how little money and how much "money" was supporting the BTC "price".

Our imaginary world replicates key features of the real world. Of course, since Tether has never been audited, we don't know the size or composition of USDT's backing. So we don't know whether Tether has implemented a magic "money" pump. But the temptation to get rich quick by doing so clearly exists, and Tether's history isn't reassuring about their willingness to skirt the law. Because of the feedback loops I described, if they ever dipped a toe in the flow from a magic "money" pump, they would have to keep doubling down.

Apart from the work of Griffin and Shams, there is a whole literature pointing out the implausibility of Tether's story. Here are a few highlights:

JP Konig's 18 things about Tether stablecoins
Social Capital's series explaining Tether and the "stablecoin" scam:
Price manipulation in the Bitcoin ecosystem by Neil Gandal et al
Cryptocurrency Pump-and-Dump Schemes by Tao Li et al
Patrick McKenzie's Tether: The Story So Far:
A friend of mine, who works in finance, asked me to explain what Tether was.

Short version: Tether is the internal accounting system for the largest fraud since Madoff.

Bernie Madoff's $64.8B Ponzi scheme was terminated in 2008 but credible suspicions had been raised nine years earlier, not least by the indefatigable Harry Markopolos. Credible suspicions were raised against Wirecard shortly after it was incorporated in 1999, but even after the Financial Times published a richly documented series based on whistleblower accounts it took almost a year before Wirecard declared bankruptcy owing €1.9B.

Massive frauds suffer from a "Wile E. Coyote" effect. Because they are "too big to fail" there is a long time between the revelation that they are frauds, and the final collapse. It is hard for people to believe that, despite numbers in the billions, there is no there there. Both investors and regulators get caught up in the excitement and become invested in keeping the bubble inflated by either attacking or ignoring negative news. For example, we saw this in the Wirecard scandal:

BaFin conducted multiple investigations against journalists and short sellers because of alleged market manipulation, in response to negative media reporting of Wirecard. ... Critics cite the German regulator, press and investor community's tendency to rally around Wirecard against what they perceive as unfair attack. ... After initially defending BaFin's actions, its president Felix Hufeld later admitted the Wirecard Scandal is a "complete disaster".

Similarly, the cryptocurrency world has a long history of both attacking and ignoring realistic critiques. An example of ignoring is the DAO:

The Decentralized Autonomous Organization (The DAO) was released on 30^th April 2016, but on 27^th May 2016 Dino Mark, Vlad Zamfir, and Emin Gün Sirer posted A Call for a Temporary Moratorium on The DAO, pointing out some of its vulnerabilities; it was ignored. Three weeks later, when The DAO contained about 10% of all the Ether in circulation, a combination of these vulnerabilities was used to steal its contents.

Source

The graph shows how little of the trading in BTC is in terms of actual money, USD. On coinmarketcap.com as I write, USDT has a "market cap" of nearly $20B and the next largest "stablecoin" is USDC, at just over $3.2B. USDC ~~is audited and~~[1] complies with banking regulations, which explains why it is used so much less. The supply of USDC can't expand enough to meet demand. The total "market cap" of all the cryptocurrencies the site tracks is $627B, an increase of more than 11% in the last day! So just one day is around the same as Bernie Madoff's Ponzi scheme. The top 4 cryptocurrencies (BTC, ETH, XRP, USDT) account for $521B (83%) of the total "market cap"; the others are pretty insignificant.

David Gerard points out the obvious in Tether is “too big to fail” — the entire crypto industry utterly depends on it:

The purpose of the crypto industry, and all its little service sub-industries, is to generate a narrative that will maintain and enhance the flow of actual dollars from suckers, and keep the party going.

Increasing quantities of tethers are required to make this happen. We just topped twenty billion alleged dollars’ worth of tethers, sixteen billion of those just since March 2020. If you think this is sustainable, you’re a fool.

Gerard links to Bryce Weiner's Hopes, Expectations, Black Holes, and Revelations — or How I Learned To Stop Worrying and Love Tether which starts from the incident in April of 2018 when Bitfinex, the cryptocurrency exchange behind Tether, encountered a serious problem:

the wildcat bank backing Tether was raided by Interpol for laundering of criminally obtained assets to the tune of about $850,000,000. The percentage of that sum which was actually Bitfinex is a matter of some debate but there’s no sufficient reason not to think it was all theirs.
...
the nature of the problem also presented a solution: instead of backing Tether in actual dollars, stuff a bunch of cryptocurrency in a basket to the valuation of the cash that got seized and viola! A black hole is successfully filled with a black hole, creating a stable asset.

At the time, USDT's "market cap" was around $2.3B, so assuming Tether was actually backed by USD at that point, it lost 37% of its backing. This was a significant problem, more than enough to motivate shenanigans.

Weiner goes on to provide a detailed explanation, and argue that Tether is impossible to shut down. He may be right, but it may be possible to effectively eliminate the "fiat off-ramp", thus completely detaching USDT and USD. This would make it clear that "prices" expressed in USDT are imaginary, not the same as prices expressed in USD.

Source

Postscript: David Gerard recounts the pump that pushed BTC over $20K:

We saw about 300 million Tethers being lined up on Binance and Huobi in the week previously. These were then deployed en masse.

You can see the pump starting at 13:38 UTC on 16 December. BTC was $20,420.00 on Coinbase at 13:45 UTC. Notice the very long candles, as bots set to sell at $20,000 sell directly into the pump.

See Cryptocurrency Pump-and-Dump Schemes by Tao Li, Donghwa Shin and Baolian Wang.

Source

Ki Joung Yu watched the pump in real time:

Lots of people deposited stablecoins to exchanges 7 mins before breaking $20k.

Price is all about consensus. I guess the sentiment turned around to buy $BTC at that time.
...
ETH block interval is 10-20 seconds.

This chart means 127 exchange users worldwide were trying to deposit #stablecoins in a single block — 10 seconds.

Note that "7 mins" is about one Bitcoin block time, and by "exchange users" he means "addresses — it could have been a pre-programmed "smart contract".

[1] David Gerard points out that:

USDC loudly touts claims that it’s well-regulated, and implies that it’s audited. But USDC is not audited — accountants Grant Thornton sign a monthly attestation that Centre have told them particular things, and that the paperwork shows the right numbers.

2021-02-03T23:19:57+00:00 David. (noreply@blogger.com) Jonathan Rochkind: Product management https://bibwild.wordpress.com/2021/02/03/product-management/

In my career working in the academic sector, I have realized that one thing that is often missing from in-house software development is “product management.”

But what does that mean exactly? You don’t know it’s missing if you don’t even realize it’s a thing and people can use different terms to mean different roles/responsibilities.

Basically, deciding what the software should do. This is not about colors on screen or margins (what our stakeholderes often enjoy micro-managing) — I’d consider those still the how of doing it, rather than the what to do. The what is often at a much higher level, about what features or components to develop at all.

When done right, it is going to be based on both knowledge of the end-user’s needs and preferences (user research); but also knowledge of internal stakeholder’s desires and preferences (overall organiational strategy, but also just practically what is going to make the right people happy to keep us resourced). Also knowledge of the local capacity, what pieces do we need to put in place to get these things developed. When done seriously, it will necessarily involve prioritization — there are many things we could possibly done, some subset of them we very well may do eventually, but which ones should we do now?

My experience tells me it is a very big mistake to try to have a developer doing this kind of product management. Not because a developer can’t have the right skillset to do them. But because having the same person leading development and product management is a mistake. The developer is too close to the development lense, and there’s just a clarification that happens when these roles are separate.

My experience also tells me that it’s a mistake to have a committee doing these things, much as that is popular in the academic sector. Because, well, just of course it is.

But okay this is all still pretty abstract. Things might become more clear if we get more specific about the actual tasks and work of this kind of product management role.

I found Damilola Ajiboye blog post on “Product Manager vs Product Marketing Manager vs Product Owner” very clear and helpful here. While it is written so as to distinguish between three different product management related roles, but Ajiboye also acknowledges that in a smaller organization “a product manager is often tasked with the duty of these 3 roles.

Regardless of if the responsibilities are to be done by one or two or three person, Ajiboye’s post serves as a concise listing of the work to be done in managing a product — deciding the what of the product, in an ongoing iterative and collaborative manner, so that developers and designers can get to the how and to implementation.

I recommend reading the whole article, and I’ll excerpt much of it here, slightly rearranged.

The Product Manager
These individuals are often referred to as mini CEOs of a product. They conduct customer surveys to figure out the customer’s pain and build solutions to address it. The PM also prioritizes what features are to be built next and prepares and manages a cohesive and digital product roadmap and strategy.
The Product Manager will interface with the users through user interviews/feedback surveys or other means to hear directly from the users. They will come up with hypotheses alongside the team and validate them through prototyping and user testing. They will then create a strategy on the feature and align the team and stakeholders around it. The PM who is also the chief custodian of the entire product roadmap will, therefore, be tasked with the duty of prioritization. Before going ahead to carry out research and strategy, they will have to convince the stakeholders if it is a good choice to build the feature in context at that particular time or wait a bit longer based on the content of the roadmap.
The Product Marketing Manager
The PMM communicates vital product value — the “why”, “what” and “when” of a product to intending buyers. He manages the go-to-market strategy/roadmap and also oversees the pricing model of the product. The primary goal of a PMM is to create demand for the products through effective messaging and marketing programs so that the product has a shorter sales cycle and higher revenue.
The product marketing manager is tasked with market feasibility and discovering if the features being built align with the company’s sales and revenue plan for the period. They also make research on how sought-after the feature is being anticipated and how it will impact the budget. They communicate the values of the feature; the why, what, and when to potential buyers — In this case users in countries with poor internet connection.
[While expressed in terms of a for-profit enterprise selling something, I think it’s not hard to translate this to a non-profit or academic environment. You still have an audience whose uptake you need to be succesful, whether internal or external. — jrochkind ]
The Product Owner
A product owner (PO) maximizes the value of a product through the creation and management of the product backlog, creation of user stories for the development team. The product owner is the customer’s representative to the development team. He addresses customer’s pain points by managing and prioritizing a visible product backlog. The PO is the first point of call when the development team needs clarity about interpreting a product feature to be implemented.
The product owner will first have to prioritize the backlog to see if there are no important tasks to be executed and if this new feature is worth leaving whatever is being built currently. They will also consider the development effort required to build the feature i.e the time, tools, and skill set that will be required. They will be the one to tell if the expertise of the current developers is enough or if more engineers or designers are needed to be able to deliver at the scheduled time. The product owner is also armed with the task of interpreting the product/feature requirements for the development team. They serve as the interface between the stakeholders and the development team.

When you have someone(s) doing these roles well, it ensures that the development team is actually spending time on things that meet user and business needs. I have found that it makes things so much less stressful and more rewarding for everyone involved.

When you have nobody doing these roles, or someone doing it in a cursory or un-intentional way not recognized as part of their core job responsibilities, or have a lead developer trying to do it on top of develvopment, I find it leads to feelings of: spinning wheels, everything-is-an-emergency, lack of appreciation, miscommunication and lack of shared understanding between stakeholders and developers, general burnout and dissatisfaction — and at the root, a product that is not meeting user or business needs well, leading to these inter-personal and personal problems.

2021-02-03T20:44:17+00:00 jrochkind Islandora: Islandora Open Meeting: February 23, 2021 https://islandora.ca/content/islandora-open-meeting-february-23-2021 Islandora Open Meeting: February 23, 2021 manez Wed, 02/03/2021 - 19:09

Body

We will be holding another open drop-in session on Tuesday, February 23 from 10:00 AM to 2:00 PM Eastern. Full details, and the Zoom link to join, are in this Google doc. The meeting is free form, with experienced Islandora 8 users on hand to answer questions or give demos on request. Please drop in at any time during the four-hour window.

Registration is not required. If you would like a calendar invite as a reminder, please let us know at community@islandora.ca.

2021-02-03T18:09:29+00:00 manez HangingTogether: Emerging Roles for Libraries in Bibliometric and Research Impact Analysis: Lessons Learned from the University of Waterloo http://feedproxy.google.com/~r/Hangingtogetherorg/~3/QETeXt_pDsk/

Library support for bibliometrics and research impact (BRI) analysis is a growing area of library investment and service. Not just in the provision of services to researchers, but for the institutions themselves, which increasingly need to quantify research impact for a spectrum of internally and externally motivated purposes, such as strategic decision support, benchmarking, reputation analysis, support for funding requests, and to better understand research performance.

Research libraries are adopting new roles to support bibliometrics and research impact analysis, and the University of Waterloo Library’s efforts have caught my attention for some time, and for two specific reasons:

They are leaders in the area of Bibliometrics and Research Impact in North America
They have exemplified exceptional cross-institutional collaboration—what I might call “social interoperability”–in developing services, staffing, and programs.

Alison Hitchens, Associate University Librarian for Collections, Technology, and Scholarly Communication, and Laura Bredahl, Bibliometrics and Research Impact Librarian, recently shared about their activities in an OCLC Research Library Partnership (RLP) Works in Progress webinar presentation entitled Case study—Supporting bibliometric and research impact analysis at the University of Waterloo. Their efforts also will be described in a forthcoming ARL Library Practice Brief on Supporting Bibliometric Data Needs at Academic Institutions.

RLP Works in Progress Webinar: Case study—Supporting bibliometric and research impact analysis at the University of Waterloo

The Waterloo BRI story

Like many institutions, the library at Waterloo has been supporting individual researchers with bibliometrics information for reputation management for over a decade. However, around 2012 the university recognized that it needed an institutional understanding of bibliometrics because important external stakeholders like funders, governments, and other accountability organizations were using them to evaluate their organization. Additionally, as the campus developed a new strategic plan emphasizing transformational research, it also needed indicators to help chart progress. As a result, the provost established a working group on bibliometrics that included cross-institutional representation from the office of research, office of institutional analysis, library, and faculties, with the goal to provide guidance to the university on the effective and appropriate use of bibliometrics.

This working group led to a few significant outcomes:

First, it led to the realization and recommendation that a campus expert was needed, which led to the creation of a new Bibliometrics and Research Impact Librarian role in 2015. This library professional works at the institutional level, providing expertise, leadership, and support for institutional needs.
Secondly, it led to the publication in 2016 of a White Paper on Measuring Research Output through Bibliometrics. This type of white paper is critical for any institution utilizing bibliometrics for decision making, as it requires institutional investment to understand and document the opportunities and limitations of bibliometric analysis. It also offers transparent guidance for how researchers and administrators can use bibliometric analysis responsibly. Another good example of this type of white paper comes from Virginia Tech, developed as a companion to that institution’s strategic planning efforts.

The white paper was followed by the development of a research metrics framework, intended to provide detailed bibliometric indicators related to the work of research institutes supporting key research areas identified in the 2013-2018 strategic plan. And this in turn was followed in 2019 by the development of an internal Bibliometric Assessment Tools Report, offering an extremely detailed review of existing bibliometrics assessment tools, known use cases, and an overview of other prominent bibliometrics tools. The Working Group on Bibliometrics continues its work today, supporting the current strategic plan, particularly by advising on the definitions of research areas and the responsible use of indicators at the institutional level.

So what does a Bibliometrics and Research Impact Librarian do?

Laura described several examples of her work at Waterloo:

Validating university rankings. A variety of rankings, such as CWTS Leiden, Shanghai’s ARWU, MacLeans (Canada), QS, and Times Higher Education, are all closely watched by academic institutions and each has their own ranking methodologies. In general, Laura works to understand and verify the data from the rankings, and her efforts also serve to monitor institutional performance, and to provide local level analysis to better understand the rankings.
Support and training to others. Laura is not the only person on campus conducting bibliometric analysis. For example, the Faculty of Engineering has a dedicated research analyst, and Laura provides consultation and expert guidance on the use of tools like SciVal. Laura has also developed an informal community of practice, open to anyone on campus, which is intended to support knowledge sharing and discussion.
Strategic planning. Laura supports efforts to identify appropriate bibliometrics indicators that can be used to understand progress on the current strategic plan.

Importance to the library

Libraries seem to me to be a natural place for bibliometrics and research impact leadership. Librarians have expertise across the research and scholarly communications life cycle, understand disciplinary differences—and how these impact bibliometrics—and also have extensive knowledge with bibliographic data and tools.

In general, this type of engagement can also positively impact the library by “raising the profile of the library on campus.” For example, in the webinar Alison commented,

“It was clear to me that being connected to and known by high level administration in the Office of Research really had an impact on building partnerships in other areas such as research data management. It was a lot easier to send an email or pick up the phone and call an AVP of Research because they knew me through the working group on bibliometrics.”

Overall, this type of activity may result in greater stakeholder appreciation for the value proposition of the library, an improved understanding of the scope of library expertise, and more invitations for the library to participate on campus committees and enterprise-wide projects. At Waterloo, for example, this included opportunities for the AUL to join the Office of Research Systems Advisory Group and for the library to contribute to the institutional RIM project. As the new strategic planning effort has launched, and seven working groups were formed to develop background papers, the library was in a position to successfully advocate for a librarian on each committee.

Learn more! Continue the conversation with the OCLC Research Library Partnership

Of course there’s much more to discuss, so we are offering affiliates with the Research Library Partnership an opportunity to continue the conversation through informal small group discussions with Alison and Laura. Please join us on the following dates:

Europe and east coast-friendly time: Wednesday, March 3 at 11 am EDT (UTC-5). This will be 3 pm in the UK.
ANZ and American west coast-friendly time: Wednesday, March 3 at 6 pm EDT (UTC-5)/3 pm PST (this should be 10 am in Melbourne/Sydney on Thursday the 4^th)

We are interested in exploring some of these questions:

Does your campus have an internal, cross-unit community of practice around BRI? What is the role of your library?
What products/tools are being used at your institution?
How is your RIM/CRIS system also being incorporated into these efforts?
What’s the status of institutional data governance and data sharing at your institution?

If you send me an email message, I will send you a calendar invitation (which can help to make sure all the time zones align).

The post Emerging Roles for Libraries in Bibliometric and Research Impact Analysis: Lessons Learned from the University of Waterloo appeared first on Hanging Together.

2021-02-03T16:19:10+00:00 Rebecca Bryant Jez Cope: GLAM Data Science Network fellow travellers https://erambler.co.uk/blog/glam-data-science-network-fellow-travellers/

Updates

2021-02-04 Thanks to Gene @dzshuniper@ausglam.space for suggesting ADHO and a better attribution for the opening quote (see comments below for details)

See comments & webmentions for details.

“If you want to go fast, go alone. If you want to go far, go together.” — African proverb, probbly popularised in English by Kenyan church leader Rev. Samuel Kobia (original)

This quote is a popular one in the Carpentries community, and I interpret it in this context to mean that a group of people working together is more sustainable than individuals pursuing the same goal independently. That’s something that speaks to me, and that I want to make sure is reflected in nurturing this new community for data science in galleries, archives, libraries & museums (GLAM). To succeed, this work needs to be complementary and collaborative, rather than competitive, so I want to acknowledge a range of other networks & organisations whose activities complement this.

The rest of this article is an unavoidably incomplete list of other relevant organisations whose efforts should be acknowledged and potentially built on. And it should go without saying, but just in case: if the work I’m planning fits right into an existing initiative, then I’m happy to direct my resources there rather than duplicate effort.

Inspirations & collaborators

Groups with similar goals or undertaking similar activities, but focused on a different sector, geographic area or topic. I think we should make as much use of and contribution to these existing communities as possible since there will be significant overlap.

code4lib

Probably the closest existing community to what I want to build, but primarily based in the US, so timezones (and physical distance for in-person events) make it difficult to participate fully. This is a well-established community though, with regular events including an annual conference so there’s a lot to learn here.

newCardigan

Similar to code4lib but an Australian focus, so the timezone problem is even bigger!

GLAM Labs

Focused on supporting the people experimenting with and developing the infrastructure to enable scholars to access GLAM materials in new ways. In some ways, a GLAM data science network would be complementary to their work, by providing people not directly involved with building GLAM Labs with the skills to make best use of GLAM Labs infrastructure.

UK Government data science community

Another existing community with very similar intentions, but focused on UK Government sector. Clearly the British Library and a few national & regional museums & archives fall into this, but much of the rest of the GLAM sector does not.

Artifical Intelligence for Libraries, Archives & Museums (AI4LAM)

A multinational collaboration between several large libraries, archives and museums with a specific focus on the Artificial Intelligence (AI) subset of data science

UK Reproducibility Network

A network of researchers, primarily in HEIs, with an interest in improving the transparency and reliability of academic research. Mostly science-focused but with some overlap of goals around ethical and robust use of data.

Museums Computer Group

I’m less familiar with this than the others, but it seems to have a wider focus on technology generally, within the slightly narrower scope of museums specifically. Again, a lot of potential for collaboration.

Training

Several organisations and looser groups exist specifically to develop and deliver training that will be relevant to members of this network. The network also presents an opportunity for those who have done a workshop with one of these and want to know what the “next steps” are to continue their data science journey.

Supporters

These misson-driven organisations have goals that align well with what I imagine for the GLAM DSN, but operate at a more strategic level. They work by providing expert guidance and policy advice, lobbying and supporting specific projects with funding and/or effort. In particular, the SSI runs a fellowship programme which is currently providing a small amount of funding to this project.

Digital Preservation Coalition (DPC)
Software Sustainability Institute (SSI)
Research Data Alliance (RDA)
Alliance of Digital Humanities Organizations (ADHO)
- … and its Libraries and Digital Humanities Special Interest Group (Lib&DH SIG)

Professional bodies

These organisations exist to promote the interests of professionals in particular fields, including supporting professional development. I hope they will provide communication channels to their various members at the least, and may be interested in supporting more directly, depending on their mission and goals.

Conclusion

As I mentioned at the top of the page, this list cannot possibly be complete. This is a growing area and I’m not the only or first person to have this idea. If you can think of anything glaring that I’ve missed and you think should be on this list, leave a comment or tweet/toot at me!

2021-02-03T16:00:04+00:00 Jez Cope Digital Library Federation: Metadata During COVID https://www.diglib.org/metadata-during-covid/

This post was written by members of the Metadata Working Group, a subgroup of DLF’s Assessment Interest Group.

Digital collections work has changed in a number of ways during the COVID-19 pandemic. For many libraries and archives, this has meant working remotely and shifting toward tasks that can be done online. Within the DLF AIG Metadata Working Group, members have discussed a number of ways that organizations have chosen to increase capacity for metadata, transcription, and other tasks related to digital collections as a way of providing work for employees who would normally work in public-serving positions. This post documents some of those projects and activities.

University of North Texas

At the University of North Texas, our Digital Collections use a web-based metadata editing interface and we can add as many users as needed. When the stay-at-home order went into effect right after spring break, many of our library staff members (including full-time librarians/staff and part-time student workers) were no longer able to do their regular jobs and we offered metadata as an alternative. We added about 100 new editors to our system in March 2020. Additionally, we added some quickly-drafted documentation to steer people toward easy metadata projects and known issues that require clean-up (like fixing name formatting). To keep oversight manageable, new editors were still attached to their own departments (or assigned to one that needed help), with a central contact person for each department and a specific sub-set of projects. Our team of developers rushed an overhaul of the event tracking system that documents who is editing and what records they are changing so that managers could more easily verify if workers were editing when they said they were working. Tracking edits has also let us measure how significantly overall editing has increased. Multiple times since this started, we have had at least one editor working during every hour of the day. Having so many relatively-untrained editors has resulted in a large number of issues that will need to be reviewed, but we have tools built into our system to help identify those issues and have added them to our ongoing list of things to fix. Overall, this was still an extremely positive experience since the increase in editors allowed significant progress or completion of work that would not have been done otherwise. – Hannah Tarver

University of Utah Marriott Library

At the University of Utah, the COVID-19 pandemic pivot to remote work prompted the launch of transcription projects, both with handwritten materials from special collections and newspaper OCR correction. This includes the transcription of 34,000 employee records by our Digital Operations student employees which resulted in the complete transcription of the Kennecott Miner Records collection. We are also using Omeka Classic with the Scripto plug-in as the platform for manuscript transcription projects and are able to find valuable work for people to engage in when they couldn’t physically be at the library.

In addition, we developed a born-digital crowdsourced digital collection, the Utah COVID-19 Digital Collection designed to capture what is currently happening during this unusual time. We’ve gotten a great response from the University and larger Utah communities, with over 800 contributions so far available in the digital library. The COVID-19 Digital Collection has enabled us to build new partnerships and provided the library with outreach opportunities. An article detailing the project is forthcoming in a special issue of the journal Digital Library Perspectives. – Anna Neatrour

Utah State Archives

After starting with From the Page a few months earlier, moving staff and volunteers to transcription and indexing projects proved to be successful. Contributors finished a historical court case (and now working on a second one) and a year’s worth of birth certificates in only a few months using the web-based interface that integrates with CONTENTdm digital collections. With a built-in notes feature, questions can be asked and answered directly on a document’s page, which will then be exported along with the rest of the metadata. We are now preparing to open up the birth certificate indexing to the general public with additional training materials. In addition, new digital collections have been published, even with metadata developed remotely, using tools like Google Sheets for input and then converting to delimited text files for import. – Gina Strack

University of Texas at Austin

At the start of March, the University of Texas Libraries Collections Portal, the public-facing search and discovery interface for our Digital Asset Management System (DAMS), included approximately 2,500 items. Shortly after, the UT-Austin campus closed and many staff members from the Libraries’ first-line customer service, acquisitions and cataloging units found their roles pivoting to create metadata remotely for our DAMS system. Collection curators within UT Libraries created large-scale digital projects to help ensure continued remote work and to utilize this unusual time to turn their focus to projects that had been placed on the back burner due to more pressing obligations. Our Digital Asset Management System Coordinator and staff from our Preservation and Digital Stewardship unit created flexible pathways to support these projects and to ensure successful ingests into the DAMS. Staff at the Architecture & Planning Library and the Alexander Architectural Archives, the Nettie Lee Benson Latin American Collection, and the Perry-Castañeda Library Map Collection dedicated themselves to ingesting and describing large amounts of digital items, increasing our total number of items available online to over 20,000 by September. Digital objects newly available online as a result of this unprecedented, organization-wide collaborative effort include over 2,000 digitized architectural drawings and images, 14 historic books from the Benson Rare Book Collection and Primeros Libros de las Américas, and 14,138 scanned maps. The University of Texas Libraries documented the experience and provided a more detailed explanation of our DAMS in TexLibris. – Mandy Ryan

Colgate University

Colgate University’s Special Collections and University Archives (SCUA) is documenting the Colgate community’s experiences and stories of COVID-19. Digital contributions can be submitted at any time via a Google Form and may be added to Colgate’s Digital Collections portal. There have been 20 direct submissions as of October 31. Physical donations of COVID-19 related materials will be accepted once staff return to the library building. Colgate’s Metadata and Cataloging (M&C) staff have been working with SCUA’s digital collections at home for the first time, describing the work of the University’s longest-serving official photographer, Edward H. Stone. Stone documented life at Colgate from the 1890s to the 1950s, and also photographed the people, places, businesses, and industry of the village of Hamilton and Madison County, New York. M&C staff are creating and editing metadata for more than 1000 glass plate negatives scanned by SCUA staff and students. We anticipate this will be a successful collaboration between the two departments that will serve as a model for other metadata-based remote work projects on campus. M&C staff have also worked with a born-digital LGBTQ oral history project curated by students in the Explorations in LGBTQ Studies class. 22 oral history interviews with Colgate graduates active in the struggle for LGBTQ rights on campus from the 1970s to the 2010s is now available on the Digital Collections site – Rachel White

Digital Library of Georgia

Most of our staff were able to continue doing most of our work from home, though some imaging projects shifted from actively imaging work (which would have had to be done in the office with our cameras) to working on image editing and curation work. We also had to postpone a meeting for our digitization partners. Some metadata projects that were waiting on new imaging work were shifted to complete later; metadata staff worked on metadata remediation and metadata harvesting projects. One colleague who works on newspaper imaging was shifted over to a project describing moving image footage for the Parade of Quartets collection. We set up a student transcription project to keep students teleworking while they had to remain off-campus due to COVID-19. Their transcription work was incorporated into our full-text accessibility feature for some smaller collections. Students are now working in the office and from home on newspaper collation and metadata work, and our imaging staff have worked out a schedule to work while social distancing. Our full staff meetings shifted from weekly meetings (in person) to daily meetings (via Zoom). Unit and supervisor meetings continue with the same frequency as they were held pre-COVID. Our Quarter 1-2 newsletter and our Quarter 3 newsletter both provide more details of what we have worked on throughout the year. – Mandy Mastrovita

University of Florida

Since the middle of March 2020, the Digital Support Services (DSS) at the Libraries has shifted the focus of its imaging assistant crew. Collaborating with the metadata staff, this crew has carried out site-wide metadata cleanup projects for the University of Florida Digital Collection (UFDC) using UFDC’s online metadata edit form. These tasks can all be done at home using a computer connected to the Internet with minimum instructions. The projects include adding missing system ID numbers, unifying the spelling of language terms, correcting diacritic displays, updating rights statements, transcribing hand-written content, merging Genre terms of different spelling variations to selected ones. So far, DSS staff has modified over 134,000 rights statements and transcribed over 125,000 words. These projects improve the overall metadata quality dramatically. For instance, the Genre terms in use will then be cut down to about 800 from the original 2000 terms gathered from all data contributors over the years. To maintain this smaller selection of Genre terms, the DSS will also implement steps to assure all incoming content uses terms from the controlled Genre list. – Xiaoli Ma

The Ohio State University Libraries

The onset of the COVID-19 pandemic necessitated a shift to telework for University Libraries’ employees. In collaboration with Metadata Initiatives and Preservation & Digitization, staff and student employees in other units and needing remote work to do were given the opportunity to do metadata telework. These entailed review and description of content for Digital Collections, a digital repository for digitized and born-digital special collections and archival materials. Catalogers worked on remediation of legacy metadata records, particularly audio and image resources. Staff and student employees with no prior metadata experience assisted with review and description of digitized audio and video content in the backlog. This group also contributed to metadata gathering and quality review for a large migration of digitized student newspapers. Virtual collaboration was conducted with Zoom, e-mail, and the university’s instance of BOX, a cloud-based content management system. This work has made a significant impact on the backlog for DC. However, Metadata Initiatives and applicable stakeholders are still reviewing the work that was done before making updates to records and ingesting the newly processed content. – Annamarie Klose

The post Metadata During COVID appeared first on DLF.

2021-02-03T14:00:33+00:00 Gayle Lucidworks: Accelerate Time to Value for Information Retrieval with AI https://lucidworks.com/post/information-retrieval-with-ai/

We’ve organized the virtuous cycle of our AI and machine learning discipline to make it clear how customers can make the most of the data science innovation at their disposal.

The post Accelerate Time to Value for Information Retrieval with AI appeared first on Lucidworks.

2021-02-02T19:53:43+00:00 Radu Miclaus OCLC Dev Network: Planned maintenance: Classify API http://www.oclc.org/content/developer/worldwide/en_us/news/2021/classify-feb-2021.html

OCLC will be performing quarterly maintenance on the experimental Classify API on 4 February 2021 from 10:00am – 10:45am Eastern US (UTC -5).

2021-02-02T17:00:00+00:00 Karen Coombs Terry Reese: MarcEdit 7.3.x/7.5.x (beta) Updates https://blog.reeset.net/archives/2941

Versions are available at: https://marcedit.reeset.net/downloads

Information about the changes:

7.3.10 Change Log: https://marcedit.reeset.net/software/update7.txt
7.5.0 Change Log: https://marcedit.reeset.net/software/update75.txt

If you are using 7.x – this will prompt as normal for update. 7.5.x is the beta build, please be aware I expect to be releasing updates to this build weekly and also expect to find some issues.

Questions, let me know.

–tr

2021-02-02T06:00:21+00:00 reeset Cynthia Ng: Choosing not to go into management (again) https://cynthiang.ca/2021/02/02/choosing-not-to-go-into-management-again/ Often, to move up and get a higher pay, you have to become a manager, but not everyone is suited to become a manager, and sometimes given the preference, it’s not what someone wants to do. Thankfully at GitLab, in every engineering team including Support, we have two tracks: technical (individual contributor), and management. Progression … Continue reading 2021-02-02T02:24:51+00:00 Cynthia Harvard Library Innovation Lab: New Updates to Search: Advanced Filters https://lil.law.harvard.edu/blog/2021/02/02/new-updates-to-search-advanced-filters/

The Caselaw Access Project offers free, public access to over 6.5 million decisions published by state and federal courts throughout American history. Because our mission is providing access to legal information, we make these decisions available in a variety of formats through a variety of different access methods.

One type of access we've been working hard on recently is our search interface, which you can get to at case.law/search. We've had basic search working for a while, and we're pleased to share our new advanced search filters.

Advanced filters work exactly as you'd expect. Start your search with keywords or phrases, and then use the filters to narrow down jurisdictions, courts, and dates. Say you're looking for Massachusetts cases from 1820 to 1840 that contain the word "whaling."

Search for cases that include the word "whaling" decided from 1820 to 1840 in Massachusetts, showing advanced filters for full-text search, date from, date to, case name abbreviation, docket number, reporter, jurisdiction, citation, and court.

You can also access the advanced filters from the search results screen, so that you can fine-tune your search if you're not happy with the initial results. Delete or modify any of the filters as you go, and sort the results chronologically or by relevance.

Search results for cases that include the word "whaling" decided from 1820 to 1840 in Massachusetts, showing filters on left.

There is a lot more we hope to do with search, but we hope you enjoy this improvement. If you have ideas of your own, please share them with us at info@case.law.

CAP is a project of the Library Innovation Lab at Harvard Law School Library. We make open source software that helps people access legal information, preserve web sources with Perma.cc, and create open educational resources with H2O.

2021-02-02T00:00:00+00:00 Open Knowledge Foundation: Announcing a new partner for Open Data Day 2021 mini-grants https://blog.okfn.org/2021/02/01/announcing-a-new-partner-for-open-data-day-2021-mini-grants/

GFDRR and OpenDRI

For Open Data Day 2021 on Saturday 6th March, the Open Knowledge Foundation is offering support and funding for in-person and online events anywhere in the world via our mini-grant scheme.

Today we are pleased to announce an additional partner for the Open Data Day 2021 mini-grant scheme: the Global Facility for Disaster Reduction and Recovery (GFDRR) through the GFDRR Labs and its Open Data for Resilience Initiative (OpenDRI).

GFDRR will be supporting mini-grants in the Environmental Data track, with a particular focus on ‘data for resilience’.

If you need inspiration for your event using data for resilience, some useful resources to check out include: GFDRR Labs, OpenDRI, Open Cities Project, ThinkHazard, Open Data for Resilience Index and the Risk Data Library.

We are extremely grateful to GFDRR and all our partners who have provided funding for this year’s mini-grant scheme. These include Microsoft, UK Foreign, Commonwealth and Development Office, Mapbox, Latin American Open Data Initiative (ILDA), Open Contracting Partnership and Datopian.

Open Data Day 2021 partners

2021-02-01T08:05:38+00:00 James Hamilton Cynthia Ng: Reflection Part 1: My first year at GitLab and becoming Senior https://cynthiang.ca/2020/06/16/reflection-part-1-my-first-year-at-gitlab-and-on-becoming-senior/ About a year ago, I wrote a reflection on Summit and Contribute, our all staff events, and later that year, wrote a series of posts on the GitLab values and culture from my own perspective. There is a lot that I mention in the blog post series and I’ll try not to repeat myself (too … Continue reading 2021-01-31T22:13:01+00:00 Cynthia Cynthia Ng: Reflection Part 2: My second year at GitLab and on becoming Senior again https://cynthiang.ca/2020/06/17/reflection-part-2-my-second-year-at-gitlab-and-on-becoming-senior-again/ This reflection is a direct continuation of part 1 of my time at GitLab so far. If you haven’t, please read the first part before beginning this one. Becoming an Engineer (18 months) The more time I spent working in Support, the more I realized that the job was much more technical than I originally … Continue reading 2021-01-31T22:12:40+00:00 Cynthia Jonathan Rochkind: Rails auto-scaling on Heroku https://bibwild.wordpress.com/2021/01/27/rails-auto-scaling-on-heroku/

We are investigating moving our medium-small-ish Rails app to heroku.

We looked at both the Rails Autoscale add-on available on heroku marketplace, and the hirefire.io service which is not listed on heroku marketplace and I almost didn’t realize it existed.

I guess hirefire.io doesn’t have any kind of a partnership with heroku, but still uses the heroku API to provide an autoscale service. hirefire.io ended up looking more fully-featured and lesser priced than Rails Autoscale; so the main service of this post is just trying to increase visibility of hirefire.io and therefore competition in the field, which benefits us consumers.

Background: Interest in auto-scaling Rails background jobs

At first I didn’t realize there was such a thing as “auto-scaling” on heroku, but once I did, I realized it could indeed save us lots of money.

I am more interested in scaling Rails background workers than I a web workers though — our background workers are busiest when we are doing “ingests” into our digital collections/digital asset management system, so the work is highly variable. Auto-scaling up to more when there is ingest work piling up can give us really nice inget throughput while keeping costs low.

On the other hand, our web traffic is fairly low and probably isn’t going to go up by an order of magnitude (non-profit cultural institution here). And after discovering that a “standard” dyno is just too slow, we will likely be running a performance-m or performance-l anyway — which likely can handle all anticipated traffic on it’s own. If we have an auto-scaling solution, we might configure it for web dynos, but we are especially interested in good features for background scaling.

There is a heroku built-in autoscale feature, but it only works for performance dynos, and won’t do anything for Rails background job dynos, so that was right out.

That could work for Rails bg jobs, the Rails Autoscale add-on on the heroku marketplace; and then we found hirefire.io.

Pricing: Pretty different

hirefire

As of now January 2021, hirefire.io has pretty simple and affordable pricing. $15/month/heroku application. Auto-scaling as many dynos and process types as you like.

hirefire.io by default can only check into your apps metrics to decide if a scaling event can occur once per minute. If you want more frequent than that (up to once every 15 seconds), you have to pay an additional $10/month, for $25/month/heroku application.

Even though it is not a heroku add-on, hirefire does advertise that they bill pro-rated to the second, just like heroku and heroku add-ons.

Rails autoscale

Rails autoscale has a more tiered approach to pricing that is based on number and type of dynos you are scaling. Starting at $9/month for 1-3 standard dynos, the next tier up is $39 for up to 9 standard dynos, all the way up to $279 (!) for 1 to 99 dynos. If you have performance dynos involved, from $39/month for 1-3 performance dynos, up to $599/month for up to 99 performance dynos.

For our anticipated uses… if we only scale bg dynos, I might want to scale from (low) 1 or 2 to (high) 5 or 6 standard dynos, so we’d be at $39/month. Our web dynos are likely to be performance and I wouldn’t want/need to scale more than probably 2, but that puts us into performance dyno tier, so we’re looking at $99/month.

This is of course significantly more expensive than hirefire.io’s flat rate.

Metric Resolution

Since Hirefire had an additional charge for finer than 1-minute resolution on checks for autoscaling, we’ll discuss resolution here in this section too. Rails Autoscale has same resolution for all tiers, and I think it’s generally 10 seconds, so approximately the same as hirefire if you pay the extra $10 for increased resolution.

Configuration

Let’s look at configuration screens to get a sense of feature-sets.

Rails Autoscale

web dynos

To configure web dynos, here’s what you get, with default values:

The metric Rails Autoscale uses for scaling web dynos is time in heroku routing queue, which seems right to me — when things are spending longer in heroku routing queue before getting to a dyno, it means scale up.

worker dynos

For scaling worker dynos, Rails Autoscale can scale dyno type named “worker” — it can understand ruby queuing libraries Sidekiq, Resque, Delayed Job, or Que. I’m not certain if there are options for writing custom adapter code for other backends.

Here’s what the configuration options are — sorry these aren’t the defaults, I’ve already customized them and lost track of what defaults are.

You can see that worker dynos are scaled based on the metric “number of jobs queued”, and you can tell it to only pay attention to certain queues if you want.

Hirefire

Hirefire has far more options for customization than Rails Autoscale, which can make it a bit overwhelming, but also potentially more powerful.

web dynos

You can actually configure as many Heroku process types as you have for autoscale, not just ones named “web” and “worker”. And for each, you have your choice of several metrics to be used as scaling triggers.

For web, I think Queue Time (percentile, average) matches what Rails Autoscale does, configured to percentile, 95, and is probably the best to use unless you have a reason to use another. (“Rails Autoscale tracks the 95th percentile queue time, which for most applications will hover well below the default threshold of 100ms.“)

Here’s what configuration Hirefire makes available if you are scaling on “queue time” like Rails Autoscale, configuration may vary for other metrics.

I think if you fill in the right numbers, you can configure to work equivalently to Rails Autoscale.

worker dynos

If you have more than one heroku process type for workers — say, working on different queues — Hirefire can scale the independently, with entirely separate configuration. This is pretty handy, and I don’t think Rails Autoscale offers this. (update i may be wrong, Rails Autoscale says they do support this, so check on it yourself if it matters to you).

For worker dynos, you could choose to scale based on actual “dyno load”, but I think this is probably mostly for types of processes where there isn’t the ability to look at “number of jobs”. A “number of jobs in queue” like Rails Autoscale does makes a lot more sense to me as an effective metric for scaling queue-based bg workers.

Hirefire’s metric is slightly difererent than Rails Autoscale’s “jobs in queue”. For recognized ruby queue systems (a larger list than Rails Autoscale’s; and you can write your own custom adapter for whatever you like), it actually measures jobs in queue plus workers currently busy. So queued+in-progress, rather than Rails Autoscale’s just queued. I actually have a bit of trouble wrapping my head around the implications of this, but basically, it means that Hirefire’s “jobs in queue” metric strategy is intended to try to scale all the way to emptying your queue, or reaching your max scale limit, whichever comes first. I think this may make sense and work out at least as well or perhaps better than Rails Autoscale’s approach?

Here’s what configuration Hirefire makes available for worker dynos scaling on “job queue” metric.

Since the metric isn’t the same as Rails Autosale, we can’t configure this to work identically. But there are a whole bunch of configuration options, some similar to Rails Autoscale’s.

The most important thing here is that “Ratio” configuration. It may not be obvious, but with the way the hirefire metric works, you are basically meant to configure this to equal the number of workers/threads you have on each dyno. I have it configured to 3 because my heroku worker processes use resque, with resque_pool, configured to run 3 resque workers on each dyno. If you use sidekiq, set ratio to your configured concurrency — or if you are running more than one sidekiq process, processes*concurrency. Basically how many jobs your dyno can be concurrently working is what you should normally set for ‘ratio’.

Hirefire not a heroku plugin

Hirefire isn’t actually a heroku plugin. In addition to that meaning separate invoicing, there can be some other inconveniences.

Since hirefire only can interact with heroku API, for some metrics (including the “queue time” metric that is probably optimal for web dyno scaling) you have to configure your app to log regular statistics to heroku’s “Logplex” system. This can add a lot of noise to your log, and for heroku logging add-ons that are tired based on number of log lines or bytes, can push you up to higher pricing tiers.

If you use paperclip, I think you should be able to use the log filtering feature to solve this, keep that noise out of your logs and avoid impacting data log transfer limits. However, if you ever have cause to look at heroku’s raw logs, that noise will still be there.

Support and Docs

I asked a couple questions of both Hirefire and Rails Autoscale as part of my evaluation, and got back well-informed and easy-to-understand answers quickly from both. Support for both seems to be great.

I would say the documentation is decent-but-not-exhaustive for both products. Hirefire may have slightly more complete documentation.

Other Features?

There are other things you might want to compare, various kinds of observability (bar chart or graph of dynos or observed metrics) and notification. I don’t have time to get into the details (and didn’t actually spend much time exploring them to evaluate), but they seem to offer roughly similar features.

Conclusion

Rails Autoscale is quite a bit more expensive than hirefire.io’s flat rate, once you get past Rails Autoscale’s most basic tier (scaling no more than 3 standard dynos).

It’s true that autoscaling saves you money over not, so even an expensive price could be considered a ‘cut’ of that, and possibly for many ecommerce sites even $99 a month might a drop in the bucket (!)…. but this price difference is so significant with hirefire (which has flat rate regardless of dynos), that it seems to me it would take a lot of additional features/value to justify.

And it’s not clear that Rails Autoscale has any feature advantage. In general, hirefire.io seems to have more features and flexibility.

Until 2021, hirefire.io could only analyze metrics with 1-minute resolution, so perhaps that was a “killer feature”?

Honestly I wonder if this price difference is sustained by Rails Autoscale only because most customers aren’t aware of hirefire.io, it not being listed on the heroku marketplace? Single-invoice billing is handy, but probably not worth $80+ a month. I guess hirefire’s logplex noise is a bit inconvenient?

Or is there something else I’m missing? Pricing competition is good for the consumer.

And are there any other heroku autoscale solutions, that can handle Rails bg job dynos, that I still don’t know about?

update a day after writing djcp on a reddit thread writes:

I used to be a principal engineer for the heroku add-ons program.
One issue with hirefire is they request account level oauth tokens that essentially give them ability to do anything with your apps, where Rails Autoscaling worked with us to create a partnership and integrate with our “official” add-on APIs that limits security concerns and are scoped to the application that’s being scaled.
Part of the reason for hirefire working the way it does is historical, but we’ve supported the endpoints they need to scale for “official” partners for years now.
A lot of heroku customers use hirefire so please don’t think I’m spreading FUD, but you should be aware you’re giving a third party very broad rights to do things to your apps. They probably won’t, of course, but what if there’s a compromise?
“Official” add-on providers are given limited scoped tokens to (mostly) only the actions / endpoints they need, minimizing blast radius if they do get compromised.

You can read some more discussion at that thread.

2021-01-30T22:53:46+00:00 jrochkind Hugh Rundle: Automation workflows with GitHub Actions and Webhooks - Library Map part 3 https://www.hughrundle.net/automation-workflows-with-github-actions-and-webhooks/

This is the third in my series on the Library Map. Part One dealt with why I made the map. Part 2 explained how I made it. This post is about strategies I've used to automate some things to keep it up to date.

What is a GitHub Action?

A GitHub Action is an automated script that runs on a virtual machine when triggered by some kind of event. Triggers for actions are defined in a "workflow" configuration file at .github/workflows in your GitHub repository. The terminology can be a bit confusing, because "GitHub Actions" is what GitHub calls the whole system, but an "action" within that system is actually the smallest part in the series:

Workflow
- Job1
  - Step1
      - Action1
      - Action2
  - Step2
      - Action3
- Job2
  - Step1
      - Action1
      - Action2

GitHub Actions are really just GitHub's version of a Continuous Integration / Continuous Deployment (CI/CD) tool. I say "just", but it's extremely powerful. Unfortunately that does mean that even though GitHub Actions are quite extensively documented, the docs aren't necessarily all that clear if you're starting from scratch, and the process is quite confusing for the uninitiated. I spent a couple of days failing to make it work the way I wanted, so that you don't have to.

GitHub Actions ...in action

There are a zillion things you can use GitHub Actions for — auto-closing "stale" Issues, adding labels automatically, running code linters on pull requests, and so on. If you've read my previous posts, you might remember that I wrote a little Python script to merge the data from library_services_information.csv into boundaries.topo.json. But doing that manually every time the CSV file is updated is a tedious manual task. Wouldn't it be better if we could automate it? Well, we can automate it with GitHub Actions!

What we want to do here is set up a trigger that runs the script whenever the CSV file is changed. I originally tried doing this on a push event (every time code is pushed to the default branch), and it worked, but ultimately I decided it would be better to run it whenever someone (including me) makes a Pull Request. I'm in a reasonably consistent habit of always creating a new git branch rather than committing directly to the default branch, and there's less chance of something going wrong and the TopoJSON file being corrupted if the merge is done at the Pull Request stage and then manually pulled in — if there can't be a clean merge, GitHub will tell me before I break everything.

To set this up, we need to write a workflow configuration file, listing the jobs we want done, and the actions within each job. Jobs within each workflow are run concurrently unless the workflow configuration tells them to wait for the previous job, though in our case that doesn't matter, because there is only a single job. The structure is:

Workflow ('topo auto updater (PR)')
 - Job1 ('auto-topo-updater')
   - Step 1: git checkout code
   - Step 2: add labels
   - Step 3: merge files
   - Step 4: git commit updated code

The first step uses an Action provided by GitHub itself. It runs a git checkout on the repository before anything else happens. This means nothing will happen in the actual repository if anything in the workflow fails, because the virtual machine that checked out your code just gets destroyed without checking the code back in.

Step 2 will use an Action created by Christian Vuerings, and automatically adds labels to an Issue or Pull Request, based on whatever criteria triggered the workflow.

Step 3 runs the python script to merge the CSV data into the TopoJSON.

Step 4 (care of Stefan Zweifel) commits and pushes the updated changes into the pull request that triggered the workflow. This is where the real magic happens, because it simply adds a second commit to the pull request as soon as it is received and before the PR is merged. I initially had set this up to create a second pull request with just the merged TopoJSON changes and then tried to work out how to auto-merge that new pull request, but someone on Mastodon helpfully asked me why I would bother creating a pull request if I wanted to auto-merge it anyway. The thought of auto-committing terrified me initially because I had no idea what I was doing, but on reflection a second PR was indeed a bit silly.

Writing the config file

To get all this to happen, we need to write a configuration file. This is written in YAML, and saved in a special directory at the top of the repository, called .github/workflows. You can name this file whatever you want, but it has to end in .yml.

First we provide some kind of trigger, and include any conditions we might want to apply. I want this workflow to happen whenever someone creates a pull request that includes changes to the website/data/library_services_information.csv file:

name: topo auto updater (PR)

on:
  pull_request:
    paths:
    - 'website/data/library_services_information.csv'

  workflow_dispatch:

The on directive lists the different 'events' that can trigger the workflow. The first one is clear enough, but what about workflow_dispatch? This event simply means "when triggered manually by pressing a button". I don't know why it has such an obscure name.

Once we've told GitHub when we want the workflow to run, we can tell it what we want it to do. First we list our jobs:

jobs:
  auto-topo-updater:
    runs-on: ubuntu-latest
    steps:
			# steps go here

The first line under 'jobs' is the name of our job (this can be anything, but without spaces). runs on tells GitHub which runner to use. A 'runner' is a special environment that runs automated continuous integration tools. In this case we're using GitHub Actions runners, but runners are also commonly used in other automated testing tools. Here we are using the "latest" Ubuntu Linux runner, which is currently using Ubuntu 18.04 even though Ubuntu 20.04 is actually the latest Ubuntu LTS release. Now we've outlined the trigger and where we want to run our steps, it's time to say what those steps are:


steps:
- uses: actions/checkout@v2
  with:
    ref: ${{ github.head_ref }}

- uses: christianvuerings/add-labels@v1
  with:
    labels: |
      auto update
      data
  env:
    GITHUB_TOKEN: $

- name: Merge CSV to TopoJSON
  run: |
    python3 ./.github/scripts/merge_csv_to_topojson.py

- uses: stefanzweifel/git-auto-commit-action@v4
  with:
    commit_message: merge csv data to topo

Whoah, that's a lot! You can see there are two ways we describe how to perform an action: uses, or name + run. The uses directive points to an Action that someone has publicly shared on GitHub. So uses: actions/checkout@v2 means "use version 2 of the Action at the repository address https://github.com/actions/checkout". This is an official GitHub action. If we want to simply run some commands, we can just give our action a name and use the run directive:

- name: Merge CSV to TopoJSON
	run: |
	python3 ./.github/scripts/merge_csv_to_topojson.py

In this example, we use a pipe (|) to indicate that the next lines should be read one after another in the default shell (basically, a tiny shell script). The first step checked out out our code, so we can now use any script that is in the repository. I moved the python merging script into .github/scripts/ to make it clearer how this script is used, and now we're calling it with the python3 command.

To pass data to an action, we use with. The step below passes a list of label names to add to the pull request ('auto update' and 'data'):

- uses: christianvuerings/add-labels@v1
  with:
    labels: |
      auto update
      data

Finally, for the labels step we need to provide an environment variable. For certain activities, GitHub requires Actions to use a GITHUB_TOKEN so that you can't just run an action against any repository without permission. This is automatically stored in the "secret store", to which you can also add other secrets like API keys and so on. The env directive passes this through to the Action:


env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Putting the robots to work

Now when a pull request is sent, it gets tagged auto update and data, and a commit updating the topo.json file is automatically added to it:

screenshot of an updated pull request

You can see the full config file in the Library Map repository.

I've also worked out how to reduce the filesize of my GeoJSON file, so I was able to check it in to the repository. This allowed me to automate the transformation from GeoJSON to TopoJSON whenever the GeoJSON file is updated, with a workflow that runs some commands over the GeoJSON and creates a new pull request. One little gotcha with this is that the action I used to process the GeoJSON file into TopoJSON also cleans up the GeoJSON, which means triggering the action on any change to the GeoJSON file creates a recursive loop whereby every time the new pull request is merged, it creates a new one. To get around this, I probably should just make it auto-commit rather than create a pull request, but for now I added an if statement:

jobs:
  processJson:
    if: "!contains(github.event.head_commit.message, 'from hughrun/geo-to-topo')"
...

 - name: Create Pull Request
    uses: peter-evans/create-pull-request@v3
    with:
      commit-message: Update TopoJSON boundaries
      title: Update TopoJSON boundaries
      body: 'Clean & minify GeoJSON'
      branch: geo-to-topo
      labels: auto update,data

The last action creates a pull request on a new geo-to-topo branch, so if the commit message includes "from hughrun/geo-to-topo" the job won't run. Recursive pull request problem solved!

What is a Webhook?

I really like cherries, but they're not always season. Imagine me sending a text message to the local greengrocer every day in early summer, to ask whether they have any cherries yet. They text me back: usually the answer is "no", but eventually it's a "yes". Then I hit on an idea: I call them and ask them to just text me when cherries are in stock.

The first approach is how an API call works: you send a request, and the server sends a response. The second is how a webhook works — you get the response without having to even send the request, when a certain criteria is met. I've been playing around with APIs and webhooks at work, because we want to connect Eventbrite event information to a calendar on our own website. But GitHub also offers webhooks (which actually pre-dates GitHub Actions), and this is the final piece of the Library Map automation pipeline.

The big difference of course is that sending an HTTP request and receiving an HTTP request are quite different things. You can send an HTTP request in many different ways: including by just typing a URL into a browser. But to receive a request you need some kind of server. Especially if you don't know when it will be sent. Conveniently I already have a VPS that I use for a few things, including hosting this blog. So we have something to receive the webhook (a server), and something to send the webhook (GitHub). Now we need to tell those two things how to talk to each other.

What we want to do here is automatically update the data on the Library Map whenever there is an update in the repository. I could make this easier by just publishing the map with GitHub pages, but I don't want to completely rely on GitHub for everything.

Sending the webhook

First of all we need to set up the webhook. In the repository we go to settings - webhooks and then click on Add webhook. Here we enter the Payload URL (the url we will set up on our server, to receive the webhook: https://example.com/gh-library-map), the Content type (application/json), and a secret. The secret is just a password that can be any text string, but I recommend using something long and hard to guess. You could try one of my favourite URLs to create it. We want the trigger to be "Just the push event" because we don't want to trigger the webhook every time anything at all happens in the repository. Unfortunately there doesn't seem to be a way to trigger it only on a push to the primary branch, but in future we could probably put some logic in at the receiving end to filter for that. Make sure the webhook is set to "Active", and click "Add webhook".

screenshot of GitHub webhooks page

Receiving the webhook

So setting up the webhook to be sent is reasonably straightforward. Receiving it is a bit more complicated. We need to set up a little application to hang around waiting to receive HTTP requests.

First of all, we set up nginx to serve our domain — in this post I'll refer to that as 'example.com'. Then we secure it using certbot so GitHub can send the webhook to https://example.com.

Because we might want to use other webhooks on other systems for different tasks, we're going to go with a slightly over-powered option and use Express. This gives us a bit of control over routing different requests to different functions. Express is a nodejs framework for building web apps, so first we need to make sure we have a recent version of nodejs installed. Then we create a new package metadata file, and a JavaScript file:

npm init
touch webhooks.js

In our empty webhooks.js file we set up some basic routing rules with Express:

npm install express --save

// webhook.js
const express = require('express')
const port = 4040

const app = express()
app.use(express.json())

app.post('/gh-library-map', (req, res, next) => {
	// do stuff
})

// everything else should 404
app.use(function (req, res) {
  res.status(404).send("There's nothing here")
})

app.listen(port, () => {
  console.log(`Webhooks app listening on port ${port}`)
})

This will do something when a POST request is received at https://example.com/gh-library-map. All other requests will receive a 404 response. You can test that now.

Returning to the delicious cherries: what happens if someone else finds out about my arrangement with the greengrocer? Maybe a nefarious strawberry farmer wants to entice me to go to the greengrocer and, upon discovering there are no cherries, buy strawberries instead. They could just send a text message to me saying "Hey it's your friendly greengrocer, I totes have cherries in stock". This is the problem with our webhook endpoint as currently set up. Anyone could send a POST request to https://example.com/gh-library-map and trigger an action. Luckily GitHub has thought of that, and has a solution.

Remember the "Secret" we set when we set up the webhook? This is where we use it. But not directly. GitHub instead creates a SHA256 hash of the entire payload using your secret and includes the resulting hash in the payload itself. The hash is sent in a header called X-Hub-Signature-256. We know what our secret is, and we can therefore check the hash by running the same process over the payload at the receiving end as GitHub did at the sending end. As long as we use a strong secret and the hashes match, we can be confident the request did indeed come from GitHub, and not a nefarious strawberry farmer. The crypto library is included in nodejs automatically, so we can use that check:

// webhook.js
const crypto = require('crypto')

app.post('/gh-library-map', (req, res, next) => {
  const hmac = crypto.createHmac('sha256', process.env.LIBRARY_MAP_GH_SECRET)
  hmac.update(JSON.stringify(req.body))

  // check has signature header and the decrypted signature matches
  if (req.get('X-Hub-Signature-256')) {
    if ( `sha256=${hmac.digest('hex').toString()}` === req.get('X-Hub-Signature-256') ){
      // do something
    } else {
      console.error("signature header received but hash did not match")
      res.status(403).send('Signature is missing or does not match')
    }
  } else {
    console.error('Signature missing')
    res.status(403).send('Signature is missing or does not match')
  }
})

Now we just need to "do something" when the hash matches 😆.

Push and Pull

So what is the something we're going to do? The Library Map server simply contains a copy of the repository, sitting behind an nginx web proxy server. What we need to do to update it is run git pull inside that directory, and it will pull in the latest updates from the repository. Our webhook will end up calling this action more often than is strictly useful, because a "push" action happens every time someone creates a pull request, for example, but it's pretty harmless to git pull more often than necessary.

First we create a new function:

// webhook.js
const util = require('util')
const exec = util.promisify(require('child_process').exec) // run child_process.exec as a Promise/async

async function gitPull(local_repo, res) {
  try {
    const { stdout, stderr } = await exec(`cd ${local_repo} && git pull`);
    let msg = stderr ? stderr : stdout // message is the error message if there is one, else the stdout
    // do something with message
    res.status(200).send('Ok')
  } catch (err) {
    console.error(err)
    res.status(500).send('server error sorry about that')
  }
}

This function is async because we need to await the git pull before we can do something with the output. To make it "awaitable" we use util.promisify() which is another built-in function in nodejs. We call this function back in our express route, where we said we would "do something":

// webhook.js
const local_repo = "/path/to/website/directory"
if (req.get('X-Hub-Signature-256')) {
	if ( `sha256=${hmac.digest('hex').toString()}` === req.get('X-Hub-Signature-256') ){
    gitPull(local_repo, res)
	} else { ...
	}
...
}

Sweet! Now every time someone does a git push we can do a git pull to add the change to the website! Maybe we want to be sure that happened though, so we can add a final piece to this, by sending ourselves an email using emailjs every time the webhook is successfully received:

npm install emailjs

// webhook.js
const { SMTPClient } = require('emailjs')
function sendEmail(msg, trigger) {

  const client = new SMTPClient({
    user: process.env.EMAIL_USER,
    password: process.env.EMAIL_PASSWORD,
    host: process.env.SMTP_DOMAIN,
    ssl: true,
  });
  
  // send the message and get a callback with an error or details of the message that was sent
  client.send(
    {
      text: `GitHub webhook for ${trigger} has triggered a "git pull" event with the following result:\n\n${msg}`,
      from: `Webhook Alerts<${process.env.EMAIL_SEND_ADDRESS}>`,
      to: process.env.EMAIL_RECEIVE_ADDRESS,
      subject: `GitHub triggered a pull for ${trigger}`,
    },
    (err, message) => {
      console.log(err || message);
    }
  );
}

async function gitPull(local_repo, res) {
  try {
    const { stdout, stderr } = await exec(`cd ${local_repo} && git pull`);
    let msg = stderr ? stderr : stdout 
    sendEmail(msg, 'mysite.com')
    res.status(200).send('Ok')
  } catch (err) {
    console.error(err)
    res.status(500).send('server error sorry about that')
  }
}

We can now test the webhook:

node webhooks.js

Express will start up. We can use curl to send some test payloads from a new console session on our local machine:

curl -d '{"key1":"value1", "key2":"value2"}' -H "Content-Type: application/json" -X POST https://example.com/gh-library-map

curl -H "X-Hub-Signature-256: blah" -d '{"key1":"value1", "key2":"value2"}' -H "Content-Type: application/json" -X POST https://example.com/gh-library-map

Both requests should return a 403 with Signature is missing or does not match, but in the server console the second one should log a message signature header received but hash did not match.

The last thing we need to do is set up our little express app to run automatically as a background process on the server. We can do this using systemd. I personally find the official documentation rather impenetrable, but there are lots of helpful tutorials online. Systemd helps us with two tasks:

keeping the app running
making the environment variables available to the app

First we create a "unit file" called webhooks.service at /etc/systemd/system:

# /etc/systemd/system/webhooks.service
Description=Keeps the webhooks express server running
After=network.target

[Service]
Type=simple
ExecStart=/usr/bin/node webhooks.js
Restart=always
RestartSec=10
User=username
WorkingDirectory=/home/username/webhooks
EnvironmentFile=/etc/systemd/system/webhooks.env

[Install]
WantedBy=multi-user.target

The User is your username, and WorkingDirectory is wherever you installed your express app. Since we're responsible server administrators, we have unattended-upgrades running, so occasionally the server will reboot itself to finish installing security updates. We can ensure the webhooks service always comes back up by setting Restart to always.

Next we create the EnvironmentFile mentioned in the unit file:

# /etc/systemd/system/webhooks.env
LIBRARY_MAP_GH_SECRET="your GitHub secret here"
EMAIL_USER="user@mail.example.com"
EMAIL_PASSWORD="top secret password"
SMTP_DOMAIN="smtp.example.com"
EMAIL_SEND_ADDRESS="webhooks@mail.example.com"

This is where all those process.env values come from in the webhooks.js file. We could hardcode them, but you might want to share your file in a blog post one day, and you definitely don't want to accidentally leave your hardcoded GitHub secret in the example!

Make sure we've stopped the app, so we don't have two conflicting installations, then run:

sudo systemctl enable webhooks.service
sudo systemctl start webhooks.service

Our webhooks service should now be running. Go back to the GitHub webhooks page in your repository settings and you should see an option to send a "ping event". This simply checks that your webhook is working by sending a test payload. Send the ping, wait a few moments, and we should see an email appear in the EMAIL_SEND_ADDRESS inbox:

screenshot of email from webhook service

What's next?

That was a pretty long and technical post, sorry not sorry. Now that I've set up all that automation, it would be great for library people to help correct and complete the data. As for me, I'll be looking for other things I can do with automation. Maybe automatically tooting release notes for ephemetoot. We'll see.

2021-01-30T04:11:36+00:00 Hugh Rundle Mita Williams: Weeknote 4 (2021) https://librarian.aedileworks.com/2021/01/29/weeknote-4-2021/

I don’t have much that I can report in this week’s note. You are just going to have to take my word that this week, a large amount of my time was spent at meetings pertaining to my library department, my union, and anti-black racism work.

§1

Last year, around this same time, some colleagues from the University and I organized an speaking event called Safer Communities in a ‘Smart Tech’ World:

We need to talk about Amazon Ring in Windsor.

Windsor’s Mayor proposes we be the first city in Canada to buy into the Ring Network.
As residents of Windsor, we have concerns with this potential project. Seeing no venue for residents of Windsor to share their fears of surveillance and loss of privacy through this private-partnership, we hosted an evening of talks on January 22nd, 2020 at The Performance Hall at the University of Windsor’s School of Creative Arts Windsor Armories Building. Our keynote speaker was Chris Gilliard, heard recently on CBC’s Spark.
Since that evening, we have been in the media raising our concerns, asking questions, and encouraging others to do the same.

The City of Windsor has yet to have entered an agreement with Amazon Ring. This is good news.

This week, the City of Windsor announced that it has entered a one-year deal partnership with Ford Mobility Canada to share data and insights via Ford’s Safety Insights platform.

I don’t think this is good news for reasons outlined in this post called Safety Insights, Data Privacy, and Spatial Justice.

§2

This week I learned a neat Tweetdeck hack. If set up a search as column, you can limit the results for that term using the number of ‘engagements’:

§3

§4

I haven’t read this but I have it bookmarked for potential future reference: The weaponization of web archives: Data craft and COVID-19 publics:

An unprecedented volume of harmful health misinformation linked to the coronavirus pandemic has led to the appearance of misinformation tactics that leverage web archives in order to evade content moderation on social media platforms. Here we present newly identified manipulation techniques designed to maximize the value, longevity, and spread of harmful and non-factual content across social media using provenance information from web archives and social media analytics. After identifying conspiracy content that has been archived by human actors with the Wayback Machine, we report on user patterns of “screensampling,” where images of archived misinformation are spread via social platforms. We argue that archived web resources from the Internet Archive’s Wayback Machine and subsequent screenshots contribute to the COVID-19 “misinfodemic” in platforms. Understanding these manipulation tactics that use sources from web archives reveals something vexing about information practices during pandemics—the desire to access reliable information even after it has been moderated and fact-checked, for some individuals, will give health misinformation and conspiracy theories more traction because it has been labeled as specious content by platforms.

§5

I’m going to leave this tweet here because I might pick up this thread in the future:

This reminds me of a talk given in 2018 by Data & Society Founder and President, danah boyd called You Think You Want Media Literacy… Do You?

This essay still haunts me, largely because we still don’t have good answers for the questions that Dr. Boyd asks of us and the stakes have only gotten higher.

2021-01-29T17:24:49+00:00 Mita Williams David Rosenthal: Effort Balancing And Rate Limits https://blog.dshr.org/2021/01/effort-balancing-and-rate-limits.html Catalin Cimpanu reports on yet another crime wave using Bitcoin in As Bitcoin price surges, DDoS extortion gangs return in force:

In a security alert sent to its customers and shared with ZDNet this week, Radware said that during the last week of 2020 and the first week of 2021, its customers received a new wave of DDoS extortion emails.

Extortionists threatened companies with crippling DDoS attacks unless they got paid between 5 and 10 bitcoins ($150,000 to $300,000)
...
The security firm believes that the rise in the Bitcoin-to-USD price has led to some groups returning to or re-prioritizing DDoS extortion schemes.

And Dan Goodin reports on the latest technique the DDOS-ers are using in DDoSers are abusing Microsoft RDP to make attacks more powerful:

As is typical with many authenticated systems, RDP responds to login requests with a much longer sequence of bits that establish a connection between the two parties. So-called booter/stresser services, which for a fee will bombard Internet addresses with enough data to take them offline, have recently embraced RDP as a means to amplify their attacks, security firm Netscout said.

The amplification allows attackers with only modest resources to strengthen the size of the data they direct at targets. The technique works by bouncing a relatively small amount of data at the amplifying service, which in turn reflects a much larger amount of data at the final target. With an amplification factor of 85.9 to 1, 10 gigabytes-per-second of requests directed at an RDP server will deliver roughly 860Gbps to the target.

I don't know why it took me so long to figure it out, but reading Goodin's post I suddenly realized that techniques we described in Impeding attrition attacks in p2p systems, a 2004 follow-up to our award-winning 2003 SOSP paper on the architecture of the LOCKSS system, can be applied to preventing systems from being abused by DDOS-ers. Below the fold, brief details.

Among the LOCKSS system's defenses against abuse are two relevant to DDOS prevention, rate limits and effort balancing.

Rate Limits

I've written before about the importance of rate limits, quoting Paul Vixie:

Every reflection-friendly protocol mentioned in this article is going to have to learn rate limiting. This includes the initial TCP three-way handshake, ICMP, and every UDP-based protocol. In rare instances it's possible to limit one's participation in DDoS reflection and/or amplification with a firewall, but most firewalls are either stateless themselves, or their statefulness is so weak that it can be attacked separately. The more common case will be like DNS [Response Rate Limiting], where deep knowledge of the protocol is necessary for a correctly engineered rate-limiting solution applicable to the protocol.

The RDP server being used to DDOS sees a flood of authentication requests whose source address has been spoofed to be the target of the DDOS. This isn't what they'd see from a real user, so the RDP server should rate-limit sending authentication responses to a client to a reasonable rate for a real client. This would be helpful, but it isn't enough. Because the DDOS-ers use a large number of systems to mount an attack, even a fairly low rate of reponses can be harmful.

Effort Balancing

In our 2004 paper we wrote:

Effort Balancing. If the effort needed by a requester to procure a service from a supplier is less than the effort needed by the supplier to furnish the requested service, then the system can be vulnerable to an attrition attack that consists simply of large numbers of ostensibly valid service requests. We can use provable effort mechanisms such as Memory-Bound Functions to inflate the cost of relatively “cheap” protocol operations by an adjustable amount of provably performed but otherwise useless effort. By requiring that at each stage of a multi-step protocol exchange the requester has invested more effort in the exchange than the supplier, we raise the cost of an attrition strategy that defects part-way through the exchange. This effort balancing is applicable not only to consumed resources such as computations performed, memory bandwidth used or storage occupied, but also to resource commitments. For example, if an adversary peer issues a cheap request for service and then defects, he can cause the supplier to commit resources that are not actually used and are only released after a timeout (e.g., SYN floods). The size of the provable effort required in a resource reservation request should reflect the amount of effort that could be performed by the supplier with the resources reserved for the request.

Vixie also noted the economic requirement:

Engineering economics requires that the cost in CPU, memory bandwidth, and memory storage of any new state added for rate limiting be insignificant compared with an attacker's effort.

The reason RDP can be used to amplify a DDOS attack is that, as Goodin wrote:

RDP responds to login requests with a much longer sequence of bits that establish a connection between the two parties.

The obvious application of effort balancing would be to require that RDP's login requests be padded with additional bytes to make them longer than the login reponse. Thus the RDP server would act to attenuate the attack, not amplify it. This would satisfy Vixie's goal:

Attenuation also has to be a first-order goal—we must make it more attractive for attackers to send their packets directly to their victims than to bounce them off a DDoS attenuator.

The protocol could specify that the padding bytes not be random, but be computed from the login request parameters by some algorithm making them relatively expensive to generate but cheap to verify (cf. proof-of-work). This would not significantly impact legitimate clients, who issue login requests infrequently, but would increase the cost of using the RDP server to disguise the source of the attack.

2021-01-28T16:00:01+00:00 David. (noreply@blogger.com) Open Knowledge Foundation: Launching the Net Zero Challenge: a global pitch competition about using open data for climate action https://blog.okfn.org/2021/01/28/launching-the-net-zero-challenge-a-global-pitch-competition-about-using-open-data-for-climate-action/

Open Knowledge Foundation is excited to launch the Net Zero Challenge, a global pitch competition about using open data for climate action.

With a new administration in the USA and the COP26 meeting in the UK, 2021 will be a crucial year for the global climate response.

Let’s see how open data can play its part.

Tell us how your idea or project uses open data for climate action – and you could win a $1,000USD in the first round of the Net Zero Challenge.

Full details about the Net Zero Challenge are available at netzerochallenge.info.

This project is funded by our partners at Microsoft and the UK Foreign, Commonwealth & Development Office. We are extremely grateful for their support.

How are you advancing climate action using open data?

To be eligible for the Net Zero Challenge, your idea or project must do one or more of the following:

Understand climate risks
Track climate progress
Enable informed climate action, or
Evaluate climate impact.

Some ways in which you might do this include:

Making climate relevant data easier to discover, view and understand by the general data user
Creating a useful passthrough tool or API for climate-relevant data in any country or jurisdiction
Organising climate data so that potential data users (including those who are less data-literate) can see what’s available, and make use of it

We are very open minded about your approach and methodology. What we care about is the outcome, and whether you answer the question.

You might consider whether your idea or project is:

Technically achievable
Easy to use
Easily integrated or can be provided as a tool
Scalable
Good value for money
Published under an open licence which allows free use by others
Explainable (this is the key test of the Challenge. Can you pitch your project in three minutes to a general audience?)

How do I apply?

Apply now by filling out this form. All applications must be received by 6pm Pacific Standard Time on Friday 12th March 2021. Late submissions will not be accepted.

Applications will be reviewed and a short list invited to pitch their idea to a panel of experts at a virtual pitch contest.

Pitches will take the form of a public three-minute presentation via video conference, followed by a question and answer session with our panel of climate data experts.

Pitches can be live, or prerecorded but the Q&A will be live.

Expert guidance for the Net Zero Challenge is provided by our advisory committee: the Open Data Charter, the Innovation and Open Data Team at Transport for New South Wales and the Open Data Day team at Open Knowledge Foundation.

Need more information?

If you have any questions about the Net Zero Challenge, please check out the FAQs on the netzerochallenge.info website. To contact the Net Zero Challenge team directly, email netzero@okfn.org.

2021-01-28T11:14:54+00:00 James Hamilton Peter Sefton: Research Data Management looking outward from IT http://ptsefton.com/2021/01/28/rdm4aero/index.html

This is a presentation that I gave on Wednesday the 2nd of December 2020 at the AeRO (Australian eResearch Organizations) council meeting at the request of the chair Dr Carina Kemp).

Carina asked:

It would be really interesting to find out what is happening in the research data management space. And I’m not sure if it is too early, but maybe touch on what is happening in the EOSC Science Mesh Project.

The Audience of the AeRO Council is AeRo member reps from AAF, AARNet, QCIF, CAUDIT, CSIRO, GA, TPAC, The uni of Auckland, REANNZ, ADSEI, Curtin, UNSW, APO.

At this stage I was still the eResearch Support Manager at UTS - but I only had a couple of weeks left in that role.

Research Data Management looking outward from IT Research Data Management looking outward from IT

In this presentation I’m going to start from a naive IT perspective about research data.

I would like to acknowledge and pay my respects to the Gundungurra and Darug people who are the traditional custodians of the land on which I live and work.

Research data is special - like snowflakes - and I don’t mean that in a mean way, Research Data could be anything - any shape any size and researchers are also special, not always 100% aligned with institutional priorities, they align with their disciplines and departments and research teams.

$data_management != $storage $data_management != $storage

It’s obvious that buying storage doesn’t mean you’re doing data management well but that doesn’t mean it’s not worth restating.

So "data storage is not data management". In fact, the opposite might be true - think about buying a laptop - do you just get one that fits all your stuff and rely on getting a bigger one every few years? Or do you get a smaller main drive and learn how to make sure that your data's actually archived somewhere? That would be managing data.

And remember that not all research data is the same “shape” as corporate data - it does not all come in database or tabular form - it can be images, video, text, with all kinds of structures.

💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾👩🏽‍🔬💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾 <p>

There are several reasons we don’t want to just dole-out storage as needed.

It’s going to cost a lot of money and keep costing a lot of money

Not everything is stored in “central storage” anyway. There are share-sync services like AARNet’s Cloudstor.

Emoji of lots of floppy disks with a researcher in the centre

Just keeping things doesn’t mean we can find them again

So far we’ve just looked at things from an infrastructure perspective but that’s not actually why we’re here, us with jobs in eResearch. I think we’re here to help researchers do excellent research with integrity, AND we need to help our institutions and researchers manage risk.

The Australian Code for the Responsible Conduct of Research which all research organizations need to adhere to if we get ARC or NHMRC grants sets out some institutional responsibilities to provide infrastructure and training
There are risks associated with research data, reputational, financial and risks to individuals and communities about whom we hold data

At UTS, we’ve embraced the Research Data Management Plan - as a way to assist in dealing with this risk. RDMPs have a mixed reputation here in Australia - some organizations have decided to keep them minimal and as streamlined as possible but at UTS the thinking is that they can be useful in addressing a lot of the issues raised so far.

Where’s the data for project X - when there’s an integrity investigation. Were procedures followed?
How much storage are we going to need?

<p>Image showing a research data management system that can provision research services (workspaces).</p> <p>

Inspired by the (defunct?) Research Data Lifecycle project that was conceived by the former organizations that became the Australian Research Data Commons (ANDS, NeCTAR and RDSI) we came up with this architecture for a central research data management system (in our case we use the open source ReDBox system) loosely linked to a variety of research Workspaces, as we call them.

The plan is that over time, researchers can plan and budget for data management in the short, medium and long term, provision services and use the system to archive data as they go.

(Diagram by Gerard Barthelot at UTS)

<p>Screenshot of the OCFL home page.</p> <p>

UTS has been an early adopter of the OCFL (Oxford Common File Layout) specificiation - a way of storing file sustainably on a file system (coming soon: s3 cloud storage) so it does not need to be migrated. I presented on this at the Open Repositories conference

<p>Screenshot of the RO-Crate home page.</p> <p>

And at the same conference, I introduced the RO-Crate standards effort, which is a marriage between the DataCrate data packaging work we’ve been doing at UTS for a few years, and the Research Object project.

<p>Screenshot of the Arkisto page.</p> <p>

We created the Arkisto platform to bring together all the work we’ve been doing to standardise research data metadata, and to build a toolkit for sustainable data repositories at all scales from single-collection up to institutional, and potentially discipline and national collections.

<p>Screenshot from the Arkisto use-cases page showing an idealized data-flow for sensor data using the Arkisto platform.</p> <p>

This is an example of one of many Arkisto deployment patterns you can read more on the Arkisto Use Cases page

<p>Screenshot of a search interface to a historical dataset.</p> <p>

This is an example of an Arkisto-platform output. Data exported from one content management system into an archive-ready RO-Crate package, which can then be made into a live site. This was created for Ass Prof Tamson Pietsch at UTS. The website is ephemeral - the data will be Interoperable and Reusable (I and R from FAIR) via the use of RO-Crate.

<p>Image of chickens roosting on the roof of their house rather than inside it.</p> <p>

Now to higher-level concerns: I built this infrastructure for my chooks (chickens) - they have a nice dry box with a roosting loft. But most of the time they roost on the roof.

We know all too well that researchers don’t always use the infrastructure we build for them - you have to get a few other things right as well.

… and have the right policy environment … <p>

… they will come, given the right training, incentives, support etc. <p>

One of the big frustrations I have had as an eResearch manager is that the expectations and aspirations of funders and integrity managers and so on are well ahead of our capacity to deliver the services they want, and then when we DO get infrastructure sorted there are organizational challenges to getting people to use it. To go back to my metaphor, we can’t just pick up the researchers from the roof and put them in their loft, or spray water on them to get them to move.

Via Gavin Kennedy and Guido Aben from AARNet Marco La Rosa and I are helping out with this charmingly named project which is adding data management service to storage, syncronization and sharing services. Contracts not yet in place so won't say much about this yet.

<p>Schreenshot of CS3MESH4EOSC homepage.</p> <p>

https://www.cs3mesh4eosc.eu/about EOSC is the European Open Science Cloud

CS3MESH4EOSC - Interactive and agile sharing mesh of storage, data and applications for EOSC - aims to create an interoperable federation of data and higher-level services to enable friction-free collaboration between European researchers. CS3MESH4EOSC will connect locally and individually provided services, and scale them up at the European level and beyond, with the promise of reaching critical mass and brand recognition among European scientists that are not usually engaged with specialist eInfrastructures.

I told Carina I would look outwards as well. What are we keeping an eye on?

Watch out for the book factory. Sorry, the publishing industry.

The publishing industry is going to “help” the sector look after it’s research data.

Like, you, know, they did with the copyright in publications. Not only did that industry work out how to take over copyright in research works, they successfully moved from selling us hard-copy resources that we could keep in our own libraries to charging an annual rent on the literature - getting to the point where they can argue that they are essential to maintaining the scholarly record and MUST be involved in the publishing process even when the (sometimes dubious, patchy) quality checks are performed by us who created the literature.

It’s up to research institutions whether this story repeats with research data - remember who you’re dealing with when you sign those contracts!

Some questions I will explore post UTS How many research institutions in Australia have secure, scalable, sustainable Research Data repositories that support FAIR practices? When are we going to have national repositories? How many of the eResearch Platforms funded under various programs have sustainable data management stories? How much data is locked-up behind APIs and will be inaccessible if the software stops running? <p>

In the 2010s the Australian National Data (ANDS) service funded investment in Metadata stores; one of these was the ReDBOX research data management platform which is alive and well and being sustained by QCIF with a subscription maintenance service. But ANDS didn’t fund development of Research Data Repositories.

Credits The UTS eResaerch team Michael Lynch, Sharyn Wise, Fiona Tweedie, Moises Sacal, Marco La Rosa, Pascal Tampubolon, Anelm Motha, Michael Lake, Matthew Gaston ,Simon Kruik, Weisi Chen from Intersect Austrlaia and our RDM project sponsor Louise Wheeler Credits <p>

The work I’ve talked about here was all done with with the UTS team.

2021-01-27T23:00:00+00:00 ptsefton Peter Sefton: Research Data Management looking outward from IT http://ptsefton.com/2021/01/28/rdm4aero/index.html

This is a presentation that I gave on Wednesday the 2nd of December 2020 at the AeRO (Australian eResearch Organizations) council meeting at the request of the chair Dr Carina Kemp).

Carina asked:

It would be really interesting to find out what is happening in the research data management space. And I’m not sure if it is too early, but maybe touch on what is happening in the EOSC Science Mesh Project.

The Audience of the AeRO Council is AeRo member reps from AAF, AARNet, QCIF, CAUDIT, CSIRO, GA, TPAC, The uni of Auckland, REANNZ, ADSEI, Curtin, UNSW, APO.

At this stage I was still the eResearch Support Manager at UTS - but I only had a couple of weeks left in that role.

In this presentation I’m going to start from a naive IT perspective about research data.

I would like to acknowledge and pay my respects to the Gundungurra and Darug people who are the traditional custodians of the land on which I live and work.

It’s obvious that buying storage doesn’t mean you’re doing data management well but that doesn’t mean it’s not worth restating.

And remember that not all research data is the same “shape” as corporate data - it does not all come in database or tabular form - it can be images, video, text, with all kinds of structures.

💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾👩🏽‍🔬💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾 <p>

There are several reasons we don’t want to just dole-out storage as needed.

It’s going to cost a lot of money and keep costing a lot of money

Not everything is stored in “central storage” anyway. There are share-sync services like AARNet’s Cloudstor.

Just keeping things doesn’t mean we can find them again

The Australian Code for the Responsible Conduct of Research which all research organizations need to adhere to if we get ARC or NHMRC grants sets out some institutional responsibilities to provide infrastructure and training
There are risks associated with research data, reputational, financial and risks to individuals and communities about whom we hold data

Where’s the data for project X - when there’s an integrity investigation. Were procedures followed?
How much storage are we going to need?

The plan is that over time, researchers can plan and budget for data management in the short, medium and long term, provision services and use the system to archive data as they go.

(Diagram by Gerard Barthelot at UTS)

This is an example of one of many Arkisto deployment patterns you can read more on the Arkisto Use Cases page

Now to higher-level concerns: I built this infrastructure for my chooks (chickens) - they have a nice dry box with a roosting loft. But most of the time they roost on the roof.

We know all too well that researchers don’t always use the infrastructure we build for them - you have to get a few other things right as well.

https://www.cs3mesh4eosc.eu/about EOSC is the European Open Science Cloud

CS3MESH4EOSC - Interactive and agile sharing mesh of storage, data and applications for EOSC - aims to create an interoperable federation of data and higher-level services to enable friction-free collaboration between European researchers. CS3MESH4EOSC will connect locally and individually provided services, and scale them up at the European level and beyond, with the promise of reaching critical mass and brand recognition among European scientists that are not usually engaged with specialist eInfrastructures.

I told Carina I would look outwards as well. What are we keeping an eye on?

Watch out for the book factory. Sorry, the publishing industry.

The publishing industry is going to “help” the sector look after it’s research data.

It’s up to research institutions whether this story repeats with research data - remember who you’re dealing with when you sign those contracts!

The work I’ve talked about here was all done with with the UTS team.

2021-01-27T23:00:00+00:00 ptsefton Hugh Rundle: Library Map Part 2 - How https://www.hughrundle.net/library-map-part-2/

This is the second in a series of posts about my new Library Map. You probably should read the first post if you're interested in why I made the map and why it maps the particular things that it does. I expected this to be a two part series but it looks like I might make a third post about automation. The first post was about why I made the map. This one is about how.

The tech stack

The map is built with a stack of (roughly in order):

original Shape (SHP) and GeoJSON files
QGIS
geojson
a bunch of csv files
a tiny python script
topojson
some HTML, CSS and JavaScript
leafletjs and leaflet plugins
Map Box tile service

Boundary files

Since I primarily wanted to map things about library services rather than individual library buildings, the first thing I looked for was geodata boundary files. In Australia public libraries are usually run by local government, so the best place to start was with local government boundaries.

This is reasonably straightforward to get - either directly from data.gov.au or one of the state equivalents, or more typically by starting there and eventually getting to the website of the state department that deals with geodata. Usually the relevant file is provided as Shapefile, which is not exactly what we need, but is a vector format, which is a good start. I gradually added each state and data about it before moving on to the next one, but the process would basically have been the same even if I'd had all of the relevant files at the same time. There are two slight oddities at this point that may (or may not 😂) be of interest.

Australian geography interlude

The first is that more or less alone of all jurisdictions, Queensland provides local government (LGA) boundaries for coastal municipalities with large blocks covering the coastal waters and any islands. Other states draw boundaries around outlying islands and include the island — as an island — with the LGA that it is part of (if it's not "unincorporated", which is often the case in Victoria for example). As a result, the national map looks a bit odd when you get to Queensland, because the overlay bulges out slightly away from the coast. I'm not sure whether this is something to do with the LGA jurisdictions in Queensland, perhaps due to the Great Barrier Reef, or whether their cartography team just couldn't be bothered drawing lines around every little island.

Secondly, when I got to Western Australia I discovered two things:

The Cocos (Keeling) Islands are an Overseas Territory of Australia; and
Cocos and Christmas Islands have some kind of jurisdictional relationship with Western Australia, and are included in the Western Australia LGA files.

I hadn't really considered including overseas territories, but since they were right there in the file, I figured I may as well. Later this led to a question about why Norfolk Island was missing, so I hunted around and found a Shapefile for overseas territories, which also included Cocos and Christmas Islands.

Shapefiles are a pretty standard format, but I wanted to use leafletjs, and for that we need the data to be in JSON format. I also needed to both stitch together all the different state LGA files, and merge boundaries where local councils have formed regional library services. This seems to be more common in Victoria (which has Regional Library Corporations) than other states, but it was required in Victoria, New South Wales, and Western Australia. Lastly, it turns out there are significant parts of Australia that are not actually covered by any local government at all. Some of these areas are the confusingly named national parks that are actually governed directly by States. Others are simply 'unincorporated' — the two largest areas being the Unincorporated Far West Region of New South Wales (slightly larger than Hungary), and the Pastoral Unincorporated Area that consists of almost 60% of the landmass of South Australia (slightly smaller than France).

I had no idea these two enormous areas of Australia had this special status. There's also a pretty large section of the south of the Northern Territory that contains no libraries at all, and hence has no library service. If you're wondering why there is a large section of inland Australia with no overlays on the Library Map, now you know.

QGIS and GeoJSON

So, anyway, I had to munge all these files — mostly Shape but also GeoJSON — and turn them into a single GeoJSON file. I've subsequently discovered mapshaper which I might have used for this, but I didn't know about it at the time, so I used QGIS. I find the number of possibilities presented by QGIS quite overwhelming, but there's no doubt it's a powerful tool for manipulating GIS data. I added each Shapefile as a layer, merged local government areas that needed to be merged, either deleted or dissolved (into the surrounding area) the unincorporated areas, and then merged the layers. Finally, I exported the new merged layer as GeoJSON, which is exactly what it sounds like: ordinary JSON, for geodata.

CSV data

At this point I had boundaries, but not other data. I mean, this is not actually true, because I needed information about library services in order to know which LGAs collectively operate a single library service, but in terms of the files, all I had was a polygon and a name for each area. I also had a bunch of location data for the actual library branches in a variety of formats originally, but ultimately in comma separate values (CSV) format. I also had a CSV file for information about each library service. The question at this point was how to associate the information I was mapping with each area. There was no way I was going to manually update 400+ rows in QGIS. Luckily, CSV and JSON are two of the most common open file formats, and they're basically just text.

Python script

I'd had a similar problem in a previous, abandoned mapping project, and had a pretty scrappy Python script lying around. With a bit more Python experience behind me, I was able to make it more flexible and simpler. If we match on the name of the library service, it's fairly straightforward to add properties to each GeoJSON feature (the features being each library service boundaries area, and the properties being metadata about that feature). This is so because the value of properties within each feature is itself simply a JSON object:

{"type": "FeatureCollection",
"name": "library_services",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::3857" } },
"features": 
[{ "type": "Feature", "properties" : {"name": "Bulloo Shire"}
 "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [143.78691062,-28.99912088],[143.78483624,-28.99912073] ... ]]}

The python script uses Python's inbuilt json and csv modules to read both the geojson and the csv file, then basically merge the data. I won't re-publish the whole thing, but the guts of it is:

# for each geojson feature, if a field in the json matches a field in the csv, add new properties to the json
for feature in json_data['features']:
  with open(csv_file, newline='') as f:
    # use DictReader so we can use the header names
    reader = csv.DictReader(f)
    for row in reader:
      # look for match
      if row[csv_match] == feature['properties'][geojson_match]:
        # create new properties in geojson
        for k in row:
          feature['properties'][k] = row[k]

The whole thing is fewer than 40 lines long. This saved me heaps of time, but as you'll discover in my future post on automation, I later worked out how to automate the whole process every time the CSV file is updated!

TopoJSON

GeoJSON is pretty cool — it's specifically designed for web applications to read and write GIS files in a native web format. Unfortunately, GeoJSON can also get very big, especially with a project like mine where there are lots of boundaries over a large area. The final file was about 130MB — far too big for anyone to reasonably wait for it to load in their browser (and Chrome just refused to load it altogether). Because of the way I originally wrote the Python script, it actually became nearly three times the size, because I put in a two-space indent out of habit. This created literally hundreds of megabytes of empty spaces. "Pretty printing" JSON is helpful if a human needs to read it, but rather unhelpful if you want to keep the file size down.

Enter TopoJSON. To be honest I don't really understand the mathematics behind it, but TopoJSON allows you to represent the same information as GeoJSON but in a much, much smaller file. I reduced a 362MB GeoJSON file (admittedly, about 200MB being blank spaces) to 2.6MB simply by converting it to TopoJSON! By "quantising" it (essentially, making it less accurate), the file size can be reduced even further, rendering the current file of about 2.2MB - definitely small enough to load in a browser without too much of a wait, albeit not lightning fast.

Good old HTML/CSS/JavaScript

At this point we're ready to start putting together the website to display the map. For this I used plain, vanilla HTML, CSS, and JavaScript. The web is awash with projects, frameworks and blog posts explaining how to use them to create your SPA (Single Page App)™️, but we really don't need any of that. The leaflet docs have a pretty good example of a minimal project, and my map is really not much more complex than that.

Something that did stump me for a while was how to bring the TopoJSON and CSV files into the JavaScript file as variables. I'm a self-taught JavaScript coder, and I learned it back to front: initially as a backend scripting language (i.e. nodejs) and then as the front-end browser scripting language it was originally made to be. So sometimes something a front-end developer would consider pretty basic: "How do I import a text file into my JavaScript and assign it to a variable?" takes me a while to work out. Initially I just opened the files in a text editor and copy-pasted the contents between two quote marks, made it the value of a javascript variable, and saved the whole thing as a .js file. But it was obvious even to me that couldn't possibly be the correct way to do it, even though it worked. In nodejs I would use fs.readFile() but the only thing that looked vaguely similar for front end JavaScript was FileReader — which is for reading files on a client, not a server. Finally I did a bit of research and found that the answer is to forget that the file is sitting right there in the same directory as all your JavaScript and HTML files, and just use AJAX like it's a remote file. The modern way to do this is with fetch, so instead of doing this:

// index.html
<script src="./boundaries.js" type="text/javascript"></script>
<script src="./branchesCsv.js" type="text/javascript"></script>
<script src="./ikcCsv.js" type="text/javascript"></script>
<script src="./mechanics.js" type="text/javascript"></script>
<script src="./nslaBranches.js" type="text/javascript"></script>
<script src="./load-map.js" type="text/javascript"></script>

// boundaries.js
const boundaries = `{"contents": "gigantic JSON string"}`
// branchesCsv.js
const branchesCsv = `lat,lng,town,address,phone
-35.5574374,138.6107874,Victor Harbor Public Library Service, 1 Bay Road, 08 8551 0730
... etc`
// ikcCsv.js
const ikcCsv = `lat,lng,town,address,phone
-10.159918,142.166344,Badu Island Indigenous Knowledge Centre,Nona Street ,07 4083 2100
...etc`
// mechanics.js
const mechanics = `lat,lng,town,address,phone
-37.562362,143.858541,Ballaarat Mechanics Institute,117 Sturt Street,03 5331 3042
..etc`
// nslaBranches.js
const nslaBranches = `lat,lng,town,address,phone
-37.809815,144.96513,State Library of Victoria,"328 Swanston Street, Melbourne",03 8664 7000
... etc`

// load-map.js
	// boundaries and the other constants are now globals
	const loanPeriod = new L.TopoJSON(boundaries, options)

...we do this:

// index.html
<script src="./load-map.js" type="text/javascript"></script>

// load-map.js
const boundaries = fetch('data/boundaries.topo.json')
.then( response => response.json())

const branchesCsv = fetch('data/public_library_locations.csv')
.then( response => response.text());

const ikcCsv = fetch('data/indigenous_knowledge_centre_locations.csv')
.then( response => response.text());

const mechanics = fetch('data/mechanics_institute_locations.csv')
.then( response => response.text());

const nslaBranches = fetch('data/nsla_library_locations.csv')
.then( response => response.text());

// fetch returns a promise so we have to let them all 'settle' before we can use the returned value
Promise.all([boundaries, branchesCsv, ikcCsv, mechanics, nslaBranches])
.then( data => {
	// data is an array with the settled values of the fetch() promises
	const loanPeriod = new L.TopoJSON(data[0], options)
}

In the code this doesn't necessarily look much simpler, but in terms of workflow it's a huge improvement that cuts out manually copy-pasting every time a CSV or TopoJSON file is updated, and reduces duplication and the total number of files.

So now the site consists of:

the original data as CSV and TopoJSON files
an index.html file to display the map
a single CSS file for basic styling
a single JavaScript file to load the map

Leaflet and friends

Finally it's time to actually put all of this stuff into a map using Leaflet. This is a really great JavaScript library, with pretty good documentation. Leaflet allows us to plot shapes onto a map, and using JavaScript to make them interactive - including adding popups, zoom to features when they're clicked, and add interactive overlays.

I won't try to replicate the Leaflet docs here and explain the exact steps to making my map, but I do want to highlight how two Leaflet plugins really helped with making the map work nicely. Leaflet has a fairly strong plugin collection, and they allow the base library to be fairly lightweight whilst the entire system is still quite flexible and fully featured.

I knew from the beginning it would require the whole library community to keep the map up to date over time. There are hundreds of library services across Australia, and they don't set their rules or their procurement decisions in stone forever. So it needed to be relatively simple to update the data as it changes. As we've discussed, GeoJSON also takes up a lot of space. Ideally, I could store as much data in CSV files as possible, and use them directly as data feeding the map. Turns out there's a plugin for that - Leaflet.geoCSV. This allows us to load CSV files directly (for library building locations), and it's converted to GeoJSON on the fly. Since CSV files are much smaller than the equivalent data in JSON, this is not only easier to maintain, but also loads faster.

The second plugin that really helped was Leaflet.pattern. The problem this helped me to solve was how to show both the fines layer and the loan period layer at the same time. Typically for a chloropleth map, different colours or shades indicate certain values. But if you add a second overlay on top of the first one, the colours no longer necessarily make much sense and combinations can be difficult or impossible to discern. Thinking about this, I figured if I could make one layer semi-transparent colours, and the second layer patterns like differently angled stripes or dots, that might do the trick. Leaflet.pattern to the rescue! After some alpha-testing by my go-to volunteer Quality Assurance tester, I worked out how to make the layers always appear in the same order, regardless of which order they were added or removed, making the combination always look consistent:

Animated GIF showing overlays

Tile service

Once all of that's complete, we can load the map. But there's a problem: all we have is a bunch of vector points and lines, there's no underlying geography. For this we need a Map Tile service. We can use one of several options provided by OpenStreetMap, but I ended up using the commercial Map Box service on a free plan (or at least, it will be free as long as thousands of people don't suddenly start using the map all at the same time). Their dark and light map styles really suited what I was trying to do, with minimal detail in terms of the underlying geography, but with roads and towns marked at the appropriate zoom level.

So that's it! It took a while to work it all out, but most of the complexity is in getting the data together rather than displaying the map. Once I had that done (though there is still a fair bit of information missing), I was able to pay more attention to maintaining the map into the future. That led me to look into some options for automating the merging of data from the library services CSV file (when it's updated) into the TopoJSON file, and also automatically refreshing the data on the actual map when the GitHub repository is updated. In my next post I'll explain how that works. While you're waiting for that, you can help me find missing data and make the map more accurate 😀.

2021-01-27T22:12:16+00:00 Hugh Rundle Jonathan Rochkind: Managed Solr SaaS Options https://bibwild.wordpress.com/2021/01/12/managed-solr-saas-options/

I was recently looking for managed Solr “software-as-a-service” (SaaS) options, and had trouble figuring out what was out there. So I figured I’d share what I learned. Even though my knowledge here is far from exhaustive, and I have only looked seriously at one of the ones I found.

The only managed Solr options I found were: WebSolr; SearchStax; and OpenSolr.

Of these, i think WebSolr and SearchStax are more well-known, I couldn’t find anyone with experience with OpenSolr, which perhaps is newer.

Of them all, SearchStax is the only one I actually took for a test drive, so will have the most to say about.

Why we were looking

We run a fairly small-scale app, whose infrastructure is currently 4 self-managed AWS EC2 instances, running respectively: 1) A rails web app 2) Bg workers for the rails web app 3) Postgres, and 4) Solr.

Oh yeah, there’s also a redis running one of those servers, on #3 with pg or #4 with solr, I forget.

Currently we manage this all ourselves, right on the EC2. But we’re looking to move as much as we can into “managed” servers. Perhaps we’ll move to Heroku. Perhaps we’ll use hatchbox. Or if we do stay on AWS resources we manage directly, we’d look at things like using an AWS RDS Postgres instead of installing it on an EC2 ourselves, an AWS ElastiCache for Redis, maybe look into Elastic Beanstalk, etc.

But no matter what we do, we need a Solr, and we’d like to get it managed. Hatchbox has no special Solr support, AWS doesn’t have a Solr service, Heroku does have a solr add-on but you can also use any Solr with it and we’ll get to that later.

Our current Solr use is pretty small scale. We don’t run “SolrCloud mode“, just legacy ordinary Solr. We only have around 10,000 documents in there (tiny for Solr), our index size is only 70MB. Our traffic is pretty low — when I tried to figure out how low, it doesn’t seem we have sufficient logging turned on to answer that specifically but using proxy metrics to guess I’d say 20K-40K requests a day, query as well as add.

This is a pretty small Solr installation, although it is used centrally for the primary functions of the (fairly low-traffic) app. It currently runs on an EC2 t3a.small, which is a “burstable” EC2 type with only 2G of RAM. It does have two vCPUs (that is one core with ‘hyperthreading’). The t3a.small EC2 instance only costs $14/month on-demand price! We know we’ll be paying more for managed Solr, but we want to do get out of the business of managing servers — we no longer really have the staff for it.

WebSolr (didn’t actually try out)

WebSolr is the only managed Solr currently listed as a Heroku add-on. It is also available as a managed Solr independent of heroku.

The pricing in the heroku plans vs the independent plans seems about the same. As a heroku add-on there is a $20 “staging” plan that doesn’t exist in the independent plans. (Unlike some other heroku add-ons, no time-limited free plan is available for WebSolr). But once we go up from there, the plans seem to line up.

Starting at: $59/month for:

1 million document limit
40K requests/day
1 index
954MB storage
5 concurrent requests limit (this limit is not mentioned on the independent pricing page?)

Next level up is $189/month for:

5 million document limit
150K requests/day
4.6GB storage
10 concurrent request limit (again concurrent request limits aren’t mentioned on independent pricing page)

As you can see, WebSolr has their plans metered by usage.

$59/month is around the price range we were hoping for (we’ll need two, one for staging one for production). Our small solr is well under 1 million documents and ~1GB storage, and we do only use one index at present. However, the 40K requests/day limit I’m not sure about, even if we fit under it, we might be pushing up against it.

And the “concurrent request” limit simply isn’t one I’m even used to thinking about. On a self-managed Solr it hasn’t really come up. What does “concurrent” mean exactly in this case, how is it measured? With 10 puma web workers and sometimes a possibly multi-threaded batch index going on, could we exceed a limit of 4? Seems plausible. What happens when they are exceeded? Your Solr request results in an HTTP 429 error!

Do I need to now write the app to rescue those gracefully, or use connection pooling to try to avoid them, or something? Having to rewrite the way our app functions for a particular managed solr is the last thing we want to do. (Although it’s not entirely clear if those connection limits exist on the non-heroku-plugin plans, I suspect they do?).

And in general, I’m not thrilled with the way the pricing works here, and the price points. I am positive for a lot of (eg) heroku customers an additional $189*2=$378/month is peanuts not even worth accounting for, but for us, a small non-profit whose app’s traffic does not scale with revenue, that starts to be real money.

It is not clear to me if WebSolr installations (at “standard” plans) are set up in “SolrCloud mode” or not; I’m not sure what API’s exist for uploading your custom schema.xml (which we’d need to do), or if they expect you to do this only manually through a web UI (that would not be good); I’m not sure if you can upload custom solrconfig.xml settings (this may be running on a shared solr instance with standard solrconfig.xml?).

Basically, all of this made WebSolr not the first one we looked at.

Does it matter if we’re on heroku using a managed Solr that’s not a Heroku plugin?

I don’t think so.

In some cases, you can get a better price from a Heroku plug-in than you could get from that same vendor not on heroku or other competitors. But that doesn’t seem to be the case here, and other that that does it matter?

Well, all heroku plug-ins are required to bill you by-the-minute, which is nice but not really crucial, other forms of billing could also be okay at the right price.

With a heroku add-on, your billing is combined into one heroku invoice, no need to give a credit card to anyone else, and it can be tracked using heroku tools. Which is certainly convenient and a plus, but not essential if the best tool for the job is not a heroku add-on.

And as a heroku add-on, WebSolr provides a WEBSOLR_URL heroku config/env variable automatically to code running on heroku. OK, that’s kind of nice, but it’s not a big deal to set a SOLR_URL heroku config manually referencing the appropriate address. I suppose as a heroku add-on, WebSolr also takes care of securing and authenticating connections between the heroku dynos and the solr, so we need to make sure we have a reasonable way to do this from any alternative.

SearchStax (did take it for a spin)

SearchStax’s pricing tiers are not based on metering usage. There are no limits based on requests/day or concurrent connections. SearchStax runs on dedicated-to-you individual Solr instances (I would guess running on dedicated-to-you individual (eg) EC2, but I’m not sure). Instead the pricing is based on size of host running Solr.

You can choose to run on instances deployed to AWS, Google Cloud, or Azure. We’ll be sticking to AWS (the others, I think, have a slight price premium).

While SearchStax gives you a pricing pages that looks like the “new-way-of-doing-things” transparent pricing, in fact there isn’t really enough info on public pages to see all the price points and understand what you’re getting, there is still a kind of “talk to a salesperson who has a price sheet” thing going on.

What I think I have figured out from talking to a salesperson and support, is that the “Silver” plans (“Starting at $19 a month”, although we’ll say more about that in a bit) are basically: We give you a Solr, we don’t don’t provide any technical support for Solr.

While the “Gold” plans “from $549/month” are actually about paying for Solr consultants to set up and tune your schema/index etc. That is not something we need, and $549+/month is way more than the price range we are looking for.

While the SearchStax pricing/plan pages kind of imply the “Silver” plan is not suitable for production, in fact there is no real reason not to use it for production I think, and the salesperson I talked to confirmed that — just reaffirming that you were on your own managing the Solr configuration/setup. That’s fine, that’s what we want, we just don’t want to mangage the OS or set up the Solr or upgrade it etc. The Silver plans have no SLA, but as far as I can tell their uptime is just fine. The Silver plans only guarantees 72-hour support response time — but for the couple support tickets I filed asking questions while under a free 14-day trial (oh yeah that’s available), I got prompt same-day responses, and knowledgeable responses that answered my questions.

So a “silver” plan is what we are interested in, but the pricing is not actually transparent.

$19/month is for the smallest instance available, and IF you prepay/contract for a year. They call that small instance an NDN1 and it has 1GB of RAM and 8GB of storage. If you pay-as-you-go instead of contracting for a year, that already jumps to $40/month. (That price is available on the trial page).

When you are paying-as-you-go, you are actually billed per-day, which might not be as nice as heroku’s per-minute, but it’s pretty okay, and useful if you need to bring up a temporary solr instance as part of a migration/upgrade or something like that.

The next step up is an “NDN2” which has 2G of RAM and 16GB of storage, and has an ~$80/month pay-as-you-go — you can find that price if you sign-up for a free trial. The discount price price for an annual contract is a discount similar to the NDN1 50%, $40/month — that price I got only from a salesperson, I don’t know if it’s always stable.

It only occurs to me now that they don’t tell you how many CPUs are available.

I’m not sure if I can fit our Solr in the 1G NDN1, but I am sure I can fit it in the 2G NDN2 with some headroom, so I didn’t look at plans above that — but they are available, still under “silver”, with prices going up accordingly.

All SearchStax solr instances run in “SolrCloud” mode — these NDN1 and NDN2 ones we’re looking at just run one node with one zookeeper, but still in cloud mode. There are also “silver” plans available with more than one node in a “high availability” configuration, but the prices start going up steeply, and we weren’t really interested in that.

Because it’s SolrCloud mode though, you can use the standard Solr API for uploading your configuration. It’s just Solr! So no arbitrary usage limits, no features disabled.

The SearchStax web console seems competently implemented; it let’s you create and delete individual Solr “deployments”, manage accounts to login to console (on “silver” plan you only get two, or can pay $10/month/account for more, nah), and set up auth for a solr deployment. They support IP-based authentication or HTTP Basic Auth to the Solr (no limit to how many Solr Basic Auth accounts you can create). HTTP Basic Auth is great for us, because trying to do IP-based from somewhere like heroku isn’t going to work. All Solrs are available over HTTPS/SSL — great!

SearchStax also has their own proprietary HTTP API that lets you do most anything, including creating/destroying deployments, managing Solr basic auth users, basically everything. There is some API that duplicates the Solr Cloud API for adding configsets, I don’t think there’s a good reason to use it instead of standard SolrCloud API, although their docs try to point you to it. There’s even some kind of webhooks for alerts! (which I haven’t really explored).

Basically, SearchStax just seems to be a sane and rational managed Solr option, it has all the features you’d expect/need/want for dealing with such. The prices seem reasonable-ish, generally more affordable than WebSolr, especially if you stay in “silver” and “one node”.

At present, we plan to move forward with it.

OpenSolr (didn’t look at it much)

I have the least to say about this, have spent the least time with it, after spending time with SearchStax and seeing it met our needs. But I wanted to make sure to mention it, because it’s the only other managed Solr I am even aware of. Definitely curious to hear from any users.

Here is the pricing page.

The prices seem pretty decent, perhaps even cheaper than SearchStax, although it’s unclear to me what you get. Does “0 Solr Clusters” mean that it’s not SolrCloud mode? After seeing how useful SolrCloud APIs are for management (and having this confirmed by many of my peers in other libraries/museums/archives who choose to run SolrCloud), I wouldn’t want to do without it. So I guess that pushes us to “executive” tier? Which at $50/month (billed yearly!) is still just fine, around the same as SearchStax.

But they do limit you to one solr index; I prefer SearchStax’s model of just giving you certain host resources and do what you want with it. It does say “shared infrastructure”.

Might be worth investigating, curious to hear more from anyone who did.

Now, what about ElasticSearch?

We’re using Solr mostly because that’s what various collaborative and open source projects in the library/museum/archive world have been doing for years, since before ElasticSearch even existed. So there are various open source libraries and toolsets available that we’re using.

But for whatever reason, there seem to be SO MANY MORE managed ElasticSearch SaaS available. At possibly much cheaper pricepoints. Is this because the ElasticSearch market is just bigger? Or is ElasticSearch easier/cheaper to run in a SaaS environment? Or what? I don’t know.

But there’s the controversial AWS ElasticSearch Service; there’s the Elastic Cloud “from the creators of ElasticSearch”. On Heroku that lists one Solr add-on, there are THREE ElasticSearch add-ons listed: ElasticCloud, Bonsai ElasticSearch, and SearchBox ElasticSearch.

If you just google “managed ElasticSearch” you immediately see 3 or 4 other names.

I don’t know enough about ElasticSearch to evaluate them. There seem on first glance at pricing pages to be more affordable, but I may not know what I’m comparing and be looking at tiers that aren’t actually usable for anything or will have hidden fees.

But I know there are definitely many more managed ElasticSearch SaaS than Solr.

I think ElasticSearch probably does everything our app needs. If I were to start from scratch, I would definitely consider ElasticSearch over Solr just based on how many more SaaS options there are. While it would require some knowledge-building (I have developed a lot of knowlege of Solr and zero of ElasticSearch) and rewriting some parts of our stack, I might still consider switching to ES in the future, we don’t do anything too too complicated with Solr that would be too too hard to switch to ES, probably.

2021-01-27T18:00:59+00:00 jrochkind Digital Library Federation: Three New NDSA Members https://www.diglib.org/three-new-ndsa-members/

Since January 2021, the NDSA Coordinating Committee unanimously voted to welcome three new members. Each of these members bring a host of skills and experience to our group. Please help us to welcome:

Arkivum: Arkivum is recognized internationally for its expertise in the archiving and digital preservation of valuable data and digitized assets in large volumes and multiple formats.
Colorado State University Libraries: Colorado State University Libraries’ digital preservation activities has focused on web archiving, targeted born-digital collecting, along with collection development and preservation guidelines for its digital repository.
Vassar College Libraries: Vassar College Libraries are committed to supporting a framework of sustainable access to our digital collections and to participate locally, nationally, and globally with other cultural and professional organizations and institutions in efforts to preserve, augment, and disseminate our collective documentary heritage.

Each organization has participants in one or more of the various NDSA interest and working groups, so keep an eye out for them on your calls and be sure to give them a shout out. Please join me in welcoming our new members. A complete list of NDSA members is on our website.

In future, NDSA is moving to a quarterly process for reviewing membership applications. Announcements for new members will be scheduled accordingly.

~ Nathan Tallman, Vice Chair of the NDSA Coordinating Committee

The post Three New NDSA Members appeared first on DLF.

2021-01-27T17:18:58+00:00 Nathan Tallman DuraSpace News: Fedora Migration Paths and Tools Project Update: January 2021 https://duraspace.org/fedora-migration-paths-and-tools-project-update-january-2021/

This is the fourth in a series of monthly updates on the Fedora Migration Paths and Tools project – please see last month’s post for a summary of the work completed up to that point. This project has been generously funded by the IMLS.

The grant team has been focused on completing an initial build of a validation utility, which will allow implementers to compare their migrated content with the original Fedora 3.x source material to verify that everything has been migrated successfully. A testable version of this tool is expected to be completed in the coming weeks, at which point the University of Virginia pilot team will test and provide feedback on the utility.

The University of Virginia team has completed a full migration of their legacy Fedora 3.2.1 repository. They also recently contributed improvements to the Fedora AWS Deployer which have been merged into the codebase. The team is now awaiting a testable version of the validation utility so they can validate their migrated content before moving on to testing this content in a newly installed Fedora 6.0 instance.

The Whitman College pilot team has completed their metadata remediation and mapping work. Their process and lessons learned will be shared in a presentation at the upcoming Code4Lib conference. Meanwhile, Islandora 8 is currently being tested with an Alpha build of Fedora 6.0, which will be used as the basis for migration testing for the Whitman College pilot. Work is currently being done in parallel to install Islandora 8 using ISLE and complete work on a new theme. Due to the impending end-of-life date of Drupal 8 the team decided to proceed directly to Drupal 9, and the theme needed to be updated accordingly. Fortunately, the transition from Drupal 8 to 9 is relatively minor.

Next month we plan to use the validation utility to validate the University of Virginia migration before moving on to testing the migrated data in Fedora 6.0 and updating the application as needed. For the Whitman College pilot, once the Islandora 8 with Fedora 6.0 installation is complete we will be able to run a series of test migrations and update the utilities and application as necessary in order to satisfy functional requirements.

Stay tuned for future updates!

The post Fedora Migration Paths and Tools Project Update: January 2021 appeared first on Duraspace.org.

2021-01-27T15:05:31+00:00 David Wilcox Open Knowledge Foundation: Open Knowledge Justice Programme takes new step on its mission to ensure algorithms cause no harm https://blog.okfn.org/2021/01/27/open-knowledge-justice-programme-takes-new-step-on-its-mission-to-ensure-algorithms-cause-no-harm/

Today we are proud to announce a new project for the Open Knowledge Justice Programme – strategic litigation. This might mean we will go to court to make sure public impact algorithms are used fairly, and cause no harm. But it will also include advocacy in the form of letters and negotiation.

The story so far

Last year, Open Knowledge Foundation made a commitment to apply our skills and network to the increasingly important topics of artificial intelligence (AI) and algorithms.

As a result, we launched the Open Knowledge Justice Programme in April 2020. Our mission is to ensure that public impact algorithms cause no harm.

Public impact algorithms have four key features:

they involve automated decision-making
using AI and algorithms
by governments and corporate entities and
have the potential to cause serious negative impacts on individuals and communities.

We aim to make public impact algorithms more accountable by equipping legal professionals, including campaigners and activists, with the know-how and skills they need to challenge the effects of these technologies in their practice. We also work with those deploying public impact algorithms to raise awareness of the potential risks and build strategies for mitigating them. We’ve had some great feedback from our first trainees!

Why are we doing this?

Strategic litigation is more than just winning an individual case. Strategic litigation is ‘strategic’ because it plays a part in a larger movement for change. It does this by raising awareness of the issue, changing public debate, collaborating with others fighting for the same cause and, when we win (hopefully!) making the law fairer for everyone.

Our strategic litigation activities will be grounded in the principle of openness because public impact algorithms are overwhelmingly deployed opaquely. This means that experts that are able to unpick why and how AI and algorithms are causing harm cannot do so and the technology escapes scrutiny.

Vendors of the software say they can’t release the software code they use because it’s a trade secret. This proprietary knowledge, although used to justify decisions potentially significantly impacting people’s lives, remains out of our reach.

We’re not expecting all algorithms to be open. Nor do we think that would necessarily be useful.

But we do think it’s wrong that governments can purchase software and not be transparent around key points of accountability such as its objectives, an assessment of the risk it will cause harm and its accuracy.

Openness is one of our guiding principles in how we’ll work too. As far as we are able, we’ll share our cases for others to use, re-use and modify for their own legal actions, wherever they are in the world. We’ll share what works, and what doesn’t, and make learning resources to make achieving algorithmic justice through legal action more readily achievable.

We’re excited to announce our first case soon, so stay tuned! Sign up to our mailing list or follow the Open Knowledge Justice Programme on Twitter to receive updates.

2021-01-27T10:13:33+00:00 Meg Foulkes David Rosenthal: ISP Monopolies https://blog.dshr.org/2021/01/isp-monopolies.html For at least the last three years (It Isn't About The Technology) I've been blogging about the malign effects of the way the FAANGs dominate the Web and the need for anti-trust action to mitigate them. Finally, with the recent lawsuits against Facebook and Google, some action may be in prospect. I'm planning a post on this topic. But when it comes to malign effects of monopoly I've been ignoring the other monopolists of the Internet, the telcos.

An insightful recent post by John Gilmore to Dave Farber's IP list sparked a response from Thomas Leavitt and some interesting follow-up e-mail. Gilmore was involved in pioneering consumer ISPs, and Leavitt in pioneering Web hosting. Both attribute the current sorry state of Internet connectivity in the US to the lack of effective competition. They and I differ somewhat on how the problem could be fixed. Below the fold I go into the details.

I've known Gilmore since the early days of Sun Microsystems, and I'm grateful for good advice he gave me at a critical stage of my career. He has a remarkable clarity of vision and a list of achievements that includes pioneering paid support for free software (the "Red Hat" model) at Cygnus Support, and pioneering consumer Internet Service Providers (ISPs) with The Little Garden. Because in those dial-up days Internet connectivity was a commercial product layered on regulated infrastructure (the analog telephone system) there was no lock-in. Consumers could change providers simply by changing the number their modem dialed.

This experience is key to Gilmore's argument. He writes:

The USA never had "network neutrality" before it was "suspended". What the USA had was 3,000 ISPs. So if an ISP did something unfriendly to its customers, they could just stop paying the bad one, and sign up with a different ISP that wouldn't screw them. That effectively prevented bad behavior among ISPs. And if the customer couldn't find an ISP that wouldn't screw them, they could START ONE THEMSELVES. I know, because we did exactly that in the 1990s.

Anyone could start an ISP because by law, everyone had tariffed access to the same telco infrastructure (dialup phone lines, and leased lines at 56 kbit/sec or 1.544 Mbit/sec or 45 Mbit/sec). You just called up the telco and ordered it, and they sent out techs and installed it. We did exactly that, plugged it into our modems and routers and bam, we were an ISP: "The Little Garden".

I was an early customer of The Little Garden. A SPARCstation, a SCSI disk and a modem sat on my window-ledge. The system dialed a local, and thus free, number and kept the call up 24/7, enabling me to register a domain and start running my own mail server. Years later I upgraded to DSL with Stanford as my ISP. As Gilmore points out, Stanford could do this under the same law:

Later, DSL lines required installing equipment in telco central offices, at the far end of the wire that leads to your house. But the telcos were required by the FCC to allow competing companies to do that. Their central office buildings were 9/10th empty anyway, after they had replaced racks of mechanical relays with digital computers.

Gilmore explains how this competitive market was killed:

The telcos figured this out, and decided they'd rather be gatekeepers, instead of being the regulated monopoly that gets a fixed profit margin. Looking ahead, they formally asked the FCC to change its rule that telcos had to share their infrastructure with everybody -- but only for futuristic optical fibers. They whined that "FCC wants us to deploy fiber everywhere, but we won't, unless we get to own it and not share it with our competitors." As usual, the regulated monopoly was great at manipulating the public interest regulators. The FCC said, "Sure, keep your fibers unshared." This ruling never even mentioned the Internet, it is all about the physical infrastructure. If the physical stuff is wires, regulated telcos have to share it; if it's glass, they don't.

The speed of dialup maxed out at 56 kbit/sec. DSL maxed out at a couple of megabits. Leased lines worked to 45 Mbit/sec but cost thousands of dollars per month. Anything over that speed required fiber, not wire, at typical distances. As demand for higher Internet speeds arose, any ISP who wanted to offer a faster connection couldn't just order one from the telco, because the telco fibers were now private and unshared. If you want a fiber-based Internet connection now, you can't buy it from anybody except the guys who own the fibers -- mostly the telcos. Most of the 3,000 ISPs could only offer slow Internet access, so everybody stopped paying them. The industry consolidated down to just one or a few businesses per region -- mostly the telcos themselves, plus the cable companies that had build their own local monopoly via city government contracts. Especially lucky regions had maybe one other competitor, like a Wireless ISP, or an electrical co-op that ran fibers on its infrastructure.

Leavitt makes a bigger point than Glimore's:

The ONLY reason the Internet exists as we know it (mass consumer access) was the regulatory loophole which permitted the ISP industry to flourish in the 1990s. The telcos realized their mistake, as John said, and made sure that there wasn't going to be a repeat of that, so with each generation (DSL, fiber), they made it more and more difficult to access their networks, with the result that John mention--almost no consumer choice, for consumers or business. Last office I rented, there was one choice of Internet provider: the local cable monopoly, which arbitrarily wanted to charge me much more ($85/mo) to connect my office than it did the apartments upstairs in the same building ($49). As is the case in most of that county; the only alternatives were a few buildings and complexes wired up by the two surviving local ISPs, and a relatively expensive WISP.

Gilmore concludes:

The telcos' elimination of fiber based competition, and nothing else, was the end of so-called "network neutrality". The rest was just activists, regulators and legislators blathering. There never was an /enforceable federal regulatory policy of network neutrality, so the FCC could hardly suspend it. If the FCC actually wanted US customers to have a choice of ISPs, they would rescind the FIBER RULE. And if advocates actually understood how only competition, not regulation, restrains predatory behavior, they would ask FCC for the fiber rule to be rescinded, so a small ISP company could rent the actual glass fiber that runs from the telco to (near or inside) your house, for the actual cost plus a regulated profit. Then customers could get high speed Internet from a variety of vendors at a variety of prices and terms. So far neither has happened.

Leavitt shows the insane lengths we are resorting to in order to deliver a modicum of competition in the ISP market:

It's ridiculous that it is going to take sending 10s of thousands of satellites into orbit to restore any semblance of competitiveness to the ISP market, when we've had a simple regulatory fix all along. It's not like the telco/cable monopolies suffered as a result of competition... in fact, it created the market they now monopolize. Imagine all the other opportunities for new markets that have been stifled by the lack of competition in the ISP market over the last two decades!

I have been, and still am, an exception to Gilmore's and Leavitt's experiences. Palo Alto owns its own utilities, a great reason to live there. In September 2001 Palo Alto's Fiber To The Home trial went live, and I was one of 67 citizens who got a 10Mbit/s bidirectional connection, with the city Utilities as our ISP. We all loved the price, the speed and the excellent customer service. The telcos got worried and threatened to sue the Utilities if it expanded the service. The City was on safe legal ground, but that is what they had thought previously when they lost a $21.5M lawsuit as part of the fallout from the Enron scandal. Enron's creditors claimed that the Utilities had violated their contract because they stopped paying Enron. The Utilities did so because Enron became unable to deliver them electricity.

So, when the trial ended after I think five years, we loved it so much that we negotiated with Motorola to take over the equipment and found an upstream ISP. But the Utilities were gun-shy and spent IIRC $50K physically ripping out the fiber and trashing the equipment. Since then, Palo Alto's approach to municipal fiber has been a sorry story of ineffective dithering.

Shortly after we lost our fiber, Stanford decided to stop providing staff with DSL, but we again avoided doing business with the telcos. We got DSL and phone service from Sonic, a local ISP that was legally enabled to rent access to AT&T's copper. It was much slower than Comcast or AT&T, but the upside was Sonic's stellar customer service and four static IP addresses. That kept us going quite happily until COVID-19 struck and we had to host our grandchildren for their virtual schools. DSL was not up to the job.

Fortunately, it turned out that Sonic had recently been able to offer gigabit fiber in Palo Alto. Sonic in its North Bay homeland has been deploying its own fiber, as has Cruzio in its Santa Cruz homeland. Here they rent access to AT&T's fiber in the same way that they rented access to the copper. So, after a long series of delays caused by AT&T's inability to get fiber through the conduit under the street that held their copper, we have gigabit speed, home phone and Sonic's unmatched customer service all for $80/month.

As a long-time Sonic customer, I agree with what the Internet Advisor website writes:

Sonic has maintained a reputation as not only a company that delivers a reliable high-speed connection to its customers but also a company that stands by its ethics. Both Dane Jasper and Scott Doty have spoken up on numerous occasions to combat the ever-growing lack of privacy on the web. They have implemented policies that reflect this. In 2011, they reduced the amount of time that they store user data to just two weeks in the face of an ever-growing tide of legal requests for its users’ data. That same year, Sonic alongside Google fought a court order to hand over email addresses who had contacted and had a correspondence with Tor developer and Wikileaks contributor Jacob Applebaum. When asked why, CEO Dane Jasper responded that it was “rather expensive, but the right thing to do.”

Sonic has made a habit of doing the right thing, both for its customers and the larger world. It’s a conscientious company that delivers on what is promised and goes the extra mile for its subscribers.

Leavitt explained in e-mail how Sonic's exception to Gilmore's argument came about:

Sonic is one of the few independent ISPs that's managed to survive the regulatory clampdown via stellar customer service and customers willing to go out of their way to support alternative providers, much like Cruzio in my home town of Santa Cruz. They cut some kind of reseller deal with AT&T back in 2015 that enabled them to offer fiber to a limited number of residents, and again, like Cruzio, are building out their own fiber network, but according to [this site], fiber through them is potentially available to only about 400,000 customers (in a state with about 13 million households and 1 million businesses); it also reports that they are the 8th largest ISP in the nation, despite being a highly regional provider with access available to only about 3 million households. This says everything about how monopolistic and consolidated the ISP market is, given the number of independent cable and telco companies that existed in previous decades, the remaining survivors of which are all undoubtedly offering ISP services.

I doubt Sonic's deal with AT&T was much more lucrative than the DSL deals Santa Cruz area ISPs were able to cut.

Gilmore attempted to build a fiber ISP in his hometown, San Francisco:

Our model was to run a fiber to about one person per block (what Cruzio calls a "champion") and teach them how to run and debug 1G Ethernet cables down the back fences to their neighbors, splitting down the monthly costs. This would avoid most of the cost of city right-of-way crud at every house, which would let us and our champions fiber the city much more broadly and quickly. And would train a small army of citizens to own and manage their own infrastructure.

For unrelated reasons it didn't work out, but it left Gilmore with the conviction that, absent repeal of the FIBER rule, ISP-owned fiber is the way to go. Especially in rural areas this approach has been successful; a recent example was described by Jon Brodkin in Jared Mauch didn't have good broadband—so he built his own fiber ISP. Leavitt argues:

I'd like to see multiple infrastructure providers, both private for profit, and municipally sponsored non-profit public service agencies, all with open access networks; ideally, connecting would be as simple as it was back in the dial up days. I think we need multiple players to keep each other "honest". I do agree that a lot of the barriers to building out local fiber networks are regulatory and process, as John mentions. The big incumbent players have a tremendous advantage navigating this process, and the scale to absorb the overhead of dealing with them in conjunction with the capital outlays (which municipalities also have).

I think we all agree that "ideally, connecting would be as simple as it was back in the dial up days". How to make this happen? As Gilmore says, there are regulatory and process costs as well as the cost of pulling the fiber. So if switching away from a misbehaving ISP involves these costs there is going to a significant barrier. It isn't going to be "as simple as it was back in the dial up days" when the customer could simply re-program their modem.

My experience of municipal fiber leads me to disagree with both Gilmore and Leavitt. For me, the key is to separate the provision of fiber from the provision of Internet services. Why would you want to switch providers?

Pretty much the only reason why you'd want to switch fiber providers is unreliability. But, absent back-hoes, fiber is extremely reliable.
There are many reasons why you'd want to switch ISPs, among them privacy, bandwidth caps, price increases.

Municipal fiber provision is typically cheap, because they are the regulator and control the permitting process themselves, and because they are accountable to their voters. And if they simply provide the equivalent of an Ethernet cable to a marketplace of ISPs, each of them will be paying the same for their connectivity. So differences in the price of ISP service will reflect the features and quality of their service offerings.

The cost of switching ISPs would be low, simply reprogramming the routers at each end of the fiber. The reason the telcos want to own the fiber isn't because owning fiber as such is a good business, it is to impose switching costs and thus lock in their customers. We don't want that. But equally we don't want switching ISPs to involve redundantly pulling fiber, because that imposes switching costs too. The only way to make connecting "as simple as it was back in the dial up days" is to separate fiber provision from Internet service provision, so that fiber pets pulled once and rented to competing ISPs. If we are going to have a monopoly at the fiber level, I'd rather have a large number of small monopolies than the duopoly of AT&T and Comcast. And I'd rather have the monopoly accountable to voters.

2021-01-26T21:21:25+00:00 David. (noreply@blogger.com) Lucidworks: Cast a Smarter Net with Semantic Vector Search https://lucidworks.com/post/cast-a-smarter-net-semantic-vector-search/

Learn how to power the product discovery experience with semantic vector search to eliminate false zero results and accelerate the path to purchase.

The post Cast a Smarter Net with Semantic Vector Search appeared first on Lucidworks.

2021-01-26T17:21:29+00:00 Garrett Schwegler Digital Library Federation: Virtual 2020 NDSA Digital Preservation recordings available online! https://www.diglib.org/virtual-2020-ndsa-digital-preservation-recordings-available-online/

Session recordings from the virtual 2020 NDSA Digital Preservation conference are now available on NDSA’s YouTube channel, as well as on Aviary. The full program from Digital Preservation 2020: Get Active with Digital Preservation, which took place online November 12, 2020, is free and open to the public.

NDSA is an affiliate of the Digital Library Federation (DLF) and the Council on Library and Information Resources (CLIR). Each year, NDSA’s annual Digital Preservation conference is held alongside the DLF Forum and acts as a crucial venue for intellectual exchange, community-building, development of good practices, and national agenda-setting for digital stewardship.

Enjoy,

Tricia Patterson; DigiPres 2020 Vice-Chair, 2021 Chair

The post Virtual 2020 NDSA Digital Preservation recordings available online! appeared first on DLF.

2021-01-26T16:55:25+00:00 kussmann Terry Reese: MarcEdit 7.5.x/MacOS 3.5.x Timelines https://blog.reeset.net/archives/2935

I sent this to the MarcEdit Listserv to provide info about my thoughts around timelines related to the beta and release. Here’s the info.

Dear All,

As we are getting close to Feb. 1 (when I’ll make the 7.5 beta build available for testing) – I wanted to provide information about the update process going forward.

Feb. 1:

MarcEdit 7.5 Download will be released. This will be a single build that includes both the 32 and 64 bit builds, dependencies, and can install if you have Admin rights or non-admin rights.

I expect to be releasing new builds weekly – with the goal of taking the beta tag off the build no later than April 1.

MarcEdit 7.3.x

I’ll be providing updates for 7.3.x till 7.5 comes out of beta. This will fold in some changes (mostly bug fixes) when possible.

MarcEdit MacOS 3.2.x

I’ll be providing Updates for MacOS 3.2.x till 3.5 is out and out of beta

MarcEdit MacOS 3.5.x Beta

Once MarcEdit 7.5.x beta is out, I’ll be looking to push a 3.5.x beta by mid-late Feb. Again, with the idea of taking the beta tag off by April (assuming I make the beta timeline)

March 2021

MarcEdit MacOS 3.5.x beta will be out and active (with weekly likely builds)
MarcEdit 7.5.x beta – testing assessed and then determine how long the beta process continues (with April 1 being the end bookend date)
MarcEdit 7.3.x – Updates continue
MarcEdit MacOS 3.2.x – updates continue

April 2021

MarcEdit 7.5.x comes out of Beta
MarcEdit 7.3.x is deprecated
MarcEdit MacOS 3.5.x beta assessed – end bookend date is April 15^th if above timelines are met

May 2021

MarcEdit MacOS 3.5.x is out of beta
MarcEdit MacOS 3.2.x is deprecated

Let me know if you have questions.

2021-01-26T16:25:01+00:00 reeset Jez Cope: A new font for the blog https://erambler.co.uk/blog/new-font/

I’ve updated my blog theme to use the quasi-proportional fonts Iosevka Aile and Iosevka Etoile. I really like the aesthetic, as they look like fixed-width console fonts (I use the true fixed-width version of Iosevka in my terminal and text editor) but they’re actually proportional which makes them easier to read.
https://typeof.net/Iosevka/

2021-01-26T09:34:00+00:00 Jez Cope Jez Cope: Training a model to recognise my own handwriting https://erambler.co.uk/blog/training-a-handwriting-model/

If I’m going to train an algorithm to read my weird & awful writing, I’m going to need a decent-sized training set to work with. And since one of the main things I want to do with it is to blog “by hand” it makes sense to focus on that type of material for training. In other words, I need to write out a bunch of blog posts on paper, scan them and transcribe them as ground truth. The added bonus of this plan is that after transcribing, I also end up with some digital text I can use as an actual post — multitasking!

So, by the time you read this, I will have already run it through a manual transcription process using Transkribus to add it to my training set, and copy-pasted it into emacs for posting. This is a fun little project because it means I can:

Write more by hand with one of my several nice fountain pens, which I enjoy
Learn more about the operational process some of my colleagues go through when digitising manuscripts
Learn more about the underlying technology & maths, and how to tune the process
Produce more lovely content! For you to read! Yay!
Write in a way that forces me to put off editing until after a first draft is done and focus more on getting the whole of what I want to say down.

That’s it for now — I’ll keep you posted as the project unfolds.

Addendum

Tee hee! I’m actually just enjoying the process of writing stuff by hand in long-form prose. It’ll be interesting to see how the accuracy turns out and if I need to be more careful about neatness. Will it be better or worse than the big but generic models used by Samsung Notes or OneNote. Maybe I should include some stylus-written text for comparison.

2021-01-25T21:08:25+00:00 Jez Cope Lucidworks: Learning From Einstein’s Brain https://lucidworks.com/post/activate-learning-from-einsteins-brain/

Einstein’s remarkable brain has an important lesson about balance for all of us in technology and machine learning.

The post Learning From Einstein’s Brain appeared first on Lucidworks.

2021-01-25T20:06:23+00:00 Ellen Leanse Meredith Farkas: Making Customizable Interactive Tutorials with Google Forms https://meredith.wolfwater.com/wordpress/2020/11/05/making-customizable-interactive-tutorials-with-google-forms/

In September, I gave a talk at Oregon State University’s Instruction Librarian Get-Together about the interactive tutorials I built at PCC last year that have been integral to our remote instructional strategy. I thought I’d share my slides and notes here in case others are inspired by what I did and to share the amazing assessment data I recently received about the impact of these tutorials that I included in this blog post. You can click on any of the slides to see them larger and you can also view the original slides here (or below). At the end of the post are a few tutorials that you can access or make copies of.

I’ve been working at PCC for over six years now, but I’ve been doing online instructional design work for 15 years and I will freely admit that it’s my favorite thing to do. I started working at a very small rural academic library where I had to find creative and usually free solutions to instructional problems. And I love that sort of creative work. It’s what keeps me going.

I’ve actually been using survey software as a teaching tool since I worked at Portland State University. There, my colleague Amy Hofer and I used Qualtrics to create really polished and beautiful interactive tutorials for students in our University Studies program.

I also used Qualtrics at PSU and PCC to create pre-assignments for students to complete prior to an instruction session that both taught students skills and gave me formative assessment data that informed my teaching. So for example, students would watch a video on how to search for sources via EBSCO and then would try searching for articles on their own topic.

A year and a half ago, the amazing Anne-Marie Dietering led my colleagues in a day of goal-setting retreat for our instruction program. In the end, we ended up selecting this goal, identify new ways information literacy instruction can reach courses other than direct instruction, which was broad enough to encompass a lot of activities people valued. For me, it allowed me to get back to my true love, online instructional design, which was awesome, because I was kind of in a place of burnout going into last Fall.

At PCC, we already had a lot of online instructional content to support our students. We even built a toolkit for faculty with information literacy learning materials they could incorporate into their classes without working with a librarian.

Farkas_GoogleFormsPresentation (7)

The toolkit contains lots of handouts, videos, in-class or online activities and more. But it was a lot of pieces and they really required faculty to do the work to incorporate them into their classes.

What I wanted to build was something that took advantage of our existing content, but tied it up with a bow for faculty. So they really could just take whatever it is, assign students to complete it, and know students are learning AND practicing what they learned. I really wanted it to mimic the sort of experience they might get from a library instruction session. And that’s when I came back to the sort interactive tutorials I built at PSU.

So I started to sketch out what the requirements of the project were. Even though we have Qualtrics at PCC, I wasn’t 100% sure Qualtrics would be a good fit for this. It definitely did meet those first four criteria given that we already have it, it provides the ability to embed video, for students to get a copy of the work they did, and most features of the software are ADA accessible. But I wanted both my colleagues In the library and disciplinary faculty members to be able to easily see the responses of their students and to make copies of the tutorial to personalize for the particular course. And while PCC does have Qualtrics, the majority of faculty have never used it on the back-end and many do not have accounts. So that’s when Google Forms seemed like the obvious choice and I had to give up on my fantasy of having pretty tutorials.

I started by creating a proof of concept based on an evaluating sources activity I often use in face-to-face reading and writing classes. You can view a copy of it here and can copy it if you want to use it in your own teaching.

In this case, students would watch a video we have on techniques for evaluating sources. Then I demonstrate the use of those techniques, which predate Caulfield’s four moves, but are not too dissimilar. So they can see how I would go about evaluating this article from the Atlantic on the subject of DACA.

The students then will evaluate two sources on their own and there are specific questions to guide them.

During Fall term, I showed my proof of concept to my colleagues in the library as well as at faculty department meetings in some of my liaison areas. And there was a good amount of enthusiasm from disciplinary faculty – enough that I felt encouraged to continue.

One anthropology instructor who I’ve worked closely with over the years asked if I could create a tutorial on finding sources to support research in her online Biological Anthropology classes – classes I was going to be embedded in over winter term. And I thought this was a perfect opportunity to really pilot the use of the Google Form tutorial concept and see how students do.

So I made an interactive tutorial where students go through and learn a thing, then practice a thing, learn another thing, then practice that thing. And fortunately, they seemed to complete the tutorial without difficulty and from what I heard from the instructor, they did a really good job of citing quality sources in their research paper in the course. Later in the presentation, you’ll see that I received clear data demonstrating the impact of this tutorial from the Anthropology department’s annual assessment project.

So my vision for having faculty make copies of tutorials to use themselves had one major drawback. Let’s imagine they were really successful and we let a thousand flowers bloom. Well, the problem with that is that you now have a thousand versions of your tutorials lying around and what do you do when a video is updated or a link changes or some other update is needed? I needed a way to track who is using the tutorials so that I could contact them when updates were made.

So here’s how I structured it. I created a Qualtrics form that is a gateway to accessing the tutorials. Faculty need to put in their name, email, and subject area. They then can view tutorials and check boxes for the ones they are interested in using.

Once they submit, they are taking to a page where they can actually copy the tutorials they want. So now, I have the contact information for the folks who are using the tutorials.

This is not just useful for updates, but possibly for future information literacy assessment we might want to do.

The individual tutorials are also findable via our Information Literacy Teaching materials toolkit.

So when the pandemic came just when I was ready to expand this, I felt a little like Nostradamus or something. The timing was very, very good during a very, very bad situation. So we work with Biology 101 every single term in Week 2 to teach students about the library and about what peer review means, why it matters, and how to find peer-reviewed articles.

Farkas_GoogleFormsPresentation (18)
As soon as it became clear that Spring term was going to start online, I scrambled to create this tutorial that replicates, as well as I could, what we do in the classroom. So they do the same activity we did in-class where they look at a scholarly article and a news article and list the differences they notice. And in place of discussions, I had them watch videos and share insights. I then shared this with the Biology 101 faculty on my campus and they assigned it to their students in Week 2. It was great! [You can view the Biology 101 tutorial here and make a copy of it here]. And during Spring term I made A LOT more tutorials.

The biggest upside of using Google Forms is its simplicity and familiarity. Nearly everyone has created a Google form and they are dead simple to build. I knew that my colleagues in the library could easily copy something I made and tailor it to the courses they’re working with or make something from scratch. And I knew faculty could easily copy an existing tutorial and be able to see student responses. For students, it’s a low-bandwidth and easy-to-complete online worksheet. The barriers are minimal. And on the back-end, just like with LibGuides, there’s a feature where you can easily copy content from another Google Form.

The downsides of using Google Forms are not terribly significant. I mean, I’m sad that I can’t create beautiful, modern, sharp-looking forms, but it’s not the end of the world. The formatting features in Google Forms are really minimal. To create a hyperlink, you actually need to display the whole url. Blech. Then in terms of accessibility, there’s also no alt tag feature for images, so I just make sure to describe the picture in the text preceding or following it. I haven’t heard any complaints from faculty about having to fill out the Qualtrics form in order to get access to the tutorials, but it’s still another hurdle, however small.

This Spring, we used Google Form tutorials to replace the teaching we normally do in classes like Biology 101, Writing 121, Reading 115, and many others. We’ve also used them in addition to synchronous instruction, sort of like I did with my pre-assignments. But word about the Google Form tutorials spread and we ended up working with classes we never had a connection to before. For example, the Biology 101 faculty told the anatomy and physiology instructors about the tutorial and they wanted me to make a similar one for A&P. And that’s a key class for nursing and biology majors that we never worked with before on my campus. Lots of my colleagues have made copies of my tutorials and tailored them to the classes they’re working with or created their own from scratch. And we’ve gotten a lot of positive feedback from faculty, which REALLY felt good during Spring term when I know I was working myself to the bone.

Since giving this presentation, I learned from my colleagues in Anthropology that they actually used my work as the basis of their annual assessment project (which every academic unit has to do). They used a normed rubric to assess student papers in anthropology 101 and compared the papers of students who were in sections in which I was embedded (where they had access to the tutorial) to students in sections where they did not have an embedded librarian or a tutorial. They found that students in the class sections in which I was involved had a mean score of 43/50 and students in other classes had a mean score of 29/50. That is SIGNIFICANT!!! I am so grateful that my liaison area did this project that so validates my own work.

Here’s an excerpt from one email I received from an anatomy and physiology instructor: “I just wanted to follow up and say that the Library Assignment was a huge success! I’ve never had so many students actually complete this correctly with peer-reviewed sources in correct citation format. This is a great tool.” At the end of a term where I felt beyond worked to the bone, that was just the sort of encouragement I needed.

I made copies of a few other tutorials I’ve created so others can access them:

2021-01-25T17:06:46+00:00 Meredith Farkas Journal of Web Librarianship: The Culture of Digital Scholarship in Academic Libraries https://www.tandfonline.com/doi/full/10.1080/19322909.2020.1859176?ai=1dl&mi=co84bk&af=R .
2021-01-25T10:10:25+00:00 Bradford Lee Eden Hugh Rundle: Library Map Part 1 - Why https://www.hughrundle.net/library-map-part-1/

This weekend past I ran the Generous & Open Galleries, Libraries, Archives & Museums (GO GLAM) Miniconf at LinuxConf.au, with Bonnie Wildie. Being a completely online conference this year, we had an increased pool of people who could attend and also who could speak, and managed to put together what I think was a really great program. I certainly learned a lot from all our speakers, and I'll probably share some thoughts on the talks and projects later this year.

I also gave a short talk, about my new Library Map project and some thoughts on generosity in providing open data. Unfortunately, Alissa is completely right about my talk. The tone was wrong. I spoke about the wrong things and in the wrong way. It was an ungenerous talk on the virtues of generosity. I allowed my frustration at underfunded government bureaucracies and my anxiety about the prospect of giving a "technical" talk that "wasn't technical enough" for LCA to overwhelm the better angels of my nature. I won't be sharing the video of my own talk when it becomes available, but here is a short clip of me not long after I delivered it:

via GIPHY

So I'm trying again. In this post I'll outline the basic concepts, and the why of Library Map - why I wanted to make it, and why I made the architecture and design choices I've made. In the next post, I'll outline how I built it - some nuts and bolts of which code is used where (and also, to some extent, why). You may be interested in one, or the other, or neither post 🙂.

Library Map

The Library Map is a map of libraries in Australia and its external territories. There are three 'layers' to the map:

Libraries

The libraries layer shows every public library in Australia, plus an indicative 800m radius around it. Also mapped on additional overlays are State and National libraries, Indigenous Knowledge Centres, and most still-operating Mechanics Institutes.

Rules

The Rules layer has two overlays.

The Fines overlay colour-codes each library service area according to whether they charge overdue fines for everyone, only for adults, or not at all.

The Loan Periods overlay uses patterns (mostly stripes) to indicate the standard loan period in weeks (2, 3, 4, or 6 as it turns out).

Library Management Software

The Library Management Software layer works basically the same as the Rules layer, except it colour codes library services according to which library management system (a.k.a Intergrated Library System) they use.

What is the library map for?

I've wanted something like this map at various times in the past. There is a fair amount of information around at the regional and state level about loan periods, or fine regimes, and even library management systems. But a lot of this is in people's heads, or in lists within PDF documents. I'm not sure I'd call myself a 'visual learner' but sometimes it is much clearer to see something mapped out visually than to read it in a table.

The intended audience for the map is actually a little bit "inside baseball". I'm not trying to build a real-time guide for library users to find things like current opening hours. Google Maps does a fine job of that, and I'm not sure a dedicated site for every public library but only libraries is a particularly useful tool. It would also be a nightmare to maintain. The site ultimately exists because I wanted to see if I could do it, but I had — broadly — two specific use cases in mind:

Mapping library management systems to visualise network effects

My talk at LCA2018 was called Who else is using it? — in reference to a question library managers often ask when confronting a suggestion to use a particular technology, especially something major like a library management system. This is understandable — it's reassuring to know that one's peers have made similar decisions ("Nobody gets fired for buying IBM"), but there are also genuine advantages to having a network of fellow users you can talk to about shared problems or desired features. I was interested in whether these sorts of networks and aggregate purchasing decisions might be visible if they were mapped out, in a different way to what might be clear from a list or table. Especially at a national level — I suspected there were strong trends within states in contrasts between them, but didn't have a really clear picture.

The State Library of Queensland was invaluable in this regard, because they have a list of every library service in the state and which library management system they use. When visiting library service websites it turned out that identifying the LMS was often the easiest piece of data to find — much easier than finding out whether they charge overdue fines! It turns out there are very strong trends within each state — stronger than I expected — but Western Australia is a much more fractured and diverse market than I had thought. I also discovered a bunch of library management systems I had never heard of, so that was fun. This layer is the most recent — I only added it today — so there may still be some improvements to be made in terms of how the data is displayed.

Mapping overdue fines

The second thing I wanted to map was whether and how libraries charge overdue fines, but my reason was different. I actually started the map with this layer, as part of a briefing I gave to some incoming Victorian local government Councillors about what they should know about public libraries.

Here, the goal is mapping as an advocacy tool, using the peer pressure of "who else is charging it?" to slowly flip libraries to go fine-free. Fines for overdue library books are regressive and counter-productive. I have found no compelling or systematic evidence that they have any effect whatsoever on the aggregate behaviour of library users in terms of returning books on time. They disproportionally hurt low income families. They need to go.

In Victoria there has been a growing movement in the last few years for public libraries to stop charging overdue fines. I wasn't really aware of the situation in other states, but it turns out the whole Northern Territory has been fine-free for over a decade, and most libraries in Queensland seem to also be fine-free. I'm still missing a fair bit of data for other states, especially South and Western Australia. What I'm hoping the map can be used for (once the data is more complete) is to identify specific libraries that charge fines but are near groups of libraries that don't, and work with the local library networks to encourage the relevant council to see that they are the odd ones out. I've worked in public libraries and know how difficult this argument can be to make from the inside, so this is a tool for activists but also to support library managers to make the case.

As if often a problem in libraries, I had to define a few terms and therefore "normalise" some data in order to have it make any sense systematically. So "no fines for children" is defined as any system that has a "younger than" exclusion for library fines or an exclusion for items designated as "children's books". Some libraries are fine free for users under 14, others for those under 17, some only for children's book loans and so on. On my map they're all the same. The other thing to normalise was the definition of "overdue fine", which you might think is simple but turns out to be complex. In the end I somewhat arbitrarily decided that if there is no fee earlier than 28 days overdue, that is classified as "no overdue fines". Some libraries charge a "notice fee" after two weeks (which does count), whilst others send an invoice for the cost of the book after 28 days (which doesn't).

Colonial mode

As the project has progressed, some things have changed, especially how I name things. When I first added the Libraries layer, I was only looking at Victoria, using the Directory of Public Library Services in Victoria. This includes Mechanics Institutes as a separate category, and that seemed like a good idea, so I had two overlays, in different colours. Then I figured I should add the National Library, and the State Libraries, as a separate layer, since they operate quite differently to local public libraries.

Once I got to Queensland, I discovered that the State Library of Queensland not only provides really good data on public libraries, but also had broadly classified them into three categories: "RLQ" for Rural Libraries Queensland, a reciprocal-borrowing arrangement; "IND" for Independent library services, and "IKC" for "Indigenous Knowledge Centre". The immediate question for me was whether I would also classify any of these libraries as something different to a "standard" public library.

The main thing that distinguishes the RLQ network from the "independents" is that it is a reciprocal lending network. In this regard, it's much the same as Libraries Victoria (formerly the Swift Consortium), or ShoreLink. There are other ways that rural libraries in Queensland operate differently to urban libraries in Queensland, but I don't think these differences make them qualitatively different in terms of their fundamental nature.

But what about Indigenous Knowledge Centres? I admit I knew very little about them, and I still only know what I've gleaned from looking at IKC websites. The Torres Strait Island Regional Council website seems to be fairly representative:

Our Indigenous Knowledge Centres endeavour to deliver new technology, literacy and learning programs to empower our communities through shared learning experiences. We work with communities to preserve local knowledge and culture and heritage, to keep our culture strong for generations.

The big difference between an IKC and a typical public library is that the focus is on preserving local Indigenous knowledge and culture, which does happen through books and other library material, but is just as likely to occur through classes and activities such as traditional art and dance.

But the more I looked at this difference, the less different it seemed to be. Public libraries across the world have begun focussing more on activities and programs in the last two decades, especially in WEIRD countries. Public libraries have always delivered new technology, literacy and learning programs. And the ‌Directory of Public Library Services in Victoria amusingly reports that essentially every library service in Victoria claims to specialise in local history. What are public libraries for, if not to "keep our culture strong for generations"?

Yet it still felt to me that Indigenous Knowledge Centres are operating from a fundamentally different mental model. Finally it dawned on me that the word "our" is doing a lot of work in that description. Our Indigenous Knowledge Centres, keep our culture strong for generations. I was taken back to a conversation I've had a few times with my friend Baruk Jacob, who lives in Aotearoa but grew up in a minority-ethnicity community in India. Baruk maintains that public libraries should stop trying to be universally "inclusive" — that they are fundamentally Eurocentric institutions and need to reconcile themselves to staying within that sphere. In this line of thinking, public libraries simply can't serve Indigenous and other non-"Western" people appropriately as centres of knowledge and culture. I could see where Baruk was coming from, but I was troubled by his argument, and the implication that different cultural traditions could never be reconciled. As I struggled to decide whether Indigenous Knowledge Centres were public libraries, or something else, I think I started to understand what Baruk meant.

I'd been thinking about this back to front. Indigenous Knowledge Centre is a usefully descriptive term. These places are centres for Indigenous knowledge. The problem wasn't how to classify IKCs, but rather how to classify the other thing. The activities might be the same, but the our is different. I thought about what a non-Indigenous Knowledge Centre might be. What kind of knowledge does it want to "keep strong for generations"? I thought about all those local history collections full of books about "pioneers" and family histories of "first settlers". If it's not Indigenous knowledge, it must be Settler knowledge. When I first saw this term being used by Aboriginal activists in reference to non-Indigenous residents generally, and white Australians specifically, I bristled. I mean, sure, the modern culture is hopelessly dismissive of 60,000 years of human occupation, culture and knowledge, but how could I be a "settler" when I have five or six generations of Australian-born ancestors? But a bit of discomfort is ok, and I have rather hypocritical ideas about other settler-colonial communities. It's exactly the right term to describe the culture most Australians live in.

So I renamed "public libraries" as "Settler Knowledge Centres". I initially renamed the National & State Libraries to "Imperial Knowledge Centres", but later decided it was more accurate to call them "Colonial Knowledge Centres". I also briefly renamed Mechanics Institutes to Worker Indoctrination Centres, but that's not entirely accurate and I realised I was getting carried away. I wasn't completely oblivious to the fact that this nomenclature could be a bit confusing, so I cheekily created two views: the "General" view which would be the default, and a second view which would appear on clicking "View in White Fragility mode". This second mode would show the more familiar names "Public Libraries" and "National & State Libraries".

While I was doing some soul searching this morning about my GO GLAM talk, I continued to work on the map. My cheeky joke about "White fragility mode" had made me slightly uncomfortable from the moment I'd created it, but I initially brushed it off as me worrying too much about being controversial. But I realised today that the real problem was that calling it "White fragility mode" sabotages the entire point of the feature. The default language of "Settler Knowledge Centre" and "Colonial Knowledge Centre" sitting next to "Indigenous Knowledge Centre" is intended to invite map users to think about the work these institutions do to normalise certain types of knowledge, and to "other" alternative knowledge systems and lifeworlds. The point is to bring people in to sit with the discomfort that comes from seeing familiar things described in an unfamiliar way. Calling it "White fragility mode" isn't inviting, it's smug. It either pushes people away, or invites them to think no more about it because they're already woke enough to get it.

So today I changed it to something hopefully more useful. General mode is now called Standard Mode, and White fragility mode is now called Colonial mode. It's the mode of thinking that is colonial, not the reader. Flicking to Colonial Mode is ok if you need the more familiar terms to get your bearings: but hopefully by making it the non-standard view, users of the map are encouraged to think about libraries and about Australia in a slightly different way. They don't have to agree that the "standard mode" terminology is better.

So that's some background behind why I started building the map and why I made some of the decisions I have about how it works. You can check it out at librarymap.hugh.run and see (most of) the code and data I used to build it on GitHub. Next time join me for a walk through how I made it.

2021-01-25T10:09:34+00:00 Hugh Rundle Nick Ruest: Four Fucking Years of Donald Trump https://ruebot.net/post/20210122-four-fucking-years-of-donald-trump/

Nearly four years ago I decided to start collecting tweets to Donald Trump out of morbid curiosity. If I was a real archivist, I would have planned this out a little bit better, and started collecting on election night in 2016, or inaguration day 2017. I didn’t. Using twarc, I started collecting with the Filter (Streaming) API on May 7, 2017. That process failed, and I pivoted to using the Search API. I dropped that process into a simple bash script, and pointed cron at it to run every 5 days. Here’s what the bash script looked like:

#!/bin/bash

DATE=`date +"%Y_%m_%d"`

cd /mnt/vol1/data_sets/to_trump/raw

/usr/local/bin/twarc search 'to:realdonaldtrump' --log donald_search_$DATE.log > donald_search_$DATE.json

It’s not beautiful. It’s not perfect. But, it did the job for the most part for almost four years save and except a couple Twitter suspensions on accounts that I used for collection, and an absolutely embarassing situtation where I forgot to setup cron correctly on a machine I moved the collecting to for a couple weeks while I was on family leave this past summer.

In the end, the collection ran from May 7, 2017 - January 20, 2021, and collected 362,464,578 unique tweets; 1.5T of line-delminted json! The final created_at timestamp was Wed Jan 20 16:49:03 +0000 2021, and the text of that tweet very fittingly reads, “@realDonaldTrump YOU’RE FIRED!“

The “dehydrated” tweets can be found here. In that dataset I decided to include in a number of derivatives created with twut which, I hope rounds out the dataset. This update is the final update on the dataset.

I also started working on some notebooks here where I’ve been trying to explore the dataset a bit more in my limited spare time. I’m hoping to have the time and energy to really dig into this dataset sometime in the future. I’m especially curious at what the leadup to the 2021 storming of the United States Capitol looks like in the dataset, as well as the sockpuppet frequency. I’m also hopeful that others will explore the dataset and that it’ll be useful in their research. I have a suspicion folks can do a lot smarter, innovative, and creative things with the dataset than I did here, here, here, here, or here.

For those who are curious what the tweet volume for the last few months looked like (please note that the dates are UTC), check out these bar charts. January 2021 is especially fun.

Tweets to @realDonaldTrump by day in October 2020

Tweets to @realDonaldTrump by day in November 2020

Tweets to @realDonaldTrump by day in December 2020

Tweets to @realDonaldTrump by day in January 2021

-30-

2021-01-23T05:00:00+00:00 Lucidworks: Consider a New Application for AI in Retail https://lucidworks.com/post/consider-a-new-application-for-ai-in-retail/

How companies can plan for 2021 by weaving AI and machine learning into their digital experiences.

The post Consider a New Application for AI in Retail appeared first on Lucidworks.

2021-01-22T21:59:43+00:00 Garrett Schwegler Mita Williams: Weeknote 3 (2021) https://librarian.aedileworks.com/2021/01/22/weeknote-3-2021/

Hey. I missed last week’s weeknote. But we are here now.

§1

This week I gave a class on searching scientific literature to a group of biology masters students. While I was making my slides comparing the Advanced Search capabilities of Web of Science and Scopus, I discovered this weird behaviour of Google Scholar: a phrase search generated more hits than not.

I understand that Google Scholar performs ‘stemming’ instead of truncation in generating search results but this still makes no sense to me.

§2

New to me: if you belong to an organization that is already a member of CrossRef, you are eligible to use a Similarity Check of documents for an additional fee. Perhaps this is a service we could provide to our OJS editors.

§3

I’m still working through the Canadian Journal of Academic Librarianship special issue on Academic Libraries and the Irrational.

Long time readers know that I have a fondness for the study of organizational culture and so it should not be too surprising that the first piece I wanted to read was The Digital Disease in Academic Libraries. It begins….

THOUGH several recent books and articles have been written about change and adaptation in contemporary academic libraries (Mossop 2013; Eden 2015; Lewis 2016), there are few critical examinations of change practices at the organizational level. One example, from which this paper draws its title, is Braden Cannon’s (2013) The Canadian Disease, where the term disease is used to explore the trend of amalgamating libraries, archives, and museums into monolithic organizations. Though it is centered on the impact of institutional convergence, Cannon’s analysis uses an ethical lens to critique the bureaucratic absurdity of combined library-archive-museum structures. This article follows in Cannon’s steps, using observations from organizational de-sign and management literature to critique a current trend in the strategic planning processes and structures of contemporary academic libraries. My target is our field’s ongoing obsession with digital transformation beyond the shift from paper-based to electronic resources, examined in a North American context and framed here as The Digital Disease.

I don’t want to spoil the article but I do want to include this zinger of a symptom which is the first of several:

If your library’s organizational chart highlights digital forms of existing functions, you might have The Digital Disease.
Kris Joseph, The Digital Disease in Academic Libraries, Canadian Journal of Academic Librarianship, Vol 6 (2020)

Ouch. That truth hurts almost as much as this tweet did:

2021-01-22T20:19:36+00:00 Mita Williams Jez Cope: Blogging by hand https://erambler.co.uk/blog/blogging-by-hand/

I wrote the following text on my tablet with a stylus, which was an interesting experience:

So, thinking about ways to make writing fun again, what if I were to write some of them by hand? I mean I have a tablet with a pretty nice stylus, so maybe handwriting recognition could work. One major problem, of course, is that my handwriting is AWFUL! I guess I’ll just have to see whether the OCR is good enough to cope…

It’s something I’ve been thinking about recently anyway: I enjoy writing with a proper fountain pen, so is there a way that I can have a smooth workflow to digitise handwritten text without just typing it back in by hand? That would probably be preferable to this, which actually seems to work quite well but does lead to my hand tensing up to properly control the stylus on the almost-frictionless glass screen.

I’m surprised how well it worked! Here’s a sample of the original text:

And here’s the result of converting that to text with the built-in handwriting recognition in Samsung Notes:

Writing blog posts by hand
So, thinking about ways to make writing fun again, what if I were to write some of chum by hand? I mean, I have a toldest winds a pretty nice stylus, so maybe handwriting recognition could work.
One major problems, ofcourse, is that my , is AWFUL! Iguess
I’ll just have to see whattime the Ocu is good enough to cope…
It’s something I’ve hun tthinking about recently anyway: I enjoy wilting with a proper fountain pion, soischeme a way that I can have a smooch workflow to digitise handwritten text without just typing it back in by hand?
That wouldprobally be preferableto this, which actually scams to work quito wall but doers load to my hand tensing up to properly couldthe stylus once almost-frictionlessg lass scream.

It’s pretty good! It did require a fair bit of editing though, and I reckon we can do better with a model that’s properly trained on a large enough sample of my own handwriting.

2021-01-22T16:16:00+00:00 Jez Cope Tara Robertson: Diversity, equity and inclusion core competencies: Get cross functional projects done (Part 5 of 5) https://tararobertson.ca/2021/dei-core-competencies-get-projects-done/

Latina woman writing on a poster board paper while 2 Black women offer ideas

This is the last post in a weeklong series exploring DEI professional competencies. Again, I believe the five key competencies for DEI professionals are:

be strategic
translate academic research into action and measure the impact of initiatives
meet people where they are at and help them move to be more inclusive
influence others
get cross functional projects done.

Yesterday’s post was about influencing others. This post will explore getting cross functional projects done. I’ll also share some other DEI career resources.

Great ideas without action are totally meaningless. As a DEI leader you’ll be working across departments and functions to get stuff done. Strong project management skills and collaboration are key in making change to existing processes and developing new ways of doing things. Here’s two examples to illustrate this competency.

One of my first projects at Mozilla was working with People Ops and a Tableau expert in IT to build a dashboard to track our diversity metrics, which was more difficult and time consuming than I first thought. When I started the project was off the rails, so I suggested we restart by introducing ourselves, what we thought we brought to the table and then developed a RASCI for the project. With these foundations in place we shifted us to be a very effective team. We completed the project and became friends. Having a dashboard for diversity metrics was important as leaders owned accountability goals and needed to know how they were doing.

Engineers started Mozilla’s first mentorship program. I joined the team and was the only non-technical person and marvelled at some of the skills and ways of thinking that the others brought. It was one of those wonderful experiences where we were more than the sum of our parts. We were a small group of people with different backgrounds, doing different jobs, at various job levels and we were able to stand up and support a mentorship program for about 100 people. I credit the leadership of Melissa O’Connor, Senior Manager of Data Operations. She often said “tell me what I’m missing here” to invite different options and ran the most efficient meetings I’ve ever attended in my life.

Great ideas without action are totally meaningless. Turning thoughts into actions as a leader in DEI is a necessary art–to get things done you’ll need to effectively collaborate with people at different levels and in different functions.

Additional resources on DEI careers

I’m excited to be one of the panelists for Andrea Tatum’s DEI careers panel tomorrow, January 23. The event is sold out but she’ll be simulcasting live on YouTube at January 23 at 10am Pacific. Andrea also introduced me to Russell Reynold’s list of competencies of a Chief Diversity Officer.

Aubrey Blanche’s post How can I get a job in D&I? starts by trying to talk the reader out of going into this line of work then gets into five key areas of expertise.

Dr. Janice Gassam’s Dirty Diversity Podcast has an episode where she interviews Lambert Odeh, Diversity and Inclusion Manager at Olo Inc. on How to Land a Career in DEI.

The post Diversity, equity and inclusion core competencies: Get cross functional projects done (Part 5 of 5) appeared first on Tara Robertson Consulting.

2021-01-22T14:01:09+00:00 Tara Robertson Open Knowledge Foundation: How to run your Open Data Day event online in 2021 https://blog.okfn.org/2021/01/22/how-to-run-your-open-data-day-event-online-in-2021/

For Open Data Day 2021 on Saturday 6th March, the Open Knowledge Foundation is offering support and funding for in-person and online events anywhere in the world via our mini-grant scheme.

Open Data Day normally sees thousands of people getting together at hundreds of events all over the world to celebrate and use open data in their communities but this year has not been a normal year.

With many countries still under lockdown or restricted conditions due to the Covid-19 pandemic, we recognise that many people will need to celebrate Open Data Day by hosting online events rather than getting together for in-person gatherings.

To support the running of events, anyone can apply to our mini-grant scheme to receive $300 USD towards the running of your Open Data Day event whether it takes place in-person or online. Applications must be submitted before 12pm GMT on Friday 5th February 2021 by filling out this form.

If you’re applying for a mini-grant for an online event, we will accept applications where the funds are allocated to cover any of the following costs:

Fees for online tools needed to help with the running of your event
Essential equipment needed to help with the running of your event
Reimbursing speakers or participants for mobile data costs incurred during event
Paying for the printing and posting of physical materials to event participants
Other costs associated with running the event

It might feel challenging to plan a great online event if you are used to running events in the real world. But many people and organisations have overcome these challenges this year, and there are many tools that can help you plan your event. Here are some tips and tools that we use for remote events that we think will help with your preparations.

Open Knowledge Foundation is a remote working organisation with our team spread around the world. We use Zoom, Google Meet or Slack to host our internal and external video meetings and rely on Google Docs, Github, Gitter and Discourse to allow us to share documents and talk in real-time. Many of these tools are free and easy to set up.

Two members of our team are also on the organisation team of csv,conf, an annual community conference for data makers which usually hosts several hundred people for a two-day event. For csv,conf,v5 in May 2020, the team decided to make their event online-only and it proved to be a great success thanks to lots of planning and the use of good online tools. Read this post – https://csvconf.com/2020/going-online – to learn more about how the team organised their first virtual conference including guidance about the pros and cons of using tools like Crowdcast, Zenodo, Zoom and Spatial Chat for public events.

Other organisations – including the Center for Scientific Collaboration and Community Engagement and the Mozilla Festival team – have also shared their guidebooks and processes for planning virtual events.

We hope some of these resources will help you in your planning. If you have any further questions relating to an Open Data Day 2021 mini-grant application, please email opendataday@okfn.org.

2021-01-22T11:06:44+00:00 Stephen Abbott Pugh Ed Summers: Trump's Tweets https://inkdroid.org/2021/01/21/trumps-tweets/

TLDR: Trump’s tweets are gone from twitter.com but still exist spectrally in various states all over the web. After profiting off of their distribution Twitter now have a responsibility to provide meaningful access to the Trump tweets as a read only archive.

This post is also published on the Documenting the Now Medium where you can comment, if the mood takes you.

So Trump’s Twitter account is gone. Finally. It’s strange to have had to wait until the waning days of his presidency to achieve this very small and simple act of holding him accountable to Twitter’s community guidelines…just like any other user of the platform.

Better late than never, especially since his misinformation and lies can continue to spread after he has left office.

But isn’t it painful to imagine what the last four years (or more) could have looked like if Twitter and the media at large had recognized their responsibility and acted sooner?

When Twitter suspended Trump’s account they didn’t simply freeze it and prevent him from sending more hateful messages. They flipped a switch that made all the tweets he has ever sent disappear from the web.

These are tweets that had real material consequences in the world. As despicable as Trump’s utterances have been, a complete and authentic record of them having existed is important for the history books, and for holding him to account.

Twitter’s suspension of Donald Trump’s account has also removed all of his thousands of tweets sent over the years. I personally find it useful as a reporter to be able to search through his tweets. They are an important part of the historical record. Where do they live now?
— Olivia Nuzzi (@Olivianuzzi) January 9, 2021

Where indeed? One hopes that they will end up in the National Archives (more on that in a moment). But depending on how you look at it, they are everywhere.

Twitter removed Trump’s tweets from public view at twitter.com. But fortunately, as Shawn Jones notes, embedded tweets like the one above persist the tweet text into the HTML document itself. When a tweet is deleted from twitter.com the text stays behind elsewhere on the web like a residue, as evidence (that can be faked) of what was said and when.

It’s difficult to say whether this graceful degradation was an intentional design decision to make their content more resilient, or it was simply a function of Twitter wanting their content to begin rendering before their JavaScript had loaded and had a chance to emboss the page. But design intent isn’t really what matters here.

What does matter is the way this form of social media content degrades in the web commons. Kari Kraus calls this process “spectral wear”, where digital media “help mitigate privacy and surveillance concerns through figurative rather than quantitative displays, reflect and document patterns of use, and promote an ethics of care.” (Kraus, 2019). This spectral wear is a direct result of tweet embed practices that Twitter itself promulgates while simultaneously forbidding it Developer Terms of Service:

If Twitter Content is deleted, gains protected status, or is otherwise suspended, withheld, modified, or removed from the Twitter Applications (including removal of location information), you will make all reasonable efforts to delete or modify such Twitter Content (as applicable) as soon as possible.

Fortunately for history there has probably never been a more heavily copied social media content than Donald Trump’s tweets. We aren’t immediately dependent on twitter.com to make this content available because of the other other places on the web where it exists. What does this copying activity look like?

I intentionally used copied instead of archived above because the various representations of Trump’s tweets vary in terms of their coverage, and how they are being cared for.

Given their complicity in bringing Trump’s messages of division and hatred to a worldwide audience, while profiting off of them, Twitter now have a responsibility to provide as best a representation of this record for the public, and for history.

We know that the Trump administration have been collecting the @realDonaldTrump Twitter account, and plan to make it available on the web as part of their responsibilities under the Presidential Records Act:

The National Archives will receive, preserve, and provide public access to all official Trump Administration social media content, including deleted posts from @realDonaldTrump and @POTUS. The White House has been using an archiving tool that captures and preserves all content, in accordance with the Presidential Records Act and in consultation with National Archives officials. These records will be turned over to the National Archives beginning on January 20, 2021, and the President’s accounts will then be made available online at NARA’s newly established trumplibrary.gov website.

NARA is the logical place for these records to go. But it is unclear what shape these archival records will take. Sure the Library of Congress has (or had) it’s Twitter archive. It’s not at all clear if they are still adding to it. But even if they are LC probably hasn’t felt obligated to collect the records of an official from the Executive Branch, since they are firmly lodged in the Legislative. Then again they collect GIFs so, maybe?

Reading between the lines it appears that a third party service is being used to collect the social media content: possibly one of the several e-discovery tools like ArchiveSocial or Hanzo. It also looks like the Trump Administration themselves have entered into this contract, and at the end of its term (i.e. now) will extract their data and deliver it to NARA. Given their past behavior it’s not difficult to imagine the Trump administration not living up to this agreement in substantial ways.

This current process is a slight departure from the approach taken by the Obama administration. Obama initiated a process where platforms [migrate official accounts] to new accounts that were then managed going forward by NARA (Acker & Kriesberg, 2017). We can see that this practice is being used again on January 20, 2021 when Biden became President. But what is different is that Barack Obama retained ownership of his personal account @barackobama, which he continues to use.NARA has announced that they will be archiving Trump’s now deleted (or hidden) personal account.

A number of Trump administration officials, including President Trump, used personal accounts when conducting government business. The National Archives will make the social media content from those designated accounts publicly available as soon as possible.

The question remains, what representation should be used, and what is Twitter’s role in providing it?

Meanwhile there are online collections like The Trump Archive, the New York Times’ Complete List of Trump’s Twitter Insults, Propublica’s Politwoops and countless GitHub repositories of data which have collected Trump’s tweets. These tweets are used in a multitude of ways including things as absurd as a source for conducting trades on the stock market.

But seeing these tweets as they appeared in the browser, with associated metrics and comments is important. Of course you can go view the account in the Wayback Machine and browse around. But what if we wanted a list of all the Trump tweets? How many times were these tweets actually archived? How complete is the list?

After some experiments with the Internet Archive’s API it’s possible to get a picture of how the tweets from the @realDonaldTrump account have been archived there. There are a few wrinkles because a given tweet can have many different URL forms (e.g. tracking parameters in the URL query string). In addition just because there was a request to archive a URL for something that looks like a realDonaldTrump tweet URL doesn’t mean it resulted in a successful response. Success here means a 200 OK from twitter.com when resolving the URL. Factoring these issues into the analysis it appears the Wayback Machine contains (at least) 16,043,553 snapshots of Trump’s tweets.

https://twittter.com/realDonaldTrump/status/{tweet-id}

Of these millions of snapshots there appear to be 57,292 unique tweets. This roughly correlates with the 59K total tweets suggested by the last profile snapshots of the account. The maximum number of times in one day that his tweets were archived was 71,837 times on February 10, 2020. Here’s what the archive snapshots of Trump’s tweets look like over time (snapshots per week).

It is relatively easy to use the CSV export from the [Trump Archive] project to see what tweets they know about that the Internet Archive does not and vice-versa (for the details see the Jupyter notebook and SQLite database here).

It looks like there are 526 tweet IDs in the Trump Archive that are missing from the Internet Archive. But further examination shows that many of these are retweets, which in Twitter’s web interface, have sometimes redirected back to the original tweet. Removing these retweets to specifically look at Trump’s own tweets there are only 7 tweets in the Trump Archive that are missing from the Internet Archive. Of these 4 are in fact retweets that have been miscategorized by the Trump Archive.

One of the three is this one, which is identified in the Trump Archive as deleted, and wasn’t collected quick enough by the Internet Archive before it was deleted:

Roger Stone was targeted by an illegal Witch Hunt tha never should have taken place. It is the other side that are criminals, including the fact that Biden and Obama illegally spied on my campaign - AND GOT CAUGHT!"

Sure enough, over at the Politwoops project you can see that this tweet was deleted 47 seconds after it was sent:

Flipping the table it’s also possible to look at what tweets are in the Internet Archive but not in the Trump Archive. It turns out that there are 3,592 tweet identifiers in the Wayback machine for Trump’s tweets which do not appear in the Trump Archive. Looking a bit closer we can see that some are clearly wrong, because the id itself is too small a number, or too large. And then looking at some of the snapshots it appears that they often don’t resolve, and simply display a “Something went wrong” message:

Yes, something definitely went wrong (in more ways than one). Just spot checking a few there also appear to be some legit tweets in the Wayback that are not in the Trump archive like this one:

Notice how the media will not play there? It would take some heavy manual curation work to sort through these tweet IDs to see which ones are legit, and which ones aren’t. But if you are interested here’s an editable Google Sheet.

Finally, here is a list of the top ten archived (at Internet Archive) tweets. The counts here reflect all the variations for a given tweet URL. So they will very likely not match the count you see in the Wayback Machine, which is for the specific URL (no query paramters).

The point of this rambling data spelunking, if you’ve made it this far, is to highlight the degree to which Trump’s tweets have been archived (or collected), and how the completeness and quality of those representations is very fluid and difficult to ascertain. Hopefully Twitter is working with NARA to provide as complete a picture as possible of what Trump said on Twitter. As much as we would like to forget, we must not.

References

Acker, A., & Kriesberg, A. (2017). Tweets may be archived: Civic engagement, digital preservation and obama white house social media data. Proceedings of the Association for Information Science and Technology, 54(1), 1–9.

Kraus, K. (2019). The care of enchanted things. In M. K. Gold & L. F. Klein (Eds.), Debates in the digital humanities 2019. Retrieved from https://www.jstor.org/stable/10.5749/j.ctvg251hk.17

2021-01-21T20:29:05+00:00 Lucidworks: Lucidworks Announces Updated Integration with Drupal https://lucidworks.com/post/updated-integration-with-drupal/

Updated Fusion integration with Drupal provides easier setup and additional monitoring.

The post Lucidworks Announces Updated Integration with Drupal appeared first on Lucidworks.

2021-01-21T19:59:50+00:00 Tom Allen