Go To Hellman

Tuesday, December 29, 2020

Infra-infrastructure, inter-infrastructure and para-infrastructure

No one is against "Investing in Infrastructure". No one wants bridges to collapse, investing is always more popular than spending, and it's even alliterative! What's more, since infrastructure is almost invisible by definition, it's politically safe to support investing in infrastructure because no one will see when you don't follow through on your commitment!

Ponte Morandi collapse - Michele Ferraris, CC BY-SA 4.0 via Wikimedia Commons

Geoffrey Bilder gives a talk where he asks us to think of Crossref and similar services as "information infrastructure" akin to "plumbing", where the implication is that since we, as a society, are accustomed to paying plumbers and bridge builders lots of money, we should also pony up for "information infrastructure", which is obvious once you say it.

What qualifies as infrastructure, anyway? If I invest in a new laptop, is that infrastructure for the Go-to-Hellman blog? Blogspot is Google-owned blogging infrastructure for sure. It's certainly not open infrastructure, but it works, and I haven't had to do much maintenance on it.

There's a lot of infrastructure used to make Unglue.it, which supports distribution of open-access ebooks. It uses Django, which is open-source software originally developed to support newspaper websites. Unglue.it also uses modules that extend Django that were made possible by Django's Open license. It works really well, but I've had to put a fair amount of work into updating my code to keep up with new versions of Django. Ironically, most of this work has been in fixing the extensions that have not updated along with Django.

I deploy Unglue.it on AWS, which is DEFINITELY infrastructure. I have a love/hate relationship with AWS because it works so well, but every time I need to change something, I have to spend 2 hours with documentation to find the one-line incantation that make it work. But every few months, the cost of using AWS goes down, which I like, but the money goes to Amazon, which is ironic because they really don't care for the free ebooks we distribute.

Aside from AWS and Django, the infrastructure I use to deliver Ebook Foundation services includes Python, Docker, Travis-CI, GitHub, git, Ubuntu Linux, MySQL, Postgres, Ansible, Requests, Beautiful Soup, and many others. The Unglue.it database relies on infrastructure services from DOAB, OAPEN, LibraryThing, Project Gutenberg, OpenLibrary and Google Books. My development environment relies heavily on BBEdit and Jupyter. We depend on Crossref and Internet Archive to resolve some links; we use subject vocabulary from Library of Congress and BISAC.

You can imagine why I was interested in "JROST 2020" which turns out to stand for "Join Roadmap for Open Science Tools 2020", a meeting organized by a relatively new non-profit, "Invest in Open Infrastructure" (IOI). The meeting was open and free, and despite the challenges associated with such a meeting in our difficult times, it managed to present a provocative program along with a compelling vision.

If you think a bit about how to address the infrastructure needs of open science and open scholarship in general, you come up with at least 3 questions:

How do you identify the "leaky pipes" that need fixing so as to avoid systemic collapse?
How do you bolster healthy infrastructure so that it won't need repair?
How do you build new infrastructure that will be valuable and thrive?

If it were up to me, my first steps would be to:

Get people with a stake in open infrastructure to talk to each other. Break them out of their silos and figure out how their solutions can help solve problems in other communities.
Create a 'venture fund" for new needed infrastructure. Work on solving the problems that no one wants to tackle on their own.

Invest in Open Infrastructure is already doing this! Kaitlin Thaney, who's been Executive Director of IOI for less that a year, seems to be pressing all the right buttons. The JROST 2020 meeting was a great start on #1 and #2 is the initial direction of the "JROST Rapid Response Fund", whose first round of awards was announced at the meeting.

Among the first awardees of the JROST Rapid Response Fund announced at JROST2020 was an organization that ties into the infrastructure that I use, 2i2c. It's a great example of much-needed infrastructure for scientific computing, education, digital humanities and data science. 2i2c aims to create hosted interactive computing environments that run in the cloud and are powered by entirely open-source technology (Jupyter). As I'm a Jupyter user and enthusiast, this makes me happy.

But while 2i2c is the awardee, it's being built on top of Jupyter. Is Jupyter also infrastructure? It needs investment too, doesn't it? There's a lot of overlap between the Jupyter team and the 2i2c team, so investment in one could be investment in the other. In fact, Chris Holdgraf, Executive Director of 2i2c, told me that "we see 2i2c as a way to both increase the impact of Jupyter in the research/education community, and a way to more sustainably drive resources back into the Jupyter community.".

Open Science Infrastructure Interdependency (from
“Scoping the Open Science Infrastructure Landscape in Europe”,
https://doi.org/10.5281/zenodo.4153809)

Where does Jupyter fit in the infrastructure landscape? It's nowhere to be seen on the neat "interdependency map" presented by SPARC EU at JROST. If 2i2c is an example of investment-worthy infrastructure, maybe the best way to think of Jupyter is "infra-infrastructure" - the open information infrastructure needed to build open information infrastructure. "Trickle-down" investment in this sort of infrastructure may be the best way to support projects like Jupyter so they stay open and are widely used.

But wait... Jupyter is built on top of Python, right? Python needs people investing in it, Is Python infra-infra-infrastructure? And Python is built on top of C (I won't even mention Jython or PyJS), right?? Turtles all the way down. Will 2i2c eventually get buried under other layers of infrastructure, be forgotten and underinvested in, only to be one day excavated and studied by technology archeologists?

Looking carefully at the interdependency map, I don't see a lot of layers. I see a network with lots of loops. And many of the nodes are connectors themselves. Orcid and CrossRef resemble roads, bridges and plumbing not because they're hidden underneath, but because they're visible and in-between. They exist because of the entities they connect cooperate to make the connection robust instead of incidental. They're not infra-infrastructure, they're inter-infrastructure. Trickle-down investment probably wouldn't work for inter-infrastucture. Instead, investments need to come from the communities that benefit so that the communities can decide how to manage and access to the inter-infrastructure to maximize the community benefit.

There's another type of infrastructure that needs investment. I work in ebooks, and a lot of overlapping communities have tackled their own special ebook problems. But the textbook people don't talk to the public domain people don't talk to the monograph people don't talk to the library people. (A slight exaggeration.) There are lots of "almost" solutions that work well for specific tasks. But with the total amount of effort being expended, we could some really amazing things... if only we were better at collaborating.

For example, the Jupyter folks have gotten funding from Sloan for the "Executable Book Project". This is really cool. Similarly, there's Bookdown, which comes out of the R community. And there are other efforts to give ebooks the functionality that a website could have. Gitbook is a commercial open-source effort targeting a similar space, Rebus, a non-profit, is using Pressbooks to gain traction in the textbook space, while MIT Press's PubPub has similar goals.

I'll call these overlapping efforts "para-infrastructure." Should investors in open infrastructure target investment in "rolling up" or merging these efforts? When private equity investors have done this to library automation companies the results have not benefited the user communities, so I'd say "NO!" but what's the alternative?

I've observed that the folks who are doing the best job of just making stuff work rarely have the time or resources to go off to conferences or workshops. Typically, these folks have no incentive to do the work to make their tools work for slightly different problems. That can be time consuming! But it's still easier than taking someone else's work and modifying it to solve your own special problem. I think the best way to invest in open para-infrastructure is to get lots of these folks together and give the time and incentive to talk and to share solutions (and maybe code.) It's hard work, but making the web of open infrastructure stronger and more resilient is what investment in open infrastructure is all about.

Different types of open infrastructure benefit from different styles of investment; I'm hoping that IOI will build on the directions exhibited by its Rapid Response Fund and invest effectively in infra-infrastructure, inter-infrastructure, and para-infrastructure.

Notes

1. Geoff Bilder and Cameron Neylon have a nice discussion of many of the issues in this post: “Bilder G, Lin J, Neylon C (2016) Where are the pipes? Building Foundational Infrastructures for Future Services, retrieved [date], http://cameronneylon.net/blog/where-are-the-pipes-building-foundational-infrastructures-for-future-services/ ‎”

2. "Trickle-down" has a negative connotation in economics, but that's how you feed a tree, right?

Monday, October 19, 2020

We should regulate virality

It turns out that virality on internet platforms is a social hazard!

Living in the age of the Covid pandemic, we see around us what happens when we let things grow exponentially. The reason that the novel coronavirus has changed our lives is not that it's often lethal - it's that it found a way to jump from one infected person to several others on average, leading to exponential growth. We are infected with virus without regard to the lethality of the virus, but only its reproduction rate.

For years, websites have been built to optimize virality of content. What we see on Facebook or Twitter is not shown to us for its relevance to our lives, its education value, or even its entertainment value. It shown to us because it maximizes our "engagement" - our tendency to interact and spread it. The more we interact with a website, the more money it makes, and so a generation of minds has been employed in the pursuit of more engagement. Sometimes it's cat videos that delight us, but more often these days it's content that enrages and divides us.

Our dissatisfaction with what the internet has become has led calls to regulate the giants of the internet. A lot of the political discourse has focused on "section 20" https://en.wikipedia.org/wiki/Section_230 a part of US law that gives interactive platforms such as Facebook a set of rules that result in legal immunity for content posted by users. As might be expected, many of the proposals for reform have sounded attractive, but the details are typically unworkable in the real world, and often would have effects opposite of what is intended.

I'd like to argue that the only workable approaches to regulating internet platforms should target their virality. Our society has no problem with regulations that force restaurant, food preparation facilities, and even barbershops to prevent the spread of disease, and no one ever complains that the regulations affect "good" bacteria too. These regulations are a component of our society's immune system, and they are necessary for its healthy functioning.

Add caption

You might think that platform virality is too technical to be amenable to regulation, but it's not. That's because of the statistical characteristics of exponential growth. My study of free ebook usage has made me aware of the pervasiveness of exponential statistics on the internet. Sometime labeled the 80-20 rule, the Pareto principle, or log-normal statistics, it's the natural result of processes that grow at a rate proportional to their size. As a result, it's possible to regulate virality of platforms because only a very small amount of content is viral enough dominate the platform. Regulate that tiny amount of super-viral content, and you create incentive to moderate the virality of platforms. The beauty of doing this is that a huge majority of content is untouched by regulation.

How might this work? Imagine a law that removed a platform's immunity for content that it shows to a million people (or maybe 10 million - I've not sure what the cutoff should be). This makes sense, too; if a platform promotes illegal content in such a way that a million people see it, the platform shouldn't get immunity just because "algorithms"! It also makes it practical for platforms to curate the content for harmlessness- it won't kill off the cat videos! The Facebooks and Twitters of the world will complain, but they'll be able to add antibodies and T-cells to their platforms, and the platforms will be healthier for it. Smaller sites will be free to innovate, without too much worry, but to get funding they'll need to have plans for virality limits.

So we really do have a choice; healthy platforms with diverse content, or cesspools of viral content. Doesn't seem like such a hard decision!

Techdirt has excellent coverage of Section 230.

Sunday, September 6, 2020

Notes on work-from-home teams

I've been working from home full-time for over eleven years - at least partly work-from-home for 20 years. I've managed work-from-home teams, and worked with quite a few others on joint projects. So when some colleagues were sharing their work-from-home experiences, I piped up with some thoughts. When I was asked recently to repeat them, I realized it might be useful to make a list for the blog. Old-style.

SO...

In-person time is super-valuable. It builds a foundation for the digital interactions we're all stuck with for a while.
Engineers in particular are prone to under-communicate, so a manager has to pro-actively push people to communicate more than they would on their own ...
... and create a safe environment that promotes asking for help.
Most remote workers need an extra helping of encouragement and positive reinforcement...
... doubly so for people prone to self-doubt or imposter syndrome.
Worker depression is the hardest thing for a work-from-home team to manage.
Trust is the most important attribute for work-from-home teams, and it has to be mutual in any type of relationship.

I think most of these are self-explanatory. In the near-term current environment, the first point is not so helpful for teams that haven't banked some in-person time; non-work activities, remote meal-sharing and happy hours are imperfect substitutes for the real thing.

The point about worker depression is worth emphasizing. It's a real hazard, often without easy mitigations. For me, daily exercise and intentional social interaction are the most effective medicine, but everyone is different. A work-from-home team needs time, space, and often support to figure out what works.

Tuesday, December 3, 2019

Your Identity, Your Library

Today, your identity on the Internet is essentially owned by the big email providers and social networks. Google, Yahoo, Facebook, Twitter - chances are you use one of these services to conveniently log into other services as YOU. You don't need to remember a new password for each service, and the service providers don't have to verify your "identity". What you gain in convenience, you lose in privacy, and that's turned out really well, hasn't it?

The "flow" you use to take advantage of this single sign-in is a "dance" that takes you from website to website and back to the site you're logging into. A similar dance occurs to secure access to resources licensed on you behalf by libraries, institutions, corporations, etc.. I wrote a bunch of articles about "RA21" (now rebranded as the vaguely NSFW "SeamlessAccess"), an effort spearheaded by STM publishers to improve the user experience of that dance. (It can be complicated and confusing because there are lots of potential dance partners!)

Henri Matisse, La danse (first version) 1909

These dance partners style themselves as "identity providers". That label makes me uncomfortable. Identity can't be something that can be stripped from you by on the whim of a megacorporation. Instead, internet identity should be woven from a web of relationships. These can be formed digitally or face-to-face, global or local, business or personal.

You'd have thunk that the whole identity-on-the-internet thing would have improved in the 13 years since that login dance was first rolled out. And you'd be almost right, because a new architecture for internet identity is now on the horizon. Made possible by many of the same technologies that are securing the internet and inflating the blockchain bubble, massively distributed and even "self-sovereign identity" are becoming real-ish.

These technologies will inevitably be applied to the access authorization problem. Access via distributed identity replaces the website-to-website dance with the presentation of some sort of signed credential. A service provider verifies the signature against the signer's public key. It's like showing a passport that can't be forged. A tricky bit is that the credential also needs to be checked against a list of revoked credentials. This would have been cumbersome even ten years ago, but distributed databases are now a mature technology, versions of which underpin the internet itself.

Interlinked with the concept of distributed identity is the notion that users of the web should be able to securely control their data, and that decisions about what a web site gets to know about you should not be delegated to advertising networks.

Unfortunately, we're not quite ready for distributed identity, in the sense that implementation for today's web would require users to install plugin software, which has its own set of usability, privacy and security issues. The ideal situation would be for some sort of standardized distributed identity and secure data management capability to be installed in browser software - Chrome, Firefox, Safari, etc.

There's a lot of work going on to make this happen.

ID2020 has put out an identity manifesto that starts with the declaration that "The ability to prove one’s identity is a fundamental and universal human right."
Tim Berners-Lee is leading the Solid Project, which let's you "move freely between services, reuse data across apps, connect with anyone, and select what you share precisely".
The W3C Verifiable Claims Working Group has published Technical Recommendations for "Verifiable Credential Use Cases and a "Verifiable Credential Data Model". They observe that "from educational records to payment account access, the next generation of web applications will authorize entities to perform actions based on rich sets of credentials issued by trusted parties."
The Sovrin Network is a "new standard for digital identity – designed to bring the trust, personal control, and ease-of-use of analog IDs – like driver’s licenses and ID cards – to the Internet."
Kaliya Young, Doc Searls and Phil Windley have been convening the Internet Identity Workshop twice a year since 2005 to create a community centered around internet identity. A glance at prior year proceedings gives a flavor of how much is happening in the field

The common thread here is that users, not unaccountable third parties, should be able to manage their identity on the internet, while at the same time creating a global chain of trust.

It seems to me that there's a last-mile problem with all these schemes. If identity is really a universal human right, how do we create a chain of trust that can include every human? That problem becomes a lot easier to solve if there were some sort of organization with a physical presence in communities all over, trusted by the community and by other organizations. A sort of institution experienced in managing information access and privacy, and devoted to the needs of all sorts of users.

In other words, what if "libraries" existed?

The federated authentications systems used by libraries today - Shibboleth, Athens, and related systems use a dance similar to what you do with Google or Facebook. It's a big step that moves your internet identity away from "surveillance capitalists" towards community institutions. But you still don't have control over what data your institution give away, as you will in the next-generation internet identity systems I describe here. (RA21 is no different from Shib or Athens in this respect.)

What might libraries do to prepare for the age of distributed identity? The first step is not about technology, it's about mission. I believe libraries should start to think of themselves as internet relationship providers for their communities. When I get access to a resource though my library, I won't be "logging in", I'll be asserting a relationship with a library community, and the library will be standing behind me. Joining an identity federation is a good next step for libraries. But the library community needs to advocate for user identity as a basic human right and prepare their systems to support a future where no dancing is required.

Update 12/5/2019: revised last two paragraphs to be less mystifying.

Friday, July 26, 2019

Four-Leaf Clovers

It seems a friend of mine collects four-leaf clovers.

When I was a kid, I loved looking for four-leaf clovers in the lawn. It was the same sort of relaxing concentration and observation you use to find a piece of a jigsaw puzzle. But one day, I found a clover plant in front of the garage that had multiple four-leaf clovers. Looking carefully, I found that not only were there four leaf clovers, but there were FIVE-LEAF-CLOVERS. I had hit the jackpot. And even a six-leaf clover!!!! I swear to all of God's integers that I even found a SEVEN leaf clover. I saved that seven leaf clover in my box of treasures for years, until I just had seven crumbling leafs of a clover.

I never looked for a four-leaf clover again.

Now, whenever I remember that clover plant (and that garage), I think of the toxins that must have caused the polyfoliate abomination.

Please don't let my story stop you looking for four-leaf clovers! Happy summer!

Go To Hellman

Tuesday, December 29, 2020