Tuesday, October 8, 2024 — 11:54

Follow-up on twtxt v2

I got some feedback on my post commenting on some of the “twtxt v2” discussion, and there’s also, of course, been a bunch of other discussion on the topic. So a follow-up (and a small new thing at the end).

Follow-up

UUID vs. URLs in twtxt files for hashes

There were a number of questions about using UUIDs instead of URLs in the twtxt file, specifically as a source for hashing. Using URLs here gains you nothing, and costs you confusion, since using URLs implies things which aren’t (reliably) true of their use here (things about validity, reachability, reachability at a particular time, &c). Their use, as defined in the v2 spec, is that of a non-intelligent identifier; something UUIDs are much more suitable for.

The argument against using UUIDs is mainly:

But the URL tags in the v2 proposal aren’t validated, either, and given that they represent past URLs, can’t be (except for the most recent one). And for what they’re defined as being used for (generating hashes), that’s fine — they don’t need to be validated.

But also: so what? If you copy my UUID and also have an identical line (timestamp and content) as mine, you can claim to have said the same thing at the same time. …okay? Once you get past the idea of the hashes as being useful for universal reference (and they can’t ever be useful for that purpose in this sort of distributed system), that’s just not a problem.

James gets the response exactly right here:

Again: yes, exactly. Using URLs as the hash key doesn’t gain you any sort of protection. If you want real “referential integrity” in a distributed system like this, you need cryptographic signatures — which means you also need to adopt some sort of solution to PKI or similar. And twtxt as a file format cannot simply piggyback on the https PKI, even if we were willing to mandate transmission over https (which whe shouldn’t), because we’re talking about the contents of the file, which can be edited by hand by the author and should be considered untrustworthy (in the sense of cryptography or integrity).

Also:

There’s another issue with using the url metadata as defined in the draft spec. If, as is the most common case, you append entries to the file as you post them, your current url would be the last one. Which means if you want the current one in the header (as has been sometimes suggested as a way of providing a canonical address, for example if the same file is reachable via multiple transports — a much more sensible use for a url tag), you have to list that, then list the old one if it’s ever change, then the current one again later. Ugh.

Replies — clarification

In the current (v1 plus the “subject” extension), finding an entire conversation, unless someone intentionally forks it, is essentially just “grep #abc123” with the right hash. Clients (and human readers) just assume a flat threading structure by default, read things in order, and use normal text/conversational queues to figure out the threading. Re-calculating the hash against a “child” in the conversation is an exception, explicitly used to intentionally fork the thread and start a new one. The v2 spec requires each reply to re-calculate the hash of the specific entry I’m replying to, inverting the logic from v1. Thus the “more work” comment: to reconstruct a thread, in this understanding of v2, I need to find the post matching my hash, find that, and find the hash of anything it references. This means either storing the hash of every line (which might not match anyway) and/or recalculating on each reference. And there’s still the other problem I reference: if I’m not following everyone in a conversation, I simply won’t have those things to check, and the thread must break in that case.

Put another way: there’s a difference between a reply and a fork. Every reply should have the same hash in the reply-to: field, and that should be what a user gets when they tell their client to “reply”. Clients should not be expected to track conversations back across forking points (so, in the example in 5.2, if I ask my client to show me the thread in Dave’s reply, I would expect to see a thread starting with Bob’s post, not Alice’s.

James, in a comment elsewhere, seemed to imply this was not the intent, so maybe this is just an error in the language of the standard document. If so, “the twt being replied to” needs to be clarified in 5.1 to be clear that it’s referring to the original post in a thread (not the one being directly replied to). As is, 5.1, as written, implies the forking in 5.2 is the default, because of that ambiguity.

There’s a related problem to all this: reliably finding the “head” in a twtxt thread isn’t really possible. Your client would need to have already fetched the original and computed and stored a hash.

UTF-8

I’m a little surprised this is even slightly contentious. I think this is implicitly true in practice today, we should just make the spec say it. I don’t want twtxt clients or servers to have to navigate any of the HTML “Content-Type” header complexity. Mainly because simplicity is a primary goal of the design of twtxt and this breaks it, but also because the design here doesn’t (and shouldn’t) assume transmission over HTTP. Gemini or raw file access aren’t going to give you a Content-Type; even when serving HTTP over hosting you don’t directly manage, that can get questionable. Lacking that, either we define a common format or tell the clients “figure it out; good luck!” which seems like a bad plan.

Please nobody suggest sticking the content type in more metadata. 🙄

Timezones vs. UTC

I don’t feel strongly about this one. UTC makes implementing clients (and anything else that wants to process the files) a bit easier, since you don’t have to do any sort of conversion on ingress, storage, comparison, &c, only (optionally) on display (and what you’re converting to on display might change over time), but none of that’s really all that big a deal. The biggest argument for UTC for me is that the current hashing definition requires the timestamps be converted to UTC anyway. I don’t find the argument that the timezone offset gives a sense of the author’s location terribly compelling, but it’s also not any sort of negative, so fine.

I’d like to see UTC mandated for simplicity, but it’s not critical.

New matter

Use square avatar images

Having used the twtxts above in writing this post, I’ve got one small additional thing I’d like to mandate in twtxt: can we please specify that avatar images should be square? We don’t have any way of doing content selection here, and while we could define some sort of syntax encoding the dimensions, that’s a lot of work for what we could solve with a simple mandate that would satisfy the overwhelming majority of use cases. Even within yarn, where I’m seeing the most non-square ones, they’re coerced to square on display. This should be a SHOULD, not a MUST, in the written standard.

Yes, there’s a bit of a “least common denominator” thing here; so be it.

Mentions

I’ve had this argument with IndieWeb folks elsewhere before: URLs are wonderful, powerful things that pack in tremendous utility. They’re also… ugly. That alone makes them terrible to use for identity; they also don’t really match the logical “identity” space well. This is why Mastodon and compatible services use webfinger to resolve @user@domain addresses to the URL as used in ActivityPub.

I’ve argued that we should keep the specification of the twtxt file format mostly separate from discussions about client and server behavior, and certainly separate from protocols those things might use to communicate. That’s still true, and my focus is absolutely on the file format. But if we’re going to start talking broader protocol/ecosystem things, doing what Mastodon did, adding webfinger for identity resolution, makes sense.

My suggestion would be to add a new webfinger link type that could be added to the response when requesting a result for resource=acct:user@domain, which would point to the twtxt file. Clients could then To allow for clients which can’t or don’t want to implement webfinger, any standard on this should specify that clients SHOULD implement this new webfinger-based addressing for mentions, but MUST continue to support the URL-based referencing.

The only proposal which seems to have gotten any traction here is the nick metadata tag, but that doesn’t really help. When I read Alice’s twtxt file and she mentions Bob, there’s no reason to expect I already follow Bob; in such a case, I won’t have ever seen his nick declaration and will be unable to dereference Alice’s mention.

I don’t love that Mastodon decided to go with addresses that look so much like email addresses but (typically) aren’t. In the context of ActivityPub that’s probably a lost cause at this point. We could, in theory, use a less confusable format for twtxt, but given the momentum behind the Mastodon-style format, and especially given the comparatively technical nature of twtxt users, the benefits of reusing it probably outweigh the risk of confusability.