Wednesday, September 25, 2024 — 00:32

Notes on twtxt v2

This is in response to the 2024-09-25 draft of the twtxt v2 specification

tl;dr:

* The hashing changes are a mixed bag; I think I have a better option.
* The 'url' tag changes, needed for the hashing changes, are bad.
* reply-to is a nice improvement over #hash pseudo-subjects, but...
* ...the threading model is kinda backwards.
* This makes too many demands of clients/servers; separate that from format

Overall Impression

I appreciate the effort to formalize the specification and revisit some rough edges. It is notable that this makes significantly firmer demands of client and server implementations than the original twtxt spec did. While some of that is helpful, this version also seeks to treat the file format as a protocol for compliant implementations. While format-as-protocol isn’t without merit, I think this draft blurs those lines in places. Twtxt should remain a good human-readable plain text file format first.

Overall, I think we should have a specification for the twtxt file format that is separate from any client/server behavior requirements. Keep them as separate docs, partly just to keep things clear when writing the spec.

Specific Issues

3 Message Format (Twts)

3.2 Timestamps

It’s a change that we’re mandating UTC. I like it — timezones suck — but I wonder if there should be language about clients converting timestamps when we see them. As this is one of the few breaking changes with twtxt v1, it seems worth enabling a migration path.

3.3 Content

reply-to

I think the reply-to: is a revision of the “subject” overloading used for threading in some v1 clients. I like the fact that it’s not overloading a hashtag and can be stuck at the end now. Given that it’s a change, the language should clarify that it can appear anywhere in the twt (the example says this, but the spec should here).

I have concerns about the threading model ‘reply-to’ is attached to (see Section 5 discussion). Depending on how those changes are made, it might make more sense as ‘thread’ (but I don’t think that matters too much).

Should we have a way to escape this? Leading backslash? Or just say “don’t do that”?

No “Edits” or “Deletions”

Edits and Deletions should go; see also Section 6. This is probably the worst example of this document pushing a text document to do more protocol-like things.

3.4 Multi-Line Twts

The multiline extension isn’t so bad on its own, particularly since it’s easy enough to ignore (substitute with ‘ ’ in simple/line-oriented clients). But when it was added as an extension in v1, I was pretty sure it was a slippery slope to doing additional formatting that would make it harder and harder. And with v2 we are careening down that well-lubricated hill. I still don’t mind this section on its own, but it enables bad things (see below).

3.5 Special Formatting

I dislike all the parts of this that aren’t plausibly usable on single-line entries. The elements in Emphasis, Inline Code, Links, and Images are probably fine; Lists and Code Blocks should go.

4 Content Addressing

4.1 Hash Generation

Setting aside how the hash is generated, I think there’s something off about the structure of the doc here. It says clients MUST calculate this hash… but then doesn’t require them to do anything with it (other than store it, in 4.2, but then no action is required with what’s stored there). I think it should be rephrased to say something along the lines of “when this document refers to a ‘hash’, that hash must be generated as follows” (awkward phrasing there aside, you get the idea).

Moving on: what to do about the URL has been an open question with twtxt hashes for years. I’m still not sure we have a good answer. My inclination is to generate the hash from the URL you’re getting the file from, stripping the method (to address the most common cases of a single file being available in multiple locations).

It is an error in the spec that 4.1 needs “the latest ‘# url’ metadata field”, but the file is not required to have one. Either 4.1 needs to include language for a fallback (the correct option, IMHO) or 3.3 needs to require at least one ‘url’ (I really dislike that option when we already have a perfectly URL we got the thing from).

I think it’s a mistake to key hashes on historical URLs. It makes individual lines in the file no longer free-standing and requires more re-reading of the file (e.g. I can’t just ‘tail’ the file to see what’s new and reply to that). That’s a loss.

My proposal: stop using ‘url’ metadata elements as significant for hashes. Instead, say authors SHOULD add a ‘uuid’ metadata element and use that, with a fallback to the URL the file is fetched from. You’ll get more resilient moves that’re harder to mess up and less work to do, and in the fallback case you’ll get threading that’s as fragile as twtxt v1 but not meaningfully worse.

4.2 Hash Usage

It’s an unnecessary burden on implementations to say they MUST store twts keyed this way. It should be a totally valid implementation choice to generate them as needed (which, for the vast majority of the ‘twts x users’ set, will be “never”).

5 Threading Model

I think there are three major problems with what’s described here. Overall, I think this is a regression from the v1+extensions version. I would remove this, give that version the formalization treatment the rest of this doc has done, and replace ‘reply-to’ with ‘thread’ or similar.

Reply to last

The way conversations in a reverse-chronological social feed normally goes is that the most common action is a reply to the last post. In the system described, this yields a bad experience. A conversation that goes on in this most common way for any significant number of posts would, in the “Thread Visualization” section, quickly be indented off the right edge of the screen.

In the existing model, we take advantage of the reverse-chronological nature of these feeds to allow this “default” case to just fall out. And we can still generate a new hash to reply to a specific twt.

More work to backtrack

What’s described is significantly more work for implementors and client programs. Searching for everything with a given subject (in the v1+extensions terminology) and just displaying that reverse-chronologically is much less work requires no recalculation on reading, and handles the common case well.

I don’t have everything

This model very much presumes a yarnd/twtxt.net style implementation. I may well not have every twtxt file involved in a conversation, and thus no way to know what the ‘reply-to’ hash is talking about, and therefore no way to put together the rest of the chain. In the “subject” model in the v1 extensions, this isn’t a problem (technically; obviously I can still miss actual human content).

6 Editing and Deleting Tats

This whole section should go. It’s a text file; users should be free to edit and delete twts by editing the file.

8 Feed Format (Twtxt)

Metadata should be collected up front

This is generally the convention today, and makes obvious sense when you think of the twixt file as something you might want to read, or write manually (as was the initial intent of twixt, and as I do regularly). For most of the metadata in v2, this still works fine, but the ‘url’ element breaks this: it can (and must, in certain circumstances) appear multiple times throughout the document, and the order matters. There is no indication of this as you read the doc until you hit an occurrence.

This is done to support the new hashing method. We shouldn’t.

I would prefer a statement that authors SHOULD collect all metadata up front. In fact, 8.1 already says that, conflicting with how ‘url’ is used.

9 Archives

Sequence-before-current is a bad choice

I think this is just an issue of a bad example, but as presented each archive file must be modified when a new one is added, as twtxt.1.txt becomes twtxt.2.txt and twtxt.2.txt becomes twtxt.3.txt — and every file older than “current” needs to be updated to reflect that. Use dates or hashes or something instead.

Hash verification, as presented, doesn’t do what’s claimed

The hash in the ‘prev’ metadata element is only the hash of the last tweet in the previous element. It does not, therefore, “Ensure[…] that the feed has not been tampered with and that no twts are missing”.

If that’s a goal, a better approach is to avoid the issue with mandatory renaming above and remove the ‘url’ problem previously discussed, convert the hash on the ‘prev’ item to be a hash of the referenced file, and declare instead that archived feeds MUST NOT be modified (having now removed the need to).

Personally, I don’t like this, though. I think these files should be editable freely. But if you want to get to the stated goal, that’s how you do it. And if you’re talking about verifying the current integrity (as opposed to historical), than an edit of the archive just requires recomputing the hashes to stay valid. That seems a sensible tradeoff for editing history, which should, one hopes, be uncommon.

Relative URLs are a mixed bag

I get what the relative URL buys here, but it comes at a cost of not being able to refer to history on another server. I’m not certain here, but I would probably re-work this to prefer relative URLs but allow for absolute, as well.

10 Discovery

Keep Discovery optional

The discovery mechanism, currently an extension, becomes a mandatory part of the system in v2. This is a mistake.

10.1 User-Agent Header

Setting the User-Agent header is difficult in some environments. And would a compliant server be within spec to reject clients who didn’t do this? That feels just silly.

And while clients which fall into the “multiple followers” bucket in 10.1 are very likely to be based on web servers, that isn’t inherently true. For a system which aren’t, providing the “who follows who resource” might not be possible or practical.

10.2 Who Follows Resource

Multiple output formats is a mistake

Twtxt is a text-based system, and should remain so. The “MUST” in 10.2 violates the spirit of that, while adding significant implementation complexity. This seems like a particularly bad place to do that, given how simple the text/plain format in question is and how easy it would be for a client which desires another format to do that conversion.

Underspecified

This section needs work regardless. Is that the nickname of the follower or the followee? Does the token do anything except gate access? Is it serving as an obscured query string for the twtxt file the request was made against?

11 Serving Feeds

In general, the requirement to use HTTP seems only in service of the discovery mechanism. Make that optional, remove 11.1 (or replace it with a general statement about how servers SHOULD offer encrypted access), and rename 11.3 (“Unsupported” is wrong here).