Research Archive

On the Choice Between UUIDs and Shorter Opaque Identifiers

A short note in response to a colleague's question about why this archive uses 24-character hexadecimal identifiers rather than the more conventional UUID. The choice is, in the end, a matter of fit — UUIDs are designed for problems that personal archives do not have.

The standard universally unique identifier, in its 8-4-4-4-12 hex form (550e8400-e29b-41d4-a716-446655440000), is 36 characters long. It is intended to provide collision-free identifier generation across distributed systems without coordination, with a collision probability that is, for practical purposes, zero [1].

What UUIDs are designed for

The use case behind RFC 4122 is unambiguous: distributed systems generating identifiers locally, possibly at high rates, without the opportunity to consult a central allocator. The probability of collision must be vanishingly small under those conditions, and the identifier must be self-contained — embedding the bits of randomness required for that guarantee.

A personal archive of, say, ten thousand entries faces none of these constraints. There is no distributed generation; identifiers can be assigned sequentially by a single author. The collision space need only be large enough that random selection within it is straightforward; the cryptographic-grade randomness of UUIDs is, for this problem, simply unused.

A comparison

The table below summarises four common schemes for opaque identifiers, with the trade-offs that informed the choice for this archive:

Scheme Length Collision space Practical use
UUID v4 (hex) 36 2122 Distributed systems, large datasets
ULID 26 280 Time-sortable distributed IDs
Nano ID (default) 21 2126 URL-safe random IDs
Hex (24 chars) 24 296 Personal archives, modest scale

The figures in the third column are illustrative — what they demonstrate is that all four schemes are dramatically over-provisioned for the scale of a personal archive. The practical question is not collision resistance, which is solved many times over, but length.

The argument for the shorter form

Length matters because the identifier is read aloud, embedded in prose, copied into footnotes, and occasionally typed by hand. At 36 characters, a UUID is awkward in citation form: it cannot be quoted in a sentence without unbalancing the line, and a misread digit is difficult to recover.

At 24 hex characters, the identifier is short enough to embed comfortably in correspondence yet long enough to remain unambiguously opaque [2]. The 296 collision space is absurdly larger than required for any plausible personal archive.

I should add: the argument is not against UUIDs in general. UUIDs are correct for the problem they were designed to solve. They are simply not the right tool for the problem of identifying entries in a small archive maintained by a single author.

A practical note

For the implementation: a hex string of 24 characters is generated by drawing 12 bytes from a cryptographically secure random source and encoding them as hexadecimal. In Python, secrets.token_hex(12) suffices. The resulting identifier is then the canonical handle for the entry, in line with the conventions discussed in an earlier note.

Update (March 2024): a reader suggested that base32 encoding would shorten the identifier further at the same collision space. This is correct. I have nonetheless retained hex on the grounds that it is universally legible and that the incremental shortening, in characters, is small relative to the cost of changing scheme.

References

  1. Leach, P., Mealling, M., & Salz, R. (2005). A Universally Unique IDentifier (UUID) URN Namespace. RFC 4122. Internet Engineering Task Force.
  2. Berners-Lee, T., Fielding, R., & Masinter, L. (2005). Uniform Resource Identifier (URI): Generic Syntax. RFC 3986. Internet Engineering Task Force.
  3. Eastlake, D., Crocker, S., & Schiller, J. (2005). Randomness Requirements for Security. RFC 4086.