Research Archive

On Stable Reference Identifiers in Personal Archives

A short note on the practical case for opaque, persistent reference identifiers in archived technical material — and the ways such identifiers, once chosen, tend to outlive the documents they were intended to anchor.

The question of how to label a document for stable reference is, on its surface, an unimportant one. A name will do. A folder structure will do. The author of a working paper, asked how their material is filed, will typically describe an organisational scheme that is intelligible only to themselves and which is rarely consulted twice.

For documents that pass into correspondence, the matter is different. A reference offered to a colleague — a citation in a footnote, a working paper attached to a discussion, an entry in an index — accrues small fixed costs of indirection. The reference is repeated. It is re-followed. It is, on occasion, re-quoted out of context. And in each of these acts the original label travels with the material, regardless of whether the underlying document has been moved, reorganised, or mislaid.

The opaque identifier as a practical default

In observation of small archives over a period of years, the most durable reference scheme appears to be the opaque identifier: a string of characters, drawn from no semantic source, applied to a document at the moment of its first archival listing, and never altered afterward. The opacity of the identifier is what gives it durability. A label derived from a title invites its own contradiction whenever the title is revised; a label derived from a date invites a re-numbering whenever the document is reissued; a label derived from a topical taxonomy invites a re-classification whenever the taxonomy is reconsidered.

An opaque identifier resists all of these temptations. Because the identifier carries no information, there is nothing in the document's later history that can render the identifier wrong.

Length and pronounceability

A secondary question follows: how long should the identifier be, and should it be pronounceable? My own preference, after some experimentation, is for an identifier of roughly twenty-four hexadecimal characters, generated at random. This is sufficient to avoid collision across a personal archive of plausible scale (say, ten thousand entries) without resort to centralised allocation. It is short enough to be embedded comfortably in a URL or a footnote. And it is long enough to read as a code rather than as a candidate phrase.

I have on occasion experimented with shorter forms — eight or twelve characters — but found that the shorter the identifier, the greater the temptation to extract meaning from it. Once a reader believes that the prefix of an identifier carries information, the identifier has lost its opacity, and the durability that opacity confers follows.

Cross-references

One small operational note. An archive that uses opaque identifiers in its public references can still maintain a private mapping from identifier to title, topic, or any other classification of interest. The opacity is for outward-facing reference; internal navigation can remain perfectly intelligible. The mapping should be straightforward to re-derive from the document's metadata if it is ever lost.

Some related observations are collected in Notes on archival annotation practices and A brief observation on URL structure and citation stability.

References

  1. Berners-Lee, T. (1998). Cool URIs don't change. W3C Style. Available at www.w3.org/Provider/Style/URI.
  2. Jacobs, I., & Walsh, N. (eds). (2004). Architecture of the World Wide Web, Volume One. W3C Recommendation.
  3. Berners-Lee, T., Fielding, R., & Masinter, L. (2005). Uniform Resource Identifier (URI): Generic Syntax. RFC 3986.
  4. Kosinski, L. (2023). Internal note on archive design. Reference a3f2c9b1.