Addressing Editor Content

Reference Semantics and Value Semantics TypeScript's unknown type and type va...

Addressing Editor Content

Every text editor system, whether it works with plain or rich text,needs some way to refer to specific positions in the document. Thevery first reason one runs into this, when implementing such a system,is for representing the cursor and selection. But in a mature editor,you'll be doing a lot more things that refer to documentpositions��representing changes, displaying user interface elements inthe text, or tracking metadata about a piece of content.

Offset Positions

An obvious way to address such positions, especially in plain text, isjust by string offset. For rich text documents it also tends to bequite straightforward to define a scheme that gives every documentposition its own offset. Such offset increase monotonically, so evenin data structures that aren't flat arrays, it is not hard toefficiently look them up.

Because text editors often store their content split by line in somebigger data structure, it is tempting to use {line, character} pairsfor your positions. Though this is often a useful thing to present toa user, as an addressing system is is really quite awful. Manipulatingsuch positions tends to get convoluted and awkward. Whereas findingnearby positions in a flat system is just a matter of addition andsubtraction, with line-based addresses you have to always special-caseline boundaries, and know the length of lines to even know where thoseboundaries are. This can be made to work, since several editors,including old versions of CodeMirror, do it, but moving to flat systemin CodeMirror 6 was a huge relief.

In many cases, just having a position isn't enough. A cursor at a linewrapping point or a jump between right-to-left and left-to-right text,for example, may correspond to multiple different positions on thescreen. At least for cursors, you need both a position and adirection��which disambiguate whether the position is attached to theelement before or after the position.

But other than that, offset positions work well. They just have onebig drawback: when the document changes, they need to change with it.So any data that refers to a document position needs to be updatedalong with the document every time that document is modified. Thisrequires quite a lot of discipline to get right, and can get expensivewhen you're tracking a lot of positions.

Unique IDs

So, though both of my editor projects use offset positions, I keepasking myself whether there is a way around the need to map thosepositions on every change. If we could have some way to representdocument position in a ��stable�� way, where you can still use them whenyou come back to a document, even if that document has changed in abunch of ways since you generated it, that would be so convenient.

To be able to do such a thing, you'd need to assign a stable ID toevery single element in the document. There are ways to make this lessnauseatingly expensive than it initially sounds. Stretches of text thatare loaded or inserted together can be assigned a contiguous region ofIDs, allowing the data structure, in many circumstance, to not storeevery single ID separately, but instead just assign a range to astretch of text. If you have that, you can now describe the positionbefore element X or after element Y in a stable way. To find it, youjust need to look up the element.

Except, of course, if that element has been deleted. When your ID nolonger exists in the document, your position has lost its meaning.

One way to handle that is to keep ��tombstones�� for deleted elements,either directly in your document data structure, or in a separate datastructure that maps the IDs of deleted elements to IDs of adjacentelements that are still in the document. This does have the downsidethat, for some types of editing patterns, your tombstone data canbecome bigger than your actual document data. It is possible to defineschemes where you periodically garbage collect these, but then youreintroduce the issue that you can be invalidating position pointersthat may still exist in some other data structure, and you are back toneeding to carefully update such pointers.

Another issue of such IDs is that going from an ID to an actualposition in the document generally needs to be fast. This is notsomething you get for free. Doing a full document scan every time youneed to find a position tends to be too slow.

There are some tricks that you can do with mutable doubly-linked treesor lists, where you keep a map from IDs to objects in those datastructures, and then traverse from that object via parent or siblingpointers to figure out where it is. But I am very partial topersistent data structures, where such tricks don't work.

It's probably possible to do something with bloom filters in a treestructure or similar, rather heavyweight tricks. But in the end, ifyou're just moving the work that an offset system would do whenmapping positions over changes to lookup time, that may not be much ofan improvement.

Ordered IDs

One way to avoid the tombstone and lookup issues with regular IDs isto define your ID assignment scheme in such a way that there is anordering of the IDs that corresponds to their order in the document.If deleted IDs can still be compared to IDs still in the document,that gives you a way to locate their position even though they aren'tthere anymore. Similarly, if you can compare IDs you can run a binaryor tree search through your document to quickly locate a position.

The obvious downside of this approach is that it is tricky to defineyour IDs in such a way that you can keep making up new IDs that ��fit��between any two existing IDs, and this forces you to use a schemawhere IDs can grow in size when there's no room left in the sequencespace on their current level.

It also, and this may be a worse issue, makes the position of deletedIDs weirdly dependent on what is inserted in their old place after thedeletion. Unless some kind of tombstone data is kept, changes willhappily fill in the ID space left empty by a deletion with elementsthat, more or less randomly, may be above or below (or even equal to)an old but still referenced ID, making its position point at ameaningless position within those inserted elements.

(Readers familiar with sequenceCRDTsmay notice a lot of similarity between what I'm describing and howsuch systems work. That's because I stole a lot of these ideas fromCRDT literature.)

Addendum: Transaction Log

Another approach, suggested to me by JamieBrandon after I published this,would be to keep a record of updates with your document, containingenough information to map a position forward if you know the documentversion it refers to. Positions would then be {version, offset}pairs, and you could interpret positions in old versions by runningthrough the changes that were made since that version.

In its straightforward form, this would make use of old positionsincreasingly expensive, as they have to be mapped forward everfurther. But I guess you could get around that by keeping a cache withevery position.

In any case, this is a neat reframing of the problem. It stillrequires a data structure that grows proportional to editing historysize, rather than document size, but seems less convoluted thanID-based approaches.

Conclusion

This problem space appears to be a tricky one where every solution hassignificant drawbacks. I'm going to keep muddling along with offsetpositions in my own systems. Though mapping all your documentpositions is a chore, this approach is relatively easy to understandand reason about, and doesn't require a lot of complicated datastructures to maintain and use.

View more on Marijn Haverbeke's website »

Like • 0 comments • flag

Published on September 29, 2025 16:00

No comments have been added yet.

Marijn Haverbeke's Blog

Marijn Haverbeke's profile
46 followers

Marijn Haverbeke isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.

Follow Marijn Haverbeke's blog with rss.

delete edit this post