Marijn Haverbeke's Blog
September 29, 2025
Addressing Editor Content
Every text editor system, whether it works with plain or rich text,needs some way to refer to specific positions in the document. Thevery first reason one runs into this, when implementing such a system,is for representing the cursor and selection. But in a mature editor,you'll be doing a lot more things that refer to documentpositions���representing changes, displaying user interface elements inthe text, or tracking metadata about a piece of content.
Offset PositionsAn obvious way to address such positions, especially in plain text, isjust by string offset. For rich text documents it also tends to bequite straightforward to define a scheme that gives every documentposition its own offset. Such offset increase monotonically, so evenin data structures that aren't flat arrays, it is not hard toefficiently look them up.
Because text editors often store their content split by line in somebigger data structure, it is tempting to use {line, character} pairsfor your positions. Though this is often a useful thing to present toa user, as an addressing system is is really quite awful. Manipulatingsuch positions tends to get convoluted and awkward. Whereas findingnearby positions in a flat system is just a matter of addition andsubtraction, with line-based addresses you have to always special-caseline boundaries, and know the length of lines to even know where thoseboundaries are. This can be made to work, since several editors,including old versions of CodeMirror, do it, but moving to flat systemin CodeMirror 6 was a huge relief.
In many cases, just having a position isn't enough. A cursor at a linewrapping point or a jump between right-to-left and left-to-right text,for example, may correspond to multiple different positions on thescreen. At least for cursors, you need both a position and adirection���which disambiguate whether the position is attached to theelement before or after the position.
But other than that, offset positions work well. They just have onebig drawback: when the document changes, they need to change with it.So any data that refers to a document position needs to be updatedalong with the document every time that document is modified. Thisrequires quite a lot of discipline to get right, and can get expensivewhen you're tracking a lot of positions.
Unique IDsSo, though both of my editor projects use offset positions, I keepasking myself whether there is a way around the need to map thosepositions on every change. If we could have some way to representdocument position in a ���stable��� way, where you can still use them whenyou come back to a document, even if that document has changed in abunch of ways since you generated it, that would be so convenient.
To be able to do such a thing, you'd need to assign a stable ID toevery single element in the document. There are ways to make this lessnauseatingly expensive than it initially sounds. Stretches of text thatare loaded or inserted together can be assigned a contiguous region ofIDs, allowing the data structure, in many circumstance, to not storeevery single ID separately, but instead just assign a range to astretch of text. If you have that, you can now describe the positionbefore element X or after element Y in a stable way. To find it, youjust need to look up the element.
Except, of course, if that element has been deleted. When your ID nolonger exists in the document, your position has lost its meaning.
One way to handle that is to keep ���tombstones��� for deleted elements,either directly in your document data structure, or in a separate datastructure that maps the IDs of deleted elements to IDs of adjacentelements that are still in the document. This does have the downsidethat, for some types of editing patterns, your tombstone data canbecome bigger than your actual document data. It is possible to defineschemes where you periodically garbage collect these, but then youreintroduce the issue that you can be invalidating position pointersthat may still exist in some other data structure, and you are back toneeding to carefully update such pointers.
Another issue of such IDs is that going from an ID to an actualposition in the document generally needs to be fast. This is notsomething you get for free. Doing a full document scan every time youneed to find a position tends to be too slow.
There are some tricks that you can do with mutable doubly-linked treesor lists, where you keep a map from IDs to objects in those datastructures, and then traverse from that object via parent or siblingpointers to figure out where it is. But I am very partial topersistent data structures, where such tricks don't work.
It's probably possible to do something with bloom filters in a treestructure or similar, rather heavyweight tricks. But in the end, ifyou're just moving the work that an offset system would do whenmapping positions over changes to lookup time, that may not be much ofan improvement.
Ordered IDsOne way to avoid the tombstone and lookup issues with regular IDs isto define your ID assignment scheme in such a way that there is anordering of the IDs that corresponds to their order in the document.If deleted IDs can still be compared to IDs still in the document,that gives you a way to locate their position even though they aren'tthere anymore. Similarly, if you can compare IDs you can run a binaryor tree search through your document to quickly locate a position.
The obvious downside of this approach is that it is tricky to defineyour IDs in such a way that you can keep making up new IDs that ���fit���between any two existing IDs, and this forces you to use a schemawhere IDs can grow in size when there's no room left in the sequencespace on their current level.
It also, and this may be a worse issue, makes the position of deletedIDs weirdly dependent on what is inserted in their old place after thedeletion. Unless some kind of tombstone data is kept, changes willhappily fill in the ID space left empty by a deletion with elementsthat, more or less randomly, may be above or below (or even equal to)an old but still referenced ID, making its position point at ameaningless position within those inserted elements.
(Readers familiar with sequenceCRDTsmay notice a lot of similarity between what I'm describing and howsuch systems work. That's because I stole a lot of these ideas fromCRDT literature.)
Addendum: Transaction LogAnother approach, suggested to me by JamieBrandon after I published this,would be to keep a record of updates with your document, containingenough information to map a position forward if you know the documentversion it refers to. Positions would then be {version, offset}pairs, and you could interpret positions in old versions by runningthrough the changes that were made since that version.
In its straightforward form, this would make use of old positionsincreasingly expensive, as they have to be mapped forward everfurther. But I guess you could get around that by keeping a cache withevery position.
In any case, this is a neat reframing of the problem. It stillrequires a data structure that grows proportional to editing historysize, rather than document size, but seems less convoluted thanID-based approaches.
ConclusionThis problem space appears to be a tricky one where every solution hassignificant drawbacks. I'm going to keep muddling along with offsetpositions in my own systems. Though mapping all your documentpositions is a chore, this approach is relatively easy to understandand reason about, and doesn't require a lot of complicated datastructures to maintain and use.
October 24, 2024
Reference Semantics and Value Semantics
The concept of an object, record, structure, or product type occurs inalmost every programming language. Being able to declare a type ofvalue that groups a number of other values together is a fundamentallyuseful thing.
As similar as the various record-ish constructs are in their basicfunctionality, the way such values are actually looked at and handledcan be rather sharply divided in two different styles. Someprogramming languages or systems strictly prescribe one or the other,but in many cases you can use both.
Reference SemanticsRecords are implemented as a piece of memory where their content islaid out at adjacent memory positions. The way to identify such arecord value is by its memory address.
Memory can be changed, so records can be changed. You can write newvalues into their fields, and thus the content of a record changesover its lifetime.
But it's still the same record, even though some of its fields mayhave changed value. So the way to compare two records is by address.You're interested in the identity of a record, not so much itscurrent content. When implementing a hash table that uses records askeys, you hash the memory addresses.
Value SemanticsMathematical product types allow us to treat a group of values as asingle value. A two-dimensional coordinate is a pair of numbers.
In implementing computer programs that manipulate such types, we mayor may not put these numbers in memory somewhere. But that is notterribly relevant to our programming model.
Changing the x position of an existing coordinate pair is not a thing.You can create a new record if you need a new pair of coordinates. Itis possible to define algebraic rules on records, definingequivalences that hold when you perform certain operations on them.
Comparing two records is done by comparing their contents. Whetherthey happen to occupy the same memory is immaterial. A hash table withrecords as keys obviously hashes their contents.
With value semantics, a record is just the sum (product?) of itscontents, and nothing beyond that. It doesn't have its own identity.
Know Your SemanticsBoth of these approaches provide a coherent way of dealing with recordvalues, and kind of force you to use the whole package���if you're goingto mutate your record, you probably also need to compare them byidentity. If you're going to make them immutable, you must implementupdates by creating new values, and will generally want to compare byvalue, because object identity has little meaning then.
In Haskell, you're going to be using value semantics whether you likeit or not (you probably do like it, or you wouldn't be using Haskell).
In most imperative programming languages there are situations thatcall for reference semantics and situations that call for valuesemantics. A mutable container or a stateful subsystem is bestimplemented with a reference-semantics type. A small composite valueor a data structure where referential transparency is valuable is muchbetter treated a value.
Some languages provide different constructs for these two types ofrecords, making it clear what you are dealing with in any givensituation. But many, such as JavaScript, don't. They may not evenprovide reasonable mechanisms for indicating what can be mutated andwhat is immutable. When working with such a crude language it isimportant to be aware of the way a given type is intended to be used.
All this is to say, when a JavaScript type provides no operations thatchange its content, when the TypeScript types list its fields asreadonly, and the author goes on about it being an immutable type inthe documentation, please stop mutating its properties and wonderingwhy things break.
June 7, 2022
CodeMirror 6.0
CodeMirror 6 is a new code editor libraryfor the web, a from-scratch implementation based on the experience ofbuilding and maintaining versions 1 to 5 for the past 13 years. Itaims to be more extensible and accessible than previous versions.
As of today, version 6.0 is stable. Going forward, probably at leastseveral years, all new releases will be under the 6 major version, andbe backwards compatible.
The library has been usable and largely stable for over a year, withonly minor breaking changes. I generally prefer to release late, toavoid having too many regrettable mistakes slip into the stablerelease, which would then have to be kept there indefinitely. Withouta doubt there will be things that, a year from now, I wish I hadreleased differently, but by having users work with the code inproduction for a significant amount of time, a lot of small issues andsources of friction have been found and resolved before being set downin stone.
Work on this system started four years ago, with PrototypeFund funding the initial work. It wasannounced publicly and crowd-funded a year after that, built out intoa useable system in the two years after that, and refined andstabilized in the past year. During the first two years, Icollaborated with Adrian Heine on the design and implementation of theinitial system.
For some more background on the new library, see the blog posts onLezer (the parser system), facets (theextension system), and collaborativeediting. For an overview of the entiresystem, take a look at the systemguide. For a rough summary ofwhat changed in the interface since 5.x, see the migrationguide.
June 6, 2022
Facets as Composable Extension Points
An extensible system, at its base, is a system that allows people toadd additional functionality that was not anticipated by the coresystem.
A good extensible system also makes sure multiple extensions thatdon't know anything about each other can be combined, and compose inways that don't cause problems.
The problem has several aspects.
Composition: Multiple extensions attaching to a given extensionpoint must have their effects combined in a predictable way.
Precedence: In cases where combining effects isorder-sensitive, it must be easy to reason about and control theorder of the extensions.
Grouping: Many extensions will need to attach to a number ofextension points, or even pull in other extensions that they dependon.
Change: The effect produced by extensions may depend on otheraspects of the system state, or be explicitly reconfigured.
This post tries to explain CodeMirror's(a code editor library) approach to solving this problem.
Facets and the Editor StateA facet, in this system, defines an extension point. It takes anynumber of input values and produces an output value. Examples offacets are...
Event handlers, where individual extension can define function thathandle a given event.
Editor configuration, like the tab size and whether content isread-only.
The set of markers to style the content with (for example syntaxhighlighting).
The set of gutters to show next to the content.
When defining an editor state, you pass in a collection of facet inputvalues, which together define the behavior of the editor. In a givenstate, each facet has zero or more inputs. Their output value somehowcombines these���it may simply be an array of input values, or someother function of them.
Facets are defined as values and (optionally) exported so thatthird-party code can provide inputs. The core system defines a numberof facets, but facets defined outside it work exactly the same asthose defined by the core.
PrecedenceOften input values need a well-defined order. For event handlers, thisdetermines which handlers get to go first, for example. For gutters,it defines the order in which they are displayed, and so on.
The order in which the facet values are provided when configuring theeditor state provides a predictable ordering for the facet inputs, andis used as a basis for precedences. So if you provide two handlers fora given event, the one that you provide first will take precedence.
But sometimes the code that defines a facet value knows that it shouldhave a given precedence, and you don't want to be dependent on theprogrammer using this extension to get the relative order right. Forcases like this, the system also supports explicit precedence tagging,which assigns one of five (���highest��� to ���lowest���) precedencecategories to a given extension. The actual precedence of inputs isdetermined first by category, then by order.
GroupingA given extension often needs to provide multiple facet values. Forexample, a code folding system needs to define a state field to holdinformation on what is currently folded, key bindings to controlfolding, a gutter to display fold markers, and a CSS module to styleits UI elements.
To make this easy, extensions can be provided as arbitrarily deeplynested arrays. A function exported from an extension module can returnan array of extensions, which can be included in a biggerconfiguration by just putting the result of calling that functionalongside other extensions in the array used to define the editorstate.
The actual ordering of the extensions is created by recursivelyflattening this array, resulting in a single array of input values,each tagged with a facet. These are then reordered based on explicitlyassigned precedence categories and split by facet to provide theactual inputs for a given facet.
DeduplicationBecause different extensions may depend on each other, and thusinclude each other's extension trees in their own extension tree, itbecomes likely that people will end up with duplicated extensions intheir configuration. For example, both the line numbers extensions andthe fold gutter extension might use an extension that defines editorgutter infrastructure.
Because it can be wasteful or even break things to actually includesuch shared dependencies multiple times, CodeMirror's extension systemdeduplicates extensions by identity���if the same extension value occursmultiple times in a configuration, only the one in thehighest-precedence position is used.
As long as extensions that run the risk of accidentally being usedmultiple times take care to statically define their extension objects,and always return the same object, this makes sure such shareddependencies don't cause problems. Things like extensionconfiguration, which might be different across uses of the extension,can often be put in a separate facet, which combines the parametersgiven by multiple users in some reasonable way, or raises an error ifthey conflict.
ReconfigurationSome of the inputs to facets might change over the lifetime of aneditor. And just creating a fully new editor state with a newconfiguration may lose information (say, the undo history) containedin that state.
Thus, existing states can be reconfigured. The system supports twokinds of reconfiguration: full reconfiguration, where the root of theextension tree is replaced with a completely new set of extensions, orcompartment reconfiguration, where you tag part of your initialextension tree as a compartment, and then later replace only that partof the tree.
In either case, the data-driven approach to configuration (the codecan compare the old and the new inputs) allows the system to preserveparts of the state that didn't change, and update the values of facetswhose inputs did change.
Dynamic InputsSystems with a complicated user interface tend to, at some point, growsome form of incremental computation support. They need to keep thethings they show to the user consistent with their state, but theirstate is large and complicated, and can change in all kinds of ways.
A code editor is definitely a complicated user interface, and becauseit must be as responsive as possible, has a strong need to avoidneedless recomputations. Facets help with this. For a start, theyavoid recomputing output values when the facet's inputs stay the same,so code that depends on the facet can do a quick identity-equalitytest on the facet's current output value to determine whether itchanged.
But it is also possible to define dynamic inputs for facets, whichprovide an input value (or a set of values) that is computed fromother facets or other aspects of the editor state. The state updatesystem makes sure that, if any of the dependencies change, the inputvalue is recomputed���and, if it is different than its old value, thefacet value is also recomputed, as are any dynamic values thatdepended on that facet, and so on.
RepresentationBecause most facets, for a given configuration, have a static value,their representation can be optimized in a way that avoids doing anywork on state updates. This is helpful, because the editor state tendsto be updated multiple times per second, and we don't want to do anysuperfluous work during those updates.
When a given configuration is resolved, facets are categorized aseither static or dynamic, depending on whether they have dynamicinputs. Each facet is assigned an address in either the static valuesarray (which is reused as-is on any state update that doesn't changethe configuration) or the dynamic values array. The latter is copiedon state updates, and the dependency graph between facets (and otherstate fields) is used to determine which of the values need to berecomputed and which can be kept as they are.
Facets with no inputs at all aren't even stored in the state. Whenqueried, the value that facet has with zero inputs can be looked upfrom the facet.
June 28, 2020
CodeMirror 6 Enters Beta
I'm happy to announce that with version 0.8.0 the CodeMirror
6 project is entering its beta phase,
and you're very much invited to start trying it out and poking at it.
Roughly, this ���beta��� status means:
I actually like the current programming interface. Apart from the
move from a mono-package to a group of separate packages (removing
the next/ in the package names), no significant breaking changes
are planned. There might still be breaking changes in peripheral
parts of the interface, when real-world experience shows the
current interface to be problematic, but these should be of the
type where a quick scan of the release notes and some grepping
through your code are enough to adapt to them.
The system is no longer a hopeless unfinished mess. In fact, it
seems to work pretty well. Which isn't to say it won't break in
your use case, of course, since it has seen little practical use
yet and you're likely to be the first to do some specific thing
with it.
I've put together enough
documentation for people who don't
want to read source code to be able to figure out how the system
works. Some of those docs are still rough (issue
reports welcome),
but they should be usable.
You can read the change
log
if you want to see, in some detail, what I've been up to in the past
months.
Project Goals
The
retrospective
for the Xi editor project
has inspired me to also spend some time thinking about this project's
goals, and how they developed.
In the initial announcement, we emphasized these goals:
Accessiblity (especially screen reader usability)
Mobile support
Native Unicode/bidirectional text support
Modularity
Performance (especially avoiding performance cliffs for huge files
or long lines)
Well-designed, TypeScript-friendly programming interface
Most of these we followed through on pretty much as planned. The
exception to this is the bidirectional text and Unicode support���that
did get implemented, but not by leaning on native behavior.
The reasons for this are:
It is often extremely hard or awkward for scripts to interact with
the native behavior. Though browsers compute an odering for
bidirectional text, scripts cannot access this ordering. Though
browsers (sort of) have advanced cursor motion logic, the only way
to get at this is to focus an element, put the selection in there,
use
Selection.modify
to move it, and then see where it went.
Often the native behavior just isn't very good. Chrome, after we
started this project, seems to have given
up
on proper visual cursor motion. When there are non-text elements in
the document, cursor motion is inconsistent between browsers and
often just broken.
Even when the native behavior isn't downright buggy, it might not
be what we want for a code editor. Selecting by word should take
criteria specific to the programming language into account, for
example.
Thus, the project has ended up with its own implementation of Unicode
segmentation, bidirectional text ordering, and, on top of that, cursor
motion. The original vision was that we could avoid this, but that
didn't turn out to be realistic. On the bright side, not depending on
native behavior makes the library a lot less susceptible to all the
bugs and inconsistencies in those native implementations.
Of the remaining points, modularity is probably the one that I
underestimated the most. Just organizing the library core as a number
of separate modules was already a somewhat subtle exercise, but the
real difficulty lies in making sure 3rd party code
composes
smoothly. The extension system went through four or five rewrites, and
drove me to desperation at times, but I'm happy with what we landed
on.
Avoiding performance cliffs has also required a lot of discipline to
make sure complexity is only linear to document or line length when it
absolutely needs to be, as well as careful API design to avoid making
the slow approach the easiest or most obvious one. But I feel the
vision has been realized well.
Having lived with a latent anxiety about whether I'd be able to
deliver this project's promises for over two years now, I think I'm
about to start actually believing I can. That's a relief.
What is Missing
The main thing you'll probably notice, when trying to integrate the
library, is that a lot of languages don't have any support yet. This
is the next thing on my roadmap. I'll implement proper support for
some more major languages, and port most of the old CodeMirror 5 modes
so that they can run in CodeMirror 6 (though, apart from some
languages that are absolutely unparseable in a context-free way, the
vision is to move to Lezer grammars
for all important languages).
Next up would be the mostly-drop-in compatibility wrapper to the
CodeMirror 5 interface that I've been talking about.
(But right now, we have summer holiday here, and the pace of work will
be slow to nonexistant during the coming weeks.)
May 13, 2020
Collaborative Editing in CodeMirror
This post describes the considerations considered in designing the
document-change data structure and built-in collaborative editing
feature in the upcoming version of
CodeMirror (a code editor system). It is something of a followup to
the Collaborative Editing in
ProseMirror post.
I won't introduce anything new or exciting here���the design I ended up
with is a very boring non-distributed operational
transformation.
In a way, this post is publishing a negative result: I looked into a
bunch of interesting alternative approaches, but found they didn't
meet the requirements for this system.
Since collaborative editing is a tricky field with a lot of different
solutions, all of which have their awkward trade-offs, I think the
path towards this boring end result might still provide a useful
resource for people working on similar systems.
Distributed versus coordinated collaborative editing
There's quite a disconnect between the scientific literature on
collaborative editing and what most collaborative editors are doing.
The literature is largely concerned with truly distributed
collaboration, where a number of peers, taking equivalent roles,
directly exchange updates among themselves and still somehow converge
on the same document. A typical web system, on the other hand, has
clients talking to a server, which orchestrates the exchange of
updates.
These problems are very different, and if you're aiming to implement
the latter, about 95% of collaborative editing literature is
discussing a problem you do not have. Working in a truly distributed
fashion is very attractive in principle, of course. It is strictly
more general, and has connotations of escaping the over-centralized
modern web. But it does drag in a lot of problems, even outside of the
document convergence���peers have to store the document along with,
depending on the technique used, its entire history. They have to
somehow discover each other, maintain connections, and so on.
So for the core of a JavaScript-based library, I decided that support
for a distributed model wasn't important enough to justify the
additional complexity. I'll get back to what that complexity looks
like later.
Operational Transformation
(I'm sorry, I'm going to explain operational transformation again,
just like a hundred other blog posts. Hang tight.)
Operational
transformation
involves, in its simplest form, a transformation function that takes
two changes A and B, which both apply to the same document, and
produces a new pair A��� (a version of A that applies to the document
produced by B) and B��� (B but applies to the document created by A),
such that A + B��� and B + A��� (where + indicates change composition)
produce the same document.
This can be visualized as a diagram like this:
Doc��
A ��� ��� B
Doc��� Doc���
B��� ��� ��� A���
Doc������
You can roughly think of the transformation as moving a change
through another change. Sometimes this just involves adjusting the
positions it refers to (moving ���insert A at 5��� through ���delete 0-1���
yields ���insert A at 4���), sometimes it it is more involved (two
changes can't delete the same content, so in that case some of the
deletions have to be dropped in the transformed change).
For a plain-text document model, such a transformation function is not
very hard to define.
The other part of an operational transformation system is its
control part, which determines how changes are shared and when they
are transformed. A simple centralized setup can do something like
this:
Keep a list of unsent local changes.
Periodically try to send those to the server, which will either
accept them or, if another client sent a change first, reject them
because your changes start in a base document that doesn't match
the current central document.
When the server sends you changes, transform your unsent changes
and the remote change over each other. Apply the transformed remote
change locally, and replace your stored local changes with the
transformed version, so that the next time you try to send changes
you send those.
Because this makes everything proceed in lockstep, it is easy to see
that this converges���the only difference in order in the way clients
apply changes is when remote changes come while there are local
changes pending. This is exactly the scenario that the transform
function's convergence handles.
When there is more than one remote or local change involved, the
situation is slightly more tricky. If the change representation that
your transform function operates on allows series of changes to be
composed into a single change, you can first do that and then perform
a single transform.
If not, you basically need to build up a bigger rectangle. If A and B
are remote changes, and X and Y local changes, you can do something
like this:
Doc��
A��� ���X
Doc��� Doc��
B��� ���X��� A����� ���Y
Doc������ Doc����� Doc����
X��������� ���B�� Y������ ���A����
Doc�������� Doc�������
Y��������� ���B����
Doc����������
In each individual diamond in this monstrosity, the lower arrows are
made up by transforming the upper arrows against each other. So the
whole thing can be built up with O(N��M) transformations, where N and M
are the number of changes on both sides. The guarantee that a single
set of changes can be transformed over each other and still converge
can thus be used to compute transformed versions of bigger groups of
changes���the bottom sides of the big diamond provide the final
transformed versions of the change sets.
Then why is OT considered hard?
So that's pretty easy. Still, the general consensus seems to be that
OT is hard. There's an amusingly long list of papers in this space
that have later been proven to be incorrect.
If you've been programming for a while you've probably run into this
kind of thing before: there's a hack that works simply and
efficiently, but completely falls apart when you add more
requirements. OT is such a hack.
When the document structure is more complicated than plain text, and
change structure is more complicated than just insertions and deletes,
it becomes very hard to define a converging transformation function.
And when you don't have a centralized party to determine the order in
which changes get applied, you need, as far as I understand the
literature,
to keep data beyond the current document (���tombstones��� that describe
where deletions happened), to guarantee convergence.
The main problem is that this technique is hard to reason about. It
has many practical advantages but, being a hack, doesn't provide the
kind of mental framework that would allow you to confidently apply it
in different situations.
Conflict-free Replicated Data Types
That is where
CRDTs
come in. They are another approach to convergence that, contrary to
operational transformation, does provide a way for us mortals to
reason about convergence.
Basically, CRDTs complicate the way they represent data and changes to
the point where you can apply changes in any order, and as long as you
applied the same changes, you get the same result.
For text-style data, CRDT approaches roughly work by assigning every
character a unique ID. There's two general approaches to representing
changes.
The first keeps deleted characters in the document forever, just
marking them as deleted. Character insertions provide reference IDs
for the character before and/or after them, and those references,
along with data (often a logical
clock) embedded in the
IDs allow each client to compute the same ordering for the
characters.
The second approach generates IDs in such a way that they are already
ordered. When inserting something at a given position, you generate a
new ID that sits between the existing IDs around that position. This
means character IDs must be able to grow in size, so that you can
always create a new ID between any given pair. You'll also need to
include some client-specific data to make sure IDs are unique. But on
the bright side, this approach doesn't require you to keep deleted
content around.
Given such a data type, collaborative editing becomes a matter of
making sure everybody ends up receiving everybody else's changes,
eventually.
But I guess you can see why this isn't a no-brainer either. Compared
to OT, where your document representation can just be a minimal
representation of the text inside it, these techniques requires an
extra data structure to be stored for every character in the
document. And with some of them, you don't get to delete those
characters either.
There are approaches that exploit the way text is often inserted in a
linear way (or loaded in bulk at the start of the editing session) to
compress this data somewhat by generating sequential IDs for such
spans of text, and storing them as a single element. But these
introduce further complexity and don't really help provide an upper
bound on the data needed���in a heavily edited document the
representation may well degenerate to something close to the
ID-per-character structure.
For CodeMirror's core, data structures and complications like this
conflict with the goal of supporting huge documents and having a
pleasant programming interface. So I've decided not to introduce a
CRDT in its internal data structures.
It may be entirely reasonable to wire an instance of the library up to
an external CRDT implementation, though, as has been done for
ProseMirror with
Yjs.
In general, I agree with this
article
that the strict separation between OT and CRDT is not really
justified. OT algorithms that support decentralized editing rather
resemble CRDT algorithms. But the distinction between dumb-as-rocks
classical OT and decentralization-capable approaches is useful.
CodeMirror's approach
So the plan is to support centralized collaborative editing out of the
box, and punt on the complexities of decentralized collaboration.
I started the project with the idea that I'd just do what ProseMirror
does, since that worked well before.
To recap, ProseMirror uses something similar to OT, but without a
converging transformation function. It can transform changes
relative to each other, but not in a way that guarantees convergence.
ProseMirror's document structure is a tree, and it supports various
types of changes, including user-defined changes, that manipulate this
tree. I didn't have the courage to try and define a converging
transformation function there. When a client receives conflicting
remote changes, it first undoes all its local pending changes, applies
the remote changes, and then the transformed form of its local
changes. This means that everybody ends up applying the changes in the
same order, so convergence is trivial.
In CodeMirror, the document model is a lot simpler, so though I got
pretty far with the ProseMirror approach, it seemed needlessly
awkward. A sequence of individual changes, each of which applies to
the document created by the previous one, isn't really how the user
thinks about updating the document.
If you, for example, want to surround a the range from position 10 to
20 with parentheses, you'd have to create a change ���insert ( at 10���,
and then, because the first change inserted a new character, the
second change would be ���insert ) at 21���. You could sometimes get
around this by creating the changes in inverse order, but that doesn't
always work either.
Also, when doing things like updating updating the document data
structure, its representation on the screen, or some auxiliary data
structure that tracks the document, you need the precise extent of a
set of changes. This can be derived from a sequence of individual
changes, but it seems like it'd be more pleasant if changes were
directly kept in that shape.
So I moved to a system where you don't deal with individual changes,
but rather with change sets, which represent any number of changes
in a flat format, as a sequence of untouched and replaced spans. The
parentheses example above would be represented as ���keep 10, insert
(, keep 10, insert )���.
(In fact, it'd get another kept span at the end covering the rest of
the document. This representation knows the length of the document it
applies to, and will refuse to perform operations where these lengths
do not match. This dynamically catches a whole class of programmer
errors that would be silently ignored before.)
To create a change, you specify your changes with offsets into the
current document, and the library will sort and combine them into a
single change set.
Given this representation, it seemed an obvious optimization to use
real OT, rather than ProseMirror's faux OT, since for this domain it's
not hard to define.
Undo history
Transforming does not just come up with collaborative editing. If you
want an undo history that can undo some changes but not others (which
is useful for collaborative editing, where you don't want to undo
other people's changes, but also comes up in other scenarios), you'll
need something similar.
In the simple case, the undo history stores the inverse of the changes
you made, and when you undo, it applies those inverted changes. But if
you have, say, change A stored in the history, and make a change B,
which should not be undoable, you can't just do nothing. Change A
applies to the current document, not the document created by B.
So you must create transformed versions of the changes again. The
transformed change A��� gets stored in the history instead of A, and, if
there are more levels of history below that, the next level must be
transformed with B���, and so on.
Position mapping
Something that comes up quite a lot in an editor is the need to
transform a document position in an original document into a
corresponding position in the changed document. If text is inserted
somewhere before the selection, the selection should move forward with
the surrounding text. Some OT systems call this ���cursor
transformation���, but it also applies to things like ranges of
collapsed code, lint warnings, breakpoint markers, and so on.
It's not hard to write a function that takes a position and a change
and returns a reasonable new position in the new document. But there
are a few complications.
Firstly, when an insertion happens directly at your position, it is
ambiguous which side the mapped position should be on. So you'll
probably want to have two forms of position mapping���one that stays
before insertions, and one that stays after them.
Secondly, though most OT systems just represent changes as a series of
insertions and deletions, mapping seems to require us to distinguish
replacements from the corresponding delete/insert pair. When a
replacement happens next to a given position, the position should
never move across the replacement, regardless of whether you're
mapping forward or backward. In order to be able to support this,
CodeMirror's change representation is encoded as a series of
replacements. (Insertions become replacements that don't delete
anything, and deletions replacements that don't insert anything.)
And thirdly, worst of all, though it wasn't that hard to use OT to
make documents converge, it does not seem to be possible, in an
OT-style system, to make mapped positions converge. This is usually
not much of a problem, since most of the things that need to be mapped
are local to a single client, but there are situations, like
collaboratively editing annotations that refer to a given range in the
document, where it would be extremely useful to know that everyone's
representation of these ranges will stay the same as the document
changes.
Let's look at an example. Two users make concurrent edits���A deletes
characters 2 to 4, and B inserts an X at position 4. After the
clients have synced up, they've made these changes:
A: delete 2 to 4, insert X at 2
B: insert X at 4, delete 2 to 4
If we are mapping position 2, with a forward bias, user A will, for
the first change, leave the position at 2, and then, because of the
forward bias, map it to the end of the insertion made by the second
change, ending with position 3.
User B, on the other hand, finds that the initial insertion didn't
affect position 2 at all, since it's two positions ahead of it. The
deletion also doesn't affect it, so they end with position 2.
This doesn't seem to be an incident of my implementation or anything
like that. Mapping through deletions loses information (in this case,
the fact that our original position isn't adjacent to the insertion),
and thus I'm confident that, in an OT-style system that doesn't track
tombstones for deletions, or any system that defines this kind of
mapping in terms of plain document offsets, doing this in a converging
way is impossible.
This is kind of weird. It took me a while to accept that we can make
documents converge, but not positions, which are so much simpler.
The explanation for that is, I guess, that for document maintenance,
what is inside deleted ranges becomes irrelevant as soon as those
ranges are deleted, whereas position mapping, to be robust, would
still need to be able to compare positions inside such ranges.
(Depressingly enough, a similar problem exists when we are just
composing changes���mapping through changes X and Y separately may
produce a different result from mapping through X+Y composed into a
single change.)
This is a large part of the reason why I spent weeks researching CRDTs
even though my document convergence needs are well covered by OT. If
you have a more fine-grained way to address positions (specifically,
one that can refer to deleted positions), and you define position
mapping in terms of that, you do not have this problem.
Since CRDTs tend to assign unique IDs to every character, and some
techniques keep those around even when deleting the character, you
could use such IDs instead of document offsets to track positions.
They never change, so you wouldn't even need to do anything to map
positions across changes.
(Though, since all CRDT approaches seem to separate insertions from
deletions, this would degrade mapping around replacements again.)
Still, that did sound promising. But, again, the cost of such a
representation is significant, and in the end, I judged the
requirement for converging positions to be too obscure to justify that
level of extra complexity and memory use.
November 28, 2019
CodeMirror MOSS project report
Development of CodeMirror 6 this year has been generously
supported by Mozilla through
their MOSS program.
The MOSS program asks for a closing retrospective blog post describing
the progress made during the grant period. I've written about the
progress in the first 9 months of the year in my status update
post from August. To summarize, in that
period we:
Implemented solid text composition support.
Designed a powerful extension system.
Documented the system with doc comments and set up a process to
generate HTML documentation from that.
Created a parser system and
integrated it with syntax highlighting.
Implemented and improved various less glamorous subsystems
(styling, gutters, in-text widgets).
The past few months have been focused more on concrete user-visible
features, instead of big architectural questions. We've added these
features:
Code folding support.
A generic extension for showing UI panels above and below the
editor.
A search and replace interface.
A generic tooltip extension.
Support for in-editor linting, highlighting issues in the text.
Support for autocompletion.
A theming system that can be used to customize the styling of the
elements produced by these various extensions.
A way to provide translations for the strings used in the UI.
You can see many of these features in action in the
demo on the website.
Working on concrete extensions was a good way to find out how well our
abstractions work in practice. Though we did end up adjusting some
things, on the whole the system was a pleasure to work with.
Specifically, the way extensions can be built up out of different
behaviors and other extensions was extremely useful. "Behaviors" (the
name will probably change to "aspect" sometime soon), the named fields
that extensions can add values to, are not just useful for
configuration, but also allow, for example, extensions that want to
display a panel to simply provide a behavior that indicates this,
which is read by the panel-displaying extension. This models something
that would be a side-effect (opening and closing panels) in most
designs in a much simpler, more robust way.
Work is still ongoing, and will continue into 2020. The list of
missing functionality is getting shorter and shorter, but there's a
bunch of stabilizing and easy-of-use improvements still ahead. In any
case, given how easy the above extensions were to implement, I'm
slowly starting to believe that our base architecture is solid.
September 2, 2019
Lezer
I keep coming across people who consider parser technology a
forbidding, scary field of programming. This is nonsense���a small
parser can be very, very
simple,
and provide a wholesome exercise in recursive thinking. At the same
time, it is true that you can make parsing extremely complicated.
Mostly, this tends to happen when you generalize parsing techniques to
work for different grammars. Since compiler textbooks usually describe
general approaches to parsing, they may be to blame for putting people
off.
This post describes a new parsing
system I wrote for
CodeMirror, a source code editor. It frames
the system with some history, and digressing into some neat
architectural details.
Editor features like syntax highlighting, bracket matching, code
folding, and autocompletion all involve some level of parsing.
Unfortunately, since editors have to handle many different languages,
they require a generalized approach to parsing.
CodeMirror is in the process of being rewritten, and I wanted to
improve the way it parses its content. Parsing inside of an editor
comes with its own unique set of constraints, which can be hard to
satisfy. Though I had been planning new approaches for years, all I
had to show for it so far were a pile of dead ends.
The constraints that make the parsing problem in a code editor hard
are roughly these:
The document is constantly changing.
You can't do anything expensive. If the parsing works takes too
long, it'll introduce latency that makes editing feel
slugglish and unresponsive.
The input is often not in a finished, syntactically correct form.
But you still have to make some sense of it���nobody wants an editor
where most features stop working when you have a syntax error in
your document.
You often want to be able to mix several languages/grammars in a
single document (think HTML with JavaScript and CSS embedded in
it).
Keeping those in mind, let's go over the approaches I've tried.
A Brief History of CodeMirror Parsing
The system in as it exists in CodeMirror 5 now is (which is pretty
much what we've been using from the very beginning) is a simple
one. For
each language, you write a tokenizer which splits the input into
pieces, and labels each piece with some syntactic category (such as
variable, keyword, or number). The tokenizers can be stateful,
which allows them to secretly be full parsers if they want to.
This state must by copyable, so that the editor can strategically
store tokenizer states from a previous run, and after a change, resume
one close to that change to avoid re-tokenizing the entire document.
Because we are usually only interested in the code in the visible
viewport, this means the complexity of re-tokenizing is bounded by the
distance between the change and the end of the viewport. Since most
changes happen inside of that viewport, this works well in practice.
Such tokenizers are awkward to write directly, so over the years
several attempts have been made to build abstractions over them. The
first was the Common JavaScript Syntax Highlighting
Specification,
an attempt by the authors of Mozilla Skywriter (formerly Bespin, later
merged into ACE) to define a declarative format
for describing tokenizers as state machines with regular expressions
(describing the tokens) as edges. The ACE project ended up with an
incompatible but similar format (too entangled with their internals to
use in CodeMirror, unfortunately). I did an implementation of the
original spec for CodeMirror, and then another incompatible
extension because the base spec was
too limiting. There are a few CodeMirror modes still based on that
code, but it was no real success.
I think the reason such state machines (and the somewhat related
TextMate
grammars which
are in wide use in desktop editors) never felt like a great solution
is that, once you get past trivial grammars (where their declarative
simplicity does look really nice), they don't really help that much
with abstraction. Manually designing complicated state machines is a
chore. Regular expressions, which are bad enough on their own, become
downright
terrifying
when you have to construct all your edges out of them, often stuffing
multiple tokens into a single expression to avoid creating
intermediate states. This ���abstraction��� has a tendency to produce
uglier, less maintainable code than what you'd get when writing the
tokenizer as plain code.
So in 2017, I started an ambitious project to create a better way to
abstractly define incremental tokenizers. I had concluded that
classical parser generators based on context-free grammars were never
going to work in this context (for reasons that I'll come back to
later on). But I kept coming across parsing expression
grammars,
which weren't based on context-free grammars and had some interesting
properties, such as being able to combine multiple grammars to create
a new grammar (which is great for mixed-language documents).
So I spent several months building a parsing system that took a
PEG-like grammar, compiled it down to a state machine, and made it
possible to run that state machine as a CodeMirror language mode.
This system is a marvel.
It uses a moderately sophisticated optimizing
compiler to generate the
state machines. The result works quite well, and is used in several
real-world systems today. But unfortunately, if I'm honest, it is a
tragically bad idea taken way too far.
Parsing expression grammars are parsed by backtracking. And as such,
they are very poorly suited for implementing a stateful tokenizer. In
a backtracking system, you never know when you've definitely parsed
a piece of content���later input might require you to backtrack again.
So what I ended up with was actually not PEG at all, but a system
where you had to explicitly annotate where the parser should look
ahead. Though grammars written this way were relatively readable, they
involved a lot of finicky, error-prone kludges to resolve local
ambiguity.
Also, parsing PEG is just really inefficient. Such grammars are
���scannerless��� meaning they don't make a distinction between tokenizing
and parsing. When parsing in that way naively, you basically have to
run your whole parsing logic for every input character. Probably
multiple times, due to backtracking. A lot of the magic in the
compiler was intended to recover the tokens that were implicit in the
grammar, in order to recover some efficiency. But the system never
came close to hand-written language modes in terms of speed.
Tree-sitter
So, though I knew I needed a new approach, I went into the CodeMirror
6 rewrite without any specific idea on what that approach would look
like.
And then I saw
tree-sitter, and was
enlightened.
Tree-sitter is a parser system written with the code editor use case
in mind, and is in the process of being integrated into the Atom
editor. It takes a much more ambitious approach to
what a parser inside an editor should do: It builds up a full,
accurate syntax tree for the content.
You can do so much more with an actual syntax tree than with a
sequence of tokens. Whereas tokens, possibly augmented with some
information stored in the tokenizer state, allow you to sort of
approximate understanding some aspects of the code's structure, a tree
usually gives you precisely the information you need.
Most of the ideas that tree-sitter uses aren't new, in fact a
paper from 1997
describes a somewhat similar system. But as far as I know, tree-sitter
is the first system that puts them all together into a practical piece
of software.
Unfortunately, tree-sitter is written in C, which is still awkward to
run in the browser (and CodeMirrror targets non-WASM browsers). It
also generates very hefty grammar files because it makes the
size/speed trade-off in a different way than a web system would.
But good ideas can be ported. Lezer is
a JavaScript-based system heavily inspired by tree-sitter.
LR Parsing and Context-Free Grammars
For a long time, I was firmly convinced that classical parser system
based on context-free grammars and
LL or
LR parsing algorithms were
just not suitable for the editor use case. My arguments for this
were...
Context-free grammars are a limiting abstraction that breaks down as
soon as the language does anything funky. Needing the grammar to be LR
or LL to please the parser generator further pins you into a corner.
This is not wrong. Expressing operator precedence in a pure
context-free grammar requires writing a silly formulaic rule for each
level of precedence. And when you need to implement something like
automatic semicolon insertion or whitespace-sensitivity, which would
be a couple of lines of code in a hand-written grammar, you can't
express that directly, and have to somehow escape the context-free
abstraction.
Making such a grammar suitable for an LR parser generator can be even
more tricky, and often requires you to have a rather deep
understanding of how the parser generator works.
But like many things, once you get to know them, they aren't that bad.
Parser generators can support precedence declarations, which make
operator parsing a lot less terrible. They can even output decent
error messages.
Supporting dynamic resolution of ambiguities through something like
GLR parsing can provide a
practical way out of situations that parser generators are
traditionally bad at.
And contrary to some of the abstractions I mentioned before, this one
actually gets us something. Context-free grammars, when combined with
a proper parser generator, really do give us fast parsers from
readable, compact grammar declarations.
A strict separation between the tokenizer and parser is
problematic.
It is, in many languages (think of JavaScript's ambiguity between
regular expressions and the division operator). It also tends to make
mixed-language parsing harder.
But just because this type of parser is traditionally ran with a
completely separate tokenizer doesn't mean it has to be. Having the
parse state drive the tokenizer is largely unproblematic. You can
either can even have the parser generator set this up
automatically, without user involvement.
Generated parsers are way too big.
A naively generated LR parser is huge, and many tools spit out
embarrassingly big files. But with careful parser state deduplication
and table compression such a parser can be made about as compact as a
hand-written one.
Making such a parser error-tolerant is extremely cumbersome.
If you search the scholarly literature for approaches to
error-tolerance in LR parser systems, you get a lot of results, with a
lot of different approaches, but none of them are very practical. Most
require the grammar writer to explicitly annotate the grammar with
error-recovery strategies, bloating the grammar and putting the
responsibility for getting it right on every grammar author.
Tree-sitter ingeniously abuses GLR
parsing, where the parser
can try multiple interpretations simultaneously, to integrate
automatic error-correction without a lot of extra complexity. Lezer
copies this approach.
Lezer
I called my tree-sitter copycat project
Lezer, which is the Dutch word for
reader (and pronounced a lot like laser). It is a bit less
advanced than tree-sitter in some areas, a bit more advanced in
others, and simply different on quite a lot of points, as determined
by a different set of priorities and tastes.
CodeMirror 6 will retain the ability to run a classical stateful
tokenizer, but its recommended way to define a language mode is to
write a Lezer grammar and wrap it in a CodeMirror-specific packages
that adds some editor-related metadata.
Lezer is an LR (with opt-in
GLR) parser generator. It
has support for incremental parsing, where you can cheaply re-parse a
document after local changes have been made to it, by reusing pieces
of the old parse tree. It automatically tries to recover and continue
parsing when it runs into a syntax error, leaving markers in the
output tree that indicate where the recovery happened.
Lezer consists of an off-line parser generator tool, which takes a
grammar description and outputs a JavaScript module containing a
parser for that grammar, and a parser run-time system (which such
output files depend on) to do the actual parsing. Only the run-time
system and the generated parser need to be loaded by the editor.
The parser outputs non-abstract syntax trees, meaning that it just
creates a raw tree structure containing the constructs it parsed (with
information on where it found them), without organizing them into a
clean, easy-to-use data structure.
The system is optimized for compactness, both in parser table size and
syntax tree size. It needs to be practical to ship a bunch of parsers
to a user on the web without producing megabytes of network traffic,
and it needs to be realistic to keep syntax trees for large documents
around without running out of memory.
The Lezer guide provides a
more thorough introduction, as well as a description of its grammar
notation. In this blog post, I want to go into the neat implementation
details that aren't relevant in user documentation.
Error Recovery
The point where I became convinced that I definitely needed to use or
copy tree-sitter was when I understood its error recovery strategy.
Say you reach a point where you can no longer proceed normally because
there is a syntax error. The rest of the input, after the error, is
probably full of meaningful constructs that could still be parsed. We
want those constructs in our syntax tree. But our regular parsing
process is stuck���it doesn't know how to get from the error to a state
where the parse can continue.
I definitely did not want to require the grammar author to add error
recovery hints to their grammar. These tend to clutter up the grammar
and are error-prone to write. Writing a grammar is hard enough
without that distraction.
You can see error recovery as a search problem. There might be a parse
state and input position (past the error) where the parse can
meaningfully continue. We just have to find it.
The actions encoded in the parse tables, along with some
recovery-specific actions that the parser wouldn't normally take,
provide a kind of search tree. You start at the state(s) where the
error occurred, and keep exploring new states from there.
But what does the accept condition look like? When do you know that
you've found an acceptable solution? You could define that precisely,
for example as the state that can handle the next N tokens without
further errors. But we can also be vague.
The solution found by Max Brunsfeld
in tree-sitter is to use the same mechanism that's used to parse
ambiguous grammars. A GLR parser can split its parse stack and run
both sides alongside each other for a while until it becomes clear
which one works out.
That's pretty much exactly what a search algorithm does���it tracks a
number of branches that it still has to explore, and continues to
explore them, possibly pruning unpromising branches with some
heuristic, until it finds a solution.
To be able to get good results, or at least some result, in messy
situations like longer stretches of invalid input, each branch has a
badness score associated with it, which is increased (linearly) each
time a recovery action is taken, and decreased (asymptotically) every
time it can consume a token normally.
What we want to do is, after an error, try all kinds of possible
recovery tricks, which recursively branch off a large amount of states.
But then, after a bit of that, we should consolidate to one or, at
most, a few parse states again, because parsing input in a whole bunch
of different ways is expensive.
To get this effect, Lezer forbids states with a badness higher than a
given multiple of the best state's badness (or some maximum threshold)
from applying further recovery actions, effectively dropping those
branches when they can't proceed normally. In the case where one
branch finds a good way to continue, that branch's badness will
converge to zero and eventually stop all worse branches. In cases
where the input continues to make no sense, all branches will
eventually get a badness score exceeding the maximum, and the parser
will only continue one of them.
The recovery strategies used are:
Skip the next token, and try again with the same state after that.
Invent a token���take any of the tokens that are valid in this state,
and continue to the state that consuming them would produce. This
is the main source of branching, since many states allow a lot of
tokens.
Force the end of the innermost production that's currently being
parsed.
There are situations where the result of this approach isn't entirely
optimal, but it usually does well. The important thing is that it
always keeps parsing, and does so in a way that remains tractable
(exponential searches are quickly dampened). The system is biased a
bit towards the token-skipping rule, so that if all else fails it'll,
in effect, just continue skipping tokens until it stumbles into a
situation where it can continue parsing.
Post-Order Parser Output
When you have a parser that may be splitting its state���a lot���and build
up parts of the tree multiple times, that duplicate tree building and
the bookkeeping involved in it can cause a lot of unnecessary work.
The order in which an LR parser creates nodes is inner-to-outer. It
will, for example, first create the node for the operands, and then
later the node for the operator expression. This suggests an approach:
What if, instead of building a tree structure right away, the parser
just keeps a flat log of the nodes it created. This can be an array in
which the nodes are arranged in post-order, with children
coming before parents.
The parser just appends to this array. When splitting the state, one
state keeps the existing array, and the other gets a new empty array
along with a pointer to the state that has the rest of the array, and
the length of that array at the time of the split.
Now splitting involves no node copying at all. You do need to copy the
state stack, which LR parser use to track context, but that is
generally relative shallow.
In addition, node allocation becomes as cheap as appending a few
numbers to an array. For actions that don't result in tree nodes
(Lezer allows you to mark rules as uninteresting, to keep the tree
small), you don't have to do anything at all. The control stacks
stores the output array position at the start of each rule, and can
use that to emit enough data to later reconstruct parent-child
relationships.
After a parse finishes successfully, the final state's parent-array
pointers can be used to find all the nodes that make up the tree, and
construct an actual tree structure out of them.
One tricky issue occurs when skipped content (whitespace and comments)
produces nodes. If you have code like this...
if (true) something()
// Comment
otherStatement()
... the comment should not be part of the if statement's node. Yet
the parser only knows for sure that it can finish that node after
seeing the next statement (there might be an else still coming).
In cases like this, where the output array contains skipped nodes
immediately in front of a reduction, the parser has to move them
forward and store the end of the node before them. Fortunately, this
occurs relatively rarely (unless you add nodes for whitespace, in
which case it'll happen at the end of every rule that has a possible
continuation).
Buffer Trees
A nice thing about the flat post-order tree representation is that it
is compact. Tree structures constructed the usual way, as separately
allocated nodes, incur a lot of extra overhead for pointers and
allocation headers. They can also have terrible locality, since who
knows how far from each other the memory allocator will put the nodes.
Unfortunately, we can't just use a flat representation for our syntax
trees. The incremental parser has to be able to reuse parts of it
without copying those parts into a different buffer.
But we can use it for parts of the tree. Storing the coarse
structure as a classical tree, but the content of smaller nodes (say
less than a few thousand characters long) as flat arrays, gives us the
best of both worlds. Since most nodes, by number, live in the fine
structure, this saves a large amount of overhead (and helps with
locality).
That does mean that we can't reuse small nodes. But since their size
is limited, the amount of work that is involved in re-parsing them is
also limited. And by removing them from consideration, the incremental
parser can avoid quite a bit of the work involved in preparing and
scanning the tree for reuse.
A small node stores its content in a typed array of 16-bit unsigned
integers. It uses 4 such numbers (64 bits) per node, storing a type, a
start position, an end position, and a child count for each node.
Contrary to the array created by the parser, these arrays are in
pre-order,
because that makes forward iteration (which tends to be more common
than backward iteration) cheaper. The child count was almost obsolete
(the end position can sort of tell you which nodes are children), but
Lezer supports zero-length nodes, which might land on the end of their
parent node and make it ambiguous whether they belong to it or not.
Client code, of course, doesn't want to deal with this representation.
Lezer provides an abstract interface to searching in and walking
through trees that hides the buffer structure, allowing you to
conceptually work with a uniform tree of nodes.
Lezer, like tree-sitter, stores the result of repetitions in the
grammar (produced by the * and + operators) as balanced subtrees.
This means that, unless your input is pathological (say, a thousand
applications of a single binary operator in a row), you tend to get
shallow, well-balanced syntax trees, which are cheap to search allow
effective reuse.
Contextual Tokens
Depending on the grammar's complexity, an LR parser generator creates
between a dozen and a few thousand parse states for your grammar.
These represent syntactic positions like ���after the opening paren of
an argument list��� or ���after an expression, possibly expecting some
expression suffix���.
The parser generator can figure out which tokens are valid in a given
state. It can also, for tokens specified as part of the grammar,
automatically determine which tokens conflict (match the same input,
or some prefix of each other).
A well-known example of conflicting tokens is the division operator
versus regular expression syntax in JavaScript. But others are
keywords that can also appear as property names, and the bitwise right
shift operator (>>) versus two closing angle brackets in C++.
Lezer will not complain about overlapping tokens if the tokens do not
appear in the same parse states. This implicitly resolves the regular
expression and property name issues, without any user interaction.
When conflicting tokens do appear in the same place, such as division
operators and C-style comments, you have to specify an explicit
precedence ordering (comments take precedence) to tell the tool that
you know what you're doing.
Contextual tokenization is implemented with a concept called token
groups. Tokens that have unresolved conflicts with other tokens are
assigned to one or more groups, where each group contains only
non-conflicting tokens. Each state is assigned a single group (if it
expects tokens that conflict with each other that's an error). This
group is passed to the tokenizer, which then takes care to only return
tokens that are either in that group, or don't conflict with any other
tokens. The check is optimized by storing group membership in a
bitset, and seeing if the right bit is set with binary and.
Tokens are compiled down to a single deterministic state machine,
which is ran on the input character stream. In cases like the
regexp-versus-division issue, you don't want the machine to go running
through regexp-specific states in a situation where you only allow
division, since that would be wasteful. Therefore, each tokenizer
state is also tagged with a bitset that tells you which groups the
tokens reachable from that state belong to, and the tokenizer stops
running when it hits a state that has no overlap with the allowed
tokens for the parse state.
Skip Expressions
Almost all programming languages have special syntactic elements like
whitespace and comments that may occur between any tokens. Encoding
these directly in the grammar is extremely tedious for most languages.
Traditionally, tokenizer just skip such elements when reading the next
token. That works well in most contexts, but makes it awkward to
include the elements in the parse tree.
Lezer treats skipped things like they are part of the grammar (though
in an optimized way to avoid increasing the size of the parse tables).
It is possible to skip things that aren't single tokens (to implement
something like nestable comments, for example, or to make sure your
block comment nodes consist of smaller nodes so that you can
incrementally parse giant block comments).
Each rule or group of rules may have its own set of skipped
expressions, so that you can express different sublanguages in a
grammar, for example something like the content of interpolated
strings, without allowing spacing in places where the language doesn't
allow it.
Each parse state has a pointer to a (shared) set of skip actions,
which, for the skipped tokens or tokens that start a compound skipped
expression, contains the actions to take for those tokens. For
single-token skipped elements, that action just tells the parser to
skip the token and stay in the same state. For compound elements, it
causes the state that handles the rest of the element to be pushed
onto the control stack.
Tree Node Tagging
The languages that a tool like Lezer needs to handle are wildly
different, from JavaScript to Haskell to CSS to YAML. As such, it is
difficult to find a cross-language vocabulary to describe their
constructs. In fact, it seems like that would be a separate multi-year
project, and pull in a serious amount of complexity.
Yet it would be nice if the parser output comes with some information
that can be interpreted without knowing what language you are working
with.
After several iterations, what I decided on was a system where nodes
have names, which only have a meaning within the language, and
props, which are values associated with tags defined by external
code. Integrating a language grammar into CodeMirror involves
assigning values for some of these props to the node types used by the
language���things like syntax highlighting style information and how to
indent such nodes.
Since the number of node types in a language is limited, we can
allocate an object for each node type to hold this information, and
have all nodes of that type point to the same object.
To allow code outside the grammar to add props without mutating global
state, parser instances can be extended with additional props,
creating a copy that will output nodes with the props attached. This
is especially useful in the context of mixed-language trees.
Lezer has support for a limited form of grammar nesting. If language A
can appear inside a document in language B, and the end of the region
covered by A can be unambiguously found by scanning for a specific
token, Lezer can temporarily switch to another set of parse tables
while parsing such a region.
The syntax tree will then contain nodes from both grammars. Having
props directly attached to the nodes makes it much easier to work with
such trees (as opposed to using a language-specific table that
associates node names with metadata).
August 29, 2019
CodeMirror 6 Status Update
It has been almost a year since we officially announced the CodeMirror
6 project, which aims to rewrite CodeMirror
(a code editor for in the browser) to align its design with the
technological realities and fashions of the late 2010s.
This post is for you if, in the course of that year, you've
occasionally wondered what the hell Marijn is doing and if he's
getting anywhere at all. I have been absolutely terrible about
communicating progress, and even monitoring the
repository would
often leave you in the dark, as I was working on local branches or
entirely different repositories. In any case, the code that's in there
is almost entirely undocumented.
Where We Are
Last week, I landed a major set of changes, which had been in the
works for about four months. They integrate a new approach to code
parsing. This was the last piece of the system that was completely in
flux, where I didn't want to nail down anything related to it yet
because I wasn't sure how it would end up working.
Now I am sure, which means that the vision for the system as a whole
just became a lot more clear.
Apart from the parser work, a lot of design and cleanup has happened
in the past year. I'll go over the major elements at the end of this
post.
For observers, the more interesting thing is that we finished a doc
comment extractor for
TypeScript, and are starting to work on adding enough comments to
generating a passable reference manual for the new system. Hopefully,
that will finally allow outsiders to get a clear view of what we're
building.
I intend to start releasing npm packages around that time. They won't
be stable or feature-complete. They won't even reflect the package
organization of the final system. But they should provide an easy way
to start playing with the system.
Where We Are Not
I'm a bit behind where I had hoped I would be at this point. This has
three reasons:
My other projects and customers kept demanding attention. How rude.
Fortunately, this means that the CodeMirror 6 budget will stretch
longer, since the time spent on other things generated its own
income.
The parser system turned out to be Really Hard, and took a bit
longer than hoped.
I'm definitely suffering from second system
syndrome, where after
eight years with the old system, I am acutely aware of its
limitations and want to fix them all this time around. That
sometimes leads me into overly-ambitious design rabbit holes, and
then it takes me a week or two to dig myself out again.
So the bad news is that a stable release is still quite a ways off
(not going to make any more specific predictions at this point). Some
parts of the system are at a point where they may roughly stay the way
they are, but other significant parts haven't even been written yet.
What We Did
These are the major pieces of work that have happened since the
project was announced...
Composition Support
Handling composition/IME (input method editor, as used by people whose
script has too many characters to fit on a keyboard, but also pretty
much all Android virtual keyboard input) in the browser is its own
special kind of hell.
During composition, the user interface is in a special mode where a
piece of the editable text is being composed. What exactly that
means differs per input method, but it usually involves step-by-step
updates through pressing additional keys and/or interacting with menus
to reach the intended input string.
In normal operation, CodeMirror will constantly sync the content in
the DOM with its model of that content���applying highlighting and
normalizing DOM elements that don't have the expected shape. However,
during composition, if you mess with the DOM around the composition,
or with the selection, you will abort the composition, making your
editor pretty much unusable to IME users and, on some browsers,
causing bizarre side effects such as their composed text being
duplicated again and again on every keystroke.
Our old approach was to freeze interface updates during composition,
stepping well back and letting the user do their thing until the
composition ended, at which point the editor sprang to life again.
That led to a number of responsiveness issues, where editor plugins,
such as autocomplete, wouldn't be kept up to date on what is
happening, causing the interface to appear frozen or outdated.
In CodeMirror 6, I implemented a subtle middle road where the part of
the document that is being composed is left alone as long as its
content isn't changed by outside code, but the editor's update cycle
proceeds as normal, and the part of the document outside of the
composition can be changed (say, by collaborative editing, or by the
autocompletion plugin updating its list of completions) as normal.
Behaviors as Extension Primitive
A design where a generic core provides a platform for all kinds of
features to be implemented in plugins has to, in a way, provide an
extendable extension mechanism, where extensions can define new ways
in which other extensions can add or configure behavior. For example,
an autocompletion plugin must be able to provide a way for other
plugins to register completion sources.
Designing this is tricky, but I think we landed on a nice
architecture. I've written another blog
post outlining our
design. This is working well so far, and has allowed us to create all
kinds of features (from the undo history to syntax highlighting to
line number gutters) outside of the core library.
CSS Modules
Since the browser platform's CSS support is still more or less
completely unhelpful when it comes to modularized system, we've
decided to use a CSS-in-JS approach where extensions can define their
own styles from within JavaScript code and make sure they are included
in the document or shadow DOM where the editor is placed.
View Fields and Updates
The place where pure code (around the immutable editor state
abstraction) stopped and imperative code (around the DOM and the
editor view) started hadn't really been properly defined in the first
iteration.
A pain point was viewport state���information about which part of the
document is currently visible. This isn't part of the core editor
state, but many UI extensions need to access it in order to operate,
usually because, as an optimization, they only act on the visible
code.
I've added a new abstraction, view fields, for pieces of state that
live in the view and affect things like decorations (styling and
widgets in the content) and the attributes of the editor's wrapper DOM
nodes.
These can be written in bog-standard imperative style, if you want,
but still make it easy to handle editor state changes in a disciplined
way���they are notified each time something changes, and provided with a
full description of the update, including the viewport information and
the transactions that were applied.
Block Widgets
Block widgets are a way to insert a block element into the document,
either on a line boundary, in the middle of a line, or covering some
content.
These existed before but I completely reimplemented them (and the rest
of the decoration system) at the start of the year to fix some corner
case issues in the old implementation, cleaning up a number of issues
and limitations in the display code.
Interestingly, allowing block widgets in the middle of lines, which
wasn't initially part of the plan, turned out to be easier than
forbidding them, due to the way decorations interact. Another
decoration could hide the newline next to a block widget, something
the initial implementation could not deal with gracefully.
Generic Gutters
The first demo came with a line number gutter, but no way to create
gutters for other purposes. I generalized the gutter plugin so that is
now possible to create your own custom gutters and dynamically add or
change the content that is displayed in them.
Doc Generation
We wrote a system that
uses the TypeScript library to extract both doc comments and types
from a TypeScript project, and output them as a JSON structure.
Feeding this into the
builddocs tool allows us to
build good-looking, interlinked reference documentation directly from
the source code.
The system is already being used to generate Lezer's reference
docs, and we're working on
applying it to CodeMirror.
Lezer
This is the biggest one, and kept me busy for most of the spring and
summer. It was clear that we'd want to move beyond CodeMirror 5's
primitive syntax analysis system, but it wasn't clear how.
That is, until I saw tree-sitter,
which is a practical implementation of an incremental LR parser,
giving you a real syntax tree for your code and updating it cheaply
when you make changes. It is used in the Atom editor.
Unfortunately, tree-sitter is written in C, which is awkward to run in
the browser (we're still targeting non-WASM browsers). It also
generates very hefty grammar files because it makes the size/speed
trade-off in a different way than a browser-based system would.
Thus, I set out to clone tree-sitter in JavaScript. And because I
always feel I know everything better, I didn't exactly clone it, but
rather built a different system inspired by it. That system is
Lezer, an incremental parser
generator and runtime system for JavaScript.
This was a relatively big project. And then, when I started
integrating it with CodeMirror 6, I was forced to go back to the
drawing board several times to work out issues.
But the system turned out well. If a grammar has been written for your
language (we have JS, CSS, HTML, and XML so far), CodeMirror can keep
a syntax tree of your document as you are editing it. This tree is
used for (more accurate) highlighting, indentation, folding (soon) and
has a host of other potential uses.
August 28, 2019
Computing Indentation from Syntax Trees
Most code editors have some concept of automatic indentation. For each
language, they have some logic that can compute an appropriate
indentation for a given line. They may immediately indent when you press
Enter, or require some interaction to reindent a line or selection.
Even with the cleverest tools, there isn't always a single definite
indentation for a given line. In a language like Python, whether a
line should continue the current block or dedent to some parent block
isn't obvious from the context. Inside multi-line strings or comments,
an editor can only guess what the user is trying to do. But even
there, having a guessed indentation that you may need to correct is
usually more convenient than having to manually indent every line.
In the process of maintaining CodeMirror,
I've written a lot of indentation logic for a lot of languages.
Initially directly on top of the source string, by applying heuristic
regexps to the ends of previous lines (which was as fragile as it
sounds). Then from data left behind by the glorified tokenizers
CodeMirror used for highlighting, which worked better, but required
those tokenizers to do extra indentation-specific work.
CodeMirror (in version 6) is moving towards building real syntax trees
for its content. A big part of the reason that many indenters are ugly
hacks is that they usually don't have access to the information they
need, at least not in a convenient format.
Syntax trees are that convenient format.
A Theory of Indentation
Most forms of indentation fall into two categories. They either indent
relative to the indentation of some reference line, or they align to
some specific position.
The first is the most common. If you have this code, the function
arguments are indented relative to the start of the line where the
argument list starts.
call(
one,
two)
Because (outside of Lisp) you tend to have different indentation
styles for different constructs, it makes sense to associate them with
syntax node types. In the example above, we'd be applying the
indentation style for argument lists.
To go from a line in a parsed document to an indentation, you first
figure out the strategy that applies at this point, which you get from
the innermost node that wraps the start of the line and has an
indentation style associated with it.
if (x) {
A
B
}
That node may span multiple lines, for example in a block like the one
above, when indenting the statements on lines 2 or 3, the wrapping
node is the block node. That node's indentation style might say
���indent one unit beyond my start line���. Because the block starts on
line 1, with indentation 0, we indent the statements a single unit.
The start of line 3 is also inside the block. But because the line
closes the block, it should not add indentation. Many node types have
some constructs that they shouldn't indent���a closing token is one, but
the else token in an if statement, or a hanging { in a K&R-style
function definition, are others.
When the content of a bracketed node starts immediately after the
opening bracket, many styles align subsequent lines with the opening
bracket.
call(one,
two)
Here the indentation isn't relative to the construct's start line, but
to a specific token.
Some constructs, such as multi-line strings or block comments, can
indicate that they shouldn't be indented.
When I claimed that nodes compute indentation relative to the line on
which they start, I was oversimplifying. If you have code like this:
function foo(a, b,
c, d) {
body()
}
The body block probably shouldn't be indented past the c. Yet it
does start on line 2, which is indented 13 spaces.
When figuring out the reference line for a given node, you have to
check whether that line starts inside another node (that is not a
parent node of the target node), and if so, continue looking at the
start of that node.
So in the example, we'd notice that line 2 starts inside a parameter
list, which isn't a parent of the body block, and thus use line 1 as
reference line.
Unfortunately, not all nodes should be skipped like this. For example,
in this code the member expression foo.bar would match the
description of a non-parent node covering the start of the line, if
the syntax tree is structured like call_expr(member_expr(...), arg_list(...).
one.
two(
three)
But we do probably want to indent the argument list relative to line
2. So this requires us to either use heuristics (such as only skipping
bracketed nodes) or some extra language-specific information telling
us which nodes must be skipped.
Implementation
CodeMirror's new indenter requires language packages to attach
indentation information to syntax tree node types. That way, even if
the document contains multiple languages, it can figure out an
indentation directly from the tree.
This information takes the form of a function that computes an
indentation depth, given some contextual information like the syntax
tree, the target node, the document, and the size of an indentation
unit.
Being a function, it can be as complicated as it needs to be. There
are helpers for defining functions that support common styles, such as
bracketed nodes that align content to their opening bracket when that
isn't the last token on the line.
Because content is often in a syntactically invalid state when
indentation is requested, the indenter takes care to ���enter���
unfinished nodes in front of the cursor when determining the context
for the indentation. For example, if you press enter after this
unfinished if statement...
if (x < y &&
The syntax tree will look something like...
if_statement(
paren_expr(
binary_expr(binary_expr(...), incomplete),
incomplete),
incomplete)
Where incomplete means that the node was cut off by the parser's
error-correction feature. The start of the next line is outside of all
this, but because we know that the node before it was incomplete, we
can indent as if we are in that node���so that puts us inside of
binary_expr, which may not have an indentation style. But
paren_expr probably does.
if (x < y &&
// continue here, aligned to the (
Marijn Haverbeke's Blog
- Marijn Haverbeke's profile
- 46 followers

