SELFIES Derivation

This section is an informal tutorial on how molecules are derived from a SELFIES. The SELFIES grammar has non-terminal symbols or states

\[X_0, \ldots, X_7, Q\]

Derivation starts with state \(X_0\). The SELFIES is read symbol-by-symbol, with each symbol specifying a grammar rule. SELFIES derivation terminates when no non-terminal symbols remain. In each subsection, we describe a type of SELFIES symbol and the grammar rules associated with it.

Atomic Symbols

Atomic symbols are of the general form [<B><A>], where <B> in {'', '/', '\\', '=', '#'} is a prefix representing a bond, and <A> is a SMILES symbol representing an atom or ion. If the SMILES symbol is enclosed by square brackets (e.g. [13C]), then the square brackets are dropped and expl (for “explicit brackets”) is appended to obtain <A>. For example:

<B>

SMILES symbol

<A>

SELFIES symbol

'='

N

N

[=N]

''

[C@@H]

C@@Hexpl

[C@@Hexpl]

'/'

[O+]

O+expl

[/O+expl]

Let atomic symbol [<B><A>] be given, where <B> is a prefix representing a bond with multiplicity \(\beta\) and <A> is an atom that can make \(\alpha\) bonds maximally. The atomic symbol maps:

\[\begin{split}X_i \to \begin{cases} \texttt{<B'><A>} & \alpha - \mu = 0 \\ \texttt{<B'><A>} X_{\alpha - \mu} & \alpha - \mu \neq 0 \end{cases}\end{split}\]

where <B'> is a prefix representing a bond with multiplicity \(\mu = \min(\beta, \alpha, i)\), or the empty string if \(\mu = 0\). Note that non-terminal states \(X_i\) effectively restrict the subsequent bond to a multiplicity of at most \(i\). We provide an example of the derivation of the SELFIES [F][=C][=C][#N]:

\[X_0 \to \texttt{F}X_1 \to \texttt{FC}X_3 \to \texttt{FC=C}X_2 \to \texttt{FC=C=N}\]

Discussion: Intuitively, the formal grammar has the following behaviour. An atomic symbol [<B><A>] connects atom <A> to the previously derived atom through bond type <B>. If creating this bond would violate the bond constraints of the previous or current atom, the bond multiplicity is reduced (minimally) such that all bond constraints are fulfilled.

Examples:

Example

SELFIES

SMILES

1

[C][=C][C][#C][13Cexpl]

C=CC#C[13C]

2

[C][F][C][C][C][C]

CF

3

[C][O][=C][#O][C][F]

COC=O

Index Symbols

The state \(Q\) is used to derive the size of branches and the location of ring bonds. After a ring or branch symbol, the subsequent one or more SELFIES symbols are used to derive an integer from \(Q\). Note that the specific branch and ring symbol itself will specify exactly how many symbols are used in the derivation (e.g. [Ring3] indicates that the subsequent three symbols are used).

First, each subsequent symbol \(s_i\) is converted to an index \(\text{idx}(s_i)\), according to the following assignment:

Index

Symbol

Index

Symbol

0

[C]

8

[Branch2_3]

1

[Ring1]

9

[O]

2

[Ring2]

10

[N]

3

[Branch1_1]

11

[=N]

4

[Branch1_2]

12

[=C]

5

[Branch1_3]

13

[#C]

6

[Branch2_1]

14

[S]

7

[Branch2_2]

15

[P]

All other symbols assigned index 0.

Then \(Q\) is mapped to the hexadecimal (base 16) integer specified by the indices. For example, if three symbols \(s_1, s_2, s_3\) are used in the derivation, then \(Q\) is mapped to:

\[Q \to (\text{idx}(s_1) \times 16^2) + (\text{idx}(s_2) \times 16) + \text{idx}(s_3)\]

For example, [Ring3][C][Branch1_1][O] will derive the number \((039)_{16}=57\).

Branch Symbols

Branch symbols are of the general form [Branch<L>_<M>], where <L>, <M> in {1, 2, 3}. A branch symbol specifies a branch from the main chain, analogous to the open and closed curved brackets in SMILES. In SELFIES, a branch is derived by a recursive call to the SELFIES derivation.

A Branch symbol [Branch<L>_<M>] maps:

\[\begin{split}X_i \to \begin{cases} X_i & i \leq 1 \\ B(Q, X_{n})X_j & i > 1 \end{cases}\end{split}\]

where \(n = \min(i - 1, \texttt{<M>})\) is the derivation state of a new branch, and \(j = i - n\) is the new derivation state of the main chain. In the \(i > 1\) case, the <L> subsequent symbols are used to derive an integer from the state \(Q\). Then \(B(Q, X_{n})\) takes the next \(Q + 1\) symbols, and recursively derives them with initial derivation state \(X_{n}\). The resulting fragment is taken to be the derived branch, and derivation proceeds with the next derivation state \(X_j\).

Discussion: Intuitively, branch symbols are skipped for states \(X_{0-1}\) because the previous atom can make at most one bond (branches require at least two bonds to be free). It is possible that a branch is nested at the start of another branch; in SELFIES, both branches will be connected to the same main chain atom (see Example 5 below).

Examples:

Example

SELFIES

\(Q + 1\)

SMILES

1

[C][Branch1_1][C][F][Cl]

1

C(F)Cl

2

[C][Branch1_2][Ring2][=C][C][C][Cl]

3

C(=CCC)Cl

3

[S][Branch1_2][C][=O][Branch1_2][C]

[=O][Branch1_1][C][O-expl][O-expl]

1, 1, 1

S(=O)(=O)([O-])[O-]

4

[C][Branch2_1][Ring1][Branch1_2][C]

[C][C][C][C][C][C][C][C][C][C][C][C]

[C][C][C][C][C][C][C][C][F]

21

C(CC...CC)F

Example 4 has a single branch of 21 carbon atoms.

5

[C][Branch1_2][Branch1_1][Branch1_1][C][C][Cl][F]

4, 1

C(C)(Cl)F

Ring Symbols

Ring symbols are of the general form [Ring<L>] or [Expl<B>Ring<L>], where <L> in {1, 2, 3} and <B> in {'/', '\\', '=', '#'} is a prefix representing a bond. A ring symbol specifies a ring bond between two atoms, analogous to the ring numbering digits in SMILES.

A Ring symbol [Ring<L>] maps:

\[\begin{split}X_i \to \begin{cases} X_i & i = 0 \\ R(Q)X_i & i \neq 0 \end{cases}\end{split}\]

In the \(i \neq 0\) case, the <L> subsequent symbols are used to derive an integer from the state \(Q\). Then \(R(Q)\) connects the current atom to the \((Q + 1)\)-th preceding atom through a single bond. More specifically, the current atom is the most recently derived atom within the current derivation instance (see Example 5 below). If the current atom is the \(m\)-th derived atom, then a bond is made between the \(m\)-th derived atom and the \(n\)-th derived atom, where \(n = \max(1, m - (Q + 1))\).

The Ring symbol [Expl<B>Ring<L>] has an equivalent function to [Ring<L>], except that it connects the current and \((Q + 1)\)-th preceding atom through a bond of type <B>.

Discussion: In practice, ring bonds are created during a second pass, after all atoms and branches have been derived. The candidate ring bonds are temporarily stored in a queue, and then made in the order that they appear in the SELFIES. A ring bond will be made if its connected atoms can make the ring bond without violating any bond constraints. This is the only non-local rule in SELFIES, but is efficiently implemented as this number can be determined only by looking at one location.

It is also possible that the current atom is already bonded to the \((Q + 1)\)-th preceding atom, e.g. if \(Q = 0\). In this case, the multiplicity of the existing bond is increased by the multiplicity of the ring bond candidate. Then the multiplicity of the resulting bond is reduced (minimally) such that no bond constraints are violated, and the multiplicity is at most 3 (see Example 6 below).

Examples:

Example

SELFIES

\(Q + 1\)

SMILES

1

[C][=C][C][=C][C][=C][Ring1][Branch1_2]

5

C1=CC=CC=C1

2

[C][C][=C][C][=C][C][Expl=Ring1][Branch1_2]

5

C=1C=CC=CC=1

3

[C][C][Expl=Ring1][C]

1

C#C

4

[C][C][C][C][C][C][C][C][C][C][C]

[C][C][C][C][C][C][C][C][C][C][C]

[Ring2][Ring1][Branch1_2]

21

C1CC...CC1

Example 4 is a single carbon ring of 22 carbon atoms.

5

[C][C][C][C][Branch1_1][C][C][Ring1][Ring2][C][C]

3

C1CCC1(C)CC

Note that the SMILES CC1CC(C1)CC is not outputted.

6

[C][C][C][C][Expl=Ring1][Ring2][Expl#Ring1][Ring2]

3, 3

C#1CCC#1

Special Symbols

The following are symbols that have a special meaning for SELFIES:

Character

Description

[epsilon]

The [epsilon] symbol maps \(X_0 \to X_0\) and \(X_i \to \epsilon\) (the empty string) for all \(i \geq 1\).

[nop]

The nop (no operation) symbol is always ignored and skipped over by selfies.decoder().

Thus, it can be used as a padding symbol for SELFIES.

.

The dot symbol is used to indicate disconnected or ionic compounds, similar to how it is

used in SMILES.