SELFIES Derivation¶
This section is an informal tutorial on how molecules are derived from a SELFIES. The SELFIES grammar has non-terminal symbols or states
Derivation starts with state \(X_0\). The SELFIES is read symbol-by-symbol, with each symbol specifying a grammar rule. SELFIES derivation terminates when no non-terminal symbols remain. In each subsection, we describe a type of SELFIES symbol and the grammar rules associated with it.
Atomic Symbols¶
Atomic symbols are of the general form [<B><A>]
, where
<B> in {'', '/', '\\', '=', '#'}
is a prefix representing a bond,
and <A>
is a SMILES symbol representing an atom or ion.
If the SMILES symbol is enclosed by square brackets (e.g. [13C]
),
then the square brackets are dropped and expl
(for “explicit brackets”)
is appended to obtain <A>
. For example:
|
SMILES symbol |
|
SELFIES symbol |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Let atomic symbol [<B><A>]
be given, where <B>
is a prefix
representing a bond with multiplicity \(\beta\) and <A>
is an atom
that can make \(\alpha\) bonds maximally. The atomic symbol maps:
where <B'>
is a prefix representing a bond with multiplicity
\(\mu = \min(\beta, \alpha, i)\), or the empty string if \(\mu = 0\).
Note that non-terminal states \(X_i\) effectively restrict the subsequent
bond to a multiplicity of at most \(i\). We provide an example of
the derivation of the SELFIES [F][=C][=C][#N]
:
Discussion: Intuitively, the formal grammar has the following behaviour.
An atomic symbol [<B><A>]
connects atom <A>
to the previously derived
atom through bond type <B>
. If creating this bond would violate the bond
constraints of the previous or current atom, the bond multiplicity is reduced
(minimally) such that all bond constraints are fulfilled.
Examples:
Example |
SELFIES |
SMILES |
---|---|---|
1 |
|
|
2 |
|
|
3 |
|
|
Index Symbols¶
The state \(Q\) is used to derive the size of branches and
the location of ring bonds. After a ring or branch symbol, the subsequent
one or more SELFIES symbols are used to derive an integer from \(Q\).
Note that the specific branch and ring symbol itself will specify exactly
how many symbols are used in the derivation (e.g. [Ring3]
indicates
that the subsequent three symbols are used).
First, each subsequent symbol \(s_i\) is converted to an index \(\text{idx}(s_i)\), according to the following assignment:
Index |
Symbol |
Index |
Symbol |
---|---|---|---|
0 |
|
8 |
|
1 |
|
9 |
|
2 |
|
10 |
|
3 |
|
11 |
|
4 |
|
12 |
|
5 |
|
13 |
|
6 |
|
14 |
|
7 |
|
15 |
|
All other symbols assigned index 0. |
Then \(Q\) is mapped to the hexadecimal (base 16) integer specified by the indices. For example, if three symbols \(s_1, s_2, s_3\) are used in the derivation, then \(Q\) is mapped to:
For example, [Ring3][C][Branch1_1][O]
will derive the number \((039)_{16}=57\).
Branch Symbols¶
Branch symbols are of the general form [Branch<L>_<M>]
, where
<L>, <M> in {1, 2, 3}
. A branch symbol specifies a branch from the
main chain, analogous to the open and closed curved brackets in SMILES.
In SELFIES, a branch is derived by a recursive call to the SELFIES
derivation.
A Branch symbol [Branch<L>_<M>]
maps:
where \(n = \min(i - 1, \texttt{<M>})\) is the derivation state of a new branch,
and \(j = i - n\) is the new derivation state of the main chain. In the \(i > 1\)
case, the <L>
subsequent symbols are used to derive an integer from the
state \(Q\). Then \(B(Q, X_{n})\) takes the next \(Q + 1\) symbols,
and recursively derives them with initial derivation state \(X_{n}\).
The resulting fragment is taken to be the derived branch, and derivation
proceeds with the next derivation state \(X_j\).
Discussion: Intuitively, branch symbols are skipped for states \(X_{0-1}\) because the previous atom can make at most one bond (branches require at least two bonds to be free). It is possible that a branch is nested at the start of another branch; in SELFIES, both branches will be connected to the same main chain atom (see Example 5 below).
Examples:
Example |
SELFIES |
\(Q + 1\) |
SMILES |
---|---|---|---|
1 |
|
1 |
|
2 |
|
3 |
|
3 |
|
1, 1, 1 |
|
4 |
|
21 |
|
Example 4 has a single branch of 21 carbon atoms. |
|||
5 |
|
4, 1 |
|
Ring Symbols¶
Ring symbols are of the general form [Ring<L>]
or [Expl<B>Ring<L>]
,
where <L> in {1, 2, 3}
and <B> in {'/', '\\', '=', '#'}
is a
prefix representing a bond. A ring symbol specifies a ring bond between two
atoms, analogous to the ring numbering digits in SMILES.
A Ring symbol [Ring<L>]
maps:
In the \(i \neq 0\) case, the <L>
subsequent symbols are used to
derive an integer from the state \(Q\). Then \(R(Q)\) connects the
current atom to the \((Q + 1)\)-th preceding atom through a
single bond. More specifically, the current atom is the most recently
derived atom within the current derivation instance (see Example 5 below).
If the current atom is the \(m\)-th derived atom, then
a bond is made between the \(m\)-th derived atom and the \(n\)-th
derived atom, where \(n = \max(1, m - (Q + 1))\).
The Ring symbol [Expl<B>Ring<L>]
has an equivalent function to
[Ring<L>]
, except that it connects the current and \((Q + 1)\)-th
preceding atom through a bond of type <B>
.
Discussion: In practice, ring bonds are created during a second pass, after all atoms and branches have been derived. The candidate ring bonds are temporarily stored in a queue, and then made in the order that they appear in the SELFIES. A ring bond will be made if its connected atoms can make the ring bond without violating any bond constraints. This is the only non-local rule in SELFIES, but is efficiently implemented as this number can be determined only by looking at one location.
It is also possible that the current atom is already bonded to the \((Q + 1)\)-th preceding atom, e.g. if \(Q = 0\). In this case, the multiplicity of the existing bond is increased by the multiplicity of the ring bond candidate. Then the multiplicity of the resulting bond is reduced (minimally) such that no bond constraints are violated, and the multiplicity is at most 3 (see Example 6 below).
Examples:
Example |
SELFIES |
\(Q + 1\) |
SMILES |
---|---|---|---|
1 |
|
5 |
|
2 |
|
5 |
|
3 |
|
1 |
|
4 |
|
21 |
|
Example 4 is a single carbon ring of 22 carbon atoms. |
|||
5 |
|
3 |
|
Note that the SMILES |
|||
6 |
|
3, 3 |
|
Special Symbols¶
The following are symbols that have a special meaning for SELFIES:
Character |
Description |
---|---|
|
The |
|
The nop (no operation) symbol is always ignored and skipped over by Thus, it can be used as a padding symbol for SELFIES. |
|
The dot symbol is used to indicate disconnected or ionic compounds, similar to how it is used in SMILES. |