A quantum sampled molecular
dataset ups the challenge of training ML models.
In brief:
- WS22 is a new molecular dataset providing quantum mechanical properties for 1.18 million geometries of 10 molecules of up to 22 atoms.
- WS22 is based on Wigner sampling and geodesic interpolation between conformations. Thus, it spans broader regions of the configurational space, increasing the challenge for machine-learning models
Multidimensional surfaces of quantum chemical properties, such as potential energies and dipole moments, are common targets for machine learning, requiring the development of robust and diverse databases extensively exploring molecular configurational spaces. Nevertheless, some of the most used datasets are limited to classical phase spaces sampled from molecular dynamics.
In a project led by Max Pinheiro Jr, we composed the WS22 database [1], consisting of 1.18 million equilibrium and non-equilibrium geometries sampled from Wigner distributions centered at different equilibrium conformations (either at the ground or excited electronic states) and further augmented with interpolated structures.
WS22 covers ten flexible organic molecules of increasing complexity with up to 22 atoms.
For each of them, we provide several quantum mechanical (QM) properties, including:
- potential energies,
- forces,
- dipole moments,
- polarizabilities,
- HOMO and LUMO energies
Our sampling delivers broader quantum mechanical distribution of the configurational space than commonly used sampling through classical molecular dynamics, upping the challenge for machine learning models. The figure below, for instance, compares our WS22 and MD17 (one of the most used datasets) for toluene. Note ho WS22 (brick) spans a much broader region than MD17 (blue).
WS22’s broader distribution is a direct consequence of quantum mechanics. The Wigner distribution includes zero-point vibrational energy, usually much higher than the thermal energy available in molecular dynamics sampling.
Moreover, Wigner sampling also visits different regions of the phase space. Look at this PCA projection of WS22 and MD17 for toluene.
MD17 is restricted to a ring, reflecting the larger probability of finding a classical oscillator near the turning point. WS22 peaks at the center, as we expect from quantum oscillators.
WS22 is fully described in Ref. [1].
You can download WS22 from this repository. You can also play with the dataset using this interactive interface.
MB
Reference
[1] M. Pinheiro Jr, S. Zhang, P. O. Dral, M. Barbatti, WS22 database, Wigner Sampling and geometry interpolation for configurationally diverse molecular datasets, Sci. Data 10, 95 (2023) DOI: 10.1038/s41597-023-01998-3