Introduction
In the recent months I have attended various InChI Working Groups meetings, as well as the NIH Virtual InChI Workshop back in March last year. These InChI meetings nowadays not only cover InChI-specific topics but have become a forum to discuss and advance chemical representation and chemical information management on a more general and fundamental level. This in particular applies to chemical entities that are not as simple as small molecules (or, more precisely, are even more complex than small molecules), such as mixtures, inorganics, nanomaterials, etc.. These have so far resisted fitting into any generally accepted and commonly applied chemical data management practices. Just think of a question like “Are these two mixtures the same?”.
I have been involved in several projects related to, for example, polymers and material science, and have therefore spent significant time thinking about the fundamentals of chemical representation, data management, and chemical registration of such entities. This has now matured to a state where I feel it can be shared with a wider community, to foster and advance discussion. This blog post is the first of what may become a small series of posts which discuss some fundamental aspects around ‘recalcitrant’ chemical compounds/entities.
Small Molecules
Before discussing more complex chemical entities, it is helpful to look at small (organic) molecules. There are of course various subtleties when it comes to dealing with chemical structures of small molecules that require true specialist knowledge to cope with (please see our recent whitepaper on the complexities of chemical data migration), but in general, small molecules are quite well-behaving species. For example, almost every biopharmaceutical or chemical company has a chemical registration system, and in general, these systems work pretty well.
What are the reasons for this success?
First, small molecules are discrete entities. For example, you either have benzene, or you have toluene. The difference is (neglecting hydrogens for the moment) exactly one carbon atom. There is nothing possible somewhere between benzene and toluene, as this would require something like half a carbon atom, or a tenth of it. As a consequence, the question of identity resp. whether two compounds are ‘the same’ can easily (well, kind of easily) be answered.
In the real world, the only really tangible thing is not the chemical compound itself, but a batch (or lot) or a sample of it. The chemical compound itself is something abstract. Depending on context, you may want to call it an idea, a concept, a design, an abstraction, or even an asset. The chemical compound itself has very few inherently associated properties/data (for example, molecular formula, molecular weight, and descriptors such as rotatable bonds). Only the batch (and any sample of it) is real and can have associated measured properties/data, such as a melting point, an amount, a purity, a physical form (for example, powder or crystal form), an NMR spectrum, or an IC50 value in a given biological assay.* And there will be variation across batches. For example, one batch has a purity of 99.5% and is crystalline while another batch is only 95% pure and is a powder.
This brings us to the second underlying reason of the success. For small molecules, we aggregate batches to a compound (which is the abstract super-ordinate object) solely based on the chemical structure. And we have been doing this for years, without giving it any further thought anymore. But why can we ignore properties such as physical form or purity? Actually, we cannot generally ignore them, but we can ignore them in a certain context, such as pharmaceutical (early) research. Here, the relevant data, for example biological activity, is typically obtained in solution, which makes any solid-state properties of the batch irrelevant. And biological test results, the key parameter in drug discovery, are relatively insensitive to inactive impurities in the test system (and have significant uncertainty ranges anyway).
The situation is completely different, for example, in later stage development of a drug. Different crystalline forms (polymorphism) can cause huge headaches at the development stage, and purity matters a lot if you are going to test a potential drug in humans. Or, silly analogy, for cooking pasta it does not really matter if the table salt you use to season the cooking water comes in fine grains or 1 cm cubes, as both will dissolve in the cooking water and the difference goes away. But have you ever tried seasoning your breakfast egg with 1 cm cubes of table salt?
So, in summary, the success of small molecule registration and data management is due to their discrete nature, resulting in a clear identity determination, and due to the fact that property differences between batches can be ignored for the typical use cases, resulting in a clear aggregation method and a clear identity definition (that can be solely based on chemical structure) when aggregating/abstracting from batches to compounds.
Mixtures
Now let’s leave the world of ‘classic’ chemical compounds and look at the bigger world of (chemical) substances. We start with mixtures, for example, a mixture of 50% water and 50% methanol. If we compare this mixture with a mixture of 60% water and 40% methanol, most people will say it is two different mixtures. But what about a 51:49, or even a 50.001:49.999 mixture? In contrast to the chemical structure in small molecule example discussed above, the component ratio as a key mixture defining characteristic property is not discrete. One consequence is that I now should distinguish between the theoretical ratio that is specified in the ‘design’ of a ‘theoretical’ mixture (e.g. 50:50), and the real-life actual ratio of a batch of the mixture (e.g. 51:49). In addition, when it comes to aggregation of batches (for example, for an analysis), there are no clear criteria regarding what is ‘the same mixture’. It will be up to the user to define what is to be considered the same and should be aggregated, and the user will base his/her decision and the associated aggregation criteria on his/her specific use case.
So depending on the use case, the user will or will not see a 51:49 batch as a representative of a theoretical 50:50 mixture design. And there are further differences to small molecules:
• The relationship of substance to batch is not necessarily one-to-many but can be many-to-many. For example, depending on your criteria, the 51:49 batch from above may be seen as ‘good enough’ for both the case of a 50:50 and a 52:48 mixture (resp. a theoretical mixture design).
• In terms of objects we are dealing with it is now no longer only batch and compound resp. substance, but instead batch, substance design, and the ‘good enough’. The ‘good enough’ (or ‘fit-for-the-purpose’), depending on context, can be considered a (database) query if I set up my criteria on the fly, or can be considered a specification if criteria are well-established. This is a separate object, and in the landscape of objects it is somewhere in between the substance and the batch, or can even replace the substance.
• Due to the flexibility in the good-enough criteria, the batch-to-substance association is dynamic. Depending on the use case, a batch may or may not be a representative of a given substance.
• There can be overlap between the good-enoughs/specifications. For example, think of a 50:50±2 and of a 52:48±2 mixture. The range 50 – 52 is included in both specifications.
• It is probably not well aligned with the thinking of a typical researcher but in some cases a specification (e.g. 50:50±2) may not even be tied to a formal substance design (e.g., 50:50).
Inorganics
Finally, we want to look at inorganics. For our purposes, we can divide inorganics into two groups: The first group is inorganics that have a clearly defined discrete molecular structure and are typically used in solutions, such as Pd(PPh3)4, a common homogeneous catalyst. Except for a sometimes more complicated chemical structure (for example, containing coordination bonds and multi-center bonds), there is no fundamental difference to the small (organic) molecules discussed earlier. When going from batches to compound, aggregation of batches purely by chemical structure typically fulfills the needs.
The other group consists of inorganics are those which do not have a clearly defined discrete molecular structure, or where solid phase properties are relevant. Staying in the world of catalysis, solid-state catalysts for heterogeneous catalysis, such as Raney Nickel or certain metal oxides, are typical representatives. These share some characteristics with the solvent mixture example. In the case of a solvent mixture, we considered one single extra property beyond the chemical structure which was relevant for batch aggregation and specifications, ratio. For our heterogeneous catalysts, we now have (at least potentially) a plethora of relevant properties. Prominent examples are pore size, surface area, and particle size, but there could be many others. Same as ratio, these properties are normally not discrete.
Another aspect is that we are now leaving the homogeneous space and entering a heterogeneous world. There is, for example, no single pore size or particle size but a distribution of pore sizes (in some cases statistical, but not always). Instead of one single value, we now have several properties to describe a batch, for example average pore size, standard deviation, and minimum and maximum pore size. Many of these properties are not part of the underlying substance design which was made in the first place.
Other than that, there is no fundamental difference to the solvent mixture example. Again, when it comes to aggregation of batches, it will be up to the user to define what should be aggregated, based on the specific use case, so it is again a matter of the user’s query/specification. In many cases, the query/specification will include several criteria, and it may take the user several iterations to fine-tune the query/specification to match use case and needs.
With respect to the question of identity and to chemical registration, this makes the situation more complex. One way forward could be to extend or complement chemical registration with some kind of a ‘specification management system’. Specifications are something which is very common in the world of manufacturing and production, but so far I am not aware of their usage in the context of early stage chemical and biopharmaceutical research or in the context of connecting batches with chemical ideas, designs, and assets.
Going forward, I may – depending my other commitments and time available – share some thoughts on other chemical entities, such as polymers, nanomaterials, and biologics like antibody-drug conjugates, and on processes and recipes.
* This perspective disregards the manifold and sometimes spectacular advances in, for example, property calculation, spectra prediction, virtual screening, and other simulation. However, many complexities are not yet sufficiently incorporated into simulation algorithms, and virtual data has not replaced experimental data as the ‘gold standard’.
About the Author
Thomas Doerner
Thomas Doerner is an independent specialist for research informatics in life sciences and chemistry. Located at the interface of R&D and Informatics, Thomas helps his clients in pharma, biotech, and chemistry define, design, and implement solutions that enable scientists, foster more effective R&D, and lay the foundation to achieve better outcomes faster. For more information, visit tdoerner.eu.
About the Informatics Alliance
The Informatics Alliance is a small group of dedicated chem- and bioinformatics experts focusing on serving the life science, agro and chemical industries. Each of us brings many years of experience with research informatics projects and practical implementations. We operate independently but we know and help each other, sharing experiences and expertise, and for bigger projects we join forces, for the benefit of all our clients. With group members based in Europe, the US, Asia, and in four world-leading life science hubs (Boston/Cambridge Massachusetts, Basel Switzerland, Copenhagen Denmark, Shanghai China) Informatics Alliance members can support clients all around the globe.