Elucidating chemical reactivity by pattern recognition methods

Elucidating chemical reactivity by pattern recognition methods

Analytica Chimrco Acta, 191 (1986) 111-123 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands ELUCIDATING METHODS J. GASTEIGER...

869KB Sizes 1 Downloads 45 Views

Analytica Chimrco Acta, 191 (1986) 111-123 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands








and P. LCW’

Organisch-chemisches Znstitut, Technische (Federal Republic of Germany)

Universitat Miinchen, D-8046 Garching

(Received 15th July 1986)

SUMMARY Methods have been developed that allow important chemical effects to be quantified. Parameters calculated with these procedures can be used to investigate both quantitative and qualitative information on chemical reactivity. A variety of statistical and pattern recognition methods is used for that purpose. These studies lead to reactivity functions that allow the prediction of the course of complex organic reactions.

Analytical chemists and organic chemists have quite different perspectives of their disciplines and different approaches to solving their problems. An analytical chemist heavily relies on instruments and bases his conclusions on the evaluation of many numbers. Modern spectrometers produce a profusion of data of high precision. Therefore, the analytical chemist must generally resort to powerful data-processing techniques to manage this large amount of information. An organic chemist, in contrast, bases his conclusions to a large extent on non-numerical information. Observations on the course of reactions have led the organic chemist to form concepts that allow him to order and explain the experimental events. These notions can be names given to reactions that share certain characteristics (Cannizzaro reaction), can specify net elimination) or stereochemistry observations on reactions (substitution, (suprafacial, enantioselective), or can express some physicochemical factor (inductive effect). These semantic models have created a world of their own, largely inaccessible to the uninitiated. However, it must be pointed out that these concepts are mostly used, even by the experts, only to explain observations in hindsight. It is difficult to make predictions with these concepts because, quite often, they are only defined in a qualitative manner. To illustrate, Fig. 1 shows possible pathways for the reaction of an alkyl bromide with a nucleophile. Each of the branching points has been characterized by organic chemists with a specific name. However, it is not easy to decide which branch will be followed for a specific reaction with a given aPresent address: Chemodata Republic of Germany. 0003-2670/86/$03.60




0 1986 Elsevier Science Publishers B.V.




Fig. 1. Potential pathways for the reaction of an alkyl bromide with a nucleophile.

group R and nucleophile X in a particular solvent. At best, attempts can be made to assess the major pathway, far away from a quantitative prediction of the yields of the various products. However different are the approaches of analytical and organic chemists to solving their problems, the nature of their problems has much in common. Both deal with multivariate problems for which the exact mathematical relationship is either not known or too complex to be explicitly solvable. Many analytical chemists are now routinely using multivariate data-processing methods for studying their problems. The question is therefore whether these methods can also be of benefit in the problems that organic chemists face. Many multivariate data-processing methods can in fact help in predicting the course and rate of organic reactions. CHEMICAL REACTIVITY:



Chemical reactions are simultaneously influenced by many effects and they are so to various extents. This makes the quantitative treatment of chemical reactivity a difficult problem. The method most widely used for quantitative analysis of reactivity data is to establish a linear free enthalpy relationship (LFER). In the LFER treatment the reaction under investigation is compared with a reference system, either some physical data or another reaction. However, the basic assumption of LFER treatment, the separation of a system into substituent, skeleton, and reaction site is artificial and breaks down when the types of interactions between these subunits change within a series of compounds. The existence of several simultaneous interaction mechanisms has led to a profusion of different tables of substituent parameters making the selection of the appropriate scale for a specific problem a matter of chance. A study was therefore undertaken to overcome the deficiences of LFER approaches. Rather than arbitrary separation of a molecule into skeleton, substituent and reaction site, it is considered necessary to treat each molecule as an integral entity, performing calculations on molecules as a whole. Even when effects can be located on specific atoms or bonds, accounting for the influence of more remote parts of the molecule was deemed necessary. Better


to describe the reacting species, organic molecules, the concepts that an organic chemist uses are a logical starting point. For these concepts, such as partial atomic charges, inductive effects, resonance effects, polarizability effects, bond dissociation energies, steric effect, and solvent effects, comprise a lot of chemical knowledge in a rather concise manner. As already mentioned, their biggest drawback, however, is that there is only a qualitative or, at the most, a semiquantitative feeling for the magnitude of these effects. The goal has therefore been to quantify the chemical effects. From the outset, the treatment of large molecules and big data sets was envisaged. This requires rapid calculations to be able to handle so many large systems. Therefore, because of restraints on computer time, a quantummechanical treatment had to be rejected. Rather, it was necessary to develop novel methods. Because of the empirical nature of these methods, extensive comparisons with physical and chemical data were conducted to establish the significance of the values calculated for the various effects. As most of the methods have been described in detail elsewhere, brief hints will suffice here. A broader overview of the methods and their application to reaction prediction is available [l] . Properties and effects utilized Bond dissociation energies. The procedure for calculating bond dissociation energies (BDE) is basically an additivity scheme and works along similar lines as the method for obtaining reaction enthalpies [2]. The program contains tables of parameters for specific structural subunits. Each type of bond and radical center has a base value which is, however, modified depending on the substitution pattern. Thus, for example, the changes in the dissociation energy of a C-H bond in going from a primary to a secondary to a tertiary carbon atom are well reproduced. The values obtained for these bond dissociation energies are usually accurate within 1.0 kcal mol-’ [l] . Partial atomic charges, residual electronegativities and charge flow. The method for calculating partial atomic charges, 4, in molecules starts from the electronegativity concept. The mutual dependence of electronegativity on charge and of charge transfer on electronegativity difference is solved by an iterative procedure [ 31. This partial equalization of orbital electronegativity (PEOE) leads to partial charges on the atoms of a molecule that reflect both the type of atom and its molecular environment. Associated with this charge, each atom receives a uniquely defined residual electronegativity value, x, that is a measure of the inductive effect. Furthermore, values for the amount of charge that is transferred over a certain bond, the charge flow, qf, are obtained. It has been shown that with these charge values various physical data like dipole moments, ESCA binding energies, as well as ‘H- and 13C-n.m.r. shifts can be calculated [ 3-61. Resonance effect. For conjugated n-systems an extension of the above method had to be developed [7, 81. Again, it was shown that the resulting n-charges can reproduce dipole moments, ESCA energies, and 13C-n.m.r. shifts [8].


Polarizability effect. An effective polarizability, (Yd,has been defined as a measure of the stabilization of a charge introduced into a molecule at a specific site [ 91. Effective polarizability can be calculated from increments typical for an atom in a given valence state. These values are added in a manner that takes account of the distance of an atom from the charge center. Correlations with physical and chemical data have shown the significance of the values thus calculated. To summarize, procedures have been developed that allow rapid assignment of quantitative values to important physicochemical effects used by the organic chemist to explain his observations. With these methods, parameters are obtained that allow a detailed description of energetic and electronic effects in molecules. Each atom and bond of a molecule is characterized by several readily calculated numbers (cf. Fig. 5). In the following paragraphs, it will be shown how these values can be used by multivariate data-processing methods to shed more light on chemical reactivity. However, it should also be pointed out that the calculations and correlations with physical data performed to establish the significance of the parameters and mentioned in the references are also of interest to the analytical chemist; for they allow the prediction of various spectroscopic data directly from structural information on molecules. CHEMICAL


Case 1: homogeneous set of quantitative data Once the significance of the values calculated for the various chemical effects with physical data had been established, attention was turned to chemical reactivity data. Initially, reactivity data on gas-phase reactions were chosen for several reasons. First, in the gas phase, the inherent reactivity of individual molecules is observed. Thus, the complicating effects of a solvent are absent and it can be hoped that the reactivity data can be reproduced with a few parameters only. Secondly, in recent years, many data of high numerical accuracy have been determined through new experimental techniques. Large series of compounds have been measured where the structure of the molecules was systematically changed. Thus, homogeneous sets of accurate data are available. In this general context, application of multilinear regression analysis (MLR) is permissible. Standard precautions were taken. The number of parameters used was kept to a minimum. A particular parameter was only included if there was a definite indication of its relevance, on both physicochemical and statistical grounds. Furthermore, the sign of a coefficient had to correspond to the physicochemical interpretation of the particular parameter. Several series of individual gas-phase reactions were studied independently. Most of the studies have been discussed thoroughly in papers already published [ 9-121. Therefore, only a brief summary is given here. Proton affinity of amines. Proton affinity (PA) values give the energy released on protonation in the gas phase. As long as the data set is restricted to


simple alkylamines (Fig. 2, reaction A), the PA values can be reproduced by a single parameter, the effective polarizability, cd : PA = 209.2 + 2.70 (Yd (n = 49, r = 0.984, s = 1.0 kcal mol-‘) [9]. When amines are included in the data set that also bear heteroatoms, the inductive effect of these heteroatoms has to be accounted for. This can be achieved by using an average value of the residual electronegativities, xIz, of the atoms one and two bonds away from the nitrogen atom. This value is a measure of the electron-withdrawing power of the entire molecule exerted on the nitrogen atom. A two-parameter equation considering the values for the polarizability and the inductive effect can reproduce the experimental data of all 80 amines for which data were available at the time of investigation [lo] : PA = 343.0 + 2.99 (Yd - 27.8 xl2 (n = 80, r = 0.998, s = 1.33 kcal mol-I). The signs of the coefficients express the fact that the protonated amine is stabilized by the polarizability effect whereas the inductive effect destabilizes the positive ion and therefore decreases the proton affinity. Data on other gas-phase reactions. To explore the general validity of the parameters calculated for the various chemical effects, data for the other gasphase reactions contained in Fig. 2 (B-E) were also studied by MLR. As above for the amines, two-parameter equations using the values of the polarizability and the inductive effect can reproduce both the proton affinities of R’


R2&j R”





R’, 2,s




[email protected]

R’ @

R2&j-H R’/ R’,@



[email protected]






[email protected]











Fig 2. Gas-phase reactions


d, 2,c--lj

[email protected]



+ H-




R for which data have been analyzed

by MLR.


alcohols and ethers (reaction B) as well as those of thiols and thioethers (C). The same two parameters, (Ydand x12, are also sufficient for calculating values for reaction (D), the gas-phase acidity of alcohols. In this case, however, the coefficients have the same sign for both independent variables, reflecting that both effects tend to stabilize the resulting negative alkoxide ion. For the PA data of carbonyl compounds (reaction E), in addition to parameters for the polarizability and inductive effect, a measure of the hyperconjugation effect had to be included to account for its stabilizing effect on an empty orbital in the alkoxycarbenium ion. The overall picture that emerges from these studies is that the parameters calculated for the various chemical effects are indeed able to reproduce quantitative data on chemical reactivity. Thus, the reactivities of some fundamental chemical reactions can be directly calculated from structural information on molecular species. Case 2: qualitative information on reactive bonds For the gas-phase reactions treated above, good quantitative data for rather extensive, homogeneous sets of molecules were available. This is a rather fortunate situation, not commonly met with organic reactions. For most organic reactions, knowledge on their reactivity is more qualitative in nature (“this compound is more reactive than the other”). In other cases, where quantitative values on reactivity are available, the data are of limited amount. The most serious problem is that the data are not statistically balanced and the data sets are inhomogeneous, i.e., the chemists ran their experiments under widely different conditions, changing too many parameters simultaneously. This precludes application of linear regression techniques. Other ways were therefore sought to quantify chemical reactivity and to develop reactivity functions, even for those reactions where not enough quantitative experimental data are available. This certainly requires statistical methods more powerful than MLR. Detrimental as the lack of statistically balanced quantitative information on chemical reactivity is, there is also some positive point in analyzing chemical reactivity. Knowledge on chemical reactions is vast; chemists have studied very many reactions, and their investigations have shown the course of a large series of reactions. For many reactions, one knows which pathway, of the many conceivable ones, is actually followed. Thus, although the information on chemical reactions is qualitative in nature, but such information exists for a wide area of organic chemistry. The reactivity space. In order to evaluate chemical reactivity even in those situations where only qualitative information (information about the groups or bonds that are reacting) is available, a variety of pattern recognition methods has been used. To do so, a space was first defined for applying these methods. As indicated above, methods for quantifying important chemical effects are available. The measures of these effects can be taken as coordinates of a space. Each bond of a molecule has uniquely defined values for these effects and is therefore represented by a clearly defined point in that space. As this space is used for studying chemical reactivity, it may be termed


the reactivity space. Figure 3 gives a simple three-dimensional space spanned by bond polarizability, charge difference, and electronegativity difference as coordinates. The bonds of iodine bromide and of hydrogen fluoride are taken as examples; iodine bromide has high bond polarizability but low charge and electronegativity difference, whereas the bond of hydrogen fluoride is characterized by a large charge and electronegativity difference but low polarizability. Most organic reactions occur through heterolysis of bonds and formation of bonds between atoms of opposite polarity. However important these reactions are, quantitative understanding of these processes is not well developed. Investigations of heterolytic processes were therefore stressed. As a bond has two ways of breaking heterolytically (Fig. 4), each bond I-J will be represented by two points in a reactivity space, one corresponding to the charge pattern I(+) and J(-) and the other to I(-) and J(+). In Fig. 5, the heterolysis of several single bonds of 2-cyclopentanone carboxylic acid is represented by points in a space defined by the resonance effect, R, bond dissociation energy, BDE, and charge flow, qf. As an example, the two possibilities for the charge pattern in breaking the OH-bond are given by point 1 and point 7, respectively (Fig. 4). The difference between these two points can, to a large extent, be attributed to the charge flow; in the heterolysis corresponding to point 7, the charges are assigned to the atoms in a manner that is already outlined in the ground state, whereas in point 1, the charge is shifted against the initial polarity of that bond. Another major distinction between points 1 and 7 comes from the resonance effect. A negative charge on the oxygen atom of the carboxyl group can be better stabilized

: 1’ I/:,










Fig. 3. Reactivity difference.

@H-r' I


/ :/





,A \

space defined by bond polarizability,

Fig. 4. The alternatives acid.


charge and electronegativity

for heterolysis of the OH-bond of 2-cyclopentanone



[email protected]























8 9

II 13

5 6



6 R

Fig. 5. Heterolysis of bonds of 2-cyclopentanone carboxylic acid in a reactivity space spanned by the resonance effect, R, bond dissociation energy, BDE, and charge flow, qf. (m) Breakable bonds; (A) unbreakable bonds (see text).


H ‘c-c’ H’

Cl ‘H



WY H-C-C-C-C”’

Tli? H-C-C-C-H A


” 0




H \O-;_&




” “’ Cl-C-C-H 4

d ‘H

Fig. 6. Examples of molecules contained in the data set for representing aliphatic chemistry

by the resonance effect than a positive charge on that atom. However, both points have the same value for the BDE, as this value refers to a homolysis of a bond, a process which is the same both for point 1 and for point 7. As an additional feature, the points of Fig. 5 are marked according to whether the corresponding breaking of a bond is considered likely or not. Reactive bonds are distinguished by cubes, non-reactive bonds by square pyramids. With this molecule, the dissociation of a proton from three sites was considered likely, leading to the carboxylate (point 7), or to a carbanion at atom 5 (point 8) or atom 8 (point lo), respectively. Reactive and unreactive bonds clearly separate in this space. Furthermore, the more reactive a bond, the further away the corresponding point is found from the plane separating breakable and unbreakable bonds. From the chemical point of view, in 2-cyclopentanone carboxylic acid, acidity will decrease in the order OH > Cs-H > C,--H. This sequence is indeed reflected in a decrease of the distance of point 7, point 10, and point 8, in that order, from the unreactive bonds. Thus, the reactivity observed in 2-cyclopentanone carboxylic acid is represented in that three-dimensional reactivity space of Fig. 5 quite well.


This promising result warranted investigation of a wider range of chemical reactivity. In addition, attempts were made to exploit the properties of reactivity spaces to arrive at quantitative functions expressing chemical reactivity. To investigate the reactivity of a broad range of aliphatic chemistry, 29 molecules were chosen as a data set to cover most functional groups and structural variations found in aliphatic compounds. Figure 6 gives some representative examples. Altogether, these molecules contained 385 bonds. As each bond has two choices for shifting the charges on heterolysis, 770 bond breakings had to be considered. For each such bond breakage, the values of BDE, bond polarizability, u-charge difference, charge flow, o-electronegativity difference, and the resonance effect were calculated (Table 1). The large size of this data set and the many parameters for each bond again underscore the need for rapid calculations of all those chemical effects. Of these bonds, 116 bonds were selected for further investigation. Thus, the space for studying the reactivity of aliphatic bonds consisted of 116 points in a six-dimensional space. Pattern recognition methods of unsupervised and supervised learning were applied to investigate this reactivity space and to get more insight into chemical reactivity. For the application of supervised learning techniques, the bonds were characterized as either reactive or unreactive. This classification rested on rather unequivocal chemical knowledge. Of the 116 bonds, 42 bonds were considered breakable and 74 unbreakable. Unsupervised learning techniques A principal component analysis showed two factors being of predominant importance with an additional third factor necessary to describe 85% of the variance in the data set (Table 2). Thus, a sizable reduction in the dimensionality of the reactivity space from six to three dimensions is possible without much loss of information. The first factor largely comprises effects in the u-electron distribution, the second factor is predominantly loaded with BDE, bond polarizability and the resonance effect, and the third factor represents some other effects in u-electron distribution. In a cluster analysis, bondsof similar type and those belonging to the same functional groups are found in common clusters. The grouping of these clusters indicates interesting chemical relationships. Supervised learning techniques In these two studies, the overall structure of the reactivity space was investigated without making assumptions on the reactivity of bonds. In the following investigations, bonds were classified as either breakable or unbreakable and this information was used by supervised pattern recognition methods. Table 3 gives the results of a K-nearest neighbor analysis. Both breakable and unbreakable bonds can, to a large extent, be correctly classified over a broad range of the number, K, of neighbors being considered. This shows that reactive and unreactive bonds clearly separate in the six-dimensional reactivity space.



The parameters for the chemical effects for two bonds of 2-chlorobutanoic acid: bond dissociation energy, BDE, bond polarizability, Q, charge difference, Aq,, charge flow, qf, electronegativity difference, Ax,,, and resonance effect, R HHH


H-C-C-C-CT ‘* ‘I 1 1 HHCl Parameter

BDE a A’QO 9f

Axa R

0 0-H 1 s 7-12 C+H-


1-8 O+H-


98.67 4.41 -0.086 -0.046 0.10 3.84

98.67 4.41 0.086 0.046 -0.10 0.0

108.13 3.60 -0.447 -0.441 1.56 3.69

108.13 3.60 0.447 0.441 -1.56 7.25

TABLE 2 Results of a principal components



% Variance

Cumulative %

1 2 3

38.9 33.5 12.7

38.9 72.4 85.1

Factor 1 BDE 0.129 cx -0.126 Aq, 0.845 Qf 0.997 Axa -0.749 R 0.024

Factor 2 -0.835 0.849 0.113 0.080 -0.037 0.449

Factor 3 0.040 -0.039 0.500 0.023 0.610 0.068

There can be a sizeable reduction in the dimensionality of the reactivity space without much loss of capability for correct classification. This can be seen from linear discriminant analysis and logistic regression analysis. A linear discriminant analysis shows that even with two variables (resonance effect, R, and charge flow, qf, a high degree of correct classification is obtained classified to a high degree (Table 4). One method that has been found particularly useful is logistic regression analysis. In this method, an initial binary classification, in our case the classification of whether a bond is breakable or unbreakable, is considered as an input probability, PO, that is modelled by a calculated probability P. P is taken as an exponential function P = l/[ 1 - exp (-f)] where the exponent, f, is expanded as a linear function in the parameters Xi used: f = co + clxl + It is observed that, again, with the two variables resonance effect, c23c2 +....

121 TABLE Results

3 of a K-nearest


% correct




















TABLE Results

4 of a linear discriminant

analysis _


Unbreakable Breakable Total

0 1

% correct


into group



95.9 83.3

71 7

3 35




R, and charge flow, qf, a high degree of correct classification is obtained (Table 5). As an additional benefit, a function, f, based on the above two parameters is obtained that performs the classification. It was mentioned in relation to Fig. 5 that the more reactive a bond, the farther it is from the unreactive bonds. This suggested the use of function f obtained through logistic regression analysis for the classification problem, as a quantitative measure of reactivity. Indeed, it has been observed that more reactive bonds are distinguished by higher values off. This function as developed by the above analysis, and a similar function obtained from investigation of a data set of organic ions and zwitterions have been applied to predict the most reactive bonds in neutral and charged organic species. This allowed prediction of the most likely reaction mechanisms of complex organic reactions. The course and outcome of organic reactions can be predicted. Figure 7 shows such a network of reaction steps obtained for the problem of predicting the most likely product for Lewis acid-catalyzed rearrangement of the menthane-bis-epoxide (1). The intermediates of each reaction step are arranged from left to right with decreasing value of the function f. It is important to note that the charges on the structures of Fig. 7 only indicate the direction of charge shift in heterolysis of the corresponding bonds. They must not be taken at face value to correspond to the development of full unit charges. Overall, it is thus predicted that the bond-breaking pattern indicated with structure 2 is the most favored one. Making three bonds in the formal intermediate 2 leads to structure 3 as being predicted to be the most favored reaction product. This is a surprising result, as one of the oxirane rings is broken in a manner leading to the less stable carbenium ion, and above all,

122 TABLE 5 Results of logistic regression analysisa Group


1 “f = -4.87

Classification 0


72 6

2 36

+ 37.33 qf + 0.256 R.






52 1


i 0-

1 0

6 0

Fig. 7. Prediction of the network of reaction steps for the rearrangement of 1, 2 : 4, 8 diepoxy-g-menthane (1).


it is predicted that even a CC-single bond in an unstrained six-membered ring should be broken. Indeed, this prediction is correct as 3 is the experimentally observed product [ 131. Other examples for the correct prediction of the course and products of complex organic reactions have been reported [ 11. CONCLUSION

It has been shown that the concepts that an organic chemist uses to order his observations and to discuss reaction mechanisms can be put on a quantitative basis. The procedures developed for that purpose provide parameters that can be used for studying information on chemical reactivity. When a homogeneous set of enough quantitative data on chemical reactivity is available, multi-linear regression techniques might suffice. In other cases, more powerful statistical and pattern recognition methods have to be applied. With such methods it is even possible to transform qualitative information on whether bonds are reactive or not into reactivity functions that allow assignment of quantitative values for chemical reactivity. With such functions, correct predictions on the course and products of complex organic reactions can be made. Thus, the conceptual models of the organic chemists and the statistical models of chemometrics can be combined to further understanding of chemical reactivity. Support of this research by Deutsche Forschungsgemeinschaft and by Imperial Chemical Industries plc, England is gratefully acknowledged. We thank Dr M. G. Hutchings for many interesting discussions. REFERENCES 1 J. Gasteiger, M. G. Hutchings, B. Christoph, L. Gann, C. Hiller, P. Low, M. Marsili, H. Saller and K. Yuki, Topics Curr. Chem., 137 (1987) 19. 2 J. Gasteiger, Comput. Chem., 2 (1978) 85; Tetrahedron, 35 (1979) 1419. 3 J. Gasteiger and M. Marsili, Tetrahedron, 36 (1980) 3219. 45. Gasteiger and M. D. Guiilen, J. Chem. Res. (S) (1983) 304;(M) (1983) 2611. 5 J. Gasteiger and M. Marsili, Org. Magn. Reson., 15 (1981) 353. 6 J. Gasteiger and I. Suryanarayana, Magn. Reson. Chem., 23 (1985) 156. 7 M. Marsili and J. Gasteiger, Croat. Chem. Acta, 53 (1980) 601. 8 J. Gasteiger and H. Sailer, Angew. Chem., 97 (1985) 699; Angew. Chem. Int. Ed. Engl., 24 (1985) 687. 9 J. Gasteiger and M. G. Hutchings, J. Chem. Sot., Perkin Trans. 2, (1984) 559. 10 M. G. Hutchings and J. Gasteiger, Tetrahedron Lett., 24 (1983) 2541. 11 J. Gasteiger and M. G. Hutchings, J. Am. Chem. Sot., 106 (1984) 6489. 12 M. G. Hutchings and J. Gasteiger, J. Chem. Sot., Perkin Trans. 2, (1986) 455. 13 T. L. Ho and C. J. Stark, Justus Liebigs Ann. Chem., (1983) 1446.