Mining parasite data using genetic programming

Mining parasite data using genetic programming

Update TRENDS in Parasitology Vol.21 No.5 May 2005 207 Mining parasite data using genetic programming John Barrett1, Aneta Kostadinova2,3 and Juan...

315KB Sizes 0 Downloads 19 Views

Update

TRENDS in Parasitology

Vol.21 No.5 May 2005

207

Mining parasite data using genetic programming John Barrett1, Aneta Kostadinova2,3 and Juan Antonio Raga3 1

Institute of Biological Sciences, University of Wales, Aberystwyth, Ceredigion, UK, SY23 3DA Central Laboratory of General Ecology, Bulgarian Academy of Sciences, 2 Gagarin Street, 1113 Sofia, Bulgaria 3 Marine Zoology Unit, Cavanilles Institute of Biodiversity and Evolutionary Biology, University of Valencia, PO Box 22 085, 46071 Valencia, Spain 2

Genetic programming is a technique that can be used to tackle the hugely demanding data-processing problems encountered in the natural sciences. Application of genetic programming to a problem using parasites as biological tags demonstrates its potential for developing explanatory models using data that are both complex and noisy.

Mimicking natural selection In many areas of biology, the ability to collect data outstrips the ability to analyse it [1]. Techniques are needed to mine large datasets and extract biologically meaningful relationships. Genetic programming (GP) is a stochastic optimization approach that helps to discover comprehensible rules for data mining. It is one of a group of supervised, evolutionary programming techniques that uses darwinian concepts to generate and optimize predictive mathematical models. This is done by mimicking ‘natural selection’ using ‘populations’ of mathematical models [2–4]. Initially, a population of n models (short computer programmes) is generated, each model representing a different, random combination of variables, constants and mathematical functions. The fitness of each model is determined (in terms of how well it solves the problem). The ‘best’ models are then selected for ‘breeding’

to produce the next generation of ‘fitter’ models, and so on until a model is evolved that solves the problem with the required degree of accuracy or until a specified stopping criterion is reached (Figure 1). During breeding, different parts of the models are recombined, and the mathematical functions and variables can be changed: the equivalent of crossover and mutation. Because GP is a randomized algorithm, it is not deterministic, and each new run with a dataset evolves an independent model (Figure 2). Therefore, several alternative solutions to a problem can be evolved. For complex problems for which there is no single answer, each run can result in a different best model, and a validation process must then be devised to select the most appropriate one [5]. Advantages of GP GP has several advantages over the more conventional multivariate analysis or distance-based classifier methods used in data analysis, and it has the potential to discover simple rules in data that are both complex and noisy [6–9]. No assumptions of normality or independence are required about the data to be analysed. It is possible to combine numerical, ordinal and categorical data in the same analysis, and the evolved models can easily be applied to new data. The identification and selection of a small subset of the available variables that has high Yes

Generate initial random population of programmes

Evaluate fitness

Termination criterion satisfied?

End

No

Select fittest programmes

Insert offspring into new population of programmes

‘Breed’ variant forms: • replication • mutation • crossover TRENDS in Parasitology

Figure 1. Flow diagram of GP algorithm. The box with the broken red outline marks the iteratively performed substeps. Corresponding author: Barrett, J. ([email protected]). Available online 18 March 2005 www.sciencedirect.com

Update

208

TRENDS in Parasitology

(a)

(b)

V3 ×

V8 V4

+ +

V6



+

– ÷

V2

×

– –



V3



V1 +

V7 V9

V7 V3

5



33.8

V6

V4 V3

V1

+

+

V2 + V2 + V5 TRENDS in Parasitology

Figure 2. The two best-predicting model equation rules derived from GP for the Mullus surmuletus dataset. (a) ScoreZ[(V8CV4CV3KV1CV3CV6KV2)O5]K (V7KV9). (b) ScoreZ(V3!V4)C{[V6K(33.8!V7KV3)]K(V1CV2CV2CV5)}. The Gmax-bioe software was used with the following settings: population sizeZ 5000; maximum programme length was 28 nodes; fitness based on tournament selection; crossover operator used 80% of the time and mutation of terminals 80%; default numeric constants (0.1, 1, 3, 5 and random) and arithmetic operators (C, K, ! and O). Training terminated at generation 150. V1 denotes Poracanthium furcatum; V2 denotes Aponurus sp.; V3 denotes Ascarophis valentina; V4 denotes Prosorhynchus crucibulum; V5 denotes Hysterothylacium aduncum; V6 denotes Gnathia sp.; V7 denotes Euzetacanthus simplex; V8 denotes Hatschekia mulli; V9 denotes Tentacularia sp.

explanatory power are particularly useful. Preliminary studies have shown that GP performs at least as well as (and often considerably better than) conventional predictive methods [9], and it has an advantage over, for example, neural networks (which are extremely limited in their interpretative capabilities) in producing models that are simple and intelligible in biological terms [8]. Supervised methods are prone to overtraining – the phenomenon whereby improving predictive performance in terms of the training data is accompanied by increasingly poor performance on independent (test) datasets, resulting in poor generalization capacity. Several strategies are available to limit overtraining in GP. The complexity of the overall model can be limited by restricting the number of terms in the model equation, by restricting the range of mathematical operators used or by using a large initial population of models and restricting the number of generations (iterations). Class assignment using parasite data Parasites have been used extensively as biological tags to assess fish populations and have been suggested as sensitive probes to monitor a range of biological and environmental factors [10,11]. However, the methods applied to data analysis have changed and evolved only recently (e.g. multivariate statistical analyses of entire parasite communities [10]). GP seems well suited to addressing the problem of discriminating between parasite populations and to classification tasks using parasites as biological or environmental indicators. As an example, we illustrate a GP approach to a fishpopulation assignment problem using parasite www.sciencedirect.com

Vol.21 No.5 May 2005

assemblages in the demersal striped red mullet, Mullus surmuletus, originating from two close Spanish Mediterranean localities (Burriana and Santa Pola). The distributions of 13 parasite species in a sample of 202 fish (119 from Burriana and 83 from Santa Pola) were used as independent variables. The dataset was subdivided so that 75% of the fish were included in the training set, and the remaining 25% were included in the test set. A series of experiments (150 runs in total) was carried out using the Gmax-bioe software (Aber Genomic Computing: http://www.predictivesolutions.co.uk) to evaluate (i) the efficiency of the models and variable selection (100 runs), and (ii) the effect of using complex operators (20 runs) and limiting the number of variables (30 runs) on model robustness. Several well-performing models with a high accuracy (percentage correctly assigned O94% on the test dataset) were developed, representing w19% of all trials. Erroneously assigned fish were not randomly distributed in the test set. In general, the same fish were misclassified in all models (with two cases accounting for 66.7% of the error in the best solutions). The overall similarity observed with respect to erroneously assigned fish across the otherwise diverse and differing solutions confirmed the consistency of GP model development on the dataset studied. Overall, the use of all variables (i.e. abundance data for all 13 parasite species) seems to provide robustness against misclassification. Indeed, reducing the number of variables did not improve accuracy, even when the restriction was limited to the variables used in the best models developed in previous experiments. On average, approximately seven species were used in the final solutions of each run but a set of four (abundance of Poracanthium furcatum, Aponurus sp., Ascarophis valentina and larval Hysterothylacium aduncum) was used consistently with high frequencies in building individual models, and it seemed to contribute considerably to successful discrimination. Two of the best solutions evolved from the initial 100 runs – experiment (i) (Figure 2) – correctly classified 98.04% of the test data. These models use simple mathematical operators (C, K, ! and O) and, for this particular problem, no advantage was gained when morecomplex operators were used (ln, powers, roots and Boolean). Discrimination between fish populations using parasites as sentinels is a difficult classification task because of the complexity of the host–parasite relationship. Consequently, age- or size-related differences in behaviour and/or feeding habits of fish can affect both species composition and abundance of parasite assemblages, and the temporal variability in parasite transmission adds an extra level of heterogeneity, thus obscuring parasitedistribution patterns. Recently, selective sampling within a defined age range, coupled with statistical adjustment of abundance of selected parasites to a standard age [12], and non-parametric length-stratified bootstrap comparisons [13] have been attempted to counteract these issues. The main promise of the example presented, although rather simplistic, is that the models were developed on a host–parasite system with unrestricted noise; we sampled a wide size-range of fish, combined seasonal data and

Update

TRENDS in Parasitology

Vol.21 No.5 May 2005

analysed a large number of parasite species with variable distributions within the hosts sampled. Considering that only simple mathematical operators were used and that different configurations were not explored, one can expect there to be a huge potential for the application of this approach to solving assignment problems using parasites as tags, and it could be particularly useful in discrimination at finer geographical scales. Future perspectives GP might not be suitable for all types of problem, although it has an advantage when there is no ideal solution or when the variables are constantly changing [3]. A range of mathematical operators can be used in GP (arithmetic, trigonometric, Boolean and fuzzy logic). Fuzzy systems are particularly interesting for biological applications because they can be used to create rules using data that are incomplete, approximate or subjective [14,15]. Although GP can find good solutions to difficult problems, the evolved models might be capable of further optimization by local-search algorithms [16]. If this were done and the optimized models are retained and reincorporated into the population, one would have the computer equivalent of Lamarckianism (reverse encoding of learned improvements back into the genome). This approach can increase efficiency considerably [17]. Recently, GP applied to ecological modelling has also shown promise in predicting time-series changes [9,18] in which the focus is on replicating a pattern and, therefore, has potential for the detection and prediction of cyclic variations of parasite occurrence in studies of host–parasite dynamics. Acknowledgements J.A.R. and A.K. were supported by grants REN2003–01758 (MCYT of Spain) and HPMD-CT-2000–00037 (20001958).

209

2 Banzhaf, W. et al., eds (1999) Genetic Programming – An Introduction, Morgan Kaufmann 3 Koza, J.R., ed. (1992) Genetic Programming: On the Programming of Computers by Natural Selection, MIT Press 4 Ferreira, C. (2001) Gene expression programming: a new adaptive algorithm for solving problems. Complex Systems 13, 87–91 5 Rowland, J.J. (2003) Model selection methodology in supervised learning with evolutionary computation. Biosystems 72, 187–196 6 Johnson, H.E. et al. (2000) Explanatory analysis of the metabolome using genetic programming of simple interpretable rules. Genetic Programming and Evolvable Machines 1, 243–258 7 Fogel, G. and Corne, E., eds (2003) Evolutionary Computation in Bioinformatics, Morgan Kauffman 8 Kell, D.B. (2002) Metabolomics and machine learning: explanatory analysis of complex metabolome data using genetic programming to produce simple robust rules. Mol. Biol. Rep. 29, 237–241 9 Jeong, K-S. et al. (2003) Modelling Microcystis aerungosa bloom dynamics in the Nakdong River by means of evolutionary computation and statistical approach. Ecol. Modelling 161, 67–78 10 MacKenzie, K. (2002) Parasites as biological tags in population studies of marine organisms: an update. Parasitol. 124, S153–S163 11 Williams, H.H. and MacKenzie, K. (2003) Marine parasites as pollution indicators: an update. Parasitology 126, S27–S41 12 Lester, R.J.G. et al. (2001) Movement and stock structure of narrowbarred Spanish mackerel as indicated by parasites. J. Fish Biol. 59, 833–842 13 Marcogliese, D.J. et al. (2002) Use of parasites in stock identification of the deepwater redfish (Sebastes mentella) in the northwest Atlantic. Fish. Bull. 101, 183–188 14 Winder, L. et al. (1997) A key for freshwater invertebrates using fuzzy logic. Comput. Appl. Biosci. 13, 169–172 15 Salski, A. (1992) Fuzzy knowledge-based models in ecological research. Ecol Model. 63, 103–112 16 Hart, W.E. et al., eds (2004) Recent Advances in Mimetic algorithms (Studies in Fuziness and Soft Computing), (Vol. 166) Springer 17 Houck, C.R. et al. (1997) Empirical investigation on the benefits of Lamarckianism. Evol. Comput. 5, 31–60 18 Kaboudan, M.A. (2001) Genetically evolved models and normality of their fitted residuals. J. Economic Dynamics and Control 25, 1719–1749

References 1 Kell, D.B. (2002) Defence against the flood. Bioinformatics World 1, 16–18

1471-4922/$ - see front matter Q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.pt.2005.03.007

Letters

What are malaria parasites? Javier Pe´rez-Tris1, Dennis Hasselquist1, Olof Hellgren1, Asta Krizanauskiene2, Jonas Waldenstro¨m1 and Staffan Bensch1 1 2

Department of Animal Ecology, Lund University, Ecology Building, SE-223 62 Lund, Sweden Institute of Ecology, Vilnius University, Akademijos 2, LT-08412 Vilnius, Lithuania

The study of haemosporidian parasites (apicomplexan protozoans that infect the blood of different vertebrates) was boosted recently by the advent of PCR-based methods for the amplification of parasite DNA from blood samples [1–3]. Suddenly, it was possible to detect and identify these parasites quickly and accurately. For non-specialists, this Corresponding author: Pe´rez-Tris, J. (javier.perez-tris[email protected]). Available online 17 March 2005 www.sciencedirect.com

method also enabled detailed analyses of parasite phylogeny and host–parasite co-evolution patterns. Birds and reptiles have rapidly become model groups for these studies, and terms such as ‘avian malaria’ and ‘lizard malaria’ have emerged as commonly used jargon in the literature when referring to diseases caused by haemosporidian parasites in vertebrates other than mammals [1–3]. Simultaneous studies of Plasmodium and other related genera have opened a debate about exactly what should be