SJNL1038-02-DO00020673.tex UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) UvA-DARE (Digital Academic Repository) A Grid-Based Hiv Expert System Sloot, P.M.A.; Boukhanovsky, A.V.; Keulen, W.; Tirado Ramos, A.; Boucher, C.A.B. DOI 10.1007/s10877-005-0673-2 Publication date 2005 Published in Journal of Clinical Monitoring and Computing Link to publication Citation for published version (APA): Sloot, P. M. A., Boukhanovsky, A. V., Keulen, W., Tirado Ramos, A., & Boucher, C. A. B. (2005). A Grid-Based Hiv Expert System. Journal of Clinical Monitoring and Computing, 19(4- 5), 263-278. https://doi.org/10.1007/s10877-005-0673-2 General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Download date:06 Apr 2021 https://doi.org/10.1007/s10877-005-0673-2 https://dare.uva.nl/personal/pure/en/publications/a-gridbased-hiv-expert-system(f4b256de-6fec-4b27-9806-31fe6822b5eb).html https://doi.org/10.1007/s10877-005-0673-2 UN CO RR EC TE D PR OO F TECHBOOKS Journal: JOCM MS Code: CH1 PIPS No: DO00020673 DISK 22-8-2005 18:20 Pages: 16 Journal of Clinical Monitoring and Computing (2005) xxx: 1–16 C© Springer 2005 A GRID-BASED HIV EXPERT SYSTEM1 Peter M.A. Sloot,1 Alexander V. Boukhanovsky,2 Wilco Keulen,3 Alfredo Tirado-Ramos,1 and 2 3 Charles A. Boucher44 From the 1Section Computational Science, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands, 2Institute for High Performance Computing and Information Sys- tems, Bering St, 38, St. Petersburg, Russia, 3Virology Educa- tion, 69042 Utrecht, The Netherlands, 4University Medical Center, University of Utrecht, 3508 GA Utrecht, The Netherlands. Received—, and in revised form—. Accepted for publication—. Based on “A Grid-based HIV Expert System”, by P.M.A. Sloot, A.V. Boukhanovsky, W. Keulen, and C.A. Boucher, which appeared in the IEEE/ACM International Symposium on Cluster Computing and the Grid, Cardiff, UK, May 9-12, 2005. c©2005 IEEE. Address correspondence to Peter M.A. Sloot, Section Computa- tional Science, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands E-mail: sloot@science.uva.nl Sloot P MA, Boukhanovsky AV, Keulen W, Tirado-Ramos A, Boucher CA. A grid-based HIV expert system. J Clin Monit 2005; xxx: 1–16 ABSTRACT. Objectives. This paper addresses Grid-based in- 5 tegration and access of distributed data from infectious dis- 6 ease patient databases, literature on in-vitro and in-vivo phar- 7 maceutical data, mutation databases, clinical trials, simulations 8 and medical expert knowledge. Methods. Multivariate analyses 9 combined with rule-based fuzzy logic are applied to the inte- 10 grated data to provide ranking of patient-specific drugs. In addi- 11 tion, cellular automata-based simulations are used to predict the 12 drug behaviour over time. Access to and integration of data is 13 done through existing Internet servers and emerging Grid-based 14 frameworks like Globus. Data presentation is done by standalone 15 PC based software, Web-access and PDA roaming WAP access. 16 The experiments were carried out on the DAS, a Dutch Grid 17 testbed. Results. The output of the problem-solving environ- 18 ment (PSE) consists of a prediction of the drug sensitivity of the 19 virus, generated by comparing the viral genotype to a relational 20 database which contains a large number of phenotype-genotype 21 pairs. Conclusions. Artificial Intelligence and Grid technology 22 is effectively used to abstract knowledge from the data and pro- 23 vide the physicians with adaptive interactive advice on treatment 24 applied to drug resistant HIV. An important aspect of our research 25 is to use a variety of statistical and numerical methods to iden- 26 tify relationships between HIV genetic sequences and antiviral 27 resistance to investigate consistency of results. 28 KEY WORDS. grid, HIV, PSE, expert system, artificial intelligence, 29 bio-statistics. 30 1. INTRODUCTION 31 1.1. Motivation 32 Forty two million people worldwide have been infected 33 with HIV and 12 million have died, over the last 20 years. 34 Figure 1 shows the pan-epidemic extent of HIV infections. 35 Effective antiretroviral therapy has lead to sustained HIV 36 viral suppression and immunological recovery in patients 37 who have been infected with the virus. The incidence of 38 AIDS has declined in the Western world with the intro- 39 duction of effective antiretroviral therapy, though questions 40 on “When to start treatment? What to start with? How to 41 monitor patients?” remain heavily debated. Adherence to 42 antiretroviral treatment remains the cornerstone of effec- 43 tive treatment, and failure to adhere is the strongest pre- 44 dictor of virological failure. Long-term therapy can lead to 45 metabolic complications. Other treatment options are now 46 available, with the recent introduction to clinical practice 47 of fusion inhibitors, second-generation non-nucleoside re- 48 verse transcriptase inhibitors, and nucleotide reverse tran- 49 scriptase inhibitors. The sheer complexity of the disease, 50 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 2 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 Fig. 1. Worldwide spread of HIV infections, history and near future per- spective. the distribution of the data, the required automatic updates51 to the knowledgebase and the efficient use and integration52 of advanced statistical and numerical techniques necessary53 to assist the physician motivated us to explore the novel54 possibilities supported by Grid technology.55 In this position paper we describe ongoing research in56 our 3 laboratories (Utrecht, St. Petersburg and Amsterdam)57 addressing the development of a Grid based medical deci-58 sion support system. The goal of the research is to investi-59 gate novel computational methods and techniques that sup-60 port the development of a user friendly integrated support61 system for physicians. We use emerging Grid-technology62 to combine data discovery, data mining, statistical analyses,63 numerical simulation and data presentation [1].64 The paper is organized as follows. Chapter 2 describes65 the background of HIV research and a prototypical rule-66 based approach to data analyses. In chapter 3 we give an67 overview of the two computational techniques we study68 to understand the temporal variability of HIV populations69 through stochastical modeling and the evolution of HIV70 infection and the onset of AIDS through Cellular Automata71 (CA) modeling. Chapter 4 describes a first approach to72 advanced data presentation through roaming devices such73 as Personal Digital Assistants (PDA’s).74 1.2. Background75 1.2.1. Clinical aspects of HIV76 The clinical management of patients infected with Human77 Immunodeficiency Virus (HIV) is based on studies on the78 pathogenesis of the disease and the results of trials evaluat-79 ing the effects of anti-HIVdrugs. Retrospective analysis of80 large cohorts has identified laboratory markers for disease81 progression, such as the amount of virus (HIV-RNA) and 82 the number of T helper cells (CD4 + cells) in blood. In ad- 83 dition the results of prospective drug trials have generated 84 data on effectiveness of individual drugs and drug combi- 85 nations and the effect of drug resistant viruses on therapy 86 outcome. Currently clinicians are limited in the practical 87 use of this information because in most cases they are only 88 provided with statistical relationships between individual 89 parameters and disease or therapy outcome. Large data sets 90 have not been analyzed and made available in such a way 91 that it allows a clinician to use the available data in more 92 clinical settings. The availability of large databases and the 93 development of innovative data mining approaches create 94 the opportunity to develop systems which allow the prac- 95 ticing clinician to determine the risk profile for disease 96 development, or the change or success for a given regimen 97 for his individual patients. Such a system will determine the 98 rate of success for different drug regimens by taking into 99 account the effect and interaction of all relevant laboratory 100 and clinical parameters and by comparing the results for 101 similar patients available in the database. 102 Currently there are fifteen drugs licensed for treatment of 103 individuals infected with HIV. These drugs belong to two 104 classes, one inhibiting the viral enzyme reverse transcrip- 105 tase and another inhibiting the viral protease. These drugs 106 are used in combination with therapy to maximally inhibit 107 viral replication and decrease HIV-RNA to below levels of 108 detection levels (currently defined as below 50 copies per 109 ml) in blood. Treatment with drug combinations is suc- 110 cessful in inhibiting viral replication to undetectable levels 111 in only 50% of the cases. In the remaining 50% of cases 112 viruses can be detected with a reduced sensitivity to one 113 or more drugs from the patients’ regimen. The molecular 114 base for resistance has been, and still is, focus of extensive 115 research. Over 80 amino acid positions in the viral enzyme 116 reverse transcriptase (RT) and 40 positions in the protease 117 enzyme can undergo changes when exposed to selective 118 drug pressure in vitro or in vivo. For some drugs, at cer- 119 tain positions, a change towards a specific new amino acid 120 is seen. At other positions several alternative amino acids 121 may appear and cause (variable) levels of resistance to one 122 or more drugs. In theory, therefore, an infinite number 123 of combinations of amino acid changes could appear and 124 cause resistance in vivo. Preliminary clinical observations 125 however show that specific amino acid changes at a limited 126 number of positions and a limited number of combina- 127 tions prevail. In addition to changing drug sensitivity some 128 amino acid changes may also influence the replication po- 129 tential of HIV. Amino acids selected initially during a failing 130 regimen cause resistance to the drugs the patient is taking, 131 but at the same time may decrease the capacity of the virus 132 to replicate. Changes appearing later do not function to 133 further increase resistance but merely function to restore 134 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Sloot et al.: Grid-Based HIV Expert System 3 the capacity of the virus to replicate (“viral fitness”). Sev-135 eral clinical studies have been performed recently to evalu-136 ate the clinical benefit of resistance-guided therapy. These137 studies show that a better virological response is obtained138 in patients who are failing their therapy, when their new139 regimen is chosen on the basis of their resistant profile.140 In three out of the four studies from last year the results141 showed that if new regimens were selected on the basis of142 the mutations (viral resistance genotype) the results were143 better as compared to standard care approaches. Currently,144 the basis for clinical interpretation of the viral genotype is145 based on data sets relating mutations to changes in drug sen-146 sitivity, and/or data sets directly relating mutations present147 in the virus to clinical responses to specific regimens. Ini-148 tially, experts compared the observed mutations to lists of149 published sequences taken from the literature, and based150 on this comparison would select a regimen.151 1.2.2. Prototype support system152 Recently, first generation bioinformatics software pro-153 grams have been developed to support clinicians. Examples154 of such systems are the Virtual Phenotype developed by155 Virco NV, and a first generation decision support system156 (Retrogram TM) developed by Virology Networks BV in157 collaboration with parts of our research team. The out-158 put of these programs consists of a prediction of the drug159 sensitivity of the virus generated by comparing the viral160 genotype to a relational database containing a large num-161 ber of phenotype-genotype pairs. The Retrogram decision162 software interprets the genotype of a patient by using rules163 developed by experts on the basis of the literature, taking164 into account the relationship of the genotype and phe-165 notype. In addition, it is based on (limited) available data166 from clinical studies and on the relationship between the167 presence of genotype directly to clinical outcome. It is im-168 portant to realise however that these systems focus on bio-169 logical relationships and are limited to the role of resistance.170 The next step will be to use clinical databases and inves-171 tigate the relationship between the viral resistance profile172 (mutational profile and/or phenotypic data) and therapy173 outcome measures such as amount of virus (HIV-RNA)174 and CD4+ cells. A summary of the flow of data is shown175 in Figure 2.176 1.2.3. Data collection177 Large high quality clinical and patient databases are used178 to explore the relationships described above and to de-179 velop a first prototype matching system. The Athena co-180 hort is a large Dutch observational clinical cohort study181 Fig. 2. From molecule to man: Hierarchical data flow model for infectious diseases. aiming at the surveillance of antiretroviral treatment sup- 182 ported by the government. The cohort consists of 3000 183 patients from whom data are centrally collected through a 184 decentralized data entry system. Within the cohort 600 pa- 185 tients are studied intensively, whose phenotypic and geno- 186 typic data, drug levels and CD4+ and HIV-RNA patterns 187 are collected. Phenotype, genotype, viral fitness and drug 188 levels as CD4+ and HIV-RNA patterns will be collected 189 from two large international trials (sponsored by Roche 190 Pharmaceuticals), evaluating the effect of a new fusion in- 191 hibitor drug (T20), and representing 1000 patients. The 192 third database will be from the international multi-center 193 Great study, sponsored by Virology Networks BV. Within 194 this study the value of the Retrogram decision support 195 program is evaluated and similar parameters as described 196 above will be collected. Within this study 360 patients will 197 be enrolled. 198 The Viradapt study showed that the virological response 199 was better in the patient group in which genotype and rule- 200 based interpretation was used as compared to the standard 201 of care arm [2]. On the basis of these results, a more elabo- 202 rate decision support software system (Retrogram version 203 1.0) was built in collaboration with Virology Networks 204 B.V. This system ranks the efficacy of the antiretroviral 205 drugs within each class. The ranking is based on expert 206 interpretation of two types of data. The software system 207 estimates the drug sensitivity for the fifteen drugs by in- 208 terpreting the genotype of a patient by using mutational 209 algorithms. These mutational algorithms are developed by 210 a group of experts on the basis of the scientific literature, 211 taking into account the published data relating genotype to 212 phenotype. In addition, the ranking is based on data from 213 clinical studies on the relationship between the presence of 214 particular mutations and clinical or virological outcome. 215 The Athena cohort is a large Dutch observational clini- 216 cal cohort study aiming at the surveillance of antiretroviral 217 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 4 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 treatment supported by the Dutch government. The co-218 hort consists of 3000 patients from whom clinical, viro-219 logical, immunological and data on drug side effects are220 centrally collected through a decentralised data entry sys-221 tem. Within this cohort 600 patients are studied intensively,222 phenotypic and genotypic data, drug levels and CD4+ and223 HIV-RNA patterns are collected. From two large interna-224 tional trials (sponsored by Roche Pharmaceuticals) eval-225 uating the effect of a new fusion inhibitor drug (T20),226 representing 1000 patients from whom also phenotype,227 genotype, viral fitness, drug levels as CD4+ and HIV-RNA228 patterns will be collected. The third database will be from229 the international multi-center Great study sponsored by230 Virology Networks BV, within this study the value of the231 Retrogram decision support program is evaluated and sim-232 ilar parameters a described above will be collected, within233 this study 360 patients will be enrolled. Another dataset234 will come from the Italian Musa study, in this trial data will235 be collected from 450 patients followed over a year. Entry236 point to the trial is failing a fist or second regimen, subse-237 quently patients will be genotyped and a new regimen will238 be selected on the basis of Retrogram 1.4 or the Virtual239 Phenotype from Virco (Belgium).240 Throughout the duration of the project we will collect241 additional datasets. These datasets may serve to further re-242 fine our models and first version software and may also be243 use to perform validation studies.244 1.2.4. Data analysis245 The primary goal of the data analysis is to identify pat-246 terns of mutations (or naturally occurring polymorphisms)247 associated with resistance to antiviral drugs and to predict248 the degree of in-vitro or in-vivo sensitivity to available drugs249 from an HIV genetic sequence. The statistical challenges250 in doing such analyses arise from the high dimensional-251 ity of these data. A variety of approaches have been de-252 veloped to handle this type of data, including clustering,253 recursive partitioning, and neural informatics. Neural in-254 formatics is used for synthesis of heuristic models received255 by methods of knowledge engineering, and results of the256 formal multivariate statistical analysis in uniform systems.257 Clustering methods have been used to group sequences258 that are “near” each other according to some measure of259 genetic distance [3]. Once clusters have been identified,260 recursive partitioning can be used to determine the im-261 portant predictors of drug resistance, as measured by in-262 vitro assays or by patient response to antiviral drugs. Prin-263 ciple component analyses can help to identify what are the264 most important sources of variability in the HIV genome.265 An important aspect of our research is to use a variety of266 methods to identify relationships between HIV genetic se-267 quences and antiviral resistance to validate the consistency 268 of results. 269 The molecular sequences of the viral enzymes reverse 270 transcriptase and protease are the micro parameters in the 271 model. In theory an infinite number of combinations of 272 mutations could appear and cause (variable) changes in viral 273 drug sensitivity and viral replication capacity (See also Ta- 274 ble 1). Clinical datasets however show that specific amino 275 acid changes at a limited numbers of positions in a lim- 276 ited number of combinations prevail. HIV-RNA and CD4 277 are the primary parameters determining disease outcome. 278 HIV-RNA, the amount of HIV-RNA genomic copies per 279 ml plasma, has been validated as being highly predictive of 280 clinical outcome. HIV-RNA and CD4+ cell numbers are 281 now the standard endpoint in clinical trials for approval of 282 new antiretroviral drugs. A patient’s HIV-RNA may range 283 between a few hundred to millions of RNA copies per 284 ml plasma. The CD4+ cell numbers in peripheral blood 285 range typically between zero and thousand. Whereas the 286 predictive clinical value of both parameters has been deter- 287 mined initially in untreated individuals, they have also been 288 shown to be of predictive value also for patients under an- 289 tiretroviral therapy. Recently observations have been pub- 290 lished indicating that in some patients under highly active 291 antiretroviral therapy (HAART) a disconnect may occur 292 between the response in HIV-RNA and in CD4 counts. 293 Typically, in these patients a rise in HIV-RNA as conse- 294 quence of incomplete inhibition of viral replication under 295 therapy is not paralleled by a continuous decrease in CD4 296 counts. This disconnect has been explained by a decrease 297 Table 1. Parameters for the data analyses. Here the hierarchical ap- proach shown in Figure 2 is extended to detail the content of the parameters Micro Parameter Protease Mutations Reverse Transcriptas Mutations Primary Parameter HIV-RNA CD4 Drug Resistance Macro Parameter Meta Parameter: Virological Viral Fitness Meta Parameter: Clinical Weight Opportunistic Infections and Tumors Survival Intervention Parameter Drug Dosage Bio-availability of Drug/Drug Level AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Sloot et al.: Grid-Based HIV Expert System 5 in the viral replicative capacity (‘viral fitness’) which leads298 to a decrease in capacity to lower CD4 counts.299 The patient’s weight and secondary opportunistic infec-300 tions and/or malignancies are parameters that determine301 disease outcome and survival time. Currently there are fif-302 teen drugs licensed for treatment of individuals infected303 with HIV: More than ten inhibitors have been developed304 which inhibit the reverse transcriptase process. These in-305 hibitors can be classified in two sub-categories that dif-306 fer in the way they inhibit the RT-enzyme, nucleoside307 (analogue) RT-inhibitors (NRTI) and the non-nucleoside308 RT-inhibitors (NNRTI). These compounds inhibit the309 protease enzyme, which acts much later on in the HIV310 replication cycle than reverse transcriptase.311 The protease is responsible for cleaving a long poly-312 protein into smaller functional proteins. The overall ex-313 posure to antiretroviral drugs has been shown to be an314 important factor for the degree of success for a given ther-315 apy. The overall exposure can be captured by parameters316 as dosage and bio-availability which will codetermine the317 drug level within an individual patient. Given the relation-318 ships between exposure and antiviral efficacy, variability in319 drug levels (which may be due to differences in patient320 adherence to their regimens) will contribute to virologi-321 cal and immunological outcome. Individuals with relatively322 low exposure are more likely to experience virological fail-323 ure than those with a high exposure.324 2. METHODS AND MATERIALS325 2.1. Modeling the dynamics and temporal variability326 of HIV-1 populations327 In addition to rule based and parameter based decision sup-328 port we developed statistical models and cellular automata329 based models to study the dynamics of the HIV popula-330 tions. These 2 numerical models run on Grid-resources.331 The output is integrated with the medical support system332 and accessible to the end-user. In this paragraph we briefly333 outline the two computational methods. Details are be-334 yond the scope of this paper; we refer to the references335 provided.336 2.1.1. A cellular automata model to study the evolution337 of HIV infection and the onset of AIDS338 A cellular automata model to study the evolution of HIV339 infection and the onset of AIDS is developed. The model340 takes into account the global features of the immune re-341 sponse to any pathogen, the fast mutation rate of the HIV,342 and a fair amount of spatial localization, which may occur 343 in the lymph nodes. The dynamics of the cellular automata 344 requires high throughput computing, which is provided by 345 the resource management of the Grid. In this section, we 346 employ non-uniform Cellular Automata (CA’s) to simulate 347 drug treatment of HIV infection, in which each compu- 348 tational domain may contain different CA rules, in con- 349 trast to normal uniform CA models. Ordinary (or par- 350 tial) differential equation models are insufficient to de- 351 scribe the two extreme time scales involved in HIV in- 352 fection (days and decades), as well as the implicit spatial 353 heterogeneity. Zorzenon dos Santos et al. [7] reported a 354 cellular automata approach to simulate three-phase pat- 355 terns of human immunodeficiency virus (HIV) infection 356 consisting of primary response, clinical latency and onset 357 of acquired immunodeficiency syndrome. We developed a 358 non-uniform CA model to study the dynamics of drug 359 therapy of HIV infection, which simulates four-phases 360 (acute, chronic, drug treatment responds and onset of 361 AIDS). Our results indicate that both simulations (with and 362 without treatments) evolve to the same steady state. Three 363 different drug therapies (mono-therapy, combined drug 364 therapy and HAART) can also be simulated in our model. 365 Our model for prediction of the temporal behaviour of the 366 immune system to drug therapy qualitatively corresponds 367 to clinical data. 368 Pseudo Code 1a: HI Model (Adapted from Zorzenon dos 369 Santos R. M., Phys. Rev. Let. 2001). H = healthy cell, 370 A1 and A2 are infected cells at different time steps. 371 Assume: {H, A1(t), A2(t+ τ ), D}; 1 time-step = 1 week; Simulation of lymph-node; Moore neighbourhood and square lattices used Rule 1: (a) If it has at least one infected-A1 neighbor, it becomes infected-A1 (b) If it has no infected-A1 neighbor but does have at least R (2 < R < 8) infected-A2 neighbors, it becomes infected-A1 (c) Otherwise it stays healthy Rule 2: An infected-A1 cell becomes infected-A2 after τ time steps Rule 3: Infected-A2 cells become dead cells Rule 4: (a) Dead cells can be replaced by healthy cells with probability prepl in the next step. (b) Each new healthy cell introduced may be replaced by an infected-A1 with probability p infec 372373 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 6 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 This CA (Pseudo-code 1a) mimics in a simple way the374 dynamical properties of a HIV infection; next we intro-375 duce drug therapy into the model by modelling a response376 function Presp and changing only rule 1.377 Pseudo Code 1b: Advanced HI Model, taking into378 account drug therapy effects.379 Rule 1: (a) If there is one A1 neighbor after the starting of drug therapy, N(0 ≤ N ≤ 7) neighbor healthy cells become infected-A1 in the next time steps with probability presp. Otherwise, all of eight neighbors become infected-A1. N represents effectiveness of drugs. N = 0: no replication; N = 7: less effective for the drug. Presp (t − ts ) represents certain response function of drug effects over the time steps (t). The ts is the starting of treatment. 380381 The main success of the presented CA model is the ad-382 equate modeling of the four-phases of HIV infection with383 different time scales into one model. Moreover, we could384 also integrate all of the three different therapy procedures.385 The simulations show a qualitative correspondence to clin-386 ical data. During the phase of drug therapy response, tem-387 poral fluctuations for N > 3 were observed, this is due to388 the relative simple form of the response distribution func-389 tion (Pdis)applied to the drug effectiveness parameter N390 at each time-step. The simulation results indicate that, in391 contrast to ODE/PDE, our model supports a more flexible392 approach to mimic different therapies through the use of393 mapping the parameter space of Pdis to clinical data. There-394 fore there is ample room to incorporate biologically more395 relevant response functions into the model. The data inte-396 gration required for the CA, the parametric computation397 and the data presentation are supported by the Grid.398 2.1.2. Multivariate stochastic modeling399 The modeling of Human Immunodeficiency Virus400 (HIV-1) genotype datasets has a goal to identify patterns401 of mutations (or naturally occurring polymorphisms) as-402 sociated with resistance to antiviral drugs and to predict403 the degree of in-vitro or in-vivo sensitivity to available drugs404 from an HIV-1 genetic sequence. The statistical challenges405 in doing such analyses arise from the high dimensionality406 of these data. Direct application of the well-known genetic407 approaches [5] to analysis of HIV-1 genotype results in a lot408 of problems. Principal difference is in the fact that, in HIV409 DNA analysis, the main scope of interests is the so-called 410 relevant mutations – a set of mutations, associated with the 411 drug resistance. These mutations might exist in different 412 positions over the amino-acid chains. Moreover, the sheer 413 complexity of the disease and data require the development 414 of the reliable statistical technique for its analysis and mod- 415 eling. A multivariate stochastic model for describing the 416 dynamics of complex non-numerical ensembles, such as 417 observed in the (HIV) genome, has been developed in [6]. 418 This model was based on principle component analyses for 419 numerated variables. Generally speaking, the interpretation 420 of numerated variables in terms of relevant mutations is not 421 clear. Below we develop this model directly for the ensem- 422 ble of relevant mutations in the RT and protease parts of 423 the HIV-1 genome. Each element of the ensemble is pre- 424 sented as the cortege �k = {ξ j }n kj =1, k = 1, M with the 425 variable dimension n k -the total number of the mutations 426 in the gene. Each value ξk is a literal index and corresponds 427 the position and new value of the amino acid (e.g., 184 V, 428 77I, etc.). It allows to associate each mutation with the cat- 429 egorical random variable i ∈ 1 . . . K , where K is the total 430 number of possible mutations. Each sub sample of genomes 431 with a fixed number of mutations n = const may be con- 432 sidered as the realizations of a categorical random vector. 433 The representation above is based on the proximity to the 434 “wild-type” virus and takes into account only the relevant 435 mutations in a genome. It allows for significant compression 436 of the DNA representation and simplifies the interpretation 437 of the results. 438 Principle of the modeling approach. The joint variability of dif- 439 ferent mutations in the HIV-1 genomes is a complicated 440 phenomenon. The dimension of the probabilistic charac- 441 teristics is high, and its analytical investigations and inter- 442 pretation are hard. Hence, for the studying of HIV-1 pop- 443 ulations we use a computational statistical approach that 444 allows to numerically generate an ensemble with the same 445 probabilistic properties by means of a Monte-Carlo pro- 446 cedure. This is a well-known powerful method to study 447 complex system variability. 448 The idea of the stochastic modeling is shown in the 449 Figure 5. It is based on the evolutionary hypothesis, consid- 450 ering the group with n + 1 mutations as subgroup of group 451 with n mutations in a previous step. For each gene the tran- 452 sit from n to n + 1 mutation groups is driven by a stochastic 453 operator D(n+1), which defines the mutations on the n + 1 454 step, when the mutations on the previous n steps are known. 455 The initial step of the stochastic procedure begins from the 456 whole ensemble of wild-type viruses. The number of the 457 genomes that has been mutated at each step of the stochas- 458 tic procedure is in accordance with Mn = ρn M, where ρn 459 are the probabilities of the occurrence of genotypes with n 460 mutations in a total population of M genes. 461 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Sloot et al.: Grid-Based HIV Expert System 7 Fig. 3. Temporal behaviour of the CD4 count, with modeled Brownian movement for lymphocytes [8]. Fig. 4. As in Figure 3, with additionally modeled mono therapy in week 300 [8]. Fig. 5. Principle of the modeling. The stochastic operator D may be considered as a “black462 box”. It is formalized in terms of the conditional probabil-463 ities of the occurrence of mutation ξi , if the mutation ξ j464 arise in the previous step of the generation. For genotypes465 with 2 mutations only the values Di j are the conditional466 probabilities of the pairs. In this case the matrix {Di j } is467 the transition Markov probability matrix, containing the 468 conditional probabilities for simple Markov chains with 469 the number of these states corresponding to quantity of 470 the relevant mutations. In more complicate cases, where 471 n > 2, the probability matrix {Di j } consists of the con- 472 ditional probabilities to meet mutation ξ j in certain gene, 473 when the mutation ξi is present. 474 This approach allows us to reduce the complicated sta- 475 tistical description of the dataset to a rather simple model, 476 using only three probabilistic distributions as the initial pa- 477 rameters of the model: distribution of number n of the 478 mutations ρn ; 479 • distribution P (1)ξ for the relevant mutations in the group 480 n = 1; 481 • transient probability matrix D. 482 All these parameters might be identified on the sample 483 datasets of the HIV-1 population. 484 Identification of the model. For the identification of parameters 485 of the model, a large database of HIV-infected patients, col- 486 lected over several years in USA, is used [4]. These databases 487 contain genotypes of 43620 patients examined from Au- 488 gust 9, 1998 to May 5, 2001. We observed 59 different 489 mutations in the RT genome, including 17 mixed muta- 490 tions, and 77 different mutations in the protease genome, 491 including 34 mixed mutations. 492 Distribution ρn of number of mutations. The practice of HIV 493 treatment however, has shown that the variability of the 494 number of mutations n is high, due to the complexity of 495 the drug combinations that has been applied. The sample 496 estimate of distribution ρn of the number of mutations in 497 protease is shown in the Figure 6. It is seen, that the distri- 498 butions have a clear first peak (n = 1), and a shelf (or second 499 peak), corresponding to n = 3 ÷ 5. Therefore we expect 500 that there are two groups of genomes in the database, cor- 501 responding to the low and high number of mutations. The 502 possible interpretation of the discovered bi-modal distri- 503 bution is that we have two groups of patients. One group 504 is the “new” patients who had one or two treatments, thus 505 their genotype contains relative small numbers of muta- 506 tions. The second group is the “old” patients, which have 507 a long treatment history, or new patients, infected through 508 treated HIV-1 patients [15]. 509 Distributions of the relevant mutations Pξ . Distribution ρn al- 510 lows describe the variability of the groups of the “new” 511 and “old” patients, only. For a more detailed study of the 512 virus mutations driving by the certain drugs combinations, 513 the probabilities of occurrence of the relevant mutations 514 ξ should be considered. They are estimated by the sample 515 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 8 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 Fig. 6. Statistical description for distribution of mutations in Protease. frequencies:516 Pξ = {Number of genes with mutation ξ } M . (1) Here M is the total number of genomes in the dataset.517 Equation (1) describes the marginal impact of each muta-518 tion in the total population, without any information about519 number and occurrences of other mutations. The prob-520 abilities of the most significant relevant mutations ξk (in521 decreasing order of its probability) are shown in Figure 6.522 The marginal estimates of Pξ over the total dataset show523 only general impacts of the mutations. For a detailed524 analysis of its behavior we also consider the occurrences525 P (n) ξ of mutations in the groups of genotypes with exactly526 n mutations. These values were computed also by means527 of Equation (1), where M def= Mn = ρn M – the number of528 genes with n mutations in a database. The sample estimates529 of these occurrences are also shown in the Figure 1. It is530 clearly seen that the inputs of some mutations are rather dif-531 ferent for different n, both for the protease and RT parts of532 the genome. E.g., for RT, for n = 1, the mutations 184 V533 and 103 N have the main input. The distribution P (1) ξ is the534 limit distribution from the procedure shown in Figure 5.535 From Figure 1 we also observe that the total sum536 ∑ k Pξk > 100%, excluding case n = 1. This demonstrates537 that the analysis of the marginal mutations is not enough538 for general statistical description of all DNA ensemble vari-539 ability, because some positions of DNA may be statistically540 dependent [15], especially in relation to viral fitness. Hence,541 the joint characteristics of its variability must be taking into 542 account. 543 Transient probability matrix D.The conditional probability of 544 the occurrence of mutation ξi , if the mutation ξ j arises 545 from the previous steps of the generation, is estimated by: 546 Di j = {Number of genotypes with mutations ξi and ξ j simult&aneously} {Number of genotypes with mutation ξi } . (2) 547 The dimensionality of the related matrix, obtained from 548 Equation (2), may be rather high. In order to decrease 549 the dimensionality we consider the algebraic technique of 550 orthogonal expansion, applied to transient probability ma- 551 trices [16]. 552 D = ��1/2�. (3) where � are the eigenvectors of matrix DDT , and �-of 553 matrix DT D. It allows considering the coefficients a k = 554√ λk as the principal components (PC) [13], and represents 555 the probability (2) as a series: 556 Di j = ∑ k √ λk φi k ψ j k . (4) The values λk shows the part of the probability, explained 557 by k-th PC. The sum of the first k-th coefficients λk may 558 be interpreted as a measure of convergence of the series 559 (4). In Table 2 the values of the first 7 λk for the RT and 560 protease parts of the HIV-1 genome are shown. These data 561 were obtained for the total database. It can be seen that the 562 series (4) converges rather fast in both cases: e.g. for the RT 563 part only the first term of the series explain more 60% of 564 conditional probability (the first five terms explain 80%). 565 Let us consider the normalized bases φ̃i k = λ0.25k 566 φi k , ψ̃ j k = λ0.25k ψ j k . It allows to present the terms in Equa- 567 tion (4) as the p i jk = φ̃i k ψ̃ j k and interpreted these values 568 as the independent factor loadings, driving the changes of 569 the conditional probability Di j over all the mutations ξi , ξ j 570 in the database. For example, in the Figure 7 the estimates 571 Table 2. Normalized (%) values of the expansion coefficients λk in Equation (4) # of PC Part of the genome 1 2 3 4 5 6 7 RT 61.3 8.2 5.4 2.8 2.1 1.7 1.6 Protease 55.0 6.3 4.5 4.2 3.4 2.7 2.4 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Sloot et al.: Grid-Based HIV Expert System 9 Fig. 7. Orthogonal basic functions of expansion (4) for transient probability matrix. of the first basic functions are shown for RT and protease572 parts of the genotype (the input of multiplication of func-573 tions are in the Table 2). It is clearly seen, that the first574 term p i j1 = φ̃i 1ψ̃ j 1 reflects the total occurrence of the mu-575 tations in a genotype (see Figure 6): for the mutations with576 the maximal occurrences the input to conditional proba-577 bilities of its pairs is also high.578 Model validation. The simulation model is based on the579 ρn , P (1) ξ , D distributions of the mutations only. No infor-580 mation of more complicate mechanisms (distributions of581 pairs, triples, etc.) has been used for this identification.582 The main goal of the verification is the possibility to583 reproduce these features of the ensemble through the de-584 pendencies formalizing the matrix D. We compared the585 total occurrences of all mutations in genotypes, estimated586 on the initial and simulated samples, see also Figure 6 (solid587 line). It is seen, that the results of the simulation and sample588 are rather close.589 The error of the simulation increases proportionally to590 absolute value of the occurrences. Nevertheless, for some591 cases the error of the simulation is larger then the boundary592 of the confidence interval. This systematic error may be 593 explained by possible variations in matrix D for groups of 594 the “old” and “new” patients. 595 Application to forecast of HIV-1 evolution in time. The evolu- 596 tion of total world populations of HIV-1 and the associ- 597 ated changing of the related drug resistance levels should 598 be taken into account. The stochastic models, used to de- 599 scribe the HIV-1 genotype ensemble in terms of parame- 600 ters and shown in the Figure 5, can be used for the analysis 601 of its temporal variability during the observation period 602 (VIII.1998–V.2001). The temporal variability of the data 603 may be considered in terms of the samples of the seasons 604 (3-months periods). The volumes of seasonal samples are 605 from 1500 till 4500 genotypes; that is enough for obtain- 606 ing the stable estimations. Only the hypothesis of linear 607 trends is considered: ξ (t ) = a t + b + δ(t ), where a is the 608 most interesting parameter—value of the trend, b is the 609 shift parameter, and δ is the white noise. In the Table 3 the 610 integral parameters of trends of the various parameters of 611 the HIV-1 population (mean value of the parameter, value 612 of the trend, determination coefficient R2 and the sample 613 value of F-criterion) are shown. 614 Trends of single mutations occurrence Pξ . The database allowed 615 us to investigate trends in codon frequency in the period 616 of 1998 till 2001. Results for Protease and RT are shown 617 in Table 3. The majority of the mutations in the genotype 618 have a negative trend, only 77I in Protease has significant 619 positive trend. 620 Trends of bi-modal distribution for number of mutations in geno- 621 types ρn . For the decreasing of the data dimensionality and 622 the statistical discrimination of two groups in the dataset 623 we consider the model of the mixture of two Bernoulli 624 distributions: 625 ρn = p g Ckm 1 q k1 (1 − q 1)m 1−k + (1 − p g )Ckm 2 q k2 (1 − q 2)m 2−k (5) where p g is an input of the first group of mutations (and 626 p g is an input of the second group, m 1, m 2-are maximal 627 numbers of mutations in groups and q 1, q 2-are probabil- 628 ities to find each one (arbitrary) mutation in the groups. 629 The use of Bernoulli distribution logic (based on the rep- 630 etition of the independent events) is more close to the 631 description of the mutation process, then the Poisson dis- 632 tribution, generally applying to description of rare events. 633 Temporal variability of the parameters ( p , q 1, q 2, m 1, m 2)t 634 of the ρn approximation by Equation (5) are shown in 635 Table 3. In both cases only the parameter p g (weight 636 of the left part for group of m1 mutations) has a clear 637 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 10 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 Table 3. Trend analysis of the parameters of the HIV-1 genotype population (F is compared with Fisher’s test F(1,31,95%) = 4.14) Occurrence of mutations, % pg , %, Coefficients √ λk , Equation (4) Parameter 77I 90M 10I 71V Equation (5) k = 1 k = 2 k = 3 Protease part Mean 37.78 32.69 27.97 23.64 48 5.78 1.67 0.83 a (1/month) 0.20 −0.43 −0.72 0.32 0.74 0.13 0.06 0.06 R2 0.68 0.91 0.61 0.82 0.67 0.80 0.73 0.54 F 16.7 77.6 9.6 47.1 64.0 23.6 26.8 11.8 RT part 41L 215Y 103N 67N k = 1 k = 2 k = 3 Mean 32.86 31.37 30.66 27.21 47 6.65 2.20 2.08 a (1/month) −0.51 −0.50 −0.32 −0.39 0.49 0.11 0.17 0.07 R2 0.88 0.93 0.88 0.84 0.75 0.68 0.78 0.71 F 57.4 98.7 59.8 41.8 94.3 21.4 36.1 25.3 significant positive trend. For protease value p g increased638 from 39% in Summer, 1998 to 62% in Summer 2001639 (with average increment a = 0.74% per month). Taking640 into account trends for separate mutations we observed a641 “degradation” of genotypes: the number of patients with642 simple genotypes (small number of mutations) is growing643 but a number of patients with big count of mutations is644 decreased.645 Trends of transient probabilities D. The analysis of the trends of646 parameters for distribution (1) shows that the input of the647 first group of mutations with low number n is increased.648 Hence, it may be a consequence of the temporal variations649 of the interdependencies between different mutations, gov-650 erned by the developing of the drug therapy. For the anal-651 ysis of these hypothesis, let us consider the trends for the652 matrix D, Equation (2). Taking into account the expan-653 sions (3, 4), we may reduce the complicate problem for654 joint trend analysis for components Di j to the procedure655 of trend analysis for independent time series – components656 of expansions (4). From the Table 3 it can be seen, that all657 the components have a clear positive trends. Taking into658 account the shape of first bases functions, see Figure 7, it is659 clear, that generally the joint probabilities Di j of the mu-660 tations is increased also; moreover, the power of increasing661 corresponds to the total occurrences of the mutation in the662 ensemble.663 The discrimination of the groups of “old” and “new”664 patients in terms of bi-modal distribution (5) allow to fore-665 cast the growth of the total number of HIV-infected people666 in time:667 N(t ) = Nnew patients (t ) + Nold patients (εt ), ε � 1. (6) Here ε – is the slow time parameter, which shows the rapid 668 increasing of the new patients group in comparison with 669 the old patients. The part of “new” patients of the sample 670 is p g (old patients−(1 − p g )) from (5). Hence, the growth 671 curve is: 672 N(t ) = Nold patients (0) [ 1 + p g (t ) 1 − p g (t ) ] , (7) where p g (t ) = p0 + a g t -is the linear trend with the pa- 673 rameters from Table 3, and N old patients (0) is the initial value of 674 “old” (treated) patients on the beginning of the forecast. 675 In Figure 8 the “crucial” forecast of the HIV-1 popula- 676 tion growth are shown. It is based on the fact that altogether 677 Fig. 8. Qualitative forecast of HIV-1 population grows. 1 – mean value (7), 2 – 90% confidence interval. AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Sloot et al.: Grid-Based HIV Expert System 11 42 million people worldwide have been infected with HIV678 at the beginning of XXI century, and 12 million have died679 over the last 20 years. Moreover, not taken into account680 is the arising of new drugs and different prophylactic and681 social preventive activities for restriction of HIV-1 infec-682 tion. Really, this result is qualitative only; for quantita-683 tive conclusions the more sophisticated research should be684 done.685 3. RESULTS686 3.1. Data presentation: Roaming PDA access687 3.1.1. User Scenario688 RetroGramTM (www.retrogram.com) is a unique HIV-689 genotype expert based interpretation software program,690 which weighs the effect of specified genotype changes on691 clinical drug activity. It accepts a list of substitutions to the692 protease and reverse transcriptase genes with respect to the693 NL4-3 reference strain. This is accomplished by running694 a “simulation”, which applies some hundred rules relat-695 ing substitutions on the HIV genome to knowledge of696 effects on drug response. The latter comes from over hun-697 dreds of references from the clinical literature. The rules are698 checked against the reported substitutions, and each drug is699 evaluated for its suitability. In a later stage we added Web-700 access where a Web interface is used to submit the input701 and take out the output. We want to make the simulations702 wireless-accessible. Developing a wireless Internet version703 from scratch will not be cost-efficient and causes maintain-704 ability problems. For example, the rules mentioned above705 are often changed and these changes have to be reflected in706 both versions. Furthermore, for privacy and security rea-707 sons the developer is not granted access to the source code708 of the “simulation”. Thus, it is much more convenient to709 have wireless access to the Web-based interface. In this case710 the “simulation” take places in a unique server and privacy711 and security are guaranteed. A typical user scenario is de-712 scribed below and the associated graphical representation713 of the Retrogram Web access is given in Figure 9.714 After the user has successfully logged in, the Patient Detail715 page is displayed (Figure 10). The form, taking place in716 this page is used to enter the personal data of the patient.717 Two fields are required in the form, Patient ID and Data of718 Sample.719 According to the information taken from the laboratory720 the user enters the laboratory test results (i.e. Protease or721 RT substitutions) for the patient in the Laboratory Informa-722 tion page. Next a script invoked on the server does the723 following:724 Fig. 9. Web-based Retrogram use case sequence. Script 1: Server validation script 725 Validate inputs: Validate Protease or RT substitutions if they conform to certain rules. A single substitution should be represented by an integer (for position in the gene) and a letter (for the amino acid). The position in the gene is in the rage from 1 to 99 for Protease position and from 1 to 599 for RT position. The amino acid code is one of the following codes: A C D E F G H I K L M N P Q R S T U V W Y. Submit the inputs to the “simulation” program and take back the drugs ranking result. Show the Drugs ranking result in the ‘HIV Therapy decision support’ screen: After applying certain rules on the laboratory test result return to the final drugs ranking or drug’s level of suitability indication as follows: A (green): This drug can be used B (yellow): Consider use if no class A drug available C (amber): Consider use if no class A or B drug available D (red): Consider use if no class A, B or C drug available U (grey): Unranked, insufficient data available 726 In the ‘HIV Therapy decision support’ screen, clicking on 727 any drug name in the ranking lists will display a list of avail- 728 able references from the scientific literature supporting the 729 particular ranking for that drug. In the ‘HIV Therapy deci- 730 sion support’ screen, clicking on the ‘Interpret substitution’ 731 button will show classification of the patient’s substitutions 732 into relevant, natural or additional. 733 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 12 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 Fig. 10. Web Retrogram: user enters patient substitutions (left), drug ranking results (right). 3.1.2. Roaming, wireless access734 In the designing phase of wireless versions of the application735 the constraints of the mobile devices should be considered.736 At the same time we have tried to maintain the same level737 of usability and readability as in the original Web version.738 This is accomplished by maintaining the same structure as739 that in the Web but with some modifications. For example,740 the Patient detail form has many fields and putting them741 in one screen would cause problems in the usability of742 the program (it’s supposed that the mobile device has a743 resolution comparable to a normal PDA, i.e., something744 around 160 × 160 pixels). Thus we use three screens for745 Patient Detail data. The Patient Detail Web page has 2746 required fields. We put them in the first screen after the747 ‘login’ screen. In this way, if the user is not interested in748 entering optional data, she can directly go to the Laboratory749 Information.750 Proxy method Implementation. A Proxy method is imple-751 mented for accessing the web-based software from mobile752 devices. The Proxy server takes places between the remote753 server (the Retrogram server) and the mobile device. A754 mininavigator script developed in the Proxy is responsible755 for the following:756 • Take the patient data from the mobile user (i.e. patient757 detail, laboratory information)758 • Create an HTTP communication with the remote759 server,760 • Submit data to the remote server. These data are basically761 the input for the Retrogram ‘simulation’.762 • Take the result from the remote server (HTML code763 generated from retrogram.asp script),764 • Parse HTML code and retrieve only relevant informa-765 tion (i.e. drug ranking, error messages, drug references766 etc.). It uses this relevant information to build wireless 767 pages (i.e. WML page in case of WAP or Web-clipping 768 page). 769 • Send the wireless pages to the mobile device. 770 The Proxy is implemented using PHP: Hypertext Pre- 771 processor as a server-site scripting language [9–11] running 772 on the Apache Web server [12]. 773 Two versions are developed using the Proxy method: 774 WAP version and web clipping. If a user wants to enter the 775 ‘patient details’ fields, he has to move from one screen to 776 the other and come back again. The fields already filled in 777 the previous screens should not be lost. Thus maintaining 778 the client’s state is necessary. In the WAP case we simply 779 use cookies but in web clipping cookies are supported only 780 in PALM OS 4.0 version or higher. For this reason the 781 “hidden field” method is used this is another method used 782 for maintaining state in the Internet. The following figures 783 are the user interfaces that have been captured. They track 784 the user’s path through the running of the application, as 785 shown in Figures 11(a) and 11(b), where the user enters 786 the patient’s details and accesses ranking results. 787 J2ME Implementation. The same user interface is applied in 788 the J2ME implementation. There are two main differences 789 between the J2ME implementation and the Proxy one: 790 1. J2ME enables the device to communicate directly to 791 the Retrogram server without an intermediate Proxy 792 2. In J2ME the client’s interface is contained within the 793 device. In the Proxy method, every time the interface 794 should be changed, the Proxy is responsible for gener- 795 ating a new page. 796 The following illustrates the necessary steps one should 797 take in order to fetch an HTML page generated from a 798 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Sloot et al.: Grid-Based HIV Expert System 13 Fig. 11. (a) User corrects the input and submit again (left), drug ranking re- sults (right). (b) Users clicks to the drug ‘indinavir’ (left), references supporting this ranking (right). script in the remote host. Specifically this is an example799 illustrating how the user can login to a script in the Ret-800 rogram server and extract the cookie from the header re-801 sponse:802 1. Open an HTTP connection803 2. Open an input stream804 3. Make an HTTP POST request805 4. Extract the cookie from the header response806 5. Close the connection807 In the J2ME implementation of Retrogram the entire808 client’s interface takes places in the device. The connec-809 tion to the server is established in the following cases: user810 login, with connection with the server is necessary in order811 to validate the user and/or password. The user submits the812 Fig. 12. J2ME method; user enters patient’s substitutions (left), drug ranking results (right). username and password, and the application judges them 813 for their correctness by scanning the HTML response from 814 the Retrogram server. The user submits the patient’s lab- 815 oratory information data. The application should connect 816 to the server in order to submit the data, take the result 817 (HTML format) and extract the drugs ranking. Next the 818 user looks for the references that suggest a certain drug 819 ranking. The database with all the references exists in the 820 Retrogram server, therefore the connection is necessary. 821 The application submits to a Retrogram script the cookie 822 and the name of the drug. The drug references are given 823 back from the server in HTML format. The application 824 should clean up the HTML tags and show the references 825 as plain text. Finally the user looks for classification of the 826 patient’s substitutions. This classification is part of the Ret- 827 rogram ‘simulation’ and thus the connection to the server 828 is still necessary. In Figure 12 we illustrate the process of 829 taking the drugs ranking using the J2ME method. 830 Currently we have the J2ME version in use for different 831 users to study the usability and extendibility. More details 832 on the implementation can be found in reference [13]. 833 3.2. Virtual laboratory infrastructure 834 3.2.1. A virtual organization for retrogram-centered workflow 835 Grid technology is a major cornerstone of today’s com- 836 putational science and engineering, with its basic unit of 837 Grid organization called the Virtual Organization (VO). 838 A VO is a set of Grid entities, such as individuals, appli- 839 cations, services or resources, which are related to each 840 other by some level of trust. In the most basic example, 841 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 14 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 Fig. 13. A Retrogram-centered workflow. service providers would only allow access to the mem-842 bers of the same VO. We are currently building a dis-843 tributed Grid-based overall decision support infrastructure844 to support the Retrogram-centered workflow shown in845 Figure 13.846 This VO will offer a Grid virtual laboratory that will847 assist users in the interpretation the genotype of a patient848 by using rules developed by experts on the basis of the lit-849 erature, taking into account the relationship between the850 genotype and phenotype. The workflow is based on highly851 distributed available data from clinical studies and on the852 relationship between the presence of genotype and the clin-853 ical outcome. In order to cover the fast temporal and spatial854 scales required to infer information from a molecular (ge-855 nomic) level up to patient medical data multi-scale methods856 are applied, where simulation, statistical analysis and data857 mining are combined and used to enhance the rule-based858 decision. In this scenario, information sources are widely859 distributed, and the data processing requirements are highly860 variable, both in the type of resources required and the pro-861 cessing demands. Experiment design, integration of infor-862 mation from various sources, as well as transparent schedul-863 ing and execution of experiments will be supported by this864 support system based on distributed Grid middleware. The 865 DAS2 testbed (Netherlands) will initially provide the addi- 866 tional computational power for our compute intensive jobs. 867 We will reuse Grid middleware from successful European 868 projects such as CrossGrid (www.crossGrid.org) and VL-e 869 (www.vl-e.nl) to provide basic Grid services of data man- 870 agement, resource management, and information services 871 on top of Globus. For transparent use of this infrastructure 872 we will build a presentation layer that will provide a user- 873 friendly interface to both medical doctors and scientists. 874 4. DISCUSSION 875 4.1. Conclusions and future work 876 In this paper we discussed an integrative approach to bio- 877 medicine at large and to infectious diseases in particular. 878 We showed how in the understanding of processes ‘from 879 molecule to man’ Grid technology can play a crucial role. 880 In order to cover the fast time and spatial scales required to 881 infer information from a molecular (genomic) level up to 882 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Sloot et al.: Grid-Based HIV Expert System 15 patient medical data, we need to apply multi-scale meth-883 ods where simulation, statistical analysis, data-mining is884 combined in an efficient way. Moreover the required in-885 tegrative approach asks for distributed data collection (e.g.886 HIV mutation databases, patient data, literature reports etc.)887 and a virtual organization (physicians, hospital administra-888 tion, computational resources etc.). Also the access to and889 use of large-scale computation (both high performance as890 well as distributed) is essential since many of the compu-891 tations involved require near real-time response and are892 to complex to run on a personal computer or PDA. Fi-893 nally data presentation is crucial in order to lower the894 barrier of actual usage by the physicians, here the Grid895 technology (server-client approach) can play an important896 role.897 Although many of the aspects discussed in this paper898 have proven to work in concept, the complete integration899 of the systems and the evaluation of day-to-day use is900 still under development [17]. In addition each of the901 underlying methods (Rule-based, statistical and CA based902 models) remain topics of further studies. We will set up a903 use-base with the system described running under various904 European Grid testbeds. The first testbed we will use is905 the so-called DAS2, and eventually the CrossGrid testbed,906 which supports specific features for interactive computa-907 tion, an essential ingredient for a medical decision support908 system.909 The authors gratefully acknowledge Fan Chen and Ferdinand910 Alimadhi for assistance in implementing the CA models and911 the roaming PDA access. The Dutch Virtual Laboratory on e-912 science project supported parts of the research presented here:913 http://www.VL-e.nl.914 GLOSSARY915 Grid: Distributed architecture for solving computational916 problems by making use of the resources from the mem-917 bers of a virtual organization, treating them as a virtual918 cluster.919 CA: Cellular Automata, a discrete model studied in com-920 putational theory and mathematics, which consists of921 regular grid of cells, each in one of a finite number of922 states.923 Decision Support System: Computer-based system that924 helps in the process of decision-making.925 Web Interface: User interfaces for information available via926 the web.927 Proxy: Computer service which allows clients to make in-928 direct network connections to other services.929 HTTP: Hyper Text Transfer Protocol, a request/response 930 protocol for transferring information on the Web. 931 HTML: Hyper Text Markup Language, a markup language 932 designed for the creation of web pages. 933 WML: Wireless Markup Language, a markup language 934 used in mobile phones. 935 J2ME: Java 2 Platform Micro Edition, a collection of Java 936 interfaces for embedded consumer appliances such as 937 cellular phones. 938 DAS2: Distributed ASCI Super Computer 2, a wide-area 939 distributed computer connecting 5 Dutch Universities. 940 REFERENCES 941 1. Zhao Z, Belleman RG, van Albada GD, Sloot PMA. AG-IVE: 942 An Agent-Based Solution to Constructing Interactive Simula- 943 tion Systems, in Series Lecture Notes in Computer Science, 944 April 2002; 2329: 693–703. 945 2. Durant J, Clevenbergh P, Halfon P, Delguidice P, Porsin S, 946 Simonet P, Montagne N, Dohin E, Schapiro JM, Boucher 947 C, Dellamonica P. Improving HIV therapy with drug resis- 948 tance genotyping: The Viradapt Study. Lancet 1999; 353: 2195– 949 2199. 950 3. Sevin AD, DeGruttola, Nijhuis M, Schapiro JM, Foulkes AS, 951 Para MF, Boucher CAB. Methods for Investigation of the Re- 952 lationship between Drug-Susceptibility Phenotype and Human 953 Immunodeficiency Virus Type 1 Genotype with Applications 954 to AIDS Clinical Trials Groupw 333. The Journal of Infectious 955 Diseases 2000; 182: 59–67. 956 4. The Genotype database is obtained from a large service testing 957 laboratory from the US. It contains the resistance profiles of the 958 Protease and Reverse Transcriptase genes of the HIV-1 virus 959 obtained from plasma samples of HIV-1 infected patients. No 960 clinical background information on medication or drug history 961 is available. 962 5. Mathematical Methods for DNA Sequences. In. Waterman MS, 963 eds. CRC Press Inc., Boca Raton, Florida, 1999. 964 6. Kiryukhin I, Saskov K, Boukhanovsky AV, Keulen W, Boucher, 965 CA, Sloot PMA. Stochastic modeling of temporal variability of 966 HIV-1 population. In: Sloot PMA, Abrahamson D, Bogdanov 967 AV, Dongarra JJ, Zomaya AY, Gorbachev YE, eds. Compu- 968 tational Science – ICCS 2003, Melbourne, Australia and St. 969 Petersburg, Russia, Proceedings Part I, in series Lecture Notes 970 in Computer Science, vol. 2657, pp. 125–135. Springer Verlag, 971 June 2003. ISBN 3-540-40194-6. 972 7. Zorzenon dos Santos RM, Coutinho S. Dynamics of HIV infec- 973 tion: A cellular automata approach. Phys Rev Lett 2001; 87(16): 974 168102–1–4. 975 8. Sloot PMA, Chen F, Boucher CA. Cellular automata model 976 of drug therapy for HIV infection. In: Bandini S, Chopard 977 B, Tomassini M, eds. 5th International Conference on Cellu- 978 lar Automata for Research and Industry, ACRI 2002, Geneva, 979 Switzerland, October 9–11, 2002. Proceedings, in series Lecture 980 Notes in Computer Science, vol. 2493, pp. 282–293. October 981 2002. 982 9. PHP: Hypertext Preprocessor: http://www.php.net. 983 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F 16 Journal of Clinical Monitoring and Computing Vol xxx No xxx 2005 10. The resource for PHP developers: http://www.phpbuilder.com984 11. Zend Technologies – PHP tools for the development, pro-985 tection and scalability of PHP applications – PHP for Linux,986 Unix and Apache, Encoder, Accelerator Studio, Debugger:987 http://www.zend.com.988 12. The Apache Software Foundation: http://www.apache.org.989 13. Alimadhi F. Mobile Internet: Wireless access to Web-990 based interfaces of legacy simulations, MSc thesis, Uni-991 versity of Amsterdam, The Netherlands, September 2002:992 http://www.science.uva.nl/research/pscs/papers/master.html.993 14. Cross-Grid: Grid technology of Interactive Distributed Com-994 putation: http://www.eu-crossGrid.org/.995 15. Little SJ, Holte S, Routy JP, Daar ES, Markowitz M, Collier AC, Koup RA, Mellors JW, Connick E, Conway B, Kilby M, Wang 996 L, Whitcomb JM, Hellmann NS, Richman DD. Antiretroviral- 997 drug resistance among patients recently infected with HIV. N 998 Engl J Med 2002; 8;347(6): 385–394. 999 16. Karlin S. A First Course in Stochastic Processes. Academic Press. 1000 NY-London, 1968. 1001 17. Sloot PMA, Boucher CA, Kiryukhin I, Saskov K, 1002 Boukhanovsky AV. A grid-based problem-solving envi- 1003 ronment for biomedicine. In: Nørager S, ed. Proceedings of 1004 the First European HealthGrid Conference, January, 16th-17th, 1005 2003, pp. 300–323. Commission of the European Commu- 1006 nities, Information Society Directorate-General, Brussels, 1007 Belgium, 2003. 1008 AUTHOR'S PROOFS UN CO RR EC TE D PR OO F Query1009 Q1. Au: Pls. provide dates.1010 AUTHOR'S PROOFS