key: cord-0035482-84dpnm5o authors: Ito, Kimihito; Zeugmann, Thomas; Zhu, Yu title: Recent Experiences in Parameter-Free Data Mining date: 2010-06-30 journal: Computer and Information Sciences DOI: 10.1007/978-90-481-9794-1_68 sha: 5cee8ca3af3eadb7436bc1c42e52533781b83b15 doc_id: 35482 cord_uid: 84dpnm5o Recent results supporting the usefulness of the normalized compression distance for the task to classify genome sequences of virus data are reported. Specifically, the problem to cluster the hemagglutinin (HA) sequences of in uenza virus data for the HA gene in dependence on the host and subtype of the virus, and the classification of dengue virus genome data with respect to their four serotypes are studied. A comparison is made with respect to hierarchical and spectral clustering via the kLine algorithm by Fischer and Poland (2004), respectively, and with respect to the standard compressors bzlip, ppmd, and zlib. Our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied. In many data mining applications the similarity between objects is of fundamental importance. Quite frequently, domain knowledge is used to define a suitable domain-specific distance measure. As a consequence, many of the resulting algorithms tend to have many parameters which have to be tuned. This is not only difficult but also including the risk of being biased. Furthermore, it may make it hard to verify the results obtained. Recently, as a radically different approach, the paradigm of parameter-free data mining has emerged (cf. Keogh et al. [12] ). The main idea of parameter-free data mining is the design of algorithms that have no parameters and that are universally applicable in all areas. At first glance this may seem impossible. How can an algorithm perform well if it is not based on extracting the important features of the data and if we are not allowed to adjust these parameters? As pointed out by Vitányi et al. [17] , parameter free data mining is aiming at scenarios where we are not interested in a certain similarity measure but in the similarity between the objects themselves. The most promising approach to this paradigm uses Kolmogorov complexity theory [14] as its basis. The key ingredient is the so-called normalized information distance (NID) which was developed by various researchers during the past decade in a series of steps (cf., e.g., [2, 13, 8] ). The intuitive idea behind it is as follows. If two objects are similar then there should be a simple description of how to transform each one of them into the other one. And conversely, if all descriptions for transforming each one of them into the other one are complex, then the objects should be dissimilar. Then, the normalized information distance between two strings x and y is defined as where K(x|y) is the length of the shortest program that outputs x on input y, and K(x) is the length of the shortest program that outputs x on the empty input. For the technical details of the NID, we refer the reader to Vitányi et al. [17] . To apply this idea to data mining tasks, standard compression algorithms have to be invoked to approximate the Kolmogorov complexity K. This yields the normalized compression distance (NCD) as approximation of the NID (cf. Definition 1). The NCD has been successfully applied to a variety of data mining problems (cf., e.g., [8, 12, 5, 6, 1] ). In this paper, we report the usefulness of the NCD for three classification problems for virus data. One task is to cluster the hemagglutinin (HA) sequences of influenza virus data for the HA gene in dependence on the subtype, where all data originate from the same host. The second task is the same classification but in dependence on the subtype and host of the virus. The third problem deals with the classification of dengue virus genome data with respect to their four serotypes. The definition of the NID depends on the function K which is uncomputable. Thus, the NID is uncomputable, too. Using a real-word compressor, one can approximate the NID by the NCD (cf. Definition 1). Again, we omit details and refer the reader to [17] . Definition 1. The normalized compression distance between two strings x and y is defined as where C is any given data compressor. Common data compressors are bzlib, ppmd, zlib, etc 3 . Note that the compressor C has to be computable and normal in order to make the NCD a useful approximation. This can be stated as follows. A compressor C is said to be normal if it satisfies the following axioms for all strings x, y, z and the empty string λ. (1) C(xx) = C(x) and C(λ) = 0; (identity) up to an additive O(log n) term, with n the maximal binary length of a string involved in the (in)equality concerned. Good real-world compressors like bzlib, ppmd, and zlib turned out to be normal for our data, and we used these compressors for our experiments. We used the ncd function from the CompLearn Toolkit (cf. [4] ) to compute the distance To cluster the data we used hierarchical clustering and spectral clustering via kLines (cf. Fischer and Poland [9] ). For a detailed description of the algorithms applied, we refer the reader to our paper [11] . The first paper using the NCD to analyze virus data was Cilibrasi and Vitányi [7] . In this paper the authors used the SARS TOR2 draft genome assembly 120403 from Canada's Michael Smith Genome Sciences Centre and compared it to other viruses by using the NCDand the bzlib compressor. After applying their quartet tree heuristic for hierarchical clustering, they obtained a ternary tree showing relations very similar to those shown in the definitive tree based on medicalmacrobiological genomics analysis which was obtained later (see [7] for details). Our first group of experiments dealt with influenza viruses, too. We have been interested in learning whether or not specific gene data for the hemagglutinin of influenza viruses are correctly classifiable by using the concept of the NCD. For any relevant background concerning the biological aspects of the influenza viruses we refer the reader to Palese and Shaw [16] and Wright et al. [18] . The family of Orthomyxoviridae is defined by viruses that have a negativesense, single-stranded, and segmented RNA genome. There are five different genera in the family of Orthomyxoviridae: the influenza viruses A, B and C; Thogotovirus; and Isavirus. Influenza A viruses have a complex structure and possess a lipid membrane derived from the host cell. We were only interested in their HA gene, since HA is the major target of antibodies that neutralize viral infectivity, and responsible for binding the virus to the cell it infects. In [11] we considered all 16 subtypes of the HA and collected a data set from the National Center for Biotechnology Information (NCBI) [15] containing a total of 106 sequences (all taken from viruses hosted by their the natural host) which could be (almost) successfully clustered into the relevant 16 subtypes of the HA. So, the HA subtype is the similarity between the different sequences. Next, we shortly describe experiments dealing with influenza viruses hosted by duck and human. Note that H1N1 is a subtype of influenza A and the most common cause of influenza in humans. In June 2009, the World Health Organization declared that a new strain of swine origin H1N1 was responsible for the 2009 flu pandemic. Usually birds can pass avian influenza viruses to swines, where the viruses have to mutate so that they can circulate in the swine population. Then a new strain emerges which can be passed to humans or to other hosts. Of course, in order to become pandemic, the viruses may mutate again. If one considers sequences for the HA gene originating from different hosts, it is only natural to ask which property is more "similar," the host or the subtype. For answering this question we chose 32 sequences having different HA subtypes that originated from both the duck and human host (again from NCBI). For a complete list of the data description we refer the reader to http://www-alg.ist.hokudai.ac.jp/nhuman vs duck.html . For the ease of presentation, below we use the following abbreviation for the data entries. Instead of giving the full description, e.g., The results obtained by using the zlib and bzlib compressor and then applying hierarchical clustering are shown in Figure 1 and 2 As these clustering results show, for this data set the similarity between subtypes is stronger than the similarity between the hosts. We could confirm this outcome by using spectral clustering, where we used two clusters. Dengue virus is an RNA virus that causes dengue fever, one of the most important emerging diseases, infecting 100 million people annually in more than one hundred countries around the world [3] . The genome of dengue virus consists of nucleotides approximately 11 KB long, and 10 viral proteins are encoded in the genome. Dengue virus exhibits extensive genetic diversity, and there exist four antigenically distinct serologic types (1 through 4). It is known that severe cases, called dengue hemorrhagic fever / dengue shock syndrome, occur in patients who have secondary infections by a different serotype from previous infections [10] . Around 250,000 cases of dengue hemorrhagic fever / dengue shock syndrome are annually reported. Nucleotide sequences of all four dengue virus groups have been determined, and the rapid development of molecular biology over the last two decades is accelerating the accumulation of genomic data on the pathogen. So, it is only natural to ask whether or not we can correctly cluster dengue virus genome data with respect to their four serotypes. To answer this question, we used 80 sequences (20 for each serotype) from NCBI ([15] ). For a complete description of the data used, please see http://www-alg.ist.hokudai.ac.jp/Dengue-Data.html . Then, we computed the distance matrix as described above by applying the standard compressors bzlib, ppmd, and zlib. It should be noted that the dengue virus genome data are much larger than the influenza virus data, i.e., 10.6 KB versus 1.7 KB. Our hierarchical clustering was perfect for the compressors ppmd, and zlib (see Figure 3 for an example), but not for bzlib. Hierarchically clustering the distance matrix computed via the bzlib compressor gave 11 errors. On the other hand, spectral clustering delivered correct results in all three cases. Moreover, we repeated these experiments with a non-balanced data set, see http://www-alg.ist.hokudai.ac.jp/imbalanced-dengue.html , where we used 44 sequences of type 1 and 20 sequences of type 2, 3, and 4. The results have been almost the same, i.e., hierarchical clustering and spectral clustering have been correct for the compressors ppmd, and zlib. Using the bzlib compressor and spectral clustering as described in [11] produced two errors. However, by using a different kernel width for transforming the distance matrix in a similarity matrix (i.e., 1.23), the clustering was again perfect. Moreover, in contrast to the experiments performed with the influenza virus data, the kernel width was much less influential. To summarize, our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied. Note that we do not have reported the running time here, since it was in the range of several seconds. The clustering algorithms used in our experiments will nicely scale up to the amount of data for for which we can efficiently compute the distance matrixluster Dendrogram hclust (*, "average") dist(NDzlib) Height Fig. 3 . Classification of dengue genome sequences; compr. zlib Language trees and zipping Information distance Fields' Virology The CompLearn Toolkit Automatic meaning discovery using Google Similarity of objects and the meaning of words A new quartet tree heuristic for hierarchical clustering Clustering by compression New methods for spectral clustering Pathogenesis of dengue: Challenges to molecular biology Clustering the normalized compression distance for influenza virus data Towards parameter-free data mining The similarity metric An Introduction to Kolmogorov Complexity and its Applications Orthomyxoviridae: The viruses and their replication Normalized information distance Fields' Virology