key: cord-0035482-84dpnm5o
authors: Ito, Kimihito; Zeugmann, Thomas; Zhu, Yu
title: Recent Experiences in Parameter-Free Data Mining
date: 2010-06-30
journal: Computer and Information Sciences
DOI: 10.1007/978-90-481-9794-1_68
sha: 5cee8ca3af3eadb7436bc1c42e52533781b83b15
doc_id: 35482
cord_uid: 84dpnm5o

Recent results supporting the usefulness of the normalized compression distance for the task to classify genome sequences of virus data are reported. Specifically, the problem to cluster the hemagglutinin (HA) sequences of in uenza virus data for the HA gene in dependence on the host and subtype of the virus, and the classification of dengue virus genome data with respect to their four serotypes are studied. A comparison is made with respect to hierarchical and spectral clustering via the kLine algorithm by Fischer and Poland (2004), respectively, and with respect to the standard compressors bzlip, ppmd, and zlib. Our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied.

In many data mining applications the similarity between objects is of fundamental importance. Quite frequently, domain knowledge is used to define a suitable domain-specific distance measure. As a consequence, many of the resulting algorithms tend to have many parameters which have to be tuned. This is not only difficult but also including the risk of being biased. Furthermore, it may make it hard to verify the results obtained.

Recently, as a radically different approach, the paradigm of parameter-free data mining has emerged (cf. Keogh et al. [12] ). The main idea of parameter-free data mining is the design of algorithms that have no parameters and that are universally applicable in all areas. At first glance this may seem impossible. How can an algorithm perform well if it is not based on extracting the important features of the data and if we are not allowed to adjust these parameters? As pointed out by Vitányi et al. [17] , parameter free data mining is aiming at scenarios where we are not interested in a certain similarity measure but in the similarity between the objects themselves.

The most promising approach to this paradigm uses Kolmogorov complexity theory [14] as its basis. The key ingredient is the so-called normalized information distance (NID) which was developed by various researchers during the past decade in a series of steps (cf., e.g., [2, 13, 8] ). The intuitive idea behind it is as follows. If two objects are similar then there should be a simple description of how to transform each one of them into the other one. And conversely, if all descriptions for transforming each one of them into the other one are complex, then the objects should be dissimilar. Then, the normalized information distance between two strings x and y is defined as

where K(x|y) is the length of the shortest program that outputs x on input y, and K(x) is the length of the shortest program that outputs x on the empty input.

For the technical details of the NID, we refer the reader to Vitányi et al. [17] .

To apply this idea to data mining tasks, standard compression algorithms have to be invoked to approximate the Kolmogorov complexity K. This yields the normalized compression distance (NCD) as approximation of the NID (cf. Definition 1). The NCD has been successfully applied to a variety of data mining problems (cf., e.g., [8, 12, 5, 6, 1] ).

In this paper, we report the usefulness of the NCD for three classification problems for virus data. One task is to cluster the hemagglutinin (HA) sequences of influenza virus data for the HA gene in dependence on the subtype, where all data originate from the same host. The second task is the same classification but in dependence on the subtype and host of the virus. The third problem deals with the classification of dengue virus genome data with respect to their four serotypes.

The definition of the NID depends on the function K which is uncomputable. Thus, the NID is uncomputable, too. Using a real-word compressor, one can approximate the NID by the NCD (cf. Definition 1). Again, we omit details and refer the reader to [17] . Definition 1. The normalized compression distance between two strings x and y is defined as

where C is any given data compressor.

Common data compressors are bzlib, ppmd, zlib, etc 3 . Note that the compressor C has to be computable and normal in order to make the NCD a useful approximation. This can be stated as follows.

A compressor C is said to be normal if it satisfies the following axioms for all strings x, y, z and the empty string λ.

(1) C(xx) = C(x) and C(λ) = 0;

(identity)

up to an additive O(log n) term, with n the maximal binary length of a string involved in the (in)equality concerned.

Good real-world compressors like bzlib, ppmd, and zlib turned out to be normal for our data, and we used these compressors for our experiments. We used the ncd function from the CompLearn Toolkit (cf. [4] ) to compute the distance

To cluster the data we used hierarchical clustering and spectral clustering via kLines (cf. Fischer and Poland [9] ). For a detailed description of the algorithms applied, we refer the reader to our paper [11] .

The first paper using the NCD to analyze virus data was Cilibrasi and Vitányi [7] . In this paper the authors used the SARS TOR2 draft genome assembly 120403 from Canada's Michael Smith Genome Sciences Centre and compared it to other viruses by using the NCDand the bzlib compressor. After applying their quartet tree heuristic for hierarchical clustering, they obtained a ternary tree showing relations very similar to those shown in the definitive tree based on medicalmacrobiological genomics analysis which was obtained later (see [7] for details).

Our first group of experiments dealt with influenza viruses, too. We have been interested in learning whether or not specific gene data for the hemagglutinin of influenza viruses are correctly classifiable by using the concept of the NCD. For any relevant background concerning the biological aspects of the influenza viruses we refer the reader to Palese and Shaw [16] and Wright et al. [18] .

The family of Orthomyxoviridae is defined by viruses that have a negativesense, single-stranded, and segmented RNA genome. There are five different genera in the family of Orthomyxoviridae: the influenza viruses A, B and C; Thogotovirus; and Isavirus. Influenza A viruses have a complex structure and possess a lipid membrane derived from the host cell.

We were only interested in their HA gene, since HA is the major target of antibodies that neutralize viral infectivity, and responsible for binding the virus to the cell it infects. In [11] we considered all 16 subtypes of the HA and collected a data set from the National Center for Biotechnology Information (NCBI) [15] containing a total of 106 sequences (all taken from viruses hosted by their the natural host) which could be (almost) successfully clustered into the relevant 16 subtypes of the HA. So, the HA subtype is the similarity between the different sequences.

Next, we shortly describe experiments dealing with influenza viruses hosted by duck and human. Note that H1N1 is a subtype of influenza A and the most common cause of influenza in humans. In June 2009, the World Health Organization declared that a new strain of swine origin H1N1 was responsible for the 2009 flu pandemic. Usually birds can pass avian influenza viruses to swines, where the viruses have to mutate so that they can circulate in the swine population. Then a new strain emerges which can be passed to humans or to other hosts. Of course, in order to become pandemic, the viruses may mutate again.

If one considers sequences for the HA gene originating from different hosts, it is only natural to ask which property is more "similar," the host or the subtype. For answering this question we chose 32 sequences having different HA subtypes that originated from both the duck and human host (again from NCBI). For a complete list of the data description we refer the reader to http://www-alg.ist.hokudai.ac.jp/nhuman vs duck.html . For the ease of presentation, below we use the following abbreviation for the data entries. Instead of giving the full description, e.g., The results obtained by using the zlib and bzlib compressor and then applying hierarchical clustering are shown in Figure 1 and 2 As these clustering results show, for this data set the similarity between subtypes is stronger than the similarity between the hosts. We could confirm this outcome by using spectral clustering, where we used two clusters.

Dengue virus is an RNA virus that causes dengue fever, one of the most important emerging diseases, infecting 100 million people annually in more than one hundred countries around the world [3] . The genome of dengue virus consists of nucleotides approximately 11 KB long, and 10 viral proteins are encoded in the genome. Dengue virus exhibits extensive genetic diversity, and there exist four antigenically distinct serologic types (1 through 4). It is known that severe cases, called dengue hemorrhagic fever / dengue shock syndrome, occur in patients who have secondary infections by a different serotype from previous infections [10] . Around 250,000 cases of dengue hemorrhagic fever / dengue shock syndrome are annually reported. Nucleotide sequences of all four dengue virus groups have been determined, and the rapid development of molecular biology over the last two decades is accelerating the accumulation of genomic data on the pathogen.

So, it is only natural to ask whether or not we can correctly cluster dengue virus genome data with respect to their four serotypes. To answer this question, we used 80 sequences (20 for each serotype) from NCBI ([15] ). For a complete description of the data used, please see http://www-alg.ist.hokudai.ac.jp/Dengue-Data.html . Then, we computed the distance matrix as described above by applying the standard compressors bzlib, ppmd, and zlib. It should be noted that the dengue virus genome data are much larger than the influenza virus data, i.e., 10.6 KB versus 1.7 KB. Our hierarchical clustering was perfect for the compressors ppmd, and zlib (see Figure 3 for an example), but not for bzlib. Hierarchically clustering the distance matrix computed via the bzlib compressor gave 11 errors. On the other hand, spectral clustering delivered correct results in all three cases.

Moreover, we repeated these experiments with a non-balanced data set, see http://www-alg.ist.hokudai.ac.jp/imbalanced-dengue.html , where we used 44 sequences of type 1 and 20 sequences of type 2, 3, and 4.

The results have been almost the same, i.e., hierarchical clustering and spectral clustering have been correct for the compressors ppmd, and zlib.

Using the bzlib compressor and spectral clustering as described in [11] produced two errors. However, by using a different kernel width for transforming the distance matrix in a similarity matrix (i.e., 1.23), the clustering was again perfect. Moreover, in contrast to the experiments performed with the influenza virus data, the kernel width was much less influential.

To summarize, our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied. Note that we do not have reported the running time here, since it was in the range of several seconds. The clustering algorithms used in our experiments will nicely scale up to the amount of data for for which we can efficiently compute the distance matrix. D3EU081225  D3AY099336  D3FJ744729  D3FJ744733  D3FJ744730  D3FJ744732  D3FJ744734  D3FJ744736  D3FJ744731  D3FJ744737  D3FJ744735  D3FJ744738  D3FJ744728  D3FJ744727  D3FJ687448  D3FJ744726  D3FJ744740  D3FJ744739  D1FJ744701  D1EU081277  D1EU081278  D1EU081280  D1EU081281  D1EU081262  D1EU081254  D1EU081279  D1FJ469907  D1FJ469908  D1FJ469909  D1FJ687430  D1FJ687426  D1FJ687431  D1FJ687429  D1FJ687428  D1FJ687427  D1FJ687433  D1FJ687432  D1FJ744702  D4GU289913  D4AF326827  D4AF326826  D4GQ868582  D4GQ868645  D4GQ868585  D4GQ868579  D4GQ868580  D4GQ868581  D4GQ868584  D4GQ868583  D4GQ868644  D4GQ868642  D4GQ868643  D4AY618988  D4AY618989  D4AY618993  D4AY618991  D4AY618992  D4AY618990  D2FJ744743  D2FJ744741  D2FJ744745  D2FJ744705  D2FJ744704  D2FJ744744  D2FJ744719  D2FJ744721  D2FJ744722  D2FJ744723  D2FJ744720  D2FJ744712  D2FJ744713  D2FJ744724  D2FJ744718  D2FJ744717  D2FJ744725  D2FJ744716  D2FJ744714  D2FJ744715   0  1  2  3  4 Cluster Dendrogram hclust (*, "average") dist(NDzlib) Height Fig. 3 . Classification of dengue genome sequences; compr. zlib

Language trees and zipping

Information distance

Fields' Virology

The CompLearn Toolkit

Automatic meaning discovery using Google

Similarity of objects and the meaning of words

A new quartet tree heuristic for hierarchical clustering

Clustering by compression

New methods for spectral clustering

Pathogenesis of dengue: Challenges to molecular biology

Clustering the normalized compression distance for influenza virus data

Towards parameter-free data mining

The similarity metric

An Introduction to Kolmogorov Complexity and its Applications

Orthomyxoviridae: The viruses and their replication

Normalized information distance

Fields' Virology