Recognizing Groups Among Dialects

Jelena Prokić John Nerbonne

Abstract

In this paper we apply various clustering algorithms to the dialect pronunciation

data. At the same time we propose several evaluation techniques that should be

used in order to deal with the instability of the clustering techniques. The results

have shown that three hierarchical clustering algorithms are not suitable for the

data we are working with. The rest of the tested algorithms have successfully

detected two-way split of the data into the Eastern and Western dialects. At the

aggregate level that we used in this research, no further division of sites can be

asserted with high confidence.

1 Introduction

Dialectometry is a multidisciplinary field that uses various quantitative methods in the

analysis of dialect data. Very often those techniques include classification algorithms

such as hierarchical clustering algorithms used to detect groups within certain dialect

area. Although known for their instability (Jain and Dubes, 1988), clustering algo-

rithms are often applied without evaluation (Goebl, 2007; Nerbonne and Siedle, 2005)

or with only partial evaluation (Moisl and Jones, 2005). Very small differences in the

input data can produce substantially different grouping of dialects (Nerbonne et al.,

2008). Without proper evaluation, it is very hard to determine if the results of the ap-

plied clustering technique are an artifact of the algorithm or the detection of real groups

in the data.

The aim of this paper is to evaluate algorithms used to detect groups among lan-

guage dialect varieties measured at the aggregate level. The data used in this research

is dialect pronunciation data that consists of various pronunciations of 156 words col-

lected all over Bulgaria. The distances between words are calculated using Leven-

shtein algorithm, which also resulted in the calculation of the distances between each

1


two sites in the data set. We apply seven hierarchical clustering algorithms, as well as

the k-means and neighbor-joining algorithm to the calculated distances and examine

these using various evaluation methods. We evaluate using several external and inter-

nal methods, since there is no direct way to evaluate the performance of the clustering

algorithms.

The structure of this paper is as follows. Different classification algorithms are

presented in the next section. In Section 3 we discuss our data set and how the data

was processed. Various evaluation techniques are described in Section 4. The results

are given in Section 5. In Section 6 we present discussion and conclusions.

2 Classification algorithms

In this section we briefly introduce seven hierarchical clustering algorithms, k-means

and neighbor-joining algorithm, originally used for reconstructing phylogenetic trees.

2.1 Hierarchical clustering

Cluster analysis is the process of partitioning a set of objects into groups or clusters

(Manning and Schütze, 1999). The goal of clustering is to find structure in the data by

finding objects that are similar enough to be put in the same group and by identifying

distinctions between the groups. Hierarchical clustering algorithms produce a set of

nested partitions of the data by finding successive clusters using previously established

clusters. This kind of hierarchy is represented with a dendrogram—a tree in which

more similar elements are grouped together. In this study seven hierarchical clustering

algorithms will be investigated with regard to their performance on dialect pronun-

ciation data. All these agglomerative clustering algorithms proceed from a distance

matrix, repeatedly choosing the two closest elements and fusing them. They differ

in the way in which distances are recalculated from the newly fused elements to the

others. We now review the various calculations.

Single link method, also known as nearest neighbor, is one of the oldest methods

in cluster analysis. The similarity between two clusters is computed as the distance

2


between the two most similar objects in the two clusters.

dk[ij] = minimum(dki, dkj )

In this formula, as well as in other formulae in this subsection, i and j are the two

closest points that have just been fused into one cluster[i, j], and k represents all the

remaining points (clusters). As noted in Jain and Dubes (1988), single link clusters

easily chain together, producing the so-called chaining effect, and produce elongated

clusters. The presence of only one intermediate object between two compact clusters

is enough to turn them into a single cluster.

Complete link, also called furthest neighbor, uses the most distant pair of objects

while fusing two clusters. It repeatedly merges clusters whose most distant elements

are closest.

dk[ij] = maximum(dki, dkj )

Unweighted Pair Group Method using Arithmetic Averages (UPGMA) belongs

to a group of average clustering methods, together with three methods that will be

described below. In UPGMA, the distance between any two clusters is the average of

distances between the members of the two clusters being compared. The average is

weighted naturally, according to size.

dk[ij] = (ni/(ni + nj )) × dki + (nj /(ni + nj )) × dkj

As a consequence, smaller clusters will be weighted less and larger ones more.

Weighted Pair Group Method using Arithmetic Averages (WPGMA), just as

UPGMA, calculates the distance between the two clusters as the average of distances

between all members of two clusters. But in WPGMA, the clusters that fuse receive

equal weight regardless of the number of members in each cluster.

dk[ij] = (
1
2
× dki) + (

1
2
× dkj )

3


Because all clusters receive equal weights, objects in smaller clusters are more heavily

weighted than those in the big clusters.

Unweighted Pair Group Method using Centroids (UPGMC) In this method, the

members of a cluster are represented by their middle point, the so-called centroid. This

centroid represents the cluster while calculating the distance between the clusters to be

fused.

dk[ij] = (ni/(ni + nj )) × dki + (nj /(ni + nj )) × dkj−

((ni × nj )/(ni + nj )2) × dij

In the unweighted version of the centroid clustering the clusters are weighted based

on the number of elements that belong to that cluster. This means that bigger clusters

receive more weight, so that centroids can be biased towards bigger clusters. Centroid

clustering methods can also occasionally produce reversals—partitions where the dis-

tance between two clusters being joined is smaller than the distance between some of

their subclusters (Legendre and Legendre, 1998).

Weighted Pair Group Method using Centroids (WPGMC) Just as in WPGMA,

in WPGMC all clusters are assigned the same weight regardless of the number of ob-

jects in each cluster. In that way the centroids are not biased towards larger clusters.

dk[ij] = (
1
2
× dki) + (

1
2
× dkj ) − (

1
4
× dij )

Ward’s method This method is also known as the minimal variance method. At

each stage in the analysis clusters that merge are those that result in the smallest in-

crease in the sum of the squared distances of each individual from the mean of its

cluster.

dk[ij] = ((nk + ni)/(nk + ni + nj )) × dki + ((nk + nj )/(nk + ni + nj )) × dkj−

((nk/(nk + ni + nj )) × dij

4


This method uses an analysis of variance approach to calculate the distances between

clusters. It tends to create clusters of the same size (Legendre and Legendre, 1998).

2.2 K-means

The k-means algorithm belongs to the non-hierarchical algorithms which are often re-

ferred to as partitional clustering methods (Jain and Dubes, 1988). Unlike hierarchical

clustering algorithms, partitional clustering methods generate a single partition of the

data. A partition implies a division of the data in such a way that each instance can be-

long only to one cluster. The number of groups in which the data should be partitioned

is usually determined by the user.

The k-means is the most commonly used partitional algorithm, that despite its sim-

plicity, works sufficiently well in many applications (Manning and Schütze, 1999). The

main idea of k-clustering is to find the partition of n objects into K clusters such that

the total error sum of squares is minimized. In the most simple version, the algorithm

consists of the following steps:

1. pick at random initial cluster centers

2. assign objects to the cluster whose mean is closest

3. recompute the means of clusters

4. reassign every object to the cluster whose mean is closest

5. repeat steps 3 and 4 until there are no changes in the cluster membership of any

object

Two main drawbacks of the k-means algorithm are the following:

• the user has to define the number of clusters in advance

• the final partitioning depends on the initial position of the centroids

Possible solutions to these problems, as well as the detailed descriptions of the k-means

algorithm can be found in some of the classical references to k-means: Hartigan (1975),

Everitt (1980) and Jain and Dubes (1988).

5


2.3 Neighbor-joining

Apart from the seven hierarchical clustering algorithms and k-means, we also investi-

gate the performance of the neighbor-joining algorithm. We introduce this technique

at more length as it is less familiar to linguists. Neighbor-joining is a method for re-

constructing phylogenetic trees that was first introduced by Saitou and Nei (1987). The

main principle of this method is to find pairs of taxonomic units that minimize the total

branch length at each stage of clustering. The distances between each pair of instances

(in our case data collection sites) are calculated and put into the n×n matrix, where n

represents the number of instances. The matrices are symmetrical since distances are

symmetrical, i.e. distance (a, b) is always the same as distance (b, a). Based on the

input distances, the algorithm finds a tree that fits the observed distances as closely as

possible. While choosing the two nodes to fuse, the algorithm always takes into ac-

count the distance from every node to all other nodes in order to find the smallest tree

that would explain the data. Once found, two optimal nodes are fused and replaced by

a new node. The distance between the new node and all other nodes is recalculated,

and the whole process is repeated until there are no more nodes left to be paired. The

algorithm was modified by Studier and Kepler (1988), and the complexity was reduced

to O(n3). The steps of the algorithm are as follows (taken from Felsenstein (2004)):

• For each node compute ui which is the sum of the distances from that node to all

other nodes

ui =
n∑

j:j 6=i

Dij
(n − 2)

• Choose i and j for which Dij − ui − uj is smallest

• Join i and j. Compute the length from i and j to the newly formed node v using

the equations below. Note that the distances from the new node to its children

(leaves) need not be identical. This possibility does not exist in hierarchical

clustering.

vi =
1
2
Dij +

1
2
(ui − uj )

vj =
1
2
Dij +

1
2
(uj − ui)

6


• Compute the distance between the new node and all of the remaining nodes

D(ij),k =
(Dik + Djk − Dij )

2

• Delete nodes i and j and replace them by the new node

This algorithm produces a unique unrooted tree under the principal of minimal evo-

lution (Saitou and Nei, 1987). In biology, the neighbor-joining algorithm has become

very popular and widely used method for reconstructing trees from distance data. It

is fast and can be easily applied to a large amount of data. Unlike most hierarchical

clustering algorithms, it will recover the true tree even if there is not a constant rate of

change among the taxa (Felsenstein, 2004).

3 Data preprocessing

The data set used in this research consists of transcriptions of the pronunciations of 156

words collected from 197 sites equally distributed all over Bulgaria. All measurements

were done based on the phonetic distances between the various pronunciations of these

156 words. No morphological, lexical or syntactic variation between the dialects were

taken into account.

Word transcriptions were preprocessed in the following way:

• First, all diacritics and suprasegmentals were removed from word transcriptions.

In order to process diacritics and suprasegmentals, they should be assigned cer-

tain weights appropriate for the specific language that is being analyzed. Since

no study of this kind was available for Bulgarian, diacritics and suprasegmentals

were removed, which resulted in the simplification of data representation. For

example, [u], [u:], ["u], and ["u:] counted as the same phone. Also, all words were

represented as series of phones which are not further defined. The result of com-

paring two phones can be 1 or 0; they either match or they do not. For example,

pair [e, E] counts as different to the same degree as pair [e, i]. Although it is lin-

guistically counterintuitive to use less sensitive measures, Heeringa (2004:p.186)

7


has shown that in the aggregate analysis of dialect differences more detailed fea-

ture representation of segments does not improve the results obtained by using

simple phone representation.

• All transcriptions were aligned based on the following principles: a) a vowel

can match only with a vowel b) a consonant can match only with a consonant,

semivowels [j], [w] and sonorants. The alignments were carried out using the

Levenshtein algorithm, which also results in the calculation of a distance be-

tween each pair of words. A detailed explanation of the Levenshtein algorithm

can be found in Heeringa (2004). The distance is the smallest number of inser-

tions, deletions, and substitutions needed to transform one string to the other. In

this work all three operations were assigned the same value: 1. An example of

an aligned pair of transcriptions can be seen here:

- e d e m
j A d A -

The distance between two sites is the mean of all word distances calculated for those

two sites. The final result is a distance matrix which contains the distances between

each two sites in the data set. This distance matrix was further analyzed using seven

hierarchical algorithms, k-means and the neighbor-joining algorithm described in the

previous section.

4 Evaluation

We analyzed the results obtained by the above mentioned methods further using a vari-

ety of measures. Multidimensional scaling was performed in order to see if there were

any separate groups in the data and to determine the optimal number of clusters in the

data set. External validation of the clustering results included the modified Rand index,

purity and entropy. External validation involves comparison of the structure obtained

by different algorithms to a gold standard. In our study we used the manual classifica-

8


tion of all the sites produced by traditional dialectologist as a gold standard. Internal

validation included examining the cophenetic correlation coefficient, noisy clustering

and a consensus tree, which do not require comparison to any a priori structure, but

rather try to determine if the structure obtained by algorithms is intrinsically appropri-

ate for the data.

Multidimensional scaling is a dimension-reducing method used in exploratory

data analysis and a data visualization method, often used to look for separation of the

clusters (Legendre and Legendre, 1998). The goal of the analysis is to detect meaning-

ful underlying dimensions that allow the researcher to explain observed similarities or

dissimilarities between the investigated objects. In general then, MDS attempts to ar-

range "objects" in a space with a certain small number of dimensions, which, however,

accord with the observed distances. As a result, we can “explain“ the distances in terms

of underlying dimensions. It has been frequently used in linguistics and dialectology

since Black (1973).

4.1 External validation

The modified Rand index (Hubert and Arabie, 1985) is used for comparing two differ-

ent partitions of a finite set of objects. It is a modified form of the Rand index (Rand,

1971), one of the most popular measures for comparing partitions. Given a set of n

elements S = o1, ...on and two partitions of S, U = u1, ...uR and V = v1, ...vC we

define

a the number of pairs of elements in S that are in the same set in U and in the same set in V

b the number of pairs of elements in S that are in different sets in U and in different sets in V

c the number of pairs of elements in S that are in the same set in U and in different sets in V

d the number of pairs of elements in S that are in different sets in U and in the same set in V

The Rand index R is

R =
a + b

a + b + c + d

In this formula a and b are the number of pairs of elements in which two classifications

9


agree, while c and d are the number of pairs of elements in which they disagree. The

value of the Rand index is between 0 and 1, with 0 indicating that the two data clusters

do not agree on any pair of points and 1 indicating that the data clusters are exactly

the same. In dialectometry, this index was used by Heeringa et al. (2002) to validate

dialect comparison methods. A problem with the Rand index is that it does not return a

constant value (zero) if two partitions are picked at random. Hubert and Arabie (1985)

suggested a modification of Rand index that corrects this property. It can be expressed

in the general form as:

RandIndex − ExpectedIndex
M aximumIndex − ExpectedIndex

The value of the modified Rand index is between -1 and 1.

Entropy and purity are two measures used to evaluate the quality of clustering by

looking at the reference class labels of the elements assigned to each cluster (Zhao and

Karypis, 2001). Entropy measures how different classes of elements are distributed

within each cluster. The entropy of a single cluster is calculated using the following

formula:

E(Sr) = −
1

log q

q∑
i=1

nir
nr

log
nir
nr

where Sr is a particular cluster of size nr, q is the number of classes in the reference

data set, and nir is the number of the elements of the ith class that were assigned to the

rth cluster. The overall entropy is the sum of all cluster entropies weighted by the size

of the cluster:

E =
k∑

r=1

nr
n

E(Sr)

The purity measure is used to determine to which extent a cluster contains objects

from primarily one class. The purity of a cluster is calculated as:

P (Sr) =
1
nr

max(nir)

10


while the overall purity is the weighted sum of the individual cluster purities:

P =
k∑

r=1

nr
n

P (Sr)

4.2 Internal validation

The cophenetic correlation coefficient (Sokal and Rohlf, 1962) is Pearson’s correla-

tion coefficient computed between the cophenetic distances produced by clustering and

those in the original distance matrix. The cophenetic distance between two objects is

the similarity level at which those two objects become members of the same cluster

during the course of clustering (Jain and Dubes, 1988) and is represented as branch

length in dendrogram. It measures to which extent the clustering results correspond to

the original distances. When the clustering functions perfectly, the value of the cophe-

netic correlation coefficient is 1. In order to check the significance of this statistics we

performed the simple Mantel test as implemented in zt software (Bonet and de Peer,

2002). A simple Mantel test is used to compare two matrices by testing the corre-

lation between them using the standard Pearson correlation coefficient and testing its

statistical significance (Mantel, 1967).

Noisy clustering, also called composite clustering, is a procedure in which small

amounts of random noise are added to matrices during repeated clustering. The main

purpose of this procedure is to reduce the influence of outliers on the regular clusters

and to identify stable clusters. As shown in Nerbonne et al. (2008) it gives results

that nearly perfectly correlate with the results obtained by bootstrapping—a statistical

method for measuring the support of a given edge in a tree (Felsenstein, 2004). The ad-

vantage of the noisy clustering, compared to bootstrapping, is that it can be applied on

a single distance matrix—the same one used as input for the classification algorithms.

A consensus dendrogram, or consensus tree, is a tree that summarizes the agree-

ment between a set of trees (Felsenstein, 2004). A consensus tree that contains a large

number of internal nodes shows high agreement between the input trees. On the other

hand, if a consensus tree contains few internal nodes, it is a sign that input trees clas-

sify the data in conflicting ways. The majority rule consensus tree, used in this study,

11


is a tree that consists of the groups, i.e clusters, which are present in the majority of

the trees under study. In this research a consensus dendrogram was created from four

dendrograms produced by four different hierarchical clustering methods. Clusters that

appear in the consensus tree are those supported by the majority of algorithms and can

be taken with greater confidence to be true clusters.

5 Results

Before describing the results of applying various algorithms to our data set, we give a

short description of the traditional division of the Bulgarian dialect area that we used

for external validation in our research.

5.1 Traditional scholarship

Traditional scholarship (Stojkov, 2002) divides the Bulgarian language into two main

groups: Western and Eastern. The border between these two areas is so-called ’yat’

border that reflects different pronunciations of the old Slavic vowel ’yat’. It goes from

Nikopol in the North, near Pleven and Teteven down to Petrich in the South (bold

dashed line in Figure 1).

Figure 1: Traditional map of Bulgarian dialects

12


Figure 2: The two-way and six-way classification of sites done by expert

Figure 3: MDS plot

Figure 4: MDS map

13


Stojkov divides each of these two areas further into three smaller dialect zones,

which can also be seen on the map in Figure 1. This 6-fold division is based on the

variation of different phonetic features. No morphological or syntactic differences were

taken into account. In order to evaluate the performance of different clustering algo-

rithms, all sites present in our data set were manually put by an expert in one of the two,

and later into six, main dialect areas according to the Stojkov’s classification. This was

done by Professor Vladimir Zhobov, phonetician and dialectologist from the Faculty

of Slavic Philologies ’St. Kliment Ohridski’, University of Sofia.

Due to various historical events, mostly migrations, some villages are dialectolog-

ical islands surrounded by language varieties from groups different from the one they

belong to. This lack of geographical coherence can be seen, for example, in the north-

central part on the map in Figure 2.

5.2 MDS

Multidimensional scaling was performed in order to check if there are any separate

clusters in the data. The results can be seen in the Figure 3, where the first two ex-

tracted dimensions are plotted against the x and y axes. In addition, all three extracted

dimensions are represented by different shades of red, blue and green colors. This

represents the third MDS dimension.

The first three dimensions represented in Figure 3 explain 98 per cent of the varia-

tion in the data set—the first dimension extracted explains 80 per cent of the variation,

and the second dimension 16 per cent. In Figure 3 we can see two distinct clusters

along the x-axis, which, if put on the map, correspond to the Eastern and Western

group of dialects (Figure 4).

Variation along the y-axis corresponds to the separation of the dialects in the South

from the rest of the country. Using MDS to screen the data, we observe that there are

two distinct clusters in the data set—even though MDS is fully capable of representing

continuous data. This finding fully agrees with the expert opinion (Stojkov, 2002)

according to which the Bulgarian dialect area can be divided into Eastern and Western

dialect areas along the ’yat’ border. A third area that can be seen in Figure 4 is the area

14


in the South of the country—the area of the Rodopi mountains. In the classification of

dialects done by Stojkov (2002), this area is identified as one of the six main dialect

areas based on the phonetic features.

5.3 External validation

The results of the multidimensional scaling and dialect divisions done by expert can

be used as a first step in the evaluation of the clustering algorithms. Visual inspection

shows that three algorithms fail to identify any structure in the data, including East-

West division of the dialects: single link and two centroid algorithms, UPGMC and

WPGMC. Dendrograms drawn using UPGMC and WPGMC reveal a large number

of reversals, while closer inspection of the single link dendrogram clearly shows the

presence of the chain effect. The remaining algorithms reveal the East-West division of

the country clearly (Figure 5). For that reason, in the rest of the paper the main focus

will be on those four clustering algorithms, as well as on the k-means and neighbor-

joining.

Figure 5: Top left map: 2-way division produced by UPGMA, WPGMA and Ward’s
method. Top right map: 6-way division produced by UPGMA. Bottom maps: 6-way
divisions produced by WPGMA and Ward’s method respectively.

In order to compare divisions done by clustering algorithms with the division of

sites done by expert we calculated the modified Rand index, entropy and purity for

15


Table 1: Results of external validation: the modified Rand index (MRI), entropy (E)
and purity (P). Results for the 2, 3 and 6-fold divisions are reported.

Algorithm MRI(2) MRI(3) MRI(6) E(2) E(3) E(6) P(2) P(3) P(6)
single link -0.004 0.007 -0.001 0.958 0.967 0.881 0.614 0.396 0.360

complete link 0.495 0.520 0.350 0.510 0.542 0.467 0.848 0.766 0.645
UPGMA 0.700 0.627 0.273 0.368 0.445 0.583 0.914 0.853 0.568
WPGMA 0.700 0.626 0.381 0.368 0.445 0.448 0.914 0.853 0.665
UPGMC -0.004 0.007 -0.006 0.959 0.967 0.926 0.614 0.396 0.310
WPGMC -0.004 0.007 -0.005 0.958 0.967 0.925 0.614 0.396 0.305

Ward’s method 0.700 0.627 0.398 0.368 0.445 0.441 0.914 0.853 0.675
k-means 0.700 0.625 0.471 0.354 0.451 0.355 0.919 0.756 0.772

NJ 0.567 0.461 - 0.442 0.550 - 0.873 0.777 -

the 2-fold, 3-fold, and 6-fold divisions done by algorithms on the one hand, and those

divisions according to the expert on the other. The results can be seen in Table 1.

The neighbor-joining algorithm produced an unrooted tree (Figure 6), where only 2-

fold and 3-fold divisions of the sites can be identified. Hence, all the indices were

calculated only for the 2-fold and 3-fold divisions in neighbor-joining.

Figure 6: NJ tree

In Table 1 we can see that the values of the modified Rand index for single link and two

centroid methods are very close to 0, which is the value we would get if the partitions

were picked at random. UPGMA, WPGMA, Ward’s method and k-means, which gave

16


nearly the same 2-fold division of the sites, show the highest correspondences with the

divisions done by expert. For 3-fold and 6-fold divisions the values for the modified

Rand index went down for all algorithms, which was expected since the number of

groups increased. The two algorithms with the highest values of the index are Ward’s

method and UPGMA for 3-fold, and k-means for the 6-fold division. Just as in the

case of the 2-fold division, the single-link, UPGMC, and WPGMC algorithms have

values of the modified Rand index close to 0. Neighbor-joining produced a relatively

low correspondence with expert opinion for the 3-fold division—0.461. Similar results

for all algorithms and all divisions were obtained using entropy and purity measures.

External validation of the clustering algorithms has revealed that single link, UPGMC

and WPGMC algorithms are not suitable for the analysis of the data we are working

with, since they fail to recognize any structure in the data.

5.4 Internal validation

In the next step internal validation methods were used to check the performance of the

algorithms: the cophenetic correlation coefficient, noisy clustering and consensus tree.

Since k-means does not produce a dendrogram, it was not possible to calculate the

cophenetic correlation coefficient. The values of the cophenetic correlation coefficient

for the remaining eight algorithms can be seen in Table 2. We can see that clustering

results of the UPGMA have the highest correspondence to the original distances of all

algorithms—90.26 per cent. They are followed by the results obtained by using com-

plete link and neighbor-joining algorithm. All correlations are highly significant with

p < 0.0001. Given the poor performance of the centroid and single-link methods in

detecting the dialect divisions scholars agree on, we note that cophenetic correlation

coefficients are not successful in distinguishing the better techniques from the weaker

ones. We conjecture that the reason for this lies in the fact that the cophenetic correla-

tion coefficient so dependent is on the lengths of the branches in the dendrogram, while

our primary purpose is the classification.

Noisy clustering, that was applied with the seven hierarchical algorithms, has con-

firmed that there are two relatively stable groups in the data: Eastern and Western.

17


Table 2: Cophenetic correlation coefficient
Algorithm CCC p
single link 0.7804 0.0001

complete link 0.8661 0.0001
UPGMA 0.9026 0.0001
WPGMA 0.8563 0.0001
UPGMC 0.8034 0.0001
WPGMC 0.6306 0.0001

Ward’s method 0.7811 0.0001
Neighbor-joining 0.8587 0.0001

Dendrograms obtained by applying noisy clustering to the whole data set show low

confidence for the two-way split of the data, between 52 and 60 per cent. After re-

moving the Southern villages from the data set, we obtained dendrograms that confirm

two-way split of the data along the ’yat’ border with much higher confidence ranging

around 70 per cent. These values are not very high. In order to check the reason of the

influence of the Southern varieties on the noisy clustering we examine an MDS plot in

two dimensions with cluster groups marked by colours. In Figure 7 we can see MDS

plot of 6 groups produced by WPGMA algorithm. MDS plot reveals two homogeneous

groups and a third, more diffuse, group that lies at a remove from them. The third group

of the sites represents the Southern group of varieties, colored light blue and yellow,

and is much more heterogeneous than the rest of the data. Closer inspection of the

MDS plot in Figure 3 also shows that this group of dialects has a particularly unclear

border to the Eastern dialects, which could explain the results of the noisy clustering

applied to the whole data set.

Since different algorithms gave different divisions of sites, we used a consensus

dendrogram in order to detect the clusters on which most algorithms agree. Since single

link, UPGMC and WPGMC have turned to be inappropriate for the analysis of our

data, they were not included in the consensus dendrogram. The consensus dendrogram

drawn using complete link, UPGMA, WPGMA and Ward’s method can be seen in

Figure 8. The names of the sites are colored based on the expert’s opinion, i.e. the

same as in Figure 2. The dendrogram shows strong support for the East-West division

of sites, but no agreement on the division of sites within the Eastern and Western areas.

18


Figure 7: MDS plot of 6 clusters produced by WPGMA. Note that the good separation
of the clusters is often spoiled by unclear margins.

At this level of hierarchy, i.e. 2-way division, there are several sites classified differ-

ently by algorithms and by expert. These sites go along the ’yat’ border and represent

the marginal cases. The only two exceptions are villages in the South-East, namely

Voden and Zheljazkovo. However, according to many traditional dialectologists these

villages should be classified as Western dialects due to many features that they share

with the dialects in the West (personal communication with prof. Vladimir Zhobov).

The four algorithms show agreement only on the very low level where several sites are

grouped together and on the highest level. It is not possible to extract any hierarchical

structure that would be present in the majority of four analyses.

6 Discussion and conclusions

Different clustering validation methods have shown that three algorithms are not suit-

able at all for the data we are working with, namely single link, UPGMC and WPGMC.

The remaining four hierarchical clustering algorithms gave different results depending

on the level of hierarchy, but all four algorithms had high agreement on the detection

of two main dialect areas within the dialect space. At the lower level of hierarchy, i.e

where there are more clusters, the performance of the algorithms is poorer, both with

19


respect to the expert opinion and with respect to the mutual agreement as well. As

shown by noisy clustering, the 2-fold division of the Bulgarian language area is the

only partition of sites that can be asserted with high confidence.

The results of the neighbor-joining algorithm were a bit less satisfactory. The rea-

son for this could be in the fact that our data is not tree-like, but rather contains a lot

of borrowings due to contact between different dialects. A recent study (Hamed and

Wang, 2006) of Chinese dialects has shown that their development is not tree-like and

that in such cases usage of tree-reconstruction methods can be misleading.

The division of sites done by the k-means algorithm corresponded well with the

expert divisions. Two and three-way divisions also correspond well with the divisions

of four hierarchical clustering algorithms. What we find more important is the fact that

in the divisions obtained by the k-means algorithm into 2, 3, 4, 5 and 6 groups the

two-way division into the Eastern and Western groups is the only stable division that

appears in all partitions.

This research shows that clustering algorithms should be applied with caution as

classifiers of language dialect varieties. Where possible, several internal and external

validation methods should be used together with the clustering algorithms in order

to validate their results and make sure that the classifications obtained are not mere

artifacts of algorithms but natural groups present in the data set. Since performance

of clustering algorithms depends on the sort of data used, evaluation of algorithms is a

necessary step in order to obtain results that can be asserted with high confidence.

The fact that there are only two distinct groups in our data set that can be asserted

with high confidence, as opposed to six found in the traditional atlases, could possible

be due to the simplified representation of the data (see Section 3). It is also possible

that some of the features responsible for the traditional 6-way division are not present

in our data set. At the moment, we are investigating these two issues. Regardless of

the quality of the input data set, we have shown that clustering algorithms will partition

data into the desired number of groups even if there is no natural separation of the

data. For this reason it is essential to use different evaluation techniques along with the

clustering algorithms.

20


Classification algorithms are nowadays applied in different subfields of humanities

(Woods et al., 1986; Boonstra et al., 1990). It is a general technique that can be applied

to any sort of data that needs to be put into different groups in order to discover various

patterns. Document and text classification, authorship detection and language typology

are just some of the areas where classification algorithms are nowdays successfully

applied. The problem of choosing the right classification algorithm and obtaining stable

results goes beyond dialectometry and is present whenever applied. For this reason

the present paper is valuable not only for the research done in dialectometry, but also

for other branches of humanities that are using clustering techniques. It shows how

unstable the results of clustering algorithms can be, but also how to approach this

problem and overcome it.

References

P. Black (1973), ‘Multidimensional scaling applied to linguistic relationships’, in

Cahiers de l’Institut de Linguistique Louvain, Volume 3 (Montreal). Expanded ver-

sion of a paper presented at the Conference on Lexicostatistics. University of Mon-

treal.

E. Bonet and Y. V. de Peer (2002), ‘zt: a software tool for simple and partial mantel

tests’, Journal of Statistical software, 7(10), 1–12.

O. Boonstra, P. Doorn, and F. Hendrickx (1990), Voortgezette Statistiek voor Historici

(Muiderberg).

B. S. Everitt (1980), Cluster Analysis (New York).

J. Felsenstein (2004), Inferring Phylogenies (Massachusetts).

H. Goebl (2007), ‘On the geolinguistic change in Northern France between 1300 and

1900: a dialectometrical inquiry’, in J. Nerbonne, T. M. Ellison, and G. Kondrak,

eds, Computing and Historical Phonology. Proceedings of the Ninth Meeting of the

ACL Special Interest Group in Computational Morphology and Phonology (Prague),

75–83.

21


M. B. Hamed and F. Wang (2006), ‘Stuck in the forest: Trees, networks and Chinese

dialects’, Diachronica, 23(1), 29–60.

J. A. Hartigan (1975), Cluster algorithms (New York).

W. Heeringa (2004), Measuring Dialect Pronunciation Differences using Levensthein

Distance (PhD Thesis, University of Groningen).

W. Heeringa, J. Nerbonne, and P. Kleiweg (2002), ‘Validating dialect comparison

methods’, in W. Gaul and G. Ritter, eds, Classification, Automation, and New Media.

Proceedings of the 24th Annual Conference of the Gesellschaft für Klassifikation,

University of Passau, March 15-17, 2000 (Heidelberg), 445–452.

L. Hubert and P. Arabie (1985), ‘Comparing partitions’, Journal of Classification, 2,

193–218.

A. K. Jain and R. C. Dubes (1988), Algorithms for Clustering Data (New Yersey).

P. Legendre and L. Legendre (1998), Numerical Ecology, second ed. (Amsterdam).

C. Manning and H. Schütze (1999), Foundations of Statistical Natural Language Pro-

cessing (Cambridge, MA).

N. Mantel (1967), ‘The detection of disease clustering and a generalized regression

approach’, Cancer Research, 27, 209–220.

H. Moisl and V. Jones (2005), ‘Cluster analysis of the Newcastle Electronic Corpus of

Tyneside English: a comparison of methods’, Literary and Linguistic Computing,

20, 125–146.

J. Nerbonne, P. Kleiweg, W. Heeringa, and F. Manni (2008), ‘ Projecting Dialect Dif-

ferences to Geography: Bootstrap Clustering vs. Noisy Clustering’, in H. B. Chris-

tine Preisach, Lars Schmidt-Thieme and R. Decker, eds, Data Analysis, Machine

Learning, and Applications. Proc. of the 31st Annual Meeting of the German Clas-

sification Society (Berlin), 647–654.

22


J. Nerbonne and C. Siedle (2005), ‘Dialektklassifikation auf der Grundlage Ag-

gregierter Ausspracheunterschiede’, Zeitschrift für Dialektologie und Linguistik,

72(2), 129–147.

W. M. Rand (1971), ‘Objective criteria for the evaluation of clustering methods’, Jour-

nal of American Statistical Association, 66(336), 846–850.

N. Saitou and M. Nei (1987), ‘The neighbor-joining method: A new method for recon-

structing phylogenetic trees’, Molecular Biology and Evolution, 4, 406–425.

R. R. Sokal and F. J. Rohlf (1962), ‘The comparison of dendrograms by objective

methods’, Taxon, 11, 33–40.

S. Stojkov (2002), Bulgarska dialektologiya (Sofia).

J. A. Studier and K. J. Kepler (1988), ‘A note on the neighbor-joining algorithm of

Saitou and Nei’, Molecular Biology and Evolution, 5, 729–731.

A. Woods, P. Fletcher, and A. Hughes (1986), Statistics in Language Studies (Cam-

bridge).

Y. Zhao and G. Karypis (2001), ‘Criterion functions for document clustering: Experi-

ments and analysis’, Technical report 01-40, Department of Computer Science, Uni-

versity of Minnesita, Minneapolis, MN.

23


Figure 8: Consensus dendrogram for four algorithms.The four algorithms show agree-
ment only on the 2-way division. It is not possible to extract any hierarchical structure
that would be present in the majority of four analyses. (For the explanation of colors
see Figure 3.)

24