doi:10.1016/j.ijhcs.2006.08.009 ARTICLE IN PRESS 1071-5819/$ - se doi:10.1016/j.ijh � Correspond E-mail addr jxu@bentley.ed Int. J. Human-Computer Studies 65 (2007) 57–70 www.elsevier.com/locate/ijhcs Mining communities and their relationships in blogs: A study of online hate groups Michael Chau a,� , Jennifer Xu b a School of Business, The University of Hong Kong, Pokfulam, Hong Kong b Department of Computer Information Systems, Bentley College, Waltham, MA 02452, USA Available online 10 October 2006 Abstract Blogs, often treated as the equivalence of online personal diaries, have become one of the fastest growing types of Web-based media. Everyone is free to express their opinions and emotions very easily through blogs. In the blogosphere, many communities have emerged, which include hate groups and racists that are trying to share their ideology, express their views, or recruit new group members. It is important to analyze these virtual communities, defined based on membership and subscription linkages, in order to monitor for activities that are potentially harmful to society. While many Web mining and network analysis techniques have been used to analyze the content and structure of the Web sites of hate groups on the Internet, these techniques have not been applied to the study of hate groups in blogs. To address this issue, we have proposed a semi-automated approach in this research. The proposed approach consists of four modules, namely blog spider, information extraction, network analysis, and visualization. We applied this approach to identify and analyze a selected set of 28 anti-Blacks hate groups (820 bloggers) on Xanga, one of the most popular blog hosting sites. Our analysis results revealed some interesting demographical and topological characteristics in these groups, and identified at least two large communities on top of the smaller ones. The study also demonstrated the feasibility in applying the proposed approach in the study of hate groups and other related communities in blogs. r 2006 Elsevier Ltd. All rights reserved. Keywords: Blogs; Social network analysis; Hate groups; Web mining 1. Introduction Blogs, or Weblogs, have become increasingly popular in recent years. Blog is a Web-based publication that allows users to add content periodically, normally in reverse chronological order, in a relatively easy way. Blogs also combine personal Web pages with tools to make linking to other pages easier, as well as to post comments and afterthoughts (Blood, 2004). Instead of having a few people being in control of the threads on traditional Internet forums, blogs basically allow anyone to express their ideas and thoughts. Many free blog hosting sites are available. Sites like www.blogger.com, www.xanga.com, and www. livejournal.com, are a few examples. e front matter r 2006 Elsevier Ltd. All rights reserved. cs.2006.08.009 ing author. Tel.: +852 2859 1014; fax: +852 2858 5614. esses: mchau@business.hku.hk (M. Chau), u (J. Xu). Many communities have emerged in the blogosphere. These could be support communities such as those for technical support or educational support (Nardi et al., 2004), or groups of bloggers who already knew each other in other context, such as a group for a high school or a company. In addition, there are also communities formed by people who share common interests or opinions. Many free blog hosting sites have the function to allow bloggers to link to each other to form explicit groups. Similar to other Web-based media such as Web sites, discussion forums, or chat rooms where hate groups are present (Anti- Defamation League, 2001; CNN, 1999; Glaser et al., 2002), there are also hate groups in blogs that are formed by bloggers who are racists or extremists. The consequences of the formation of such groups on the Internet cannot be underestimated. Hate groups or White supremacist groups like the Ku Klux Klan have started to use the Internet to spread their beliefs, recruit new members, or even advocate hate crimes with considerable success (Anti-Defamation http://www.blogger.com http://www.xanga.com http://www.livejournal.com http://www.livejournal.com www.elsevier.com/locater/ijhcs dx.doi.org/10.1016/j.ijhcs.2006.08.009 mailto:jxu@bentley.edu mailto:jxu@bentley.edu ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–7058 League, 2001). The Web has allowed these groups to reach much further into society than ever before. Young people, the major group of bloggers, are more likely to be affected and even ‘‘brainwashed’’ by ideas propagated through the Web as a global medium. Hatred and extremism ideas could easily be embedded into their minds to make them become members of these hate groups or even conduct hate crimes. Facing the new trend in the cyberspace, our study has two objectives. First, we propose a semi-automated approach that combines blog spidering and social network analysis techniques to facilitate the monitoring, study, and research on the networks of bloggers, especially those in hate groups. To study the cyber activities of hate groups in blogs, it is important to devise an efficient and effective way to identify these groups, extract the information of their members, and explore their relationships. In recent years, advanced techniques such as text mining, Web mining, and social network analysis have been widely used in studies on cyber crimes, online extremist groups, and terrorist organizations (e.g., Burris et al., 2000, Chen et al., 2004, Zhou et al., 2005). However, the application of these techniques to blog analysis on the Web is a new area and no prior research has been published. Our approach consists of a set of techniques for identifying and analyzing hate groups in the blogosphere on the Web. Moreover, we include in our approach topological analysis, a new network analysis methodology that has not been employed in prior hate group and blog-related research, to study the relationships between bloggers. Second, our study seeks insights into the organization and movement of online hate groups. We reveal the structural properties of social networks of bloggers through a case study and compare our findings with those from previous studies. Like many other extremist organi- zations, hate groups and their activities are a type of social movement, which has significant social and political implications (Burris et al., 2000; Douglas et al., 2005). Because of their tendency toward violence, association with crimes, and especially, potential influence on youths, hatred and hate groups have attracted much attention from academics recently. Moreover, since such a move- ment is dynamic and keeps changing over time, it is desirable to constantly monitor the changes, compare findings across studies, and continuously update our understanding of hate groups. The remainder of the paper is organized as follows. In Section 2 we review the research background of hate group analysis and related research in text mining, Web mining, and social network analysis. We pose our research questions in Section 3, and a semi-automated approach to hate group analysis in blogs is presented in Section 4. In Section 5 we present a case study that we have performed on a popular blog hosting site based on the proposed approach and discuss our analysis findings. Lastly, we conclude our research in Section 6 and suggest some directions for future work. 2. Related work In this section we will review the background of the increasingly popular blogging phenomena. We also review relevant techniques in Web mining and social network analysis that have been applied in analyzing Web contents. 2.1. Blogging Blogs have become increasingly popular in the past few years. In the early days, blogs, a short form of weblogs, were used mainly for pages where links to other useful resources were periodically ‘‘logged’’ and posted. At that time blogs were mostly maintained by hand (Blood, 2004). After easy-to-use blogging software became widely avail- able in the early 2000s, the nature of blogs has changed and many blogs are more like personal Web sites that contain various types of content (not limited to links) posted in reverse-chronological order. Bloggers often make a record of their lives and express their opinions, feelings, and emotions through writing blogs (Nardi et al., 2004). Many bloggers consider blogging as an outlet for their thoughts and emotions. Besides personal blogs, there are also blogs created by companies. For example, ice.com, an online jewelry seller, has launched three blogs and reported that thousands of people linked to their Web site from these blogs (Hof, 2005). One of the most important features in blogs is the ability for any reader to write a comment on a blog entry. On most blog hosting sites, it is very easy to write a comment, in a way quite similar to replying to a previous message in traditional discussion forums. The ability to comment on blogs has facilitated the interaction between bloggers and their readers. On some controversial issues, like those related to racism, it is not uncommon to find a blog entry with thousands of comments where people dispute back and forth on the matter. Cyber communities have also emerged in blogs. Com- munities in blogs can be categorized as explicit commu- nities or implicit communities, like some other cyber communities on the Web (Kumar et al., 1999). Explicit communities in blogs are the groups, or blogrings, that bloggers have explicitly formed and joined. Most blog hosting sites allow bloggers to form a new group or join any existing groups. On the other hand, implicit commu- nities can only be defined by the interactions among bloggers, such as subscription, linking, or commenting. For example, a blogger may subscribe to another blog, meaning that the subscriber can get updates when the subscribed blog has been updated. A blogger can also post a link or add a comment to another blog, which are perhaps the most traditional activities among bloggers. These interac- tions signify some kind of connection between two bloggers. Because of such interactions among bloggers, these communities are less similar to the cyber communities as discussed in Kumar et al. (1999) but more resembling to the virtual communities which involve the social interaction ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–70 59 between members characterized by memberships, sense of belonging, relationships, shared values and practices, and self-regulation (Erickson, 1997; Roberts, 1998; Rheingold, 2000). Similar to the analysis of hyperlinks among Web pages to identify communities (Chau et al., 2005a,b), analysis of the connections between bloggers could also identify these virtual communities, their characteristics, and their relationships. 2.2. Web mining and social network analysis Techniques based on both Web mining and social network techniques have been used in intelligence- and security-related applications and achieved considerable success. Web mining techniques are important because the Web has provided a vast amount of publicly accessible information that could be useful in security applications. It has been reported that many terrorists and extremist groups have been using the Web for various purposes (Zhou et al., 2005; Qin et al., 2006). Analysis and mining of such content can provide useful insights that are important to national and international security. Similarly, social network analysis (SNA) techniques also have been increasingly used in security applications in recent years. As SNA analyzes the interactions between individuals, it can often identify communities within a group of individuals and reveal some other interesting findings. This is especially useful for analyzing criminal organizations and terrorist groups and promising results have been reported (Xu and Chen, 2005). In the rest of this subsection, we give a brief review of Web mining and SNA techniques that are relevant to this research. Web mining techniques have been widely applied to various Web applications in recent years. In these applications, Web spiders usually are an indispensable component. Spiders, also known as crawlers, wanderers, or Webbots, are defined as ‘‘software programs that traverse the World Wide Web information space by following hypertext links and retrieving Web documents by standard HTTP protocol’’ (Cheong, 1996). Since the early days of the Web, spiders have been widely used to build the underlying databases of search engines (e.g., Pinkerton, 1994), to perform personal search (e.g., Chau et al., (2001)), to archive particular Web sites or even the whole Web (e.g., Kahle, 1997), or to collect Web statistics (e.g., Broder et al., 2000). Web mining techniques can be categorized into three types: content mining, structure mining, and usage mining (Kosala and Blockeel, 2000). Web content mining refers to the discovery of useful information from Web contents, including text, images, audio, video, etc. Web content mining research includes resource discovery from the Web (e.g., Cho et al., 1998; Chakrabarti et al., 1999), document categorization and clustering (e.g., Zamir and Etzioni, 1999; Chen et al., 2003), and information extraction from Web pages. Web structure mining studies the model underlying the hyperlink structures of the Web. It usually involves the analysis of in-links and out-links information of a Web page, and has been used for search engine result ranking and other Web applications. Google’s PageRank (Brin and Page, 1998) and HITS (Kleinberg, 1998) are the two most widely used algorithms. Web usage mining employs data mining techniques to analyze search logs or other activity logs to find interesting patterns. One of the main applications of Web usage mining is its use to learn user profiles (e.g., Armstrong et al., 1995). Specifically, the identification of Web communities has to a large extent relied on Web structure mining (Gibson et al., 1998; Kumar et al., 1999). Many of existing community identification methods are rooted in the HITS algorithm (Kleinberg, 1998). Kumar et al. (1999) propose a trawling approach to find a set of core pages containing both authoritative and hub pages for a specific topic. The core is a directed bipartite subgraph whose node set is divided into two sets with all hub pages in one set and authoritative pages in the other. The core and the other related pages constitute a Web community (Gibson et al., 1998). Treating the Web as a large graph, the problem of community identification can also be formulated as a minimum-cut problem, which finds clusters of roughly equal sizes while minimizing the number of links between clusters (Flake et al., 2000, 2002). Realizing that the minimum-cut problem is equivalent to the maximum-flow problem (Ford and Fulkerson, 1956), Flake et al. (2000) formulate the Web community identification problem as an s�t maximum flow problem, which can be solved using efficient polynomial time methods. Recently, a number of hierarchical clustering methods have also been proposed for identifying community structure in networks. These methods are especially suitable for unweighted networks such as the Web, in which hyperlinks do not have associated weights. The G–N algorithm (Girvan and Newman, 2002), for example, is a divisive clustering algorithm that gradually removes links to break a connected graph into clusters. However, the algorithm is rather slow. A few alternative methods such as the modularity-based algorithm (Newman, 2004b) and the edge clustering coefficient based algorithm (Radicchi et al., 2004) have improved the efficiency of the G–N algorithm. Web structure mining, on the other hand, also has a big overlap with social network analysis. SNA is a sociological methodology for analyzing patterns of relationships and interactions between social actors in order to discover the underlying social structure (Wasserman and Faust, 1994). Not only the attributes of social actors, such as their age, gender, socioeconomic status, and education, but also the properties of relationships between social actors, such as the nature, intensity, and frequency of the relationships, are believed to have important implications to the social structure. SNA methods have been employed to study organizational behavior (Borgatti and Foster, 2003), inter- organizational relations (Stuart, 1998), citation patterns (Baldi, 1998), virutal communities (Garton et al., 1999), and many other domains. Recently, SNA has also been ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–7060 used in the intelligence and security domain to analyze criminal and terrorist networks (Dombroski and Carley, 2002; Krebs, 2001; Xu and Chen, 2004, 2005). When used to mine a network, SNA can help reveal the structural patterns such as the central nodes which act as hubs, leaders, or gatekeepers, the densely-knit communities or groups, and the patterns of interactions between the communities and groups. These patterns often have important implications to the functioning of the network. For example, the central nodes often play a key role by issuing commands or bridging different communities. The removal of central nodes can effectively disrupt a network than peripheral nodes (Albert et al., 2000). Moreover, a recent movement in statistical analysis of network topology (Albert and Barabási, 2002) has brought new insights and research methodology to the study of network structure. Networks, regardless of their contents, are classified into three categories: random network (Bollobás, 1985), small-world network (Watts and Strogatz, 1998), and scale-free network (Barabási et al., 1999). In a random network the probability that two randomly selected nodes are connected is a constant p. As a result, each node has roughly the same number of links and nodes are rather homogenous. In addition, communities are not likely to exist in random networks. Small-world networks, in contrast, have a significantly high tendency to form groups and communities. Most empirical networks ranging from social networks (Newman, 2004a), biological net- works (Jeong et al., 2001), to the Web (Albert et al., 1999) have been found to be nonrandom networks. In addition, many of these networks are also scale-free networks (Barabási et al., 1999), in which a large percentage of nodes have just a few links, while a small percentage of the nodes have a large number of links. Thus, nodes in scale- free networks are not homogenous in terms of their links. Some nodes become hubs or leaders that play important roles in the operation of the network. The Web has been found to have both small-world and scale-free properties (Albert and Barabási, 2002). 2.3. Hate on the Internet Hate crimes have been one of the long-standing problems in the United States because of various historical, cultural, and political reasons. Race, gender, religion, and disability often become the reason of hate. Over time, hate groups have been formed to unite individuals with similar beliefs as well as to spread such ideology. For example, White supremacist groups such as Ku Klux Klan (KKK), Neo-Nazis, and Racist Skinheads have been active in the United States for a long time (Burris et al., 2000). Hate groups have been increasingly using the Internet to express their ideas, spread their beliefs, and recruit new members (Lee and Leets, 2002). It has been reported that 60% of hate criminals are youths (Levin and McDevitt, 1993), who are, perhaps unfortunately, also one of the largest groups of Internet users. Glaser et al. (2002) suggest that racists often express their views more freely on the Internet. The Hate Directory (Franklin, 2005) compiles a list of hundreds of Web sites, files archives, newsgroups, and other Internet resources related to hate and racism. Several studies have investigated Web sites that are related to racism or White supremacy. Douglas et al. (2005) studied 43 Web sites that were related to White supremacy. It was found that while these groups showed lower level of advocated violence due to legal constraints, they exhibited high levels of social conflict and social creativity. Lee and Leets (2002) found that storytelling-style, implicit messages often used by hate groups on the Internet were more persuasive to adolescents, who have become the target of new member recruitment of many hate groups. These adolescences might be easily influenced to conduct hate crimes. Gerstenfeld et al. (2003) conducted a manual analysis of 157 extremist Web sites. They found that some hate Web sites were associated with hate groups while others were maintained by individuals. Many of these sites had links to other extremist sites or hate group sites, showing that some of these groups are linked to each other. Burris et al. (2000) systematically analyzed the networks of Web sites maintained by white supremacist groups and found that this network had a decentralized structure with several centers of influence. In addition, communities were present in this network in which groups sharing similar interests and ideologies tended to be closely connected. Zhou et al. (2005) used software to automate the analysis of the content of hate group Web sites and the linkage among them. They found that one of the major objectives of these Web sites was to share ideology. Cyber commu- nities such as White Supremacists and Neo-Nazis were identified among these sites. Recent years have seen the emergence of hate groups in blogs, where high-narrative messages are the norm. This has made blogs an ideal medium for spreading hatred. Blogs have also made it possible for individuals to find others with similar belief and ideology much more easily. As a result, hate groups have emerged in blogs. To study these online hate groups in blogs, it is important to analyze the content of these blogs as well as the relationships among the bloggers. However, because of the large volume of data involved, it is often a mentally exhausting, if not infeasible, process to perform such kind of analysis manually. There is great potential value to apply Web mining and social network analysis techniques, which have been used successfully in other security-related applications, to analyze hate-related blogs automatically in order to identify patterns and facilitate further analysis. However, we have not been able to identify any prior research in this aspect in the literature. 3. Research questions As discussed earlier, it is an important and timely issue to identify the hate groups in blogs and analyze their relationships. Web mining techniques have been used to ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–70 61 analyze Web content such as Web pages and hyperlinks; however, few of these techniques have been applied to blog analysis. Based on our review, we pose the following two sets of research questions: (1) Can we use semi-automatic techniques, such as automated text collection, text analysis, and network analysis, to identify hate groups in blogs? (2) What are the structural properties of the social networks of bloggers in the hate groups? Are there bloggers who stand out as leaders of influence in these groups? What is the community structure in these groups? What do the structural properties suggest about the organization of the hate groups? What are the social and political implications of these properties? 4. Proposed approach In this section, we propose a semi-automated approach for identifying groups and analyzing their relationships in blogs. The approach is diagrammed in Fig. 1. Our approach consists of four main modules, namely Blog Spider, Information Extraction, Network Analysis, and Visualization. The Blog Spider module downloads blog pages from the Web. These pages are then processed by the Information Extraction module. Data about these blogs and their relationships are extracted and passed to the Network Analysis module for further analysis. Finally the Visualization module presents the analysis results to users in a graphical display. In the following, we describe each module in more detail. 4.1. Blog spider A blog spider program is first needed to download the relevant pages from the blogs of interest. Similar to general Web fetching, the spider can connect to blog hosting sites Fig. 1. The proposed semi-automated using standard HTTP protocol. After a blogring descrip- tion page or a blog page is fetched, URLs are extracted and stored into a queue. However, instead of following all extracted links, the blog spider should only follow links that are of interest, e.g., links to a group’s members, other bloggers, comment links, and so on. Links to other external resources are often less useful in blog analysis. Multi- threading also can be used, as in standard spiders, such that multiple Web pages can be downloaded in parallel (Chau et al., 2005a,b). This can avoid bottleneck in the process if any particular Web server is sending malicious response or not responding at all. Alternatively, asynchro- nous I/O can be used for parallel fetching (Brin and Page, 1998). In either case, after a page is downloaded it can be stored into a relational database or as a flat file. In addition, the spider can use RSS (Really Simple Syndica- tion) and get notification when the blog is updated. However, this is only necessary when monitoring or incremental analysis is desired. 4.2. Information extraction After a blog page has been downloaded, it is necessary to extract useful information from the page. This includes information related to the blog or the blogger, such as user profiles and date of creation. This can also include linkage information between two bloggers, such as linkage, commenting, or subscription. Because different blogs may have different formats, it is not a trivial task to extract this information from blogs. Even blogs hosted on the same hosting site could have considerably different formats as they can be easily customized by each blogger. Pattern matching or entity extraction techniques can be applied. For example, rule-based algorithms that rely on hand- crafted rules can be used to extract useful information such approach for blog link analysis. ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–7062 as named entities. The rules may be structural, contextual, or lexical (Krupka and Hausman, 1998). Although such human created rules are usually of high quality, this approach is not scalable to large number of entities and not robust to less structured documents such as blogs. Fortunately, some standard information such as name and location are oftentimes put into specific format (e.g., as a sidebar) in large blog hosting sites, and simple rules should suffice. Nonetheless, different rules may be needed for different blog hosts. Other techniques, such as statistical approach or machine learning techniques, also can be applied if more detailed information needs to be extracted. The information extracted can be stored into database for further analysis. Links extracted can be passed back to the blog spider for further fetching. 4.3. Network analysis Network analysis is a major component in our approach. In this module we propose three types of analysis: topological analysis centrality analysis and community analysis. The goal of topological analysis is to ensure that the network extracted based on links between bloggers is not random and it is meaningful to perform the centrality and community analysis. We use three statistics that are widely used in topological studies to categorize the extracted network (Albert and Barabási, 2002): average shortest path length, clustering coefficient and degree distribution. Aver- age path length is the mean of the all-pair shortest paths in a network. It measures the efficiency of communication between nodes in a network. Clustering coefficient in- dicates how likely nodes in a network form groups or communities. The degree distribution, P(k), is the prob- ability that a node has exactly k links. Another measure related to average path length is the network’s global efficiency, which is defined as the average of the inverses of shortest path lengths over all pairs of nodes in a network (Crucitti et al., 2003). It is shown that a random network usually has a small average path length and is more efficient because an arbitrary node can reach any other node in a few steps. Small-world networks usually have significantly high clustering coefficients than their random network counterparts. Scale-free networks are categorized by a power-law degree distribution, which is different from the Poisson degree distribution presented in random networks. Centrality analysis follows the topological analysis if the extracted network is shown to be a nonrandom network in which node degrees may vary greatly. The goal of centrality analysis is to identify the key nodes in a network. Three traditional centrality measures can be used: degree, betweenness, and closeness (Freeman, 1979). Degree measures how active a particular node is. It is defined as the number of direct links a node has. ‘‘Popular’’ nodes with high degree scores are the leaders, experts, or hubs in a network. It has been shown that these popular nodes can be a network’s ‘‘Archilles’ Heel,’’ whose failure or removal will cause the network to quickly fall apart (Albert et al., 2000; Holme et al., 2002). In the intelligence and security context, the removal of key offenders in a criminal or terrorist network is often an effective disruptive strategy (McAndrew, 1999; Sparrow, 1991). Betweenness measures the extent to which a particular node lies between other nodes in a network. The betweenness of a node is defined as the number of geodesics (shortest paths between two nodes) passing through it. Nodes with high betweenness scores often serve as gatekeepers and brokers between different communities. They are important communication channels through which information, goods, and other resources are transmitted or exchanged (Wasserman and Faust, 1994). The removal of nodes with high betweenness scores can be even more devastating than the removal of nodes with high degrees (Holme et al., 2002). Closeness is the sum of the length of geodesics between a particular node and all the other nodes in a network. A node with low closeness may find it very difficult to communicate with other nodes in the network. Such nodes are thus more ‘‘peripheral’’ and can become outliers in the network (Sparrow, 1991; Xu and Chen, 2005). Community analysis is to identify social groups in a network. In SNA a subset of nodes is considered a community or a social group if nodes in this group have stronger or denser links with nodes within the group than with nodes outside of the group (Wasserman and Faust, 1994). A weighted network in which each link has an associated weight can be partitioned into groups by maximizing the within-group link weights while minimizing between-group link weights. Because the link weight represents node similarity or link strength and intensity, nodes in the same group are more similar to each other or more strongly connected. An unweighted network can be partitioned into groups by maximizing within-group link density while minimizing between-group link density. In this case, groups are densely-knit subsets of the network. Note that community and groups here do not refer to the explicit groups (blogrings). They refer to a subset of nodes which form implicit clusters through various relationships. In these communities, members subscribe to or post comments to each other’s blogs frequently even though they may not belong to the same blogrings. After a network is partitioned into groups, the between- group relationships become composites of links between individual nodes. In SNA, a method called blockmodeling is often used to reveal the patterns of interactions between groups (White et al., 1976). Given groups in a network, blockmodel analysis determines the presence or absence of a relationship between two groups based on the link density (Wasserman and Faust, 1994). When the density of the links between the two groups is greater than a predefined threshold value, a between-group relationship is present, indicating that the two groups interact with each other constantly and thus have a strong relationship. By this means, blockmodeling summarizes individual relational ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–70 63 details into relationships between groups so that the overall structure of the network becomes more prominent. 4.4. Visualization The extracted network and analysis results can be visualized using various types of network layout methods. Two examples are multidimensional scaling (MDS) and graph layout approaches. MDS is the most commonly used method for social network visualization (Freeman, 2000). It is a statistical method that projects higher-dimensional data onto a lower-dimensional display. It seeks to provide a visual representation of proximities (dissimilarities) among nodes so that nodes that are more similar to each other are closer on the display, while nodes that are less similar to each other are further apart (Kruskal and Wish, 1978). Graph layout algorithms have been developed particularly for drawing aesthetically pleasing network presentations (Fruchterman and Reingold, 1991). A type of graph layout algorithm called spring embedder, also known as force-directed method (Davidson and Harel, 1996; Eades, 1984; Fruchterman and Reingold, 1991; Kamada and Kawai, 1989) has been widely used to visualize networks. This algorithm treats a network as an energy system in which steel rings (nodes) are connected by springs (links). Nodes attract and repulse each other and finally settle down when the total energy carried by the springs is minimized. 5. Case study 5.1. Focus and Methods We applied our approach to conduct a case study of hate groups in blogs. As pointed out in our review, hate crimes have been a long-standing problem in the United States and it is important to study the phenomenon of online hate groups on the Internet, particularly in new media such as blogs. We chose to study the hate groups against Blacks. There are two reasons for the focus. First, the nature of hate groups and hate crimes is often dependent on the target ‘‘hated’’ group. By focusing on a type of hate groups it is possible to identify relationships that are more prominent. Second, among different hate crimes, anti- Black hate crimes have been one of the most widely studied (e.g., Burris et al., 2000; Glaser et al., 2002). This allows us to compare our results with previous research in the literature. To keep the study at a manageable size, we limit our study to the blogs on Xanga (www.xanga.com). According to statistics provided by Alexa (2005), Xanga is the second most popular Web site that is primarily devoted to blogs, only after Blogger (www.blogger.com). It is also ranked 17th in traffic (visit popularity) among all Web sites in English. We chose Xanga over Blogger because Xanga has more prominent features to support subscriptions and groups (as blogrings). These features are useful for the identification of hate groups in the blogs and the relation- ships between bloggers. After choosing our focus, we had to identify a set of hate groups on Xanga. We used the search feature in Xanga to semi-automate the task. First, a set of terms related to Black-hatred, such as ‘‘KKK’’, ‘‘niggers’’, ‘‘white pride’’, were identified. We used these terms to search for groups (blogrings) on Xanga that have any of these words in their group name or description. We then checked these groups and filtered out those not related to anti-Black. Groups with only one single member, which were likely to have been formed by one blogger with no one else joining afterwards, were also removed from our list. This resulted in a set of 40 groups. While most of these groups showed some beliefs of racism or White supremacies, we tried to further narrow these down to groups that demonstrated explicit hatred, so as to make sure that our analysis focused on ‘‘hate groups’’. Therefore, we manually checked these groups and only included those that explicitly mentioned hatred (e.g., ‘‘I hate black people’’, ‘‘hate the black race’’) or used offensive languages (e.g., ‘‘nigger beaters’’, four- letter words) towards the Blacks in their group name or description. Finally we had a list of 28 groups. These groups are listed in Table 1. In order to further justify the validity of the list of the 28 groups identified, we tried to compare this list with the ‘‘racist blogs’’ list in the Hate Directory (Franklin, 2005). However, only 14 bloggers (10 hosted on Xanga) were listed in the Hate Directory, and no blogrings were listed. To make the comparison possible, we compared our list with the blogrings that these bloggers belong to. As two out of the 10 bloggers on Xanga were no longer available, we had 8 bloggers for our comparison. We found that all these 8 bloggers are members of at least one of the groups identified in our list. Five of these bloggers are members of ‘‘! White Power !’’, the biggest blogring identified in our list. While the comparison was not direct, it showed that our list of groups had covered all of the racist bloggers listed in the Hate Directory, who were some of the more prominent bloggers. Spiders were used to automatically download the description page and member list of each of these groups. A total of 820 bloggers were identified from these 28 groups. The spiders further downloaded the blogs of each of these bloggers. The extraction program was then executed to extract the information of each blogger, including user id, real name, date of creation, date of birth, city, state, and country. One should note that these data were self-reported; they could be fraud or even missing. The extraction program also analyzed the relationship between these bloggers. In this study, two types of relationships were extracted: 1. Group co-membership: two bloggers belong to the same group (blogring). This is an undirected relationship with an integer weight (based on the number of groups shared by the two bloggers). http://www.xanga.com http://www.blogger.com ARTICLE IN PRESS Table 1 The 28 groups (blogrings) identified in our analysis Blogring name URL No. of members ! White Power ! http://www.xanga.com/groups/group.aspx?id=76863 371 KKK white is right http://www.xanga.com/groups/group.aspx?id=84971 135 The KKK (Ku Klux Klan) http://www.xanga.com/groups/group.aspx?id=164821 67 Are u racist? hate queers and niggers? Me too. http://www.xanga.com/groups/group.aspx?id=191887 67 Angry and White http://www.xanga.com/groups/group.aspx?id=258062 48 ! ! ! !Hatred 4 SoCiEtY! ! ! ! http://www.xanga.com/groups/group.aspx?id=709048 43 ALL NiGgErS sTiNk http://www.xanga.com/groups/group.aspx?id=107296 40 I HATE BLACK PEOPLE http://www.xanga.com/groups/group.aspx?id=525845 37 ::White::Power:: http://www.xanga.com/groups/group.aspx?id=261285 37 Nigger beaters http://www.xanga.com/groups/group.aspx?id=58711 29 WHITE f**kin PRIDE http://www.xanga.com/groups/group.aspx?id=240012 24 KKK WE GONNA KILL THE NIGGERS!!! http://www.xanga.com/groups/group.aspx?id=250810 21 I HATE THE FREAKIN PORCH MONKEYS http://www.xanga.com/groups/group.aspx?id=1066584 14 White-power http://www.xanga.com/groups/group.aspx?id=1244436 9 ** WHITE POWER NATION ** http://www.xanga.com/groups/group.aspx?id=1298240 8 N I G G E R S L A Y E R http://www.xanga.com/groups/group.aspx?id=1184100 7 K.K.K. MEMBERS http://www.xanga.com/groups/group.aspx?id=1447382 5 The ‘‘i like to beat negros and mexicans’’ blog rin http://www.xanga.com/groups/group.aspx?id=387326 5 Honor+The+Ku+Klux+Klan http://www.xanga.com/groups/group.aspx?id=323614 4 THE REAL KU KLUX KLAN http://www.xanga.com/groups/group.aspx?id=1315712 4 Dj aNDIE http://www.xanga.com/groups/group.aspx?id=794267 4 I hate G-Unit http://www.xanga.com/groups/group.aspx?id=916827 4 **�I Hate negros�** http://www.xanga.com/groups/group.aspx?id=1470409 4 Niggerzsmellfunny http://www.xanga.com/groups/group.aspx?id=317512 4 Ku Klux Klan_White Knights Of America http://www.xanga.com/groups/group.aspx?id=1014877 3 ‘*�NEGRO HATERS�*’ http://www.xanga.com/groups/group.aspx?id=325247 2 Black Haters and Negro Hangers http://www.xanga.com/groups/group.aspx?id=1401428 2 | | | | | I HATE N1GGERS | | | | | http://www.xanga.com/groups/group.aspx?id=1733297 2 M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–7064 2. Subscription: blogger A subscribes to blogger B. This is a directed, binary relationship. The first type of relationship (group co-membership) assumes that two bloggers are related as long as they join the same blogring. Co-membership relationship is impor- tant as it often reflects that the two bloggers involved have similar belief or ideology. The resulting network based on such relationships consisted of several fully-connected cliques. Each clique corresponded to a blogring, in which each node was connected with every other node. This made the nodes indistinguishable from each other in terms of degrees. Thus, we included in our network only co- membership links whose weight was greater than one. That is, we considered a co-membership link between two bloggers a valid link only if they shared memberships of at least two common groups. For the second type of relationship, we included all subscription links. Subscrip- tion link is important because it means that the subscriber is interested in reading the subscribed blogs. It suggests that there is an information flow between the two individuals. 5.2. Analysis and results After collecting the blogs and extracting information from them, we performed demographical and network analysis on the data set in order to reveal the characteristics of these groups and ascertain whether any patterns exist. Visualization was then applied to present the results. We discuss the details of our analysis in the following sections. 5.2.1. Demographical analysis We provide a brief summary of the demographical information of the bloggers of interest and the growth patterns of the blog space of hate groups. As in many other Internet-based media such as forums and chat rooms, the real identities of bloggers are unknown. Thus, the self- reported demographical information of bloggers is subject to the problems of anonymity. However, since blogs are often personal online diaries many bloggers still choose to release partial information about their demographics such as gender and country. Many bloggers even post their personal photos on their blogs making the true gender information accessible, assuming the photos are really those of the bloggers themselves. Among the 820 bloggers in our data set, 659 explicitly indicate their gender. Sixty three percent of them are male and 37% are female. These bloggers are from various countries. Among the 529 bloggers who explicitly report the country information, 81.9% are from the United States. The two next countries are Germany and Afghani- stan, taking on 2.6% and 2.1%, respectively. The remain- ing 13.4% of bloggers are from 43 other countries. It can be seen that hate groups are dominated by male bloggers from the United States. However, we are aware that this http://www.xanga.com/groups/group.aspx?id=76863 http://www.xanga.com/groups/group.aspx?id=84971 http://www.xanga.com/groups/group.aspx?id=164821 http://www.xanga.com/groups/group.aspx?id=191887 http://www.xanga.com/groups/group.aspx?id=258062 http://www.xanga.com/groups/group.aspx?id=709048 http://www.xanga.com/groups/group.aspx?id=107296 http://www.xanga.com/groups/group.aspx?id=525845 http://www.xanga.com/groups/group.aspx?id=261285 http://www.xanga.com/groups/group.aspx?id=58711 http://www.xanga.com/groups/group.aspx?id=240012 http://www.xanga.com/groups/group.aspx?id=250810 http://www.xanga.com/groups/group.aspx?id=1066584 http://www.xanga.com/groups/group.aspx?id=1244436 http://www.xanga.com/groups/group.aspx?id=1298240 http://www.xanga.com/groups/group.aspx?id=1184100 http://www.xanga.com/groups/group.aspx?id=1447382 http://www.xanga.com/groups/group.aspx?id=387326 http://www.xanga.com/groups/group.aspx?id=323614 http://www.xanga.com/groups/group.aspx?id=1315712 http://www.xanga.com/groups/group.aspx?id=794267 http://www.xanga.com/groups/group.aspx?id=916827 http://www.xanga.com/groups/group.aspx?id=1470409 http://www.xanga.com/groups/group.aspx?id=317512 http://www.xanga.com/groups/group.aspx?id=1014877 http://www.xanga.com/groups/group.aspx?id=325247 http://www.xanga.com/groups/group.aspx?id=1401428 http://www.xanga.com/groups/group.aspx?id=1733297 ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–70 65 finding is based on the problematic source of demographic information. The actual distribution of gender and country may be different. Nevertheless, this finding provides the evidence that the Internet and blogs have become convenient media for hate groups to penetrate into different countries and regions. The anti-Black ideology and Black-hatred groups are no longer unique to specific Western countries but have the potential to reach international audience. This finding is consistent with previous studies on online hate groups (Burris et al., 2000; Gerstenfeld et al., 2003). For example, Burris et al. (2000) found that the Internet had made it possible and rather easy for white supremacist groups to form a transnational cyber-community via hyperlinks between their Web sites. We also analyzed the growth of the hate group blogs over the years. Unlike demographical information, the exact time when a blogger registered on the blog hosting site (Xanga) is recorded by the server and thus is generally not subject to fraud. Fig. 2 presents the growth of the number of bloggers who registered on and joined the anti- Black blogrings since the first quarter of 2002. The number increased steadily between 2003 and the third quarter of 2004 and started to fluctuate since the forth quarter of 2004. This may be because some bloggers who have recently registered have not joined those popular blogrings or have not formed into large communities. As a result, some of them were not included in the data set after we filtered the raw data. Fig. 2 implies that hate groups have been gaining popularity in blog space over years as more and more individual bloggers joined these groups. Such a trend should not be underestimated because the ideas, beliefs, and opinions advocated by racists and extremists may pose potential threats to the society. More impor- tantly, if more youths join these anti-Black blogrings, they can be influenced by the ideology easily and become future 0 20 40 60 80 100 120 140 160 1/ 20 02 3/ 20 02 4/ 20 02 1/ 20 03 2/ 20 03 3/ 20 03 4/ 20 03 1/ 20 04 2/ 20 04 3/ 20 04 4/ 20 04 1/ 20 05 2/ 20 05 3/ 20 05 Time (Quarter) N u m b e r o f B lo g g e rs R e g is tr e d Fig. 2. The number of hate-group bloggers registered over the years. Table 2 The topological properties of the giant component (number in the parenthese Average shortest path length Giant component 3.62 Random counterpart 2.89 (0.03) members of extremist organizations. As extremist organi- zations gain more members, popularity, and powers, their advocated illegal activities and crimes would substantially hamper the stability of the society. 5.2.2. Topological analysis When analyzing the topology of the network, we ignored the weight of co-membership relationship and the direction of subscription. We connected two nodes (bloggers) if they belonged to at least two common groups or one subscribed to the other. As a result, there could only be at most one link between a pair of nodes. The resulting network was an unweighted, undirected network consisting of 1193 links. This network was not a connected graph in that it consisted of several disjoint components, between which no link existed. The largest connected component, often called a giant component in graph theory (Bollobás, 1985), contained 273 nodes connected by 1115 links. This giant component was a rather dense graph with an average node degree of 8.2. We performed topological analysis for the giant compo- nent. Table 2 provides the statistics of the average shortest path length, global efficiency, and clustering coefficient. To compare the giant component with its random graph counterpart, we generated 30 random networks consisting of the same number of nodes (273) and links (1115) with the giant component. The resulting statistics are also reported in Table 2. It shows that the giant component is slightly less efficient than its random graph counterpart. On average, a node in the giant component must take 0.75 more steps than in the random graph to reach another arbitrary node. Although it is less efficient than a random network, the giant component is indeed an efficient network. It takes a blogger no more than four steps (three intermediate bloggers) to reach another arbitrary one among the 273 bloggers. This short path length has an important implication to the diffusion and communication of information in the network. In other words, new ideas, opinions, beliefs, and propaganda can be quickly distrib- uted among the bloggers in the network, making it easier for bloggers to influence each other. Moreover, the giant component has a significantly higher clustering coefficient, which is 31 times more than its random graph counterpart. This implies that the giant component is a small world, in which densely-knit communities are very likely to exist. Communities have also been observed in other online media of hate groups. For example, Burris et al. (2000) found that there was a community of Holocaust Revisionist Web sites, consisting s are standard deviations) Global efficiency Clustering coefficient 0.33 0.37 0.37 (0.00) 0.03 (0.00) ARTICLE IN PRESS 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 10 20 30 40 50 60 k p ( k ) Fig. 3. The degree distribution of the giant component. M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–7066 of 11 sites connected by dense hyperlinks. A community plays an important role in reinforcing the interests and beliefs of its members, and helps create a ‘‘collective identity’’ (Gerstenfeld et al., 2003). By comparing with other members in the community, one may feel that his/her extremism ideas and beliefs are shared by many other people and thus are not extreme at all. The degree distribution of the giant component seems to follow a power-law distribution (see Fig. 3). The most distinctive feature of a power-law distribution curve is its long tail for large degree (k), which is significantly different from a bell-shaped Poisson distribution. The long tail indicates that a small number of nodes in the network have a large number of links and they are the ‘‘centers’’ of the network. Actually, the power-law degree distribution manifests the ‘‘rich-get-richer’’ phenomenon (Albert and Barabási, 2002), where nodes with many links are more capable of attracting links. If a blogger has many subscribers, he/she may attract more bloggers to visit, make comments on, and subscribe to his/her blog. Such a blogger become an ‘‘authoritative’’ (Kleinberg, 1998) and influential leader in the network. His/her beliefs and even advocated crimes could influence more people faster. On the other hand, a blogger subscribing to many other bloggers can appear as a ‘‘hub’’ that serves as a pointer and channel to many other bloggers including those central, influential leaders. 5.2.3. SNA and visualization We used a prototype system we developed (Xu and Chen, 2005) to find the central nodes and to identify the communities. The system has a graphical user interface to facilitate interaction between users and the system (see Fig. 4). The user interface visualizes the network and presents the results of social network analysis. In the visualization, each node represents a blogger. A straight line connecting two nodes indicates that the two corre- sponding bloggers either co-register in more than one blogring, or one blogger subscribes to the other. Note that the directions of subscription links are not shown. The layout of the network is determined using the MDS method. In order to position nodes which are likely to belong to the same community close to each other on the display, we assigned each link an ‘‘edge clustering coefficient’’, which measures the likelihood of two incident nodes of the link to form a cluster (Radicchi et al., 2004). The community analysis can be performed by adjusting the ‘‘level of abstraction’’ slider at the bottom of the panel. At the lowest level of abstraction, each individual node and link are presented. As the abstraction level increases, the system employs hierarchical clustering method to gradually merge nodes, which are connected by links of high edge clustering coefficients. When the highest abstraction level is reached, the whole giant component becomes a big community. At any level of abstraction, a circle represents a community. The size of the circle is proportional to the number of bloggers in the community. Straight lines connecting circles represent between-group relationships, which are extracted using blockmodel analysis. The thickness of a line is proportional to the density of the links between the two corresponding communities. Fig. 4(a) presents the giant component at its lowest abstraction level. The bloggers who have the highest degree and betweenness scores are highlighted and labeled with their usernames. These bloggers are those who may participate in multiple blogrings or have many subscription relationships with other bloggers. It is interesting to see that in addition to joining explicit groups (blogrings), bloggers have also formed implicit communities through co-membership and subscription. Three circles of nodes are apparently communities in which bloggers share many common memberships. The communities found by the system (see Fig. 4(c)) also confirm this pattern. The two circles indicate that at least two big communities exist among the bloggers. Moreover, the two communities also interact with each other very actively, as represented by the thick line between the two circles. Because the communities can be analyzed at different levels, the two big communities may consist of smaller communities. For example, the inner structure of the bigger community shows that two smaller circles of nodes, which may also be communities, exist in the community (see Fig. 4(d)). The Fig. 4(d) also displays at the right-hand side of the screen the rankings of the community members in terms of their centrality scores within the specific commu- nity. These three centrality measures are interpreted as the leader role, gatekeeper role, and outlier role. Therefore, a blogger which ranks the highest in degree in the community would likely be the centers of the community. Actually, these central bloggers may be either authoritative leaders or hubs. The leaders with many in-links from subscribers can spread their beliefs and opinions to many of their subscribers; and the hubs with many out-links may facilitate such a spread of influence by directing their visitors to the leaders via subscription links. To separate the co-membership links and subscription links we also present the network with only subscription links in Fig. 4(b). Comparing Fig. 4(a) and (b) one can see that among the five highlighted nodes, two bloggers (the upper right one and the lower left one) stand out due to ARTICLE IN PRESS Fig. 4. The prototype system for SNA and visualization. (a) The giant component with both co-membership and subscription links. The highlighted nodes are those who have large degree or betweenness scores. (b) The giant component with only subscription links. Note that the directions of subscription are omitted. (c) The two major communities found in the network. (d) The inner structure of a community which is represented by the large circle in (c). M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–70 67 their large number of subscription links. Because the system interface does not show the directions of the subscription links it is difficult to visually determine whether they are leaders or hubs. By calculating the in- degree and the out-degree of these nodes, we found that they were actually hubs who subscribed to many other blogs. In general, it is worth a closer look at the bloggers with many subscribers and a high in-degree. A content analysis on their blogs may reveal whether they are actually popular leaders who often express extremism ideas, beliefs, and opinions. Because such ideologies may easily be spread among and influence their subscribers, these bloggers need to be closely monitored. On the other hand, the hubs with a high out-degree should also be paid attention since they may be intermediate channels leading to the extremism ideologies. In addition, although the network in Fig. 4(b) is not as dense as the one in Fig. 4(a), some part of the network is still rather dense indicating a tendency for forming communities. Interestingly, the upper right highlighted node seems to be a joining point that holds a community around it. This further implies that hubs, although they may not be leaders, can be important communication channels. 5.3. Discussion In this section we compare our findings with previous studies while providing answers to our research questions regarding online hate groups. Specifically, we choose to compare with the findings in Burris et al. (2000), primarily because among the limited number of studies on online hate groups, this article studies this social movement from a social network perspective. The authors focused on the structural properties of the Web site networks among white supremacist groups, making it appropriate for us to align and compare our findings with theirs. a. What are the structural properties of the social networks of bloggers in the hate groups? Similar to the network of white supremacist Web sites (Burris et al., 2000), the network of bloggers in hate groups is decentralized. The data set contains a number of isolated clusters and a giant component consisting of roughly one third of the bloggers. Centralized structures such as star and hierarchical ones are not observed in the network. This is not out of our expectation because the hate groups under study were mostly formed spontaneously. They have not evolved into organizations or associations with formal organizational structure. ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–7068 b. Are there bloggers who stand out as leaders of influence in these groups? Burris et al. (2000) found that the decentralized white supremacist groups had different centers of influence. Our analysis shows that there are some bloggers who are connected with many others through common membership and subscription. These bloggers may be either leaders of opinions and ideologies or hubs of communication. Both types of center require closer analysis and examination. c. What is the community structure in these groups? Communities, represented by densely-knit clusters of Web sites in Burris et al. (2000), are also present in the blogger network. However, these communities are not composed of Web sites but individual bloggers. Com- munities provide an environment for its members to exchange their ideas and opinions and reinforce the shared ideology. A community becomes the collective social identity of its members and is potential of facilitating the communication and collaboration among its members to carry out illegal activities and crimes. In addition, we observed that the two commu- nities identified are closely connected. It suggests that members of these communities may share common ideologies and interests, and it is not unlikely that they will merge as a larger community in the future. d. What do the structural properties suggest about the organization of the hate groups? As mentioned in point (a), the structure of the network suggests that the hate groups in blogosphere have not formed into centralized organizations. However, it does not eliminate the possibility that these groups may help prepare future members for extremist organizations such as the Ku Klux Klan. Especially, their influence on youths should not be underestimated. e. What are the social and political implications of these properties? Burris et al. (2000) commented that extremist groups are a type of social movement which has profound social and political implications. At this stage of our study we cannot find any evidence that the bloggers in hate groups are affiliated with any political entities or associations. Indeed, there has not been any group that takes a form of organization. However, we do find that the hate groups are gaining popularity among bloggers as more and more individuals join the groups. In addition, these hate groups also show transnational characteristics, meaning that hatred and hate-related illegal activities are not a problem to a particular country or region but an international phenomenon. Such a movement should not be over- looked and ignored by authorities, researchers, and other watchdog organizations such as the Hate Direc- tory (Franklin, 2005). 6. Conclusion and future directions In this paper, we have discussed the problems of the emergence of hate groups and racism in blogs. Our contributions are twofold. First, we have proposed a semi-automated approach for blog analysis. Our approach consists of a set of Web mining and network analysis techniques that can be applied to the study of blogosphere. Such techniques as network topology analysis, which has not been employed in blog or hate group related research, are proposed and have been demonstrated useful and relevant in this context. While the approach has been proposed and studied in the context of hate group analysis, we have tried to keep the approach general such that it is not specific to such analysis. We believe that the approach can also be applied to other domains that involve virtual community analysis and mining, which we believe would be an increasingly important field for various applications. These applications include not only other security infor- matics research such as bioterrorism (Raghu and Vinze, In Press) but also other applications such as marketing analysis and business intelligence analysis (Chau et al., 2005a,b). Second, we applied this approach to investigate the characteristic and structural relationships among the hate groups in blogs in our case study. Hatred and hate groups are a type of social movement which should not be ignored in today’s information age. Extremist groups are using the Internet to disseminate their propaganda, spread their ideologies, and recruit new members. These groups have also been present in Blogs, a new online communication and publication tool, thereby posing threat to our society. However, there has not been much research on this movement. We believe that our research is timely and important to the security of society. By disseminating both explicit and implicit hatred messages through blogs, racists can easily target youths with ubiquitous coverage— basically anyone who has access to the Internet—that was never possible in the past. Youths are often easily influenced by these messages and could eventually become terrorists and pose a threat to our society (Blazak, 2001). Our study not only has provided an approach that could facilitate the analysis of law enforcement and social workers in studying and monitoring such activities, but also has brought insights into the structural properties of online hate groups and helped broaden and deepen our understanding of such a social movement. In the future, we will extend our study in three major directions to address some limitations of the current study. First, the current study only investigated the hate group activities on one single blog site, Xanga. Although Xanga has been reported to have the most number of blogs associated with hate groups (Franklin, 2005), a further study that includes other popular sites such as Blogger would be more comprehensive. Second, only two types of relationships, namely co-membership and subscription, were considered in the present study. It would be interesting to expand our study to include other types of relationships, such as commenting and hyperlinking, in the network analysis. Inclusion of these relationships could reveal other implicit linkages among the bloggers. Third, ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–70 69 in-depth content analysis can be performed on blogs. Important nodes identified by the current approach can be further analyzed using text mining techniques such as co- occurrence analysis or entity extraction (Chau et al., 2002). Such analysis will help further unveil online hate groups. Acknowledgement This project has been supported in part by a grant from the University of Hong Kong Seed Funding for Basic Research, ‘‘Using Content and Link Analysis in Develop- ing Domain-specific Web Search Engines: A Machine Learning Approach,’’ 10205294 (PI: M. Chau), February 2004—January 2006. We would like to thank Porsche Lam and Boby Shiu of the University of Hong Kong for their participation and support in this project. References Albert, R., Barabási, A.-L., 2002. Statistical mechanics of complex networks. Reviews of Modern Physics 74 (1), 47–97. Albert, R., Jeong, H., Barabási, A.-L., 1999. Diameter of the world-wide web. Nature 401, 130–131. Albert, R., Jeong, H., Barabási, A.-L., 2000. Error and attack tolerance of complex networks. Nature 406, 378–382. Alexa, 2005. Top English language sites. [Online] Retrieved from http:// www.alexa.com/site/ds/top_sites?ts_mode=lang&lang=en on Octo- ber 7, 2005. Anti-Defamation League, 2001. Poisoning the Web: Hatred online. [Online] Retrieved from http://www.adl.org/poisoning_web/ poisoning_toc.asp on October 7, 2005. Armstrong, R., Freitag, D., Joachims, T., Mitchell, T., 1995. WebWatch- er: A learning apprentice for the World Wide Web. In: Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, March 1995. Baldi, S., 1998. Normative versus social constructivist processes in the allocation of citations: a network-analytic model. American Socio- logical Review 63 (6), 829–846. Barabási, A.-L., Albert, R., Jeong, H., 1999. Mean-field theory for scale- free random networks. Physica A 272, 173–187. Blazak, R., 2001. White boys to terrorist men: Target recruitment of Nazi skinheads. American Behavioral Scientist 44 (6), 982–1000. Blood, R., 2004. How blogging software reshapes the online community. Communications of the ACM 47 (12), 53–55. Bollobás, B., 1985. Random Graphs. London, Academic. Borgatti, S.P., Foster, P.C., 2003. The network paradigm in organiza- tional research: a review and typology. Journal of Management 29, 991–1013. Brin, S., Page, L., 1998. The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th WWW Conference, Brisbane, Australia, April 1998. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J., 2000. Graph Structure in the Web. Proceedings of the 9th International World Wide Web Conference, Amsterdam, Netherlands, May 2000. Burris, V., Smith, E., Strahm, A., 2000. White supremacist networks on the internet. Sociological Focus 33 (2), 215–235. Chakrabarti, S., van den Berg, M., Dom, B., 1999. Focused crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, May 1999. Chau, M., Zeng, D., Chen, H., 2001. Personalized Spiders for Web Search and Analysis. Proceedings of the First ACM/IEEE-CS Joint Con- ference on Digital Libraries, Roanoke, Virginia, USA, June 24–28, 2001, pp. 79–87. Chau, M., Xu, J. J., Chen, H., 2002. Extracting Meaningful Entities from Police Narrative Reports. Proceedings of the National Conference for Digital Government Research (dg.o 2002), Los Angeles, California, USA, May, 2002, pp. 271–275. Chau, M., Qin, J., Zhou, Y., Tseng, C., Chen, H., 2005a. SpidersRUs: Automated development of vertical search engines in different domains and languages. In: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, Colorado, USA, June, 2005, pp. 110–111. Chau, M., Shiu, B., Chan, I., Chen, H., 2005b. Automated identification of web communities for business intelligence analysis. In: Proceedings of the Fourth Workshop on E-Business (WEB 2005), Las Vegas, USA, December, 2005. Chen, H., Fan, H., Chau, M., Zeng, D., 2003. Testing a cancer meta spider. International Journal of Human-Computer Studies 59 (5), 755–776. Chen, H., Chung, W., Xu, J., Wang, G., Qin, Y., Chau, M., 2004. Crime data mining: a general framework and some examples. IEEE Computer 37 (4), 50–56. Cheong, F.C., 1996. Internet Agents: Spiders, Wanderers, Brokers, and Bots. New Riders Publishing, Indianapolis, Indiana, USA. Cho, J., Garcia-Molina, H., Page, L., 1998. Efficient crawling through URL ordering. In: Proceedings of the 7th WWW Conference, Brisbane, Australia, April 1998. CNN (1999). Hate group web sites on the rise. CNN News [Online] Retrieved from http://edition.cnn.com/US/9902/23/hate.group.report/ index.html on October 7, 2005. Crucitti, P., Latora, V., Marchiori, M., Rapisarda, A., 2003. Efficiency of scale-free networks: error and attack tolerance. Physica A 320, 622–642. Davidson, R., Harel, D., 1996. Drawing graphs nicely using simulated annealing. ACM Transactions on Graphics 15 (4), 301–331. Dombroski, M.J., Carley, K.M., 2002. NETEST: estimating a terrorist network’s structure. Computational and Mathematical Organization Theory 8, 235–241. Douglas, K.M., McGarty, C., Bliuc, A., Lala, G., 2005. Understanding cyberhate: social competition and social creativity in online white supremacist groups. Social Science Computer Review 23 (1), 68–76. Eades, P., 1984. A heuristic for graph drawing. Congressus Numerantium 42, 149–160. Erickson, T., 1997. Social interaction on the net: Virtual community as participatory genre. Proceedings of the Thirtieth Hawaii International Conference on System Sciences, January 7–10, 1997, Wailea, Hawaii, USA. Flake, G.W., Lawrence, S., Giles, C.L., 2000. Efficient identification of web communities. In: Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD 2000), Boston, MA. Flake, G.W., Lawrence, S., Giles, C.L., Coetzee, F.M., 2002. Self- organization and identification of web communities. IEEE Computer 35 (3), 66–71. Ford Jr., L.R., Fulkerson, D.R., 1956. Maximal flow through a network. Canadian Journal of Mathematics 8, 399–404. Franklin, R.A., 2005. The hate directory. [Online] Retrieved from http:// www.bcpl.net/�rfrankli/hatedir.htm on October 7, 2005. Freeman, L.C., 1979. Centrality in social networks: conceptual clarifica- tion. Social Networks 1, 215–240. Freeman, L.C., 2000. Visualizing social networks. Journal of Social Structure 1 (1). Fruchterman, T.M.J., Reingold, E.M., 1991. Graph drawing by force- directed placement. Software-Practice and Experience 21 (11), 1129–1164. Garton, L., Haythornthwaite, C., Wellman, B., 1999. Studying online social networks. Doing Internet Research. Sage Publications, S. Jones. Thousand Oaks, CA, 75–105. http://www.alexa.com/site/ds/top_sites?ts_mode=lang&lang=en http://www.alexa.com/site/ds/top_sites?ts_mode=lang&lang=en http://www.alexa.com/site/ds/top_sites?ts_mode=lang&lang=en http://www.adl.org/poisoning_web/poisoning_toc.asp http://www.adl.org/poisoning_web/poisoning_toc.asp http://edition.cnn.com/US/9902/23/hate.group.report/index.html http://edition.cnn.com/US/9902/23/hate.group.report/index.html http://www.bcpl.net/~rfrankli/hatedir.htm http://www.bcpl.net/~rfrankli/hatedir.htm http://www.bcpl.net/~rfrankli/hatedir.htm ARTICLE IN PRESS M. Chau, J. Xu / Int. J. Human-Computer Studies 65 (2007) 57–7070 Gerstenfeld, P.B., Grant, D.R., Chiang, C.P., 2003. Hate online: a content analysis of extremist internet sites. Analyses of Social Issues and Public Policy 3, 29–44. Gibson, D., Kleinberg J., Raghavan, P., 1998. Inferring web communities from link topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh, PA. Girvan, M., Newman, M.E.J., 2002. Community structure in social and biological networks. In: Proceedings of the National Academy of Science of the United States of America, vol. 99, pp. 7821–7826. Glaser, J., Dixit, J., Green, D.P., 2002. Studying hate crime with the internet: what makes racists advocate racial violence? Journal of Social Issues 58 (1), 177–193. Hof, R., 2005. Blogs on ice: Signs of a business model? Business Week Online—The Tech Beat, June 2, 2005. [Online] Retrieved from http:// www.businessweek.com/the_thread/techbeat/archives/2005/06/blogs_ on_ice_si.html on October 7, 2005. Holme, P., Kim, B.J., Yoon, C.N., Han, S.K., 2002. Attack vulnerability of complex networks. Physical Review E 65, 056109. Jeong, H., Mason, S.P., Barabási, A.-L., Oltvai, Z.N., 2001. Lethality and centrality in protein networks. Nature 411 (6833), 41. Kahle, B., 1997. Preserving the Internet. Scientific America, March 1997. Kamada, T., Kawai, S., 1989. An algorithm for drawing general undirected graphs. Information Processing Letters 31 (1), 7–15. Kleinberg, J., 1998. Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algo- rithms, San Francisco, California, USA, January 1998, pp. 668–677. Kosala, R., Blockeel, H., 2000. Web mining research: a survey. ACM SIGKDD Explorations 2 (1), 1–15. Krebs, V.E., 2001. Mapping networks of terrorist cells. Connections 24 (3), 43–52. Krupka, G.R., Hausman, K., 1998. IsoQuest Inc.: Description of the NetOwlTM extractor system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference, April 1998. Kruskal, J.B., Wish, M., 1978. Multidimensional Scaling. Sage Publica- tions, Beverly Hills, CA. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., 1999. Trawling the web for emerging cyber-communities. Computer Networks 31 (11–16), 1481–1493. Lee, E., Leets, L., 2002. Persuasive storytelling by hate groups online: Examining its effects on adolescents. American Behavioral Scientist 45, 927–957. Levin, J., McDevitt, J., 1993. Hate crimes: The Rising Tide of Bigotry and Bloodshed. Plenum, New York. McAndrew, D., 1999. The structural analysis of criminal networks. In: The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III. D. Canter and L. Alison. Dartmouth, Aldershot, pp. 53–94. Nardi, B.A., Schiano, D.J., Gumbrecht, M., Swartz, L., 2004. Why we blog. Communications of the ACM 47 (12), 41–46. Newman, M.E.J., 2004a. Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Science of the United States of America, vol. 101, pp. 5200–5205. Newman, M.E.J., 2004b. Fast algorithm for detecting community structure in networks. Physical Review E 69 (6), 066133. Pinkerton, B., 1994. Finding What People Want: Experiences with the WebCrawler. Proceedings of the 2nd International World Wide Web Conference, Chicago, Illinois, USA, 1994. Qin, J., Zhou, Y., Reid, E., Lai, G., Chen, H., 2006. ‘‘Unraveling International Terrorist Groups’ Exploitation of the Web: Technical Sophistication, Media Richness, and Web Interactivity,’’ Proceedings of the Workshop on Intelligence and Security Informatics (WISI-06), Singapore, April 9, 2006. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D., 2004. Defining and identifying communities in networks. Proceedings of the National Academy of Science of the United States of America, vol. 101, pp. 2658–2663. Raghu, T.S., Vinze, A., In Press. A business process context for knowledge management. Decision Support Systems, forthcoming. Rheingold, H., 2000. The Virtual Community: Homesteading on the Electronic Frontier. MIT Press, Cambridge, Massachusetts, USA. Roberts, T.L., 1998. Are newsgroups virtual communities? Proceedings of the CHI’98 Conference, April 18–23, 1998, Los Angeles, California, USA. Sparrow, M.K., 1991. The application of network analysis to criminal intelligence: an assessment of the prospects. Social Networks 13, 251–274. Stuart, T.E., 1998. Network positions and propensities to investigation of strategic alliance formation in a high-technology industry. Adminis- trative Science Quarterly 43, 668–698. Wasserman, S., Faust, K., 1994. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge. Watts, D.J., Strogatz, S.H., 1998. Collective dynamics of ‘‘small-world’’ networks. Nature 393, 440–442. White, H.C., Boorman, S.A., Breiger, R.L., 1976. Social structure from multiple networks: I. Blockmodels of roles and positions. American Journal of Sociology 81, 730–780. Xu, J.J., Chen, H., 2004. Fighting organized crime: using shortest-path algorithms to identify associations in criminal networks. Decision Support Systems 38 (3), 473–487. Xu, J.J., Chen, H., 2005. Crimenet explorer: a framework for criminal network knowledge discovery. ACM Transactions on Information Systems 23 (2), 201–226. Zamir, O., Etzioni, O., 1999. Grouper: A dynamic clustering interface to web search results. In: Proceedings of the 8th World Wide Web Conference, Toronto, May 1999. Zhou, Y., Reid, E., Qin, J., Chen, H., Lai, G., 2005. US domestic extremist groups on the web: Link and content analysis. IEEE Intelligent Systems 20 (5), 44–51. http://www.businessweek.com/the_thread/techbeat/archives/2005/06/blogs_on_ice_si.html http://www.businessweek.com/the_thread/techbeat/archives/2005/06/blogs_on_ice_si.html http://www.businessweek.com/the_thread/techbeat/archives/2005/06/blogs_on_ice_si.html Mining communities and their relationships in blogs: �A study of online hate groups Introduction Related work Blogging Web mining and social network analysis Hate on the Internet Research questions Proposed approach Blog spider Information extraction Network analysis Visualization Case study Focus and Methods Analysis and results Demographical analysis Topological analysis SNA and visualization Discussion Conclusion and future directions Acknowledgement References