key: cord-0159327-8upkw7qm
authors: Tohalino, Jorge A. V.; Silva, Thiago C.; Amancio, Diego R.
title: Using virtual edges to extract keywords from texts modeled as complex networks
date: 2022-05-04
journal: nan
DOI: nan
sha: 8c935c3717bc8a9da10545b4e102c3b24f0842b7
doc_id: 159327
cord_uid: 8upkw7qm

Detecting keywords in texts is important for many text mining applications. Graph-based methods have been commonly used to automatically find the key concepts in texts, however, relevant information provided by embeddings has not been widely used to enrich the graph structure. Here we modeled texts co-occurrence networks, where nodes are words and edges are established either by contextual or semantical similarity. We compared two embedding approaches -- Word2vec and BERT -- to check whether edges created via word embeddings can improve the quality of the keyword extraction method. We found that, in fact, the use of virtual edges can improve the discriminability of co-occurrence networks. The best performance was obtained when we considered low percentages of addition of virtual (embedding) edges. A comparative analysis of structural and dynamical network metrics revealed the degree, PageRank, and accessibility are the metrics displaying the best performance in the model enriched with virtual edges.

In recent years, there has been a large increase in textual information available on the Internet. Examples include newspapers, social network comments, books, encyclopedias and scientific articles. In order to make sense and summarize such a large volume of data, several NLP applications have been proposed. One particular task is the keyword extraction task, which consists of selecting a set of words (or topics) that best represent the content of a document [45] . Finding keywords in multiple documents is important because manually finding the most central words can be an expensive and time-consuming task for human annotators. Since keyword extraction provides a compact representation of the document, many applications can benefit from this task: automatic indexing, automatic document summarization, automatic document classification, document clustering, automatic filtering, among other applications [5, 7, 18] .

Different approaches have been considered for the keyword extraction task [7] . The simplest models are the statistical models that study the statistical information regarding the spatial use of words in each text as well as their frequency of use [20] . These methods include for example the well-known Term Frequency (TF) or Term Frequency-Inverse Document Frequency (TF-IDF). Approaches based on linguistic and syntactic analysis have also been used to address this task [44] . Additionally, several features extracted from the previous approaches can be used in machine learning algorithms. The main goal of these methods is to detect keywords via binary classification [22] .

Graph-based approaches have also been used to detect keywords [42, 44] . The objective of these methods is to represent each document as a network of words and then apply a set of centrality measurements to assign a relevance value for each network node. In this way, the most central nodes represent the automatic keywords found for each document. Most of these approaches have used word co-occurrence networks, where an edge exists between two words if they are adjacent. However, different strategies to connect words have not been extensively studied, with most of the works considering larger window contexts in the co-occurrence model [30, 44] .

In this paper, we propose a graph-based method for keyword extraction, where texts are represented as co-occurrence networks and edges are established in a twofold manner.

In addition to word adjacency models, we consider further contexts to connect words. In order to better represent the relationship between words, we also link words that do not necessarily co-occur in the text, but are semantically similar. Our motivation is to enrich the representation by including the so-called virtual edges. Consequently, hidden similarities are explicitly represented in the model. Our hypothesis is that the included virtual edges can be used to improve the traditional co-occurrence network representation based on word adjacency relationships alone. In the proposed model, the virtual edges were constructed from the word vectors generated by the Word2Vec and BERT embedding models [16, 31] .

After the networks are constructed, we computed the centrality values for each node (word) of the network. We used several structural and dynamical network measurements to identify the key concepts in texts. We also probed the effect of using the weighted versions of these measurements. The efficiency of our methods was evaluated in different datasets comprising documents of various sizes.

We have found several interesting results from this analysis. First, we observed that including virtual edges can improve the performance in retrieving keywords. The fraction of included virtual edges required to yield optimized results turned out to be relatively low.

A negative performance effect was observed, however, when too many virtual edges were included. Concerning the embedding method, both considered strategies -Word2vec and BERT -yielded similar performance. The network metrics with the best performance were the degree, PageRank, and Accessibility. Surprisingly, when the weighted versions of the traditional metrics had a poor performance. Our results reinforce the potential of enriching networks in multiple text network applications [12, 39, 40] .

This manuscript is organized as follows: Section II presents a summary of the related works for keyword extraction. The description of the datasets, as well as the proposed methodology, are described in Section III. In this section, we describe the network creation stage and the process of extracting keywords using network centrality metrics. The results are discussed in Section IV. describes the obtained results and the analysis of each network measurement. Finally, in Section V we present the conclusions and perspectives for future works.

Studies addressing the keyword extraction problem can be grouped into three main approaches: statistical and network-based methods [19, 29] . The objective of statistical methods is to rank words using their statistical distribution along the text [11, 20] . A very simple approach is described is the one relying on word frequency [27] . According to this approach, words are sorted according to their frequency, and the most and less frequent words are disregarded. Such words are disregarded because they are common words (such as prepositions) or rare (not relevant) words. Note that frequency-based approaches do not consider the structure of the text, since a shuffled, meaningless version of the same text would provide the same set of keywords. To overcome the weaknesses of frequency-based methods, word clustering and word entropy methods were then proposed [20, 32] . Some modifications of these techniques include term-frequency inverse-document frequency approaches [33] .

In [32] , the authors found that relevant words, which are more closely related to the main topics of the text, are generally concentrated in certain regions of the text. Keywords are usually unevenly distributed along the text and tend to form clusters. Conversely, common words are more regularly distributed along texts. A combination of both spatial clustering and frequency was proposed in [10] . The authors used the Shannon's entropy metric to define a method based on the information content of the sequence of occurrences of each word in the text. They used text partitions to calculate the entropy of all words. Because relevant words are unevenly distributed, the heterogeneity of word distribution in different partitions can be captured via entropy. In comparison to clustering-based methods, an improved performance was reported with entropy-based techniques. The main advantage of the statistical techniques is that they do not require any knowledge of the language and thus can be used any analyze even unknown documents [15] .

Graph-based approaches include the representation of the relationship of words as networks [17, 25, 46] . In [17] , the authors modeled documents as graphs of semantic relationships between the words. The weight linking two words modeled the semantic relatedness computed as measured via Wikipedia. The strategy considered that the words related to central topics tend to be grouped into densely connected network communities, while common words are organized in weakly connected communities. This method was found to be particularly effective in removing noisy information. A similar study represented texts as word co-occurrence networks considering weighted and directed networks [25] . They used several centrality network measurements to find the relevant words. The authors concluded that network measurements can be successfully used for keyword extraction without the need for large external corpora. They also found that simpler centrality metrics like node degree or strength outperformed more complex and computationally expensive metrics in the proposed methodology. Related strategies have been used to find key concepts for the purpose of text summarization [42] .

Finally, word co-occurrence network considering larger co-occurrence contexts was proposed in [44] . Different from other works, an edge is created if words co-occur within a window comprising three words. Using a combination of feature selection and clustering methods to find the best set of keywords, the authors obtained optimized results in comparison to other works that only employed traditional co-occurrence networks. The authors also found that several network metrics are strongly correlated, yielding thus equivalent performance.

Differently from the previous works, here we propose a graph-based method for keyword extraction using enriched word co-occurrence networks. In addition to the edges established via word adjacency, we considered edges created via embedding similarity. Several centrality measurements were then used to find the most relevant words. As we shall show, our approach outperforms both the traditional word adjacency model and its modified version considering larger contexts.

The framework proposed to detect keywords via word embeddings and graph modeling comprises the following four main steps: i) text pre-processing; ii) word vectorization; iii) network creation; and iv) word ranking and keyword extraction.

1. Pre-processing: This phase comprises the required steps to conveniently pre-process the datasets. This step encompasses sentence segmentation, stopword removal and text stemming. In Section III A, we provide a brief description of the pre-processing steps we applied.

2. Word vectorization: we considered the embeddings models to represent the words.

The embeddings are important for identifying similar words that do not co-occur in the text. Section III B provides a detailed explanation of the word embedding methods used in this work.

3. Network creation: We modeled the documents as word co-occurrence networks, where nodes represent words and edges are established based on the neighbors of each word.

We also considered "virtual" edges, which were generated based on the similarity between two words. The similarity is computed based on the word vectorization. This is an essential step for capturing long-range, semantical relationships. In Section III C, we explain the adopted methodology for network creation.

We used several network centrality measurements to rank the words for each document. Such measurements are used to give an importance value or relevance weight to each node from the network. The top N ranked words were considered keywords. Section III D describes the keyword extraction step.

The workflow we considered for keyword extraction is shown in Figure 1 .

We applied some pre-processing steps before texts are represented as networks. We first performed sentence segmentation. We defined a sentence as any text portion which is separated by a period, exclamation or question mark. This step is needed because BERT embedding model requires the input documents to be separated into sentences. Next, we removed stopwords and punctuation marks. We finally applied text stemming to the remaining words so that words are converted into their singular, infinitive form. This is important to map related words into the same node. We did not consider text lemmatization because reference keywords from datasets were in their stemmed form.

Word embeddings models are a set of methods to represent words as dense vector representations. The idea behind these models is that words having similar meaning should have similar vector representations [31] . Word embeddings have been successfully used for several FIG. 1: Architecture of our system for graph-based keyword extraction. The first step consists of pre-processing the input texts. Next, we obtain the vector representation of the words from the pre-processed datasets. Then co-occurrence networks are constructed considering two edges types (co-occurrence and embedding (virtual) edges). Several centrality measurements are calculated to rank words. Finally, the top-ranked words are considered as keywords for each document.

text applications such as information retrieval, question answering, document summarization and text classification [8, 23, 37] . Here we use embeddings to establish links between similar nodes that do not co-occur in the text. The adopted method for including embedding edges as well as the construction process of the networks is described in Section III C.

There is a myriad of approaches to representing words as vectors. Methods to create word vectors include approaches based on neural networks, dimension reduction and probabilistic theory [2] . Here we employed the following methods:

• Word2Vec: This method is one of the first models to represent words as vectors [31] .

Given a corpus, Word2Vec analyzes the words of each sentence and tries to predict neighbor words. For example, in the sentence "The early bird catches the X",

Word2Vec can predict that the next word X is "worm", based on the previous context.

This model uses a neural network with a single hidden layer. The neural network is trained with the documents of the corpus, then, for a given word α, it is calculated the probability that each word of the vocabulary is a neighbor of α. Once the network is trained, the model uses the weights of the hidden layer. as word vectors. Before the training stage, we defined different dimensions (d) for the word vectors. We generated vectors with 100 ≤ d ≤ 1, 000. Despite its simplicity and efficiency for various applications, Word2Vec has a significant weakness: it generates a unique vector for each word, regardless of word meaning and context, and this can generate noisy vectors, especially when representing ambiguous words. For example, the word "apple" will have the same associated vector regardless of whether it refers to the apple fruit or the Apple technology company. The BERT model addresses this problem as it generates different vectors for each word by taking into account the context in which the word appears.

• Bidirectional Encoder Representations from Transformer (BERT): This model creates representations using the context appearing before and after the target word. Then, once previously trained, it can be fine-tuned for several specific tasks [16] . BERT uses a multilayer model of transformers (self-attention modules), and these structures allow learning attention weights of each word appearing before and after the target word.

The model is pre-trained in two unsupervised tasks. In the first task, the model hides a percentage of the input tokens (words), and then it learns how to predict them. In the second task, the model selects two sentences, and then it predicts whether they are consecutive or not. Once the model has been pre-trained, it can be adjusted in a different task via a fine-tuning of parameters. We used this model to get the word vector representations of the keyword extraction datasets. We used the pre-trained model of BERT, which was previously trained over millions of texts. For each sentence, we then obtained the representative vectors of the words composing that sentence. In this sense, each occurrence of the same word is represented by a different vector. The context of each occurrence is used to generate the vectors.

Recently, a large number of word embedding algorithms have been proposed to mathematically represent words and text segments. In this work, we used Word2Vec for being one of the first word embeddings models that were proposed as an improvement to the traditional vector space models. Furthermore, Word2Vec is a simple model whose training stage is fast compared to other techniques. Word2Vec has also been used quite successfully for small and large datasets. BERT is one of the first models to offer significant gain in performance compared to several models based on Word2Vec. Such gain lies in the fact that BERT and related models are capable of producing various vector representations of a word according to its context. In this sense, BERT is able to capture the polysemy of a word, which typically results in more accurate feature representations [16] .

After the pre-processing steps are applied, a graph representation is created. The motivation for representing text as complex networks is the simple, yet competitive and interpretable results obtained in related text analysis tasks [1, 6, 13, 14] . The adopted graph represents each word as a node. For the creation of the edges between two words, we defined two edges types: edges based on the neighborhood relationship of two words (co-occurrence edges), and edges based on the semantic similarity of the words (embedding edges). Differently from previous approaches, here we establish long-range edges that can not be obtained from adjacency relationships alone. This approach is a way to link words that are semantically related but do not share the same stem. For co-occurrence edges, the following procedure was applied: we first defined a window value of size w. Edges linking two nodes are established for all words coexisting within the window. To build all edges, the window slides along the document. Figure 2 shows an example of edge formation. Here we considered w = {1, 2, 3}.

Larger values of w were not considered in order to avoid a large complexity in the computation of network measurements. We also did not observe, in preliminary experiments, a significant performance gain when considering larger contexts.

FIG. 2: Example of how to find neighbors of a word for creating co-occurrence edges. In the sentence extracted from Wikipedia, we defined the neighbors of "province" according to a predefined window. If w = 1, the immediate left and right side words ("the" and "and") are considered. The window length w = 3 includes the three words at the left and right side of the reference word "province": "located", "in", "the", "and", "the", and "eponymous".

After the construction of the networks considering the co-occurrence relationships, the next step is the addition of edges established via embeddings similarity. This type of edge will also be referred to as virtual edges. Let E t be the number of traditional co-occurrence edges. The number of virtual edges included is E v = P E t . Here we considered a small percentage P of additional edges, with 0 ≤ P ≤ 1. The included virtual edges are the E V most similar ones, according to the cosine similarity index. This strategy of enriching complex networks has been useful to provide more information in related applications [34, 36] . This is particularly useful in short texts [36] . We did not include more edges to avoid the complexity of analyzing denser networks. In addition, we did not find significant improvement in performance when the network is strongly connected.

The final step consists in using centrality measurements to rank the words according to the topological significance of the nodes from the network. Therefore, the best-ranked words (nodes) are chosen to be part of the resulting keyword list of the document. Centrality measurements are used to identify the most relevant nodes in a network. They are structural (or dynamical) attributes that indicate how central is a node according to a specific criterion. The identification of central nodes has been successfully used for various text applications. For example, [42] used several traditional network measurements to identify the most important sentences in a sentence network. [44] modeled documents as word co-occurrence networks and used several centrality measurements to rank the words for the keyword extraction task. The network measurements were also used as features for classification and authorship attribution tasks. For example, [34] represented literary books as word co-occurrence networks and the centrality measurements of the most frequent words were considered as feature vectors. Then the selected vectors were used in a machine learning algorithm for authorship identification. Here, we evaluated traditional network measurements and their weighted versions. We also considered the accessibility metric owing to its relative success in text analysis [42] . Apart from the degree, we refer to the weighted version of metric X as X (w) .

1. Degree (k) and strength (s): The node degree of a node is the number of edges that are connected to that node. In the case of weighted networks, the strength represents the sum of the weights of all the edges that are connected to the reference node.

2. PageRank (π): This measurement considers a node i as relevant if it is connected to other relevant nodes. The PageRank can be computed in a recursive way:

where γ and β are used as damping factors, with 0 ≤ γ ≤ 1 and 0 ≤ β ≤ 1 [26] . We also used a variation of this measurement that is based on the eigenvector centrality

Eigenvector centrality (EV ).

: This metric is computed as the portion of shortest paths between two nodes that pass through a reference node. The betweenness centrality quantifies the relevance of a node to disseminate information [9] . It can also be used to identify words that are relevant even when they are not frequent [3] . 4 . Closeness (C): This measurement tries to detect the nodes that can efficiently spread information through a network. It is defined as C i = N j 1/d ij , where d ij is the distance between i and j, and N is the number of nodes in the network. Nodes having high closeness value will have the shortest distances to all other nodes [35] .

Distance-based measurements have also been used to analyze texts [3] . 5 . Accessibility (A (h) ): The accessibility metric quantifies the number of accessible nodes from an initial node using self-avoiding random walks of length h [43] . Nodes having a high accessibility also have effective access to more neighbors. This metric considers both the number of nodes at a given distance and the transition probabilities between the source and neighbor nodes. The accessibility can be evaluated considering different hierarchy levels. The levels can be set by specifying the length h of the random walks [43] . To compute this metric for a reference node i, we first define p (h) (i, j) to denote the likelihood of reaching a node j from an initial node i in a self-avoiding random walk of length h. Then, the accessibility of i is defined as the exponential of the true diversity of p (h) (i, j):

This measurement has been used in several contexts to analyze texts, including in stylometric and semantic tasks [38] .

We used each centrality measurement to assign different importance values for each word.

Then, the centrality values were used to rank the words. Therefore, the adopted methodology generated various word rankings according to the chosen network metrics. In Section IV, we reported the performance obtained for each network metric. After the word ranking step is performed, we selected the N best-ranked words, where N is the number of reference keywords.

We used publicly available datasets including the source texts and their gold-standard keywords defined by experts. The following datasets were chosen for their variability in size and sources. The Hult-2003 contains title, keywords, and abstracts from scientific papers published between 1998 and 2002 [21] . The documents were extracted from the Inspect Database of Physics and Engineering papers [21] . This dataset contains 500 abstracts as well as the set of keywords that were assigned by human annotators. The average size of the documents from this dataset is about 123 words. The Marujo-2012 dataset comprises 450 web news stories on subjects such as business, culture, sport, and technology [28] . The mean document size is 452 words. Finally, we also used the Semeval-2010 [24] . This dataset comprises scientific papers that were extracted from the ACM Digital Library. We considered the full content of 100 papers and their corresponding keywords assigned by both authors and readers [24] . The average document length is 8, 168 words. In Table I we provide a summary indicating the main attributes of each dataset. 

In this section, we analyze whether our hypothesis that the inclusion of virtual edges can improve the performance of co-occurrence networks in detecting keywords. In Section IV A, an analysis of the effect of parameter variation on the performance is provided. In Sections IV B and IV C, we detail the results obtained with Word2Vec and BERT, respectively.

Finally, we show in Section IV D a summary of the obtained results.

In this section, we investigate whether the proposed extension of traditional word adjacency networks can lead to optimized results. In this section, we analyze if the performance is improved when we vary the model parameters. We are particularly interested in the performance analysis when varying both the window length (w) and the number of virtual edges (P ). We focus our analysis on the results obtained for the Word2vec model, since similar results have been found with BERT (see next sections).

In Figure 3 All in all the results show that the parameter behavior seems to depend on the considered dataset. In short texts (Hult-2003) , the importance of including virtual edges is clearly observed. This happens because when short texts are modeled as co-occurrence networks, the generated line is almost a graph line. As a consequence, the topological information is not able to detect keywords, since all concepts will have the same topological information. In this case, the use of virtual edges is essential to identify the hidden information in short texts. Therefore, the results suggest that the proposed methodology can be useful to analyze short texts. Despite the above differences, the optimized results are almost always obtained when using a large window length (w = 3). The weighted metrics did not provide a significant gain in performance over their unweighted versions.

B. Performance analysis using the Word2Vec model and 500 dimensions (result not shown). We did not include the results obtained with larger dimensions because the observed performance decreases compared to smaller dimensions.

We also show the performance of each vector size when the window parameter (w) was considered. For each measurement, we also show the percentage of embedding edge insertion that yielded the highest accuracy rates (P ) and the highest accuracy observed with the proposed model (Acc.). We defined two additional quantities Γ 1 and Γ 2 , which are defined as

Acc (tr) corresponds to the accuracy obtained with the traditional co-occurrence model [41] (i.e, our model with P = 0 and w = 1). Acc (tr) corresponds to accuracy obtained with the model considering only co-occurrence links [44] (i.e., our model with P = 0). Thus, Γ 1 and Γ 2 quantify the gain in performance when important features of the model are disregarded.

According to the results shown in Table II , for the Hult-2003 dataset, the vectors having d = 300 dimensions yielded the highest accuracy rates in most cases. Considering d = 300, the most important results were reached when the percentage of embedding edge insertion was low (less than 10%). However, for dimensions greater than 300, values of P between 0%

to 26% yielded high accuracy rates. Conversely, there are some exceptions where percentages of virtual edges larger than 50% yielded the best performance. Regarding the parameter w, in most cases, the best results are reached when the parameter w = 3 is considered.

In the case of the Marujo-2012 dataset, Table II shows that, generally, d = 100 dimensions are the optimal size for the word vectors. We also observed that the typical optimal percentage of addition of embeddings type edges did not exceed 7%. However, there are some exceptions when high values of P lead to a higher accuracy. However, for these cases (B (w) and C (w) metrics), the addition of edges does not outperform the results obtained with the respective unweighted version of these metrics. Once again the largest context size typically achieved the best performances for the Marujo-2012 dataset. The weighted version of the PageRank (π (w) ) obtained the highest accuracy rate (with k = 100, w = 3, and 0%

of insertion of embedding edges).

Table II also revealed that d = 100 and d = 300 dimensions of the word vectors achieved the best accuracy rates for the SemEval-2010 dataset. The edge addition percentages that achieved the best results were higher compared to previous datasets. Such percentages included values ranging between 0 and 45%. However, for the closeness metric (C (w) ), the optimal value of P was even higher, reaching 64%. Despite this higher level of embedding Meas. P w Γ 1 Γ 2 Acc. P w Γ 1 Γ 2 Acc. P w Γ 1 Γ 2 Acc. Concerning the context parameter for co-occurrence links, both w = 1 and w = 3 achieved the highest accuracy rates. For the SemEval dataset, the closeness measurement (C) reached the best performance (with d = 300, w = 3, and P = 35%).

In conclusion, we found that d ≤ 300 yields competitive performance for the considered datasets. When larger values of d were considered, the accuracy rates did not significantly improve. We also observed that the performance can be improved in several scenarios when larger window length and/or the inclusion of virtual edges are considered.

In this section, we discuss the results we obtained considering the word vectors produced by the BERT model [16] . Because in this model each occurrence of the same word is represented by different vectors, we had to adapt our methodology concerning the insertion of virtual edges. We adopted two approaches to compute the similarity between two words.

In the first approach (BERT Sim 1 ), the word is represented by averaging the corresponding vector observed in each occurrence. In the second approach (BERT Sim 2 ), the similarity sim(a, b) between nodes a and b is computed as:

where v (a) k is the k-th vector representation of word a, cos is the cosine similarity and f a is the frequency (i.e. number of occurrences) of a. The results obtained for both approaches are depicted in Table III . The results are shown in terms of w and P .

Concerning the Hult-2003 dataset, the BERT Sim 1 approach achieved the highest accuracy rates in most cases. However, for various situations, the best results are obtained without using virtual edges or when P is lower than 6%. Only for the weighted closeness metric, the percentage of edge addition was quite high (84%). As for the window length, w = 1 and w = 3 generally yielded the highest performance. The accessibility metric considering one hierarchy level (A (1) ) achieved the best performance for the Hult-2003 dataset (BERT Sim 1 approach and w = 3)

The results revealed that the BERT Sim 1 approach outperformed the BERT Sim 2 ap- Meas. P w Γ 1 Γ 2 Acc. P w Γ 1 Γ 2 Acc. proach for the Marujo-2012 dataset. We observed that the optimal percentage of edge insertion is typically lower than 15%, but in particular cases, it reached high values between 67% and 86% (for the weighted versions of betweenness and closeness metrics). Regarding the window length, the best results were obtained for w ≥ 2. The node degree (k) centrality obtained the highest accuracy rate considering the following parameters: BERT Sim 1 approach, w = 2, and a percentage of P = 7% for the fraction of virtual edges.

Unlike the other datasets, the BERT Sim 2 approach performed slightly better than the BERT Sim 1 method for the SemEval-2010 dataset. The optimal value of the edge addition percentage in most cases was less than 20%. Higher values of P , however were used for both weighted versions of Betweenness and Closeness. Once again, the best results were obtained with window length w = 3. For the SemEval-2010 dataset, the PageRank metric (π) performed better than the other centrality measurements considering the BERT Sim 2 approach and w = 3. IV: Summary of the best results based on accuracy rate (Acc.) for both Word2Vec and BERT models. % represents the optimal fraction of included virtual edges, d is the embedding dimension, and w is the context size adopted to construct co-occurrence networks. Concerning the network creation parameters, we found the best results with Word2vec considered vectors comprising typically less than 500 dimensions. In the BERT approach, the two approaches proposed to handle multiple word vectors for the same concept -namely BERT Sim 1 and BERT Sim 2 -had a similar performance. However, the BERT Sim 2 approach requires a higher computational cost, especially when analyzing large documents.

The experiments also showed that in most cases the percentage of addition of virtual edges is typically not very high. The performance of each system considerably decreases when high percentages of addition of virtual edges are considered. In conclusion, we showed -as a proof of principle -that the combination of further window length in the co-occurrence model and virtual edges can improve the quality of the keyword detection [44] .

Identifying keywords is an important task in many text mining applications. In this paper, we addressed this problem by generating different representations for a text using cooccurrence networks. We considered two variations of the word adjacency model: the number of words that can be connected within the same context, and the fraction of virtual edges used to connect similar words. For each generated network, we evaluated several centrality measurements, including a generalization of the node degree centrality considering a network dynamics [43] .

Our results revealed that the optimal window length in the co-occurrence network is w = 3, while the fraction of embeddings/virtual edges yielding the best results is typically not high. We also observed that the node degree, PageRank, and accessibility metrics reached the highest accuracy rates for the three datasets. The unweighted versions of the traditional measurements turned out to provide better performance than their weighted counterparts in almost all cases.

Our results showed, as a proof of principle, that using virtual edges can improve the informativeness of co-occurrence networks for the keyword detection task. Given that the informativeness of the characterization can be improved in the adopted representation, we believe that the inclusion of virtual edges could be useful in other network classification scenarios, such as in name disambiguation [4] . The proposed methodology could be improved by including other model components. For example, edges weight modeling could be improved if both co-occurrence frequency and semantic similarity are combined, for example, via linear operations.

Another source of improvement to the model could arise if synonyms are handled before the creation of the networks. In this way, words with similar meanings would be represented by a single node, so as to avoid redundancy in the co-occurrence networks. This could be done by taking advantage of the vectors generated by BERT, for example. Finally, our approach is limited to finding unigram keywords (keywords composed of a single word). A more general approach could consider keywords comprising two or more words.

On the role of words in the network structure of texts: Application to authorship attribution

Word embeddings: A survey

Comparing intermittency and network measurements of words and their dependence on authorship

Topological-collaborative approach for disambiguating authors' names in collaborative networks

Keyword extraction for text categorization

Representation of texts as complex networks: a mesoscopic approach

Automatic keyword extraction for text summarization: A survey

A review on word embedding techniques for text classification

A faster algorithm for betweenness centrality

Level statistics of words: Finding keywords in literary texts and symbolic sequences

Improving statistical keyword detection in short texts: Entropic and clustering approaches

The multiplex structure of the mental lexicon influences picture naming in people with aphasia

Approaching human language with complex networks

Disentangling the climate divide with emotional patterns: a network-based mindset reconstruction approach

Paragraph-based representation of texts: A complex networks approach

Pre-training of deep bidirectional transformers for language understanding

Extracting key terms from noisy and multitheme documents

Corephrase: Keyphrase extraction for document clustering

Automatic keyphrase extraction: A survey of the state of the art

Statistical keyword detection in literary corpora

Improved automatic keyword extraction given more linguistic knowledge

A ranking approach to keyphrase extraction

Combining word embeddings and n-grams for unsupervised document summarization

Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles

Keyword and keyphrase extraction using centrality measures on collocation networks

Google's PageRank and beyond: The science of search engine rankings

The automatic creation of literature abstracts

Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization

Automatic keyphrase extraction: a survey and trends

TextRank: Bringing order into text

Conference on Empirical Methods in Natural Language Processing

Efficient estimation of word representations in vector space

Keyword detection in natural languages and dna

Text mining: use of tf-idf to examine the relevance of words to documents

Using virtual edges to improve the discriminability of co-occurrence text networks

The centrality index of a graph

Enriching complex networks with word embeddings for detecting mild cognitive impairment from speech transcripts

Word embedding based correlation model for question/answer matching

Text characterization based on recurrence networks

Cognitive network science reconstructs how experts, news outlets and social media perceived the covid-19 pandemic

Multiplex lexical networks reveal patterns in early word acquisition in children

Exploiting cooccurrence networks for classification of implicit inter-relationships in legal texts

Extractive multi-document summarization using multilayer networks

Accessibility in complex networks

A multi-centrality index for graph-based keyword extraction. Information Processing and Management

Graph-based keyword extraction for twitter data

Keyword extraction of document based on weighted complex network

This study was financed in part by the Coordenação de Aperfeiçoamento de Pes-