Abstract
Several crimes occur daily, and the initial investigation begins with a police report. In cities with high crime rates, it is impractical to expect the police to read and analyze every crime narrative. Some police reports may involve multiple victims or the same crime may be reported more than once. Additionally, police reports may exhibit similarities due to a shared modus operandi. This study addresses the challenge of providing a police report and searching for the most similar report in the database. A similar police report can be either another report with overlapping words or one that shares a similar modus operandi. One potential solution is to represent each police report as a feature vector and compare these vectors using a similarity function. Different methods can be employed to represent the narrative, including embedding vectors and count-based approaches such as TF-IDF. This research explores the use of pre-trained embedding representations at both the word and sentence levels, such as Universal Sentence Encoder, Word2Vec, RoBERTa, Doc2Vec, among others. We determine the most effective representation for capturing semantic and lexical similarities between police reports by comparing different embedding models. Furthermore, we compare the effectiveness of available pre-trained embedding models with a model trained specifically on a corpus of police reports. Another contribution of this work is the development of trained embedding models specifically tailored for the domain of police reports.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Usually, victims and witnesses report the crime events on police reports for the police investigation. The police report also protects the police action, demonstrating where this series of investigative efforts by the police department began. A police report might even make it to the Supreme Court in severe circumstances. Consequently, a police report needs to be thorough and factual when describing an occurrence of crime (for example, accurate, concise, clear, objective, timely and complete).
This paper addresses the challenge of conducting a search for the most similar police report in a database. When it comes to similarity search in text data, the goal is to find the most similar police report(s) based not only on string similarity but also taking account semantic similarity. A similar police report can be a police report with overlapping words or one that describes an equivalent modus operandi, even if it does not use the exact words.
To illustrate both cases, let’s consider the following two sentences: “Augusta’s car was stolen yesterday at night” and “Yesterday night, Augusta had her car stolen.” These sentences share almost identical wording. Now, let’s examine two different sentences that exhibit the same modus operandi: “Two armed men on a motorbike robbed Patricia and took her mobile phone” and “Two men with a knife on a motorbike approached Mary and stole her personal belongings.” These police reports involve the same type of crime (robbery), a female victim, and a similar criminal approach (two individuals on a motorbike). One possible approach to address this issue is to represent each police report as a feature vector and compare these vectors using cosine similarity.
Pre-trained word embeddings are one of the most commonly used representations for document vocabulary. There are various pre-trained word embeddings available, such as [13, 16, 17, 19]. Word embeddings can capture a word’s relationships with other words, its contextual meaning within a document, as well as its semantic and syntactic similarities. However, word embeddings alone may struggle to capture nuanced shifts in meaning when a sentence undergoes minor changes. For example, consider the sentences “Augusta’s car was stolen” and “Augusta’s car was not stolen.” Despite the difference of only one word, these sentences have completely different meanings. Nonetheless, when using word embeddings, the cosine similarity between the vectors obtained from these sentences might still be relatively high, despite being semantically opposite. Embedding techniques should ideally be able to address this issue and provide more nuanced representations.
One alternative is to encode the entire sentence into embedding vectors. This method is known as sentence embedding, and there are many pre-trained sentence embedding techniques available, including Doc2Vec [15], SBert [3] and Universal Sentence Encoder [21]. These models take the text as input and generate a fixed-dimensional embedding representation of the sentence. Sentence embedding techniques aim to capture the full sentences and the semantic information they contain in the form of vectors. This helps the machine understand the context, intention, and other nuances of the text as a whole. The quality of the training data has a significant impact on the resulting sentence embedding vectors. For optimal results, the sentences in the training set should be semantically related [14].
This study focuses on addressing the challenge of conducting similarity searches on police report documents. The primary objective is to find similar police reports - with overlapping words or reports that share the same modus operandi. Police reports pose unique challenges for similarity search due to specialized language and terminology specific to law enforcement, requiring tailored approaches. Furthermore, the wealth of highly contextual information in these reports, encompassing location, time, individuals involved, and preceding events, adds complexity, necessitating context-aware techniques for accurate assessment. Additionally, the sensitivity of confidential police reports, containing personal details and classified materials, demands privacy protection measures during similarity analysis, adhering to legal and ethical guidelines through anonymization or aggregation techniques. To guide this research, we have formulated the following key research questions:
-
1.
(RQ1) The first research question aims to identify the pre-trained sentence-level representation that effectively captures both the syntactic and semantic features of the vocabulary found in police reports while being able to differentiate between similar reports. In this study, we consider various approaches for input sentence representation, including word-level and sentence-level embeddings, as well as count-based text representation techniques such as TF-IDF. Additionally, we explore lexical-based similarities, such as Jaccard similarity, to analyze the effectiveness of different representation methods. The goal is to determine the most suitable approach that can accurately capture the essential characteristics of police reports and enable the detection of similar reports.
-
2.
(RQ2) Our second research question delves into the effectiveness of embedding models that are trained from scratch with police report documents. We aim to determine if these models are better at representing sentences for similarity searches. Furthermore, we compare the pre-trained models used in RQ1 with our own data-trained embeddings. Our primary goal is to assess the performance of these various versions when it comes to searching for similar police reports.
-
3.
(RQ3) The third research question explores whether combining embeddings can enhance the quality of searching for similar police reports. Building on the findings from [8], which highlights the benefits of combining different embeddings due to their complementarity, we investigate whether combining embeddings can lead to improved performance on the task at hand. Based on the analysis conducted in the previous research questions, we select the two best-performing embeddings. We then compare the effectiveness of combining these two embeddings using different aggregation techniques, such as taking the maximum, minimum, or average vector. By investigating the potential of combining models, we aim to determine whether it is possible to achieve a better sentence representation that captures the semantic and syntactic features crucial for searching similar police reports.
The remaining sections of this article are organized as follows. Section 2 presents the background and related works. Section 3 explains the data and the solutions to cope with our problem. Section 4 explains the evaluation metrics used in the comparisons and shows the analysis performed compared to different models. Finally, Sect. 5 summarizes this work and proposes future developments.
2 Backgound and Related Works
The background information and studies that are relevant to our problem are covered in this section. We divide this section into four main topics: Traditional text representation, Word Embedding, Sentence Embedding and Similarity search in text data.
Traditional Text Representation. The method of vectorizing text involves turning text into tensors of numbers. The most frequent and fundamental representation is one-hot encoding. Every word is given a distinct integer index, and this integer index i is then converted into a binary vector of size N (the vocabulary size). The vector is all zeros except for the \(i-th\) entry, which is 1 [4]. Another typical representation of text data is the use of bag-of-words or bag-of-n-words [9] for simplicity. According to this paradigm, a sentence or text is typically depicted as a collection of words, ignoring grammar and word order but keeping multiplicity.
There are some alternatives to represent each word in a document for such a model: (i) a boolean indicating the presence or not; (ii) the frequency of the word in the text; (iii) TF-IDF (Term Frequency, Inverse Document Frequency), a statistical tool that assesses how pertinent a word is to a document within a collection of documents. The two metrics are multiplied: how many times a word appears in a document (TF) and the inverse document frequency of the word across a set of documents (IDF). However, these techniques have some drawbacks, such as the lack of semantics over the representation. In this paper, we experimented with the representation of each word in a sentence as its TF-IDF value.
Word Embeddings. Text of various lengths (such as a document, paragraph, sentence, or even a word) is transformed by embedding models into a fixed-length numeric vector that may be used to train machine learning algorithms like classification and clustering models. Pre-trained word embeddings have been widely used [1, 16, 17, 19, 24] due to their capacity to retain a word’s context inside a document, as well as their semantic and syntactic similarity to other words. A framework for learning word vectors is proposed by [17] by training a linguistic model that predicts a word given other words in a context. Word2Vec is one specific implementation of such a structure. The main disadvantage is that [17] under-utilizes corpus statistics as the model is trained on a separate local context window rather than on global co-occurrence counts.
Bypassing this issue, [19] provide a model that creates a vector space of words by training it on the worldwide count of word co-occurrence, making effective use of statistical data. Another word embedding proposal is FLAIR [1], which abstracts away from the particular engineering difficulties that various word embeddings bring. It creates a uniform interface for all word, sentence, and arbitrary embedding combinations. Brazilian Portuguese BERT models are provided by BERTimbau [24]. Three NLP tasks-textual similarity of the sentence, textual linkage recognition, and named entity recognition-were used to assess the models. Compared to BERT multilingual and earlier single language techniques, BERTimbau advances state-of-the-art in these tasks, demonstrating the value of large pre-trained language models for Portuguese.
We can obtain document vectors from word embeddings. An approach is to average all word vectors collectively. However, this process weighs both significant and insignificant words equally. The fact that each word would be represented with the same vector, regardless of context, is another drawback of utilizing word embeddings to represent text. Sentence embedding is an improvement from word embedding.
Sentence Embedding. performs the encoding of sentences in an n-dimensional vector space, which might have several word representations based on context. The most widely used sentence embedding proposals are Doc2Vec [15], InferSent [5], SBert [3], Universal Sentence Encoder [21] and LaBSE [7], among others.
Similar to Word2Vec, Doc2Vec trains sentence embeddings or paragraph vectors to predict the following word, given a variety of contexts drawn from the paragraph. The sentence and word vectors are combined to forecast the next word in a context. The pre-trained BERT network is used by SBert [3], which adds a grouping operation to the output to create a fixed-length sentence embedding. To enhance BERT and produce semantically meaningful sentence embeddings that can be compared using cosine similarity, SBert introduces Siamese and triple lattice structures. Using the output of CLS-token, the developers of [3] experiment with various pooling strategies, such as finding the average of all output vectors (the default strategy) or computing a max-over-time of the output vectors (MAX-strategy).
The trained LaBSE model is proposed by [7] for sentence embeddings in several languages. Only pairs of bilingual utterances that are translations of one another are used exclusively in training and optimizing LaBSE to generate comparable representations. The source and destination phrases are independently encoded using a shared BERT-based encoder in the framework’s LaBSE, then passed into a combination function. This is known as a dual encoder. The sentence embedding for each entry is obtained from the last layer representations [CLS]. The cosine over the sentence embedding created by BERT encoders is used to score how similar the source and destination phrases are to each other.
This paper assesses a variety of embeddings, including Word2Vec, Flair, and BERT as word embeddings, as well as Doc2Vec, the Universal Sentence Encoder, and SBert.
Similarity Search in Text Data. Several studies have been conducted to assess the effectiveness of different embeddings for text similarity. One such study, as suggested by the authors of [23], compares the semantic similarity between various techniques in the patent field. This study utilizes three different vector spaces, including a basic TF-IDF baseline, an LSI topic model, and a Doc2Vec neural model.
Another article, [18], compares different learning vector representations like Latent Semantic Analysis (LSA), Word2Vec, and GloVe to identify the most efficient technique in the subject segmentation space. [22] trains a supervised meta-embedding neural network and combines multiple sentences pre-trained embedding models for sentence similarity recognition. The meta-model outputs the distance between two given sentences as input. The approach for selecting the best embedding models in our study can be applied to the methodology proposed by [22] to choose the embedding models for effectively creating their meta-embedding model.
In addition, [12] tackles the issue of recognizing duplicate questions. This article proposes a method that vectorizes questions based on the combination and individual use of FastText mining subword integration, Google news vector integration, and FastText mining integration. These question word vectors are then used to determine how similar words are semantically. The proposed model is trained on each of the characteristics derived from each of the three words separately, and input is received by the MaLSTM neural network, which measures the Manhattan distance to assess the semantic similarity between questions.
Various studies have utilized embeddings to map input to low-dimensional vectors and perform similarity searches. For instance, [11] investigates embedding complex data objects like DNA sequences and images in a vector space to make their distances as close to their real distances as possible. This allows queries to be run on the embedded objects in similarity search applications. Similarly, [6] conducts similarity searches in Knowledge Graphs (KG) to help users identify the most significant entities to their query.
3 Data and Methods
This section discusses our datasets and methodology for investigating the research questions. We employ a two-step approach: generating similarity matrices for different sentence representations and validating the accuracy and Mean Reciprocal Rank (MRR) of police report representations. The first step involves data preprocessing, sentence representation, model training, and ranking of the top K most similar sentences using cosine and Jaccard similarity. In the second step, we evaluate the accuracy of each police report representation by determining if the most similar report is retrieved within the top K sentences. We also calculate the MRR to assess the retrieval performance. We provide more details about MRR as follows. Figure 1 presents the steps followed by us in this paper.
Data. Two Brazilian Department of Public Security organizations provided the datasets utilized in this study. To maintain anonymity during the review process, we have omitted their names. The datasets comprise two corpora: one with 1,089 police reports (1,065 reports are similar, containing at least one copy for each incident report), and another with 30,011 reports, both electronic and non-electronic. These corpora cover incidents that occurred between January 1, 2020, and March 29, 2020, encompassing various types of crimes such as theft, murder, robbery, attempted murder, and harassment. The first corpus with 1,089 police reports has 1,065 reports in all, in which each police report has a duplicate in this same set. Another 24 reports did not have similarities and were discarded from validation. Figure 2 presents some important information about the first dataset (with 1,065 police reports):
Each column bin varies the number of words to 20; for instance, the first column contains 38 reports (for up to 20 words in each report). It is possible to observe that the number of documents decreases as we move to subsequent columns, i.e., fewer documents with a high number of words. In addition, Fig. 2 presents an asymmetry to the right, suggesting a greater concentration of documents with fewer words and a smaller number of documents with a more significant number of words in the documents. Some statistical measures from our dataset: the maximum document length in terms of the number of words is 1,019; on average, each document contains 119 words and around 597 characters.
To provide an example of a police report and its duplication:
-
Original report: “The reporting party, identified above, informs via the Virtual Police Station that on the specified date and time, they were the victim of a robbery. The incident occurred as follows: Two individuals on a black motorcycle, wearing helmets and armed with a knife, approached and demanded the cell phone. They took a Samsung phone, model A205GT, with the serial number: 357621102968577. When asked about identifying or describing the suspects, the reporting party stated that it was not possible due to their helmets.”
-
Duplicated report: “The reporting party, identified as mentioned above, reports via the Virtual Police Station that on the mentioned date and time, they were the victim of a robbery. The incident occurred as follows: Two individuals on a black motorcycle, identified as POP, with a knife, threatened the reporting party and demanded the cell phone. When asked about identifying or describing the suspects, the reporting party stated that they were unable to do so as the suspects were wearing helmets.”
Duplications of police reports can happen due to several factors: (i) the same occurrence may have happened to different victims, and thus each victim records the same event, however, with different words; (ii) electronic police reports can be made due to system or user failure, these police reports can be registered more than once, among other possibilities.
During the pre-processing phase, we conducted several steps. These included the removal of special characters, such as HTML tags and non-ASCII characters, as well as the elimination of terms containing digits. Additionally, we eliminated stopwords and converted all characters to lowercase. To represent the police reports using embedding models, we explored three potential solutions: utilizing pre-trained embeddings, training embeddings from scratch using our data, and combining different embeddings to represent the text.
Methods. In this paper, we utilize pre-trained neural embedding models, including Flair, Universal Sentence Encoder (USE), Doc2Vec, Word2Vec, and Sbert. Firstly, we employ word embeddings by computing the average of word representations to describe the narrative of each police report using models such as Word2Vec. Additionally, we investigate the performance of Word2Vec, Doc2Vec, and RoBERTa when trained from scratch using our dataset, which consists of approximately 30,011 reports obtained. Each word is assigned an initial representation, and the models refine these representations through training iterations to enhance their quality. To compare the performance of the embedding models and lexical approaches, we evaluate the use of a count-based method, TF-IDF, where each word in a sentence is represented by its TF-IDF value.
After representing each police report as a vector using either an embedding model or TF-IDF, we compare the reports to identify the most similar ones. For each representation approach investigated, we generate an N x N matrix to rank the top K most similar police reports for each report. For example, in the case of the dataset containing 1,065 police reports represented with Word2Vec, we calculate the cosine similarity between all 1,065 sentences, resulting in a 1,065\(\,\times \,\)1,065 matrix that represents the similarity percentage. We disregard the diagonal of the matrix since the similarity between identical sentences would always be 100%. This similarity matrix is generated for each representation approach, including Word2Vec, Flair, USE, Doc2Vec, Sbert, RoBERTa, and TF-IDF.
Furthermore, we explore an alternative approach that retains the words in the sentences without representing them as vectors. In this case, we compute the Jaccard similarity matrix, which compares two sentences by assessing if they contain identical words. The Jaccard similarity ranges from 0% to 100%, with a higher percentage indicating a greater similarity between the sentences.
Evaluation Metrics. The ranking step in our evaluation involves analyzing the top K (top 1, top 5, and top 10) most similar sentences and calculating the Mean Reciprocal Rank (MRR). For each value of K, a list of the most similar police report IDs is generated. In the case of top 1, the list contains only the most similar sentence, while in the case of top 5, the list includes the five most similar sentences, and so on. If the list of most similar police report IDs includes the correct duplicate police report, it is considered a hit for the respective approach used for sentence representation. The approach’s accuracy is calculated as the percentage of hits achieved by the approach across the entire dataset. The formula of MRR is shown below:
where D is the dataset of police reports and \(\text {rank}_{i}\) refers to the rank position of the first relevant document (most similar) for the \(i-th\) police report. Additionally, the MRR is computed for each method, providing a harmonic mean of the ranks.
Comparative studies of word embedding vectors, such as those conducted by [2, 25], play a vital role in ensuring the quality of word representation before their utilization in machine learning tasks. Evaluation methods for word vectors can be categorized as intrinsic or extrinsic [20, 26].
Intrinsic evaluation focuses on assessing word relationships based on syntax or semantics and is independent of a particular NLP task. Human-curated benchmark datasets are used to compare words in terms of similarity and relatedness. On the other hand, extrinsic evaluation involves incorporating word vectors into an NLP task, such as sentiment analysis or natural language inference. The performance of the embedding models is then evaluated using task-specific metrics like accuracy, F1-score, or other relevant measures. This type of evaluation assesses the usefulness and effectiveness of the embeddings in real-world applications and can assist data scientists.
4 Experimental Results
In this section, we analyze and evaluate the results obtained using pre-trained embedding models as well as embedding models trained on a separate dataset of police reports. We investigate the impact of these models on identifying similar sentences, including the ones with the same modus operandi. Additionally, we examine the behavior of count-based methods, such as TF-IDF, in representing the text. Furthermore, we explore the potential benefits of combining different embeddings to improve the search for similar police reports.
The research questions defined in the Introduction guide our analysis in this section. By addressing these questions, we gain insights into the performance and effectiveness of various embedding models and representation techniques in the context of police reports. This evaluation allows us to understand the strengths and limitations of different approaches and provides valuable information for researchers and practitioners working in the field of natural language processing and law enforcement.
Study on the Results of RQ1. The first research question aims to determine the most effective representation method for capturing syntactic and semantic similarity between police reports. To investigate this question, we utilize a dataset containing 1,065 police reports. This dataset is accompanied by ground truth annotations, meaning that we have information about the most similar police report for each entry.
In our evaluation, we consider various pre-trained models, including Universal Sentence Encoder (USE), Word2Vec, Doc2Vec, BERT, SBert, and Flair. For each approach, we compare the sentences using cosine similarity. Additionally, we explore the option of keeping the words in the sentences without converting them into vector representations. In this case, we utilize Jaccard similarity to identify the most similar sentences.
Table 1 presents the accuracy and MRR results obtained by these approaches. These metrics allow us to assess the performance and effectiveness of each representation method in capturing the similarity between police reports.
The USE (Universal Sentence Encoder) model demonstrated the best performance among the embedding models evaluated. It is based on a Transformer encoder and achieved higher accuracy compared to the deep averaging network (DAN) model. The accuracy achieved by USE is similar to the results reported in the paper by [3] for the STS Bench (Semantic Textual Similarity Benchmark). The USE model incorporates sentences into a 512-dimensional vector, enabling a more robust semantic representation. Moreover, the Multilingual USE model allows for the representation of words from different languages.
BERT, another transformer-based model, obtained the second-best results among the models in capturing syntactic and semantic information. The pre-trained BERT model from [24]Footnote 1 slightly outperformed the pre-trained Word2Vec model from [10]Footnote 2. SBert and Flair models showed similar performance, likely because they are both based on the same training base. However, the Doc2Vec model performed poorly compared to the other models and ranked last. This could be attributed to the use of Portuguese Wikipedia data for training, which may have introduced variations in phrase usage across different contexts.
TF-IDF was computed using the TfidfVectorizerFootnote 3 to represent the sentences. This involved creating a dictionary of words from the sentences and calculating the frequency of each term. Portuguese stopwords were removed to ensure consistency in similarity calculations. The TF-IDF representation yielded promising results. One key takeaway from this experiment: it is indeed expected that the embeddings used would be able to capture the context of the sentences and identify duplications as accurately as, if not better than, count-based and lexical methods. However, in the case of police reports, the text may not be exactly the same in terms of words, as demonstrated in the previous section. There may be repetitions of words, but they can occur in a different order within the text, which is a characteristic of this type of dataset.
Embeddings that are pre-trained on non-police data may not capture this particularity of the police reports as effectively as count-based and lexical methods. The pre-trained embeddings may have learned general language patterns, but they may not have specifically captured the nuances and variations present in police reports. This can result in lower accuracy when identifying similar sentences. In such cases, count-based and lexical methods, like TF-IDF and Jaccard similarity, tend to perform better because they directly compare the presence or absence of words in the sentences. They are not reliant on capturing contextual information and can handle cases where word order varies within the text.
Therefore, it is important to consider the specific characteristics of the dataset and the task at hand when choosing an appropriate representation method. While pre-trained embeddings offer many advantages, they may not always be the optimal choice for certain types of text, such as police reports with specific structural patterns and word repetitions.
Study on the Results of RQ2. The second research question investigates whether the embedding models are more effective in representing the sentences when trained with a police report dataset. Using an unsupervised approach, this study trained the RoBERTa (an improvement of the BERT model), Doc2Vec, and Word2Vec embedding models. The data for training is around 30,011 police reports provided by Brazilian Department of Public Security. To investigate RQ2, we use the dataset with 1,065 police reports (the same used for RQ1). The RoBERTa trained model was more assertive in identifying similar sentences; We trained the RoBERTa model for 13 epochs when the difference between the current and last epoch loss was not more significant than 0.01. Roberta embodied each word in 512 dimensions. The representation of the police reports for Word2Vec was made with vectors of 50 dimensions, trained for 100 epochs. For the Doc2Vec model, the representations of the sentences were with vectors of 150 dimensions, trained for 100 epochs, with a training window of size 10.
The findings of the study suggest that training embedding models with a specific dataset from the same context, in this case, the police report dataset, improves their effectiveness in representing sentences and capturing syntactic, semantic, and morphological relationships. The Word2Vec, RoBERTa, and Doc2Vec models trained with the 30,011 police reports achieved slightly better results compared to their pre-trained counterparts, which were trained on a much larger and diverse corpus.
The advantage of training embedding models with domain-specific data is that they can better capture the specific language patterns, vocabulary, and contextual relationships present in that particular domain. The embeddings trained on the police report dataset were able to maintain proximity between vectors representing syntactic, semantic, and morphological relationships relevant to police reports. This proximity allows for better identification of similar sentences and captures the nuances of language specific to the domain.
The results obtained with the trained models were considered satisfactory, considering that the pre-trained versions of Word2Vec, BERT, and Doc2Vec were trained on a much larger corpus with a significantly larger number of tokens. The pre-trained versions of Word2Vec, BERT, and Doc2Vec were trained with 1,395,926,282 tokens. Our 30K police reports are much smaller (with 2,579,815 tokens). Despite the smaller size of the police report dataset, it still proved to be effective in capturing the relevant patterns and relationships needed for the task at hand. This highlights the importance of training embedding models on data that is specifically relevant to the target domain, as it can lead to improved performance and accuracy in representing and analyzing text data (Table 2).
Study on the Results of RQ3. The third research question explores the potential of combining embeddings to enhance the search for similar police reports and improve the retrieval of similar reports. In this study, the embeddings from the USE model and Word2Vec-trained model, both trained with the 30,011 police reports, were combined to create a new embedding model called USEW2V.
To ensure compatibility in dimensionality, the Word2Vec model was adjusted to generate embeddings with 512 dimensions, matching the dimensionality of the embeddings generated by the USE model. By combining the embeddings of both models, the USEW2V model was created. The performance of the USEW2V model was compared against the individual USE and Word2Vec models. The evaluation focused on assessing the accuracy and effectiveness of the combined embedding model in identifying similar police reports and retrieving similar reports.
By combining the strengths and features of both models, it was expected that the USEW2V model could outperform the individual models in terms of accuracy and retrieval performance.
Table 3 presents the results of the USEW2V models (max, min, mean) compared to the individual USE and Word2Vec models for the 1,065 police reports dataset. It can be observed that the combination of the two best embedding models, referred to as USEW2V models, yielded similar results to those obtained using only the USE model.
Since the 1,065 sentences in the dataset share a similar writing syntax, the improvement achieved by combining the two models to represent the sentences was not significant. However, it is worth noting that combining embeddings can sometimes yield results that are on par with or even better than the individual models.
This experiment demonstrates that combining different embedding models can be a viable approach, and there are various methods available for combining embeddings. Exploring these options and evaluating their effectiveness can be a valuable direction for future research in the field.
5 Conclusion and Future Works
This paper addresses the problem of finding the most similar police report in a database given a specific report. The goal is to consider not only lexical similarity but also syntactic and semantic similarity. The experiments conducted in this paper showed that the USE model, a pre-trained transformer-based sentence embedding model in its multilingual version, outperformed the other embedding models. Interestingly, even the raw comparison methods using TF-IDF and Jaccard similarity achieved similar results to the USE model.
Furthermore, the study found that training an embedding model with data from the same context as the target dataset is more effective than using pre-trained embeddings. The paper also explored the combination of different embeddings using various aggregation functions such as max, min, and average. However, it was observed that while combining embeddings can be competitive with using individual models, it might not surpass their performance as expected.
Future directions for research include evaluating the proposed approaches with different datasets, exploring embeddings from different architectures like graph embeddings, and investigating metrics to compare embeddings and assess their suitability for specific datasets and machine learning problems. These directions can contribute to further advancements in the field of similarity search and text representation. Another potential future direction is to explore the utilization of our dataset with 30,011 police reports for fine-tuning the models studied in the first research question. Another suggestion about combining embeddings would be to concatenate two embeddings generated by different models and assess whether it is possible to obtain a better representation.
Notes
- 1.
- 2.
Available at https://nilc.icmc.usp.br/embeddings.
- 3.
References
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of NAACL (Demonstrations), pp. 54–59 (2019)
Boggust, A., Carter, B., Satyanarayan, A.: Embedding comparator: visualizing differences in global structure and local neighborhoods via small multiples. In: 27th International Conference on Intelligent User Interfaces, pp. 746–766 (2022)
Cer, D., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
Chollet, F.: Deep Learning with Python. Simon and Schuster (2021)
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 EMNLP, pp. 670–680 (2017)
Do, P., Pham, P.: W-KG2Vec: a weighted text-enhanced meta-path-based knowledge graph embedding for similarity search. Neural Comput. Appl. 33(23), 16533–16555 (2021)
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 (2020)
Ghannay, S., Favre, B., Esteve, Y., Camelin, N.: Word embedding evaluation and combination. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 300–305 (2016)
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., Aluísio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Proceedings of the 11th STIL, pp. 122–131 (2017)
Hjaltason, G.R., Samet, H.: Properties of embedding methods for similarity searching in metric spaces. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 530–549 (2003)
Imtiaz, Z., Umer, M., Ahmad, M., Ullah, S., Choi, G.S., Mehmood, A.: Duplicate questions pair detection using siamese MaLSTM. IEEE Access 8, 21932–21942 (2020)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems (NIPS), pp. 3294–3302 (2015)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 1188–1196. PMLR (2014)
Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013)
Naili, M., Chaibi, A.H., Ghezala, H.H.B.: Comparative study of word embedding methods in topic segmentation. Procedia Comput. Sci. 112, 340–349 (2017)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 EMNLP, pp. 1532–1543 (2014)
Qiu, Y., Li, H., Li, S., Jiang, Y., Hu, R., Yang, L.: Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. In: Sun, M., Liu, T., Wang, X., Liu, Z., Liu, Y. (eds.) CCL/NLP-NABD -2018. LNCS (LNAI), vol. 11221, pp. 209–221. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01716-3_18
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Rodrigues, A.C., Marcacini, R.M.: Sentence similarity recognition in Portuguese from multiple embedding models. In: 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 154–159. IEEE (2022)
Shahmirzadi, O., Lugowski, A., Younge, K.: Text similarity in vector space models: a comparative study. In: 2019 18th IEEE ICMLA, pp. 659–666. IEEE (2019)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th BRACIS (2020)
Toshevska, M., Stojanovska, F., Kalajdjieski, J.: Comparative analysis of word embeddings for capturing word similarities. arXiv preprint arXiv:2005.03812 (2020)
Zhai, M., Tan, J., Choi, J.: Intrinsic and extrinsic evaluations of word embeddings. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Acknowledgments
The research reported in this work received support from the FUNCAP project titled “Big Data Platform to Accelerate the Digital Transformation of Ceará State" 04772551/2020. Part of the results presented in this work were also obtained through the UFC-FASTEF 31/2019.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Araújo, J.A.F., Coelho da Silva, T.L., da Rocha, A.R., de Lira, V.C.M. (2023). Police Report Similarity Search: A Case Study. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-45392-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)


