Mining an English-Chinese parallel Dataset of Financial News RESEARCH PAPER CORRESPONDING AUTHOR: Nicolas Turenne BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China nicolas.turenne@univ-eiffel.fr KEYWORDS: English-Chinese; text mining; clustering; classification; patterns TO CITE THIS ARTICLE: Turenne, N., Chen, Z., Fan, G., Li, J., Li, Y., Wang, S., & Zhou, J. (2022). Mining an English- Chinese parallel Dataset of Financial News. Journal of Open Humanities Data, 8: 9, pp. 1–12. DOI: https://doi. org/10.5334/johd.62 Mining an English-Chinese parallel Dataset of Financial News NICOLAS TURENNE ZIWEI CHEN GUITAO FAN JIANLONG LI YIWEN LI SIYUAN WANG JIAQI ZHOU ABSTRACT Parallel text datasets are a valuable for educational purposes, machine translation, and cross-language information retrieval, but few are domain-oriented. We have created a Chinese–English parallel dataset in the domain of finance technology, using the Financial Times website, from which we grabbed 60,473 news items from between 2007 and 2021. This dataset is a bilingual Chinese–English parallel dataset of news in the domain of finance. It is open access in its original state without transformation, and has been made not for machine translation as has been used, but for intelligent mining, in which we conducted many experiments using up-to-date text mining techniques: clustering (topic modeling, community detection, k-means), topic prediction (naive Bayes, SVM, LSTM, Bert), and pattern discovery (dictionary based, time series). We present the usage of these techniques as a framework for other studies, not only as an application but with an interpretation. *Author affiliations can be found in the back matter of this article mailto:nicolas.turenne@univ-eiffel.fr https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62 https://orcid.org/0000-0003-1229-5590 2Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 1 INTRODUCTION The investigation of classical and new text mining methods using a bilingual dataset can enhance the meaningfulness of comparisons of these techniques. The original way to use a parallel text dataset is to benefit from its construction, by which the texts are supposed to be strictly similar, leading us to expect that exploratory results from text mining will be similar too. We decided to explore a parallel dataset from a domain to extract knowledge from a technical area (e.g., finance). The choice of the pair Chinese–English has several motivations: firstly, the data is more easily available; secondly, there is a demand for English and Chinese tools and datasets, as English is already the lingua franca in many areas (political, economical, cultural, and scientific), and we also see an increasing interest in Chinese, which is now being taught at schools in western countries. One can keep in mind 1.41 billion people speak Chinese as their first or second language, while this is 1.35 billion for English (the overlap is no more than 20%). Secondly, China and the USA, as the areas of the native speakers, are drivers for the world economy. The language of business and finance has always attracted interest, since the movement of stock indexes can be an indicator, a ‘barometer,’ of the general trend in the economy. When we look at the availability of domain-specific parallel corpora, the majority of them are constructed around the following drivers: biomedicine (Neves, Yepes, & Névéol, 2016), digital humanities/culture (Christodoulopoulos & Steedman, 2014), city, transport (Lefever, Macken, & Hoste, 2009), food, the environment (Xiong, 2013), ICT (Labaka, Alegria, & Sarasola, 2016), digital humanities/law, and governance (Steinberger et al., 2006). Concerning Chinese–English, Chang (2004) from Peking University made one of the first large scale Chinese–English parallel corpora from HTML files with alignments at the paragraph and sentence levels, leading to a size of 10 million Chinese characters about different genres (news, technical articles, subtitles). Concerning the domain of finance, there are some small corpora for different pairs of languages, but not Chinese–English, (Arcan, Thomas, de Brandt, & Buitelaar, 2013; Bick & Barreiro, 2015; Smirnova & Rackevičienė, 2020; Tiedemann, 2012; Volk, Amrhein, Aepli, Müller, & Ströbel, 2016). The largest one is the SEDAR dataset,1 containing 8.6 million French–English sentence pairs in the finance domain from PDF files of the regulations of the province of Quebec (Ghaddar & Langlais, 2020). To our knowledge, the dataset discussed in our article represents new available material for the community. The question we address is to consider the state of the art techniques and the main contemporary approaches to text mining, and see what finally we can extract from a dataset of news in a specialized domain such as fintech. Knowing that each news item contains the same version in Chinese and English, another question to explore is the following:”are the efficiency and extraction exactly the same or do some cultural aspects influence the translation and so the lexical and semantic content? In this way, the general dataset we present in this article can be seen as a gold standard for the output of calibrated measures for all kinds of techniques. In general, studies use text collection within the framework of a specific method such as disinformation analysis (Turenne, 2018) or the development of medical drugs (Kolchinsky, Lourenco, Wu, & Rocha, 2015), or for a specific task such as part of speech (POS) tagging (Akbik, Blythe, & Vollgraf, 2018) or named entity extraction (Chiu & Nichols, 2016). In this article, we also take a domain dataset (namely, fintech) and a specific genre of document (news), but we do not have a specific task to improve. We try easy tasks intuitively and directly usable on such a dataset: clustering (named entity and word), classification (topic and sentiment), and pattern extraction (word life and citation). We made the dataset using the Financial Times website from which we grabbed 60,473 news items from between 2007 and 2021, each containing a version in English and its translation into Chinese. We focus on three families of techniques within the text-mining framework: (i) pre-processing techniques; (ii) supervised approaches involving deep learning techniques such as LSTM, BERT, CNN and also SVM, naïve Bayes, and random forest; and (iii) unsupervised techniques involving k-means, community detection, biclustering, co-cord analysis, and topic modeling (Turenne et al., 2020). This paper is divided into the following sections: we discuss the dataset and its sub-datasets, describe the state- of-the-art research based on bilingual corpora, machine learning, and natural language processing, and then present the results of our experiments. 1 https://github.com/autorite/sedar-bitext (last accessed: 01.03.2022). https://doi.org/10.5334/johd.62 https://github.com/autorite/sedar-bitext 3Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 2 RELATED WORK 2.1 PARALLEL LANGUAGE DATASET BUILDING Zhao and Vogel (2002) is probably one of the pioneering studies about combining a parallel Chinese–English dataset and mining approach. They used 10 years of the Xinhua bilingual news collection, but that is not available. Koehn (2005) is a large-scale document multilingual and parallel dataset containing ∼60 million words on average per language for 21 European languages, but nothing about Chinese. In the same way, we find a topic detection and tracking repository.2 It contains 30K in Chinese and English, but not in parallel. Christodoulopoulos and Steedman (2014) and Sturgeon (2021) are open data repositories and digital humanities projects. They contain books with English–Chinese versions but their content is closely related to philosophy, religion, and difficult-to-understand contemporary thinking: for example, manual annotation for classification is not easy. The UCI Machine Learning Repository (Dua & Graff, 2017) and Kaggle3 are repositories of datasets, and many of them are used for the evaluation of algorithms. There are no English–Chinese parallel corpora. Zhai, Liu, Zhong, Illouz, and Vilnat (2020) made a dataset considering 11 genres (constructed based on existing work: art, literature, law, material for education, microblogs, news, official documents, spoken, subtitles, science, and scientific articles) and made a parallel English–Chinese dataset with 2,200 sentences to test the translation of literals. Tian et al. (2014) presents UM-Corpus,4 designed for sentence machine translation (SMT) research. It contains 15 million English–Chinese parallel sentences and treats eight genres: News, Spoken, Laws, Theses, Educational Materials, Science, Speech/Subtitles, and Microblog. Globally, the dataset contains 2.2 million sentences in both languages (450,000 for news alone). This dataset is freely available but named entities are anonymized. 2.2 BUILDING DOMAIN-SPECIFIC PARALLEL DATASETS In this section we present an extensive literature review of domain-specific datasets, their language pairs, and topics. We observed an increased interest in domain-specific parallel datasets in the past year. The main use of such material is to make a specialized learning dataset to improve a statistical machine translation system and to do cross-lingual information retrieval (McEnery & Xiao, 2007) from a computational point of view, to extract automatically or semi-automatically a specialized lexicon in different languages (Rosemeyer & Enrique-Arias, 2016) from a linguistic point of view. In the following review, we consider as domain-specific a dataset focused on all aspects of one topic. A text genre, such as news or technical publications, is considered as a domain. 2.2.1 Digital Humanities: culture In this domain we have found 20 datasets, of which large pair datasets are as follows. In the area of religious studies, Christodoulopoulos and Steedman (2014) is about the Bible in 100 languages. We also find the Chinese–English (Sturgeon, 2021), the Arabic–English (Hamoud & Atwell, 2017), a presentation of the same ancient religious texts in different Germanic dialects (Dipper & Schultz-Balluff, 2013), and a parallel dataset of English and Persian religious texts (Beikian & Borzoufard, 2016). In literary studies Fraisse, Tran, Jenn, Paroubek, and Fishkin (2018) created a massively parallel dataset of translated American literary texts, with 23 languages. Altammami, Atwell, and Alsalka (2020) present a bilingual parallel English–Arabic dataset of narratives reporting different aspects of Muhammad’s life. In the domain of tourism and traveling, Espla-Gomis et al. (2014) built a domain-specific English―Croatian parallel dataset from different websites, Ponay and Cheng (2015) made an English–Tagalog dataset, Bureros, Tabaranza, and Roxas (2015) created a English–Cebuano dataset, Woldeyohannis, Besacier, and Meshesha (2018) made an Amharic–English dataset, Srivastava and Sanyal (2015) made a small parallel English–Hindi dataset, and Boldrini and Ferrández (2009) got 4500 questions/answers from customers about tourism in Spanish translated into English. About literary texts, Rovenchak (2021) published a Bamana–French analysis concerning Bamana tales, Kenny (1999) describes GEPCOLT, an electronic collection of some fourteen works of contemporary German-language fiction alonside their translations into English, 2 http://projects.ldc.upenn.edu/TDT3-TDT4 (last accessed: 01.03.2022). 3 https://www.kaggle.com/datasets (last accessed: 01.03.2022). 4 http://nlp2ct.cis.umac.mo/um-corpus/index.html (last accessed: 01.03.2022). https://doi.org/10.5334/johd.62 http://projects.ldc.upenn.edu/TDT3-TDT4 https://www.kaggle.com/datasets http://nlp2ct.cis.umac.mo/um-corpus/index.html 4Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 Giouli, Glaros, Simov, and Osenova (2009) made a Greek–Bulgarian dataset about cultural, literary and folk texts, Kashefi (2020) made a Persian–English dataset with masterpieces of literature, Frankenberg-Garcia (2009) built a parallel dataset of English and Portuguese literary texts, Miletic, Stosic, and Marjanović (2017) made Paracolab a dataset of English, French and Serbian literary books, Guzman (2013) describes a dataset of literary texts with versions in Spanish, French, German, and Catalan. 2.2.2 Finance D.-Y. Lee (2011) used an interesting approach, for Korean and English, to improve financial phrase translation, but the corpora are comparable without being really parallel. There are some parallel corpora about finance, with a limited size, such as Smirnova and Rackevičienė (2020), who made a dataset of European documents in English translated to French and Lithuanian related to finance, but the size is relatively small, consisting of 154 documents from 2010 to 2014. Bick and Barreiro (2015) made a Portuguese–English parallel dataset of about 40,000 sentences in the Legal-Financial domain, coming from a company translation memory. We will next mention four notable parallel corpora about finance, for which we will give the details below: the ECB dataset,5 the DBpedia-linguee dataset, the CSB dataset,6 and the SEDAR dataset.1 All of them have been made for automatic translation and cross-lingual information retrieval purposes. In the Opus project (Tiedemann, 2012), we can find the ECB dataset, employing 19 European languages and concerning financial and legal newsletters from the European Central Bank. As an example, it contains 113,000 English–German pairs of sentences. Arcan et al. (2013) used DBpedia datasets to extract the titles of relevant Wikipedia articles, and the Linguee database, obtaining 193,000 aligned sentences (English–German, English– French, and English–Spanish) to find translations of financial terms. The Credit Suisse Bulletin dataset (CSB) is based on the world’s oldest banking magazine, published by Credit Suisse since 1895 in both German and French (Volk et al., 2016). The SEDAR dataset (i.e., the System for Electronic Document Analysis and Retrieval) contains 8.6 million French–English sentence pairs in the finance domain from PDF files of regulations of the province of Quebec (Ghaddar & Langlais, 2020). However, all these datasets are about pairs of European languages. Guo (2016) describes how it can be feasible to make a domain-specific Chinese–English parallel dataset in the financial service domain, but it is restricted to giving guidelines about which tool to use to get raw data and how to use a parallel dataset, with the description and availability of the dataset. We have seen in this review that, firstly, domain-specific datasets are for different topics of societal challenges. Secondly, although the finance domain is not lacking in datasets, English–Chinese is not covered yet. 2.3 PARALLEL LANGUAGE DATASET EXPLORATION Parallel corpora have been investigated to make alignments between sentences. Wu and Xia (1994) is a pioneering work using parallel sentences in the framework of automatic translation. They used literal translations of sentences from the parliamentary proceedings of the Hong Kong Legislative Council, with five million words, to predict the Chinese translation of each English entry. In Yang and Li (2003), an alignment method is presented at different levels (title, word, and character) based on dynamic programming (DP). Lu, Tsou, Jiang, Kwong, and Zhu (2010) used a non-open dataset of 157,000 files, with both Chinese and English versions. More recently, Schwenk, Chaudhary, Sun, Gong, and Guzmán (2021) have made an alignment process over 85 languages and 135 million sentences from Wikipedia (available as open data), but they found only 790 sentences for English–Chinese, which is very few for a text mining workflow. Li, Wang, Huang, and Zhao (2011) used a linear combination and minimum sample risk (MSR) algorithm to make a matching between named entities (Person, Organization) and obtained an F-score of 84%. A pioneering work in text mining and English–Chinese texts is probably C.-H. Lee and Yang (2000), who used a neural network clustering method called Self-Organizing maps to extract clusters from an English–Chinese parallel dataset (this parallel dataset is made with Sinorama magazine articles with 50,000 sentences)7 but their conclusion only reveals the potential of the 5 https://www.ecb.europa.eu/press/key/html/downloads.en.html (last accessed: 01.03.2022). 6 http://csb.access.ch (last accessed: 01.03.2022). 7 https://www.taiwan-panorama.com/en/Home/About (last accessed: 01.03.2022). https://doi.org/10.5334/johd.62 https://www.ecb.europa.eu/press/key/html/downloads.en.html http://csb.access.ch https://www.taiwan-panorama.com/en/Home/About 5Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 approach. Lan and Huang (2017) construct a bilingual English–Chinese latent semantic space and also select k-means initial cluster centers, but the interpretation of the clustering is not very clear. 3 THE DATASET 3.1 DATA COLLECTION We extracted news from the Financial Times and FT Chinese, both freely available news located at the financial times website.8,9,10 The news was collected for the period from 2007 to 2021. After collating the links, the pages were downloaded with ‘wget’ and stripped of HTML. The encoding of the files was normalized to UTF-8 (R package ‘httpr’). Cloud computing under the SLURM framework was used to parallelize the NLP preprocessing. In all, we got an uncleaned raw text dataset with 90,003 documents. 3.2 DATA PREPROCESSING We carried out sentence segmentation, word splitting, and named entity extraction. For linguistic preprocessing, we used regular expressions for field extraction, sentence and paragraph splitting. We used Jieba and spaCy algorithms for tokenization and tagging, and the Stanford NER framework for named entity extraction. The use of HTML was helpful to automatically extract from each news item its timestamp, title (in both languages), text body (in both languages), and topic tags. But in some cases, a translation was not available, so we took it as is. We tried to carry out a paragraph alignment between two equivalent documents in Chinese and English. Splitting into paragraph is also quite easy using line break markers. However, in some cases the number of paragraphs does not match, and we did not achieve this alignment because of the expensiveness of a human validation. We proceeded to clean the documents using two rules: (1) each one had to have both English and Chinese versions; (2) only files with a text body containing more than two characters were kept. We got a cleaned raw text dataset of 60,473 documents. The dataset is available at https://doi.org/10.5281/zenodo.5591908 3.3 DATA STATISTICS The dataset contains various metadata, such as title and text body both in English and Chinese, the time of publication, and some topic tags. Table 1 shows the extraction of elementary linguistic features. 3.4 CATEGORIES OF FINANCE DOMAIN We made different samples for topic prediction using classification methods. This is the list of the 10 topic-metadata tags contained in the documents, used by the Financial Times to annotate the area of each news item. A news item can contain several tags: book, business, culture, economy, lifestyle, management, markets, people, politics, or society. There were 8 https://www.ft.com/ (last accessed: 01.03.2022). 9 https://www.ftchinese.com/ (last accessed: 01.03.2022). 10 This is an example of a parallel archived news link: http://www.ftchinese.com/story/001015037/ce?archive (last accessed: 01.03.2022). LANG. TOKEN NP MULTIWD PARAG.S SENT. NE HANZI English 2,598,309 1,672,577 2,376,424 272,756 597,372 1,190,682 0 Chinese 7,480,139 1,491,790 3,466,453 258,213 572,185 1,268,674 21,679,815 Table 1 Linguistic features of the text collection (‘Lang.’ is language, ‘NP’ is noun phrases, ‘MultiWD’ is multiwords, ‘Sent.’ is sentences, ‘NE’ is named entities, ‘Hanzi’ is Chinese characters. https://doi.org/10.5334/johd.62 https://doi.org/10.5281/zenodo.5591908 https://www.ft.com/ https://www.ftchinese.com/ http://www.ftchinese.com/story/001015037/ce?archive 6Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 57,584 documents containing topic metadata. This is the list of the 10 tags from the Financial Times websites about the economic sector we used for manual annotation: technology, consumer services, health care, consumer goods, basic materials, industrials, oil & gas, and telecommunications. There are 2,993 documents that were tagged manually. The top influential media in Finance are: 1. The Wall Street Journal. 2. Bloomberg. 3. The New York Times. 4. The Financial Times. 5. CNBC. 6. Reuters. 7. The Economist. Five items of the Financial Times website can be clearly identified as related to the “economy” (equities, currencies, commodities, bonds, funds & ETFS) and the item world market can be associated with “markets,” company as “business,” and director dealings as management. The economy, management, markets, and business are among the tags contained in each document as metadata. However, we also find other tags, such as lifestyle, politics, and people. In fact, many influential people have an impact on the evolution of markets. Other items as sectors and industrials can be further split into: id01 – Technology (Software & Computer Services, Technology Hardware & Equipment) id02 – Consumer Services (General Retailers, Travel & Leisure, Food & Drug Retailers, Media) id03 – Health Care (Health Care Equipment & Services, Pharmaceuticals & Biotechnology) id04 – Consumer Goods (Automobiles & Parts, Leisure Goods, Personal Goods, Food Producers, Household Goods, Tobacco, Beverages) id05 – Basic Materials (Industrial Metals, Mining, Chemicals) id06 – Industrials (Support Services, Electronic & Electrical Equipment, Industrial Transportation, Aerospace & Defense, Construction) id07 – Financials (Real Estate Investment & Services, Financial Services, General Financial, Life Insurance, Banks, Nonlife Insurance) id08 – Oil & Gas (Alternative Energy, Oil & Gas Producers, Oil Equipment, Services & Distribution) id09 – Utilities (Gas, Water & Multi-utilities, Electricity) id10 – Telecommunications (Fixed Line Telecommunications, Mobile Telecommunications) Sectors, in finance, act both as a guide to make promising investments in the right places and as representation of areas of activity. Topics id01 to id10 are used for manual annotation so their representation is less important than topics inserted into each document as metadata. From the manual annotation, the most frequent topics are: financials, consumer goods, consumer services, and technology. From the metadata, the most frequent topics are: business, the economy, markets, management, politics, lifestyle, and society. 3.5 MANUAL ANNOTATION To carry out the manual annotation, we made a set of document batches, each one containing 100 distinct documents. A population of 31 students (year-3 level in computer science, with B1 to C1 level of English) received one batch each. Multiple annotation was possible, and the format of the annotation was quite elementary, such as document id followed by class id, one annotation by line, e.g.: 1014550; id07 1014871; id11 An extra annotator assessed the annotations by choosing randomly 10 files for each batch. If the annotation done by the extra annotator showed more than four differences from those produced by the annotator (i.e., >40% disagreement), the batch had to be revised by the annotator. Nineteen batches were revised. Finally, after the second round, we compiled all the batches together. 3.6 DATA USAGE As mentioned in the previous section on the literature, there are several ways to use a parallel dataset. The same is true for our Chinese–English parallel dataset for the domain of finance. So here are five main key points as possible usage: https://doi.org/10.5334/johd.62 7Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 • The influence of the language on the knowledge discovery We present the results of different clusterings for topic discovery and classification for topic detection. Here, the algorithm is not supposed to take into account specificities of the language (i.e., it is to be language-independent). This dataset can be useful to study how a language-dependent algorithm could be more efficient. • Keyword in context Concordances of a word in the domain of finance can be extracted. In such a usage case, different contexts make possible the study of the meaning of a phrase and its variation. • Automatic translation A classical usage case is to exploit such a dataset to make automatic translations of documents in the domain of finance, using this dataset as a training set for a statistical machine translation system (SMT) • Neologism translation Translation is always a challenge, especially for new words. A usage case of the dataset is the study of neologisms. For example, to find the Chinese equivalent to about a new named entity in English (company name, people name). • Time series of a domain-specific word The last case can be the study of the distribution of words or phrases over time and see their popularity. 4 DISCOVERY OF SOME FREQUENT INTERESTING TERMS In this section, we will search for some interesting words or phrases in the dataset and count their frequency of occurrence, which will be conducive to our further understanding of the dataset. Next, this section will be divided into three parts to explore the frequency of English proverbs and Chinese idioms, important finance related terms, and globally famous companies in the dataset. We made some experiments about lexical variation over time and proverb analysis (see appendix A for more details). 4.1 DISCOVERY OF FREQUENT TERMS OF FINANCE DOMAIN The first step is deciding how to choose some commonly used financial terms. Our decision was to use Fundera. Fundera is an online marketplace that connects small business owners with the best providers of capital for their businesses. It offers product marketplaces that cover everything from loans to legal services, free financial content, and one-on-one access with experienced lending experts. Based on the founding editor and vice president of the Fundera Ledger Meredith Wood’s “60 Business and finance terms you should definitely know”,11 we selected the top 20 financial terms that appear most frequently in the dataset. The results are shown on Table 2. 11 https://www.fundera.com/blog/business-finance-terms-and-definitions (last accessed: 01.03.2022). Capital 9383 Net Worth 195 Asset 3086 Liability 141 Liquidity 1704 Business Plan 126 Interest Rate 1036 Fixed Asset 101 Bankruptcy 616 Debt Financing 97 Balance Sheet 522 Working Capital 83 Principle 382 Financial Statements 72 Collateral 371 Equity Financing 64 Depreciation 368 Line of Credit 46 Cash Flow 209 Appraisal 42 Table 2 20 most frequently used financial terms. https://doi.org/10.5334/johd.62 https://www.fundera.com/blog/business-finance-terms-and-definitions 8Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 Next, we imitated the method used above to detect the most frequent idioms and proverbs, extracting the statements in the dataset and calculating the frequency of occurrence of each financial term (see appendix file). 4.2 DISCOVERY OF FREQUENT COMPANY NAMES We used the same method to collect statistics on the frequency of occurrence of company names in the dataset. Among them, we find the Chinese company Huawei, which shows that with the increase of China’s international influence, Chinese technology enterprises are increasingly favored by global business people. 5 TEXT-MINING APPROACHES AND THE DOMAIN OF FINANCE The first point for people interested in finance or natural language processing about such a dataset as this, is that we provide a full analysis taking into account state of the art text- mining technology. These experiments were of three kinds (see appendix B and appendix C for technical details): (1) lexical extraction (words, noun phrases, names of people, names of companies) (2) classification (rervised learning) (3) clustering (unsupervised learning) As we showed in the section on the discovery of lexical items, this dataset is useful for identifying the important concepts and actors of the domain. These concepts are not new for an expert working in finance everyday, but the dataset can be used as an educational tool for students at school or college to understand what is finance through real life events and practical information. A list of frequent noun phrases (such as ‘asset,’ ‘interest rate’), a list of famous and influential people (such as Elon Musk, Xi Jinping), a list of names of famous organizations (such as the IMF and the Fed) were extracted, and one hundred frequent items for these three categories can easily serve as a basic framework of concepts for educational purposes. We also studied and compared the properties of the English and Chinese languages through the use of proverbs, which is one of the high-level linguistic patterns of any language. We discovered that in the domain of finance, which is highly related to technology and also to society, in the Chinese language, people used more freely proverbs but not at all in English. We do not have an explanation for this except that it may be an important cultural difference in how people use language to disseminate information (even in a technological area). We have shown that using this classification technique some potential readers could process new documents (unseen from the dataset), which may be interesting for them, according to the ontology of 20 topics described in Section 3.4. Clustering, by definition, relies mainly on organizing knowledge about a set of unstructured data. We have carried out several experiments and clustering has revealed some classical topics of finance, such as business or markets, but also surprising topics in the finance domain, such as lifestyle, art and life, politics, and British education, which seem to play a big role. This shows that finance is not just an activity in society, like sports for example, but also seems to be an ideological model. Secondly, the clusters show that even if finance is globalized, a polarity about the specific relationship between China and US appears to emerge as more important than all others. 6 CONCLUSION Chinese and English is an interesting combination of languages for testing algorithms and mining. Finance is a hot area of activity in our contemporary world. We made a text dataset using the Financial Times website from which we grabbed 60,473 news items from between 2007 and 2021. This dataset is a bilingual Chinese–English parallel dataset of news in the domain of finance, and is open access. We used a text mining analytical framework. As a future perspective, our dataset can be used to infer the translation of new terms from English to Chinese (i.e., company names), to extract the distribution of occurrences of new concepts for time series analysis (i.e., neologisms) or to apply a more innovative clustering approach to discover new concepts (i.e., ontology learning). https://doi.org/10.5334/johd.62 9Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 ADDITIONAL FILES The additional files for this article can be found as follows: • Appendix A. Discovery of some frequent interesting terms. DOI: https://doi.org/10.5334/ johd.62.s1 • Appendix B. Classification. DOI: https://doi.org/10.5334/johd.62.s2 • Appendix C. Clustering. DOI: https://doi.org/10.5334/johd.62.s3 COMPETING INTERESTS The authors have no competing interests to declare. AUTHOR CONTRIBUTIONS Nicolas Turenne: Conceptualisation and writing original draft Ziwei Chen: Methodology, Classification section Jianlong Li: Methodology, Classification section Guitao Fan: Methodology, lexical analysis and pattern section Jiaqi Zhou: Methodology, lexical analysis and pattern section Siyuan Wang: Methodology, clustering section Yiwen Li: Methodology, clustering section AUTHOR AFFILIATIONS Nicolas Turenne orcid.org/0000-0003-1229-5590 BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China Ziwei Chen BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China Guitao Fan BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China Jianlong Li BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China Yiwen Li BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China Siyuan Wang BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China Jiaqi Zhou BNU-HKBU United International College, UIC, Division of Science and Technology, Zhuhai Guangdong, China REFERENCES Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In Proceedings of the 27th international conference on computational linguistics (coling) (pp. 1638–1649). New Mexico: Paparazzi Press. Retrieved from https://aclanthology.org/C18-1139 Altammami, S., Atwell, E., & Alsalka, A. (2020). The Arabic-English parallel corpus of authentic hadith. International Journal on Islamic Applications in Computer Science And Technology, 8(2). DOI: http:// www.sign-ific-ance.co.uk/index.php/IJASAT/article/view/2199 Arcan, M., Thomas, S. M., de Brandt, D., & Buitelaar, P. (2013). Translating the FINREP taxonomy using a domain-specific corpus. In Proceedings of Chinese translation summit XIV. Nice, France. Retrieved from https://aclanthology.org/2013.mtsummit-posters.1.pdf Beikian, A., & Borzoufard, M. (2016). Mizan: A large persian-english parallel corpus. Retrieved from https:// cdn.ketabchi.com/products/175402/pdfs/ketab-general-book-sample-wybml.pdf https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62.s1 https://doi.org/10.5334/johd.62.s1 https://doi.org/10.5334/johd.62.s2 https://doi.org/10.5334/johd.62.s3 https://orcid.org/0000-0003-1229-5590 https://orcid.org/0000-0003-1229-5590 https://aclanthology.org/C18-1139 http://www.sign-ific-ance.co.uk/index.php/IJASAT/article/view/2199 http://www.sign-ific-ance.co.uk/index.php/IJASAT/article/view/2199 https://aclanthology.org/2013.mtsummit-posters.1.pdf https://cdn.ketabchi.com/products/175402/pdfs/ketab-general-book-sample-wybml.pdf https://cdn.ketabchi.com/products/175402/pdfs/ketab-general-book-sample-wybml.pdf 10Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 Bick, E., & Barreiro, A. (2015). Automatic anonymisation of a new portuguese-english parallel corpus in the legal-financial domain. Oslo Studies in Language, 7(1), 101–124. Retrieved from https://journals. uio.no/index.php/osla/article/view/1460/1357. DOI: https://doi.org/10.5617/osla.1460 Boldrini, E., & Ferrández, S. (2009, March 1–7). A parallel corpus labeled using open and restricted domain ontologies. In Proceedings of 10th international conference CICLing. Mexico City, Mexico. DOI: https://doi.org/10.1007/978-3-642-00382-0_28 Bureros, L. L., Tabaranza, Z. L. B., & Roxas, R. R. (2015). Building an English-Cebuano tourism parallel corpus and a named-entity list from the Web. In Proceedings of workshop on computation: Theory and practice (pp. 158–169). DOI: https://doi.org/10.1142/9789813202818_0012 Chang, B. (2004). Chinese-English parallel corpus construction and its application. In Proceedings of the PACLIC (pp. 201–204). Tokyo: Waseda University, Dec. 8–10. Retrieved from https://aclanthology.org/ Y04-1030.pdf Chiu, J. P. C., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics (pp. 357–370). DOI: https://doi.org/10.1162/ tacl_a_00104 Christodoulopoulos, C., & Steedman, M. (2014). The Bible in 100 Languages. Retrieved from https:// github.com/christos-c/bible-corpus Dipper, S., & Schultz-Balluff, S. (2013). The Anselm Corpus: Methods and perspectives of a parallel aligned corpus. In Proceedings of the workshop on computational historical linguistics at NODALIDA. NEALT (pp. 27–42). Retrieved from https://ep.liu.se/ecp/087/ecp13087.pdf#page=35 Dua, D., & Graff, C. (2017). UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml Espla-Gomis, M., Klubička, F., Ljubešić, N., Ortiz-Rojas, S., Papavassiliou, V., & Prokopidis, P. (2014). Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites. In Proceedings of the ninth international conference on language resources and evaluation (pp. 1252–1258). European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/529 Paper.pdf Fraisse, A., Tran, Q.-T., Jenn, R., Paroubek, P., & Fishkin, S. (2018, May). TransLiTex: A parallel corpus of translated literary texts. In Proceedings of the eleventh international conference on language resources and evaluation (pp. 201–204). Miyazaki, Japan: European Language Resources Association (ELRA). Retrieved from https://hal.archives-ouvertes.fr/hal-01827884/file/11 W34.pdf Frankenberg-Garcia, A. (2009). Compiling and using a parallel corpus for research in translation. Babel: International journal of translation, 21(1), 57–71. Retrieved from https://openresearch.surrey.ac.uk/ esploro/outputs/journalArticle/Compiling-and-using-a-parallel-corpus-for-research-in-translation/995 16816302346#file-0 Ghaddar, A., & Langlais, P. (2020). Sedar: a large scale French-english financial domain parallel corpus. In Proceedings of the language resources and evaluation conference (pp. 3595–3602). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2020.lrec-1.442 Giouli, V., Glaros, N., Simov, K., & Osenova, P. (2009). A web-enabled and speech-enhanced parallel corpus of Greek-Bulgarian cultural texts. In Proceedings of the of the EACL workshop on language technology and resources for cultural heritage, social sciences, humanities, and education (pp. 35–42). Athens, Greece: Association for Computational Linguistics. Retrieved from https://aclanthology.org/ W09-0305.pdf. DOI: https://doi.org/10.3115/1642049.1642054 Guo, X. (2016, November 17–18). Drawing a route map of making a small domain-specific parallel corpus for translators and beyond. In Proceedings of translating and the computer (pp. 88–99). London, UK. Retrieved from https://aclanthology.org/2016.tc-1.9.pdf Guzman, J. R. (2013). El corpus COVALT i l’eina d’alineament de frases Alfra-COVALT. In L. Bracho Lapiedra (Ed.), El corpus COVALT: un observatori de fraseologia traduïda (pp. 49–60). Aachen: Shaker. Hamoud, B., & Atwell, E. (2017). Evaluation corpus for restricted-domain question-answering systems for the holy Quran. International Journal of Science and Research, 6(8), 1133–1138. Retrieved from https://eprints.whiterose.ac.uk/125920/ Kashefi, O. (2020). MIZAN: A large Persian-English parallel corpus. Retrieved from https://arxiv.org/ pdf/1801.02107v3.pdf Kenny, D. (1999). The German-English parallel corpus of literary texts (GEPCOLT): A resource for translation scholars. Teanga, 1, 25–42. Koehn, P. (2005). Europarl. Retrieved from http://www.statmt.org/europarl/ Kolchinsky, A., Lourenco, A., Wu, H.-Y., & Rocha, L. M. (2015). Extraction of pharmacokinetic evidence of drug-drug interactions from the literature. PLOS ONE. DOI: https://doi.org/10.1371/journal. pone.0122199 Labaka, G., Alegria, I., & Sarasola, K. (2016). Domain adaptation in MT using Wikipedia as a parallel corpus: Resources and evaluation. In Proceedings of the tenth international conference on language resources and evaluation (pp. 2209–2213). Portoroz, Slovenia: European Language Resources Association (ELRA). https://doi.org/10.5334/johd.62 https://journals.uio.no/index.php/osla/article/view/1460/1357 https://journals.uio.no/index.php/osla/article/view/1460/1357 https://doi.org/10.5617/osla.1460 https://doi.org/10.1007/978-3-642-00382-0_28 https://doi.org/10.1142/9789813202818_0012 https://aclanthology.org/Y04-1030.pdf https://aclanthology.org/Y04-1030.pdf https://doi.org/10.1162/tacl_a_00104 https://doi.org/10.1162/tacl_a_00104 https://github.com/christos-c/bible-corpus https://github.com/christos-c/bible-corpus https://ep.liu.se/ecp/087/ecp13087.pdf#page=35 http://archive.ics.uci.edu/ml http://www.lrec-conf.org/proceedings/lrec2014/pdf/529 Paper.pdf https://hal.archives-ouvertes.fr/hal-01827884/file/11 W34.pdf https://openresearch.surrey.ac.uk/esploro/outputs/journalArticle/Compiling-and-using-a-parallel-corpus-for-research-in-translation/99516816302346#file-0 https://openresearch.surrey.ac.uk/esploro/outputs/journalArticle/Compiling-and-using-a-parallel-corpus-for-research-in-translation/99516816302346#file-0 https://openresearch.surrey.ac.uk/esploro/outputs/journalArticle/Compiling-and-using-a-parallel-corpus-for-research-in-translation/99516816302346#file-0 https://aclanthology.org/2020.lrec-1.442 https://aclanthology.org/W09-0305.pdf https://aclanthology.org/W09-0305.pdf https://doi.org/10.3115/1642049.1642054 https://aclanthology.org/2016.tc-1.9.pdf https://eprints.whiterose.ac.uk/125920/ https://arxiv.org/pdf/1801.02107v3.pdf https://arxiv.org/pdf/1801.02107v3.pdf http://www.statmt.org/europarl/ https://doi.org/10.1371/journal.pone.0122199 https://doi.org/10.1371/journal.pone.0122199 11Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 Lan, H., & Huang, J. (2017, February). Chinese-English cross-language text clustering algorithm based on latent semantic analysis. In Proceedings of information science and cloud computing (pp. 1–7). Retrieved from https://pos.sissa.it/300/007/pdf Lee, C.-H., & Yang, H.-C. (2000). Towards multilingual information discovery through a SOM based text mining approach. In PRICAI workshop on text and web mining (pp. 80–87). Melbourne, Australia. Retrieved from https://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.33.8800&rep=rep1&type=pdf Lee, D.-Y. (2011). A corpus-based translation of Korean financial reports into English. Journal of Universal Language, 12(1), 75–94. Retrieved from https://www.sejongjul.org/download/download pdf?pid=jul-12-1-75. DOI: https://doi.org/10.22425/jul.2011.12.1.75 Lefever, E., Macken, L., & Hoste, V. (2009, 30 March – 3 April). Language-independent bilingual terminology extraction from a multilingual parallel corpus. In Proceedings of the 12th conference of the European Chapter of the ACL (pp. 1746–1751). Athens, Greece. Retrieved from https:// aclanthology.org/E09-1057.pdf. DOI: https://doi.org/10.3115/1609067.1609122 Li, L., Wang, P., Huang, D., & Zhao, L. (2011). Mining English-Chinese named entity pairs from comparable corpora. ACM Transactions on Asian Language Information Processing, 10, 1–19. DOI: https://doi. org/10.1145/2025384.2025387 Lu, B., Tsou, B. K., Jiang, T., Kwong, O. Y., & Zhu, J. (2010). Mining large-scale parallel corpora from multilingual patents: An English-Chinese example and its application to SMT. In Proceedings of the 1st CIPS-SIGHAN joint conference on Chinese language processing (pp. 79–86). Beijing. Retrieved from https://aclanthology.org/W10-4110.pdf McEnery, T., & Xiao, Z. (2007). Parallel and comparable corpora – the state of play. In N. T. Y. Kawaguchi T. Takagaki & Y. Tsuruga (Eds.), Proceedings of the international conference on Asian language processing (pp. 131–146). Amsterdam: Benjamin. DOI: https://doi.org/10.1075/ubli.6.11mce Miletic, A., Stosic, D., & Marjanović, D. (2017). ParCoLab: A Parallel Corpus for Serbian, French and English. In K. Ekštein & V. Matoušek (Eds.), Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science, 10415, 201–204. Berlin: Springer-Verlag. DOI: https://doi.org/10.1007/978-3-319-64206-2 Neves, M., Yepes, A. J., & Névéol, A. (2016). The Scielo Corpus: A parallel corpus of scientific publications for biomedicine. In Proceedings of the 15th international conference on language resources and evaluation. European Language Resources Association. Retrieved from https://aclanthology.org/L16- 1470 Ponay, C. S., & Cheng, C. K. (2015). Building an English-Filipino tourism corpus and lexicon for an ASEAN language translation system. In Proceedings of the international conference ASIALEX (pp. 201–204). Hong Kong: Polytechnic University. Retrieved from https://www.researchgate.net/profile/Charmaine- Ponay-2/publication/27994689223BuildinganEnglish-FilipinoTourismCorpusandLexiconforanASEANLa nguageTranslationSystem/links/559f2fee08ae97223ddc602f/23-Building-an-English-Filipino-Tourism- Corpus-and-Lexicon-for-an-ASEAN-Language-Translation-System.pdf Rosemeyer, M., & Enrique-Arias, A. (2016). A match made in heaven: Using parallel corpora and multinomial logistic regression to analyze the expression of possession in Old Spanish. Language Variation and Change, 28(03), 307–334. DOI: https://doi.org/10.1017/S0954394516000120 Rovenchak, A. (2021). Bamana tales recorded by Umaru Nanankr Jara: A comparative study based on a Bamana-French parallel corpus. Mandenkan, 64, 81–104. DOI: https://doi.org/10.4000/ mandenkan.2471 Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2021, April). WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics: Main volume (pp. 1351– 1361). Online: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/ anthology/2021.eacl-main.115. DOI: https://doi.org/10.18653/v1/2021.eacl-main.115 Smirnova, O., & Rackevičienė, S. (2020). English-French-Lithuanian parallel corpus of EU financial documents. Retrieved from http://hdl.handle.net/20.500.11821/35 Srivastava, J., & Sanyal, S. (2015). POS-based word alignment for small corpus. In Proceedings of international conference on Asian language processing (pp. 37–40). DOI: https://doi.org/10.1109/ IALP.2015.7451526 Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006, 24–26 May). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th international conference on language resources and evaluation (pp. 2142–2147). Genoa, Italy. Retrieved from https://arxiv.org/abs/cs/0609058 Sturgeon, D. (Ed.). (2021). Ancient Chinese Books Datasets (Chinese Text Project). Retrieved from https:// ctext.org/daoism Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). UM-Corpus: A Large English- Chinese parallel corpus for statistical machine translation. In LREC. Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/ lrec2014/pdf/774Paper.pdf https://doi.org/10.5334/johd.62 https://pos.sissa.it/300/007/pdf https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.8800&rep=rep1&type=pdf https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.8800&rep=rep1&type=pdf https://www.sejongjul.org/download/download pdf?pid=jul-12-1-75 https://www.sejongjul.org/download/download pdf?pid=jul-12-1-75 https://doi.org/10.22425/jul.2011.12.1.75 https://aclanthology.org/E09-1057.pdf https://aclanthology.org/E09-1057.pdf https://doi.org/10.3115/1609067.1609122 https://doi.org/10.1145/2025384.2025387 https://doi.org/10.1145/2025384.2025387 https://aclanthology.org/W10-4110.pdf https://doi.org/10.1075/ubli.6.11mce https://doi.org/10.1007/978-3-319-64206-2 https://aclanthology.org/L16-1470 https://aclanthology.org/L16-1470 https://www.researchgate.net/profile/Charmaine-Ponay-2/publication/27994689223BuildinganEnglish-FilipinoTourismCorpusandLexiconforanASEANLanguageTranslationSystem/links/559f2fee08ae97223ddc602f/23-Building-an-English-Filipino-Tourism-Corpus-and-Lexicon-fo https://www.researchgate.net/profile/Charmaine-Ponay-2/publication/27994689223BuildinganEnglish-FilipinoTourismCorpusandLexiconforanASEANLanguageTranslationSystem/links/559f2fee08ae97223ddc602f/23-Building-an-English-Filipino-Tourism-Corpus-and-Lexicon-fo https://www.researchgate.net/profile/Charmaine-Ponay-2/publication/27994689223BuildinganEnglish-FilipinoTourismCorpusandLexiconforanASEANLanguageTranslationSystem/links/559f2fee08ae97223ddc602f/23-Building-an-English-Filipino-Tourism-Corpus-and-Lexicon-fo https://www.researchgate.net/profile/Charmaine-Ponay-2/publication/27994689223BuildinganEnglish-FilipinoTourismCorpusandLexiconforanASEANLanguageTranslationSystem/links/559f2fee08ae97223ddc602f/23-Building-an-English-Filipino-Tourism-Corpus-and-Lexicon-fo https://doi.org/10.1017/S0954394516000120 https://doi.org/10.4000/mandenkan.2471 https://doi.org/10.4000/mandenkan.2471 https://www.aclweb.org/anthology/2021.eacl-main.115 https://www.aclweb.org/anthology/2021.eacl-main.115 https://doi.org/10.18653/v1/2021.eacl-main.115 http://hdl.handle.net/20.500.11821/35 https://doi.org/10.1109/IALP.2015.7451526 https://doi.org/10.1109/IALP.2015.7451526 https://arxiv.org/abs/cs/0609058 https://ctext.org/daoism https://ctext.org/daoism http://www.lrec-conf.org/proceedings/lrec2014/pdf/774Paper.pdf http://www.lrec-conf.org/proceedings/lrec2014/pdf/774Paper.pdf 12Turenne et al. Journal of Open Humanities Data DOI: 10.5334/johd.62 TO CITE THIS ARTICLE: Turenne, N., Chen, Z., Fan, G., Li, J., Li, Y., Wang, S., & Zhou, J. (2022). Mining an English- Chinese parallel Dataset of Financial News. Journal of Open Humanities Data, 8: 9, pp. 1–12. DOI: https://doi. org/10.5334/johd.62 Published: 18 March 2022 COPYRIGHT: © 2022 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/ licenses/by/4.0/. Journal of Open Humanities Data is a peer-reviewed open access journal published by Ubiquity Press. Tiedemann, J. (2012, May). Parallel data, tools and interfaces in OPUS. In N. Calzolari et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (pp. 2214–2218). Istanbul, Turkey: European Language Resources Association (ELRA). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2874&rep=rep1&type=pdf Turenne, N. (2018, January). The rumour spectrum. PLOS ONE, 13(1), 1–27. DOI: https://doi.org/10.1371/ journal.pone.0189080 Turenne, N., Xu, B., Li, X., Xu, X., Liu, H., & Zhu, X. (2020). Exploration of a balanced reference corpus with a wide variety of text mining tools. In Proceedings of ACAI 2020: 2020 3rd international conference on algorithms, computing and artificial intelligence (pp. 1–9). New Mexico, USA: ACM Digital Library. DOI: https://doi.org/10.1145/3446132.3446192 Volk, M., Amrhein, C., Aepli, N., Müller, M., & Ströbel, P. (2016). Building a parallel corpus on the world’s oldest banking magazine. In Proceedings of the 13th conference on natural language processing (konvens) (pp. 288–296). DOI: https://doi.org/10.5167/uzh-125746 Woldeyohannis, M. M., Besacier, L., & Meshesha, M. (2018). A corpus for Amharic-English speech translation: The case of tourism domain. In F. Mekuria, E. Nigussie, W. Dargie, M. Edward & T. Tegegne (Eds.), Proceedings of information and communication technology for development for Africa. ict4da 2017. Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (Vol. 244). DOI: https://doi.org/10.1007/978-3-319-95153-9 Wu, E., & Xia, X. (1994). Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the first conference of the association for machine translation in the Americas (pp. 206–213). Retrieved from https://aclanthology.org/1994.amta-1.26.pdf Xiong, W. (2013). The development of the malaysian hansard corpus: A corpus of parliamentary debates 1959–2020. New Technology of Library and Information Service, Vol. Issue (6): 36–41. DOI: https://doi. org/10.11925/infotech.1003-3513.2013.06.06 Yang, C. C., & Li, K. W. (2003). Automatic construction of English/Chinese parallel corpora. J. Am. Soc. Inf. Sci. Technol., 54, 730–742. Retrieved from https://aclanthology.org/A00-1004.pdf. DOI: https://doi. org/10.1002/asi.10261 Zhai, Y., Liu, L., Zhong, X., Illouz, G., & Vilnat, A. (2020, May). Building an English-Chinese parallel corpus annotated with sub-sentential translation techniques. In Proceedings of the 12th language resources and evaluation conference (pp. 4024–4033). Marseille, France: European Language Resources Association. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.496 Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web bilingual news collection. In zz (Ed.), Proceedings of the IEEE international conference on data mining (pp. 745–748). Beijing. DOI: https://doi.org/10.1109/ICDM.2002.1184044 https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2874&rep=rep1&type=pdf https://doi.org/10.1371/journal.pone.0189080 https://doi.org/10.1371/journal.pone.0189080 https://doi.org/10.1145/3446132.3446192 https://doi.org/10.5167/uzh-125746 https://doi.org/10.1007/978-3-319-95153-9 https://aclanthology.org/1994.amta-1.26.pdf https://doi.org/10.11925/infotech.1003-3513.2013.06.06 https://doi.org/10.11925/infotech.1003-3513.2013.06.06 https://aclanthology.org/A00-1004.pdf https://doi.org/10.1002/asi.10261 https://doi.org/10.1002/asi.10261 https://www.aclweb.org/anthology/2020.lrec-1.496 https://doi.org/10.1109/ICDM.2002.1184044