China Biographical Database (CBDB): A Relational Database for Prosopographical Research of Pre-Modern China


DATA PAPER

CORRESPONDING AUTHOR:

Song Chen

Department of East Asian 
Studies, Bucknell University, 
Lewisburg, US

song.chen@bucknell.edu

KEYWORDS:
Chinese history; relational 
database; prosopography; 
geographical information 
system; social network 
analysis

TO CITE THIS ARTICLE:
Chen, S., & Wang, H. (2022). 
China Biographical Database 
(CBDB): A Relational Database 
for Prosopographical Research 
of Pre-Modern China. Journal 
of Open Humanities Data, 8(1); 
4, pp. 1–6. DOI: https://doi.
org/10.5334/johd.68

China Biographical 
Database (CBDB): 
A Relational Database for 
Prosopographical Research 
of Pre-Modern China

SONG CHEN 

HONGSU WANG 

ABSTRACT
The China Biographical Database (CBDB) is the largest prosopographical database 
for the study of Chinese history. We use regular expressions and neural network 
models to systematically harvest data from primary and secondary sources and 
employ an entity-relationship model to organize our data. As a relational database 
with both online and offline versions, CBDB provides freely accessible, structured 
data for macroscopic, quantitative studies of premodern China. The data in CBDB is 
continuously disambiguated and readily formatted for statistical, social network, and 
spatial analyses, and also has value for tagging named entities in historical texts and 
contextualizing other data collections.

*Author affiliations can be found in the back matter of this article

mailto:song.chen@bucknell.edu
https://doi.org/10.5334/johd.68
https://doi.org/10.5334/johd.68
https://orcid.org/0000-0003-3922-4792
https://orcid.org/0000-0002-1840-2046


2Chen and Wang  
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.68

(1) OVERVIEW 
REPOSITORY LOCATION

The database is available in both Microsoft Access and SQLite versions on Dataverse at https://
doi.org/10.7910/DVN/PAGGQS and on Github at https://github.com/cbdb-project/cbdb_sqlite. They are 
regularly updated with new contents and functions.

CONTEXT

The China Biographical Database (CBDB) amasses biographical information from disparate 
historical sources to facilitate quantitative, prosopographical research of premodern China. 
The project originated with the dataset that Robert M. Hartwell (1932–1996) created between 
the mid-1970s and 1995, as part of his research on the social and political history of middle-
period China (ca. 7th–13th century), and willed to the Harvard-Yenching Institute. In 2004–05, 
Michael A. Fuller restructured and converted the data from dBase first into FoxPro and then 
into Microsoft Access format. It has since been transferred to the Fairbank Center for Chinese 
Studies at Harvard University, which, together with the Center for Research on Ancient Chinese 
History at Peking University and the Institute of History and Philology at Academia Sinica, 
continued to add new contents under the direction of an international committee chaired by 
Peter K. Bol. Over the past sixteen years, CBDB has grown from a database of about 25,000 
individuals to include approximately 491,000 individuals (as of May 2021) whose lives spanned 
from the seventh through nineteenth centuries and is available for scholarly use in several 
online and offline (Microsoft Access, Microsoft SQL Server, MySQL, and SQLite) versions.1 The 
contents of CBDB benefit from, and are inevitably shaped by, China’s historiographical tradition 
which provides rich data on family relations, literary exchanges, intellectual interactions, and 
careers in government, among others, but is often reticent about issues like gender relations 
and economic transactions. Because of this, CBDB has 275,945 records on bureaucratic 
appointments, 482,953 records on kinship relations, 160,219 records of non-kin social 
connections, but hardly any on economic activities as of May 2021.

(2) METHOD
STEPS 

There are two core tasks in our data collection: data mining and disambiguation. CBDB is a 
relational database that uses the entity-relationship model to organize biographical information. 
Persons are a type of entity. So are places, texts, offices, and so forth. Each entity has its own 
set of attributes (e.g., each person has a birth year and a death year, and each place has a 
longitude and a latitude), and every life event is conceptualized as an instance of a relationship 
between multiple entities (e.g., a bureaucratic appointment is an instance of relationship, from 
the beginning to the end year of that appointment, between a person, the office he held, and 
the jurisdiction of that office). Data collection is, in substance, a matter of identifying named 
entities and their relationships in historical sources that are described in narrative forms. For 
this purpose, we have experimented with several data mining approaches and found value 
in algorithms based on regular expressions and neural network models, such as Bidirectional 
Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory 
(Bi-LSTM). We use BERT, for example, to create a vector representation of each Chinese 
character (an approach known as “word embedding”), which allows us to capture semantic 
and syntactic relations between characters through mathematical operations. We also use  
Bi-LSTM to tag the characters and predict whether a character is part of a string that signifies 
a specific person, place, or bureaucratic office. Outputs from these automated data mining 
algorithms are reviewed by an editorial team before they are prepared for inclusion into our 
database.

1  The Microsoft Access and SQLite versions of CBDB are updated on a regular basis. To download the most 
recent version of CBDB in the Microsoft Access format, see https://projects.iq.harvard.edu/cbdb/download-cbdb-
standalone-database. The up-to-date SQLite version is downloadable from https://github.com/cbdb-project/cbdb_
sqlite. Our Microsoft SQL Server is currently undergoing alpha testing. The MySQL version of CBDB provides data 
dump for development teams and other experienced users upon request. The CBDB online querying and data 
visualization interface for general use is developed by our commercial collaborator and accessible via http://www.
inindex.cn/. With collaboration from Academia Sinica and the CBDB open-source community, we have also been 
developing various backend APIs (CC BY-NC-SA 4.0) that support the future design of alternative online interfaces 
(https://github.com/cbdb-project/cbdb-online-main-server/blob/develop/API.md).

https://doi.org/10.5334/johd.68
https://doi.org/10.7910/DVN/PAGGQS
https://doi.org/10.7910/DVN/PAGGQS
https://github.com/cbdb-project/cbdb_sqlite
https://projects.iq.harvard.edu/cbdb/download-cbdb-standalone-database
https://projects.iq.harvard.edu/cbdb/download-cbdb-standalone-database
https://github.com/cbdb-project/cbdb_sqlite
https://github.com/cbdb-project/cbdb_sqlite
http://www.inindex.cn/
http://www.inindex.cn/
https://github.com/cbdb-project/cbdb-online-main-server/blob/develop/API.md


3Chen and Wang  
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.68

In merging newly harvested data into CBDB, the chief challenge comes from the complex 
relationship in natural language between a name and the entity it signifies. CBDB assigns a 
unique identifier (“id” or “code”) to each named entity regardless of how it is referenced in the 
sources, and our development team makes every effort to disambiguate all newly harvested 
data before incorporating them into the database. Take persons for example. While we are 
blessed by the fact that most people of all walks of life in Chinese society, unlike the Europeans, 
had possessed both a family name and a given name since the Han dynasty (202 BCE–220 
CE) and had the flexibility of composing given names from almost any Chinese character, it is 
not rare for two persons to have exactly the same name. On the other hand, members of the 
elite in imperial China were typically known by a wide variety of names and could be referred 
to by their office titles and other honorific appellations. Therefore, it is often necessary to 
disambiguate personal names and appellations in historical sources. In practice, we make use 
of a variety of biographical information such as alternative names, birth and death year, native 
place, examination degree, and data on kinship and social connections to distinguish a person 
from his namesake and consolidate data points about the same person whom the sources 
reference in various ways. 

We do not only disambiguate and code entities, but also disambiguate kinship relations. We 
have designed a set of symbols to describe kinship relations with greater precision than they 
are expressed in the natural language (e.g., we use FBS and MBS [father’s or mother’s brother’s 
son], among others, to distinguish different kinds of paternal and maternal cousins). We also 
normalize social relations by aggregating varied expressions found in historical sources into 
coded categories. Natural language has numerous ways of describing social relations. While 
the nuances in these descriptions (e.g., to censure someone vs. to criticize someone) merit 
attention and may, at least in some cases, reflect subtle differences in the nature of actual social 
relationships or the perceptions thereof, the strength of CBDB lies in facilitating the analysis of a 
large amount of historical data in the aggregate. To achieve this goal, we classify social relations 
into coded categories. As of May 2021, we have 470 pairs of coded relations that are further 
organized into larger classes and subclasses, which include literary exchanges, teacher-disciple 
ties, supportive or oppositional political relations, and so forth. After fully disambiguating and  
normalizing (“coding”) named entities and their relations, we partition the data into separate 
tables which are subsequently uploaded to the database. The primary key in each data table 
eliminates duplicate records, and the foreign key ensures proper linkage between tables.

Disambiguation and normalization are time-consuming tasks that require domain knowledge 
in specific historical periods and topics. To expedite the process, we launched a crowdsourcing 
platform in 2021 to encourage contributions from historians of premodern China.

SAMPLING STRATEGY

Our ultimate goal is to collect all biographical information in the extant historical record of 
premodern China. Resource constraints, however, require that we must set priorities. To produce 
a large collection of data for scholarly use within a reasonable timeframe, we have worked 
mainly with digitized, searchable texts, especially those that were written and formatted in a 
style particularly suitable for automated data extraction, and prioritized data sources that can 
systematically expand the coverage of our database. These include both modern scholarly 
works, such as biographical sketches and rosters of officeholders compiled by twentieth-
century historians, and primary historical documents, such as biographies in official histories 
and local gazetteers, tomb epitaphs, records of imperial examination graduates, and the lists 
of letters and other writings in literary collections. 

Several biographical dictionaries, compiled in the 1960s and 1970s, provide a large assemblage 
of material on the lives of approximately 70,000 persons between the tenth and seventeenth 
centuries (Chang & Wang, 1974; Wang, Li, & Pan, 1979; National Central Library, 1965). By 
systematically harvesting the data in these dictionaries, the CBDB team managed to create 
basic profiles for a large number of historical figures during an early phase of our project. 

Since then, we have expanded coverage by concentrating data collection in three areas: 
bureaucratic appointments, family relations, and literary exchanges. We have collected 
data from two multi-volume compendia which contributed more than 35,000 records on 
prefectural appointments from the seventh to thirteenth centuries (Yu, 2000; Li, 2001).  

https://doi.org/10.5334/johd.68


4Chen and Wang  
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.68

These were recently supplemented by another 107,000 entries on local appointments taken from 
158 local gazetteers compiled in Ming-Qing times (1368–1912). Using fifty-two examination 
records from the Ming dynasty (1368–1644), we have added roughly 14,116 metropolitan 
examination graduates and their 130,000 relatives into the database. We are now expanding 
data coverage in this area with a new dataset containing 19,576 Song-dynasty (960–1279) 
examination graduates based on a recent publication (Fu, Gong, & Zu, 2009). With the help 
of Tang historians (Yao Ping and Nicolas Tackett), we have added some 100,000 instances of 
kinship relations from tomb epitaphs between the seventh and tenth centuries (Zhou, 1992; 
Zhou & Zhao, 2001), and we are currently preparing a massive collection of officeholding data 
from Song-dynasty administrative documents (Xu, 2014).

At present, the majority of our data on social relations are based on records of literary exchanges. 
We collected 18,124 instances of poetic exchange between the seventh and tenth centuries, 
based on the work of a modern scholar (Wu, 1993), and some 8,800 instances of epistolary 
exchange between the tenth and thirteenth centuries based on Complete Song-Dynasty Prose 
(Zeng & Liu, 2006). We will soon add another 40,000 instances of epistolary exchange from 
Ming-dynasty (1368–1644) literary collections. For a full list of our data sources, see https://
projects.iq.harvard.edu/cbdb/cbdb-sources.

In addition, we have also coded and incorporated data from existing databases that focus on 
specific social groups and historical periods. These include, for example, a massive collection 
of data on family relations and officeholding for more than 46,000 persons from the Database 
of Names and Biographies (Institute of History and Philology, Academia Sinica, n.d.) and some 
5,000 female writers from Ming-Qing Women’s Writings Project (McGill University, n.d.).

CBDB is a work in progress and has no end date planned. Its current contents reflect its history 
that began with Hartwell’s dataset of Song-dynasty officials and gradually extended back into 
the Tang dynasty and forward into the Yuan, Ming, and Qing dynasties. As more historical texts 
from premodern China become available in searchable digital formats and the technology of 
data mining improves, the contents of CBDB will continue to grow.

QUALITY CONTROL 

Our editorial group, composed of doctoral students in Chinese history who specialize in various 
topics and periods, review the output from data mining algorithms and, when necessary, 
manually input data into our database. Additionally, when new data are prepared for uploading 
to CBDB, the primary and foreign keys in data tables also function as a line of defense for data 
integrity.

(3) DATASET DESCRIPTION 
OBJECT NAME

SQLite version: CBDB_20210525.7z; 

Microsoft Access version: CBDB_bc_20210525.7z

FORMAT NAMES AND VERSIONS

CBDB is available for downloading in SQLite and Microsoft Access versions. Both its content and 
interface are constantly evolving. Data contents are dated by the most recent update in the 
format of yyyy-mm-dd, and the interface is versioned using two lowercase English letters (the 
latest release is the bc version).

Creation dates – 1970s to 2021–05–25

Dataset creators – Current executive committee members include Peter K. Bol (Harvard 
University, Chair), Xiaonan Deng (Center for Research on Ancient Chinese History, Peking 
University), Michael A. Fuller (University of California at Irvine), Song Chen (Bucknell University), 
Hsi-yuan Chen (Institute of History and Philology, Academia Sinica), Wenyi Chen (Institute 
of History and Philology, Academia Sinica), Xin Luo (Center for Research on Ancient Chinese 
History, Peking University). Current project managers are Hongsu Wang (Harvard University) 
and Yang Xu (Peking University). For a list of past and present committee members, editors, 

https://doi.org/10.5334/johd.68
https://projects.iq.harvard.edu/cbdb/cbdb-sources
https://projects.iq.harvard.edu/cbdb/cbdb-sources


5Chen and Wang  
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.68

and other contributors, see https://projects.iq.harvard.edu/cbdb/core-institutions-and-editors. For a list 
of crowdsourcing contributors, see https://projects.iq.harvard.edu/cbdb/cbdb-crowdsourcing-projects.

Language – Variable names are in English. Data are bilingual (English and Chinese).

License – CC BY-NC-SA 4.0 

Repository name – Dataverse and Github

Publication date – 2021–05–25

(4) REUSE POTENTIAL
CBDB assembles biographical information from disparate sources and is particularly suited 
for data-driven, social scientific research that aims at discovering macroscopic patterns in 
Chinese history and complements the qualitative, humanistic approach of close reading. The 
current coverage of CBDB makes it particularly powerful for prosopographical studies of the 
Chinese elite from the seventh through nineteenth centuries. The data in CBDB is continuously 
disambiguated and readily formatted for statistical, social network, and spatial analyses. A 
growing number of articles are published every year that use CBDB data to explore topics 
ranging from career trajectory, regional composition, and family connections of civil officials 
to intellectual and social networks of Neo-Confucian moral philosophers, antiquities collectors, 
and members of political factions. For a full list of publications that use CBDB data, see https://
projects.iq.harvard.edu/cbdb/publications-use-cbdb-data. 

CBDB also has immense value for developing new digital projects. Online text markup platforms, 
like MARKUS (Ho & De Weerdt, n.d.), use CBDB code tables to tag persons, bureaucratic offices, 
places, and temporal references in user-uploaded historical texts. Specialized databases (e.g., 
Database of Names and Biographies) access CBDB, through our API, to provide more context to 
their data collections. The Chinese Text Project integrates data from CBDB and other sources to 
produce a knowledge graph in its Data Wiki (Sturgeon, n.d.), and the Shanghai Library uses our 
data for its Linked Open Data project (Shanghai Library, n.d.). Universities, such as Tsinghua, use 
CBDB to teach digital methods for Chinese studies and incorporate CBDB into their pedagogical 
platforms (Tsinghua University, n.d.) that train the next generation of digital humanists.

FUNDING INFORMATION 
COL Digital Publishing Group Co., Ltd. (2018–)

The Tang Research Foundation (2015–17)

The Henry Luce Foundation (2012–15)

Institute of History and Philology, Academia Sinica (2006–)

Center for Research on Ancient Chinese History, Peking University (2010–)

Harvard University and Harvard University Asia Center (2008, 2009–2011)

The National Endowment for the Humanities (2009–2012; PW-50438-09)

Chiang Ching-kuo Foundation for International Scholarly Exchange (2011–2018)

The Social Sciences and Humanities Research Council of Canada (2011–2015)

The American Council of Learned Societies (2008)

Bequest from the Estate of Robert Hartwell to Harvard-Yenching Institute (2005–2010)

COMPETING INTERESTS
The authors have no competing interests to declare.

AUTHOR CONTRIBUTIONS
Song Chen: Conceptualization, Methodology, Writing – original draft.

Hongsu Wang: Data Curation, Project Administration, Software, Writing – review & editing.

https://doi.org/10.5334/johd.68
https://projects.iq.harvard.edu/cbdb/core-institutions-and-editors
https://projects.iq.harvard.edu/cbdb/cbdb-crowdsourcing-projects
https://projects.iq.harvard.edu/cbdb/publications-use-cbdb-data
https://projects.iq.harvard.edu/cbdb/publications-use-cbdb-data


6Chen and Wang  
Journal of Open 
Humanities Data  
DOI: 10.5334/johd.68

TO CITE THIS ARTICLE:
Chen, S., & Wang, H. (2022). 
China Biographical Database 
(CBDB): A Relational Database 
for Prosopographical Research 
of Pre-Modern China. Journal 
of Open Humanities Data, 8(1): 
4, pp. 1–6. DOI: https://doi.
org/10.5334/johd.68

Published: 27 January 2022

COPYRIGHT:
© 2022 The Author(s). This is an 
open-access article distributed 
under the terms of the Creative 
Commons Attribution 4.0 
International License (CC-BY 
4.0), which permits unrestricted 
use, distribution, and 
reproduction in any medium, 
provided the original author 
and source are credited. See 
http://creativecommons.org/
licenses/by/4.0/.

Journal of Open Humanities 
Data is a peer-reviewed open 
access journal published by 
Ubiquity Press.

AUTHOR AFFILIATIONS 
Song Chen  orcid.org/0000-0003-3922-4792 
Department of East Asian Studies, Bucknell University, Lewisburg, US

Hongsu Wang  orcid.org/0000-0002-1840-2046 
Institute for Quantitative Social Science, Harvard University, Cambridge, US

REFERENCES
Chang, B., & Wang, D. (1974). Song ren zhuanji ziliao suoyin 宋人傳記資料索引 [Index to Biographical 

Materials of Song Figures]. Taipei: Dingwen shuju.

Fu, X., Gong, Y., & Zu, H. (2009). Song dengke ji kao 宋登科記考 [Research on Examination Graduates of 
the Song Dynasty]. Nanjing: Jiangsu jiaoyu chubanshe.

Harvard University, Academia Sinica, and Peking University. China Biographical Database. https://
projects.iq.harvard.edu/cbdb

Ho, H. L. B., & De Weerdt, H. MARKUS: Text Analysis and Reading Platform. https://dh.chinese-empires.eu/
markus/beta/

Institute of History and Philology, Academia Sinica. Database of Names and Biographies 人名權威人物
傳記資料庫. https://newarchive.ihp.sinica.edu.tw/sncaccgi/sncacFtp

Li, Z. (2001). Songdai junshou tongkao 宋代郡守通考 [Comprehensive Studies on Song-Dynasty Prefects]. 
Chengdu: Ba Shu shushe. 

McGill Library. Ming-Qing Women’s Writings Project. Directed by Grace S. Fong and Song Shi. https://
digital.library.mcgill.ca/mingqing/english/index.php

National Central Library. (1965). Ming ren zhuanji ziliao suoyin 明人傳記資料索引 [Index to Biographical 
Materials of Ming Figures]. Taipei: Guoli zhongyang tushuguan.

Shanghai Library. CBDB Linked Open Data. https://cbdb.library.sh.cn
Sturgeon, D. Chinese Text Project Data Wiki. https://ctext.org/tools/linked-open-data
Tsinghua University. Tsinghua Digital Humanities Teaching and Research Platform 清華大學數字人文教學

與研究平臺. http://qh.nqcx.net
Wang, D., Li, R., & Pan, B. (1979). Yuan ren zhuanji ziliao suoyin 元人傳記資料索引 [Index to Biographical 

Materials of Yuan Figures]. Taipei: Xinwenfeng chuban gongsi.

Wu, R. (1993). Tang Wudai ren jiaowangshi suoyin 唐五代人交往詩索引 [Indexes to the Exchange Poems 
of Tang and Five Dynasties]. Shanghai: Shanghai guji chubanshe.

Xu, S. (2014). Song huiyao jigao 宋會要輯稿 [Collected Administrative Documents from the Song Dynasty]. 
Shanghai: Shanghai guji chubanshe.

Yu, X. (2000). Tang cishi kao quanbian 唐刺史考全編 [Complete Collection of Studies on Tang-Dynasty 
Prefects]. Hefei: Anhui daxue chubanshe.

Zeng, Z., & Liu, L. (2006). Quan Song wen 全宋文 [Complete Song-Dynasty Prose]. Shanghai: Shanghai 
cishu chubanshe.

Zhou, S. (1992). Tangdai muzhi huibian 唐代墓誌彙編 [Collection of Tang-Dynasty Tomb Epitaphs]. 
Shanghai: Shanghai guji chubanshe.

Zhou, S., & Zhao, C. (2001). Tangdai muzhi huibian xuji 唐代墓誌彙編續集 [Sequel to the Collection of 
Tang-Dynasty Tomb Epitaphs]. Shanghai: Shanghai guji chubanshe.

https://doi.org/10.5334/johd.68
https://doi.org/10.5334/johd.68
https://doi.org/10.5334/johd.68
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
https://orcid.org/0000-0003-3922-4792
https://orcid.org/0000-0002-1840-2046
https://projects.iq.harvard.edu/cbdb
https://projects.iq.harvard.edu/cbdb
https://dh.chinese-empires.eu/markus/beta/
https://dh.chinese-empires.eu/markus/beta/
https://newarchive.ihp.sinica.edu.tw/sncaccgi/sncacFtp
https://digital.library.mcgill.ca/mingqing/english/index.php
https://digital.library.mcgill.ca/mingqing/english/index.php
https://cbdb.library.sh.cn
https://ctext.org/tools/linked-open-data
http://qh.nqcx.net