id author title date pages extension mime words sentences flesch summary cache txt work_dgpqd6codbg2xgxz7q6bcgdncq Tuula Pääkkönen Exporting Finnish Digitized Historical Newspaper Contents for Offline Use 2016 .htm text/html 5241 317 61 Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. In the recent one the data is provided in the original ALTO XML format, but the directory structure follows the original file system order, where one newspaper title can span different archive files. For the newspaper corpus we decided to create the export with original ALTO XML and raw text plus the metadata. The number of ALTO XML files in the newspaper part of the export is presented in Figure 4 with languages. Should it contain all of the content within one file, or should it be structurally divided so that each data item (metadata, ALTO, page text) is available separately. ./cache/work_dgpqd6codbg2xgxz7q6bcgdncq.htm ./txt/work_dgpqd6codbg2xgxz7q6bcgdncq.txt