Digitization of Text documents Using PDF/A Yan Han and Xueheng Wan INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 52 Yan Han (yhan@email.arizona.edu) is Full Librarian, the University of Arizona Libraries, and Xueheng Wan (wanxueheng@email.arizona.edu) is a student, Department of Computer Science, University of Arizona. ABSTRACT The purpose of this article is to demonstrate a practical use case of PDF/A for digitization of text documents following FADGI’s recommendation of using PDF/A as a preferred digitization file format. The authors demonstrate how to convert and combine TIFFs with associated metadata into a single PDF/A-2b file for a document. Using real-life examples and open source software, the authors show readers how to convert TIFF images, extract associated metadata and International Color Consortium (ICC) profiles, and validate against the newly released PDF/A validator. The generated PDF/A file is a self-contained and self-described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. Providing theoretical analysis and empirical examples, the authors show that PDF/A has many advantages over the traditionally preferred file format, TIFF/JPEG2000, for digitization of text documents. BACKGROUND PDF has been primarily used as a file delivery format across many platforms in almost every device since its initial release in 1993. PDF/A was designed to address concerns about long-term preservation of PDF files, but there has been little research and few implementations of this file format. Since the first standard (ISO 19005 PDF/A-1), published in 2005, some articles discuss the PDF/A family of standards, relevant information, and how to implement PDF/A for born-digital documents.1 There is growing interest in the PDF and PDF/A standards after both the US Library of Congress and the National Archives and Records Administration (NARA) joined the PDF Association in 2017. NARA joined the PDF Association because PDF files are used as electronic documents in every government and business agency. As explained in a blog post, the Library of Congress joined the PDF Association because of the benefits to libraries, including participating in developing PDF standards, promoting best-practice use of PDF, and access to the global expertise in PDF technology.2 Few articles, if any, have been published about using this file format for preservation of digitized content. Yan Han published a related article in 2015 about theoretical research on using PDF/A for text documents.3 In this article, Han discussed the shortcomings of the widely used TIFF and JPEG2000 as master preservation file formats and proposed using the then-emerging PDF/A as the preferred file format for digitization of text documents. Han further analyzed the requirements mailto:yhan@email.arizona.edu mailto:wanxueheng@email.arizona.edu DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 53 HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878 of digitization of text documents and discussed the advantages of PDF/A over TIFF and JPEG2000. These benefits include platform independence, smaller file size, better compression algorithms, and metadata encoding. In addition, the file format reduces workload and simplifies post- digitization processing such as quality control, adding and updating missing pages, and creating new metadata and OCR data for discovery and digital preservation. As a result, PDF/A can be used in every phase of a digital object in an Open Archival Information System (OAIS)—for example, a Submission Information Package (SIP), Archive Information Package (AIP), and Dissemination Information Package (DIP). In summary, a PDF/A file can be a structured, self-contained, and self- described container allowing a simpler one-to-one relationship between an original physical document and its digital surrogate. In September 2016, the Federal Agencies Digital Guidelines Initiative (FADGI) released its latest guidelines for digitization related to raster images: Technical Guidelines for Digitizing Heritage Materials.4 The de-facto best practices for digitization, these guidelines provide federal agencies guidance and have been used in many cultural heritage institutions. Both the PDF Association and the authors welcomed the recognition of PDF/A as the preferred master file format for digitization of text documents such as unbound documents, bound volumes, and newspapers.5 GOALS AND TASKS Since Han has previously provided theoretical methods of coding raster images, metadata, and related information in PDF/A, the goals of this article are threefold: 1. present real-life experience of converting TIFFs/JPEG2000s to PDF/A and back, along with image metadata 2. test open source libraries to create and manipulate images, image metadata, and PDF/A 3. validate generated PDF/As with the first legitimate validator for PDF/A validation The tasks included the following: ● Convert all the master files in TIFFs/JPEG2000 from digitization of text documents into single PDF/A files losslessly. One document, one PDF/A file. ● Evaluate and extract metadata from each TIFF/JPEG2000 image and encode it along with its image when creating the corresponding PDF/A file. ● Demonstrate the runtimes of the above tasks for feasibility evaluation. ● Validate the PDF/A files against the newly released open source PDF/A validator veraPDF. ● Extract each digital image from the PDF/A file back to its original master image files along with associated metadata. ● Verify the extracted image files in the back-and-forth conversion process against the original master image files Choices of PDF/A Standards and Conformance Level This article demonstrates using PDF/A-2b as a self-contained self-describing file format. Currently, there are three related PDF/A standards (PDF/A-1, PDF/A-2, and PDF/A-3), each with INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 54 three conformance levels (a, b, and u). The reasons for choosing PDF/A-2 (instead of PDF/A-1 or PDF/A-3) are the following: ● PDF/A-1 is based on PDF 1.4. In this standard, images coded in PDF/A-1 cannot use JPEG2000 compression (named in PDF/A as JPXDecode). One can still convert TIFFs to PDF/A-1 using other lossless compression methods such as LZW. However, the space- saving benefits of JPEG2000 compression over other methods would not be utilized. ● PDF/A-2 and PDF/A-3 are based on PDF 1.7. One significant feature of PDF 1.7 is that it supports JPEG2000 compression, which saves 40–60 percent of space for raster images compared to uncompressed TIFFs. ● PDF/A-3 has one major feature that PDF/A-2 does not have, which is to allow arbitrary files to be embedded within the PDF file. In this case, there is no file to be embedded. The authors chose conformance level b for simplicity. ● b is basic conformance, which requires only necessary components (e.g., all fonts embedded in the PDF) for reproduction of a document’s visual appearance. ● a is accessible conformance, which means b conformance level plus additional accessibility (structural and semantic features such as document structure). One can add tags to convert PDF/2b to PDF/2a. ● u represents a conformance level with the additional requirement that all text in the document have Unicode equivalents. This article does not cover any post-processing of additional manual or computational features such as adding OCR text to the generated PDF/A files. These features do not help faithfully capture the look and feel of original pages in digitization, and they can be added or updated later without any loss of information. In addition, OCR results rely on the availability of OCR engines for the document’s language, and results can vary between different OCR engines over time. OCR technology is getting better and will produce better results in the future. For example, current OCR technology for English gives very reliable (more than 90 percent) accuracy. In comparison, traditional Chinese manuscripts and Pashto/Persian give unacceptably low accuracy (less than 60 percent). The cutting edge on OCR engines has started to utilize artificial intelligence networks, and the authors believe that a breakthrough will happen soon. Data Source The University of Arizona Libraries (UAL) and Afghanistan Center at Kabul University (ACKU) have been partnering to digitize and preserve ACKU’s permanent collection held in Kabul. This collaborative project created the largest Afghan digital repository in the world. Currently the Afghan digital repository (http://www.afghandata.org) contains more than fifteen thousand titles and 1.6 million pages of documents. Digitization of these text documents follows the previous version of the FADGI guideline, which recommended scanning each page of a text document into a separate TIFF file as the master file. These TIFFs were organized by directories in a file system, where each directory represents a corresponding document containing all the scanned pages of this title. An example of the directory structure can be found in Han’s article. http://www.afghandata.org/ DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 55 HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878 PDF/A and Image Manipulation Tools There are a few open source and proprietary PDF software development kits (SDK). Adobe PDF Library and Foxit SDK are the most well-known commercial tools to manipulate PDFs. To show readers that they can manipulate and generate PDF/A documents themselves, open source software, rather than commercial tools, was used. Currently, only a very limited number of open source PDF SDKs are available, including iText and PDFBox. iText was chosen because it has g ood documentation and provides a well-built set of APIs to support almost all the PDF and PDF/A features. Initially written by Bruno Lowagie (who was in the ISO PDF standard working group) in 1998 as an in-house project, Lowagie later started up his own company, iText, and published iText in Action with many code examples.6 Moreover, iText has Java and C# coding options with good code documentation. It is worth mentioning that iText has different versions. The author used iText 5.5.10 and 5.4.4. Using an older version in our implementation generated a non-compatible PDF/A file because the it was not aligned with the PDF/A standard.7 For image processing, there were a few popular open source options, including ImageMagick and GIMP. ImageMagick was chosen because of its popularity, stability, and cross-platform implementation. Our implementation identified one issue with ImageMagick: the current version (7.0.4) could not retrieve all the metadata from TIFF files as it did not extract certain information such as the Image File Directory and color profile. These metadata are critical because they are part of the original data from digitization. Unfortunately, the author observed that some image editors were unable to preserve all the metadata from the image files during the conversion process. Hart and De Varies used case studies to show the vulnerability of metadata, demonstrating metadata elements in a digital object can be lost and corrupted by use or conversion of a file to another format. They suggested that action is needed to ensure proper metadata creation and preservation so that all types of metadata must be captured and preserved to achieve the most authentic, consistent, and complete digital preservation for future use.8 Metadata Extraction Tools and Color Profiles As we digitize physical documents and manipulate images, color management is important. The goal of color management is to obtain a controlled conversion between the color representations of various devices such as image scanners, digital cameras, and monitors. A color profile is a set of data that control input and output of a color space. The International Color Consortium (ICC) standards and profiles were created to bring various manufacturers together because embedding color profiles into images is one of the most important color management solutions. Image formats such as TIFF and JPEG2000 and document formats such as PDF may contain embedded color profiles. The authors identified a few open source tools to extract TIFF metadata, includin g ExifTool, Exiv2, and tiffInfo. ExifTool is an open source tool for reading, writing, and manipulating metadata of media files. Exiv2 is another free metadata tool supporting different image formats. The tiffInfo program is widely used in the Linux platform, but it has not been updated for at least ten years. Our implementations showed that ExifTool was the one that most easily extracted the full ICC profiles and other metadata from TIFF and JPEG2000 files. ImageMagick and other image processing software were examined in Van der Knijff’s article discussing JPEG2000 for long-term preservation.9 He found that ICC profiles were lost in ImageMagick. Our implementation has INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 56 showed the current version of ImageMagick has fixed this issue. A metadata sample can be found in appendix A. IMPLEMENTATION Converting and Ordering TIFFs into a Single PDF/A-2 File When ordering and combining all individual TIFFs of a document into a single PDF/A-2b file, the authors intended to preserve all information from the TIFFs, including raster image data streams and metadata stored in each TIFF’s header. The raster image data streams are the main images reflecting the original look and feel of these pages, while the metadata (including technical and administrative metadata such as BitsPerSample, DateTime, and Make/Model/Software) tells us important digitization and provenance information. Both are critical for delivery and digital preservation. The TIFF images were first converted to JPEG2000 with lossless compression using the open source ImageMagick software. Our tests of ImageMagick demonstrated that it can handle different color profiles and will convert images correctly if the original TIFF comes with a color profile. This gave us confidence that past concerns about JPEG2000 and ImageMagick had been resolved. These images were then properly sorted into their original order and combined into a single PDF/A-2 file. An alternative is to directly code TIFF’s image data stream into a PDF/A file, but this approach would miss one benefit of PDF/A-2: tremendous file size reduction with JPEG2000. The following is the pseudocode of ordering and combining all the TIFFs in a text document into a single PDF/A- 2 file. CreatePDFA2(queue TiffList) { Create an empty queue XMLQ; Create an empty queue JP2Q; /* TiffFileList is pre-sorted queue based on the original order */ /* Convert each TIFF to JPEG2000 losslessly, then add each JPEG2000 and its metadata into a queue */ while (TiffList is NOT empty) { String TiffFilePath = TiffList.dequeue(); string xmlFilePath = Tiff metadata extracted using exiftool; XMLQ.enqueue(xmlFilePath); String jp2FilePath = JPEG2000 file location from Tiff converted by ImageMagick; JP2Q.enqueue(jp2FilePath); } /* Convert each image’s metadata to XMP, add each JPEG2000 and its metadata into the PDF/A-2 file based on its original order */ Document pdf2b = new Document(); /* create PDF/A-2b conformance level */ PdfAWriter writer = PdfAWriter.getInstance(doc, new FileOutputStream(PdfAFilePath),PdfAConformaceLevel.PDF_A_2B); writer.createXmpMetadata(); //Create Root XMP DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 57 HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878 pdf2b.open(); while(JP2Q is NOT empty){ Image jp2 = Image.getInstance(JP2Q.dequeue()); Rectangle size = new Rectangle(jp2.getWidth(), jp2.getHeight()); //PDF page size setting pdf2b.setPageSize(size); pdf2b.newPage(); // create a new page for a new image byte[] bytearr = XmpManipulation(XMLQ.dequeue()); // convert original metadata based on the XMP standard writer .setPageXmpMetadata(bytearr); pdf2b.add(jp2); } pdf2b.close(); } Converting PDF/A-2 Files back to TIFFs and JPEG2000s To ensure that we can extract raster images from the newly created PDF/A-2 file, the authors also wrote code to convert a PDF/A-2 file back to the original TIFF or JPEG2000 format. This implementation was a reverse process of the above operation. Once the reverse conversion process was completed, the authors verified that the image files created from the PDF/A-2 file were the same as before the conversion to PDF/A-2. Note that we generated MD5 checksums to verify image data streams. Images data streams are the same, but metadata location can be varied because of inconsistent TIFF tags used over the years. When converting one TIFF to another TIFF, ImageMagick has its implementation of metadata tags. The code can be found in appendix B. PDF/A Validation PDF/A is one of the most recognized digital preservation formats, specially designed for long -term preservation and access. However, no commonly accepted PDF/A validator was available in the past, although several commercial and open source PDF preflight and validation engines (e.g., Acrobat) were available. Validating a PDF/A against the PDF/A standards is a challenging task for a few reasons, including the complexity of the PDF and PDF/A formats. The PDF Association and the Open Preservation Foundation recognized the need and started a project to develop an open source PDF/A validator and build a maintenance community. Their result, VeraPDF, is an open source validator designed for all PDF/A parts and conformance levels. Released in January 2017, the goal of veraPDF is to become the commonly accepted PDF/A validator. 10 Our generated PDF/As have been validated with veraPDF 1.4 and Adobe Acrobat Pro DC Preflight. Both products validated the PDF/A-2b files as fully compatible. Our implementations showed that veraPDF 1.4 verified more cases than Acrobat DC Preflight. Figure 1 shows a PDF file structure and its metadata. INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 58 Figure 1. A PDF object tree with root-level metadata. RUNTIME AND CONCLUSION The time complexity of our code is O(log n) because of the sorting algorithms used. TIFFs were first converted to JPEG2000. When JPEG2000 images are added to a PDF/A-2 file, no further image manipulation is required because the generated PDF/A-2 uses JPEG2000 directly (in other words, it uses the JPXDecode filter). Tables 1 and 2 show the performance comparison running in our computer hardware and software environment (Intel Core i7-2600 CPU@3.4GHz, 8GB DDR3 RAM, 3TB 7200-RPM 64MB-cache hard disk running Ubuntu 16.10). DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 59 HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878 Table 1. Runtimes of converting grayscale TIFFs to JPEG2000s and to PDF/A-2b No. of Files Total File Size (MB) Image Conversion Runtime (TIFFs to JP2s in seconds) Total Runtime (TIFFs to JP2s to a single PDF/A-2b in seconds) 1 9.1 3.61 3.98 10 91.1 35.63 36.71 20 182.2 71.83 73.98 50 455.5 179.06 184.63 100 910.9 358.3 370.91 Table 2. Runtimes of converting color TIFFs to JPEG2000s and to PDF/A-2b No. of Files Total File Size (MB) Image Conversion Runtime (TIFFs to JP2s in seconds) Total Runtime (TIFFs to JP2s to a single PDF/A-2b in seconds) 1 27.3 14.80 14.94 10 273 150.51 151.55 20 546 289.95 293.21 50 1,415 741.89 749.75 100 2,730 1490.49 1509.23 The results show that (a) the majority of the runtime (more than 95 percent) is spent in converting a TIFF to a JPEG2000 using ImageMagick (see figure 2); (b) the average runtime of converting a TIFF has a constant positive relationship with the file’s size (see figure 2); (c) in INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 60 comparison, the runtime of converting a color TIFF is significantly higher than that of converting a greyscale TIFF (see figure 2); and (d) it is feasible in terms of time and resources to convert existing master images of digital document collections to PDF/A-2b. For example, the runtime of 1 TB of conversion of color TIFFs will be 552,831 seconds (153.5 hours; 6.398 days) using the above hardware. The authors have already processed more than 600,000 TIFFs using this method. The authors conclude that using PDF/A gives institutions advantages of the newly preferred master file format for digitization of text documents over TIFF/JPEG2000. The above implementation demonstrates the ease, the reasonable runtime, and the availability of open source software to perform such conversions. From both the theoretical analysis and empirical evidences, the authors show that PDF/A has advantages over the traditional preferred file format TIFF for digitization of text documents. Following best practice, a PDF/A file can be a self- contained and self-described container that accommodates all the data from digitization of textual materials, including page-level metadata and ICC profiles. SUMMARY The goal of this article is to demonstrate empirical evidences of using PDF/A for digitization of text document. The authors evaluated and used multiple open source software programs for processing raster images, extracting image metadata, and generating PDF/A files. These PDF/A files were validated using the up-to-date PDF/A validators veraPDF and Acrobat Preflight. The authors also calculated the time complexity of the program and measured the total runtime in multiple testing cases. Most of the runtime was spent on image conversions from TIFF to JPEG2000. The creation of the PDF/A-2b file with associated page-level metadata accounted for less than 5 percent of the total runtime. Runtime of conversion of a color TIFF was much higher than that of a greyscale one. Our theoretical analysis and empirical examples show that using PDF/A-2 presents many advantages over the traditional preferred file format (TIFF/JPEG2000) for digitization of text documents. DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 61 HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878 Figure 2. File size, greyscale and color TIFFs and runtime ratio. INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 62 APPENDIX A: SAMPLE TIFF METADATA WITH ICC HEADER 8 3400 4680 8 8 8 Uncompressed RGB (Binary data 41025 bytes, use -b option to extract) 3 1 (Binary data 28079 bytes, use -b option to extract) 400 400 Chunky APPL 2.2.0 Display Device Profile RGB XYZ 2006:02:02 02:20:00 acsp Apple Computer Inc. Not Embedded, Independent none Reflective, Glossy, Positive, Color Perceptual 0.9642 1 0.82491 EPSO 0 EPSON sRGB 0.43607 0.22249 0.01392 0.38515 0.71687 0.09708 0.14307 0.06061 0.7141 0.95045 1 1.08905 Copyright (c) SEIKO EPSON CORPORATION 2000 - 2006. All rights reserved. (Binary data 8204 bytes, use -b option to extract) (Binary data 8204 bytes, use -b option to extract) (Binary data 8204 bytes, use -b option to extract) 0 0 0 DIGITIZATION OF TEXT DOCUMENTS USING PDF/A | HAN AND WAN 63 HTTPS://DOI.ORG/10.6017/ITAL.V37I1.9878 APPENDIX B: SAMPLE CODE TO CONVERT PDF/A-2 BACK TO JPEG2000S /* Assumption: The PDF/A-2b file was specifically generated from image objects converted from TIFF images with JPXDecode along with page-level metadata */ public static void parse(String src, String dest) throws IOException{ PdfReader reader = new PdfReader(src); PdfObject obj; int counter = 0; for(int i = 1; i <= reader.getXrefSize(); i ++){ obj = reader.getPdfObject(i); if(obj != null && obj.isStream()){ PRStream stream = (PRStream) obj; byte[] b; try{ b = PdfReader.getStreamBytes(stream); }catch(UnsupportedPdfException e){ b = PdfReader.getStreamBytesRaw(stream); } PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE); FileOutputStream fos = null; if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.XML.toString())) { fos = new FileOutputStream(String.format(dest + "_xml/" + counter+".xml", i)); System.out.println("Page Metadata Extracted!"); } if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.IMAGE.toString())) { counter ++; fos = new FileOutputStream(String.format(dest + "_jp2/" + counter+".jp2", i)); } if (fos != null) { fos.write(b); fos.flush(); fos.close(); System.out.println("JPEG2000s Conversion from PDF completed !"); } } } /* Then Use ImageMagick library to convert JPEG2000s to TIFFs */ INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2018 64 REFERENCES 1 PDF-Tools.com and PDF Association, “PDF/A—The Standard for Long-Term Archiving,” version 2.4, white paper, May 20, 2009, http://www.pdf- tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf; Duff Johnson, “White Paper: How to Implement PDF/A,” Talking PDF, August 24, 2010, https://talkingpdf.org/white-paper- how-to-implement-pdfa/; Alexandra Oettler, “PDF/A in a Nutshell 2.0: PDF for Long-Term Archiving,” Association for Digital Standards, 2013, https://www.pdfa.org/wp- content/until2016_uploads/2013/05/PDFA_in_a_Nutshell_211.pdf; Library of Congress, “PDF/A, PDF for Long-Term Preservation,” last modified July 27, 2017, https://www.loc.gov/preservation/digital/formats/fdd/fdd000318.shtml. 2 Library of Congress, “The Time and Place for PDF: An Interview with Duff Johnson of the PDF Association,” The Signal (blog), December 12, 2017, https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff- johnson-of-the-pdf-association/. 3 Yan Han, “Beyond TIFF and JPEG2000: PDF/A as an OAIS Submission Information Package Container,” Library Hi Tech 33, no. 3 (2015): 409–23, https://doi.org/10.1108/LHT-06-2015- 0068. 4 Federal Agencies Digital Guidelines Initiative, Technical Guidelines for Digitizing Cultural Heritage Materials. (Washington, DC: Federal Agencies Digital Guidelines Initiative, 2016), http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20D igital%20Guidelines%20Initiative-2016%20Final_rev1.pdf. 5 Duff Johnson, “US Federal Agencies Approve PDF/A,” PDF Association, September 2, 2016, http://www.pdfa.org/new/us-federal-agencies-approve-pdfa/. 6 Bruno Lowagie, iText in Action, 2nd ed. (Stamford, CT: Manning, 2010). 7 “iText 5.4.4,” iText, last modified September 16, 2013, http://itextpdf.com/changelog/544. 8 Timothy Robert Hart and Denise de Vries, “Metadata Provenance and Vulnerability,” Information Technology and Libraries 36, no. 4 (2017), https://doi.org/10.6017/ital.v36i4.10146. 9 Johan Van der Knijff, “JPEG 2000 for Long-Term Preservation: JP2 as a Preservation Format,” D- Lib 17, no. 5/6 (2011), https://doi.org/10.1045/may2011-vanderknijff. 10 PDF Association, “How veraPDF does PDF/A Validation,” 2016, http://www.pdfa.org/how- verapdf-does-pdfa-validation/. http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf https://talkingpdf.org/white-paper-how-to-implement-pdfa/ https://talkingpdf.org/white-paper-how-to-implement-pdfa/ https://www.pdfa.org/wp-content/until2016_uploads/2013/05/PDFA_in_a_Nutshell_211.pdf https://www.pdfa.org/wp-content/until2016_uploads/2013/05/PDFA_in_a_Nutshell_211.pdf https://www.loc.gov/preservation/digital/formats/fdd/fdd000318.shtml https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/ https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/ https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/ https://blogs.loc.gov/thesignal/2017/12/the-time-and-place-for-pdf-an-interview-with-duff-johnson-of-the-pdf-association/ https://doi.org/10.1108/LHT-06-2015-0068 https://doi.org/10.1108/LHT-06-2015-0068 http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf http://www.digitizationguidelines.gov/guidelines/FADGI%20Federal%20%20Agencies%20Digital%20Guidelines%20Initiative-2016%20Final_rev1.pdf https://www.pdfa.org/new/us-federal-agencies-approve-pdfa/ https://www.pdfa.org/new/us-federal-agencies-approve-pdfa/ https://www.pdfa.org/new/us-federal-agencies-approve-pdfa/ http://itextpdf.com/changelog/544 http://itextpdf.com/changelog/544 https://doi.org/10.6017/ital.v36i4.10146 https://doi.org/10.6017/ital.v36i4.10146 https://doi.org/10.1045/may2011-vanderknijff https://www.pdfa.org/how-verapdf-does-pdfa-validation/ https://www.pdfa.org/how-verapdf-does-pdfa-validation/ https://www.pdfa.org/how-verapdf-does-pdfa-validation/ Abstract Background Goals and Tasks Choices of PDF/A Standards and Conformance Level Data Source PDF/A and Image Manipulation Tools Metadata Extraction Tools and Color Profiles Implementation Converting and Ordering TIFFs into a Single PDF/A-2 File Converting PDF/A-2 Files back to TIFFs and JPEG2000s PDF/A Validation Runtime and Conclusion Summary Appendix A: Sample TIFF Metadata with ICC header Appendix B: Sample Code to convert PDF/A-2 back to JPEG2000s References