Towards connecting scholarly editions to corpora in the LiLa (Linking Latin) Knowledge Base of linguistic resources - Towards connecting scholarly editions to corpora in the LiLa (Linking Latin) Knowledge Base of linguistic resources Greta Franzini greta.franzini@unicatt.it Conference | Wuppertal, Germany | 17 December 2019 This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994. 1 Table of Contents Introduction Computational Linguistics Linked Data and Linguistic Linked Open Data LiLa: Linking Latin Scholarly Editions Linked Data Connection to LiLa Conclusion Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 2 Table of Contents Introduction Computational Linguistics Linked Data and Linguistic Linked Open Data LiLa: Linking Latin Scholarly Editions Linked Data Connection to LiLa Conclusion Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 3 Computational Linguistics Definition Computational Linguistics is an interdisciplinary field concerned with the processing of language by computers. (Mitkov, 2004) Computational Linguistics Develops computational methods and formalisms to answer linguistics questions. Natural Language Processing Solves engineering problems arising from the analysis of natural language text. (adapted from Eisner, 2016) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 3 Computational Linguistics Definition Computational Linguistics is an interdisciplinary field concerned with the processing of language by computers. (Mitkov, 2004) Computational Linguistics Develops computational methods and formalisms to answer linguistics questions. Natural Language Processing Solves engineering problems arising from the analysis of natural language text. (adapted from Eisner, 2016) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 3 Computational Linguistics Definition Computational Linguistics is an interdisciplinary field concerned with the processing of language by computers. (Mitkov, 2004) Computational Linguistics Develops computational methods and formalisms to answer linguistics questions. Natural Language Processing Solves engineering problems arising from the analysis of natural language text. (adapted from Eisner, 2016) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 3 Computational Linguistics Definition Computational Linguistics is an interdisciplinary field concerned with the processing of language by computers. (Mitkov, 2004) Computational Linguistics Develops computational methods and formalisms to answer linguistics questions. Natural Language Processing Solves engineering problems arising from the analysis of natural language text. (adapted from Eisner, 2016) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 4 Computational Linguistics Linguistic Resources and NLP Tools Automatic language processing requires linguistic resources and NLP tools Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 5 Computational Linguistics Linguistic Resources Dictionary collection of words and phrases with information about them lexicon dictionary/list of words, typically for computational purposes thesaurus words grouped together according to similarity of meaning Ontology inventory of objects or processes in a domain, together with a specification of some or all of the relations that hold among them, generally arranged as a hierarchy Corpus a body of linguistic data in machine readable form, gathered according to some principled sampling method and criterion. A syntactically/semantically-annotated corpus is known as a treebank Grammar systematic analysis of the structure of a language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 5 Computational Linguistics Linguistic Resources Dictionary collection of words and phrases with information about them lexicon dictionary/list of words, typically for computational purposes thesaurus words grouped together according to similarity of meaning Ontology inventory of objects or processes in a domain, together with a specification of some or all of the relations that hold among them, generally arranged as a hierarchy Corpus a body of linguistic data in machine readable form, gathered according to some principled sampling method and criterion. A syntactically/semantically-annotated corpus is known as a treebank Grammar systematic analysis of the structure of a language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 5 Computational Linguistics Linguistic Resources Dictionary collection of words and phrases with information about them lexicon dictionary/list of words, typically for computational purposes thesaurus words grouped together according to similarity of meaning Ontology inventory of objects or processes in a domain, together with a specification of some or all of the relations that hold among them, generally arranged as a hierarchy Corpus a body of linguistic data in machine readable form, gathered according to some principled sampling method and criterion. A syntactically/semantically-annotated corpus is known as a treebank Grammar systematic analysis of the structure of a language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 5 Computational Linguistics Linguistic Resources Dictionary collection of words and phrases with information about them lexicon dictionary/list of words, typically for computational purposes thesaurus words grouped together according to similarity of meaning Ontology inventory of objects or processes in a domain, together with a specification of some or all of the relations that hold among them, generally arranged as a hierarchy Corpus a body of linguistic data in machine readable form, gathered according to some principled sampling method and criterion. A syntactically/semantically-annotated corpus is known as a treebank Grammar systematic analysis of the structure of a language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 6 Computational Linguistics NLP Tools Tokeniser performs tokenisation and determines the boundaries for individual tokens in text (words, numbers, punctuation) Tagger assigns tags to words or expressions in a text (e.g. part of speech, named entity) Parser analyses a sentence or other string of words into its constituents, producing a parse tree of syntactic relations between them Lemmatiser groups the inflected forms of a word together under a base form, recovers the base form from an inflected form. Can be morphological (no context, ambiguity) or morpho-syntactic (context, no ambiguity). ... and more. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 6 Computational Linguistics NLP Tools Tokeniser performs tokenisation and determines the boundaries for individual tokens in text (words, numbers, punctuation) Tagger assigns tags to words or expressions in a text (e.g. part of speech, named entity) Parser analyses a sentence or other string of words into its constituents, producing a parse tree of syntactic relations between them Lemmatiser groups the inflected forms of a word together under a base form, recovers the base form from an inflected form. Can be morphological (no context, ambiguity) or morpho-syntactic (context, no ambiguity). ... and more. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 6 Computational Linguistics NLP Tools Tokeniser performs tokenisation and determines the boundaries for individual tokens in text (words, numbers, punctuation) Tagger assigns tags to words or expressions in a text (e.g. part of speech, named entity) Parser analyses a sentence or other string of words into its constituents, producing a parse tree of syntactic relations between them Lemmatiser groups the inflected forms of a word together under a base form, recovers the base form from an inflected form. Can be morphological (no context, ambiguity) or morpho-syntactic (context, no ambiguity). ... and more. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 6 Computational Linguistics NLP Tools Tokeniser performs tokenisation and determines the boundaries for individual tokens in text (words, numbers, punctuation) Tagger assigns tags to words or expressions in a text (e.g. part of speech, named entity) Parser analyses a sentence or other string of words into its constituents, producing a parse tree of syntactic relations between them Lemmatiser groups the inflected forms of a word together under a base form, recovers the base form from an inflected form. Can be morphological (no context, ambiguity) or morpho-syntactic (context, no ambiguity). ... and more. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 6 Computational Linguistics NLP Tools Tokeniser performs tokenisation and determines the boundaries for individual tokens in text (words, numbers, punctuation) Tagger assigns tags to words or expressions in a text (e.g. part of speech, named entity) Parser analyses a sentence or other string of words into its constituents, producing a parse tree of syntactic relations between them Lemmatiser groups the inflected forms of a word together under a base form, recovers the base form from an inflected form. Can be morphological (no context, ambiguity) or morpho-syntactic (context, no ambiguity). ... and more. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 7 Computational Linguistics Linguistic Resources and NLP Tools for Latin Corpora Perseus Digital Library, Eurasian Latin Archive, Corpus Grammaticorum Latinorum, Croatiae auctores Latini, Archivio della Latinità Italiana del Medioevo, Musisque Deoque, Patrologia Latina, PHI Classical Latin Texts, Index Thomisticus Treebank, PROIEL Latin Treebank, etc. Lexica Vallex, IT-VaLex, Latin WordNet, Oxford Latin Dictionary, Du Cange Glossarium Mediae et Infimae Latinitatis, Thesaurus Lingua Latinae, Thesaurus Formarum Totius Latinitatis, Lexicon musicum Latinum medii aevi, etc. NLP Tools LEMLAT, Whitaker’s Words, LatMor, TreeTagger, Collatinus, UDPipe, Chiron, etc. Latin is the most resourced historical language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 7 Computational Linguistics Linguistic Resources and NLP Tools for Latin Corpora Perseus Digital Library, Eurasian Latin Archive, Corpus Grammaticorum Latinorum, Croatiae auctores Latini, Archivio della Latinità Italiana del Medioevo, Musisque Deoque, Patrologia Latina, PHI Classical Latin Texts, Index Thomisticus Treebank, PROIEL Latin Treebank, etc. Lexica Vallex, IT-VaLex, Latin WordNet, Oxford Latin Dictionary, Du Cange Glossarium Mediae et Infimae Latinitatis, Thesaurus Lingua Latinae, Thesaurus Formarum Totius Latinitatis, Lexicon musicum Latinum medii aevi, etc. NLP Tools LEMLAT, Whitaker’s Words, LatMor, TreeTagger, Collatinus, UDPipe, Chiron, etc. Latin is the most resourced historical language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 7 Computational Linguistics Linguistic Resources and NLP Tools for Latin Corpora Perseus Digital Library, Eurasian Latin Archive, Corpus Grammaticorum Latinorum, Croatiae auctores Latini, Archivio della Latinità Italiana del Medioevo, Musisque Deoque, Patrologia Latina, PHI Classical Latin Texts, Index Thomisticus Treebank, PROIEL Latin Treebank, etc. Lexica Vallex, IT-VaLex, Latin WordNet, Oxford Latin Dictionary, Du Cange Glossarium Mediae et Infimae Latinitatis, Thesaurus Lingua Latinae, Thesaurus Formarum Totius Latinitatis, Lexicon musicum Latinum medii aevi, etc. NLP Tools LEMLAT, Whitaker’s Words, LatMor, TreeTagger, Collatinus, UDPipe, Chiron, etc. Latin is the most resourced historical language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 7 Computational Linguistics Linguistic Resources and NLP Tools for Latin Corpora Perseus Digital Library, Eurasian Latin Archive, Corpus Grammaticorum Latinorum, Croatiae auctores Latini, Archivio della Latinità Italiana del Medioevo, Musisque Deoque, Patrologia Latina, PHI Classical Latin Texts, Index Thomisticus Treebank, PROIEL Latin Treebank, etc. Lexica Vallex, IT-VaLex, Latin WordNet, Oxford Latin Dictionary, Du Cange Glossarium Mediae et Infimae Latinitatis, Thesaurus Lingua Latinae, Thesaurus Formarum Totius Latinitatis, Lexicon musicum Latinum medii aevi, etc. NLP Tools LEMLAT, Whitaker’s Words, LatMor, TreeTagger, Collatinus, UDPipe, Chiron, etc. Latin is the most resourced historical language Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 8 Computational Linguistics Problems with Linguistic Resources and NLP Tools These resources and tools, however, are: I Scattered and isolated I Developed for specific tasks I Follow different annotation schemas and conceptual models No interoperability! Interoperability: I Increases productivity I Improves efficiency I More effective knowledge organisation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 9 Linked Data A solution Linked Data: Semantic Web technology Semantic Web: set of standards and best practices to share data across the web, and to help machines make inferences and understand the meaning of this data. Advantages: I Connects and defines relationships between heterogeneous datasets I Aggregates distributed datasets to reduce dispersion and increase (serendipitous) knowledge discovery (i.e. discoverability of the resource) I Allows us to build systems that can reason across the web and answer complex questions Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 9 Linked Data A solution Linked Data: Semantic Web technology Semantic Web: set of standards and best practices to share data across the web, and to help machines make inferences and understand the meaning of this data. Advantages: I Connects and defines relationships between heterogeneous datasets I Aggregates distributed datasets to reduce dispersion and increase (serendipitous) knowledge discovery (i.e. discoverability of the resource) I Allows us to build systems that can reason across the web and answer complex questions Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 9 Linked Data A solution Linked Data: Semantic Web technology Semantic Web: set of standards and best practices to share data across the web, and to help machines make inferences and understand the meaning of this data. Advantages: I Connects and defines relationships between heterogeneous datasets I Aggregates distributed datasets to reduce dispersion and increase (serendipitous) knowledge discovery (i.e. discoverability of the resource) I Allows us to build systems that can reason across the web and answer complex questions Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 9 Linked Data A solution Linked Data: Semantic Web technology Semantic Web: set of standards and best practices to share data across the web, and to help machines make inferences and understand the meaning of this data. Advantages: I Connects and defines relationships between heterogeneous datasets I Aggregates distributed datasets to reduce dispersion and increase (serendipitous) knowledge discovery (i.e. discoverability of the resource) I Allows us to build systems that can reason across the web and answer complex questions Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 9 Linked Data A solution Linked Data: Semantic Web technology Semantic Web: set of standards and best practices to share data across the web, and to help machines make inferences and understand the meaning of this data. Advantages: I Connects and defines relationships between heterogeneous datasets I Aggregates distributed datasets to reduce dispersion and increase (serendipitous) knowledge discovery (i.e. discoverability of the resource) I Allows us to build systems that can reason across the web and answer complex questions Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 9 Linked Data A solution Linked Data: Semantic Web technology Semantic Web: set of standards and best practices to share data across the web, and to help machines make inferences and understand the meaning of this data. Advantages: I Connects and defines relationships between heterogeneous datasets I Aggregates distributed datasets to reduce dispersion and increase (serendipitous) knowledge discovery (i.e. discoverability of the resource) I Allows us to build systems that can reason across the web and answer complex questions Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 10 Linked Data How does it work? Linked Data technology describes data as triples (statements): I OBJECT of one triple can be the SUBJECT of another triple I Nodes and edges are assigned persistent Uniform Resource Identifiers (URIs) for unambiguous identification across the web I Relationships are described by ontologies or vocabularies of knowledge representation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 10 Linked Data How does it work? Linked Data technology describes data as triples (statements): I OBJECT of one triple can be the SUBJECT of another triple I Nodes and edges are assigned persistent Uniform Resource Identifiers (URIs) for unambiguous identification across the web I Relationships are described by ontologies or vocabularies of knowledge representation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 10 Linked Data How does it work? Linked Data technology describes data as triples (statements): I OBJECT of one triple can be the SUBJECT of another triple I Nodes and edges are assigned persistent Uniform Resource Identifiers (URIs) for unambiguous identification across the web I Relationships are described by ontologies or vocabularies of knowledge representation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 10 Linked Data How does it work? Linked Data technology describes data as triples (statements): I OBJECT of one triple can be the SUBJECT of another triple I Nodes and edges are assigned persistent Uniform Resource Identifiers (URIs) for unambiguous identification across the web I Relationships are described by ontologies or vocabularies of knowledge representation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 10 Linked Data How does it work? Linked Data technology describes data as triples (statements): I OBJECT of one triple can be the SUBJECT of another triple I Nodes and edges are assigned persistent Uniform Resource Identifiers (URIs) for unambiguous identification across the web I Relationships are described by ontologies or vocabularies of knowledge representation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 11 Linked Data Many domains Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore https://lod-cloud.net/ 12 Linguistic Linked Data Linguistics Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore https://linguistic-lod.org/llod-cloud 13 Table of Contents Introduction Computational Linguistics Linked Data and Linguistic Linked Open Data LiLa: Linking Latin Scholarly Editions Linked Data Connection to LiLa Conclusion Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 14 LiLa: Linking Latin At a glance I Funding: ERC Consolidator Grant, 2M EUR I Duration: 2018-2023 I Team: 9 staff + student assistants I Website: https://lila-erc.eu I Objective: Knowledge Base of Linguistic Resources & Natural Language Processing Tools I Method: Linked Data paradigm (FAIR principles) I Purpose: Foster resource/data interoperability Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore https://lila-erc.eu 15 LiLa: Structure Lemmas as connectors Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 16 LiLa: Structure Lemma bank Lemma bank of LEMLAT, our morphological analyser. Over 150,000 lemmas, including: I Classical: 43,432 lemmas from Georges & Georges (1913-1918), Glare (1982), Gradenwitz (1904) I Medieval and Late: 82,556 lemmas from Du Cange (1883-1887) I Onomasticon: 26,250 lemmas from Forcellini (1940) http://www.lemlat3.eu/ Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.lemlat3.eu/ 16 LiLa: Structure Lemma bank Lemma bank of LEMLAT, our morphological analyser. Over 150,000 lemmas, including: I Classical: 43,432 lemmas from Georges & Georges (1913-1918), Glare (1982), Gradenwitz (1904) I Medieval and Late: 82,556 lemmas from Du Cange (1883-1887) I Onomasticon: 26,250 lemmas from Forcellini (1940) http://www.lemlat3.eu/ Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.lemlat3.eu/ 16 LiLa: Structure Lemma bank Lemma bank of LEMLAT, our morphological analyser. Over 150,000 lemmas, including: I Classical: 43,432 lemmas from Georges & Georges (1913-1918), Glare (1982), Gradenwitz (1904) I Medieval and Late: 82,556 lemmas from Du Cange (1883-1887) I Onomasticon: 26,250 lemmas from Forcellini (1940) http://www.lemlat3.eu/ Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.lemlat3.eu/ 16 LiLa: Structure Lemma bank Lemma bank of LEMLAT, our morphological analyser. Over 150,000 lemmas, including: I Classical: 43,432 lemmas from Georges & Georges (1913-1918), Glare (1982), Gradenwitz (1904) I Medieval and Late: 82,556 lemmas from Du Cange (1883-1887) I Onomasticon: 26,250 lemmas from Forcellini (1940) http://www.lemlat3.eu/ Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.lemlat3.eu/ 16 LiLa: Structure Lemma bank Lemma bank of LEMLAT, our morphological analyser. Over 150,000 lemmas, including: I Classical: 43,432 lemmas from Georges & Georges (1913-1918), Glare (1982), Gradenwitz (1904) I Medieval and Late: 82,556 lemmas from Du Cange (1883-1887) I Onomasticon: 26,250 lemmas from Forcellini (1940) http://www.lemlat3.eu/ Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.lemlat3.eu/ 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 17 LiLa: Structure Lemma bank CORPUSX LILA Corpus lemmas (LC) can’t connect to LiLa lemmas (LL) when: I LC doesn’t exist in LL I LC is a different written representation of a LL, e.g. annuncio vs. adnuntio I LC is a lemma variant of a LL, e.g. anthropomorphita vs. anthropomorphitae (pluralia tantum) I LC is a pseudo-lemma, i.e. non Latin words I lemmatisation errors, e.g. pbiectum instead of obiectum Manual fix Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 18 LiLa: Structure Conceptual and structural interoperability To build and define relationships between datasets (triples), LiLa reuses the following ontologies: I OntoLex (Lemon): for lexical information I OLiA (Ontologies of Linguistic Annotation) bundle: for part-of-speech tagging I NIF (NLP Interchange Format) and POWLA (OWL + PAULA, Potsdamer Austauschformat Linguistischer Annotationen) for corpus annotation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 18 LiLa: Structure Conceptual and structural interoperability To build and define relationships between datasets (triples), LiLa reuses the following ontologies: I OntoLex (Lemon): for lexical information I OLiA (Ontologies of Linguistic Annotation) bundle: for part-of-speech tagging I NIF (NLP Interchange Format) and POWLA (OWL + PAULA, Potsdamer Austauschformat Linguistischer Annotationen) for corpus annotation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 18 LiLa: Structure Conceptual and structural interoperability To build and define relationships between datasets (triples), LiLa reuses the following ontologies: I OntoLex (Lemon): for lexical information I OLiA (Ontologies of Linguistic Annotation) bundle: for part-of-speech tagging I NIF (NLP Interchange Format) and POWLA (OWL + PAULA, Potsdamer Austauschformat Linguistischer Annotationen) for corpus annotation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 18 LiLa: Structure Conceptual and structural interoperability To build and define relationships between datasets (triples), LiLa reuses the following ontologies: I OntoLex (Lemon): for lexical information I OLiA (Ontologies of Linguistic Annotation) bundle: for part-of-speech tagging I NIF (NLP Interchange Format) and POWLA (OWL + PAULA, Potsdamer Austauschformat Linguistischer Annotationen) for corpus annotation Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 19 LiLa: Structure Triplestore LiLa = database of triples = triplestore Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 19 LiLa: Structure Triplestore LiLa = database of triples = triplestore Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 19 LiLa: Structure Triplestore LiLa = database of triples = triplestore Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 20 LiLa: Overview Resources connected and upcoming connections I Corpora � Index Thomisticus Treebank (Summa contra Gentiles) � Dante (700th death anniversary coming up!) I Lexica � Word Formation Latin (Classical Latin) � BRILL Etymological dictionary of Latin and the other Italic Languages � Latin WordNet I NLP tools � LEMLAT (lemma bank) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 20 LiLa: Overview Resources connected and upcoming connections I Corpora � Index Thomisticus Treebank (Summa contra Gentiles) � Dante (700th death anniversary coming up!) I Lexica � Word Formation Latin (Classical Latin) � BRILL Etymological dictionary of Latin and the other Italic Languages � Latin WordNet I NLP tools � LEMLAT (lemma bank) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 20 LiLa: Overview Resources connected and upcoming connections I Corpora � Index Thomisticus Treebank (Summa contra Gentiles) � Dante (700th death anniversary coming up!) I Lexica � Word Formation Latin (Classical Latin) � BRILL Etymological dictionary of Latin and the other Italic Languages � Latin WordNet I NLP tools � LEMLAT (lemma bank) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 20 LiLa: Overview Resources connected and upcoming connections I Corpora � Index Thomisticus Treebank (Summa contra Gentiles) � Dante (700th death anniversary coming up!) I Lexica � Word Formation Latin (Classical Latin) � BRILL Etymological dictionary of Latin and the other Italic Languages � Latin WordNet I NLP tools � LEMLAT (lemma bank) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 20 LiLa: Overview Resources connected and upcoming connections I Corpora � Index Thomisticus Treebank (Summa contra Gentiles) � Dante (700th death anniversary coming up!) I Lexica � Word Formation Latin (Classical Latin) � BRILL Etymological dictionary of Latin and the other Italic Languages � Latin WordNet I NLP tools � LEMLAT (lemma bank) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 21 LiLa: Structure Querying the lemma bank Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 22 LiLa: Structure An example: LOD view of ITTB token lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 23 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 24 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 25 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 26 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 27 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 28 LiLa: Structure Querying corpora Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 29 LiLa: Structure An example: LOD view of ITTB token prosequi Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 30 LiLa: Structure An example: LOD view of ITTB token prosequi Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 31 LiLa: Structure An example: LOD view of ITTB token prosequi Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 32 LiLa: Structure An example: LOD view of ITTB token prosequi Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 33 LiLa: Structure An example: LOD view of ITTB token prosequi eiusdem autem est unum contrario- rum prosequi et aliud refutare sicut medicina , quae sanitatem operatur , aegritudinem excludit . (ITTB, 1.1.6) Now it belongs to the same thing to pursue one contrary and to remove the other: thus medicine, which ef- fects health, removes sickness. (Trans. Laurence Shapcote) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore https://aquinas.cc/la/en/~SCG1 https://aquinas.cc/la/en/~SCG1 34 LiLa: Structure LodLive interface https://lila-erc.eu/lodlive/ Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore https://lila-erc.eu/lodlive/ 35 LiLa: Structure LiLa as mere reflection LiLa reflects the annotation granularity of the resources it connects No data enrichment or further analysis is performed Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 36 LiLa: Requirements Connecting resources in the Knowledge Base To enter the LiLa Knowledge Base, a textual resource must be: I Lemmatised I Part-of-Speech tagged (ideally, using the Universal Dependencies tagset) I Online! Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 36 LiLa: Requirements Connecting resources in the Knowledge Base To enter the LiLa Knowledge Base, a textual resource must be: I Lemmatised I Part-of-Speech tagged (ideally, using the Universal Dependencies tagset) I Online! Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 36 LiLa: Requirements Connecting resources in the Knowledge Base To enter the LiLa Knowledge Base, a textual resource must be: I Lemmatised I Part-of-Speech tagged (ideally, using the Universal Dependencies tagset) I Online! Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 36 LiLa: Requirements Connecting resources in the Knowledge Base To enter the LiLa Knowledge Base, a textual resource must be: I Lemmatised I Part-of-Speech tagged (ideally, using the Universal Dependencies tagset) I Online! Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 37 Table of Contents Introduction Computational Linguistics Linked Data and Linguistic Linked Open Data LiLa: Linking Latin Scholarly Editions Linked Data Connection to LiLa Conclusion Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 38 Scholarly Editions Linked Data “[...] computational philology seems to be somewhat decoupled from the recent progress in [Linguistic Linked Open Data]: even though LOD as a concept is gaining significant popu- larity in Digital Humanities, existing LLOD standards and vocabularies are not widely used in this community, and philological resources are underrepresented in the LLOD cloud di- agram [...]." (Chiarcos et al., 2018) “[...] As of yet only a relatively small number of born-digital editions of [...] Latin texts exists [...]." (Fischer, 2017) Of these, only a handful provide (some) data in Linked Data format. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 38 Scholarly Editions Linked Data “[...] computational philology seems to be somewhat decoupled from the recent progress in [Linguistic Linked Open Data]: even though LOD as a concept is gaining significant popu- larity in Digital Humanities, existing LLOD standards and vocabularies are not widely used in this community, and philological resources are underrepresented in the LLOD cloud di- agram [...]." (Chiarcos et al., 2018) “[...] As of yet only a relatively small number of born-digital editions of [...] Latin texts exists [...]." (Fischer, 2017) Of these, only a handful provide (some) data in Linked Data format. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 38 Scholarly Editions Linked Data “[...] computational philology seems to be somewhat decoupled from the recent progress in [Linguistic Linked Open Data]: even though LOD as a concept is gaining significant popu- larity in Digital Humanities, existing LLOD standards and vocabularies are not widely used in this community, and philological resources are underrepresented in the LLOD cloud di- agram [...]." (Chiarcos et al., 2018) “[...] As of yet only a relatively small number of born-digital editions of [...] Latin texts exists [...]." (Fischer, 2017) Of these, only a handful provide (some) data in Linked Data format. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 39 Scholarly Edition Information layers Why so few Linked Data-compatible editions of Latin texts? Possible reasons: I Projects lack the know-how and/or have other priorities I Scholarly editions are complex objects. Many layers of information, including: 1. Textual, i.e. the transcription (, ,
, etc.) 2. Bibliographic, e.g. properties of the edition () 3. Source, e.g. date, material, scribe, binding, folio count, size, etc. () 4. Linguistic, e.g. lemma, etc. () 5. Palaeographic, e.g. abbreviations, ligatures, glyphs, allographs, etc. () Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 40 Scholarly Edition Information layers Linked Data support: � Bibliographic + Textual I FABiO (FRBR-aligned Bibliographic Ontology) I CiTO (Citation Typing Ontology) I DC (Dublin Core) � Source I DM2E (Digitised Manuscripts to Europeana) I FRBRoo (FRBR-object oriented) SAWS (Sharing Ancient Wisdoms) � Linguistic I Ontolex, NIF, POWLA, OLiA � Palaeographic I Peter Stokes: DigiPal project I Paolo Monella: VeDPH seminar, 4th December 2019 Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.ancientwisdoms.ac.uk/media/ontology/sawsOntology.owl http://www.digipal.eu/about/the-project/ http://www1.unipa.it/paolo.monella/babel2019/ 40 Scholarly Edition Information layers Linked Data support: � Bibliographic + Textual I FABiO (FRBR-aligned Bibliographic Ontology) I CiTO (Citation Typing Ontology) I DC (Dublin Core) � Source I DM2E (Digitised Manuscripts to Europeana) I FRBRoo (FRBR-object oriented) SAWS (Sharing Ancient Wisdoms) � Linguistic I Ontolex, NIF, POWLA, OLiA � Palaeographic I Peter Stokes: DigiPal project I Paolo Monella: VeDPH seminar, 4th December 2019 Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.ancientwisdoms.ac.uk/media/ontology/sawsOntology.owl http://www.digipal.eu/about/the-project/ http://www1.unipa.it/paolo.monella/babel2019/ 40 Scholarly Edition Information layers Linked Data support: � Bibliographic + Textual I FABiO (FRBR-aligned Bibliographic Ontology) I CiTO (Citation Typing Ontology) I DC (Dublin Core) � Source I DM2E (Digitised Manuscripts to Europeana) I FRBRoo (FRBR-object oriented) SAWS (Sharing Ancient Wisdoms) � Linguistic I Ontolex, NIF, POWLA, OLiA � Palaeographic I Peter Stokes: DigiPal project I Paolo Monella: VeDPH seminar, 4th December 2019 Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.ancientwisdoms.ac.uk/media/ontology/sawsOntology.owl http://www.digipal.eu/about/the-project/ http://www1.unipa.it/paolo.monella/babel2019/ 40 Scholarly Edition Information layers Linked Data support: � Bibliographic + Textual I FABiO (FRBR-aligned Bibliographic Ontology) I CiTO (Citation Typing Ontology) I DC (Dublin Core) � Source I DM2E (Digitised Manuscripts to Europeana) I FRBRoo (FRBR-object oriented) SAWS (Sharing Ancient Wisdoms) � Linguistic I Ontolex, NIF, POWLA, OLiA � Palaeographic I Peter Stokes: DigiPal project I Paolo Monella: VeDPH seminar, 4th December 2019 Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.ancientwisdoms.ac.uk/media/ontology/sawsOntology.owl http://www.digipal.eu/about/the-project/ http://www1.unipa.it/paolo.monella/babel2019/ 40 Scholarly Edition Information layers Linked Data support: � Bibliographic + Textual I FABiO (FRBR-aligned Bibliographic Ontology) I CiTO (Citation Typing Ontology) I DC (Dublin Core) � Source I DM2E (Digitised Manuscripts to Europeana) I FRBRoo (FRBR-object oriented) SAWS (Sharing Ancient Wisdoms) � Linguistic I Ontolex, NIF, POWLA, OLiA � Palaeographic I Peter Stokes: DigiPal project I Paolo Monella: VeDPH seminar, 4th December 2019 Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://www.ancientwisdoms.ac.uk/media/ontology/sawsOntology.owl http://www.digipal.eu/about/the-project/ http://www1.unipa.it/paolo.monella/babel2019/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 41 Scholarly Editions Linked Data Example: I Vespasiano da Bisticci, Letters (not lemmatised/PoS-tagged!) Tools: I TEI-to-RDF converters (e.g. RDF Textual Encoding Framework) I Linked Data support for the Edition Visualisation Technology (upcoming talk by Roberto Rosselli del Turco and Paolo Monella at AIUCD 2020) Initiatives: I Workshop Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies (GraphSDE, 3-4.06.2019) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore http://vespasianodabisticciletters.unibo.it/ http://rdftef.sourceforge.net/ https://aiucd2020.unicatt.it/aiucd-programma http://wp.unil.ch/graphsde/ http://wp.unil.ch/graphsde/ 42 Scholarly editions Hypothetical (and brutally simplistic) Corpus + Edition Linked Data scenario Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 43 Table of Contents Introduction Computational Linguistics Linked Data and Linguistic Linked Open Data LiLa: Linking Latin Scholarly Editions Linked Data Connection to LiLa Conclusion Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 44 Conclusion Scholarly Editions and Corpora: Mutual benefits Linguistic corpora: I provide new forms of access to editions I provide the bigger picture, i.e. large and diachronic linguistic context Scholarly editions: I provide new forms of access to corpora I provide connections to cultural heritage objects I provide philological layer of annotation (textual criticism) Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 45 Thanks! Get in touch Greta Franzini CIRCSE, Università Cattolica del Sacro Cuore greta.franzini@unicatt.it @ERC_LiLa https://github.com/CIRCSE https://lila-erc.eu Largo Gemelli 1, 20123 Milan, Italy This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme - Grant Agreement No. 769994. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore greta.franzini@unicatt.it @ERC_LiLa https://github.com/CIRCSE https://lila-erc.eu 46 Works cited I Chiarcos et al. (2018) ‘Towards a Linked Open Data Edition of Sumerian Corpora’, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 7-12, Miyasaki, Japan. ISBN: 979-10-95546-00-9 I Eisner, J. (2016) How is computational linguistics different from natural language processing? I Fischer, F. (2017) ‘Digital Corpora and Scholarly Editions of Latin Texts: Features and Requirements for Textual Criticism’, Speculum, 92/S1. DOI: 10.1086/693823 I Mitkov, R. (2004) The Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore https://www.quora.com/How-is-computational-linguistics-different-from-natural-language-processing 10.1086/693823 47 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 48 LiLa: Structure Participles vs. adjectives Ambiguity Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 49 LiLa: Structure Participles vs. adjectives Ambiguity Solution Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 50 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor SPARQL endpoint with graphical interface to query against the LiLa triplestore. Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 51 LiLa: Structure An example: LOD view of LEMLAT lemma prosequor Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore 52 EvaLatin Participate! I Evaluation campaign designed following a long tradition in NLP (MUC, ACE, SemEval, CoNLL...) I Shared tasks, shared training and test data, shared evaluation metrics I 3 tasks: 1. PoS tagging 2. Lemmatisation I 3 sub-tasks for each task: 1. Basic 2. Cross-Genre 3. Cross-Time Greta Franzini | CIRCSE, Università Cattolica del Sacro Cuore Introduction Computational Linguistics Linked Data and Linguistic Linked Open Data LiLa: Linking Latin Scholarly Editions Linked Data Connection to LiLa Conclusion