Division of semantic labor in the Global WordNet Grid Piek Vossen, VU University Amsterdam German Rigau, University of the Basque Country 5 th Global Wordnet.

  • Published on
    27-Mar-2015

  • View
    214

  • Download
    2

Transcript

Slide 1Division of semantic labor in the Global WordNet Grid Piek Vossen, VU University Amsterdam German Rigau, University of the Basque Country 5 th Global Wordnet Conference Mumbai, India, Jan 30 Feb 5, 2010 Slide 2 Overview KYOTO as a domain implementation of the Global Wordnet Grid Scope of knowledge integration Division of linguistic labor How to integrate resources? How to make inferences? Slide 3 KYOTO some statistics European-Asian project March 2008 March 2011 7 countries (The Netherlands, Italy, Germany, Spain, Taiwan, Japan, Czech Republic) 12 sites Universities & research institutes: VUA, CNR-ILC, CNR-IIT, BBAW, EHU, AS, NICT, Masaryk Companies: Synthema, Irion User organizations: ECNC, WWF 7 languages (English, Italian, Japanese, Dutch, Spanish, Basque, Chinese) Slide 4 KYOTO Overall architecture Overview of the KYOTO process Slide 5 GWC2010, Mumbai 5 Applying ontology mappings Slide 6 GWC2010, Mumbai 6 Gobal Wordnet Grid Domain Ontology Base concepts Wn DOLCE/SUMO OntoWordnet Domain V Slide 7 GWC2010, Mumbai 7 Available repositories in KYOTO Environment domain Term database: 500,000 terms per 1,000 documents per language Open data project: DBPedia: 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies. The knowledge base consists of 274 million pieces of information (RDF triples) GeoNames: 8 million geographical names and consists of 6.5 million unique features whereof 2.2 million populated places and 1.8 million alternate names Domain thesauri and taxonomies: Species 2000: 2,1 million species Wordnets for 7 languages: about 50,000 to 120,000 synsets per language Ontologies: SUMO, DOLCE, SIMPLE Slide 8 GWC2010, Mumbai 8 Domain T TV TV V T TT Species Domain Kyoto Knowledge Base Ontology Base concepts Wn DBPedia Terms 500K 2,100K DOLCE/SUMO OntoWordnet Terms 500K Species 2,100K Domain V Slide 9 GWC2010, Mumbai 9 Species in the ontology - Implies to store 2.1 million species twice in the ontology. Slide 10 GWC2010, Mumbai 10 Should all knowledge be stored in the central ontology? Vocabularies are too large for full inferencing with current reasoners Vocabularies are linguistically too diverse to be represented in an ontology Inferencing capabilities of formal ontologies is not needed for all levels of knowledge Slide 11 GWC2010, Mumbai 11 Modeling knowledge in a domain Knowledge needs to be divided over different lexical and ontological layers: Precisely define the relations between lexical and ontological layers Precisely define the inferencing based on the distributed knowledge layers Slide 12 GWC2010, Mumbai 12 Division of linguistic labor principle Putnam 1975: No need to know all the necessary and sufficient properties to determine if something is "gold" Assume that there is a way to determine these properties and that domain experts know how to recognize instances of these concepts. Speakers can still use the word "gold" and communicate useful information Slide 13 GWC2010, Mumbai 13 Division of semantic labor principle Digital version of Putnam (1975): Computer does not need to have all the necessary and sufficient properties to determine if something is a "European tree frog" Computer assumes that there is a way to determine this and that domain experts (people) know how to recognize instances of these concepts. Computers can still reason with semantics and do useful stuff with textual data Slide 14 GWC2010, Mumbai 14 What does the computer need to know? Distinction between rigid and non-rigid (Welty & Guarino 2002): being a "cat" is essential to individual's existence and therefore rigid being a "pet" is a temporarily role and therefore non- rigid; a cat can become a pet and stop being a pet without ceasing to exist Felix is born as a cat and will always be a cat, but during some period Felix can become a pet and stop being a pet while he continuous to exist as a cat All 2.1 million species are rigid concepts Slide 15 GWC2010, Mumbai 15 What does the computer need to know? Roles and processes in documents have more information value than the defining properties of species: Species defined in terms of physical properties already known to expert; Roles such as "invasive species", "migration species", "threatened species" express THE important properties of instances of species Roles are typically the terms we learn from the text not the species! Slide 16 GWC2010, Mumbai 16 Wordnet-ontology-relations Rigid synset relations to ontology: Synset:Endurant(Object); Synset:Perdurant(Event); Synset:Quality: sc_equivalenceOf (= relation in WN-SUMO) or sc_subclassOf (+ relation in WN-SUMO) Non-rigid synset relations to ontology: Synset:Role; Synset:Endurant(Object); Synset:Perdurant(Event) sc_domainOf: range of ontology types that restricts a role sc_playRole: role that is being played sc_participantOf: the process in wich the role is played Rigidity can be detected automatically (Rudify, 80% precision, IAG 80%) and is stored in wordnets as attributes to synsets Slide 17 Global Wordnet Grid Model perdurant change-of-location migration endurant object organism bird role done-by has-source has-destination has-path some has bird_1_Nsc_equivalentOf bird rigid English Wordnet in WN-LMFKYOTO Ontology in OWL-DL (Extension of DOLCE LT) migration_bird_1_Nsc_domainOf bird non-rigidsc_playRole done-by sc_participantOf migration migration_4_Nsc_equivalentOf migration migrate_1_Vsc_equivalentOf migration duck_1_N, rigid hyponym subclass Slide 18 Global Wordnet Grid Model perdurant change-of-location migration endurant object organism bird role done-by has-source has-destination has-path some has bird_1_Nsc_equivalentOf bird rigid English Wordnet in WN-LMFKYOTO Ontology in OWL-DL (Extension of DOLCE LT) migration_bird_1_Nsc_domainOf bird non-rigidsc_playRole done-by sc_participantOf migration migration_4_Nsc_equivalentOf migration migrate_1_Vsc_equivalentOf migration duck_1_N, rigid subclass Dutch Wordnet migrerende dieren_1_Nsc_domainOf organism (migrating species)sc_playRole done-by non-rigidsc_participantOf migration equivalent_hypernym eng-30-02356039-n (bird) eend_1_N (duck) equivalent eng-30-01254614-n (duck) Spanish Wn, Basque Wn Italian Wn, Japanese Wn Chinese Wn.... Cross-lingual equivalence mappings are expressed through wordnet mappings Slide 19 Wordnet to ontology mappings {create, produce, make}Verb, English -> sc_ equivalenceOf construction {artifact, artefact}Noun, English -> sc_domainOf physical_object -> sc_playRole result-existence -> sc_participantOf construction {kunststof}Noun, Dutch // lit. artifact substance -> sc_domainOf amount_of_matter -> sc_playRole result-existence -> sc_participantOf construction {meat}Noun, English -> sc_domainOf cow, sheep, pig -> sc_playRole patient -> sc_participantOf eat {,, }Noun, Chinese -> sc_domainOf animal -> sc_playRole patient -> sc_participantOf eat { , , }Noun, Arabic -> sc_domainOf cow, sheep -> sc_playRole patient -> sc_participantOf eat Slide 20 Wordnet to ontology mappings {teacher}Noun, English -> sc_domainOf human -> sc_playRole done-by -> sc_participantOf teach {leraar}Noun, Dutch // lit. male teacher -> sc_domainOf man -> sc_playRole done-by -> sc_participantOf teach {lerares}Noun, Dutch // lit. female teacher -> sc_domainOf woman -> sc_playRole done-by -> sc_participantOf teach Slide 21 Wordnet-LMF Slide 22 WN-LMF Synset relations Slide 23 WN-LMF Synset relations Slide 24 GWC2010, Mumbai 24 Division of labor in knowledge sources Eleutherodactylus augusti Eleutherodactylus Leptodactylidae Anura Amphibia Chordata Animalia Eleutherodactylus atrabracus barking frog frog:1, toad:1, toad frog:1, anuran:1, batrachian:1, salientian:1 amphibian:3 vertebrate:1,craniate:1 chordate:1 animal:1 Base Concept 2.1 million species100,000 synsets2,000 types endurant physical-object organism endemic frog endangered frog poisonous frog alien frog 500,000 terms Skos database Wordnet-LMFOntology-OWL-DL Term database perdurant endanger Slide 25 GWC2010, Mumbai 25 How to make inferences? Sparql queries to large Virtuoso databases: Aligned Species 2000, DBPedia Sql queries to term database Graph matching on wordnets stored in DebVisDic Reasoning on a small ontology Slide 26 KYOTO Project meeting, Jan 13-14th 2010, PolyU Hong Kong 26 Ontotagger applied to KAF Apply WSD to every term in the KAF representation of a text For each term in KAF representation of a text: (a)If wordnet synset (WSD) then check for ontology mappings, if none traverse wordnet hierarchy to find first mapping (b)Else check the SKOS database for wordnet mapping, if necessary traverse broader relations up to the first wordnet mapping and go to a.) (c)Else check the term database for wordnet mappings, if necessary traverse parent relations up to the first wordnet mapping and go to a.) Collect all mappings from the ontology and all (relevant) ontological implications and insert them into the KAF representation of the text. Slide 27 KYOTO Project meeting, Jan 13-14th 2010, PolyU Hong Kong 27 Examples 1.Migration birds in the Humber Estuary. 2.The migration of birds to the Humber Estuary 3.Bird migration in the Humber Estuary 4.Birds that migrate to the Humber Estuary Slide 28 Annotation of ontological implications in KAF Slide 29 Annotation of ontological implications in KAF Slide 30 Annotation of ontological implications in KAF Slide 31 KYOTO Project meeting, Jan 13-14th 2010, PolyU Hong Kong 31 Kybot profiles IF T1 + to + T2 & T1.impliedType="change_of_location" & T1.impliedRole="has-target" & T2.Type="location" THEN IF T1 + from + T2 & T1.impliedType="change_of_location" & T1.impliedRole="has-source" & T2.Type="location" THEN Slide 32 Kybot Knowledge Patterns Slide 33 GWC2010, Mumbai 33 Conclusion: Should all knowledge be stored in the central ontology? Vocabularies are too large for full inferencing Vocabularies are linguistically too diverse to be represented in an ontology Inferencing capabilities of formal ontologies is not needed for all levels of knowledge A model of division of labor (along the lines of Putnam 1975) in which knowledge is stored in 3 layers: SKOS vocabularies and term databases wordnet (WN-LMF) ontology (OWL-DL), Each layer supports different types of inferencing ranging from Sparql queries, graph algorithms to reasoning. Mapping relations that support the division of labour and different types of inferencing and that allow for the encoding of language- specific lexicalizations and restrictions. Slide 34 Conclusions Ontologies are abstract and minimal and lexicons are large and rich Semantic relations in lexicons are complementary to ontological relations Semantic relations expressed in a language system should be compatible with ontologies Large vocabularies of types (rigid things in the world) can be mapped to the ontology through combinations of lexical relations and basic ontological mappings Lexicalizations of contextual and subjective concepts need to be expressed through more complex relations Equivalences across languages partially through ontological expressions and partially across lexicons Slide 35 Applying WSD to terms Slide 36 GWC2010, Mumbai 36 How to integrate the data? Species 2000 vocabulary: 2,171,281 concepts in MySql database with parent relations: Kingdom -> Class -> Order -> Family -> Genus -> Species -> Infra species Animalia -> Chordata -> Amphibia -> Anura -> Leptodactylidae - > Eleutherodactylus -> Eleutherodactylus augusti Converted to SKOS format Aligned with DBPedia for language labels Aligned with Wordnet using vocabulary and relation mappings Published in Virtuoso, accessed with SPARQL queries Slide 37 GWC2010, Mumbai 37 How to integrate data? Extending language labels using DBPedia Language Species 2000DBPedia extension English 69,045834,821 Spanish 1,731358,499 Italian 17,552215,511 Dutch 5,397185,437 Chinese 58,77483,756 Japanese 4,625139,754 Slide 38 GWC2010, Mumbai 38 Vocabulary match with Wordnet synsets If polysemous then SSI-Dijkstra weighting of senses based on the hyperonym chain Results still to be evaluated: Animalia (animal:1)-> Chordata (chordate:1) - > Amphibia (amphibian:3) -> Anura -> Leptodactylidae -> Eleutherodactylus -> Eleutherodactylus augusti (barking frog:1) How to integrate data? Alignment Species 2000 with wordnet Slide 39 GWC2010, Mumbai 39 Word-sense-disambiguation is applied to terms in KAF (Kyoto Annotation Format) Term hierarchy is extracted from KAF: land:5 grassland:1 -> biome:1 woodland:1 -> biome:1 cropland urban land Results still to be evaluated: SemEval2010 How to integrate data? Alignment of terms with wordnet

Recommended

View more >