9. Data Quality and Linking
9.1 How well are the linked open data in practice?
Linked Open Vocabularies(LOV) project
– analyze usage of vocabularies
9.2 Quality
Linked Data Conformance vs. Quality
Conformance: – i.e., following standards and best practices, technical dimension, can be evaluated automatically
Quality: – i.e., how complete/correct/… is the data, content dimension, hard to evaluate automatically
Example: Crowd Evaluation of DBpedia
The Quality of Linked Open Data is far from perfect: conformance & content
Improving the quality is an active field of research
– Survey 2017: >40 approaches
– since then: a lot of work in KG embeddings
9.3 Links
Previously on Knowledge Graphs
- Integrate data from different sources
- Make connections between entities in those sources
- Facilitate cross data source queries
- Overcome data silos
Why do we need Links?
How do we Create the Links?
数据太多,很多将自己的跟其他数据集互连
9.3.1 Tool Support
A plethora of names
Mostly used for schema level:
- Ontology matching/alignment/mapping
- Schema matching/mapping
Mostly used for the instance level:
- Instance matching/alignment
- Interlinking
- Link discovery
9.3.2 Automating Interlinking
Basic Interlinking Techniques
Sources for Interlinking Signals
Simple String Based Metrics
- String equality
e.g. foo:University_of_Mannheim, bar:University_of_Mannheim - Common prefixes
e.g. foo:United_States, bar:United_States_of_America - Common postfixes
e.g. foo:Barack_Obama, bar:Obama - Typical usage of prefixes/postfixes: |common|/max(length)
foo:United_States, bar:United_States_of_America → 12/22
foo:Barack_Obama, bar:Obama → 5/12
Edit Distance
N-gram based Similarity
Typical Preprocessing Techniques
Language-specific Preprocessing
Using External Knowledge
From Matching Literals to Matching Entities
Preprocessing and Matching Pipelines
9.4 Schema Matching
9.5 Instance based Matching
Enforcing 1:1 Mappings
9.5 Matcher Combination
Evaluating Matchers
Challenges in Matching