Publication: Framework for quality assurance of semantic resources and repositories
Authors
Abad Navarro, Francisco
item.page.secondaryauthor
Escuela Internacional de Doctorado
item.page.director
Fernández Breis, Jesualdo Tomás ; Martínez Costa, Catalina
Publisher
Universidad de Murcia
publication.page.editor
publication.page.department
DOI
item.page.type
info:eu-repo/semantics/doctoralThesis
Description
Tesis por compendio de publicaciones
Abstract
Esta tesis aborda el desafío de evaluar la calidad de los artefactos semánticos, centrándose en las ontologías como un mecanismo fundamental para representar conocimiento en la Web Semántica. Las ontologías codifican conocimiento estructurado que puede ser reutilizado y ampliado por otros usuarios; por ello, garantizar su calidad es esencial para asegurar la coherencia del contenido que albergan y del conocimiento que se genera a partir de ellas.
El lenguaje recomendado por el W3C para describir ontologías es OWL, un lenguaje formal basado en lógica descriptiva que permite representar conocimiento mediante axiomas lógicos. Además, OWL admite la incorporación de contenido legible por humanos, como etiquetas, nombres o descripciones, lo que mejora la legibilidad y facilita la reutilización de las ontologías. Sin embargo, este componente léxico, clave para la comprensión humana, ha recibido tradicionalmente menos atención en los procesos de evaluación de calidad.
El objetivo principal de esta tesis es aprovechar el contenido legible por humanos presente en ontologías para desarrollar nuevos métodos que permitan medir y mejorar su calidad. Para ello, se plantearon cinco objetivos específicos:
OB1: Desarrollar modelos de representación en grafo que exploten los aspectos estructurales de las ontologías.
OB2: Diseñar métodos de enriquecimiento basados en la información léxica.
OB3: Definir un proceso para evaluar la calidad desde la perspectiva del contenido legible por humanos.
OB4: Proponer métricas cuantitativas derivadas de dicha información.
OB5: Evaluar las metodologías desarrolladas en escenarios reales.
La metodología seguida en esta tesis integra análisis estructural, evaluación léxica y técnicas basadas en modelos de lenguaje. En primer lugar, se desarrolló la librería Graphlib, que permite representar ontologías como grafos y calcular propiedades estructurales relevantes, como rutas mínimas y relaciones jerárquicas entre entidades. Esto sirvió como base para los análisis posteriores.
En segundo lugar, las técnicas de enriquecimiento se basaron en identificar la semántica oculta presente en los nombres de las entidades, bajo la idea de que la información léxica refleja patrones semánticos que no siempre están formalizados mediante axiomas lógicos en la ontología. Estos métodos permiten detectar conceptos relacionados léxicamente pero no conectados semánticamente, lo que facilita identificar axiomas faltantes o inconsistencias en la nomenclatura.
Las principales contribuciones metodológicas son los marcos de evaluación HURON y OCALM. HURON proporciona un proceso sistemático para evaluar la legibilidad humana, definiendo buenas prácticas al incluir contenido legible por humanos en ontologías y métricas cuantitativas que miden hasta qué punto las buenas prácticas propuestas están siendo adoptadas. Con ello, se aborda un aspecto crítico pero habitualmente subestimado del diseño de ontologías. OCALM evalúa la cobertura de dominio mediante modelos de lenguaje como FastText, BERT u OWL2VEC*, midiendo la alineación entre los conceptos de una ontología y la terminología de un corpus específico del dominio. Ambos marcos se aplicaron a ontologías ampliamente utilizadas, como SNOMED CT o Gene Ontology, demostrando su robustez y aplicabilidad.
Los resultados muestran que los métodos propuestos detectan aspectos de calidad de las ontologías estudiadas. HURON reveló problemas de inconsistencias de nombres, etiquetas ambiguas y falta de contenido legible en ontologías de OBO Foundry. A su vez, OCALM permitió seleccionar de forma fiable la ontología más adecuada para dominios definidos mediante texto en lenguaje natural.
En conjunto, esta tesis contribuye al objetivo de hacer que la Web Semántica sea no solo interpretable por máquinas, sino también más comprensible y utilizable por personas, mejorando la fiabilidad y usabilidad de las ontologías mediante evaluaciones centradas en legibilidad, coherencia y cobertura de dominio.
This thesis addresses the challenge of evaluating the quality of semantic artefacts, focusing on ontologies as a key mechanism for representing knowledge in the Semantic Web. Ontologies are building blocks that encode knowledge, and this knowledge can be reused by others to build new knowledge on top of the previous one. Thus, it is important to assure their quality, so that their content, and the content created from them, is consistent. OWL is the language recommended by the W3C for describing ontologies. This language is machine-actionable and it is based on description logics, enabling the creation of knowledge by using logical axioms, thus constituting a formal language. OWL also permits the attachment of human-readable (lexical) content, such as descriptions or names, to the logical concepts, increasing the human readability and facilitating the reuse of ontologies. The main objective of the thesis is the use of the human-readable content included in the ontologies to develop new methods for improving and measuring the quality of ontologies. The following objectives were stated accordingly: OB1: Development of graph representation models of ontologies to exploit structural aspects. OB2: Development of enrichment methods based on lexical information. OB3: Specification of a process to assure the quality of ontologies from the human-readable content perspective. OB4: Development of new quantitative metrics based on lexical information. OB5: Evaluation of the developed methods in real scenarios. This thesis adopts a methodology that integrates structural analysis, lexical evaluation, and language-model-based techniques. First, the Graphlib library was developed to support the representation of ontologies as graphs and to compute structural properties, such as shortest paths between ontology concepts. This served as a foundation for subsequent analyses. Second, enrichment techniques exploit the hidden semantics appearing in entity names, following the idea that lexical information often reflects facts that are not formalised in the ontology. These methods identify lexically related ontology entities and check whether they are semantically linked through ontology axioms. This is useful for detecting potential missing axioms or naming inconsistencies. The main methodological contributions are two evaluation frameworks: HURON and OCALM. HURON provides a systematic process for assessing human readability, defining a set of good practices for including human-readable content in ontologies, and quantitative metrics for measuring them. It evaluates how understandable an ontology is for human users, addressing a critical but often overlooked aspect of ontology design. OCALM, on the other hand, evaluates domain coverage by leveraging language models such as FastText, BERT, and OWL2VEC*. It measures the alignment between ontology concepts and the terminology present in domain-specific corpora, offering a quantitative indication of an ontology's representativeness. Both frameworks were tested in ontologies widely used by the community, such as SNOMED CT or Gene Ontology. The results demonstrate that the proposed methods successfully identify quality aspects and issues of the ontologies evaluated. The application of HURON metrics to OBO Foundry ontologies uncovered issues such as inconsistent naming conventions, ambiguous labels, and missing human-readable content, confirming that readability significantly varies across ontologies and domains. For its part, OCALM was used to find the most suitable ontology for a set of particular domains, defined by natural language text. It was able to successfully select the ontology that best covered the corresponding domain. Experiments conducted on real ontologies validated the applicability, generality, and usefulness of the proposed frameworks. In conclusion, this thesis contributes to the long-term goal of making the Semantic Web not only machine-interpretable but also human-comprehensible. By proposing quantitative frameworks that evaluate readability, consistency, and domain coverage
This thesis addresses the challenge of evaluating the quality of semantic artefacts, focusing on ontologies as a key mechanism for representing knowledge in the Semantic Web. Ontologies are building blocks that encode knowledge, and this knowledge can be reused by others to build new knowledge on top of the previous one. Thus, it is important to assure their quality, so that their content, and the content created from them, is consistent. OWL is the language recommended by the W3C for describing ontologies. This language is machine-actionable and it is based on description logics, enabling the creation of knowledge by using logical axioms, thus constituting a formal language. OWL also permits the attachment of human-readable (lexical) content, such as descriptions or names, to the logical concepts, increasing the human readability and facilitating the reuse of ontologies. The main objective of the thesis is the use of the human-readable content included in the ontologies to develop new methods for improving and measuring the quality of ontologies. The following objectives were stated accordingly: OB1: Development of graph representation models of ontologies to exploit structural aspects. OB2: Development of enrichment methods based on lexical information. OB3: Specification of a process to assure the quality of ontologies from the human-readable content perspective. OB4: Development of new quantitative metrics based on lexical information. OB5: Evaluation of the developed methods in real scenarios. This thesis adopts a methodology that integrates structural analysis, lexical evaluation, and language-model-based techniques. First, the Graphlib library was developed to support the representation of ontologies as graphs and to compute structural properties, such as shortest paths between ontology concepts. This served as a foundation for subsequent analyses. Second, enrichment techniques exploit the hidden semantics appearing in entity names, following the idea that lexical information often reflects facts that are not formalised in the ontology. These methods identify lexically related ontology entities and check whether they are semantically linked through ontology axioms. This is useful for detecting potential missing axioms or naming inconsistencies. The main methodological contributions are two evaluation frameworks: HURON and OCALM. HURON provides a systematic process for assessing human readability, defining a set of good practices for including human-readable content in ontologies, and quantitative metrics for measuring them. It evaluates how understandable an ontology is for human users, addressing a critical but often overlooked aspect of ontology design. OCALM, on the other hand, evaluates domain coverage by leveraging language models such as FastText, BERT, and OWL2VEC*. It measures the alignment between ontology concepts and the terminology present in domain-specific corpora, offering a quantitative indication of an ontology's representativeness. Both frameworks were tested in ontologies widely used by the community, such as SNOMED CT or Gene Ontology. The results demonstrate that the proposed methods successfully identify quality aspects and issues of the ontologies evaluated. The application of HURON metrics to OBO Foundry ontologies uncovered issues such as inconsistent naming conventions, ambiguous labels, and missing human-readable content, confirming that readability significantly varies across ontologies and domains. For its part, OCALM was used to find the most suitable ontology for a set of particular domains, defined by natural language text. It was able to successfully select the ontology that best covered the corresponding domain. Experiments conducted on real ontologies validated the applicability, generality, and usefulness of the proposed frameworks. In conclusion, this thesis contributes to the long-term goal of making the Semantic Web not only machine-interpretable but also human-comprehensible. By proposing quantitative frameworks that evaluate readability, consistency, and domain coverage
publication.page.subject
Citation
item.page.embargo
Collections
Ir a Estadísticas
Este ítem está sujeto a una licencia Creative Commons. http://creativecommons.org/licenses/by-nc-nd/4.0/





