DigitalUM :: Browsing by Subject "Machine learning"

Browsing by Subject "Machine learning"

Now showing 1 - 20 of 27

Open Access
A data Science approach to cost estimation decision making - Big Data and Machine Learning
(Universidad de Murcia, Servicio de Publicaciones, 2022) Fernández-Revuelta Pérez, Luis; Romero Blasco, Álvaro
Cost estimation may become increasingly difficult, slow, and resource-consuming when it cannot be performed analytically. If traditional cost estimation techniques are usable at all under those circumstances, they have important limitations. This article analyses the potential applications of data science to management accounting, through the case of a cost estimation task posted on Kaggle, a Google data science and machine learning website. When extensive data exist, machine learning techniques can overcome some of those limitations. Applying machine learning to the data reveals non-obvious patterns and relationships that can be used to predict costs of new assemblies with acceptable accuracy. This article discusses the advantages and limitations of this approach and its potential to transform cost estimation, and more widely management accounting. The multinational company Caterpillar posted a contest on Kaggle to estimate the price that a supplier would quote for manufacturing a number of industrial assemblies, given historical quotes for similar assemblies. Hitherto, this problem would have required reverse-engineering the supplier’s accounting structure to establish the cost structure of each assembly, identifying non-obvious relationships among variables. This complex and tedious task is usually performed by human experts, adding subjectivity to the process.
Embargo
A novel Machine Learning-based approach for the detection of SSH botnet infection
(2021-02) Martínez Garre, José Tomás; Gil Pérez, Manuel; Ruiz-Martínez, Antonio; Ingeniería de la Información y las Comunicaciones
Botnets are causing severe damages to users, companies, and governments through information theft, abuse of online services, DDoS attacks, etc. Although significant research is being made to detect them and mitigate their effect, they are exponentially increasing due to new zero-day attacks, a variation of their behavior, and obfuscation techniques. High Interaction Honeypots (HIH) are the only honeypots able to capture attacks and log all the information generated by attackers when setting up a botnet. The data generated is being processed using Machine Learning (ML) techniques for detection since they can detect hidden patterns. However, so far, research has been focused on intermediate phases of the botnet’s life cycle during operation, underestimating the initial phase of infection. To the best of our knowledge, this is the first solution in the infection phase of SSH-based botnets. Therefore, we have designed an approach based on an SSH-based HIH to generate a dataset consisting of executed commands and network information. Herein, we have applied ML techniques for the development of a real-time detection model. This approach reached a very high level of prediction and zero false negatives. Indeed, our system detected all known and unknown SSH sessions intended to infect our honeypots. Thus, our research has demonstrated that new SSH infections can be detected through ML techniques.
Open Access
A programmable web platform for distributed access, analysis, and visualization of data
(Elsevier, 2023-10-26) Esquembre, Francisco; Chacón, Jesús; Saenz, Jacobo; Vega, Jesús; Dormido-Canto, Sebastián; Matemáticas
Daily work of Fusion Data Research (FDR) scientists faces three practical challenges: (i) getting access to vast amounts of validated, curated, and (ideally) annotated discharge data, (ii) applying a wide variety of standard, domain-specific, and home-made analysis and visualization software libraries and routines, and (iii) using fast, specialized, and not easy to obtain hardware and software installations. This paper introduces a novel web platform that addresses these three challenges in a federated way. Based on a client–server architecture, the new platform allows for easy use and exchange of curated data, validated analysis and visualization routines, and even networked hardware and software installations among the FDR community. This exchange goes beyond the mere use of a code repository, but facilitates the creation of an actual ready-to-use network of computers which can be used remotely to configure and perform data analysis. The network functions in a federated way, in which each member of the community contributes, using the same web platform, with its data, programming experience, and hardware and software availability. The platform is open source.
Open Access
An interpretable semi‐supervised system for detecting cyberattacks using anomaly detection in industrial scenarios
(Wiley Open Access, 2023-05-09) Perales Gómez, Ángel Luis; Fernández Maimó, Lorenzo; Huertas Celdrán, Alberto; García Clemente, Félix J.; Ingeniería y Tecnología de Computadores
When detecting cyberattacks in Industrial settings, it is not sufficient to determine whether the system is suffering a cyberattack. It is also fundamental to explain why the system is under a cyberattack and which are the assets affected. In this context, the Anomaly Detection based on Machine Learning (ML) and Deep Learning (DL) techniques showed great performance when detecting cyberattacks in industrial scenarios. However, two main limitations hinder using them in a real environment. Firstly, most solutions are trained using a supervised approach, which is impractical in the real industrial world. Secondly, the use of black-box ML and DL techniques makes it impossible to interpret the decision made by the model. This article proposes an interpretable and semi-supervised system to detect cyberattacks in Industrial settings. Besides, our proposal was validated using data collected from the Tennessee Eastman Process. To the best of our knowledge, this system is the only one that offers interpretability together with a semi-supervised approach in an industrial setting. Our system discriminates between causes and effects of anomalies and also achieved the best performance for 11 types of anomalies out of 20 with an overall recall of 0.9577, a precision of 0.9977, and a F1-score of 0.9711.
Open Access
Analysis of the hyperparameter optimisation of four machine learning satellite imagery classification methods
(Springer, 2024-04-05) Alonso Sarría, Francisco; Valdivieso Ros, Carmen; Gomariz Castillo, Francisco; Geografía
The classification of land use and land cover (LULC) from remotely sensed imagery in semi-arid Mediterranean areas is a challenging task due to the fragmentation of the landscape and the diversity of spatial patterns. Recently, the use of deep learning (DL) for image analysis has increased compared to commonly used machine learning (ML) methods. This paper compares the performance of four algorithms, Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Convolutional Network (CNN), using multi-source data, applying an exhaustive optimisation process of the hyperparameters. The usual approach in the optimisation process of a LULC classification model is to keep the best model in terms of accuracy without analysing the rest of the results. In this study, we have analysed such results, discovering noteworthy patterns in a space defined by the mean and standard deviation of the validation accuracy estimated in a 10-fold cross validation (CV). The point distributions in such a space do not appear to be completely random, but show clusters of points that facilitate the discovery of hyperparameter values that tend to increase the mean accuracy and decrease its standard deviation. RF is not the most accurate model, but it is the less sensitive to changes in hyperparameters. Neural Networks, tend to increase commission and omission errors of the less represented classes because their optimisation lead the model to learn better the most frequent classes. On the other hand, RF and MLP prediction layers are the most accurate from a general qualitative point of view.
Open Access
Behavioral fingerprinting to detect ransomware in resource-constrained devices
(Elsevier, 2023-12) Huertas Celdrán, Alberto; Sánchez Sánchez, Pedro Miguel; Von der Assen, Jan; Shushack, Dennis; Perales Gómez, Ángel Luis; Bovet, Gérôme; Martínez Pérez, Gregorio; Stiller, Burkhard; Ingeniería y Tecnología de Computadores
The Internet of Things (IoT), a network of interconnected devices, has grown and gained traction over the last few years. This paradigm can impact our lives while also providing significant economic benefits. However, although resource-constrained IoT devices offer numerous advantages, they are also vulnerable to cyberattacks. As a result, ransomware severely threatens IoT devices managing sensitive and relevant information. Solutions based on Machine and Deep Learning (ML/DL) that consider behavioral data have been identified as promising. However, most detection solutions have been developed for Windows-based systems, which generally have more resources than IoT devices. As a result, these solutions are not suitable for resource-constrained components. In addition, no solution compares the pros and cons of different behavioral dimensions of resource-constrained devices. Thus, this work presents a framework that combines three different behavioral sources with supervised and unsupervised ML/DL algorithms to detect and classify heterogeneous ransomware impacting resource-constrained spectrum sensors. A pool of experiments has demonstrated the suitability of the proposed solution and compared its performance with a rule-based system. In conclusion, the usage of resources combined with local outlier factor and decision tree are the most promising combinations to detect anomalies and classify ransomware while consuming CPU, RAM, and time of devices in a reduced manner.
Open Access
Comparison of manual and automated digital image analysis systems for quantification of cellular protein expression
(Universidad de Murcia, Departamento de Biologia Celular e Histiologia, 2022) Jagomast, T.; Idel, C.; Klapper, L.; Kuppler, P.; Proppe, L.; Beume, S.; Falougy, M.; Steller, D.; Hakim, S.G.; Offermann, A.; Roesch, M.C.; Bruchhage, K.L.; Perner, S.; Ribbat Idel, J.
Objective. Quantifying protein expression in immunohistochemically stained histological slides is an important tool for oncologic research. The use of computer-aided evaluation of IHC-stained slides significantly contributes to objectify measurements. Manual digital image analysis (mDIA) requires a userdependent annotation of the region of interest (ROI). Others have built-in machine learning algorithms with automated digital image analysis (aDIA) and can detect the ROIs automatically. We aimed to investigate the agreement between the results obtained by aDIA and those derived from mDIA systems. Methods. We quantified chromogenic intensity (CI) and calculated the positive index (PI) in cohorts of tissue microarrays (TMA) using mDIA and aDIA. To consider the different distributions of staining within cellular subcompartments and different tumor architecture our study encompassed nuclear and cytoplasmatic stainings in adenocarcinomas and squamous cell carcinomas. Results. Within all cohorts, we were able to show a high correlation between mDIA and aDIA for the CI (p<0.001) along with high agreement for the PI. Moreover, we were able to show that the cell detections of the programs were comparable as well and both proved to be reliable when compared to manual counting. Conclusion. mDIA and aDIA show a high correlation in acquired IHC data. Both proved to be suitable to stratify patients for evaluation with clinical data. As both produce the same level of information, aDIA might be preferable as it is time-saving, can easily be reproduced, and enables regular and efficient output in large studies in a reasonable time period.
Open Access
Las contribuciones de los estudiantes a Wikipedia como evidencia de aprendizaje y de desarrollo de competencias en educación a distancia
(Universidad de Murcia, Servicio de Publicaciones, 2025-07-30) Obregón Sierra, Ángel; Maina Patrás, Marcelo Fabián; Sin departamento asociado
El presente estudio explora cómo el estudiantado de un máster en línea desarrolló conceptos disciplinares a través de una estrategia de escritura colaborativa en Wikipedia. Esta consistía en la búsqueda de información y la elaboración de contenidos conforme a los criterios de rigor y calidad de la enciclopedia virtual durante un total de ocho semestres. Con la ayuda de un servicio web que utiliza aprendizaje automático para la evaluación de ediciones en Wikipedia, se analizó la participación de 1779 estudiantes con un total de 57560 ediciones. Los resultados proporcionaron evidencias del aprendizaje del estudiantado, que editó correctamente en el taller grupal de la enciclopedia como paso previo a la publicación final del artículo. Se observó una mejora constante en las contribuciones, evidenciada por el incremento del grado de "buena fe" y la disminución significativa del “daño”. La implementación de esta estrategia de escritura colaborativa no solo ha permitido que los estudiantes desarrollen competencias y conocimientos propios a la asignatura, sino que también ha fomentado el pensamiento crítico, la reflexión, el trabajo en equipo y las competencias digitales. La supervisión del profesorado ha sido fundamental para asegurar la calidad y rigor de las contribuciones, demostrando que puede ser una metodología efectiva para el aprendizaje en entornos educativos superiores.
Open Access
Desempeño logístico en entidades turísticas cubanas de la cadena de suministro: Estudio comparativo mediante machine learning
(Universidad de Murcia: Escuela Universitaria Turismo de Murcia, 2025) Guerra Castellón, Emilio Enrique; Vázquez Alfonso, Yasser; Núñez Torres, Edgar; Departamentos
Esta investigación evalúa el desempeño logístico de trece entidades turísticas cubanas de la cadena de suministro, identificando patrones y áreas de mejora mediante técnicas avanzadas de machine learning . Se utilizó el modelo de referencia de la logística de excelencia para la evaluación y se aplicaron técnicas de machine learning , tales como la regresión l ineal múltiple, clustering K-means y Random Forest . Los resultados mostraron un desempeño pr omedio de 3. 25 en una escala de 1 – 5, con ITH Ciego de Ávila l iderando ( 3. 92 ) y ITH Base de Transporte en último lugar ( 2. 46). Los módulos más débiles fueron Tecnologías de la Información, Sistema de Software y Barreras y Riesgos. El análisis identificó cua tro grupos de entidades con perfiles similares y destacó la relevancia del transporte y las tecnologías para mejorar el desempeño. Se concluye que son necesarias intervenciones específicas en tecnología y gestión de riesgos para optimizar la cadena de sumi nistro turística cubana.
Open Access
Detecting flooded areas using Sentinel-1 SAR imagery
(MDPI, 2025-04-11) Alonso Sarria, Francisco; Valdivieso Ros, Carmen; Molina-Pérez, Gabriel; Geografía
Abstract: Floods are a major threat to human life and economic assets. Monitoring these events is therefore essential to quantify and minimize such losses. Remote sensing has been used to extract flooded areas, with SAR imagery being particularly useful as it is independent of weather conditions. This approach is more difficult when detecting flooded areas in semi-arid environments, without a reference permanent water body, than when monitoring the water level rise of permanent rivers or lakes. In this study, Random Forest is used to estimate flooded cells after 19 events in Campo de Cartagena, an agricultural area in SE Spain. Sentinel-1 SAR metrics are used as predictors and irrigation ponds as training areas. To minimize false positives, the pre- and post-event results are compared and only those pixels with a probability of water increase are considered as flooded areas. The ability of the RF model to detect water surfaces is demonstrated (mean accuracy = 0.941, standard deviation = 0.048) along the 19 events. Validating using optical imagery (Sentinel-2 MSI) reduces accuracy to 0.642. This form of validation can only be applied to a single event using a S2 image taken 3 days before the S1 image. A large number of false negatives is then expected. A procedure developed to correct for this error gives an accuracy of 0.886 for this single event. Another form of indirect validation consists in relating the area flooded in each event to the amount of rainfall recorded. An RF regression model using both rainfall metrics and season of the year gives a correlation coefficient of 0.451 and RMSE = 979 ha using LOO-CV. This result shows a clear relationship between flooded areas and rainfall metrics.
Open Access
Estimation of soil properties using machine learning techniques to improve hydrological modeling in a semiarid environment: Campo de Cartagena (Spain)
(Springer, 2025-03-11) Alonso Sarria, Francisco; Blanco Bernardeau, Arantzazu; Gomariz Castillo, Francisco; Romero Díaz, María Asunción; Geografía
Soils are a key element in the hydrological cycle through a number of soil properties that are complex to estimate and exhibit considerable spatial variability. Therefore, several techniques have been proposed for their estimation and mapping from point data along a given study area. In this work, four machine learning methods: Random Forest, Support Vector Machines, XGBoost and Multilayer Perceptrons, are used to predict and map the proportions of organic carbon, clay, silt and sand in the soils of the Campo de Cartagena (SE Spain). These models depend on a number of hyperparameters that need to be optimised to maximise accuracy, although this process can lead to overtraining, which affects the generalisability of the models. In this work it was found that neural networks gave the best results in validation, but on the test data the methods based on decision trees, random forest and xgboost were more accurate, although the differences were generally not significant. Accuracy values, as usual for soil variables, were not high. The RMSE values were 8.040 for SOC, 7.049 for clay, 10.227 for silt and 13.561 for loam. The layers obtained were then used to obtain annual curve number layers whose ability to reproduce runoff hydrographs was compared with the official CN layer. For high flow events, the CN layers obtained in this study gave better results (NSE=0.807, PBIAS=-4.7 and RMSE=0.4) than the official CN layers (NSE=-2.28, PBIAS=135.82 and RMSE=1.8).
Open Access
FARMIT: Continuous Assessment of Crop Quality Using Machine Learning and Deep Learning Techniques for IoT-based Smart Farming
(Springer, 2022-03-31) Perales Gómez, Ángel Luis; López de Teruel Alcolea, Pedro Enrique; Ruiz García, Alberto; García Mateos, Ginés; García Clemente, Félix Jesús; Ingeniería y Tecnología de Computadores
The race for automation has reached farms and agricultural fields. Many of these facilities use the Internet of Things (IoT) technologies to automate processes and increase productivity. Besides, Machine Learning and Deep Learning allow performing continuous decision making based on data analysis. In this work, we fill a gap in the literature and present a novel architecture based on IoT and Machine Learning / Deep Learning technologies or the continuous assessment of agricultural crop quality. This architecture is divided into three layers that work together to gather, process, and analyze data from different sources to evaluate crop quality. In the experiments, he proposed approach based on data aggregation from different sources reaches a lower percentage error than considering only one source. In particular, the percentage error achieved by our approach in the test dataset was 6.59, while the percentage error achieved exclusively using data from sensors was 6.71.
Open Access
Inteligencia Artificial y Machine Learning como recurso educativo desde la perspectiva de docentes en distintas etapas educativas no universitarias
(Universidad de Murcia, Servicio de Publicaciones, 2023) Dúo Terrón, Pablo; Moreno Guerrero, Antonio José; López Belmonte, Jesús; Marín Marín, José Antonio
La Inteligencia Artificial (IA) está considerada la cuarta revolución industrial y está provocando profundos cambios en la sociedad del siglo XXI. Además, tiene el potencial de afrontar los mayores retos en el campo de la educación como implantar metodologías activas innovadoras a través de proyectos STEAM utilizando IA. Esta investigación tiene como objetivo principal identificar el impacto que tiene el aprendizaje automático o machine learning, subcampo de la IA, como recurso educativo en el proceso de enseñanza-aprendizaje en diferentes etapas educativas no universitarias analizando la edad, el género y la experiencia previa en proyectos y conocimientos de la IA desde el punto de vista del profesorado. Fue empleado un método cuantitativo de carácter descriptivo y comparativo. El instrumento utilizado está basado en un cuestionario de 25 ítems y2 preguntas abiertas. En la muestra han participado de manera libre y voluntaria docentes (n=92) pertenecientes a las comunidades autónomas de Asturias, Extremadura y la Ciudad Autónoma de Ceuta, los cuáles han asistido a un curso de formación sobre IA para realizar proyectos de innovación en el aula. Los principales resultados ofrecen unos valores altos de media en todos los docentes para implantar proyectos basados en IA como recurso educativo. Los resultados se dividen en función al sexo, edad, etapa educativa y experiencia previa de los docentes. Presentan mayor puntuación y valoración los docentes entre 20-29 años, los del género masculino, los docentes pertenecientes a la etapa de Educación Infantil y aquellos que tienen experiencia y conocimientos previos en IA. Posteriormente, se procede a la discusión y conclusión de la investigación.
Open Access
Interpolation of Instantaneous Air Temperature Using Geographical and MODIS Derived Variables with Machine Learning Techniques
(MDPI, 2019-08-31) Ruiz Álvarez, Marcos; Alonso-Sarria, Francisco; Gomariz Castillo, Francisco; Geografía
Several methods have been tried to estimate air temperature using satellite imagery. In this paper, the results of two machine learning algorithms, Support Vector Machines and Random Forest, are compared with Multiple Linear Regression and Ordinary kriging. Several geographic, remote sensing and time variables are used as predictors. The validation is carried out using two different approaches, a leave-one-out cross validation in the spatial domain and a spatio-temporal k-block cross-validation, and four different statistics on a daily basis, allowing the use of ANOVA to compare the results. The main conclusion is that Random Forest produces the best results (R2 = 0.888 ± 0.026, Root mean square error = 3.01 ± 0.325 using k-block cross-validation). Regression methods (Support Vector Machine, Random Forest and Multiple Linear Regression) are calibrated with MODIS data and several predictors easily calculated from a Digital Elevation Model. The most important variables in the Random Forest model were satellite temperature, potential irradiation and cdayt, a cosine transformation of the julian day.
Open Access
Machine Learning para la mejora de la experiencia con MOOC : el caso de la Universitat Politècnica de València
(Universidad de Murcia, Servicio de Publicaciones, 2021) Despujol Zabala, Ignacio; Martínez Navarro, Jorge Ángel
El trabajo que se presenta tiene como objetivo el diseño de una propuesta de mecanismos automatizados fundamentados en machine learning para la mejora de la experiencia de los participantes en los cursos MOOC de la Universitat Politécnica de Valencia y la reducción de las tasas de abandono. Siguiendo una estrategia de investigación basada en el diseño IBD, en la que se ha priorizado siempre las decisiones pedagógicas por encima de las propias analíticas de datos, se han realizado tres iteraciones con distintos patrones metodológicos (revisión sistemática de literatura, machine learning basado en los datos de 260 cursos y más de 700.000 estudiantes, y creación de mecanismos automatizados) que siempre finalizan con la presentación de resultados y la realimentación por parte del equipo de la universidad. Las principales conclusiones de este trabajo indican que, de los veinticinco indicadores pedagógicos de abandono referidos por las revisiones bibliográficas en la iteración 1, solo se validan diez de ellos con los cursos de la UPV (no se tienen datos automáticos ni automatizables de los otros), y de esos finalmente solo seis de ellos son posibles predictores del abandono del alumnado, con los datos utilizados. Se proponen finalmente un conjunto de mecanismos automatizados que se aplicarán en la plataforma EdX de la universidad, para la mejora de la experiencia de los usuarios y la reducción de la tasa de abandonos en los cursos.
Open Access
Mathematical Abilities in School-Aged Children: A Structural Magnetic Resonance Imaging Analysis With Radiomics.
(Frontiers Media, 2022-04-14) Pina, Violeta; Campello, Víctor M.; Lekadir, Karim; Seguí, Santi; Garcia Santos, Jose M.; Fuentes Melero, Luis José; Psicología Básica y Metodología
Structural magnetic resonance imaging (sMRI) studies have shown that children that differ in some mathematical abilities show differences in gray matter volume mainly in parietal and frontal regions that are involved in number processing, attentional control, and memory. In the present study, a structural neuroimaging analysis based on radiomics and machine learning models is presented with the aim of identifying the brain areas that better predict children’s performance in a variety of mathematical tests. A sample of 77 school-aged children from third to sixth grade were administered four mathematical tests: Math fluency, Calculation, Applied problems and Quantitative concepts as well as a structural brain imaging scan. By extracting radiomics related to the shape, intensity, and texture of specific brain areas, we observed that areas from the frontal, parietal, temporal, and occipital lobes, basal ganglia, and limbic system, were differentially related to children’s performance in the mathematical tests. sMRI-based analyses in the context of mathematical performance have been mainly focused on volumetric measures. However, the results for radiomics-based analysis showed that for these areas, texture features were the most important for the regression models, while volume accounted for less than 15% of the shape importance. These findings highlight the potential of radiomics for more in-depth analysis of medical images for the identification of brain areas related to mathematical abilities.
Open Access
Multi-objective evolutionary spatio-temporal forecasting of air pollution
(Elsevier, 2022-05-31) Espinosa Fernández, Raquel; Jiménez Barrionuevo, Fernando; Palma Méndez, José Tomás; Ingeniería de la Información y las Comunicaciones
Nowadays, air pollution forecasting modeling is vital to achieve an increase in air quality, allowing an improvement of ecosystems and human health. It is important to consider the spatial characteristics of the data, as they allow us to infer predictions in those areas for which no information is available. In the current literature, there are a large number of proposals for spatio-temporal air pollution forecasting. In this paper we propose a novel spatio-temporal approach based on multi-objective evolutionary algorithms for the identification of multiple non-dominated linear regression models and their combination in an ensemble learning model for air pollution forecasting. The ability of multi-objective evolutionary algorithms to find a Pareto front of solutions is used to build multiple forecast models geographically distributed in the area of interest. The proposed method has been applied for one-week NO prediction in southeastern Spain and has obtained promising results in statistical comparison with other approaches such as the union of datasets or the interpolation of the predictions for each monitoring station. The validity of the proposed spatio-temporal approach is thus demonstrated, opening up a new field in air pollution engineering.
Open Access
Multivariate feature ranking with high-dimensional data for classification tasks
(2022-06-08) Jimenez Barrionuevo, F.; Sanchez Carpena, G.; Palma Méndez, José Tomás; Miralles Pechuan, L.; Botia Blaya, J. A.; Ingeniería de la Información y las Comunicaciones
In many machine learning classification problems, datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes, eliminating the redundant and irrelevant ones. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods are not very suitable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, which do not detect interactions between factors. In this paper, we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which havebeen applied for cancer gene expression and genotype-tissue expression classification tasks using public datasets. We statistically proved that the proposed methods outperform the state-of-the-art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as other feature selection methods for attribute subset evaluation based on correlation and consistency with the multi-objective evolutionary search strategy, and with the embedded feature selection methods C4.5 and LASSO. The proposed methods have been implemented on the WEKA platform for public use, making all the results reported in this paper repeatable and replicable.
Open Access
ORIENTATE: automated machine learning classifiers for oral health prediction and research
(BMC, 2023-06-20) Gómez-Rios, Inmaculada; Egea-López, Esteban; Ortiz Ruiz, Antonio José; Dermatología, Estomatología, Radiología y Medicina Física
Background The application of data-driven methods is expected to play an increasingly important role in healthcare. However, a lack of personnel with the necessary skills to develop these models and interpret its output is preventing a wider adoption of these methods. To address this gap, we introduce and describe ORIENTATE, a software for automated application of machine learning classification algorithms by clinical practitioners lacking specific technical skills. ORIENTATE allows the selection of features and the target variable, then automatically generates a number of classification models and cross-validates them, finding the best model and evaluating it. It also implements a custom feature selection algorithm for systematic searches of the best combination of predictors for a given target variable. Finally, it outputs a comprehensive report with graphs that facilitates the explanation of the classification model results, using global interpretation methods, and an interface for the prediction of new input samples. Feature relevance and interaction plots provided by ORIENTATE allow to use it for statistical inference, which can replace and/or complement classical statistical studies. Results Its application to a dataset with healthy and special health care needs (SHCN) children, treated under deep sedation, was discussed as case study. On the example dataset, despite its small size, the feature selection algorithm found a set of features able to predict the need for a second sedation with a f1 score of 0.83 and a ROC (AUC) of 0.92. Eight predictive factors for both populations were found and ordered by the relevance assigned to them by the model. A discussion of how to derive inferences from the relevance and interaction plots and a comparison with a classical study is also provided. Conclusions ORIENTATE automatically finds suitable features and generates accurate classifiers which can be used in preventive tasks. In addition, researchers without specific skills on data methods can use it for the application of machine learning classification and as a complement to classical studies for inferential analysis of features. In the case study, a high prediction accuracy for a second sedation in SHCN children was achieved. The analysis of the relevance of the features showed that the number of teeth with pulpar treatments at the first sedation is a predictive factor for a second sedation.
Open Access
Pinna nobilis in the Mar Menor coastal lagoon: a story of colonization and uncertainty
(Inter-Research, 2020-10-15) Gimenez-Casalduero, Francisca; Gomariz Castillo, Francisco; Alonso Sarria, Francisco; Cortés, Emilio; Izquierdo-Muñoz, Andrés; Ramos Esplá, Alfonso A.; Geografía
Populations of the Mediterranean fan mussel Pinna nobilis have progressively decreased over the last decades as a result of anthropogenic activities. The rate of decline has strongly increased since 2016, when a mass mortality event triggered by the parasite Haplosporidium pinnae occurred, and evidence exists that Mycobacterium species may also have played a major role in the event. Indeed, the epidemic has spread throughout the Mediterranean, although coastal lagoons seem to offer a degree of ‘resistance’ against the parasite. In the early 1980s, P. nobilis appeared in the Mar Menor lagoon and rapidly became an important component of the benthos. However, colonization of the lagoon by the fan mussel was cut short in 2016 when a massive mortality event occurred, possibly as a consequence of the environmental collapse that occurred in the lagoon, parallel to the mortality that the species suffered in the Mediterranean that same year. In this study, we estimated the spatial distribution of P. nobilis in the Mar Menor for 3 periods: 2003-2004, 2013 and 2016. The first 2 periods use published data, and the last period uses data collected in a new campaign. The probability of occurrence for the 3 periods was estimated using random forest and random forest regression-kriging models. The main environmental variables that determined the dispersion and colonization of the bivalve in the lagoon before 2016 are also identified.