Carta ao Editor sobre Predição de bioacumulação de PFAS em diferentes tecidos de plantas com modelos de aprendizado de máquina baseados em impressões moleculares de Song et al. (2024), Sci. Total Environ. 950 175091

Sci Total Environ. 23 de Maio de 2025;984:179714. doi: 10.1016/j.scitotenv.2025.179714. Online ahead of print.

RESUMO

Conteúdo Patrocinado

Song et al. (2024), “Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints,” employed machine learning methods, such as XGBoost and SHapley Additive exPlanations (SHAP), to predict PFAS bioaccumulation, reporting high predictive accuracy. However, this commentary critically examines their interpretation of feature importance, since high predictive accuracy does not guarantee reliable feature importance. Both XGBoost and SHAP are known to exhibit biases, such as overemphasizing features used in early splits and inheriting biases from the underlying model. Furthermore, the high dimensionality and potential collinearity of molecular fingerprints complicate SHAP interpretation, increasing overfitting risk and compromising SHAP value stability. To provide a general example, we conducted an independent simulation using a publicly available dataset of US industrial facilities and environmental compliance, demonstrating significant discrepancies between feature importance rankings from XGBoost and robust statistical tests. This commentary advocates for robust statistical methods coupled with p-values, including Spearman’s rho, Kendall’s tau, Goodman-Kruskal’s gamma, Somers’ delta, and Hoeffding’s dependence, for feature selection. These non-parametric methods, which are independent of specific model assumptions and rely on data ranks, are better suited to capture complex relationships in high-dimensional data, providing a more reliable foundation for future PFAS bioaccumulation research.

PMID:40412074 | DOI:10.1016/j.scitotenv.2025.179714

Respostas

Respostas