DOM Formula Assignment using K-Nearest Neighbors

Training and Testing Data for A Machine Learning and Benchmarking Approach for Molecular Formula Assignment of Ultra High-Resolution Mass Spectrometry Data from Complex Mixtures

Paper: Under review Dataset: SaeedLab/dom-formula-assignment-data

Abstract

A machine learning approach to molecular formula assignment is crucial for unlocking the full potential of ultra-high resolution mass spectrometry (UHRMS) when analyzing complex mixtures. By combining data-driven models with rigorous benchmarking, the accuracy, consistency, and speed in identifying plausible molecular formulas from vast spectral datasets can be improved. Compared with traditional de novo methods that rely heavily on rule-based heuristics, and manual parameter tuning, machine learning approaches can capture complex patterns in data and adapt more readily to diverse sample types. In this paper, we describe the application of a machine learning methods using the k-nearest neighbors (KNN) algorithm trained on curated chemical formula datasets of UHRMS analysis of dissolved organic matter (DOM) covering the saline river continuum and tropical wet/dry season variability. The influence of the mass accuracy (training set with 0.15-1ppm) was evaluated on a blind test set of DOMs of different geographical origins. A Decision Tree Regressor (DTR) and Random Forest Regressor (RFR) based on mass accuracy (<1ppm) was used. Results from our ML models exhibit 43% more formulas annotated than traditional methods (5796 vs 4047), Model-Synthetic achieved 99.9% assignment rate and annotated/assigned 2x more formulas (8,268 vs 4047). DTR and RFR achieved formula-level accuracies (FA) of 86.5% and 60.4%, respectively. Overall, results show an increase in formula assignment when compared with traditional methods. This ultimately enables more reliable characterization of complex natural and engineered systems, supporting advances in fields such as environmental science, metabolomics, and petroleomics. Furthermore, the novel data set produced for this study is made publicly available, establishing an initial benchmark for molecular formula assignment in UHRMS using machine learning. The dataset and code are publicly available at: https://github.com/pcdslab/dom-formula-assignment-using-ml

Available Models

This repository contains pre-trained KNN models used for molecular formula assignment. The models vary by:

Dataset: L1, L3, L1-L3, Synthetic
Neighbors (K): 1, 3
Distance Metric: Euclidean, Manhattan
Field Strength/Type: 7T, 21T, SYN (Synthetic)

Model List

Model Filename	Description
`knn_model_Model-L1_K1_Euclidean.joblib`	L1 Dataset, K=1, Euclidean Distance
`knn_model_Model-L1_K1_Manhattan.joblib`	L1 Dataset, K=1, Manhattan Distance
`knn_model_Model-L1_K3_Euclidean.joblib`	L1 Dataset, K=3, Euclidean Distance
`knn_model_Model-L1_K3_Manhattan.joblib`	L1 Dataset, K=3, Manhattan Distance
`knn_model_Model-L3_K1_Euclidean.joblib`	L3 Dataset, K=1, Euclidean Distance
`knn_model_Model-L3_K1_Manhattan.joblib`	L3 Dataset, K=1, Manhattan Distance
`knn_model_Model-L3_K3_Euclidean.joblib`	L3 Dataset, K=3, Euclidean Distance
`knn_model_Model-L3_K3_Manhattan.joblib`	L3 Dataset, K=3, Manhattan Distance
`knn_model_Model-L1-L3_K1_Euclidean_21T.joblib`	L1-L3 Combined, K=1, Euclidean, 21T
`knn_model_Model-L1-L3_K1_Euclidean_7T.joblib`	L1-L3 Combined, K=1, Euclidean, 7T
`knn_model_Model-L1-L3_K1_Manhattan_21T.joblib`	L1-L3 Combined, K=1, Manhattan, 21T
`knn_model_Model-L1-L3_K1_Manhattan_7T.joblib`	L1-L3 Combined, K=1, Manhattan, 7T
`knn_model_Model-L1-L3_K3_Euclidean_21T.joblib`	L1-L3 Combined, K=3, Euclidean, 21T
`knn_model_Model-L1-L3_K3_Euclidean_7T.joblib`	L1-L3 Combined, K=3, Euclidean, 7T
`knn_model_Model-L1-L3_K3_Manhattan_21T.joblib`	L1-L3 Combined, K=3, Manhattan, 21T
`knn_model_Model-L1-L3_K3_Manhattan_7T.joblib`	L1-L3 Combined, K=3, Manhattan, 7T
`knn_model_Model-Synthetic_K1_Euclidean_21T.joblib`	Synthetic, K=1, Euclidean, 21T
`knn_model_Model-Synthetic_K1_Euclidean_7T.joblib`	Synthetic, K=1, Euclidean, 7T
`knn_model_Model-Synthetic_K1_Euclidean_SYN.joblib`	Synthetic, K=1, Euclidean, SYN
`knn_model_Model-Synthetic_K1_Manhattan_21T.joblib`	Synthetic, K=1, Manhattan, 21T
`knn_model_Model-Synthetic_K1_Manhattan_7T.joblib`	Synthetic, K=1, Manhattan, 7T
`knn_model_Model-Synthetic_K1_Manhattan_SYN.joblib`	Synthetic, K=1, Manhattan, SYN
`knn_model_Model-Synthetic_K3_Euclidean_21T.joblib`	Synthetic, K=3, Euclidean, 21T
`knn_model_Model-Synthetic_K3_Euclidean_7T.joblib`	Synthetic, K=3, Euclidean, 7T
`knn_model_Model-Synthetic_K3_Euclidean_SYN.joblib`	Synthetic, K=3, Euclidean, SYN
`knn_model_Model-Synthetic_K3_Manhattan_21T.joblib`	Synthetic, K=3, Manhattan, 21T
`knn_model_Model-Synthetic_K3_Manhattan_7T.joblib`	Synthetic, K=3, Manhattan, 7T
`knn_model_Model-Synthetic_K3_Manhattan_SYN.joblib`	Synthetic, K=3, Manhattan, SYN

License

This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.

Contact

For any additional questions or comments, contact Fahad Saeed ([email protected]).

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

SaeedLab
/

dom-formula-assignment-using-knn