Series article 5. #metabolomics #lipidomics #massspectrometry #cancer #research #analyticalchemistry
Welcome back to Article 5 in a series dedicated to increasing awareness of factors that can affect metabolomic (and other omic) data quality. Today I’ll continue the 42 Factors Series with a discussion of pre-normalization and how it can increase rigor and reproducibility in omic data (with my usual emphasis on metabolomics).
Factor – Pre-Normalization
It wouldn’t be fair for my 8-year-old nephew’s recreational league basketball team to compete against the Houston Rockets. Sure, we’d cheer Henry on, but the output of the two teams wouldn’t even compare. The same is true with all types of scientific data—data normalization is needed to facilitate accurate comparisons and conclusions. Normalization can happen after the data are collected, but I’ll argue here that it’s even more important to create a level playing field during the experimental design stage.
In a future article I’ll come back to post-normalization, but today I’ll focus on pre-normalization. Whereas post-normalization is typically a statistical procedure applied after an experiment is completed, pre-normalization is typically an experimental procedure applied prior to data acquisition.
In mass spectrometry (MS)-based metabolomics, pre-normalization has many forms. In cell-based time course experiments, if an investigator seeds the same number of cells in all dishes and then harvests those dishes over a 48-hour period, the number of cells will be quite different between time points. One pre-normalization strategy would be to count the cells in each sample immediately prior to harvest and then use the cell count as a basis for loading the same concentration of cell extract into the MS. Instead, we recommend “differential seeding” (i.e., seeding with variable cell density) to target ~70% confluence for all samples at the time of harvest. Don’t worry—you don’t need to achieve exactly the same confluence in each dish at the time of harvest (but it’s great if you do); it can be a bit of a guessing game to figure out how many cells to seed for each of the different time points in your experiment. Do your best, and post-normalization will correct for differences in confluence later.
“Won’t post-normalization correct for differences in sample input?” you ask. Not always, especially for low-abundance analytes. In MS especially, unequal sample loading creates bias when comparing low-abundance metabolites that might only be detected above a certain threshold. For example, an analyte may be undetected with extraction of 100,000 cells but detected with extraction of 1,000,000 cells as the input. Post-normalization won’t fix that, but pre-normalization will.
Have you seen data sets containing a large number of undetected metabolites? It’s a bit maddening, simply because something really interesting could be going on with those analytes, but we’ll never know. My biggest concern with omic data is false negatives—missing something interesting that was present but undetected. That problem can be mitigated by pre-normalization.
It’s worth noting that many data analysts address the non-detect problem by applying interpolation—a statistical procedure that replaces zero values with non-zero values to facilitate analysis of the data set. I won’t go into a full rant against interpolation here, but, put simply, interpolation treats a symptom but doesn’t cure the disease.
Another example where pre-normalization is effective is in preclinical and clinical studies with body fluid samples (e.g., whole blood, plasma, serum, or urine), with which it is customary to process the same volume of fluid across all samples. That is indeed a form of pre-normalization. It is worth noting that, in general, pre-normalization does not preclude the need for post-normalization, but in the case of those body fluids many analysts do not conduct post-normalization under the assumption that the measurement obtained from each fixed-volume sample yields accurate comparisons. Some, nevertheless, do apply post-normalization. This article provides a more detailed guide to selecting appropriate pre- and post-normalization procedures for urine metabolomics, for example.
In summary, if you keep in mind that the overall goal of data normalization is to put data on the same playing field to facilitate accurate comparisons, you’ll avoid inaccurate results and conclusions.
Take-home messages: 1) cell-based time course data are improved by pre-normalization; 2) differential seeding is a preferred strategy for pre-normalization of cell-based time course data; and 3) in general, pre-normalization does not preclude the need for post-normalization (which will be discussed further in a future article).
Thanks to Chris Beecher, who first pointed out the value of pre-normalization in metabolomics to me, and to present and former members of my team, especially Preeti Purwaha, Di Du, Leona Martin, and Lucas Veillon for their efforts toward studying this factor.
Do you have any pressing metabolomics questions? Leave a comment.
0 comments