For various reasons this dataset has been disregarded, but as I've come to learn more about MS I have questioned why. This file tries to evaluate data quality in a structured way.
SBP = St Göran Bipolar Project. A longitudinal naturalistic cohort.
We have seen repeatedly that SBP-sthlm and SBP-gbg are not comparable in terms of both clinic and across various biological measures. In MS, there are some parameters that deviate:
To repeat - in total there are 160 samples over 16 sets (including reference samples). These are pre-fractionated. Included in the protocol are some Hydrocephalus samples and some pool channels for quality control.
BD: Bipolar Disorder
HC: Healthy Control
Pool: Pool containing a fraction of all samples
H: Hydrocephalus, several samples from the same individual with Hydrocephlaus
In the normalization protocol I look at ANOVA tests for each normalization protocol. The test plots p-value histogram for each assembled protein (yep, protein!) and on and TMTch. Low p-values (indicating true bias) aggregate in TMTch, and no normalization protocol handles this.
Clear channel bias already in the
nonorm data which is just median centered abundance ratios. This was not seen in the SBP-sthlm dataset.
Also load some data straight into linux
Original file from PD2.2 has metadata, abundance ratios and abundances (and more) - define selections:
In the normalization protocol I filter out 1) peptides related to non-unique master proteins, and 2) peptides with <50% detection rate
The output file from PD 2.2 contains some metadate. Let's start by expling that:
All this metadata is by peptide and not by sample, so we can probably estimate some general trends about idenfication reliability but not about deviating samples/sets.
Not quite sure how to interpret this but it looks like good statistics? This is only for peptide identification(?)
Is there any other relevant information to extract from this data?
The total number of peptides is much higher than what can be expected to be found in any given set (due to oversampling). So we expect to see lots of NA's in each set, but this should not deviate across sets!
Here, I just look at frequencies of missing datapoints. Starting simple with global NA-rate!
Okay, so 51% NA's globally. To put that number into context I guess you'd have to compare different datasets...
These are NA's per channel, reported by set. The input is the filtered abundance data (unique peptides with >50% detection rate), a total of 7233 peptides.
Note the Y-axes - set 5 & 7 have much more missing datapoints! Set 4 & 9 too
There is also a general trend of more missing data in 126 & 127N (e.g. set 6), which adds up when you look at it globally.
Are the ranges differing in any way?
The current hypothesis is that there is some pre-anlytical issue with TMT 126/127 leading to non-complete peptide binding. This would from my understanding result in NA's but not necessarily affect quantification of labelled peptides, or skew distributions. Let's look at it!
All channels follow the same general distribution, but some samples (e.g. 6:127N) deviate from the others within the set.
SUMMARY: No overall trend seen for set 5 and 7, and not for ch 126/127 either.
Here, I've used umap (a non-supervised dimensionality reduction algorithm) to try to visualize any clusteres in the data. The input is abundances (not controlled for ch131 intensity) so we expect to see some grouping if the intensity varies between sets. This algorithm expects complete data w/o missing, so I filtered out non-completely covered peptides.
The plot looks weird when rendered but the key point is that some sets don't really cluster but e.g. in the top left there are 2-3 sets with samples really close together (set 4,5,7). Set 9 is also in there but hard to see in this quick render.
SUMMARY: Again, set 5 and 7 (and 4 & 9) deviate!
In the common protocol reference samples are only in one channel (TMT131) but in this study 5 additional reference samples were spread across the protocol. There were also a group of identical hydrocephalus samples spread in the protocol.
Set 7 again with much more NA's again!
Overall distribution seems good but possibly less tight in 4:127C, but since it's just one sample it is hard to tell if this is because of TMT 127 or just a random event.
SUMMARY: Quite a few outliers in some sets, but the genereal trend is a tight normal distribution around 0 (=good!)
Same thing with Hydrocephalus samples: So 17 identical samples referenced to TMT 131 per set.
NA's follow the general trend above (set 5 & 7...).
SUMMARY: No certain shift/skewness in the distribution in TMT 126/127.
So, there are some issues with missing data. But how can we use the data that was quantified?