SBP-ML: peptide data evaluation


For various reasons this dataset has been disregarded, but as I've come to learn more about MS I have questioned why. This file tries to evaluate data quality in a structured way.
UPDATED: 191206

SBP = St Göran Bipolar Project. A longitudinal naturalistic cohort.

We have seen repeatedly that SBP-sthlm and SBP-gbg are not comparable in terms of both clinic and across various biological measures. In MS, there are some parameters that deviate:

  • SBP-sthlm: n=342, trypsinated peptides. Most samples were not centrifuged before freezing.
  • SBP-gbg: n=144, pre-fractionated trypsinated peptides. All centrifuged.

Laboratory protocol


To repeat - in total there are 160 samples over 16 sets (including reference samples). These are pre-fractionated. Included in the protocol are some Hydrocephalus samples and some pool channels for quality control.

BD: Bipolar Disorder
HC: Healthy Control
Pool: Pool containing a fraction of all samples
H: Hydrocephalus, several samples from the same individual with Hydrocephlaus


Why so suspicious?


In the normalization protocol I look at ANOVA tests for each normalization protocol. The test plots p-value histogram for each assembled protein (yep, protein!) and on and TMTch. Low p-values (indicating true bias) aggregate in TMTch, and no normalization protocol handles this.


Clear channel bias already in the nonorm data which is just median centered abundance ratios. This was not seen in the SBP-sthlm dataset.

Table of contents

  • QC metadata
  • Looking at NA's
  • Distributions
  • with umap
  • Reference samples

Setup and load data


Also load some data straight into linux

Original file from PD2.2 has metadata, abundance ratios and abundances (and more) - define selections:

In the normalization protocol I filter out 1) peptides related to non-unique master proteins, and 2) peptides with <50% detection rate

1. QC metadata


The output file from PD 2.2 contains some metadate. Let's start by expling that:

Loading output library...

All this metadata is by peptide and not by sample, so we can probably estimate some general trends about idenfication reliability but not about deviating samples/sets.

Qvality and Percolator metrics


Not quite sure how to interpret this but it looks like good statistics? This is only for peptide identification(?)

Loading output library...

Is there any other relevant information to extract from this data?

2. Looking at NA's


The total number of peptides is much higher than what can be expected to be found in any given set (due to oversampling). So we expect to see lots of NA's in each set, but this should not deviate across sets!

Global NA rate


Here, I just look at frequencies of missing datapoints. Starting simple with global NA-rate!

Loading output library...

Okay, so 51% NA's globally. To put that number into context I guess you'd have to compare different datasets...

NA by sample


These are NA's per channel, reported by set. The input is the filtered abundance data (unique peptides with >50% detection rate), a total of 7233 peptides.

Loading output library...

Note the Y-axes - set 5 & 7 have much more missing datapoints! Set 4 & 9 too

There is also a general trend of more missing data in 126 & 127N (e.g. set 6), which adds up when you look at it globally.


  • Generally more missing data in TMT 126/127
  • Set 5 & 7 have much more missing data!

3. Distributions


Are the ranges differing in any way?

The current hypothesis is that there is some pre-anlytical issue with TMT 126/127 leading to non-complete peptide binding. This would from my understanding result in NA's but not necessarily affect quantification of labelled peptides, or skew distributions. Let's look at it!

Loading output library...

All channels follow the same general distribution, but some samples (e.g. 6:127N) deviate from the others within the set.

SUMMARY: No overall trend seen for set 5 and 7, and not for ch 126/127 either.

4. Dim. red. with umap


Here, I've used umap (a non-supervised dimensionality reduction algorithm) to try to visualize any clusteres in the data. The input is abundances (not controlled for ch131 intensity) so we expect to see some grouping if the intensity varies between sets. This algorithm expects complete data w/o missing, so I filtered out non-completely covered peptides.

Loading output library...

The plot looks weird when rendered but the key point is that some sets don't really cluster but e.g. in the top left there are 2-3 sets with samples really close together (set 4,5,7). Set 9 is also in there but hard to see in this quick render.

SUMMARY: Again, set 5 and 7 (and 4 & 9) deviate!

5. Reference samples


In the common protocol reference samples are only in one channel (TMT131) but in this study 5 additional reference samples were spread across the protocol. There were also a group of identical hydrocephalus samples spread in the protocol.


  • Left: Number of NA's - as above
  • Right: Abundance ratios (ch/TMT131). Y-axis label should be 'log2(rel.abundnace)'. E.g. 4/127C is a pool sample tagged with TMT 127C referenced to a pool sample tagged with TMT 131 - should be a bare minimum of variance around 0!
Loading output library...

Set 7 again with much more NA's again!

Overall distribution seems good but possibly less tight in 4:127C, but since it's just one sample it is hard to tell if this is because of TMT 127 or just a random event.

SUMMARY: Quite a few outliers in some sets, but the genereal trend is a tight normal distribution around 0 (=good!)

Same thing with Hydrocephalus samples: So 17 identical samples referenced to TMT 131 per set.

Loading output library...

NA's follow the general trend above (set 5 & 7...).

SUMMARY: No certain shift/skewness in the distribution in TMT 126/127.


  • 4 sets have much more missing data that others: 5,7,9,4
  • TMT 126 and 127N seem to have more NA's than the other channels
  • No verified shift in distributions in the peptides that were identified and quantified in these sets/channels

So, there are some issues with missing data. But how can we use the data that was quantified?