Pre-processing in flow cytometry data

supervisor: Gerjen Tinnevelt 

Multicolor Flow Cytometry (MFC) is a powerful analytical platform to measure the expression of several surface markers on a single cell. A typical MFC sample may contain a very large number of cells (>10000).1 The number of markers that can be measured on the same cell is constantly increasing. Chemometric multivariate analysis is needed to visualize the high dimensional data based on all measured markers. These analysis can be used to enhance the study of hematopoiesis and immunology, including immune responses on drugs and tumor progression.2 Additionally, MFC is also used to measure the autofluorescence of algae.3

MFC has a unique data structure and can be arrange into matrix   B-internship Tinnevelt 2016 - formule
, where N1  is the number of cells per sample i= 1,…,I and with surface proteins j=1,…,J. Due to this structure, multiple pre-processing strategies exist. Pre-processing is required to enhance the visualization of the response related structure of the data instead of technical, biological and noise related shifts/scales. For example, centering and scaling of the whole matrix X  or each individual matrix  Xi or even based on a control group. These different pre-processing strategies can have different effects as seen in the Figure below. The goal of the internship is to visualize the different pre-processing strategies in different situations and to show when to use which strategy.
internship Pre-processing in flow cytometry data fig 1 - jan2016

Figure 1: Simulation. Red circles are cells belonging to a control sample and blue triangles are cells belonging to a case sample. The top three are not scaled and different centering options are applied. The bottom three are individually centered and different scaling options are applied.

A typical dataset contains over 100 samples with each 100,000 cells with more than 10 surface markers measured on each cell, leading to over 100 million data points. Many multivariate chemometric methods cannot handle that many data points, therefore subsampling is needed. A second subgoal is to successfully take a representative subset (subsampling) of the data.  One of the criteria is to subsample the most common cells and to remove outlier cells, because outliers become more important if the data is subsampled, however the rare cells should be contained.