Visualising Gene Expression Data

This post shows an example of using parallel coordinates for the visualisation of gene expression data. It demonstrates the use of two important interaction techniques: setting scales and coloring dimensions.

The dataset that was used in the original publication (Dietzsch, Heinrich, Nieselt, & Bartz, 2009) comprises gene expression measurements for approximately 800 genes involved in cell-cycle regulation (Spellman et al., 1998), measured every 7 minutes over 2 hours. By interpreting time points as ‘dimensions’ and gene expression levels as points in a high-dimensional space, we obtain the following parallel-coordinates plot:

Here, values greater than zero signify that the corresponding gene is being expressed or ‘on’, while values below zero denote genes that are ‘off’ (for details on how to interprete gene expression levels see the original publication) and references therein. Note that we are looking at time-series data here, so the order of axes is fixed (from timepoint 0 on the left to timepoint 119 on the right). This type of plot is also referred to as a ‘profile plot’ in bioinformatics and doesn’t seem to make any use of the concept of parallel coordinates at this point (but we will get there). In fact, we note that the axis scaling is somewhat disturbing in the plot above: it’s different for each axes, as the default way to scale axes in d3.parcoords.js is to show the full range of each dimension from its minimum to its maximum. So the first thing we do is to put a common scale in place and make this look even more like a classic time-series or profile plot:

Apart from a couple of outliers, there is no pattern visible from this plot that might be worth exploring. So we do as in the paper) and add another dimension to the data:

This is the critical step turning the above in an actual parallel-coordinates plot: the new dimension does not denote a timepoint, but a statistic that was derived from the time-series. In this case, Phi.sin is the phase shift of a harmonic regression curve that was applied to every gene: genes with similar values for this dimension share a similar activation pattern over time. To visualise this, we apply a colormap to the new dimension:

Now, we can see cyclic patterns emerge from the colored polylines. Be aware that the choice of colormap is important, as it defines the number of groups or clusters that we can distinguish. Although colors are interpolated in the above plot, I am using a colormap with five different hues.

To conclude, this post was intended to demonstrate the use of scales, color, and explorative data analysis to find patterns in parallel coordinates.

References

  1. Dietzsch, J., Heinrich, J., Nieselt, K., & Bartz, D. (2009). SpRay: A Visual Analytics Approach for Gene Expression Data. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (pp. 179–186). doi:10.1109/VAST.2009.5333911
  2. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., … Futcher, B. (1998). Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology Of the Cell, 9(12), 3273–3297.