Hierarchical Latent Space Clustering For Simulated Data

Abstract number
174
Presentation Form
Poster
DOI
10.22443/rms.elmi2024.174
Corresponding Email
[email protected]
Session
Poster Session
Authors
Jack Bacon (1)
Affiliations
1. University of Warwick
Keywords

Latent Space Clustering, GAN, ClusterGAN, 

Abstract text


We propose an unsupervised approach to the simulation of labelled biomedical data consisting of single cells and tissues acquired by state-of-the-art lattice lightsheet microscopy. We aim to build on recent advances in Variational Autoencoders and Generative Adversarial Networks to perform clustering in the latent space and generate samples presenting distinct features.


Previous work in latent space clustering has proven to be highly sensitive to hyper-parameter ‘k’ specifying the intended number of clusters to be considered when training the model. These methods have been shown to suffer a significant reduction in performance should this number not align with the true number of classes within the dataset. Through further experimentation, we have found this claim to be misleading. Rather than forming multiple new and nonsensical clusters, we instead observe that the model attempts to further separate existing clusters based on common characteristics within classes. In the case of a simple dataset such as MNIST, these characteristics can take the form of simple features such as the angle or width of the digits.


We therefore aim to create a system of hierarchical clustering wherein we deliberately choose a value of k > |classes| so as to force the model to further divide the samples in the latent space and allow for greater specificity in generated samples. Depending on the chosen value of k, we can control both the depth and structure of the final hierarchy. Choosing a k equal to the number of classes multiplied by a power of 2 is likely to result in a binary tree of hierarchical features, whereas other values may cause the model to target specific clusters for decomposition over others. It is however not possible to specify the exact nature of the final clusters, as the model remains unsupervised.


Initial testing has shown that this approach is able to separate samples of the same class expressing different features. Following previous work we have evaluated our model on the standard MNIST dataset and found that it was unable to differentiate between the four and nine classes. We observed that this was likely due to the model focusing on the angle of the digits, rather than the comparatively small differences in their structure. Increasing the number of clusters in the latent space then allowed the model to further break down these clusters to give multiple new groupings, with some correctly identifying the original classes.

This provides us with multiple simulated clusters for a single labelled class within the training data, facilitating greater specificity in the simulated data. These simulated clusters are divided based upon existing features found in training data. In our example, this functionality can be seen in the ‘angled’ four and nine classes.


When applied to more complex data, this approach could allow for greater specificity in generated samples while simultaneously highlighting new structural patterns within the dataset which may otherwise remain unnoticed.


With minor alterations, this approach could be applied to complex 3D graph data to allow us to simulate labelled complex biomedical data, such as that obtained through lightsheet microscopy. This would help alleviate current demands for labelled training data, which pose a significant barrier for the supervised deep learning methods commonly used in the analysis of complex biomedical data.


Figure 1: MNIST digits clustered with k=10

Figure 2: MNIST digits clustered with k=20

Figure 3: focusing on classes 4 and 9 with k=2

Figure 4: focusing on classes 4 and 9 with k=4