Rethinking Instance Representation Learning in Multiple Instance Learning


Bag-level classifier are good teacher for instance-level classifier.

Published on May 05, 2023 by Hongyi aka dotman

Pathology CAMELYON MIL

3 min READ

Recently I am working on a new Multiple Instance Learning (MIL) framework that can couple the patch representation learning with the bag-level classification. The process has been pretty hard, since I also have to deal with other business at the same time. However, I am still very exciting at this idea since it seems to have a very sound logic.

Current Multiple Instance Learning Methods and Their Limitations.

For a very long time, MIL on Whole Slide Images (WSI) has been separated into two independent procedures, which are patch feature embedding and bag-level classification to be specific. These two process should have been trained end-to-end but is just prohibitively expensive for current GPUs. Therefore, people usually use a ImageNet pretrained ResNet50 (or ViT and whatever. ViT performs better in my expeiments btw) to embed the patches of a WSI into fixed 1024-vectors, and then train a bag-level classifier on these embeddings at a low cost. Another reason that people tend to use a pretrained model to directly generate embeddings is that in most cases, we do not have the patch-level labels. We do not know whether a patch contains tumor area or not, and therefore using a pretrained network to directly project a image to a 1024-vector seems like to be the most convenient way.

However, such compromise to computational cost inevitably lead to imperfect patch embedding performance. More specifically, when applying ImageNet pretrained weights on pathology images, it will inevitably bring about domain shift. However, since we do not have the instance-level labels, we can only retrain the patch embedder with pseudo labels. Existing methods that focus on the embedder training usually use the attention score of each instance to generate their paeuso labels, but in our work, we propose to use the bag-level classifier to directly genrate new embeddings.

Coupling the Two Processes in Multiple Instance Learning

Of course we want an End-to-End trained MIL if we have enough GPU memory. But unfortunately, in real life GPU memory is usually restricted at 24GB and that is just impossible for training with these gigapixel WSIs.

We are glad to find out that recently more and more people started to look into this problem. In CVPR2023 paper [1], the authors propose to apply information bottleneck on the tiled patches of a WSI and each time only select 10% of the patches for end-to-end training. Our concurrent work has a similar idea, but instead of realizing orthodox end-to-end training at the cost of input information lost, we turn to realize a nearly end-to-end training with full input information. A general explanation to our solution is that we view the entire MIL pipeline as a Expectation-Maximization process, with patch embedding as E step and bag classification as M step. We iteratively optimize these two processes by fixing the other process, and gradually couple them together to realize a nearly end-to-end training.

Further Experiments

Since the paper has just been submitted to MICCAI 2023, we cannot give the detailed description of our solution yet. However, we will still working on this idea and find better training scheme for the M step.

Reference

[1] Li H, Zhu C, Zhang Y, et al. Task-specific Fine-tuning via Variational Information Bottleneck for Weakly-supervised Pathology Whole Slide Image Classification. arXiv preprint arXiv:2303.08446, 2023.