Abstract
Pooling heterogeneous datasets across domains is a common strategy in representation learning, but naive pooling can amplify distributional asymmetries and yield biased estimators, especially in settings where zero-shot generalization is required. We propose a matching framework that selects samples relative to an adaptive centroid and iteratively refines the representation distribution. The double robustness and the propensity score matching for the inclusion of data domains make matching more robust than naive pooling and uniform subsampling by filtering out the confounding domains (the main cause of heterogeneity). Theoretical and empirical analyses show that, unlike naive pooling or uniform subsampling, matching achieves better results under asymmetric meta-distributions, which are also extended to non-Gaussian and multimodal real-world settings. Most importantly, we show that these improvements translate to zero-shot medical anomaly detection, one of the extreme forms of data heterogeneity and asymmetry. The code is available on this https URL.
Abstract (translated)
跨领域汇集异构数据集是表示学习中的一种常见策略,但简单的汇集方法可能会放大分布不对称性,并导致偏差估计问题,尤其是在需要零样本泛化的情况下。我们提出了一种匹配框架,该框架根据自适应中心点选择样本并迭代优化表示分布。通过双重稳健性和倾向得分匹配来处理数据领域包含的问题,使得匹配比简单的汇集和均匀抽样更鲁棒,能够过滤掉导致异质性的混淆领域(即主要的异构原因)。理论分析和实证研究表明,在不对称元分布下,与简单汇集或均匀抽样相比,匹配方法能取得更好的结果,并且这些改进还可以扩展到非高斯和多模态的真实世界设置中。最重要的是,我们展示了在零样本医学异常检测中的这些改进效果,这种情况下数据异质性和不对称性尤为极端。相关代码可在提供的URL上找到。
URL
https://arxiv.org/abs/2602.07154