Normalizing Flow to Augmented Posterior: Conditional Density Estimation with Interpretable Dimension Reduction for High Dimensional Data
Authors: Cheng Zeng, George Michailidis, Hitoshi Iyatomi, Leo L Duan
Abstract:
The conditional density characterizes the distribution of a response variable y given other predictor x, and plays a key role in many statistical tasks, including classification and outlier detection. Although there has been abundant work on the problem of Conditional Density Estimation (CDE) for a low-dimensional response in the presence of a high-dimensional predictor, little work has been done for a high-dimensional response such as images. The promising performance of normalizing flow (NF) neural networks in unconditional density estimation acts a motivating starting point. In this work, we extend NF neural networks when external x is present. Specifically, they use the NF to parameterize a one-to-one transform between a high-dimensional y and a latent z that comprises two components [zP , zN]. The zP component is a low-dimensional subvector obtained from the posterior distribution of an elementary predictive model for x, such as logistic/linear regression. The zN component is a high-dimensional independent Gaussian vector, which explains the variations in y not or less related to x. Unlike existing CDE methods, the proposed approach, coined Augmented Posterior CDE (AP-CDE), only requires a simple modification on the common normalizing flow framework, while significantly improving the interpretation of the latent component, since zP represents a supervised dimension reduction. In image analytics applications, AP-CDE shows good separation of x-related variations due to factors such as lighting condition and subject id, from the other random variations. Further, the experiments show that an unconditional NF neural network, based on an unsupervised model of z, such as Gaussian mixture, fails to generate interpretable results.
Keywords: Conditional density estimation, image generation, normalizing flow, supervised dimension reduction.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 180References:
[1] G. R. Terrell and D. W. Scott, “Variable Kernel Density Estimation,” The Annals of Statistics, pp. 1236–1265, 1992.
[2] Z. I. Botev, J. F. Grotowski, and D. P. Kroese, “Kernel Density Estimation via Diffusion,” The Annals of Statistics, vol. 38, no. 5, pp. 2916–2957, 2010.
[3] J. Kim and C. D. Scott, “Robust Kernel Density Estimation,” Journal of Machine Learning Research, vol. 13, no. 1, pp. 2529–2565, 2012.
[4] Y. Mack and M. Rosenblatt, “Multivariate K-Nearest Neighbor Density Estimates,” Journal of Multivariate Analysis, vol. 9, no. 1, pp. 1–15, 1979.
[5] Y.-H. Kung, P.-S. Lin, and C.-H. Kao, “An Optimal K-Nearest Neighbor for Density Estimation,” Statistics & Probability Letters, vol. 82, no. 10, pp. 1786–1791, 2012.
[6] W. Jiang and M. A. Tanner, “Hierarchical Mixtures-of-Experts for Exponential Family Regression Models: Approximation and Maximum Likelihood Estimation,” The Annals of Statistics, vol. 27, no. 3, pp. 987–1011, 1999.
[7] J. Geweke and M. Keane, “Smoothly Mixing Regressions,” Journal of Econometrics, vol. 138, no. 1, pp. 252–290, 2007.
[8] M. Villani, R. Kohn, and P. Giordani, “Regression Density Estimation Using Smooth Adaptive Gaussian Mixtures,” Journal of Econometrics, vol. 153, no. 2, pp. 155–173, 2009.
[9] A. Norets, “Approximation of Conditional Densities by Smooth Mixtures of Regressions,” The Annals of Statistics, vol. 38, no. 3, pp. 1733–1766, 2010.
[10] Z. Huang, H. Lam, and H. Zhang, “Evaluating Aleatoric Uncertainty via Conditional Generative Models,” arXiv Preprint arXiv:2206.04287, 2022.
[11] D. W. Scott, “Partial Mixture Estimation and Outlier Detection in Data and Regression,” in Theory and Applications of Recent Robust Methods. Springer, 2004, pp. 297–306.
[12] E. Schubert, A. Zimek, and H.-P. Kriegel, “Generalized Outlier Detection with Flexible Kernel Density Estimates,” in Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 2014, pp. 542–550.
[13] A. Garg and D. Roth, “Understanding Probabilistic Classifiers,” in European Conference on Machine Learning. Springer, 2001, pp. 179–191.
[14] H. A. Chipman, E. I. George, and R. E. McCulloch, “BART: Bayesian Additive Regression Trees,” The Annals of Applied Statistics, vol. 4, no. 1, pp. 266–298, 2010.
[15] G. McLachlan and D. Peel, “Mixtures of Factor Analyzers,” in International Conference on Machine Learning. Citeseer, 2000.
[16] Y. Tang, R. Salakhutdinov, and G. Hinton, “Deep Mixtures of Factor Analysers,” arXiv Preprint arXiv:1206.4635, 2012.
[17] M. E. Tipping and C. M. Bishop, “Mixtures of Probabilistic Principal Component Analyzers,” Neural Computation, vol. 11, no. 2, pp. 443–482, 1999.
[18] J. W. Sammon, “A Nonlinear Mapping for Data Structure Analysis,” IEEE Transactions on Computers, vol. 100, no. 5, pp. 401–409, 1969.
[19] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, “Kernel Principal Component Analysis,” in International Conference on Artificial Neural Networks. Springer, 1997, pp. 583–588.
[20] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003.
[21] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
[22] N. Lawrence, “Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data,” Advances in Neural Information Processing Systems, vol. 16, 2003.
[23] L. Van der Maaten and G. Hinton, “Visualizing Data Using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
[24] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform Manifold Approximation and Projection for Dimension Reduction,” arXiv Preprint arXiv:1802.03426, 2018.
[25] C. Fraley and A. E. Raftery, “Model-Based Clustering, Discriminant Analysis, and Density Estimation,” Journal of the American Statistical Association, vol. 97, no. 458, pp. 611–631, 2002.
[26] A. E. Gelfand, “Gibbs Sampling,” Journal of the American Statistical Association, vol. 95, no. 452, pp. 1300–1304, 2000.
[27] K. M. Zuev, J. L. Beck, and L. S. Katafygiotis, “On the Optimal Scaling of the Modified Metropolis-Hastings Algorithm,” in Proceedings of the 11th International Conference on Applications of Statistics and Probability in Civil Engineering, 2011.
[28] N. K. Chandra, A. Canale, and D. B. Dunson, “Bayesian Clustering of High-Dimensional Data,” arXiv Preprint arXiv:2006.02700, 2020.
[29] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density Estimation Using Real NVP,” arXiv Preprint arXiv:1605.08803, 2016.
[30] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “MADE: Masked Autoencoder for Distribution Estimation,” in International Conference on Machine Learning. PMLR, 2015, pp. 881–889.
[31] G. Papamakarios, T. Pavlakou, and I. Murray, “Masked Autoregressive Flow for Density Estimation,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[32] D. P. Kingma and P. Dhariwal, “Glow: Generative Flow With Invertible 1x1 Convolutions,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[33] W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud, “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models,” arXiv Preprint arXiv:1810.01367, 2018.
[34] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H. Jacobsen, “Invertible Residual Networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 573–582.
[35] R. T. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, “Residual Flows for Invertible Generative Modeling,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[36] P. Izmailov, P. Kirichenko, M. Finzi, and A. G. Wilson, “Semi-Supervised Learning With Normalizing Flows,” in International Conference on Machine Learning. PMLR, 2020, pp. 4615–4630.
[37] X. Peng, J. Feng, J. T. Zhou, Y. Lei, and S. Yan, “Deep Subspace Clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12, pp. 5509–5521, 2020.
[38] B. D. Haeffele, C. You, and R. Vidal, “A Critique of Self-Expressive Deep Subspace Clustering,” arXiv Preprint arXiv:2010.03697, 2020.
[39] G. Papamakarios, E. T. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research, vol. 22, no. 57, pp. 1–64, 2021.
[40] P. Gr¨unwald and T. Van Ommen, “Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It,” Bayesian Analysis, vol. 12, no. 4, pp. 1069–1103, 2017.
[41] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan, “Hybrid Models with Deep and Invertible Features,” in International Conference on Machine Learning. PMLR, 2019, pp. 4723–4732.
[42] R. D. Cook and L. Ni, “Sufficient Dimension Reduction via Inverse Regression: A Minimum Discrepancy Approach,” Journal of the American Statistical Association, vol. 100, no. 470, pp. 410–428, 2005.
[43] L. Su and H. White, “A Consistent Characteristic Function-Based Test for Conditional Independence,” Journal of Econometrics, vol. 141, no. 2, pp. 807–834, 2007.
[44] ——, “A Nonparametric Hellinger Metric Test for Conditional Independence,” Econometric Theory, vol. 24, no. 4, pp. 829–864, 2008.
[45] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv Preprint arXiv:1412.6980, 2014.
[46] I. Loshchilov and F. Hutter, “Sgdr: Stochastic Gradient Descent with Warm Restarts,” arXiv preprint arXiv:1608.03983, 2016.
[47] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: A Novel Image Dataset for Benchmarking Machine Learning Algorithms,” arXiv preprint arXiv:1708.07747, 2017.
[48] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring Linear Subspaces for Face Recognition Under Variable Lighting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 684–698, 2005.