Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings
Authors: G. Candel, D. Naccache
Abstract:
t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic, and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n2) to O(n2/k), and the memory requirement from n2 to 2(n/k)2 which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics.
Keywords: Concept drift, data visualization, dimension reduction, embedding, monitoring, reusability, t-SNE, unsupervised learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 489References:
[1] S. Cohen, E. Ruppin, and G. Dror, “Feature selection based on the shapley value.” 01 2005, pp. 665–670.
[2] M. A. Hall, “Correlation-based feature selection for machine learning,” Ph.D. dissertation, 1999.
[3] I. Jolliffe, Principal Component Analysis. John Wiley & Sons, Ltd, 2005.
[4] J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber, “Stacked convolutional auto-encoders for hierarchical feature extraction,” 06 2011, pp. 52–59.
[5] M. Maggipinto, C. Masiero, A. Beghi, and G. A. Susto, “A convolutional autoencoder approach for feature extraction in virtual metrology,” Procedia Manufacturing, vol. 17, pp. 126–133, 2018, 28th International Conference on Flexible Automation and Intelligent Manufacturing (FAIM2018), June 11-14, 2018, Columbus, OH, USAGlobal Integration of Intelligent Manufacturing and Smart Industry for Good of Humanity.
[6] T. Kohonen, Self-organizing maps, 3rd ed., ser. Springer series in information sciences, 30. Berlin: Springer, Dec. 2001.
[7] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, p. 2319, 2000.
[8] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” 2018, cite arxiv:1802.03426Comment: Reference implementation available at http://github.com/lmcinnes/umap.
[9] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
[Online]. Available: http://www.jmlr.org/papers/v9/vandermaaten08a.html
[10] D. Kobak and P. Berens, “The art of using t-sne for single-cell transcriptomics,” bioRxiv, 2019.
[11] N. Pezzotti, A. Mordvintsev, T. H¨ollt, B. P. F. Lelieveldt, E. Eisemann, and A. Vilanova, “Linear tsne optimization for the web,” CoRR, vol. abs/1805.10817, 2018.
[Online]. Available: http://arxiv.org/abs/1805.10817
[12] L. van der Maaten, “Learning a parametric embedding by preserving local structure,” in Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, D. van Dyk and M. Welling, Eds., vol. 5. Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR, 16–18 Apr 2009, pp. 384–391.
[Online]. Available: http://proceedings.mlr.press/v5/maaten09a.html
[13] M. R. Min, H. Guo, and D. Shen, “Parametric t-distributed stochastic exemplar-centered embedding,” CoRR, vol. abs/1710.05128, 2017.
[Online]. Available: http://arxiv.org/abs/1710.05128
[14] A. Boytsov, F. Fouquet, T. Hartmann, and Y. L. Traon, “Visualizing and exploring dynamic high-dimensional datasets with lion-tsne,” CoRR, vol. abs/1708.04983, 2017.
[Online]. Available: http://arxiv.org/abs/1708.04983
[15] P. E. Rauber, A. X. Falc˜ao, and A. C. Telea, “Visualizing time-dependent data using dynamic t-sne,” in Proceedings of the Eurographics / IEEE VGTC Conference on Visualization: Short Papers, ser. EuroVis ’16. Goslar, DEU: Eurographics Association, 2016, p. 7377.
[16] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R’ıo, M. Wiebe, P. Peterson, P. G’erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, “Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–362, Sep. 2020.
[Online]. Available: https://doi.org/10.1038/s41586-020-2649-2
[17] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: Extraction and mining of academic social networks,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’08. New York, NY, USA: Association for Computing Machinery, 2008, p. 990998.
[18] K. Wang, Z. Shen, C. Huang, C.-H. Wu, D. Eide, Y. Dong, J. Qian, A. Kanakia, A. Chen, and R. Rogahn, “A review of microsoft academic services for science of science studies,” Frontiers in Big Data, vol. 2, p. 45, 2019.
[19] B. H. Weinberg, “Bibliographic coupling: A review,” Information Storage and Retrieval, vol. 10, no. 5, pp. 189–196, 1974.
[20] D. Chavalarias and J.-P. Cointet, “The reconstruction of science phylogeny,” 04 2009.
[21] L. van der Maaten, “Barnes-hut-sne,” in ICLR, Y. Bengio and Y. LeCun, Eds., 2013.