Dimension Reduction Related Abstracts
3 Multidimensional Item Response Theory Models for Practical Application in Large Tests Designed to Measure Multiple Constructs
Abstract:This work presents a statistical methodology for measuring and founding constructs in Latent Semantic Analysis. This approach uses the qualities of Factor Analysis in binary data with interpretations present on Item Response Theory. More precisely, we propose initially reducing dimensionality with specific use of Principal Component Analysis for the linguistic data and then, producing axes of groups made from a clustering analysis of the semantic data. This approach allows the user to give meaning to previous clusters and found the real latent structure presented by data. The methodology is applied in a set of real semantic data presenting impressive results for the coherence, speed and precision. Procedia PDF Downloads 260
2 Hybrid Algorithm for Non-Negative Matrix Factorization Based on Symmetric Kullback-Leibler Divergence for Signal Dependent Noise: A Case Study
Abstract:Non-negative matrix factorization approximates a high dimensional non-negative matrix V as the product of two non-negative matrices, W and H, and allows only additive linear combinations of data, enabling it to learn parts with representations in reality. It has been successfully applied in the analysis and interpretation of high dimensional data arising in neuroscience, computational biology, and natural language processing, to name a few. The objective of this paper is to assess a hybrid algorithm for non-negative matrix factorization with multiplicative updates. The method aims to minimize the symmetric version of Kullback-Leibler divergence known as intrinsic information and assumes that the noise is signal-dependent and that it originates from an arbitrary distribution from the exponential family. It is a generalization of currently available algorithms for Gaussian, Poisson, gamma and inverse Gaussian noise. We demonstrate the potential usefulness of the new generalized algorithm by comparing its performance to the baseline methods which also aim to minimize symmetric divergence measures.
Keywords: Clustering, Dimension Reduction, non-negative matrix factorization, intrinsic information, symmetric information divergence, signal-dependent noise, exponential family, generalized Kullback-Leibler divergence, dual divergenceProcedia PDF Downloads 129
1 Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings
Abstract:t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embeddings. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n²) to O(n²=k), and the memory requirement from n² to 2(n=k)², which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution, and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics. Procedia PDF Downloads 1