A Monte Carlo Method to Data Stream Analysis

Kittisak Kerdprasop; Nittaya Kerdprasop; Pairote Sattayatham

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32799

A Monte Carlo Method to Data Stream Analysis

Authors: Kittisak Kerdprasop, Nittaya Kerdprasop, Pairote Sattayatham

Abstract:

Data stream analysis is the process of computing various summaries and derived values from large amounts of data which are continuously generated at a rapid rate. The nature of a stream does not allow a revisit on each data element. Furthermore, data processing must be fast to produce timely analysis results. These requirements impose constraints on the design of the algorithms to balance correctness against timely responses. Several techniques have been proposed over the past few years to address these challenges. These techniques can be categorized as either dataoriented or task-oriented. The data-oriented approach analyzes a subset of data or a smaller transformed representation, whereas taskoriented scheme solves the problem directly via approximation techniques. We propose a hybrid approach to tackle the data stream analysis problem. The data stream has been both statistically transformed to a smaller size and computationally approximated its characteristics. We adopt a Monte Carlo method in the approximation step. The data reduction has been performed horizontally and vertically through our EMR sampling method. The proposed method is analyzed by a series of experiments. We apply our algorithm on clustering and classification tasks to evaluate the utility of our approach.

Keywords: Data Stream, Monte Carlo, Sampling, DensityEstimation.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1059597

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1381

References:

[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, A framework for clustering evolving data streams, in Pro. Very Large Data Bases, 2003.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Model and issues in data stream systems, in Pro. ACM PODS, 2002.
[3] M. Berthold and D.J. Hand, Intelligent Data Analysis: An Introduction. Springer-Verlag, 2003.
[4] J. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models, Dept. Electrical Engineering and Computer Science, University of California Berkeley, Technical Report TR-97-021, 1998.
[5] G. Coremode and S. Muthukrishnan, What s hot and what s not: Tracking most frequent items dynamically, in Pro. ACM PODS, 2003.
[6] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, vol. 39, pp. 1-22, 1977.
[7] P. Domingos and G. Hulten, A general method to scaling up machine learning algorithms and its application to clustering, in Pro. ICML, 2001.
[8] M. A. T. Figueiredo and A. K. Jain, Unsupervised learning of finite mixture models, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, pp. 381 396, 2002.
[9] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data stream: A review, SIGMOD Record, vol. 34, pp. 18 26, 2005.
[10] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, One-pass wavelet decompositions of data streams, IEEE Trans. Knowledge and Data Engineering, vol. 15, pp. 541 554, 2003.
[11] D. Mackay, Introduction to Monte Carlo, in Learning in Graphical Models, M. Jordan, Ed. MIT Press, 1996, pp. 175-204.
[12] J. M. Marin, K. Mengersen, and C. Robert, Bayesian modelling and inference on mixtures of distributions, in Handbook of Statistics, vol. 25, Elsevier-Science, 2005.
[13] S. Muthukrishnan, Data streams: Algorithms and applications, in Proc. ACM-SIAM Symposium on Discrete Algorithm, 2003.
[14] R. Neal, Probabilistic inference using Markov chain Monte Carlo methods, Dept. Computer Science, University of Toronto, Technical Report CRG-TR93-1, 1993.
[15] B. Resch, A tutorial for the course computational intelligence, Available: http://www.igi.tugraz.at/lehre/CI
[16] J. von Neumann, Various techniques used in connection with random digits, Applied Mathematics Series, vol. 12, National Bureau of Standards, Washington, D.C., 1951.
[17] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.