{"title":"A Monte Carlo Method to Data Stream Analysis","authors":"Kittisak Kerdprasop, Nittaya Kerdprasop, Pairote Sattayatham","volume":20,"journal":"International Journal of Computer and Information Engineering","pagesStart":2758,"pagesEnd":2764,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/4653","abstract":"Data stream analysis is the process of computing\r\nvarious summaries and derived values from large amounts of data\r\nwhich are continuously generated at a rapid rate. The nature of a\r\nstream does not allow a revisit on each data element. Furthermore,\r\ndata processing must be fast to produce timely analysis results. These\r\nrequirements impose constraints on the design of the algorithms to\r\nbalance correctness against timely responses. Several techniques\r\nhave been proposed over the past few years to address these\r\nchallenges. These techniques can be categorized as either dataoriented\r\nor task-oriented. The data-oriented approach analyzes a\r\nsubset of data or a smaller transformed representation, whereas taskoriented\r\nscheme solves the problem directly via approximation\r\ntechniques. We propose a hybrid approach to tackle the data stream\r\nanalysis problem. The data stream has been both statistically\r\ntransformed to a smaller size and computationally approximated its\r\ncharacteristics. We adopt a Monte Carlo method in the approximation\r\nstep. The data reduction has been performed horizontally and\r\nvertically through our EMR sampling method. The proposed method\r\nis analyzed by a series of experiments. We apply our algorithm on\r\nclustering and classification tasks to evaluate the utility of our\r\napproach.","references":"[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, A framework for clustering\r\nevolving data streams, in Pro. Very Large Data Bases, 2003.\r\n[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Model and\r\nissues in data stream systems, in Pro. ACM PODS, 2002.\r\n[3] M. Berthold and D.J. Hand, Intelligent Data Analysis: An Introduction.\r\nSpringer-Verlag, 2003.\r\n[4] J. Bilmes, A gentle tutorial of the EM algorithm and its application to\r\nparameter estimation for Gaussian mixture and hidden Markov models,\r\nDept. Electrical Engineering and Computer Science, University of\r\nCalifornia Berkeley, Technical Report TR-97-021, 1998.\r\n[5] G. Coremode and S. Muthukrishnan, What s hot and what s not:\r\nTracking most frequent items dynamically, in Pro. ACM PODS, 2003.\r\n[6] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood\r\nfrom incomplete data via the EM algorithm, Journal of the Royal\r\nStatistical Society B, vol. 39, pp. 1-22, 1977.\r\n[7] P. Domingos and G. Hulten, A general method to scaling up machine\r\nlearning algorithms and its application to clustering, in Pro. ICML,\r\n2001.\r\n[8] M. A. T. Figueiredo and A. K. Jain, Unsupervised learning of finite\r\nmixture models, IEEE Trans. Pattern Analysis and Machine\r\nIntelligence, vol. 24, pp. 381 396, 2002.\r\n[9] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data stream: A\r\nreview, SIGMOD Record, vol. 34, pp. 18 26, 2005.\r\n[10] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, One-pass\r\nwavelet decompositions of data streams, IEEE Trans. Knowledge and\r\nData Engineering, vol. 15, pp. 541 554, 2003.\r\n[11] D. Mackay, Introduction to Monte Carlo,\r\nin Learning in Graphical\r\nModels, M. Jordan, Ed. MIT Press, 1996, pp. 175-204.\r\n[12] J. M. Marin, K. Mengersen, and C. Robert, Bayesian modelling and\r\ninference on mixtures of distributions, in Handbook of Statistics, vol.\r\n25, Elsevier-Science, 2005.\r\n[13] S. Muthukrishnan, Data streams: Algorithms and applications, in Proc.\r\nACM-SIAM Symposium on Discrete Algorithm, 2003.\r\n[14] R. Neal, Probabilistic inference using Markov chain Monte Carlo\r\nmethods, Dept. Computer Science, University of Toronto, Technical\r\nReport CRG-TR93-1, 1993.\r\n[15] B. Resch, A tutorial for the course computational intelligence,\r\nAvailable: http:\/\/www.igi.tugraz.at\/lehre\/CI\r\n[16] J. von Neumann, Various techniques used in connection with random\r\ndigits, Applied Mathematics Series, vol. 12, National Bureau of\r\nStandards, Washington, D.C., 1951.\r\n[17] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools\r\nand Techniques with Java Implementations. Morgan Kaufmann, 2000.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 20, 2008"}