**Commenced**in January 2007

**Frequency:**Monthly

**Edition:**International

**Paper Count:**31105

##### Content Based Sampling over Transactional Data Streams

**Authors:**
Mansour Tarafdar,
Mohammad Saniee Abade

**Abstract:**

This paper investigates the problem of sampling from transactional data streams. We introduce CFISDS as a content based sampling algorithm that works on a landmark window model of data streams and preserve more informed sample in sample space. This algorithm that work based on closed frequent itemset mining tasks, first initiate a concept lattice using initial data, then update lattice structure using an incremental mechanism.Incremental mechanism insert, update and delete nodes in/from concept lattice in batch manner. Presented algorithm extracts the final samples on demand of user. Experimental results show the accuracy of CFISDS on synthetic and real datasets, despite on CFISDS algorithm is not faster than exist sampling algorithms such as Z and DSS.

**Keywords:**
Sampling,
data streams,
closed frequent item set mining

**Digital Object Identifier (DOI):**
doi.org/10.5281/zenodo.1076652

**References:**

[1] J. Vitter; "Random sampling with a reservoir", ACM Transactions on Mathematical Software, Vol 11(1), pp.37- 57, 1985 .

[2] P.B.Gibbons, Y. Matias. "New sampling-based summary statistics for improving approximate query answers". In Proceedings of ACM SIGMOD international conference on management of data, 1998, pp.331-342.

[3] C. Estan, G. Varghese. "New directions in traffic measurement and accounting" . In: Proceedings of 1st ACM SIGCOMM workshop on Internet measurement, 2001, pp. 75-80.

[4] G.Manku, R.Motwani. "Approximate frequency counts over data streams". In Proc. 2002 Int. Conf. Very Large Data Bases, 2002, pp. 346-357.

[5] E.D.Demaine, A.L-opez-Ortiz, J.I.Munro. "Frequency estimation of Internet packet streams with limited space". In: Proceedings of 10th European symposium on algorithms, 2002, pp. 348-360.

[6] M. Dash, W. Ng. "Efficient Reservoir Sampling for Transactional Data Streams" . Sixth IEEE International Conference on Data Mining - Workshops, 2006, pp. 662-666 .

[7] W.Ng, M.Dash. "Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?". In Proceedings of the 11th international Conference on Data Warehousing and Knowledge Discovery , 2009, p.p. 151-162.

[8] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In ACM SIGMOD the Internatial Workshop on Data Mining and Knowledge Discovery, 2000.

[9] J.Wang, J.Han and J.Pei: CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets; In Proceeding of ACM SIGKDD the InternationalConference on Knowledge Discovery and Data Mining, 2003

[10] M.J.Zaki and C.Hsiao: Charm: An efficient algorithm for closed itemset mining; In Proceedings of SDM the SIAM International Conference on Data Mining, 2002

[11] C. Lucchesse, S. Orlando, and R. Perego. DCI-Closed: A fast and memory efficient algorithm to mine frequent closed itemsets. In B. Goethals, M. J. Zaki, and R. Bayardo, editors, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 2004), volume 126 of CEUR Workshop Proceedings, Brighton, UK, 1 November 2004.

[12] G. Grahne and J. Zhu. Efficiently using prefix-trees in mining frequent itemsets. In B. Goethals and M. J. Zaki, editors, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 2003), volume 90 of CEUR Workshop Proceedings, Melbourne, Florida, USA, 19 November 2003.

[13] T.Uno , T.Asai , Y.Uchida , H.Arimura . LCM: An efficient algorithm for enumerating frequent closed item sets , In Proceedings of Workshop on Frequent itemset Mining Implementations , 2003 .

[14] B.N. Ranganath, M.N. Murty. "Stream-Close: Fast Mining of Closed Frequent Itemsets in High Speed Data Streams". ICDM Workshops 2008, 2008, pp. 516-525.

[15] H.Li, H.Chen. "Moment+: Mining Closed Frequent Itemsets over Data Stream", ADMA 2008, 2008, pp. 612-619.

[16] Y. Chi, H. Wang, P.S. Yu, R.R. Muntz. "Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window", In Proc. Fourth IEEE Int-l Conf. Data Mining, 2004, pp. 59-66.

[17] V. Choi. (2006, Jun) "Faster Algorithms for Constructing a Concept Galois Lattice." CoRR .vol abs/cs/0602069. Available: http://arxiv.org/abs/cs/0602069

[Jun. 1,2006].

[18] M.J. Zaki. C.-J. Hsiao. "Efficient Algorithms for Mining Closed Itemsets and Their Lattice Structure". IEEE Trans. on Knowl. and Data Eng. Vol.17, No.4. pp. 462-478. 2005.

[19] P. Valtchev, R. Missaoui, R. Godin. "A framework for incremental generation of closed itemsets". Discrete Applied Mathematics vol.156 (6). pp. 924-949. 2008.

[20] J.H. Chang, W.S. Lee. "Finding recently frequent itemsets adaptively over online transactional data streams". Inf. Syst. Vol.31, No.8. pp. 849-869. 2006.

[21] F. Alqadah , R. Bhatnagar. "Similarity Measures in Formal Concept Analysis". ISAIM 2010, Fort Lauderdale, Florida , 2010.

[22] IlliMine. Package for Data Mining in C++ ,

[Online] Available: http://illimine.cs.uiuc.edu.