Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33122
Proxisch: An Optimization Approach of Large-Scale Unstable Proxy Servers Scheduling
Authors: Xiaoming Jiang, Jinqiao Shi, Qingfeng Tan, Wentao Zhang, Xuebin Wang, Muqian Chen
Abstract:
Nowadays, big companies such as Google, Microsoft, which have adequate proxy servers, have perfectly implemented their web crawlers for a certain website in parallel. But due to lack of expensive proxy servers, it is still a puzzle for researchers to crawl large amounts of information from a single website in parallel. In this case, it is a good choice for researchers to use free public proxy servers which are crawled from the Internet. In order to improve efficiency of web crawler, the following two issues should be considered primarily: (1) Tasks may fail owing to the instability of free proxy servers; (2) A proxy server will be blocked if it visits a single website frequently. In this paper, we propose Proxisch, an optimization approach of large-scale unstable proxy servers scheduling, which allow anyone with extremely low cost to run a web crawler efficiently. Proxisch is designed to work efficiently by making maximum use of reliable proxy servers. To solve second problem, it establishes a frequency control mechanism which can ensure the visiting frequency of any chosen proxy server below the website’s limit. The results show that our approach performs better than the other scheduling algorithms.Keywords: Proxy server, priority queue, optimization approach, distributed web crawling.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1125009
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2896References:
[1] S. Kaur and A. Gupta, “A survey on web focused information extraction algorithms,” 2015.
[2] S. Brin and L. Page, “Reprint of: The anatomy of a large-scale hypertextual web search engine,” Computer networks, vol. 56, no. 18, pp. 3825–3833, 2012.
[3] Attributor, “Attributor.”
[4] Y. Zhang, J. Tang, Z. Yang, J. Pei, and P. S. Yu, “Cosnet: Connecting heterogeneous social networks with local and global consistency,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1485–1494.
[5] S. Ji, W. Li, P. Mittal, X. Hu, and R. Beyah, “Secgraph: A uniform and open-source evaluation system for graph data anonymization and de-anonymization,” in 24th USENIX Security Symposium (USENIX Security 15), 2015, pp. 303–318.
[6] R. Patel and P. Bhatt, “A survey on semantic focused web crawler for information discovery using data mining technique,” International Journal for Innovative Research in Science and Technology, vol. 1, no. 7, pp. 168–170, 2015.
[7] W. Galuba, K. Aberer, D. Chakraborty, Z. Despotovic, and W. Kellerer, “Outtweeting the twitterers-predicting information cascades in microblogs,” in Proceedings of the 3rd conference on Online social networks, vol. 39, no. 12, 2010, p. 3ˆaAS3.
[8] V. Shkapenyuk and T. Suel, “Design and implementation of a high-performance distributed web crawler,” in Data Engineering, 2002. Proceedings. 18th International Conference on. IEEE, 2002, pp. 357–368.
[9] H. T. Y. Achsan and W. C. Wibowo, “A fast distributed focused-web crawling,” Procedia Engineering, vol. 69, pp. 492–499, 2014.
[10] S. A. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti, “Crawling facebook for social network analysis purposes,” in Proceedings of the international conference on web intelligence, mining and semantics. ACM, 2011, p. 52.
[11] L. F. Lopes, J. Zamite, B. Tavares, F. Couto, F. Silva, and M. J. Silva, “Automated social network epidemic data collector,” in INForum informatics symposium. Lisboa, 2009.
[12] M. Ke, P. Zhang, and G. Chen, “The crawler of specific resources recognition based on multi-thread,” in Computational Sciences and Optimization (CSO), 2012 Fifth International Joint Conference on. IEEE, 2012, pp. 569–572.
[13] A. H. Wang, “Don’t follow me: Spam detection in twitter,” in Security and Cryptography (SECRYPT), Proceedings of the 2010 International Conference on. IEEE, 2010, pp. 1–10.
[14] B. Liu, L. Wang, and Y.-H. Jin, “An effective hybrid pso-based algorithm for flow shop scheduling with limited buffers,” Computers & Operations Research, vol. 35, no. 9, pp. 2791–2806, 2008.
[15] G. Schmidt, “Scheduling with limited machine availability,” European Journal of Operational Research, vol. 121, no. 1, pp. 1–15, 2000.
[16] D. McCoy, J. A. Morales, and K. Levchenko, “Proximax: A measurement based system for proxies dissemination,” Financial Cryptography and Data Security, vol. 5, no. 9, p. 10, 2011.
[17] Q. Wang, Z. Lin, N. Borisov, and N. Hopper, “rbridge: User reputation based tor bridge distribution with privacy preservation.” in NDSS, 2013.
[18] M. H. Au, A. Kapadia, and W. Susilo, “Blacr: Ttp-free blacklistable anonymous credentials with reputation,” 2012.
[19] D. Bilenko, “gevent,” http://www.gevent.org/, 2015.
[20] 199it, “Report about renren,” http://www.ebrun.com/20130507/72900.shtml, 2013.
[21] K. Reitz, “Requests library,” http://www.python-requests.org/en/latest/, 2015.