Multi-Level Air Quality Classification in China Using Information Gain and Support Vector Machine
Machine Learning and Data Mining are the two important tools for extracting useful information and knowledge from large datasets. In machine learning, classification is a wildly used technique to predict qualitative variables and is generally preferred over regression from an operational point of view. Due to the enormous increase in air pollution in various countries especially China, Air Quality Classification has become one of the most important topics in air quality research and modelling. This study aims at introducing a hybrid classification model based on information theory and Support Vector Machine (SVM) using the air quality data of four cities in China namely Beijing, Guangzhou, Shanghai and Tianjin from Jan 1, 2014 to April 30, 2016. China's Ministry of Environmental Protection has classified the daily air quality into 6 levels namely Serious Pollution, Severe Pollution, Moderate Pollution, Light Pollution, Good and Excellent based on their respective Air Quality Index (AQI) values. Using the information theory, information gain (IG) is calculated and feature selection is done for both categorical features and continuous numeric features. Then SVM Machine Learning algorithm is implemented on the selected features with cross-validation. The final evaluation reveals that the IG and SVM hybrid model performs better than SVM (alone), Artificial Neural Network (ANN) and K-Nearest Neighbours (KNN) models in terms of accuracy as well as complexity.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.2363286Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 336
 B. R. Gurjar, T. M. Butler, M. G. Lawrence, J. Lelieveld. Evaluation of emissions and air quality in megacities. Atmospheric Environment 42 (2008) 1593–1606.
 Niharika, Venkatadri M, Padma S. Rao. A survey on Air Quality forecasting Techniques. International Journal of Computer Science and Information Technologies, Vol. 5 (1) (2014) 103-107.
 Euro Cogliani. Air pollution forecast in cities by an air pollution index highly correlated with meteorological variables. Atmospheric Environment 35 (2001) 2871-2877.
 Robert A. Rohde, Richard A. Muller. Air Pollution in China: Mapping of Concentrations and Sources. PLoS ONE 10(8): e0135749 (2015).
 Guleda Onkal-Engin, Ibrahim Demir, Halil Hiz. Assessment of urban air quality in Istanbul using fuzzy synthetic evaluation. Atmospheric Environment 38 (2004) 3809–3815.
 Chak K. Chan, Xiaohong Yao. Air pollution in mega cities in China. Atmospheric Environment 42 (2008) 1–42.
 Dahe Jiang, Yang Zhang, Xiang Hu, Yun Zeng, Jianguo Tan, Demin Shao. Progress in developing an ANN model for air pollution index forecast. Atmospheric Environment 38 (2004) 7055–7064.
 Hong Zhao, Jie Zhang, Kai Wang, Zhi peng Bai, Aixie Liu. A GA-ANN Model for Air Quality Predicting. Computer Symposium (ICS), International (2010) 693 – 699.
 A. Suárez Sánchez, P. J. García Nieto, P. Riesgo Fernández, J. J. del Coz Díaz, F. J. Iglesias-Rodríguez. Application of an SVM-based regression model to the air quality study at local scale in the Avilés urban area (Spain). Mathematical and Computer Modelling 54 (2011) 1453 –1466.
 Anikender Kumar, Pramila Goyal. Forecasting of air quality in Delhi using principal component regression technique. Atmospheric Pollution Research 2 (2011) 436 – 444.
 Anikender Kumar, Piyush Goyal. Forecasting of Air Quality Index in Delhi Using Neural Network Based on Principal Component Analysis. Pure and Applied Geophysics 170 (4) (2013) 711-722.
 Ioannis N. Athanasiadis, Kostas D. Karatzas, Pericles A. Mitkas. Classification techniques for air quality forecasting. Fifth ECAI Workshop on Binding Environmental Sciences and Artificial Intelligence, 17th European Conference on Artificial Intelligence, Riva del Garda, Italy, (2006).
 Z. Ghaemia, M. Farnaghi, A. Alimohammadi. Hadoop-based Distribution System for Online Prediction of Air Pollution based on Support Vector Machine. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-1/W5, 2015. International Conference on Sensors & Models in Remote Sensing & Photogrammetry, Kish Island, Iran. (2015).
 S. Bedoui, S. Gomri, H. Samet, A. Kachouri. A prediction distribution of atmospheric pollutants using support vector machines, discriminant analysis and mapping tools (Case study: Tunisia). Pollution, 2(1) (2016) 11-23.
 Artemio Sotomayor-Olmedo, Marco A. Aceves-Fernández, Efrén Gorrostieta-Hurtado, Carlos Pedraza-Ortega, Juan M. Ramos-Arreguín, J. Emilio Vargas-Soto. Forecast Urban Air Pollution in Mexico City by Using Support Vector Machines: A Kernel Performance Approach. International Journal of Intelligence Science, 3 (2013) 126-135.
 Yin Zhao, Yahya Abu Hasan. Machine learning algorithms for predicting roadside fine particulate matter concentration level in Hong Kong Central. Computational Ecology and Software, 3(3) (2013) 61-73.
 Weizhen Lu, Wenjian Wang, A. Y. T. Leung, Siu-Ming Lo, R. K. K. Yuen, Zongben Xu, Huiyuan Fan. Air Pollutant Parameter Forecasting Using Support Vector Machines. IJCNN, (Volume: 1), (2002).
 P. Viotti, G. Liuti, P. Di Genova. Atmospheric urban pollution: applications of an artificial neural network (ANN) to the city of Perugia. Ecological Modelling 148 (2002) 27–46.
 Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online and Active Learning. Journal of Machine Learning Research 6 (2005) 1579–1619.
 Wenjian Wang, Changqian Men, Weizhen Lu. Online prediction model based on support vector machine. Neurocomputing 71 (2008) 550–558.
 C. E. Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, (1948).
 Badr Hssina, Abdelkarim Merbouha, Hanane Ezzikouri, Mohammed Erritali. A comparative study of decision tree ID3 and C4.5. IJACSA, (2014).
 Ross Ihaka, Robert Gentleman. R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics, Volume 5, Issue 3, (1996), 299-314.
 Brian Ripley, William Venables. Package “nnet”: Feed-Forward Neural Networks and Multinomial Log-Linear Models. Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0, (2002).
 Vladimir N. Vapnik. An Overview of Statistical Learning Theory. IEEE Transactions of Neural Networks, Vol. 10, No. 5, (1999).
 D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch. Package “e1071”. Misc. functions of the Department of Statistics (e1071), TU Wien. The comprehensive R archive network (2012).