Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32229
Data Quality Enhancement with String Length Distribution

Authors: Qi Xiu, Hiromu Hota, Yohsuke Ishii, Takuya Oda


Recently, collectable manufacturing data are rapidly increasing. On the other hand, mega recall is getting serious as a social problem. Under such circumstances, there are increasing needs for preventing mega recalls by defect analysis such as root cause analysis and abnormal detection utilizing manufacturing data. However, the time to classify strings in manufacturing data by traditional method is too long to meet requirement of quick defect analysis. Therefore, we present String Length Distribution Classification method (SLDC) to correctly classify strings in a short time. This method learns character features, especially string length distribution from Product ID, Machine ID in BOM and asset list. By applying the proposal to strings in actual manufacturing data, we verified that the classification time of strings can be reduced by 80%. As a result, it can be estimated that the requirement of quick defect analysis can be fulfilled.

Keywords: Data quality, feature selection, probability distribution, string classification, string length.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 775


[1] J. Rivera and R. V. D. Meulen. (2014, November 3). Gartner Says the Processing, Sensing and Communications Semiconductor Device Portion of the IoT Is Set for Rapid Growth. (Online). Available: (accessed on 2016, October 31).
[2] National Highway Traffic Safety Administration (NHTSA), Vehicle recall summary by year (1966-2014). (Online). Available: .pdf (accessed on 2016, October 31).
[3] R. Y. Wang, and D. M. Strong, “Beyond accuracy: What data quality means to data consumers,” in JIMS 12, 4(1996), 5-34.
[4] B. Stvilia, L. Gasser, M. B. Twidale, and L. C. Smisth, “A Framework for Information Quality Assessment,” In JASIST, 58(12), 1720-1733.
[5] D.P. Ballou, H.L. Pazer, “Modeling data and process quality in multi-input, multi-output information systems,” Management Science 31 (2), 1985, pp. 150-162.
[6] M. Jarke, Y. Vassiliou, “Data warehouse quality: a review of the DWQ project, Proceedings of the Conference onInformation Quality,” Cambridge, MA, 1997, pp. 299-313.
[7] B. K. Kahn, D. M. Strong, and. R. Y.Wang, “Information quality benchmarks: Product and service performance,” Communications of the ACM, 45, 4, 184-192, 2002.
[8] Y. W. Lee, D. M. Strong, B. K. Kahn and R. Y. Wang, “AIMQ: A methodology for information quality assessment,” Information & Management, 40, 2 December, 133-146, 2002.
[9] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data quality assessment,” Commun. ACM 45, 4, 2002.
[10] T. Margaritopoulos, M. Margaritopoulos, I. Mavridis and A. Manitsaris, “A Conceptual Framework for Metadata Quality Assessment,” In DCMI 2008.
[11] M. Ge, and M. Helfert, “A review of information quality research – develop a research agenda,” in Proceedings of the 12th ICIQ, Nov, 2007.
[12] Scott S., “Probabilistic Versus Deterministic Data Matching: Making an Accurate Decision,” access in June 2009.
[13] H. B. Newcombe, J. M. Kennedy, S. Axford, and A. James. “Automatic linkage of vital records,” in Science, 130(3381):954-959, 1959.
[14] A. K. Menon, O. Tamuz, S. Gulwani, B. Lampson, and A T. Kalai, “A machine learning framework for programming by example,” in Proceedings of the 30th ICML, pages 187-95, 2013.
[15] “Tamr’s data connection and enrichment platform data sheet,” (Online) Available: Data Sheet 021915.pdf (accessed on 2016, October 31).
[16] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu “Data curation at scale: The Data Tamer system,” In CIDR, 2013.
[17] A. Bartoli, G. Davanzo, A. D. Lorenzo, M. Mauri, E. Medvet, and E. Sorio,“Automatic Generation of Regular Expressions from Examples with Genetic Programming,” in GECCO, 2012.
[18] D. Lorenzo, E. Medvet, and A. Bartoli, “Automatic String Replace by Examples,” in GECCO, 2013.
[19] A. Bartoli, G. Davanzo, A. D. Lorenzo, E. Medvet, and E. Sorio, “Automatic Synthesis of Regular Expressions from Examples,” IEEE Computer, 2014.
[20] “IBM InfoSphere QualityStage data sheet,” (Online). Available: /InfoSphereQualityStage.pdf (accessed on 2016, October 31).
[21] “Informatica Data Quality data sheet,” (Online). Available: /us/collateral/data-sheet/informatica-data-quality data-sheet 6710.pdf (accessed on 2016, October 31).
[22] A. Doan, A. Halevy, Z. Ives, Principles of data integration. Waltham: Morgan Kaufmann. 2012, pp. 173-205.
[23] P. Christen, Data matching concepts and techniques for record linkage, entity resolution, and duplicate detection. Berlin-Heidelberg-New York: Springer, 2012, pp. 101-162.
[24] S. Theodoridis., K. Koutroumbas Pattern recognition. Burlington: Academic Press, 2008, pp. 261-322.