Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 206

Search results for: extracting numerals

206 OCR for Script Identification of Hindi (Devnagari) Numerals using Feature Sub Selection by Means of End-Point with Neuro-Memetic Model

Authors: Banashree N. P., R. Vasanta

Abstract:

Recognition of Indian languages scripts is challenging problems. In Optical Character Recognition [OCR], a character or symbol to be recognized can be machine printed or handwritten characters/numerals. There are several approaches that deal with problem of recognition of numerals/character depending on the type of feature extracted and different way of extracting them. This paper proposes a recognition scheme for handwritten Hindi (devnagiri) numerals; most admired one in Indian subcontinent. Our work focused on a technique in feature extraction i.e. global based approach using end-points information, which is extracted from images of isolated numerals. These feature vectors are fed to neuro-memetic model [18] that has been trained to recognize a Hindi numeral. The archetype of system has been tested on varieties of image of numerals. . In proposed scheme data sets are fed to neuro-memetic algorithm, which identifies the rule with highest fitness value of nearly 100 % & template associates with this rule is nothing but identified numerals. Experimentation result shows that recognition rate is 92-97 % compared to other models.

Keywords: OCR, Global Feature, End-Points, Neuro-Memetic model.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1552
205 OCR for Script Identification of Hindi (Devnagari) Numerals using Error Diffusion Halftoning Algorithm with Neural Classifier

Authors: Banashree N. P., Andhe Dharani, R. Vasanta, P. S. Satyanarayana

Abstract:

The applications on numbers are across-the-board that there is much scope for study. The chic of writing numbers is diverse and comes in a variety of form, size and fonts. Identification of Indian languages scripts is challenging problems. In Optical Character Recognition [OCR], machine printed or handwritten characters/numerals are recognized. There are plentiful approaches that deal with problem of detection of numerals/character depending on the sort of feature extracted and different way of extracting them. This paper proposes a recognition scheme for handwritten Hindi (devnagiri) numerals; most admired one in Indian subcontinent our work focused on a technique in feature extraction i.e. Local-based approach, a method using 16-segment display concept, which is extracted from halftoned images & Binary images of isolated numerals. These feature vectors are fed to neural classifier model that has been trained to recognize a Hindi numeral. The archetype of system has been tested on varieties of image of numerals. Experimentation result shows that recognition rate of halftoned images is 98 % compared to binary images (95%).

Keywords: OCR, Halftoning, Neural classifier, 16-segmentdisplay concept.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1508
204 An Analysis of the Five Most Used Numerals and a Proposal for the Adoption of a Universally Acceptable Numeral (UAN)

Authors: Mufutau Ayinla Abdul-Yakeen

Abstract:

An analysis of the five most used numerals and a proposal for the adoption of a Universally Acceptable Numerals (UAN), came up as a result of the researchers inquisitiveses of the need for a set of numerals that is universally accepted. The researcher sought for the meaning of the first letter, “Nun”, “ن”, of the first verse of Suratul-Kalam (Chapter of the Pen), the Sixty-Eighth Chapter of the Holy Qur'an. It was observed that there was no universally accepted, economical, explainable, linkable and consistent set of numerals used by all scientists up till the moment of making this enquiry. As a theoretical paper, explanatory method is used to review five of the most used numerals (Tally Marks, Roman Figure, Hindu-Arabic, Arabic, and Chinese) and the urgent need for a universally accepted, economical, explainable, linkable and consistent set of numerals arises. The study discovers: ., I, \, _, L, U, =, C, O, 9, and 1.; to be used as numeral 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 respectively; as a set of universally acceptable, economical, explainable, linkable, sustainable, convertible and consistent set of numerals that originates from Islam. They can be called Islameconumerals or UAN. With UAN, everything dropped, written, drawn and/or scribbled has meaning(s) as postulated by the first verse of Qur'an 68 and everyone can easily document all figures within the shortest period. It is suggested that there should be a discipline called Numeralnomics (Study of optimum utilization of Numerals) and everybody should start using the UAN, now, in order in know their strengths and weaknesses so as to suggest a better and acceptable set of numerals for the interested readers. Similarly study can be conducted for the alphabets.

Keywords: Islameconumerals, economical, Universally Acceptable Numerals (UAN), numeralnomics.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 562
203 Segmentation of Arabic Handwritten Numeral Strings Based on Watershed Approach

Authors: Nidal F. Shilbayeh, Remah W. Al-Khatib, Sameer A. Nooh

Abstract:

Arabic offline handwriting recognition systems are considered as one of the most challenging topics. Arabic Handwritten Numeral Strings are used to automate systems that deal with numbers such as postal code, banking account numbers and numbers on car plates. Segmentation of connected numerals is the main bottleneck in the handwritten numeral recognition system.  This is in turn can increase the speed and efficiency of the recognition system. In this paper, we proposed algorithms for automatic segmentation and feature extraction of Arabic handwritten numeral strings based on Watershed approach. The algorithms have been designed and implemented to achieve the main goal of segmenting and extracting the string of numeral digits written by hand especially in a courtesy amount of bank checks. The segmentation algorithm partitions the string into multiple regions that can be associated with the properties of one or more criteria. The numeral extraction algorithm extracts the numeral string digits into separated individual digit. Both algorithms for segmentation and feature extraction have been tested successfully and efficiently for all types of numerals.

Keywords: Handwritten numerals, segmentation, courtesy amount, feature extraction, numeral recognition.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 380
202 Development of a Remote Testing System for Performance of Gas Leakage Detectors

Authors: Gyoutae Park, Woosuk Kim, Sangguk Ahn, Seungmo Kim, Minjun Kim, Jinhan Lee, Youngdo Jo, Jongsam Moon, Hiesik Kim

Abstract:

In this research, we designed a remote system to test parameters of gas detectors such as gas concentration and initial response time. This testing system is available to measure two gas instruments simultaneously. First of all, we assembled an experimental jig with a square structure. Those parts are included with a glass flask, two high-quality cameras, and two Ethernet modems for transmitting data. This remote gas detector testing system extracts numerals from videos with continually various gas concentrations while LCDs show photographs from cameras. Extracted numeral data are received to a laptop computer through Ethernet modem. And then, the numerical data with gas concentrations and the measured initial response speeds are recorded and graphed. Our remote testing system will be diversely applied on gas detector’s test and will be certificated in domestic and international countries.

Keywords: Gas leakage detector, inspection instrument, extracting numerals, concentration.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 675
201 Identification of Printed Punjabi Words and English Numerals Using Gabor Features

Authors: Rajneesh Rani, Renu Dhir, G. S. Lehal

Abstract:

Script identification is one of the challenging steps in the development of optical character recognition system for bilingual or multilingual documents. In this paper an attempt is made for identification of English numerals at word level from Punjabi documents by using Gabor features. The support vector machine (SVM) classifier with five fold cross validation is used to classify the word images. The results obtained are quite encouraging. Average accuracy with RBF kernel, Polynomial and Linear Kernel functions comes out to be greater than 99%.

Keywords: Script identification, gabor features, support vector machines.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1924
200 Alphanumeric Hand-Prints Classification: Similarity Analysis between Local Decisions

Authors: G. Dimauro, S. Impedovo, M.G. Lucchese, R. Modugno, G. Pirlo

Abstract:

This paper presents the analysis of similarity between local decisions, in the process of alphanumeric hand-prints classification. From the analysis of local characteristics of handprinted numerals and characters, extracted by a zoning method, the set of classification decisions is obtained and the similarity among them is investigated. For this purpose the Similarity Index is used, which is an estimator of similarity between classifiers, based on the analysis of agreements between their decisions. The experimental tests, carried out using numerals and characters from the CEDAR and ETL database, respectively, show to what extent different parts of the patterns provide similar classification decisions.

Keywords: Handwriting Recognition, Optical Character Recognition, Similarity Index, Zoning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1092
199 Hand Written Digit Recognition by Multiple Classifier Fusion based on Decision Templates Approach

Authors: Reza Ebrahimpour, Samaneh Hamedi

Abstract:

Classifier fusion may generate more accurate classification than each of the basic classifiers. Fusion is often based on fixed combination rules like the product, average etc. This paper presents decision templates as classifier fusion method for the recognition of the handwritten English and Farsi numerals (1-9). The process involves extracting a feature vector on well-known image databases. The extracted feature vector is fed to multiple classifier fusion. A set of experiments were conducted to compare decision templates (DTs) with some combination rules. Results from decision templates conclude 97.99% and 97.28% for Farsi and English handwritten digits.

Keywords: Decision templates, multi-layer perceptron, characteristics Loci, principle component analysis (PCA).

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1702
198 Persian Printed Numerals Classification Using Extended Moment Invariants

Authors: Hamid Reza Boveiri

Abstract:

Classification of Persian printed numeral characters has been considered and a proposed system has been introduced. In representation stage, for the first time in Persian optical character recognition, extended moment invariants has been utilized as characters image descriptor. In classification stage, four different classifiers namely minimum mean distance, nearest neighbor rule, multi layer perceptron, and fuzzy min-max neural network has been used, which first and second are traditional nonparametric statistical classifier. Third is a well-known neural network and forth is a kind of fuzzy neural network that is based on utilizing hyperbox fuzzy sets. Set of different experiments has been done and variety of results has been presented. The results showed that extended moment invariants are qualified as features to classify Persian printed numeral characters.

Keywords: Extended moment invariants, optical characterrecognition, Persian numerals classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1670
197 Multilevel Classifiers in Recognition of Handwritten Kannada Numerals

Authors: Dinesh Acharya U., N. V. Subba Reddy, Krishnamoorthi Makkithaya

Abstract:

The recognition of handwritten numeral is an important area of research for its applications in post office, banks and other organizations. This paper presents automatic recognition of handwritten Kannada numerals based on structural features. Five different types of features, namely, profile based 10-segment string, water reservoir; vertical and horizontal strokes, end points and average boundary length from the minimal bounding box are used in the recognition of numeral. The effect of each feature and their combination in the numeral classification is analyzed using nearest neighbor classifiers. It is common to combine multiple categories of features into a single feature vector for the classification. Instead, separate classifiers can be used to classify based on each visual feature individually and the final classification can be obtained based on the combination of separate base classification results. One popular approach is to combine the classifier results into a feature vector and leaving the decision to next level classifier. This method is extended to extract a better information, possibility distribution, from the base classifiers in resolving the conflicts among the classification results. Here, we use fuzzy k Nearest Neighbor (fuzzy k-NN) as base classifier for individual feature sets, the results of which together forms the feature vector for the final k Nearest Neighbor (k-NN) classifier. Testing is done, using different features, individually and in combination, on a database containing 1600 samples of different numerals and the results are compared with the results of different existing methods.

Keywords: Fuzzy k Nearest Neighbor, Multiple Classifiers, Numeral Recognition, Structural features.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1518
196 Determination of Measurement Uncertainty in Extracting of Forming Limit Diagrams

Authors: M. Mahboubkhah, H. Fayazfar

Abstract:

In this research, Forming Limit Diagrams for supertension sheet metals which are using in automobile industry have been obtained. The exerted strains to sheet metals have been measured with four different methods and the errors of each method have also been represented. These methods have been compared with together and the most efficient and economic way of extracting of the exerted strains to sheet metals has been introduced. In this paper total error and uncertainty of FLD extraction procedures have been derived. Determination of the measurement uncertainty in extracting of FLD has a great importance in design and analysis of the sheet metal forming process.

Keywords: Forming Limit Diagram, Major and Minor Strain, Measurement Uncertainty.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1751
195 Assamese Numeral Speech Recognition using Multiple Features and Cooperative LVQ -Architectures

Authors: Manash Pratim Sarma, Kandarpa Kumar Sarma

Abstract:

A set of Artificial Neural Network (ANN) based methods for the design of an effective system of speech recognition of numerals of Assamese language captured under varied recording conditions and moods is presented here. The work is related to the formulation of several ANN models configured to use Linear Predictive Code (LPC), Principal Component Analysis (PCA) and other features to tackle mood and gender variations uttering numbers as part of an Automatic Speech Recognition (ASR) system in Assamese. The ANN models are designed using a combination of Self Organizing Map (SOM) and Multi Layer Perceptron (MLP) constituting a Learning Vector Quantization (LVQ) block trained in a cooperative environment to handle male and female speech samples of numerals of Assamese- a language spoken by a sizable population in the North-Eastern part of India. The work provides a comparative evaluation of several such combinations while subjected to handle speech samples with gender based differences captured by a microphone in four different conditions viz. noiseless, noise mixed, stressed and stress-free.

Keywords: Assamese, Recognition, LPC, Spectral, ANN.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1788
194 Extracting Multiword Expressions in Machine Translation from English to Urdu using Relational Data Approach

Authors: Kashif Bilal, Uzair Muhammad, Atif Khan, M. Nasir Khan

Abstract:

Machine Translation, (hereafter in this document referred to as the "MT") faces a lot of complex problems from its origination. Extracting multiword expressions is also one of the complex problems in MT. Finding multiword expressions during translating a sentence from English into Urdu, through existing solutions, takes a lot of time and occupies system resources. We have designed a simple relational data approach, in which we simply set a bit in dictionary (database) for multiword, to find and handle multiword expression. This approach handles multiword efficiently.

Keywords: Machine Translation, Multiword Expressions, Urdulanguage processing, POS (stands for Parts of Speech) Tagging forUrdu, Expert Systems.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2138
193 Extracting Human Body based on Background Estimation in Modified HLS Color Space

Authors: Jang-Hee Yoo, Doosung Hwang, Jong-Wook Han, Ki-Young Moon

Abstract:

The ability to recognize humans and their activities by computer vision is a very important task, with many potential application. Study of human motion analysis is related to several research areas of computer vision such as the motion capture, detection, tracking and segmentation of people. In this paper, we describe a segmentation method for extracting human body contour in modified HLS color space. To estimate a background, the modified HLS color space is proposed, and the background features are estimated by using the HLS color components. Here, the large amount of human dataset, which was collected from DV cameras, is pre-processed. The human body and its contour is successfully extracted from the image sequences.

Keywords: Background Subtraction, Human Silhouette Extraction, HLS Color Space, and Object Segmentation

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2224
192 A Fast Directionally Constrained Minimization of Power Algorithm for Extracting a Speech Signal Perpendicular to a Microphone Array

Authors: Yasuhiko Okuma, Yuichi Suzuki, Takahiro Murakami, Yoshihisa Ishida

Abstract:

In this paper, an extended method of the directionally constrained minimization of power (DCMP) algorithm for broadband signals is proposed. The DCMP algorithm is one of the useful techniques of extracting a target signal from observed signals of a microphone array system. In the DCMP algorithm, output power of the microphone array is minimized under a constraint of constant responses to directions of arrival (DOAs) of specific signals. In our algorithm, by limiting the directional constraint to the perpendicular direction to the sensor array system, the calculating time is reduced.

Keywords: Beamformer, directionally constrained minimizationof power, direction of arrival, microphone array.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1420
191 Biogas Control: Methane Production Monitoring Using Arduino

Authors: W. Ait Ahmed, M. Aggour, M. Naciri

Abstract:

Extracting energy from biomass is an important alternative to produce different types of energy (heat, electricity, or both) assuring low pollution and better efficiency. It is a new yet reliable approach to reduce green gas emission by extracting methane from industry effluents and use it to power machinery. We focused in our project on using paper and mill effluents, treated in a UASB reactor. The methane produced is used in the factory’s power supply. The aim of this work is to develop an electronic system using Arduino platform connected to a gas sensor, to measure and display the curve of daily methane production on processing. The sensor will send the gas values in ppm to the Arduino board so that the later sends the RS232 hardware protocol. The code developed with processing will transform the values into a curve and display it on the computer screen.

Keywords: Biogas, Arduino, processing, code, methane, gas sensor, program.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2406
190 Experimental Study of the Extraction of Copper(II) from Sulphuric Acid by Means of Sodium Diethyldithiocarbamate (SDDT)

Authors: S.Touati, A.H. Meniai

Abstract:

The present work presents the extraction of copper(II) from sulphuric acid solutions with Sodium diethyldithiocarbamate (SDDT), and six different organic diluents: Dichloromethane, Chloroform, Carbon tetrachloride, Toluene, xylene and Cyclohexane, were tested. The pair SDDT/Chloroform showed to be the most selective in removing the copper cations, and hence was considered throughout the experimental study. The effects of operating parameters such as the initial concentration of the extracting agent, the agitation time, the agitation speed and the acid concentration were considered. For an initial concentration of Cu (II) of 63 ppm in a 0.5 M sulphuric acid solution, both with a mass of the extracting agent of 20 mg, an extraction percentage of about 97.8 % and a distribution coefficient of 44.42 were obtained, respectively, confirming the performance of the SDDT-Chloroform pair.

Keywords: Copper (II), Distribution coefficient, Extraction, SDDT, Sulphuric acid.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1621
189 A Relationship Extraction Method from Literary Fiction Considering Korean Linguistic Features

Authors: Hee-Jeong Ahn, Kee-Won Kim, Seung-Hoon Kim

Abstract:

The knowledge of the relationship between characters can help readers to understand the overall story or plot of the literary fiction. In this paper, we present a method for extracting the specific relationship between characters from a Korean literary fiction. Generally, methods for extracting relationships between characters in text are statistical or computational methods based on the sentence distance between characters without considering Korean linguistic features. Furthermore, it is difficult to extract the relationship with direction from text, such as one-sided love, because they consider only the weight of relationship, without considering the direction of the relationship. Therefore, in order to identify specific relationships between characters, we propose a statistical method considering linguistic features, such as syntactic patterns and speech verbs in Korean. The result of our method is represented by a weighted directed graph of the relationship between the characters. Furthermore, we expect that proposed method could be applied to the relationship analysis between characters of other content like movie or TV drama.

Keywords: Data mining, Korean linguistic feature, literary fiction, relationship extraction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1381
188 An Approach to Image Extraction and Accurate Skin Detection from Web Pages

Authors: Moheb R. Girgis, Tarek M. Mahmoud, Tarek Abd-El-Hafeez

Abstract:

This paper proposes a system to extract images from web pages and then detect the skin color regions of these images. As part of the proposed system, using BandObject control, we built a Tool bar named 'Filter Tool Bar (FTB)' by modifying the Pavel Zolnikov implementation. The Yahoo! Team provides us with the Yahoo! SDK API, which also supports image search and is really useful. In the proposed system, we introduced three new methods for extracting images from the web pages (after loading the web page by using the proposed FTB, before loading the web page physically from the localhost, and before loading the web page from any server). These methods overcome the drawback of the regular expressions method for extracting images suggested by Ilan Assayag. The second part of the proposed system is concerned with the detection of the skin color regions of the extracted images. So, we studied two famous skin color detection techniques. The first technique is based on the RGB color space and the second technique is based on YUV and YIQ color spaces. We modified the second technique to overcome the failure of detecting complex image's background by using the saturation parameter to obtain an accurate skin detection results. The performance evaluation of the efficiency of the proposed system in extracting images before and after loading the web page from localhost or any server in terms of the number of extracted images is presented. Finally, the results of comparing the two skin detection techniques in terms of the number of pixels detected are presented.

Keywords: Browser Helper Object, Color spaces, Image and URL extraction, Skin detection, Web Browser events.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1639
187 Deep iCrawl: An Intelligent Vision-Based Deep Web Crawler

Authors: R.Anita, V.Ganga Bharani, N.Nityanandam, Pradeep Kumar Sahoo

Abstract:

The explosive growth of World Wide Web has posed a challenging problem in extracting relevant data. Traditional web crawlers focus only on the surface web while the deep web keeps expanding behind the scene. Deep web pages are created dynamically as a result of queries posed to specific web databases. The structure of the deep web pages makes it impossible for traditional web crawlers to access deep web contents. This paper, Deep iCrawl, gives a novel and vision-based approach for extracting data from the deep web. Deep iCrawl splits the process into two phases. The first phase includes Query analysis and Query translation and the second covers vision-based extraction of data from the dynamically created deep web pages. There are several established approaches for the extraction of deep web pages but the proposed method aims at overcoming the inherent limitations of the former. This paper also aims at comparing the data items and presenting them in the required order.

Keywords: Crawler, Deep web, Web Database

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1915
186 Deep Web Content Mining

Authors: Shohreh Ajoudanian, Mohammad Davarpanah Jazi

Abstract:

The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased difficulty of extracting potentially useful knowledge. Web content mining confronts this problem gathering explicit information from different web sites for its access and knowledge discovery. Query interfaces of web databases share common building blocks. After extracting information with parsing approach, we use a new data mining algorithm to match a large number of schemas in databases at a time. Using this algorithm increases the speed of information matching. In addition, instead of simple 1:1 matching, they do complex (m:n) matching between query interfaces. In this paper we present a novel correlation mining algorithm that matches correlated attributes with smaller cost. This algorithm uses Jaccard measure to distinguish positive and negative correlated attributes. After that, system matches the user query with different query interfaces in special domain and finally chooses the nearest query interface with user query to answer to it.

Keywords: Content mining, complex matching, correlation mining, information extraction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2077
185 Application of Vortex Tubes for Extracting Sediments Using SHARC Software - A Case Study of the Western Canal in the Dez Diversion Weir

Authors: A. H. Sajedi Pour, N. Hedayat, Z. Yazdi

Abstract:

Sediment loads transfer in hydraulic installations and their consequences for the O&M of modern canal systems is emerging as one of the most important considerations in hydraulic engineering projects apriticularly those which are inteded to feed the irrigation and draiange schemes of large command areas such as the Dez and Mogahn in Iran.. The aim of this paper is to investigate the applicability of the vortex tube as a viable means of extracting sediment loads entering the canal systems in general and the water inatke structures in particulars. The Western conveyance canal of the Dez Diversion weir which feeds the Karkheh Flood Plain in Sothwestern Dezful has been used as the case study using the data from the Dastmashan Hydrometric Station. The SHARC software has been used as an analytical framework to interprete the data. Results show that given the grain size D50 and the canal turbulence the adaption length from the beginning of the canal and after the diversion dam is estimated at 477 m, a point which is suitable for laying the vortex tube.

Keywords: Vortex tube, sediments, western canal, SHARCmodel

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1838
184 Powerful Tool to Expand Business Intelligence: Text Mining

Authors: Li Gao, Elizabeth Chang, Song Han

Abstract:

With the extensive inclusion of document, especially text, in the business systems, data mining does not cover the full scope of Business Intelligence. Data mining cannot deliver its impact on extracting useful details from the large collection of unstructured and semi-structured written materials based on natural languages. The most pressing issue is to draw the potential business intelligence from text. In order to gain competitive advantages for the business, it is necessary to develop the new powerful tool, text mining, to expand the scope of business intelligence. In this paper, we will work out the strong points of text mining in extracting business intelligence from huge amount of textual information sources within business systems. We will apply text mining to each stage of Business Intelligence systems to prove that text mining is the powerful tool to expand the scope of BI. After reviewing basic definitions and some related technologies, we will discuss the relationship and the benefits of these to text mining. Some examples and applications of text mining will also be given. The motivation behind is to develop new approach to effective and efficient textual information analysis. Thus we can expand the scope of Business Intelligence using the powerful tool, text mining.

Keywords: Business intelligence, document warehouse, text mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2418
183 Web Search Engine Based Naming Procedure for Independent Topic

Authors: Takahiro Nishigaki, Takashi Onoda

Abstract:

In recent years, the number of document data has been increasing since the spread of the Internet. Many methods have been studied for extracting topics from large document data. We proposed Independent Topic Analysis (ITA) to extract topics independent of each other from large document data such as newspaper data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis. The topic represented by ITA is represented by a set of words. However, the set of words is quite different from the topics the user imagines. For example, the top five words with high independence of a topic are as follows. Topic1 = {"scor", "game", "lead", "quarter", "rebound"}. This Topic 1 is considered to represent the topic of "SPORTS". This topic name "SPORTS" has to be attached by the user. ITA cannot name topics. Therefore, in this research, we propose a method to obtain topics easy for people to understand by using the web search engine, topics given by the set of words given by independent topic analysis. In particular, we search a set of topical words, and the title of the homepage of the search result is taken as the topic name. And we also use the proposed method for some data and verify its effectiveness.

Keywords: Independent topic analysis, topic extraction, topic naming, web search engine.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 234
182 Extracting Single Trial Visual Evoked Potentials using Selective Eigen-Rate Principal Components

Authors: Samraj Andrews, Ramaswamy Palaniappan, Nidal Kamel

Abstract:

In single trial analysis, when using Principal Component Analysis (PCA) to extract Visual Evoked Potential (VEP) signals, the selection of principal components (PCs) is an important issue. We propose a new method here that selects only the appropriate PCs. We denote the method as selective eigen-rate (SER). In the method, the VEP is reconstructed based on the rate of the eigen-values of the PCs. When this technique is applied on emulated VEP signals added with background electroencephalogram (EEG), with a focus on extracting the evoked P3 parameter, it is found to be feasible. The improvement in signal to noise ratio (SNR) is superior to two other existing methods of PC selection: Kaiser (KSR) and Residual Power (RP). Though another PC selection method, Spectral Power Ratio (SPR) gives a comparable SNR with high noise factors (i.e. EEGs), SER give more impressive results in such cases. Next, we applied SER method to real VEP signals to analyse the P3 responses for matched and non-matched stimuli. The P3 parameters extracted through our proposed SER method showed higher P3 response for matched stimulus, which confirms to the existing neuroscience knowledge. Single trial PCA using KSR and RP methods failed to indicate any difference for the stimuli.

Keywords: Electroencephalogram, P3, Single trial VEP.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1444
181 Assamese Numeral Corpus for Speech Recognition using Cooperative ANN Architecture

Authors: Mousmita Sarma, Krishna Dutta, Kandarpa Kumar Sarma

Abstract:

Speech corpus is one of the major components in a Speech Processing System where one of the primary requirements is to recognize an input sample. The quality and details captured in speech corpus directly affects the precision of recognition. The current work proposes a platform for speech corpus generation using an adaptive LMS filter and LPC cepstrum, as a part of an ANN based Speech Recognition System which is exclusively designed to recognize isolated numerals of Assamese language- a major language in the North Eastern part of India. The work focuses on designing an optimal feature extraction block and a few ANN based cooperative architectures so that the performance of the Speech Recognition System can be improved.

Keywords: Filter, Feature, LMS, LPC, Cepstrum, ANN.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1876
180 Handwritten Character Recognition Using Multiscale Neural Network Training Technique

Authors: Velappa Ganapathy, Kok Leong Liew

Abstract:

Advancement in Artificial Intelligence has lead to the developments of various “smart" devices. Character recognition device is one of such smart devices that acquire partial human intelligence with the ability to capture and recognize various characters in different languages. Firstly multiscale neural training with modifications in the input training vectors is adopted in this paper to acquire its advantage in training higher resolution character images. Secondly selective thresholding using minimum distance technique is proposed to be used to increase the level of accuracy of character recognition. A simulator program (a GUI) is designed in such a way that the characters can be located on any spot on the blank paper in which the characters are written. The results show that such methods with moderate level of training epochs can produce accuracies of at least 85% and more for handwritten upper case English characters and numerals.

Keywords: Character recognition, multiscale, backpropagation, neural network, minimum distance technique.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1676
179 A Text Mining Technique Using Association Rules Extraction

Authors: Hany Mahgoub, Dietmar Rösner, Nabil Ismail, Fawzy Torkey

Abstract:

This paper describes text mining technique for automatically extracting association rules from collections of textual documents. The technique called, Extracting Association Rules from Text (EART). It depends on keyword features for discover association rules amongst keywords labeling the documents. In this work, the EART system ignores the order in which the words occur, but instead focusing on the words and their statistical distributions in documents. The main contributions of the technique are that it integrates XML technology with Information Retrieval scheme (TFIDF) (for keyword/feature selection that automatically selects the most discriminative keywords for use in association rules generation) and use Data Mining technique for association rules discovery. It consists of three phases: Text Preprocessing phase (transformation, filtration, stemming and indexing of the documents), Association Rule Mining (ARM) phase (applying our designed algorithm for Generating Association Rules based on Weighting scheme GARW) and Visualization phase (visualization of results). Experiments applied on WebPages news documents related to the outbreak of the bird flu disease. The extracted association rules contain important features and describe the informative news included in the documents collection. The performance of the EART system compared with another system that uses the Apriori algorithm throughout the execution time and evaluating extracted association rules.

Keywords: Text mining, data mining, association rule mining

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3893
178 Optimization of Some Process Parameters to Produce Raisin Concentrate in Khorasan Region of Iran

Authors: Peiman Ariaii, Hamid Tavakolipour, Mohsen Pirdashti, Rabehe Izadi Amoli

Abstract:

Raisin Concentrate (RC) are the most important products obtained in the raisin processing industries. These RC products are now used to make the syrups, drinks and confectionery productions and introduced as natural substitute for sugar in food applications. Iran is a one of the biggest raisin exporter in the world but unfortunately despite a good raw material, no serious effort to extract the RC has been taken in Iran. Therefore, in this paper, we determined and analyzed affected parameters on extracting RC process and then optimizing these parameters for design the extracting RC process in two types of raisin (round and long) produced in Khorasan region. Two levels of solvent (1:1 and 2:1), three levels of extraction temperature (60°C, 70°C and 80°C), and three levels of concentration temperature (50°C, 60°C and 70°C) were the treatments. Finally physicochemical characteristics of the obtained concentrate such as color, viscosity, percentage of reduction sugar, acidity and the microbial tests (mould and yeast) were counted. The analysis was performed on the basis of factorial in the form of completely randomized design (CRD) and Duncan's multiple range test (DMRT) was used for the comparison of the means. Statistical analysis of results showed that optimal conditions for production of concentrate is round raisins when the solvent ratio was 2:1 with extraction temperature of 60°C and then concentration temperature of 50°C. Round raisin is cheaper than the long one, and it is more economical to concentrate production. Furthermore, round raisin has more aromas and the less color degree with increasing the temperature of concentration and extraction. Finally, according to mentioned factors the concentrate of round raisin is recommended.

Keywords: Raisin concentrate, optimization, process parameters, round raisin, Iran.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1420
177 ORank: An Ontology Based System for Ranking Documents

Authors: Mehrnoush Shamsfard, Azadeh Nematzadeh, Sarah Motiee

Abstract:

Increasing growth of information volume in the internet causes an increasing need to develop new (semi)automatic methods for retrieval of documents and ranking them according to their relevance to the user query. In this paper, after a brief review on ranking models, a new ontology based approach for ranking HTML documents is proposed and evaluated in various circumstances. Our approach is a combination of conceptual, statistical and linguistic methods. This combination reserves the precision of ranking without loosing the speed. Our approach exploits natural language processing techniques for extracting phrases and stemming words. Then an ontology based conceptual method will be used to annotate documents and expand the query. To expand a query the spread activation algorithm is improved so that the expansion can be done in various aspects. The annotated documents and the expanded query will be processed to compute the relevance degree exploiting statistical methods. The outstanding features of our approach are (1) combining conceptual, statistical and linguistic features of documents, (2) expanding the query with its related concepts before comparing to documents, (3) extracting and using both words and phrases to compute relevance degree, (4) improving the spread activation algorithm to do the expansion based on weighted combination of different conceptual relationships and (5) allowing variable document vector dimensions. A ranking system called ORank is developed to implement and test the proposed model. The test results will be included at the end of the paper.

Keywords: Document ranking, Ontology, Spread activation algorithm, Annotation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1670