Search results for: Categorical data sequences
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 7538

Search results for: Categorical data sequences

7538 A Simplified Higher-Order Markov Chain Model

Authors: Chao Wang, Ting-Zhu Huang, Chen Jia

Abstract:

In this paper, we present a simplified higher-order Markov chain model for multiple categorical data sequences also called as simplified higher-order multivariate Markov chain model.

Keywords: Higher-order multivariate Markov chain model, Categorical data sequences, Multivariate Markov chain.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3215
7537 Incremental Algorithm to Cluster the Categorical Data with Frequency Based Similarity Measure

Authors: S.Aranganayagi, K.Thangavel

Abstract:

Clustering categorical data is more complicated than the numerical clustering because of its special properties. Scalability and memory constraint is the challenging problem in clustering large data set. This paper presents an incremental algorithm to cluster the categorical data. Frequencies of attribute values contribute much in clustering similar categorical objects. In this paper we propose new similarity measures based on the frequencies of attribute values and its cardinalities. The proposed measures and the algorithm are experimented with the data sets from UCI data repository. Results prove that the proposed method generates better clusters than the existing one.

Keywords: Clustering, Categorical, Incremental, Frequency, Domain

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1774
7536 Cluster Analysis for the Statistical Modeling of Aesthetic Judgment Data Related to Comics Artists

Authors: George E. Tsekouras, Evi Sampanikou

Abstract:

We compare three categorical data clustering algorithms with respect to the problem of classifying cultural data related to the aesthetic judgment of comics artists. Such a classification is very important in Comics Art theory since the determination of any classes of similarities in such kind of data will provide to art-historians very fruitful information of Comics Art-s evolution. To establish this, we use a categorical data set and we study it by employing three categorical data clustering algorithms. The performances of these algorithms are compared each other, while interpretations of the clustering results are also given.

Keywords: Aesthetic judgment, comics artists, cluster analysis, categorical data.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1590
7535 Categorical Clustering By Converting Associated Information

Authors: Dongmin Cai, Stephen S-T Yau

Abstract:

Lacking an inherent “natural" dissimilarity measure between objects in categorical dataset presents special difficulties in clustering analysis. However, each categorical attributes from a given dataset provides natural probability and information in the sense of Shannon. In this paper, we proposed a novel method which heuristically converts categorical attributes to numerical values by exploiting such associated information. We conduct an experimental study with real-life categorical dataset. The experiment demonstrates the effectiveness of our approach.

Keywords: Categorical, Clustering, Converting, Information

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1311
7534 Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency

Authors: Semeh Ben Salem, Sami Naouali, Moetez Sallami

Abstract:

Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.

Keywords: Clustering, k-means, categorical datasets, pattern recognition, unsupervised learning, knowledge discovery.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3474
7533 Categorical Missing Data Imputation Using Fuzzy Neural Networks with Numerical and Categorical Inputs

Authors: Pilar Rey-del-Castillo, Jesús Cardeñosa

Abstract:

There are many situations where input feature vectors are incomplete and methods to tackle the problem have been studied for a long time. A commonly used procedure is to replace each missing value with an imputation. This paper presents a method to perform categorical missing data imputation from numerical and categorical variables. The imputations are based on Simpson-s fuzzy min-max neural networks where the input variables for learning and classification are just numerical. The proposed method extends the input to categorical variables by introducing new fuzzy sets, a new operation and a new architecture. The procedure is tested and compared with others using opinion poll data.

Keywords: Classifier, imputation techniques, fuzzy systems, fuzzy min-max neural networks.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1713
7532 Landscape Data Transformation: Categorical Descriptions to Numerical Descriptors

Authors: Dennis A. Apuan

Abstract:

Categorical data based on description of the agricultural landscape imposed some mathematical and analytical limitations. This problem however can be overcome by data transformation through coding scheme and the use of non-parametric multivariate approach. The present study describes data transformation from qualitative to numerical descriptors. In a collection of 103 random soil samples over a 60 hectare field, categorical data were obtained from the following variables: levels of nitrogen, phosphorus, potassium, pH, hue, chroma, value and data on topography, vegetation type, and the presence of rocks. Categorical data were coded, and Spearman-s rho correlation was then calculated using PAST software ver. 1.78 in which Principal Component Analysis was based. Results revealed successful data transformation, generating 1030 quantitative descriptors. Visualization based on the new set of descriptors showed clear differences among sites, and amount of variation was successfully measured. Possible applications of data transformation are discussed.

Keywords: data transformation, numerical descriptors, principalcomponent analysis

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1456
7531 Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure

Authors: S.Aranganayagi, K.Thangavel

Abstract:

K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.

Keywords: Clustering, categorical data, K-Modes, weighted dissimilarity measure

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3627
7530 Clustering Protein Sequences with Tailored General Regression Model Technique

Authors: G. Lavanya Devi, Allam Appa Rao, A. Damodaram, GR Sridhar, G. Jaya Suma

Abstract:

Cluster analysis divides data into groups that are meaningful, useful, or both. Analysis of biological data is creating a new generation of epidemiologic, prognostic, diagnostic and treatment modalities. Clustering of protein sequences is one of the current research topics in the field of computer science. Linear relation is valuable in rule discovery for a given data, such as if value X goes up 1, value Y will go down 3", etc. The classical linear regression models the linear relation of two sequences perfectly. However, if we need to cluster a large repository of protein sequences into groups where sequences have strong linear relationship with each other, it is prohibitively expensive to compare sequences one by one. In this paper, we propose a new technique named General Regression Model Technique Clustering Algorithm (GRMTCA) to benignly handle the problem of linear sequences clustering. GRMT gives a measure, GR*, to tell the degree of linearity of multiple sequences without having to compare each pair of them.

Keywords: Clustering, General Regression Model, Protein Sequences, Similarity Measure.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1509
7529 Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data

Authors: George E. Tsekouras, Dimitris Papageorgiou, Sotiris Kotsiantis, Christos Kalloniatis, Panagiotis Pintelas

Abstract:

We develop a three-step fuzzy logic-based algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropy-based clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy c-modes algorithm to obtain a fuzzy partition of the data set, and the third step introduces a novel cluster validity index, which decides the final number of clusters.

Keywords: Categorical data, cultural data, fuzzy logic clustering, fuzzy c-modes, cluster validity index.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1655
7528 On The Elliptic Divisibility Sequences over Finite Fields

Authors: Osman Bizim

Abstract:

In this work we study elliptic divisibility sequences over finite fields. MorganWard in [11, 12] gave arithmetic theory of elliptic divisibility sequences. We study elliptic divisibility sequences, equivalence of these sequences and singular elliptic divisibility sequences over finite fields Fp, p > 3 is a prime.

Keywords: Elliptic divisibility sequences, equivalent sequences, singular sequences.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1425
7527 Categorical Data Modeling: Logistic Regression Software

Authors: Abdellatif Tchantchane

Abstract:

A Matlab based software for logistic regression is developed to enhance the process of teaching quantitative topics and assist researchers with analyzing wide area of applications where categorical data is involved. The software offers an option of performing stepwise logistic regression to select the most significant predictors. The software includes a feature to detect influential observations in data, and investigates the effect of dropping or misclassifying an observation on a predictor variable. The input data may consist either as a set of individual responses (yes/no) with the predictor variables or as grouped records summarizing various categories for each unique set of predictor variables' values. Graphical displays are used to output various statistical results and to assess the goodness of fit of the logistic regression model. The software recognizes possible convergence constraints when present in data, and the user is notified accordingly.

Keywords: Logistic regression, Matlab, Categorical data, Influential observation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1834
7526 On the Properties of Pseudo Noise Sequences with a Simple Proposal of Randomness Test

Authors: Abhijit Mitra

Abstract:

Maximal length sequences (m-sequences) are also known as pseudo random sequences or pseudo noise sequences for closely following Golomb-s popular randomness properties: (P1) balance, (P2) run, and (P3) ideal autocorrelation. Apart from these, there also exist certain other less known properties of such sequences all of which are discussed in this tutorial paper. Comprehensive proofs to each of these properties are provided towards better understanding of such sequences. A simple test is also proposed at the end of the paper in order to distinguish pseudo noise sequences from truly random sequences such as Bernoulli sequences.

Keywords: Maximal length sequence, pseudo noise sequence, punctured de Bruijn sequence, auto-correlation, Bernoulli sequence, randomness tests.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 6635
7525 Elliptic Divisibility Sequences over Finite Fields

Authors: Betül Gezer, Ahmet Tekcan, Osman Bizim

Abstract:

In this work, we study elliptic divisibility sequences over finite fields. Morgan Ward in [14], [15] gave arithmetic theory of elliptic divisibility sequences and formulas for elliptic divisibility sequences with rank two over finite field Fp. We study elliptic divisibility sequences with rank three, four and five over a finite field Fp, where p > 3 is a prime and give general terms of these sequences and then we determine elliptic and singular curves associated with these sequences.

Keywords: Elliptic divisibility sequences, singular elliptic divisibilitysequences, elliptic curves, singular curves.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1649
7524 A Prediction of Attractive Evaluation Objects Based On Complex Sequential Data

Authors: Shigeaki Sakurai, Makino Kyoko, Shigeru Matsumoto

Abstract:

This paper proposes a method that predicts attractive evaluation objects. In the learning phase, the method inductively acquires trend rules from complex sequential data. The data is composed of two types of data. One is numerical sequential data. Each evaluation object has respective numerical sequential data. The other is text sequential data. Each evaluation object is described in texts. The trend rules represent changes of numerical values related to evaluation objects. In the prediction phase, the method applies new text sequential data to the trend rules and evaluates which evaluation objects are attractive. This paper verifies the effect of the proposed method by using stock price sequences and news headline sequences. In these sequences, each stock brand corresponds to an evaluation object. This paper discusses validity of predicted attractive evaluation objects, the process time of each phase, and the possibility of application tasks.

Keywords: Trend rule, frequent pattern, numerical sequential data, text sequential data, evaluation object.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1182
7523 MIM: A Species Independent Approach for Classifying Coding and Non-Coding DNA Sequences in Bacterial and Archaeal Genomes

Authors: Achraf El Allali, John R. Rose

Abstract:

A number of competing methodologies have been developed to identify genes and classify DNA sequences into coding and non-coding sequences. This classification process is fundamental in gene finding and gene annotation tools and is one of the most challenging tasks in bioinformatics and computational biology. An information theory measure based on mutual information has shown good accuracy in classifying DNA sequences into coding and noncoding. In this paper we describe a species independent iterative approach that distinguishes coding from non-coding sequences using the mutual information measure (MIM). A set of sixty prokaryotes is used to extract universal training data. To facilitate comparisons with the published results of other researchers, a test set of 51 bacterial and archaeal genomes was used to evaluate MIM. These results demonstrate that MIM produces superior results while remaining species independent.

Keywords: Coding Non-coding Classification, Entropy, GeneRecognition, Mutual Information.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1669
7522 On the Effectivity of Different Pseudo-Noise and Orthogonal Sequences for Speech Encryption from Correlation Properties

Authors: V. Anil Kumar, Abhijit Mitra, S. R. Mahadeva Prasanna

Abstract:

We analyze the effectivity of different pseudo noise (PN) and orthogonal sequences for encrypting speech signals in terms of perceptual intelligence. Speech signal can be viewed as sequence of correlated samples and each sample as sequence of bits. The residual intelligibility of the speech signal can be reduced by removing the correlation among the speech samples. PN sequences have random like properties that help in reducing the correlation among speech samples. The mean square aperiodic auto-correlation (MSAAC) and the mean square aperiodic cross-correlation (MSACC) measures are used to test the randomness of the PN sequences. Results of the investigation show the effectivity of large Kasami sequences for this purpose among many PN sequences.

Keywords: Speech encryption, pseudo-noise codes, maximallength, Gold, Barker, Kasami, Walsh-Hadamard, autocorrelation, crosscorrelation, figure of merit.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1984
7521 Advances on the Understanding of Sequence Convergence Seen from the Perspective of Mathematical Working Spaces

Authors: Paula Verdugo-Hernández, Patricio Cumsille

Abstract:

We analyze a first-class on the convergence of real number sequences, named hereafter sequences, to foster exploration and discovery of concepts through graphical representations before engaging students in proving. The main goal was to differentiate between sequences and continuous functions-of-a-real-variable and better understand concepts at an initial stage. We applied the analytic frame of Mathematical Working Spaces, which we expect to contribute to extending to sequences since, as far as we know, it has only developed for other objects, and which is relevant to analyze how mathematical work is built systematically by connecting the epistemological and cognitive perspectives, and involving the semiotic, instrumental, and discursive dimensions.

Keywords: Convergence, graphical representations, Mathematical Working Spaces, paradigms of real analysis, real number sequences.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 446
7520 SIMGraph: Simplifying Contig Graph to Improve de Novo Genome Assembly Using Next-generation Sequencing Data

Authors: Chien-Ju Li, Chun-Hui Yu, Chi-Chuan Hwang, Tsunglin Liu , Darby Tien-Hao Chang

Abstract:

De novo genome assembly is always fragmented. Assembly fragmentation is more serious using the popular next generation sequencing (NGS) data because NGS sequences are shorter than the traditional Sanger sequences. As the data throughput of NGS is high, the fragmentations in assemblies are usually not the result of missing data. On the contrary, the assembled sequences, called contigs, are often connected to more than one other contigs in a complicated manner, leading to the fragmentations. False connections in such complicated connections between contigs, named a contig graph, are inevitable because of repeats and sequencing/assembly errors. Simplifying a contig graph by removing false connections directly improves genome assembly. In this work, we have developed a tool, SIMGraph, to resolve ambiguous connections between contigs using NGS data. Applying SIMGraph to the assembly of a fungus and a fish genome, we resolved 27.6% and 60.3% ambiguous contig connections, respectively. These results can reduce the experimental efforts in resolving contig connections.

Keywords: Contig graph, NGS, de novo assembly, scaffold.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1678
7519 On the Construction of m-Sequences via Primitive Polynomials with a Fast Identification Method

Authors: Abhijit Mitra

Abstract:

The paper provides an in-depth tutorial of mathematical construction of maximal length sequences (m-sequences) via primitive polynomials and how to map the same when implemented in shift registers. It is equally important to check whether a polynomial is primitive or not so as to get proper m-sequences. A fast method to identify primitive polynomials over binary fields is proposed where the complexity is considerably less in comparison with the standard procedures for the same purpose.

Keywords: Finite field, irreducible polynomial, primitive polynomial, maximal length sequence, additive shift register, multiplicative shift register.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3878
7518 The Convergence Theorems for Mixing Random Variable Sequences

Authors: Yan-zhao Yang

Abstract:

In this paper, some limit properties for mixing random variables sequences were studied and some results on weak law of large number for mixing random variables sequences were presented. Some complete convergence theorems were also obtained. The results extended and improved the corresponding theorems in i.i.d random variables sequences.

Keywords: Complete convergence, mixing random variables, weak law of large numbers.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1567
7517 Power Efficient OFDM Signals with Reduced Symbol's Aperiodic Autocorrelation

Authors: Ibrahim M. Hussain

Abstract:

Three new algorithms based on minimization of autocorrelation of transmitted symbols and the SLM approach which are computationally less demanding have been proposed. In the first algorithm, autocorrelation of complex data sequence is minimized to a value of 1 that results in reduction of PAPR. Second algorithm generates multiple random sequences from the sequence generated in the first algorithm with same value of autocorrelation i.e. 1. Out of these, the sequence with minimum PAPR is transmitted. Third algorithm is an extension of the second algorithm and requires minimum side information to be transmitted. Multiple sequences are generated by modifying a fixed number of complex numbers in an OFDM data sequence using only one factor. The multiple sequences represent the same data sequence and the one giving minimum PAPR is transmitted. Simulation results for a 256 subcarrier OFDM system show that significant reduction in PAPR is achieved using the proposed algorithms.

Keywords: Aperiodic autocorrelation, OFDM, PAPR, SLM, wireless communication.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1672
7516 A Class of Formal Operators for Combinatorial Identities and its Application

Authors: Ruigang Zhang, Wuyungaowa, Xingchen Ma

Abstract:

In this paper, we present some formulas of symbolic operator summation, which involving Generalization well-know number sequences or polynomial sequences, and mean while we obtain some identities about the sequences by employing M-R‘s substitution rule.

Keywords: Generating functions, operators sequence group, Riordan arrays, R. G operator group, combinatorial identities.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1758
7515 Digital Image Encryption Scheme using Chaotic Sequences with a Nonlinear Function

Authors: H. Ogras, M. Turk

Abstract:

In this study, a system of encryption based on chaotic sequences is described. The system is used for encrypting digital image data for the purpose of secure image transmission. An image secure communication scheme based on Logistic map chaotic sequences with a nonlinear function is proposed in this paper. Encryption and decryption keys are obtained by one-dimensional Logistic map that generates secret key for the input of the nonlinear function. Receiver can recover the information using the received signal and identical key sequences through the inverse system technique. The results of computer simulations indicate that the transmitted source image can be correctly and reliably recovered by using proposed scheme even under the noisy channel. The performance of the system will be discussed through evaluating the quality of recovered image with and without channel noise.

Keywords: Digital image, Image encryption, Secure communication

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2175
7514 Differences in Goal Scoring and Passing Sequences between Winning and Losing Team in UEFA-EURO Championship 2012

Authors: Muhamad S., Norasrudin S, Rahmat A.

Abstract:

The objective of current study is to investigate the differences of winning and losing teams in terms of goal scoring and passing sequences. Total of 31 matches from UEFA-EURO 2012 were analyzed and 5 matches were excluded from analysis due to matches end up drawn. There are two groups of variable used in the study which is; i. the goal scoring variable and: ii. passing sequences variable. Data were analyzed using Wilcoxon matched pair rank test with significant value set at p < 0.05. Current study found the timing of goal scored was significantly higher for winning team at 1st half (Z=-3.416, p=.001) and 2nd half (Z=-3.252, p=.001). The scoring frequency was also found to be increase as time progressed and the last 15 minutes of the game was the time interval the most goals scored. The indicators that were significantly differences between winning and losing team were the goal scored (Z=-4.578, p=.000), the head (Z=-2.500, p=.012), the right foot (Z=-3.788,p=.000), corner (Z=-.2.126,p=.033), open play (Z=-3.744,p=.000), inside the penalty box (Z=-4.174, p=.000) , attackers (Z=-2.976, p=.003) and also the midfielders (Z=-3.400, p=.001). Regarding the passing sequences, there are significance difference between both teams in short passing sequences (Z=-.4.141, p=.000). While for the long passing, there were no significance difference (Z=-.1.795, p=.073). The data gathered in present study can be used by the coaches to construct detailed training program based on their objectives.

Keywords: Football, goals scored, passing, timing.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2781
7513 Qualitative Data Analysis for Health Care Services

Authors: Taner Ersoz, Filiz Ersoz

Abstract:

This study was designed enable application of multivariate technique in the interpretation of categorical data for measuring health care services satisfaction in Turkey. The data was collected from a total of 17726 respondents. The establishment of the sample group and collection of the data were carried out by a joint team from The Ministry of Health and Turkish Statistical Institute (Turk Stat) of Turkey. The multiple correspondence analysis (MCA) was used on the data of 2882 respondents who answered the questionnaire in full. The multiple correspondence analysis indicated that, in the evaluation of health services females, public employees, younger and more highly educated individuals were more concerned and complainant than males, private sector employees, older and less educated individuals. Overall 53 % of the respondents were pleased with the improvements in health care services in the past three years. This study demonstrates the public consciousness in health services and health care satisfaction in Turkey. It was found that most the respondents were pleased with the improvements in health care services over the past three years. Awareness of health service quality increases with education levels. Older individuals and males would appear to have lower expectancies in health services.

Keywords: Multiple correspondence analysis, optimal scaling, multivariate categorical data, health care services, health satisfaction survey, statistical visualizing, Turkey.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 816
7512 Computing Entropy for Ortholog Detection

Authors: Hsing-Kuo Pao, John Case

Abstract:

Biological sequences from different species are called or-thologs if they evolved from a sequence of a common ancestor species and they have the same biological function. Approximations of Kolmogorov complexity or entropy of biological sequences are already well known to be useful in extracting similarity information between such sequences -in the interest, for example, of ortholog detection. As is well known, the exact Kolmogorov complexity is not algorithmically computable. In prac-tice one can approximate it by computable compression methods. How-ever, such compression methods do not provide a good approximation to Kolmogorov complexity for short sequences. Herein is suggested a new ap-proach to overcome the problem that compression approximations may notwork well on short sequences. This approach is inspired by new, conditional computations of Kolmogorov entropy. A main contribution of the empir-ical work described shows the new set of entropy-based machine learning attributes provides good separation between positive (ortholog) and nega-tive (non-ortholog) data - better than with good, previously known alter-natives (which do not employ some means to handle short sequences well).Also empirically compared are the new entropy based attribute set and a number of other, more standard similarity attributes sets commonly used in genomic analysis. The various similarity attributes are evaluated by cross validation, through boosted decision tree induction C5.0, and by Receiver Operating Characteristic (ROC) analysis. The results point to the conclu-sion: the new, entropy based attribute set by itself is not the one giving the best prediction; however, it is the best attribute set for use in improving the other, standard attribute sets when conjoined with them.

Keywords: compression, decision tree, entropy, ortholog, ROC.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1783
7511 Strong Limit Theorems for Dependent Random Variables

Authors: Libin Wu, Bainian Li

Abstract:

In This Article We establish moment inequality of dependent random variables,furthermore some theorems of strong law of large numbers and complete convergence for sequences of dependent random variables. In particular, independent and identically distributed Marcinkiewicz Law of large numbers are generalized to the case of m0-dependent sequences.

Keywords: Lacunary System, Generalized Gaussian, NA sequences, strong law of large numbers.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1429
7510 Eukaryotic Gene Prediction by an Investigation of Nonlinear Dynamical Modeling Techniques on EIIP Coded Sequences

Authors: Mai S. Mabrouk, Nahed H. Solouma, Abou-Bakr M. Youssef, Yasser M. Kadah

Abstract:

Many digital signal processing, techniques have been used to automatically distinguish protein coding regions (exons) from non-coding regions (introns) in DNA sequences. In this work, we have characterized these sequences according to their nonlinear dynamical features such as moment invariants, correlation dimension, and largest Lyapunov exponent estimates. We have applied our model to a number of real sequences encoded into a time series using EIIP sequence indicators. In order to discriminate between coding and non coding DNA regions, the phase space trajectory was first reconstructed for coding and non-coding regions. Nonlinear dynamical features are extracted from those regions and used to investigate a difference between them. Our results indicate that the nonlinear dynamical characteristics have yielded significant differences between coding (CR) and non-coding regions (NCR) in DNA sequences. Finally, the classifier is tested on real genes where coding and non-coding regions are well known.

Keywords: Gene prediction, nonlinear dynamics, correlation dimension, Lyapunov exponent.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1752
7509 Parallezation Protein Sequence Similarity Algorithms using Remote Method Interface

Authors: Mubarak Saif Mohsen, Zurinahni Zainol, Rosalina Abdul Salam, Wahidah Husain

Abstract:

One of the major problems in genomic field is to perform sequence comparison on DNA and protein sequences. Executing sequence comparison on the DNA and protein data is a computationally intensive task. Sequence comparison is the basic step for all algorithms in protein sequences similarity. Parallel computing is an attractive solution to provide the computational power needed to speedup the lengthy process of the sequence comparison. Our main research is to enhance the protein sequence algorithm using dynamic programming method. In our approach, we parallelize the dynamic programming algorithm using multithreaded program to perform the sequence comparison and also developed a distributed protein database among many PCs using Remote Method Interface (RMI). As a result, we showed how different sizes of protein sequences data and computation of scoring matrix of these protein sequence on different number of processors affected the processing time and speed, as oppose to sequential processing.

Keywords: Protein sequence algorithm, dynamic programming algorithm, multithread

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1852