Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 2

Search results for: Rebone L. Meraba

2 Genomic Sequence Representation Learning: An Analysis of K-Mer Vector Embedding Dimensionality

Authors: James Jr. Mashiyane, Risuna Nkolele, Stephanie J. Müller, Gciniwe S. Dlamini, Rebone L. Meraba, Darlington S. Mapiye

Abstract:

When performing language tasks in natural language processing (NLP), the dimensionality of word embeddings is chosen either ad-hoc or is calculated by optimizing the Pairwise Inner Product (PIP) loss. The PIP loss is a metric that measures the dissimilarity between word embeddings, and it is obtained through matrix perturbation theory by utilizing the unitary invariance of word embeddings. Unlike in natural language, in genomics, especially in genome sequence processing, unlike in natural language processing, there is no notion of a “word,” but rather, there are sequence substrings of length k called k-mers. K-mers sizes matter, and they vary depending on the goal of the task at hand. The dimensionality of word embeddings in NLP has been studied using the matrix perturbation theory and the PIP loss. In this paper, the sufficiency and reliability of applying word-embedding algorithms to various genomic sequence datasets are investigated to understand the relationship between the k-mer size and their embedding dimension. This is completed by studying the scaling capability of three embedding algorithms, namely Latent Semantic analysis (LSA), Word2Vec, and Global Vectors (GloVe), with respect to the k-mer size. Utilising the PIP loss as a metric to train embeddings on different datasets, we also show that Word2Vec outperforms LSA and GloVe in accurate computing embeddings as both the k-mer size and vocabulary increase. Finally, the shortcomings of natural language processing embedding algorithms in performing genomic tasks are discussed.

Keywords: word embeddings, k-mer embedding, dimensionality reduction

Procedia PDF Downloads 27
1 Woody Plant Encroachment Effects on the Physical Properties of Vertic Soils in Bela-Bela, Limpopo Province

Authors: Rebone E. Mashapa, Phesheya E. Dlamini, Sandile S. Mthimkhulu

Abstract:

Woody plant encroachment, a land cover transformation that reduces grassland productivity may influence soil physical properties. The objective of the study was to determine the effect of woody plant encroachment on physical properties of vertic soils in a savanna grassland. In this study, we quantified and compared soil bulk density, aggregate stability and porosity in the top and subsoil of an open and woody encroached savanna grassland. The results revealed that soil bulk density increases, while porosity and mean weight diameter decreases with depth in both open and woody encroached grassland soil. Compared to open grassland, soil bulk density was 11% and 10% greater in the topsoil and subsoil, while porosity was 6% and 9% lower in the topsoil and subsoil of woody encroached grassland. Mean weight diameter, an indicator of soil aggregation increased by 38% only in the subsoil of encroached grasslands due to increasing clay content with depth. These results suggest that woody plant encroachment leads to compaction of vertic soils, which in turn reduces pore size distribution.

Keywords: soil depth, soil physical properties, vertic soils, woody plant encroachment

Procedia PDF Downloads 71