Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 16

Search results for: speed-up

16 Parallel Version of Reinhard’s Color Transfer Algorithm

Authors: Abhishek Bhardwaj, Manish Kumar Bajpai

Abstract:

An image with its content and schema of colors presents an effective mode of information sharing and processing. By changing its color schema different visions and prospect are discovered by the users. This phenomenon of color transfer is being used by Social media and other channel of entertainment. Reinhard et al’s algorithm was the first one to solve this problem of color transfer. In this paper, we make this algorithm efficient by introducing domain parallelism among different processors. We also comment on the factors that affect the speedup of this problem. In the end by analyzing the experimental data we claim to propose a novel and efficient parallel Reinhard’s algorithm.

Keywords: Reinhard et al’s algorithm, color transferring, parallelism, speedup

Procedia PDF Downloads 590

15 Performance Evaluation of Parallel Surface Modeling and Generation on Actual and Virtual Multicore Systems

Authors: Nyeng P. Gyang

Abstract:

Even though past, current and future trends suggest that multicore and cloud computing systems are increasingly prevalent/ubiquitous, this class of parallel systems is nonetheless underutilized, in general, and barely used for research on employing parallel Delaunay triangulation for parallel surface modeling and generation, in particular. The performances, of actual/physical and virtual/cloud multicore systems/machines, at executing various algorithms, which implement various parallelization strategies of the incremental insertion technique of the Delaunay triangulation algorithm, were evaluated. T-tests were run on the data collected, in order to determine whether various performance metrics differences (including execution time, speedup and efficiency) were statistically significant. Results show that the actual machine is approximately twice faster than the virtual machine at executing the same programs for the various parallelization strategies. Results, which furnish the scalability behaviors of the various parallelization strategies, also show that some of the differences between the performances of these systems, during different runs of the algorithms on the systems, were statistically significant. A few pseudo superlinear speedup results, which were computed from the raw data collected, are not true superlinear speedup values. These pseudo superlinear speedup values, which arise as a result of one way of computing speedups, disappear and give way to asymmetric speedups, which are the accurate kind of speedups that occur in the experiments performed.

Keywords: cloud computing systems, multicore systems, parallel Delaunay triangulation, parallel surface modeling and generation

Procedia PDF Downloads 186

14 SISSLE in Consensus-Based Ripple: Some Improvements in Speed, Security, Last Mile Connectivity and Ease of Use

Authors: Mayank Mundhra, Chester Rebeiro

Abstract:

Cryptocurrencies are rapidly finding wide application in areas such as Real Time Gross Settlements and Payments Systems. Ripple is a cryptocurrency that has gained prominence with banks and payment providers. It solves the Byzantine General’s Problem with its Ripple Protocol Consensus Algorithm (RPCA), where each server maintains a list of servers, called Unique Node List (UNL) that represents the network for the server, and will not collectively defraud it. The server believes that the network has come to a consensus when members of the UNL come to a consensus on a transaction. In this paper we improve Ripple to achieve better speed, security, last mile connectivity and ease of use. We implement guidelines and automated systems for building and maintaining UNLs for resilience, robustness, improved security, and efficient information propagation. We enhance the system so as to ensure that each server receives information from across the whole network rather than just from the UNL members. We also introduce the paradigm of UNL overlap as a function of information propagation and the trust a server assigns to its own UNL. Our design not only reduces vulnerabilities such as eclipse attacks, but also makes it easier to identify malicious behaviour and entities attempting to fraudulently Double Spend or stall the system. We provide experimental evidence of the benefits of our approach over the current Ripple scheme. We observe ≥ 4.97x and 98.22x in speedup and success rate for information propagation respectively, and ≥ 3.16x and 51.70x in speedup and success rate in consensus.

Keywords: Ripple, Kelips, unique node list, consensus, information propagation

Procedia PDF Downloads 124

13 A Sub-Scalar Approach to the MIPS Architecture

Authors: Kumar Sambhav Pandey, Anamika Singh

Abstract:

The continuous researches in the field of computer architecture basically aims at accelerating the computational speed and to gain enhanced performance. In this era, the superscalar, sub-scalar concept has not gained enough attention for improving the computation performance. In this paper, we have presented a sub-scalar approach to utilize the parallelism present with in the data while processing. The main idea is to split the data into individual smaller entities and these entities are processed with a defined known set of instructions. This sub-scalar approach to the MIPS architecture can bring out significant improvement in the computational speedup. MIPS-I is the basic design taken in consideration for the development of sub-scalar MIPS64 for increasing the instruction level parallelism (ILP) and resource utilization.

Keywords: dataword, MIPS, processor, sub-scalar

Procedia PDF Downloads 520

12 CMOS Solid-State Nanopore DNA System-Level Sequencing Techniques Enhancement

Authors: Syed Islam, Yiyun Huang, Sebastian Magierowski, Ebrahim Ghafar-Zadeh

Abstract:

This paper presents system level CMOS solid-state nanopore techniques enhancement for speedup next generation molecular recording and high throughput channels. This discussion also considers optimum number of base-pair (bp) measurements through channel as an important role to enhance potential read accuracy. Effective power consumption estimation offered suitable rangeof multi-channel configuration. Nanopore bp extraction model in statistical method could contribute higher read accuracy with longer read-length (200 < read-length). Nanopore ionic current switching with Time Multiplexing (TM) based multichannel readout system contributed hardware savings.

Keywords: DNA, nanopore, amplifier, ADC, multichannel

Procedia PDF Downloads 436

11 Parallel Random Number Generation for the Modern Supercomputer Architectures

Authors: Roman Snytsar

Abstract:

Pseudo-random numbers are often used in scientific computing such as the Monte Carlo Simulations or the Quantum Inspired Optimization. Requirements for a parallel random number generator running in the modern multi-core vector environment are more stringent than those for sequential random number generators. As well as passing the usual quality tests, the output of the parallel random number generator must be verifiable and reproducible throughout the concurrent execution. We propose a family of vectorized Permuted Congruential Generators. Implementations are available for multiple modern vector modern computer architectures. Besides demonstrating good single core performance, the generators scale easily across many processor cores and multiple distributed nodes. We provide performance and parallel speedup analysis and comparisons between the implementations.

Keywords: pseudo-random numbers, quantum optimization, SIMD, parallel computing

Procedia PDF Downloads 98

10 Numerical Solution Speedup of the Laplace Equation Using FPGA Hardware

Authors: Abbas Ebrahimi, Mohammad Zandsalimy

Abstract:

The main purpose of this study is to investigate the feasibility of using FPGA (Field Programmable Gate Arrays) chips as alternatives for the conventional CPUs to accelerate the numerical solution of the Laplace equation. FPGA is an integrated circuit that contains an array of logic blocks, and its architecture can be reprogrammed and reconfigured after manufacturing. Complex circuits for various applications can be designed and implemented using FPGA hardware. The reconfigurable hardware used in this paper is an SoC (System on a Chip) FPGA type that integrates both microprocessor and FPGA architectures into a single device. In the present study the Laplace equation is implemented and solved numerically on both reconfigurable hardware and CPU. The precision of results and speedups of the calculations are compared together. The computational process on FPGA, is up to 20 times faster than a conventional CPU, with the same data precision. An analytical solution is used to validate the results.

Keywords: accelerating numerical solutions, CFD, FPGA, hardware definition language, numerical solutions, reconfigurable hardware

Procedia PDF Downloads 365

9 Scheduling Algorithm Based on Load-Aware Queue Partitioning in Heterogeneous Multi-Core Systems

Authors: Hong Kai, Zhong Jun Jie, Chen Lin Qi, Wang Chen Guang

Abstract:

There are inefficient global scheduling parallelism and local scheduling parallelism prone to processor starvation in current scheduling algorithms. Regarding this issue, this paper proposed a load-aware queue partitioning scheduling strategy by first allocating the queues according to the number of processor cores, calculating the load factor to specify the load queue capacity, and it assigned the awaiting nodes to the appropriate perceptual queues through the precursor nodes and the communication computation overhead. At the same time, real-time computation of the load factor could effectively prevent the processor from being starved for a long time. Experimental comparison with two classical algorithms shows that there is a certain improvement in both performance metrics of scheduling length and task speedup ratio.

Keywords: load-aware, scheduling algorithm, perceptual queue, heterogeneous multi-core

Procedia PDF Downloads 119

8 Speedup Breadth-First Search by Graph Ordering

Authors: Qiuyi Lyu, Bin Gong

Abstract:

Breadth-First Search(BFS) is a core graph algorithm that is widely used for graph analysis. As it is frequently used in many graph applications, improve the BFS performance is essential. In this paper, we present a graph ordering method that could reorder the graph nodes to achieve better data locality, thus, improving the BFS performance. Our method is based on an observation that the sibling relationships will dominate the cache access pattern during the BFS traversal. Therefore, we propose a frequency-based model to construct the graph order. First, we optimize the graph order according to the nodes’ visit frequency. Nodes with high visit frequency will be processed in priority. Second, we try to maximize the child nodes overlap layer by layer. As it is proved to be NP-hard, we propose a heuristic method that could greatly reduce the preprocessing overheads. We conduct extensive experiments on 16 real-world datasets. The result shows that our method could achieve comparable performance with the state-of-the-art methods while the graph ordering overheads are only about 1/15.

Keywords: breadth-first search, BFS, graph ordering, graph algorithm

Procedia PDF Downloads 114

7 GPU Based High Speed Error Protection for Watermarked Medical Image Transmission

Authors: Md Shohidul Islam, Jongmyon Kim, Ui-pil Chong

Abstract:

Medical image is an integral part of e-health care and e-diagnosis system. Medical image watermarking is widely used to protect patients’ information from malicious alteration and manipulation. The watermarked medical images are transmitted over the internet among patients, primary and referred physicians. The images are highly prone to corruption in the wireless transmission medium due to various noises, deflection, and refractions. Distortion in the received images leads to faulty watermark detection and inappropriate disease diagnosis. To address the issue, this paper utilizes error correction code (ECC) with (8, 4) Hamming code in an existing watermarking system. In addition, we implement the high complex ECC on a graphics processing units (GPU) to accelerate and support real-time requirement. Experimental results show that GPU achieves considerable speedup over the sequential CPU implementation, while maintaining 100% ECC efficiency.

Keywords: medical image watermarking, e-health system, error correction, Hamming code, GPU

Procedia PDF Downloads 272

6 Performance Evaluation of Task Scheduling Algorithm on LCQ Network

Authors: Zaki Ahmad Khan, Jamshed Siddiqui, Abdus Samad

Abstract:

The Scheduling and mapping of tasks on a set of processors is considered as a critical problem in parallel and distributed computing system. This paper deals with the problem of dynamic scheduling on a special type of multiprocessor architecture known as Linear Crossed Cube (LCQ) network. This proposed multiprocessor is a hybrid network which combines the features of both linear type of architectures as well as cube based architectures. Two standard dynamic scheduling schemes namely Minimum Distance Scheduling (MDS) and Two Round Scheduling (TRS) schemes are implemented on the LCQ network. Parallel tasks are mapped and the imbalance of load is evaluated on different set of processors in LCQ network. The simulations results are evaluated and effort is made by means of through analysis of the results to obtain the best solution for the given network in term of load imbalance left and execution time. The other performance matrices like speedup and efficiency are also evaluated with the given dynamic algorithms.

Keywords: dynamic algorithm, load imbalance, mapping, task scheduling

Procedia PDF Downloads 435

5 Trimma: Trimming Metadata Storage and Latency for Hybrid Memory Systems

Authors: Yiwei Li, Boyu Tian, Mingyu Gao

Abstract:

Hybrid main memory systems combine both performance and capacity advantages from heterogeneous memory technologies. With larger capacities, higher associativities, and finer granularities, hybrid memory systems currently exhibit significant metadata storage and lookup overheads for flexibly remapping data blocks between the two memory tiers. To alleviate the inefficiencies of existing designs, we propose Trimma, the combination of a multi-level metadata structure and an efficient metadata cache design. Trimma uses a multilevel metadata table to only track truly necessary address remap entries. The saved memory space is effectively utilized as extra DRAM cache capacity to improve performance. Trimma also uses separate formats to store the entries with non-identity and identity mappings. This improves the overall remap cache hit rate, further boosting the performance. Trimma is transparent to software and compatible with various types of hybrid memory systems. When evaluated on a representative DDR4 + NVM hybrid memory system, Trimma achieves up to 2.4× and on average 58.1% speedup benefits, compared with a state-of-the-art design that only leverages the unallocated fast memory space for caching. Trimma addresses metadata management overheads and targets future scalable large-scale hybrid memory architectures.

Keywords: memory system, data cache, hybrid memory, non-volatile memory

Procedia PDF Downloads 50

4 The Intersection/Union Region Computation for Drosophila Brain Images Using Encoding Schemes Based on Multi-Core CPUs

Authors: Ming-Yang Guo, Cheng-Xian Wu, Wei-Xiang Chen, Chun-Yuan Lin, Yen-Jen Lin, Ann-Shyn Chiang

Abstract:

With more and more Drosophila Driver and Neuron images, it is an important work to find the similarity relationships among them as the functional inference. There is a general problem that how to find a Drosophila Driver image, which can cover a set of Drosophila Driver/Neuron images. In order to solve this problem, the intersection/union region for a set of images should be computed at first, then a comparison work is used to calculate the similarities between the region and other images. In this paper, three encoding schemes, namely Integer, Boolean, Decimal, are proposed to encode each image as a one-dimensional structure. Then, the intersection/union region from these images can be computed by using the compare operations, Boolean operators and lookup table method. Finally, the comparison work is done as the union region computation, and the similarity score can be calculated by the definition of Tanimoto coefficient. The above methods for the region computation are also implemented in the multi-core CPUs environment with the OpenMP. From the experimental results, in the encoding phase, the performance by the Boolean scheme is the best than that by others; in the region computation phase, the performance by Decimal is the best when the number of images is large. The speedup ratio can achieve 12 based on 16 CPUs. This work was supported by the Ministry of Science and Technology under the grant MOST 106-2221-E-182-070.

Keywords: Drosophila driver image, Drosophila neuron images, intersection/union computation, parallel processing, OpenMP

Procedia PDF Downloads 212

3 Parallel Pipelined Conjugate Gradient Algorithm on Heterogeneous Platforms

Authors: Sergey Kopysov, Nikita Nedozhogin, Leonid Tonkov

Abstract:

The article presents a parallel iterative solver for large sparse linear systems which can be used on a heterogeneous platform. Traditionally, the problem of solving linear systems does not scale well on multi-CPU/multi-GPUs clusters. For example, most of the attempts to implement the classical conjugate gradient method were at best counted in the same amount of time as the problem was enlarged. The paper proposes the pipelined variant of the conjugate gradient method (PCG), a formulation that is potentially better suited for hybrid CPU/GPU computing since it requires only one synchronization point per one iteration instead of two for standard CG. The standard and pipelined CG methods need the vector entries generated by the current GPU and other GPUs for matrix-vector products. So the communication between GPUs becomes a major performance bottleneck on multi GPU cluster. The article presents an approach to minimize the communications between parallel parts of algorithms. Additionally, computation and communication can be overlapped to reduce the impact of data exchange. Using the pipelined version of the CG method with one synchronization point, the possibility of asynchronous calculations and communications, load balancing between the CPU and GPU for solving the large linear systems allows for scalability. The algorithm is implemented with the combined use of technologies: MPI, OpenMP, and CUDA. We show that almost optimum speed up on 8-CPU/2GPU may be reached (relatively to a one GPU execution). The parallelized solver achieves a speedup of up to 5.49 times on 16 NVIDIA Tesla GPUs, as compared to one GPU.

Keywords: conjugate gradient, GPU, parallel programming, pipelined algorithm

Procedia PDF Downloads 137

2 Flood Modeling in Urban Area Using a Well-Balanced Discontinuous Galerkin Scheme on Unstructured Triangular Grids

Authors: Rabih Ghostine, Craig Kapfer, Viswanathan Kannan, Ibrahim Hoteit

Abstract:

Urban flooding resulting from a sudden release of water due to dam-break or excessive rainfall is a serious threatening environment hazard, which causes loss of human life and large economic losses. Anticipating floods before they occur could minimize human and economic losses through the implementation of appropriate protection, provision, and rescue plans. This work reports on the numerical modelling of flash flood propagation in urban areas after an excessive rainfall event or dam-break. A two-dimensional (2D) depth-averaged shallow water model is used with a refined unstructured grid of triangles for representing the urban area topography. The 2D shallow water equations are solved using a second-order well-balanced discontinuous Galerkin scheme. Theoretical test case and three flood events are described to demonstrate the potential benefits of the scheme: (i) wetting and drying in a parabolic basin (ii) flash flood over a physical model of the urbanized Toce River valley in Italy; (iii) wave propagation on the Reyran river valley in consequence of the Malpasset dam-break in 1959 (France); and (iv) dam-break flood in October 1982 at the town of Sumacarcel (Spain). The capability of the scheme is also verified against alternative models. Computational results compare well with recorded data and show that the scheme is at least as efficient as comparable second-order finite volume schemes, with notable efficiency speedup due to parallelization.

Keywords: dam-break, discontinuous Galerkin scheme, flood modeling, shallow water equations

Procedia PDF Downloads 158

1 Solid Particles Transport and Deposition Prediction in a Turbulent Impinging Jet Using the Lattice Boltzmann Method and a Probabilistic Model on GPU

Authors: Ali Abdul Kadhim, Fue Lien

Abstract:

Solid particle distribution on an impingement surface has been simulated utilizing a graphical processing unit (GPU). In-house computational fluid dynamics (CFD) code has been developed to investigate a 3D turbulent impinging jet using the lattice Boltzmann method (LBM) in conjunction with large eddy simulation (LES) and the multiple relaxation time (MRT) models. This paper proposed an improvement in the LBM-cellular automata (LBM-CA) probabilistic method. In the current model, the fluid flow utilizes the D3Q19 lattice, while the particle model employs the D3Q27 lattice. The particle numbers are defined at the same regular LBM nodes, and transport of particles from one node to its neighboring nodes are determined in accordance with the particle bulk density and velocity by considering all the external forces. The previous models distribute particles at each time step without considering the local velocity and the number of particles at each node. The present model overcomes the deficiencies of the previous LBM-CA models and, therefore, can better capture the dynamic interaction between particles and the surrounding turbulent flow field. Despite the increasing popularity of LBM-MRT-CA model in simulating complex multiphase fluid flows, this approach is still expensive in term of memory size and computational time required to perform 3D simulations. To improve the throughput of each simulation, a single GeForce GTX TITAN X GPU is used in the present work. The CUDA parallel programming platform and the CuRAND library are utilized to form an efficient LBM-CA algorithm. The methodology was first validated against a benchmark test case involving particle deposition on a square cylinder confined in a duct. The flow was unsteady and laminar at Re=200 (Re is the Reynolds number), and simulations were conducted for different Stokes numbers. The present LBM solutions agree well with other results available in the open literature. The GPU code was then used to simulate the particle transport and deposition in a turbulent impinging jet at Re=10,000. The simulations were conducted for L/D=2,4 and 6, where L is the nozzle-to-surface distance and D is the jet diameter. The effect of changing the Stokes number on the particle deposition profile was studied at different L/D ratios. For comparative studies, another in-house serial CPU code was also developed, coupling LBM with the classical Lagrangian particle dispersion model. Agreement between results obtained with LBM-CA and LBM-Lagrangian models and the experimental data is generally good. The present GPU approach achieves a speedup ratio of about 350 against the serial code running on a single CPU.

Keywords: CUDA, GPU parallel programming, LES, lattice Boltzmann method, MRT, multi-phase flow, probabilistic model

Procedia PDF Downloads 182