Search results for: CUDA
13 Efficient Heuristic Algorithm to Speed Up Graphcut in Gpu for Image Stitching
Authors: Tai Nguyen, Minh Bui, Huong Ninh, Tu Nguyen, Hai Tran
Abstract:
GraphCut algorithm has been widely utilized to solve various types of computer vision problems. Its expensive computational cost encouraged many researchers to improve the speed of the algorithm. Recent works proposed schemes that work on parallel computing platforms such as CUDA. However, the problem of low convergence speed prevents the usage of GraphCut for real time applications. In this paper, we propose global suppression heuristic to boost the conver-gence process of the algorithm. A parallel implementation of GraphCut algorithm on CUDA designed for the image stitching problem is introduced. Our method achieves up to 3× time boost on the graph of size 80 × 480 compared to the best sequential GraphCut algorithm while achieving satisfactory stitched images, suitable for panorama applications. Our source code will be soon available for further research.Keywords: CUDA, graph cut, image stitching, texture synthesis, maxflow/mincut algorithm
Procedia PDF Downloads 13212 Comparison of Parallel CUDA and OpenMP Implementations of Memetic Algorithms for Solving Optimization Problems
Authors: Jason Digalakis, John Cotronis
Abstract:
Memetic algorithms (MAs) are useful for solving optimization problems. It is quite difficult to search the search space of the optimization problem with large dimensions. There is a challenge to use all the cores of the system. In this study, a sequential implementation of the memetic algorithm is converted into a concurrent version, which is executed on the cores of both CPU and GPU. For this reason, CUDA and OpenMP libraries are operated on the parallel algorithm to make a concurrent execution on CPU and GPU, respectively. The aim of this study is to compare CPU and GPU implementation of the memetic algorithm. For this purpose, fourteen benchmark functions are selected as test problems. The obtained results indicate that our approach leads to speedups up to five thousand times higher compared to one CPU thread while maintaining a reasonable results quality. This clearly shows that GPUs have the potential to acceleration of MAs and allow them to solve much more complex tasks.Keywords: memetic algorithm, CUDA, GPU-based memetic algorithm, open multi processing, multimodal functions, unimodal functions, non-linear optimization problems
Procedia PDF Downloads 10211 Fast Prediction Unit Partition Decision and Accelerating the Algorithm Using Cudafor Intra and Inter Prediction of HEVC
Authors: Qiang Zhang, Chun Yuan
Abstract:
Since the PU (Prediction Unit) decision process is the most time consuming part of the emerging HEVC (High Efficient Video Coding) standardin intra and inter frame coding, this paper proposes the fast PU decision algorithm and speed up the algorithm using CUDA (Compute Unified Device Architecture). In intra frame coding, the fast PU decision algorithm uses the texture features to skip intra-frame prediction or terminal the intra-frame prediction for smaller PU size. In inter frame coding of HEVC, the fast PU decision algorithm takes use of the similarity of its own two Nx2N size PU's motion vectors and the hierarchical structure of CU (Coding Unit) partition to skip some modes of PU partition, so as to reduce the motion estimation times. The accelerate algorithm using CUDA is based on the fast PU decision algorithm which uses the GPU to make the motion search and the gradient computation could be parallel computed. The proposed algorithm achieves up to 57% time saving compared to the HM 10.0 with little rate-distortion losses (0.043dB drop and 1.82% bitrate increase on average).Keywords: HEVC, PU decision, inter prediction, intra prediction, CUDA, parallel
Procedia PDF Downloads 39910 A Parallel Implementation of Artificial Bee Colony Algorithm within CUDA Architecture
Authors: Selcuk Aslan, Dervis Karaboga, Celal Ozturk
Abstract:
Artificial Bee Colony (ABC) algorithm is one of the most successful swarm intelligence based metaheuristics. It has been applied to a number of constrained or unconstrained numerical and combinatorial optimization problems. In this paper, we presented a parallelized version of ABC algorithm by adapting employed and onlooker bee phases to the Compute Unified Device Architecture (CUDA) platform which is a graphical processing unit (GPU) programming environment by NVIDIA. The execution speed and obtained results of the proposed approach and sequential version of ABC algorithm are compared on functions that are typically used as benchmarks for optimization algorithms. Tests on standard benchmark functions with different colony size and number of parameters showed that proposed parallelization approach for ABC algorithm decreases the execution time consumed by the employed and onlooker bee phases in total and achieved similar or better quality of the results compared to the standard sequential implementation of the ABC algorithm.Keywords: Artificial Bee Colony algorithm, GPU computing, swarm intelligence, parallelization
Procedia PDF Downloads 3789 Speeding Up Lenia: A Comparative Study Between Existing Implementations and CUDA C++ with OpenGL Interop
Authors: L. Diogo, A. Legrand, J. Nguyen-Cao, J. Rogeau, S. Bornhofen
Abstract:
Lenia is a system of cellular automata with continuous states, space and time, which surprises not only with the emergence of interesting life-like structures but also with its beauty. This paper reports ongoing research on a GPU implementation of Lenia using CUDA C++ and OpenGL Interoperability. We demonstrate how CUDA as a low-level GPU programming paradigm allows optimizing performance and memory usage of the Lenia algorithm. A comparative analysis through experimental runs with existing implementations shows that the CUDA implementation outperforms the others by one order of magnitude or more. Cellular automata hold significant interest due to their ability to model complex phenomena in systems with simple rules and structures. They allow exploring emergent behavior such as self-organization and adaptation, and find applications in various fields, including computer science, physics, biology, and sociology. Unlike classic cellular automata which rely on discrete cells and values, Lenia generalizes the concept of cellular automata to continuous space, time and states, thus providing additional fluidity and richness in emerging phenomena. In the current literature, there are many implementations of Lenia utilizing various programming languages and visualization libraries. However, each implementation also presents certain drawbacks, which serve as motivation for further research and development. In particular, speed is a critical factor when studying Lenia, for several reasons. Rapid simulation allows researchers to observe the emergence of patterns and behaviors in more configurations, on bigger grids and over longer periods without annoying waiting times. Thereby, they enable the exploration and discovery of new species within the Lenia ecosystem more efficiently. Moreover, faster simulations are beneficial when we include additional time-consuming algorithms such as computer vision or machine learning to evolve and optimize specific Lenia configurations. We developed a Lenia implementation for GPU using the C++ and CUDA programming languages, and CUDA/OpenGL Interoperability for immediate rendering. The goal of our experiment is to benchmark this implementation compared to the existing ones in terms of speed, memory usage, configurability and scalability. In our comparison we focus on the most important Lenia implementations, selected for their prominence, accessibility and widespread use in the scientific community. The implementations include MATLAB, JavaScript, ShaderToy GLSL, Jupyter, Rust and R. The list is not exhaustive but provides a broad view of the principal current approaches and their respective strengths and weaknesses. Our comparison primarily considers computational performance and memory efficiency, as these factors are critical for large-scale simulations, but we also investigate the ease of use and configurability. The experimental runs conducted so far demonstrate that the CUDA C++ implementation outperforms the other implementations by one order of magnitude or more. The benefits of using the GPU become apparent especially with larger grids and convolution kernels. However, our research is still ongoing. We are currently exploring the impact of several software design choices and optimization techniques, such as convolution with Fast Fourier Transforms (FFT), various GPU memory management scenarios, and the trade-off between speed and accuracy using single versus double precision floating point arithmetic. The results will give valuable insights into the practice of parallel programming of the Lenia algorithm, and all conclusions will be thoroughly presented in the conference paper. The final version of our CUDA C++ implementation will be published on github and made freely accessible to the Alife community for further development.Keywords: artificial life, cellular automaton, GPU optimization, Lenia, comparative analysis.
Procedia PDF Downloads 428 GPU Accelerated Fractal Image Compression for Medical Imaging in Parallel Computing Platform
Authors: Md. Enamul Haque, Abdullah Al Kaisan, Mahmudur R. Saniat, Aminur Rahman
Abstract:
In this paper, we have implemented both sequential and parallel version of fractal image compression algorithms using CUDA (Compute Unified Device Architecture) programming model for parallelizing the program in Graphics Processing Unit for medical images, as they are highly similar within the image itself. There is several improvements in the implementation of the algorithm as well. Fractal image compression is based on the self similarity of an image, meaning an image having similarity in majority of the regions. We take this opportunity to implement the compression algorithm and monitor the effect of it using both parallel and sequential implementation. Fractal compression has the property of high compression rate and the dimensionless scheme. Compression scheme for fractal image is of two kinds, one is encoding and another is decoding. Encoding is very much computational expensive. On the other hand decoding is less computational. The application of fractal compression to medical images would allow obtaining much higher compression ratios. While the fractal magnification an inseparable feature of the fractal compression would be very useful in presenting the reconstructed image in a highly readable form. However, like all irreversible methods, the fractal compression is connected with the problem of information loss, which is especially troublesome in the medical imaging. A very time consuming encoding process, which can last even several hours, is another bothersome drawback of the fractal compression.Keywords: accelerated GPU, CUDA, parallel computing, fractal image compression
Procedia PDF Downloads 3367 An Evaluation of Neural Network Efficacies for Image Recognition on Edge-AI Computer Vision Platform
Abstract:
Image recognition, as one of the most critical technologies in computer vision, works to help machine-like robotics understand a scene, that is, if deployed appropriately, will trigger the revolution in remote sensing and industry automation. With the developments of AI technologies, there are many prevailing and sophisticated neural networks as technologies developed for image recognition. However, computer vision platforms as hardware, supporting neural networks for image recognition, as crucial as the neural network technologies, need to be more congruently addressed as the research subjects. In contrast, different computer vision platforms are deterministic to leverage the performance of different neural networks for recognition. In this paper, three different computer vision platforms – Jetson Nano(with 4GB), a standalone laptop(with RTX 3000s, using CUDA), and Google Colab (web-based, using GPU) are explored and four prominent neural network architectures (including AlexNet, VGG(16/19), GoogleNet, and ResNet(18/34/50)), are investigated. In the context of pairwise usage between different computer vision platforms and distinctive neural networks, with the merits of recognition accuracy and time efficiency, the performances are evaluated. In the case study using public imageNets, our findings provide a nuanced perspective on optimizing image recognition tasks across Edge-AI platforms, offering guidance on selecting appropriate neural network structures to maximize performance under hardware constraints.Keywords: alexNet, VGG, googleNet, resNet, Jetson nano, CUDA, COCO-NET, cifar10, imageNet large scale visual recognition challenge (ILSVRC), google colab
Procedia PDF Downloads 906 Parallel Computing: Offloading Matrix Multiplication to GPU
Authors: Bharath R., Tharun Sai N., Bhuvan G.
Abstract:
This project focuses on developing a Parallel Computing method aimed at optimizing matrix multiplication through GPU acceleration. Addressing algorithmic challenges, GPU programming intricacies, and integration issues, the project aims to enhance efficiency and scalability. The methodology involves algorithm design, GPU programming, and optimization techniques. Future plans include advanced optimizations, extended functionality, and integration with high-level frameworks. User engagement is emphasized through user-friendly interfaces, open- source collaboration, and continuous refinement based on feedback. The project's impact extends to significantly improving matrix multiplication performance in scientific computing and machine learning applications.Keywords: matrix multiplication, parallel processing, cuda, performance boost, neural networks
Procedia PDF Downloads 595 Solid Particles Transport and Deposition Prediction in a Turbulent Impinging Jet Using the Lattice Boltzmann Method and a Probabilistic Model on GPU
Authors: Ali Abdul Kadhim, Fue Lien
Abstract:
Solid particle distribution on an impingement surface has been simulated utilizing a graphical processing unit (GPU). In-house computational fluid dynamics (CFD) code has been developed to investigate a 3D turbulent impinging jet using the lattice Boltzmann method (LBM) in conjunction with large eddy simulation (LES) and the multiple relaxation time (MRT) models. This paper proposed an improvement in the LBM-cellular automata (LBM-CA) probabilistic method. In the current model, the fluid flow utilizes the D3Q19 lattice, while the particle model employs the D3Q27 lattice. The particle numbers are defined at the same regular LBM nodes, and transport of particles from one node to its neighboring nodes are determined in accordance with the particle bulk density and velocity by considering all the external forces. The previous models distribute particles at each time step without considering the local velocity and the number of particles at each node. The present model overcomes the deficiencies of the previous LBM-CA models and, therefore, can better capture the dynamic interaction between particles and the surrounding turbulent flow field. Despite the increasing popularity of LBM-MRT-CA model in simulating complex multiphase fluid flows, this approach is still expensive in term of memory size and computational time required to perform 3D simulations. To improve the throughput of each simulation, a single GeForce GTX TITAN X GPU is used in the present work. The CUDA parallel programming platform and the CuRAND library are utilized to form an efficient LBM-CA algorithm. The methodology was first validated against a benchmark test case involving particle deposition on a square cylinder confined in a duct. The flow was unsteady and laminar at Re=200 (Re is the Reynolds number), and simulations were conducted for different Stokes numbers. The present LBM solutions agree well with other results available in the open literature. The GPU code was then used to simulate the particle transport and deposition in a turbulent impinging jet at Re=10,000. The simulations were conducted for L/D=2,4 and 6, where L is the nozzle-to-surface distance and D is the jet diameter. The effect of changing the Stokes number on the particle deposition profile was studied at different L/D ratios. For comparative studies, another in-house serial CPU code was also developed, coupling LBM with the classical Lagrangian particle dispersion model. Agreement between results obtained with LBM-CA and LBM-Lagrangian models and the experimental data is generally good. The present GPU approach achieves a speedup ratio of about 350 against the serial code running on a single CPU.Keywords: CUDA, GPU parallel programming, LES, lattice Boltzmann method, MRT, multi-phase flow, probabilistic model
Procedia PDF Downloads 2074 A Parallel Approach for 3D-Variational Data Assimilation on GPUs in Ocean Circulation Models
Authors: Rossella Arcucci, Luisa D'Amore, Simone Celestino, Giuseppe Scotti, Giuliano Laccetti
Abstract:
This work is the first dowel in a rather wide research activity in collaboration with Euro Mediterranean Center for Climate Changes, aimed at introducing scalable approaches in Ocean Circulation Models. We discuss designing and implementation of a parallel algorithm for solving the Variational Data Assimilation (DA) problem on Graphics Processing Units (GPUs). The algorithm is based on the fully scalable 3DVar DA model, previously proposed by the authors, which uses a Domain Decomposition approach (we refer to this model as the DD-DA model). We proceed with an incremental porting process consisting of 3 distinct stages: requirements and source code analysis, incremental development of CUDA kernels, testing and optimization. Experiments confirm the theoretic performance analysis based on the so-called scale up factor demonstrating that the DD-DA model can be suitably mapped on GPU architectures.Keywords: data assimilation, GPU architectures, ocean models, parallel algorithm
Procedia PDF Downloads 4123 Acceleration of Lagrangian and Eulerian Flow Solvers via Graphics Processing Units
Authors: Pooya Niksiar, Ali Ashrafizadeh, Mehrzad Shams, Amir Hossein Madani
Abstract:
There are many computationally demanding applications in science and engineering which need efficient algorithms implemented on high performance computers. Recently, Graphics Processing Units (GPUs) have drawn much attention as compared to the traditional CPU-based hardware and have opened up new improvement venues in scientific computing. One particular application area is Computational Fluid Dynamics (CFD), in which mature CPU-based codes need to be converted to GPU-based algorithms to take advantage of this new technology. In this paper, numerical solutions of two classes of discrete fluid flow models via both CPU and GPU are discussed and compared. Test problems include an Eulerian model of a two-dimensional incompressible laminar flow case and a Lagrangian model of a two phase flow field. The CUDA programming standard is used to employ an NVIDIA GPU with 480 cores and a C++ serial code is run on a single core Intel quad-core CPU. Up to two orders of magnitude speed up is observed on GPU for a certain range of grid resolution or particle numbers. As expected, Lagrangian formulation is better suited for parallel computations on GPU although Eulerian formulation represents significant speed up too.Keywords: CFD, Eulerian formulation, graphics processing units, Lagrangian formulation
Procedia PDF Downloads 4172 Parallel Pipelined Conjugate Gradient Algorithm on Heterogeneous Platforms
Authors: Sergey Kopysov, Nikita Nedozhogin, Leonid Tonkov
Abstract:
The article presents a parallel iterative solver for large sparse linear systems which can be used on a heterogeneous platform. Traditionally, the problem of solving linear systems does not scale well on multi-CPU/multi-GPUs clusters. For example, most of the attempts to implement the classical conjugate gradient method were at best counted in the same amount of time as the problem was enlarged. The paper proposes the pipelined variant of the conjugate gradient method (PCG), a formulation that is potentially better suited for hybrid CPU/GPU computing since it requires only one synchronization point per one iteration instead of two for standard CG. The standard and pipelined CG methods need the vector entries generated by the current GPU and other GPUs for matrix-vector products. So the communication between GPUs becomes a major performance bottleneck on multi GPU cluster. The article presents an approach to minimize the communications between parallel parts of algorithms. Additionally, computation and communication can be overlapped to reduce the impact of data exchange. Using the pipelined version of the CG method with one synchronization point, the possibility of asynchronous calculations and communications, load balancing between the CPU and GPU for solving the large linear systems allows for scalability. The algorithm is implemented with the combined use of technologies: MPI, OpenMP, and CUDA. We show that almost optimum speed up on 8-CPU/2GPU may be reached (relatively to a one GPU execution). The parallelized solver achieves a speedup of up to 5.49 times on 16 NVIDIA Tesla GPUs, as compared to one GPU.Keywords: conjugate gradient, GPU, parallel programming, pipelined algorithm
Procedia PDF Downloads 1651 The Effect of a Saturated Kink on the Dynamics of Tungsten Impurities in the Plasma Core
Authors: H. E. Ferrari, R. Farengo, C. F. Clauser
Abstract:
Tungsten (W) will be used in ITER as one of the plasma facing components (PFCs). The W could migrate to the plasma center. This could have a potentially deleterious effect on plasma confinement. Electron cyclotron resonance heating (ECRH) can be used to prevent W accumulation. We simulated a series of H mode discharges in ASDEX U with PFC containing W, where central ECRH was used to prevent W accumulation in the plasma center. The experiments showed that the W density profiles were flat after a sawtooth crash, and become hollow in between sawtooth crashes when ECRH has been applied. It was also observed that a saturated kink mode was active in these conditions. We studied the effect of saturated kink like instabilities on the redistribution of W impurities. The kink was modeled as the sum of a simple analytical equilibrium (large aspect ratio, circular cross section) plus the perturbation produced by the kink. A numerical code that follows the exact trajectories of the impurity ions in the total fields and includes collisions was employed. The code is written in Cuda C and runs in Graphical Processing Units (GPUs), allowing simulations with a large number of particles with modest resources. Our simulations show that when the W ions have a thermal velocity distribution, the kink has no effect on the W density. When we consider the plasma rotation, the kink can affect the W density. When the average passing frequency of the W particles is similar to the frequency of the kink mode, the expulsion of W ions from the plasma core is maximum, and the W density shows a hollow structure. This could have implications for the mitigation of W accumulation.Keywords: impurity transport, kink instability, tungsten accumulation, tungsten dynamics
Procedia PDF Downloads 171