Search results for: Lizhi Sun
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 3

Search results for: Lizhi Sun

3 SAP-Reduce: Staleness-Aware P-Reduce with Weight Generator

Authors: Lizhi Ma, Chengcheng Hu, Fuxian Wong

Abstract:

Partial reduce (P-Reduce) has set a state-of-the-art performance on distributed machine learning in the heterogeneous environment over the All-Reduce architecture. The dynamic P-Reduce based on the exponential moving average (EMA) approach predicts all the intermediate model parameters, which raises unreliability. It is noticed that the approximation trick leads the wrong way to obtaining model parameters in all the nodes. In this paper, SAP-Reduce is proposed, which is a variant of the All-Reduce distributed training model with staleness-aware dynamic P-Reduce. SAP-Reduce directly utilizes the EMA-like algorithm to generate the normalized weights. To demonstrate the effectiveness of the algorithm, the experiments are set based on a number of deep learning models, comparing the single-step training acceleration ratio and convergence time. It is found that SAP-Reduce simplifying dynamic P-Reduce outperforms the intermediate approximation one. The empirical results show SAP-Reduce is 1.3× −2.1× faster than existing baselines.

Keywords: collective communication, decentralized distributed training, machine learning, P-Reduce

Procedia PDF Downloads 22
2 Charging-Vacuum Helium Mass Spectrometer Leak Detection Technology in the Application of Space Products Leak Testing and Error Control

Authors: Jijun Shi, Lichen Sun, Jianchao Zhao, Lizhi Sun, Enjun Liu, Chongwu Guo

Abstract:

Because of the consistency of pressure direction, more short cycle, and high sensitivity, Charging-Vacuum helium mass spectrometer leak testing technology is the most popular leak testing technology for the seal testing of the spacecraft parts, especially the small and medium size ones. Usually, auxiliary pump was used, and the minimum detectable leak rate could reach 5E-9Pa•m3/s, even better on certain occasions. Relative error is more important when evaluating the results. How to choose the reference leak, the background level of helium, and record formats would affect the leak rate tested. In the linearity range of leak testing system, it would reduce 10% relative error if the reference leak with larger leak rate was used, and the relative error would reduce obviously if the background of helium was low efficiently, the record format of decimal was used, and the more stable data were recorded.

Keywords: leak testing, spacecraft parts, relative error, error control

Procedia PDF Downloads 450
1 Efficient DNN Training on Heterogeneous Clusters with Pipeline Parallelism

Authors: Lizhi Ma, Dan Liu

Abstract:

Pipeline parallelism has been widely used to accelerate distributed deep learning to alleviate GPU memory bottlenecks and to ensure that models can be trained and deployed smoothly under limited graphics memory conditions. However, in highly heterogeneous distributed clusters, traditional model partitioning methods are not able to achieve load balancing. The overlap of communication and computation is also a big challenge. In this paper, HePipe is proposed, an efficient pipeline parallel training method for highly heterogeneous clusters. According to the characteristics of the neural network model pipeline training task, oriented to the 2-level heterogeneous cluster computing topology, a training method based on the 2-level stage division of neural network modeling and partitioning is designed to improve the parallelism. Additionally, a multi-forward 1F1B scheduling strategy is designed to accelerate the training time of each stage by executing the computation units in advance to maximize the overlap between the forward propagation communication and backward propagation computation. Finally, a dynamic recomputation strategy based on task memory requirement prediction is proposed to improve the fitness ratio of task and memory, which improves the throughput of the cluster and solves the memory shortfall problem caused by memory differences in heterogeneous clusters. The empirical results show that HePipe improves the training speed by 1.6×−2.2× over the existing asynchronous pipeline baselines.

Keywords: pipeline parallelism, heterogeneous cluster, model training, 2-level stage partitioning

Procedia PDF Downloads 5