Optimization of SAD Algorithm on VLIW DSP
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32799
Optimization of SAD Algorithm on VLIW DSP

Authors: Hui-Jae You, Sun-Tae Chung, Souhwan Jung

Abstract:

SAD (Sum of Absolute Difference) algorithm is heavily used in motion estimation which is computationally highly demanding process in motion picture encoding. To enhance the performance of motion picture encoding on a VLIW processor, an efficient implementation of SAD algorithm on the VLIW processor is essential. SAD algorithm is programmed as a nested loop with a conditional branch. In VLIW processors, loop is usually optimized by software pipelining, but researches on optimal scheduling of software pipelining for nested loops, especially nested loops with conditional branches are rare. In this paper, we propose an optimal scheduling and implementation of SAD algorithm with conditional branch on a VLIW DSP processor. The proposed optimal scheduling first transforms the nested loop with conditional branch into a single loop with conditional branch with consideration of full utilization of ILP capability of the VLIW processor and realization of earlier escape from the loop. Next, the proposed optimal scheduling applies a modulo scheduling technique developed for single loop. Based on this optimal scheduling strategy, optimal implementation of SAD algorithm on TMS320C67x, a VLIW DSP is presented. Through experiments on TMS320C6713 DSK, it is shown that H.263 encoder with the proposed SAD implementation performs better than other H.263 encoder with other SAD implementations, and that the code size of the optimal SAD implementation is small enough to be appropriate for embedded environments.

Keywords: Optimal implementation, SAD algorithm, VLIW, TMS320C6713.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1082273

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2292

References:


[1] P. G. Paulin, et al., "Embedded Software in Real-Time Signal Processing Systems: Application and Architecture Trends," Proc. of IEEE, Vol. 85, No.3, pp. 419-435, Mar. 1997.
[2] M. Budagavi et. al., "Wireless MPEG-4 Video Communication on DSP chips," IEEE Signal Processing Magazine, pp. 36-53, Jan. 2000.
[3] J. Hennessy and D. Patterson, Computer Architecture A Quantitative Approach, 3rd ed., MK Pub. 2003.
[4] J. A. Fisher, P. Farabosch, and C. Young, Embedded Computing A VLIW Approach to Architecture, Compilers, and Tools, Morgan Kaufmann, 2004.
[5] K. Muthukumar and G. Doshi, "Software pipelining of nested loops," In R. Wilhelm, editor, CC 2001, LNCS 2027, pages 165-181. Springer-Verlag, Berlin Heidelberg, 2001.
[6] R. Scales, Nested loop optimization on the TMS320C6x. Application Report SPRA519, Texas Instruments, Feb. 1999.
[7] Q. Zhuge, Z. Shao, and E. H.-M. Sha, "Optimization of Nest-Loop Software Pipelining," Submitted paper, http://www.utdallas.edu/~edsha/
[8] G. Gupta, and C. Chakrabarti, "Architectures for hierarchical and other block matching algorithms," Circuits and Systems for Video Technology, IEEE Transactions on Volume 5, Issue 6, pp. 477-489, Dec. 1995.
[9] W. Hwang, Y. Lu, Y. Zeng, "Fast block-matching algorithm for video coding," Electronics Letters Volume 33, Issue 10, pp. 833 - 835, May 1997
[10] D. Talla, L.K. John, V. Lapinskii, and B.L. Evans, "Evaluating signal processing and multimedia applications on SIMD, VLIW and Superscalar architectures," Proceedings of 2000 International Conference on Computer Design, pp. 163-172, Sept. 2000.
[11] S. M. Akramullah, I. Ahmad, I. and M.L. Liou, "Optimization of H.263 video encoding using a single processor computer: performance tradeoffs and benchmarking," Circuits and Systems for Video Technology, IEEE Transactions on Volume 11, Issue 8, pp. 901-915, Aug. 2001.
[12] H. Miyazawa, H.263 Encoder: TMS320C6000 Implementation, Application Report SPRA721, Texas Instruments, December, 2000
[13] O. Lehtoranta, T. Hamalainen, J. Saarinen, "Real-time H.263 encoding of QCIF-images on TMS320C6201 fixed point DSP," Circuits and Systems, Proceedings of IEEE International Symposium on Circuits and Systems, Volume 1, 28-31 pp. 583 - 586, 2000.
[14] S. Bangerjee, et.al., "VLIW DSP vs. Superscalar Implementation of a Baseline H.263 Video Encoder," 34th IEEE Alsilomar Conf. on Signals, Systems and Computers, vol. 2, pp. 1665-1669, 2000.
[15] TMS320C62x Image Library, http://focus.ti.com/docs/toolsw/ folders/ print/sprc093.html
[16] B. Erol, F. Kossentini, and H. Alnuweiri, "Implementation of a fast H.263+ encoder/decoder," Proc. IEEE Asilomar Conf. Alsilomar Conf. on Signals, Systems and Computers, vol.1, pp.462-466, Nov. 1998.
[17] Iain E G Richardson, Video CODEC Design Developing image and video compression systems, John Wiley & Sons, 2002.
[18] M. S. Lam, "Software pipelining: An effective scheduling technique for VLIW machines," in Proc. ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pp. 318-328, Jun. 1988.
[19] B. R. Rau, "Iterative Modulo Scheduling," International Journal of Parallel Processing, Vol. 24, No. 1, Feb. 1996.
[20] TMS320C6713 Datasheet, No. SPRU186D, Texas Instruments, May, 2003.
[21] TMS320C6000 CPU and Instruction Set Reference Guide, No. SPRU189F, Texas Instruments, Oct., 2000.