In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the perform...In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the performance of processors and imposes a limitation of the number of processors that can be used efficiently in bus-based systems. Since the multi-processor’s performance depends upon many parameters which affect the performance in different ways, timed Petri nets are used to model shared-memory bus-based multiprocessors at the instruction execution level, and the developed models are used to study how the performance of processors changes with the number of processors in the system. The results illustrate very well the restriction on the number of processors imposed by the shared bus. All performance characteristics presented in this paper are obtained by discrete-event simulation of Petri net models.展开更多
The Fork-Join program consisting of K parallel tasks is a useful model for a large number of computing applications. When the parallel processor has multi-channels, later tasks may finish execution earlier than their ...The Fork-Join program consisting of K parallel tasks is a useful model for a large number of computing applications. When the parallel processor has multi-channels, later tasks may finish execution earlier than their earlier tasks and may join with tasks from other programs. This phenomenon is called exchangeable join (EJ), which introduces correlation to the task’s service time. In this work, we investigate the response time of multiprocessor systems with EJ with a new approach. We analyze two aspects of this kind of systems: exchangeable join (EJ) and the capacity constraint (CC). We prove that the system response time can be effectively reduced by EJ, while the reduced amount is constrained by the capacity of the multiprocessor. An upper bound model is constructed based on this analysis and a quick estimation algorithm is proposed. The approximation formula is verified by extensive simulation results, which show that the relative error of approximation is less than 5%.展开更多
The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algori...The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algorithms. The design of efficient parallel algorithms should be based on the following considerationst algorithm parallelism and the hardware-parallelism; granularity of the parallel algorithm, algorithm optimization according to the underling parallel machine. In this paper , these principles are applied to solve a model problem of the PDE. The speedup of the new method is high. The results were tested and evaluated on a shared memory MIMD machine. The practical results were agree with the predicted performance.展开更多
Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that m...Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that matches the needs of the target application is always critical to the overall performance. Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs). However, the quest for high performance networks has led to very complex and resource-expensive NoC designs, leaving little room for the real computing force, i.e., PEs. Moreover, many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers. We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs. This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs. We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology, but also the resource requirement of each router. Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network, The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.展开更多
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order ...Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.展开更多
Declarative Programming Languages (DPLs) apply a process model of Horn claun es such as PARLOG[8] or a reduction model of A-calculus such as SML[7] and are) in principle, well suited to multiprocessor implemelltation....Declarative Programming Languages (DPLs) apply a process model of Horn claun es such as PARLOG[8] or a reduction model of A-calculus such as SML[7] and are) in principle, well suited to multiprocessor implemelltation. However, the performance of a parallel declarative program can be impaired by a mismatch between the parallelism available in an application and the parallelism available in the architecture. A particularly attractive solution is to automatically match the parallelism of the program to the parallelism of the target hardware as a compilation step. In this paper) we present an optimizillg compilation technique called granularity analysis which identi fies and removes excess parallelism that would degrade performance. The main steps are: an analysis of the flow of data to form an attributed call graph between function (or predicate) arguments; and an asymptotic estimation of granularity of a function (or predicate) to generate approximate grain size. Compiled procedure calls can be annotated with grain size and a task scheduler can make scheduling decisions with the classilication scheme of grains to control parallelism at runtime. The resulting granularity analysis scheme is suitable for exploiting adaptive parallelism of declarative programming languages on multiprocessors.展开更多
The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocesso...The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocessor systems. In this paper, wefirst outline a new Hybrid Execution Model(HEM) and corresponding parallel abstract machine PAM/TGR based on Extended parallel Graph Rewriting ComputationalModel EGRCM for implementing CIL language on distributed memory multiprocessorsystems. Then we focus on the compiling CIL language with various optindzing techniques such as pattern matching, rule indexing, node ordering and compile-time partialscheduling. The experimental results on a 16-node Thansputer Array demonstrates the effectiveness of our model and strategies.展开更多
In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel G...In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstract machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed. The performance statistics on a transputer array demonstrate the effectiveness of our model,parallel ab- stract machine,optimizing compilation strategies and compiler.展开更多
Earlier performance studies of multiple-bus multiprocessor systetns assume a ran-dom selection of competing requests for bus assignment and ignore the effects of realistic bus arbitration schemes on the performance of...Earlier performance studies of multiple-bus multiprocessor systetns assume a ran-dom selection of competing requests for bus assignment and ignore the effects of realistic bus arbitration schemes on the performance of such systetns. In this paper, we present performance analysis of the multiple-bus systems with different arbitration protocols.The priority protocols considered are random selection, fixed priority, rotating priority, roundrobin and FIFO. Analytica1 models are developed for each of these five dmerent priority protocoIs. Each of our analyses modeIs exactly the behavior of the correspond-ing priority protocol with little computation cost. The analytical models are validated through extensive simu1ations and are then used to carry out performance analysis and comparison of different priority protocols. Numerical results obtained from our models show that the round-robin protocol performs the best among the five protocols in the system with a few buses.展开更多
A method of providing redundancy to a class of bus-based multiprocessor arrays is discussed.The reconfiguration is hierarchical,providing global spare replacement at the array level and local reconfiguration within th...A method of providing redundancy to a class of bus-based multiprocessor arrays is discussed.The reconfiguration is hierarchical,providing global spare replacement at the array level and local reconfiguration within the spare block.Results of yield analysis performed on a 32 processor array are pres- ented.展开更多
It is relatively clear how to map regular, repetitive or grid oriented computations onto SIMD architectures. It is not so clear, however, how to do this for irregular computations even though there may be significant ...It is relatively clear how to map regular, repetitive or grid oriented computations onto SIMD architectures. It is not so clear, however, how to do this for irregular computations even though there may be significant amounts of intrinsic parallelism in branch free code. We study compilation techniques for this type of code when targeted to SIMD computers and illustrate their use on a simple model architecture.In this paper, we present one of the compilation techniques, global mpister allocation,we have developed for SIMD computers, and demonstrate that it can effectively allocate registers for parallelizing irregular computations in branch free code. This technique is an extension and a modification of the register allocation via graph coloring approach used by sequential compilers. Our performance results validate our method.展开更多
As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocess...As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.展开更多
This paper proposes an object oriented model scheduling for parallel computing in media MultiProcessors System on Chip(MPSoC).Firstly,the Coarse Grain Data Flow Graph(CGDFG) parallel programming model is used in this ...This paper proposes an object oriented model scheduling for parallel computing in media MultiProcessors System on Chip(MPSoC).Firstly,the Coarse Grain Data Flow Graph(CGDFG) parallel programming model is used in this approach.Secondly,this approach has the feature of unified abstraction for software objects implementing in processor and hardware objects implementing in ASICs,easy for mapping CGDFG programming on MPSoC.This approach cuts down the kernel overhead and reduces the code size effectively.The principle of the oriented object model,the method of scheduling,and how to map a parallel programming through CGDFG to the MPSoC are analyzed in this approach.This approach also compares the code size and execution cycles with conventional control flow scheduling,and presents respective management overhead for one application in me-dia-SoC.展开更多
In the context of real-time fault-tolerant scheduling in multiprocessor systems, Primary-backup scheme plays an important role. A backup copy is always preferred to be executed as passive backup copy whenever possible...In the context of real-time fault-tolerant scheduling in multiprocessor systems, Primary-backup scheme plays an important role. A backup copy is always preferred to be executed as passive backup copy whenever possible because it can take the advantages of backup copy de-allocation technique and overloading technique to improve schedulability. In this paper, we propose a novel efficient fault-tolerant ratemonotonic best-fit algorithm efficient fault-tolerant rate-monotonic best-fit (ERMBF) based on multiprocessors systems to enhance the schedulability. Unlike existing scheduling algorithms that start scheduling tasks with only one processor. ERMBF pre-allocates a certain amount of processors before starting scheduling tasks, which enlarge the searching spaces for tasks. Besides, when a new processor is allocated, we reassign the task copies that have already been assigned to the existing processors in order to find a superior tasks assignment configuration. These two strategies are all aiming at making as many backup copies as possible to be executed as passive status. As a result, ERMBF can use fewer processors to schedule a set of tasks without losing real-time and fault-tolerant capabilities of the system. Simulation results reveal that ERMBF significantly improves the schedulability over existing, comparable algorithms in literature.展开更多
Compared with accurate diagnosis, the system’s selfdiagnosing capability can be greatly increased through the t/kdiagnosis strategy at most k vertexes to be mistakenly identified as faulty under the comparison model,...Compared with accurate diagnosis, the system’s selfdiagnosing capability can be greatly increased through the t/kdiagnosis strategy at most k vertexes to be mistakenly identified as faulty under the comparison model, where k is typically a small number. Based on the Preparata, Metze, and Chien(PMC)model, the n-dimensional hypercube network is proved to be t/kdiagnosable. In this paper, based on the Maeng and Malek(MM)*model, a novel t/k-fault diagnosis(1≤k≤4) algorithm of ndimensional hypercube, called t/k-MM*-DIAG, is proposed to isolate all faulty processors within the set of nodes, among which the number of fault-free nodes identified wrongly as faulty is at most k. The time complexity in our algorithm is only O(2~n n~2).展开更多
The high-speed computational performance is gained at the cost of huge hardware resource,which restricts the application of high-accuracy algorithms because of the limited hardware cost in practical use.To solve the p...The high-speed computational performance is gained at the cost of huge hardware resource,which restricts the application of high-accuracy algorithms because of the limited hardware cost in practical use.To solve the problem,a novel method for designing the field programmable gate array(FPGA)-based non-uniform rational B-spline(NURBS) interpolator and motion controller,which adopts the embedded multiprocessor technique,is proposed in this study.The hardware and software design for the multiprocessor,one of which is for NURBS interpolation and the other for position servo control,is presented.Performance analysis and experiments on an X-Y table are carried out,hardware cost as well as consuming time for interpolation and motion control is compared with the existing methods.The experimental and comparing results indicate that,compared with the existing methods,the proposed method can reduce the hardware cost by 97.5% using higher-accuracy interpolation algorithm within the period of 0.5 ms.A method which ensures the real-time performance and interpolation accuracy,and reduces the hardware cost significantly is proposed,and it’s practical in the use of industrial application.展开更多
The application-specific multiprocessor system-on-chip(MPSoC) architecture is becoming an attractive solution to deal with increasingly complex embedded applications,which require both high performance and flexible pr...The application-specific multiprocessor system-on-chip(MPSoC) architecture is becoming an attractive solution to deal with increasingly complex embedded applications,which require both high performance and flexible programmability. As an effective method for MPSoC development,we present a gradual refinement flow starting from a high-level Simulink model to a synthesizable and executable hardware and software specification. The proposed methodology consists of five different abstract levels:Simulink combined algorithm and architecture model(CAAM),virtual architecture(VA),transactional accurate architecture(TA),virtual prototype(VP) and field-programmable gate array(FPGA) emulation. Experimental results of Motion-JPEG and H.264 show that the proposed gradual refinement flow can generate various MPSoC architectures from an original Simulink model,allowing processor,communication and tasks design space exploration.展开更多
To decrease the cost of exchanging load information among processors, a dynamic load-balancing (DLB) algorithm which adopts multieast tree technology is proposed. The muhieast tree construction rules are also propos...To decrease the cost of exchanging load information among processors, a dynamic load-balancing (DLB) algorithm which adopts multieast tree technology is proposed. The muhieast tree construction rules are also proposed to avoid wrongly transferred or redundant DLB messages due to the overlapping of multicast trees. The proposed DLB algorithm is distributed controlled, sender initiated and can help heavily loaded processors with complete distribution of redundant loads with minimum number of executions. Experiments were executed to compare the effects of the proposed DLB algorithm and other three ones, the results prove the effectivity and practicability of the proposed algorithm in dealing with great scale compute-intensive tasks.展开更多
文摘In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the performance of processors and imposes a limitation of the number of processors that can be used efficiently in bus-based systems. Since the multi-processor’s performance depends upon many parameters which affect the performance in different ways, timed Petri nets are used to model shared-memory bus-based multiprocessors at the instruction execution level, and the developed models are used to study how the performance of processors changes with the number of processors in the system. The results illustrate very well the restriction on the number of processors imposed by the shared bus. All performance characteristics presented in this paper are obtained by discrete-event simulation of Petri net models.
基金Project supported by the National Natural Science Foundation of0 China (Nos. 60274011 and 60574067), and the Program for NewCentury Excellent Talents in University (No. NCET-04-0094), China
文摘The Fork-Join program consisting of K parallel tasks is a useful model for a large number of computing applications. When the parallel processor has multi-channels, later tasks may finish execution earlier than their earlier tasks and may join with tasks from other programs. This phenomenon is called exchangeable join (EJ), which introduces correlation to the task’s service time. In this work, we investigate the response time of multiprocessor systems with EJ with a new approach. We analyze two aspects of this kind of systems: exchangeable join (EJ) and the capacity constraint (CC). We prove that the system response time can be effectively reduced by EJ, while the reduced amount is constrained by the capacity of the multiprocessor. An upper bound model is constructed based on this analysis and a quick estimation algorithm is proposed. The approximation formula is verified by extensive simulation results, which show that the relative error of approximation is less than 5%.
文摘The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algorithms. The design of efficient parallel algorithms should be based on the following considerationst algorithm parallelism and the hardware-parallelism; granularity of the parallel algorithm, algorithm optimization according to the underling parallel machine. In this paper , these principles are applied to solve a model problem of the PDE. The speedup of the new method is high. The results were tested and evaluated on a shared memory MIMD machine. The practical results were agree with the predicted performance.
文摘Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that matches the needs of the target application is always critical to the overall performance. Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs). However, the quest for high performance networks has led to very complex and resource-expensive NoC designs, leaving little room for the real computing force, i.e., PEs. Moreover, many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers. We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs. This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs. We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology, but also the resource requirement of each router. Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network, The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.
基金supported in part by Spanish Government and European ERDF under Grant Nos. TIN2007-66423, TIN2010-21291-C02-01 and TIN2007-60625gaZ:T48 research group (Arag'on Government and European ESF)+1 种基金Consolider CSD2007-00050 (Spanish Government)HiPEAC-2 NoE (European FP7/ICT 217068)
文摘Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.
文摘Declarative Programming Languages (DPLs) apply a process model of Horn claun es such as PARLOG[8] or a reduction model of A-calculus such as SML[7] and are) in principle, well suited to multiprocessor implemelltation. However, the performance of a parallel declarative program can be impaired by a mismatch between the parallelism available in an application and the parallelism available in the architecture. A particularly attractive solution is to automatically match the parallelism of the program to the parallelism of the target hardware as a compilation step. In this paper) we present an optimizillg compilation technique called granularity analysis which identi fies and removes excess parallelism that would degrade performance. The main steps are: an analysis of the flow of data to form an attributed call graph between function (or predicate) arguments; and an asymptotic estimation of granularity of a function (or predicate) to generate approximate grain size. Compiled procedure calls can be annotated with grain size and a task scheduler can make scheduling decisions with the classilication scheme of grains to control parallelism at runtime. The resulting granularity analysis scheme is suitable for exploiting adaptive parallelism of declarative programming languages on multiprocessors.
文摘The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocessor systems. In this paper, wefirst outline a new Hybrid Execution Model(HEM) and corresponding parallel abstract machine PAM/TGR based on Extended parallel Graph Rewriting ComputationalModel EGRCM for implementing CIL language on distributed memory multiprocessorsystems. Then we focus on the compiling CIL language with various optindzing techniques such as pattern matching, rule indexing, node ordering and compile-time partialscheduling. The experimental results on a 16-node Thansputer Array demonstrates the effectiveness of our model and strategies.
基金This work was partially supported by the National 863 High Technical Grant 863-306-101the National Doctoral Subject Foundation Grant 0249136.
文摘In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstract machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed. The performance statistics on a transputer array demonstrate the effectiveness of our model,parallel ab- stract machine,optimizing compilation strategies and compiler.
文摘Earlier performance studies of multiple-bus multiprocessor systetns assume a ran-dom selection of competing requests for bus assignment and ignore the effects of realistic bus arbitration schemes on the performance of such systetns. In this paper, we present performance analysis of the multiple-bus systems with different arbitration protocols.The priority protocols considered are random selection, fixed priority, rotating priority, roundrobin and FIFO. Analytica1 models are developed for each of these five dmerent priority protocoIs. Each of our analyses modeIs exactly the behavior of the correspond-ing priority protocol with little computation cost. The analytical models are validated through extensive simu1ations and are then used to carry out performance analysis and comparison of different priority protocols. Numerical results obtained from our models show that the round-robin protocol performs the best among the five protocols in the system with a few buses.
文摘A method of providing redundancy to a class of bus-based multiprocessor arrays is discussed.The reconfiguration is hierarchical,providing global spare replacement at the array level and local reconfiguration within the spare block.Results of yield analysis performed on a 32 processor array are pres- ented.
文摘It is relatively clear how to map regular, repetitive or grid oriented computations onto SIMD architectures. It is not so clear, however, how to do this for irregular computations even though there may be significant amounts of intrinsic parallelism in branch free code. We study compilation techniques for this type of code when targeted to SIMD computers and illustrate their use on a simple model architecture.In this paper, we present one of the compilation techniques, global mpister allocation,we have developed for SIMD computers, and demonstrate that it can effectively allocate registers for parallelizing irregular computations in branch free code. This technique is an extension and a modification of the register allocation via graph coloring approach used by sequential compilers. Our performance results validate our method.
基金supported by the National Natural Science Foundation of China under Grant Nos.60970002,60833004,60773146, and 60673145.
文摘As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.
基金Supported by National Natural Science Foundation ofChina (No.60873112)
文摘This paper proposes an object oriented model scheduling for parallel computing in media MultiProcessors System on Chip(MPSoC).Firstly,the Coarse Grain Data Flow Graph(CGDFG) parallel programming model is used in this approach.Secondly,this approach has the feature of unified abstraction for software objects implementing in processor and hardware objects implementing in ASICs,easy for mapping CGDFG programming on MPSoC.This approach cuts down the kernel overhead and reduces the code size effectively.The principle of the oriented object model,the method of scheduling,and how to map a parallel programming through CGDFG to the MPSoC are analyzed in this approach.This approach also compares the code size and execution cycles with conventional control flow scheduling,and presents respective management overhead for one application in me-dia-SoC.
基金Supported by the National Basic Reseach Program of China (973 Program 2004 CB318200)
文摘In the context of real-time fault-tolerant scheduling in multiprocessor systems, Primary-backup scheme plays an important role. A backup copy is always preferred to be executed as passive backup copy whenever possible because it can take the advantages of backup copy de-allocation technique and overloading technique to improve schedulability. In this paper, we propose a novel efficient fault-tolerant ratemonotonic best-fit algorithm efficient fault-tolerant rate-monotonic best-fit (ERMBF) based on multiprocessors systems to enhance the schedulability. Unlike existing scheduling algorithms that start scheduling tasks with only one processor. ERMBF pre-allocates a certain amount of processors before starting scheduling tasks, which enlarge the searching spaces for tasks. Besides, when a new processor is allocated, we reassign the task copies that have already been assigned to the existing processors in order to find a superior tasks assignment configuration. These two strategies are all aiming at making as many backup copies as possible to be executed as passive status. As a result, ERMBF can use fewer processors to schedule a set of tasks without losing real-time and fault-tolerant capabilities of the system. Simulation results reveal that ERMBF significantly improves the schedulability over existing, comparable algorithms in literature.
基金supported by the National Natural Science Foundation of China(61363002)
文摘Compared with accurate diagnosis, the system’s selfdiagnosing capability can be greatly increased through the t/kdiagnosis strategy at most k vertexes to be mistakenly identified as faulty under the comparison model, where k is typically a small number. Based on the Preparata, Metze, and Chien(PMC)model, the n-dimensional hypercube network is proved to be t/kdiagnosable. In this paper, based on the Maeng and Malek(MM)*model, a novel t/k-fault diagnosis(1≤k≤4) algorithm of ndimensional hypercube, called t/k-MM*-DIAG, is proposed to isolate all faulty processors within the set of nodes, among which the number of fault-free nodes identified wrongly as faulty is at most k. The time complexity in our algorithm is only O(2~n n~2).
基金supported by National Key Basic Research Program of China(973 ProgramGrant No.2011CB706804)+1 种基金Shanghai Municipal Science and Technology Commission of China(Grant No.11QH1401400)Research Project of State Key Laboratory of Mechanical System & Vibration of China(Grant No.MSVMS201102)
文摘The high-speed computational performance is gained at the cost of huge hardware resource,which restricts the application of high-accuracy algorithms because of the limited hardware cost in practical use.To solve the problem,a novel method for designing the field programmable gate array(FPGA)-based non-uniform rational B-spline(NURBS) interpolator and motion controller,which adopts the embedded multiprocessor technique,is proposed in this study.The hardware and software design for the multiprocessor,one of which is for NURBS interpolation and the other for position servo control,is presented.Performance analysis and experiments on an X-Y table are carried out,hardware cost as well as consuming time for interpolation and motion control is compared with the existing methods.The experimental and comparing results indicate that,compared with the existing methods,the proposed method can reduce the hardware cost by 97.5% using higher-accuracy interpolation algorithm within the period of 0.5 ms.A method which ensures the real-time performance and interpolation accuracy,and reduces the hardware cost significantly is proposed,and it’s practical in the use of industrial application.
文摘The application-specific multiprocessor system-on-chip(MPSoC) architecture is becoming an attractive solution to deal with increasingly complex embedded applications,which require both high performance and flexible programmability. As an effective method for MPSoC development,we present a gradual refinement flow starting from a high-level Simulink model to a synthesizable and executable hardware and software specification. The proposed methodology consists of five different abstract levels:Simulink combined algorithm and architecture model(CAAM),virtual architecture(VA),transactional accurate architecture(TA),virtual prototype(VP) and field-programmable gate array(FPGA) emulation. Experimental results of Motion-JPEG and H.264 show that the proposed gradual refinement flow can generate various MPSoC architectures from an original Simulink model,allowing processor,communication and tasks design space exploration.
基金the National Natural Science Foundation of China(69973007)
文摘To decrease the cost of exchanging load information among processors, a dynamic load-balancing (DLB) algorithm which adopts multieast tree technology is proposed. The muhieast tree construction rules are also proposed to avoid wrongly transferred or redundant DLB messages due to the overlapping of multicast trees. The proposed DLB algorithm is distributed controlled, sender initiated and can help heavily loaded processors with complete distribution of redundant loads with minimum number of executions. Experiments were executed to compare the effects of the proposed DLB algorithm and other three ones, the results prove the effectivity and practicability of the proposed algorithm in dealing with great scale compute-intensive tasks.