Optimisation des nids de boucles dans le modèle SDF

Télécharger le fichier original (Mémoire de fin d’études)

Contributions of this Thesis

Rapid Prototyping Framework

The recent evolution of digital communication systems (voice, data and video)
has been become more and more complex. Over the last two decades, low datarate
systems (such as dial-up modems, rst and second generation cellular systems,
802.11 Wireless local area networks) have been replaced or augmented by systems
capable of data rates of several Mbps, supporting multimedia applications (such as
DSL, cable modems, 802.11b/a/g/n wireless local area networks, 3G, WiMax and
ultra-wideband personal area networks).
As communication systems have evolved, the resulting increase in data rates
has necessitated a higher system algorithmic complexity. A more complex system
requires greater exibility in order to deal with dierent protocols in dierent environments.
Additionally, there is an increased need for the system to support multiple
interfaces and multi-component devices. Consequently, this requires the optimization
of device parameters over varying constraints such as performance, area and
power. Achieving this device optimization requires a good understanding of the ap30
plication complexity and the choice of an appropriate architecture to support this
application.
An embedded system commonly contains several processor cores in addition to
hardware co-processors. The embedded system designer needs to distribute a set of
signal processing functions onto a given hardware with predened features. The functions
are then executed as software code on target architectures ; this action will be
called a deployment in this paper. A common approach to implement a parallel algorithm
is the creation of a program containing several synchronized threads in which
execution is driven by the scheduler of an operating system. Such implementation
does not meet the hard timing constraints required by real-time applications and the
memory consumption constraints required by embedded systems [Lee06]. One-time
manual scheduling developed for single-processor applications is also not suitable for
multiprocessor architectures : manual data transfers and synchronizations quickly
become very complex, leading to a waste of time and potential deadlocks. Furthermore,
the task of nding an optimal deployment of an algorithm mapped onto a
multi-component architecture is not straightforward. When performed manually,
the result is generally a sub-optimal solution. These issues raise the need for new
methodologies, which allow the exploration of several solutions, to achieve a more
optimal result.
Several features must be provided by a fast prototyping process : description of
the system (hardware and software), automatic mapping/scheduling, simulation of
the execution and automatic code generation. Based on [PAN08][PBRP09][PMAN09]
a more complete rapid prototyping framework was created. This complete framework
is composed of three complementary tools based on Eclipse [ecl] that provide
a full environment for the rapid prototyping of real-time embedded systems : Parallel
and Real-time Embedded Executives Scheduling Method (PREESM), Graphiti
and Synchronous Data Flow for Java (SDF4J). This framework implements the
methodology Algorithm-Architecture Matching (AAM) [GS03]. The focus of this
rapid prototyping activity is currently static code mapping/scheduling but dynamic
extensions are planned for future versions of the tool. Such tools rely on a design
ow which takes an algorithm representation and an architecture representation to
then go through algorithm transformations, architecture analysis, allocation and
scheduling, to nally generate code adapted to the target as depicted in 5.1.
From the graph descriptions of an algorithm and of an architecture, PREESM
can nd the right deployment, can provide simulation information and can generate
a framework code for the processor cores [PAN08]. These rapid prototyping tasks
can be combined and parameterized in a workow. In PREESM, a workow is
dened as an oriented graph representing the list of rapid prototyping tasks to
Contributions of this Thesis 31
Algorithm specification
actor A (bool COMPUTE_Y) uint(size=24) PIX ==>
uint(size=8) R, uint(size=8) G, uint(size=8) B,
uint(size=8) Y:
int RSHIFT = 16; int RMASK = 255;
int GSHIFT = 8; int GMASK = 255;
int BSHIFT = 0; int BMASK = 255;
int COUNT = 8;
action: PIX:[pix] repeat COUNT ==>
R:[r] repeat COUNT,
G:[g] repeat COUNT,
B:[b] repeat COUNT,
Y:[y] repeat COUNT
var
int i := 0
do
// imperative version to compute R, G, B
while i < COUNT do
r[i] := bitand(rshift(pix[i], RSHIFT), RMASK);
g[i] := bitand(rshift(pix[i], GSHIFT), GMASK);
b[i] := bitand(rshift(pix[i], BSHIFT), BMASK);
i := i + 1;
done
Architecture specification
<?xml version= »1.0″ encoding= »UTF-8″?>
<spirit:design
xmlns:spirit= »http://www.spiritconsortium.org/XMLSchema/
SPIRIT/1.4″>
<spirit:name>4C64</spirit:name>
<spirit:componentInstances>
<spirit:componentInstance>
<spirit:instanceName>C64_1</spirit:instanceName>
<spirit:componentRef spirit:library= » »
spirit:name= »C64x »
spirit:vendor= » » spirit:version= » »/>
<spirit:configurableElementValues>
<spirit:configurableElementValue
spirit:referenceId= »componentType »>operator</spirit:config
urableElementValue>
<spirit:configurableElementValue
spirit:referenceId= »refinement »>VPU0</spirit:configurableEle
mentValue>
</spirit:configurableElementValues>
</spirit:componentInstance>
<spirit:componentInstance>
<spirit:instanceName>C64_2</spirit:instanceName>
<spirit:componentRef spirit:library= » »
spirit:name= »C64x »
spirit:vendor= » » spirit:version= » »/>
ACC
PE PE
Algorithm analysis
execute on the input algorithm and architecture graphs in order to determine and
simulate a given deployment. A rapid prototyping process in PREESM consists of a
succession of transformations. These transformations are associated in a data-ow
graph representing a workow that can be edited in a Graphiti generic graph editor.
The PREESM input graphs may also be edited using Graphiti. The PREESM
algorithm models are handled by the SDF4J library. The framework can be extended
by modifying the workows or by connecting new plug-ins (for compilation, graph
analyses, and so on).
There exist numerous solutions to partition algorithms onto multi-core architectures.
If the target architecture is homogeneous (all cores of the architectures share
the same properties), several solutions exist which generate multi-core code from
C with additional information (OpenMP [opeb], CILK [BJK+95a]). In the case of
heterogeneous architectures, languages such as OpenCL [opea] and the Multicore
Association Application Programming Interface (MCAPI [mca]) dene ways to express
parallel properties of a code. However, they are not currently linked to ecient
compilers and runtime environments. Moreover, compilers for such languages would
have diculty in extracting and solving the bottlenecks of the implementation that
appear inherently in graph descriptions of the architecture and the algorithm.
The Poly-Mapper tool from PolyCore Software [pol] oers similar functionalities
to PREESM but, in contrast to PREESM, its mapping/scheduling is manual.
Ptolemy II [Lee01] is a simulation tool that supports many models of computation.
However, it also has no automatic mapping and currently its code generation for
embedded systems focuses on single-core targets. Another family of frameworks
existing for data ow based programming is based on CAL [EJ03b] language like
OpenDF [BBE+08]. OpenDF employs a more dynamic model than PREESM but
its related code generation does not currently support ecient automatic mapping
on multi-core embedded systems.
Closer to PREESM are the Model Integrated Computing (MIC [KSLB03]), the
Open Tool Integration Environment (OTIE [Bel06]), the Synchronous Distributed
Executives (SynDEx [GLS99]), the Dataow Interchange Format (DIF [HKK+04]),
and SDF for Free (SDF3 [SGB06]). Both MIC and OTIE can not be accessed online.
According to the literature, MIC focuses on the transformation between algorithm
domain-specic models and metamodels, while OTIE denes a single system description
that can be used during the whole signal processing design cycle.
DIF is designed as an extensible repository of representation, analysis, transformation
and scheduling of data-ow language. DIF is a Java library which allows the
user to go from graph specication using the DIF language to C code generation.
However, the hierarchical Synchronous Data Flow (SDF) model used in the SDF4J
library and PREESM is not available in DIF.
SDF3 is an open source tool implementing some data-ow models and providing
analysis, transformation, visualization, and manual scheduling as a C++ library.
SDF3 implements the Scenario Aware Data Flow (SADF [TGS+08]), and provides
Multiprocessor System-on-Chip (MP-SoC) binding/scheduling algorithm to output
MP-SoC conguration les.

Table des matières

I French Summary
1 Introduction
1.1 Contributions du travail de thèse
1.1.1 Prototypage rapide
1.1.2 Recongurable Video Coding
1.1.3 Bit-Stream Description Language
1.1.4 Organisation de la thèse
2 Etat de l’art
2.1 Prototypage rapide
2.1.1 Introduction
2.1.2 Systèmes de calculs distribués
2.1.3 Modèle de calcul ux de données
2.1.4 Ordonnancement multiprocesseurs
2.2 Transformation des nids de boucles
2.2.1 Boucles imbriquées et ordre séquentiel
2.2.2 Dépendance de données dans les nids de boucles
2.2.3 Transformation des nids de boucles
2.2.4 Partitionnement des répétitions
3 Contributions
3.1 Représentation hiérarchique dans le modèle SDF
3.2 Génération de code pour le prototypage rapide
3.3 Optimisation des nids de boucles dans le modèle SDF
4 Conclusion et propositions de travail
II Background
5 Introduction
5.1 Overview
5.2 Contributions of this Thesis
5.2.1 Rapid Prototyping Framework
5.2.2 Recongurable Video Coding
5.2.3 Bit-Stream Description Language
5.3 Outline of this Thesis
6 Background and related work
6.1 Introduction
6.2 Parallel computing systems
6.3 Data Flow model of computation
6.3.1 Introduction
6.3.2 Data Flow paradigm introduction
6.3.3 Synchronous Data Flow (SDF)
6.3.4 Homogeneous Synchronous Data Flow (HSDF)
6.3.5 Boolean-controlled Data Flow (BDF)
6.3.6 Cyclo-Static Synchronous Data Flow (CSDF)
6.3.7 Parameterized Synchronous Data Flow (PSDF)
6.3.8 Conclusion
6.4 Application modeling tools
6.4.1 Introduction
6.4.2 Dataow Interchange Format
6.4.3 SDF3
6.4.4 Ptolemy II
6.4.5 StreamIt
6.4.6 PREESM
6.4.7 SynDEx data-ow model
6.4.8 Canals
6.4.9 CAL Actor Language
6.4.10 MCSE specication model
6.4.11 Conclusion
6.5 Multi-processor scheduling
6.6 Conclusion
7 Nested loops Partitioning
7.1 Introduction
7.2 Nested Loops representation
7.3 Nested Loops execution optimization
7.4 Nested Loops partitioning by iteration domain projection
7.5 Nested Loops partitioning by iteration domain tiling
7.6 Conclusion
III Research Work
8 Hierarchy Representation in Synchronous Data Flow Graphs
8.1 Introduction
8.2 Existing hierarchy representation in Synchronous Data Flow Graphs
8.2.1 Repetition based hierarchy
8.2.2 Parameter based hierarchy
8.3 Interface based hierarchy
8.3.1 Special nodes
8.3.2 Hierarchy deadlock-freeness
8.3.3 Hierarchy scheduling
8.3.4 Hierarchy behavior
8.3.5 Hierarchy improvements
8.4 Application case study
8.4.1 IDCT2D description
8.4.2 Structural analysis
8.5 Conclusion
9 Multi-core code generation of Interface based Synchronous Data-ow
9.1 Introduction
9.2 Multi-threaded execution model
9.3 C code generation procedure
9.4 Graph optimization for code generation
9.5 Buer optimization and allocation
9.6 Hierarchical actor port management
9.7 Actor prototypes instantiation
9.8 In loop actors instantiation
9.9 Special vertices instantiation
9.10 Inter Processor Communication instantiation
9.11 xml to C transformation
9.12 Code generation example
9.13 Conclusion
10 Loop partitioning techniques for Interface based Synchronous Data ow
10.1 Introduction
10.2 Iteration domain projection technique
10.2.1 Distance vector extraction from interface-based SDF
10.2.2 SDF network synthesis using analysis results
10.2.3 The matrix vector product example
10.3 Iteration domain tiling
10.3.1 Limitations
10.3.2 SDF network synthesis from partitioned iteration domain
10.3.3 The matrix vector product example
10.4 Conclusion
11 Conclusion, Current Status and Future Work
11.1 Conclusion
11.2 Current Status
11.3 Future Work
Glossary