Mapping Complex, Large – Scale Spiking Networks on Neural VLSI

Christian Mayr¹, Matthias Ehrlich¹, Stephan Henker¹, Karsten Wendt, and René Schüffny

Abstract—Traditionally, VLSI implementations of spiking neural nets have featured large neuron counts for fixed computations or small exploratory, configurable nets. This paper presents the system architecture of a large configurable neural net system employing a dedicated mapping algorithm for projecting the targeted neurobiological community is the main objective in this case. behavioral models delivered by the neuro-theoretic or supplemental/complement to software-based neuro-simulators. The scientific target of confirming or analyzing data or some functionality necessitates an IC redesign. The other avenue of exploration consists of IC’s employing complex topologies and neuron/synapse dynamics with attendant constraints.

Keywords—Large scale VLSI neural net, topology mapping, complex pulse communication.

I. INTRODUCTION

VLSI implementations of pulse coupled neural nets aimed at exploring various computational aspects of biological neural nets have so far mainly explored two avenues: On the one hand, networks have been created with simple dynamics and fixed, repetitive network structures with very little flexibility, but relatively high neural element count [1], [2]. The processing function(s) of these nets have been determined a priori, emulating cut-outs from biological structures and operations, either to analyze/understand these functions or use them in technical application. The limited flexibility of these designs relegates them to large-scale proof-of-concept of a certain functionality, while further exploration of said functionality necessitates an IC redesign. The other avenue of exploration consists of IC’s employing complex topologies and neuron/synapse dynamics with attendant large configuration memories. These have been relegated to small nets (100-1000 neurons, <10k Synapses), even if the hardware has been designed such as to permit the linking of several chips [3],[4]. These nets are primarily used for exploring network and element behavior, and as a supplement/complement to software-based neuro-simulators. The scientific target of confirming or analyzing data or behavioral models delivered by the neuro-theoretic or neurobiology community is the main objective in this case. Because of the full individual reconfigurability of neuroelements on the IC (i.e. synapses, neurons) and the low element count (permitting full connectivity among all elements via flexible electronic axons and dendrites), the transfer of net topologies and element configuration is trivial from an organizational/structural point of view. The only challenge for these ICs would be to project biology-centric neuro-variables such as membrane potential, conduction delays, leakage terms, adaptation constants, etc to their electronic representations on the IC.

The work presented in [5] is a step towards a kind of system achieving a synthesis of both approaches, with large element count and flexible, yet hardware-constrained configuration. However, the element count could still be improved, and more complex synapse and neuron dynamics realized. In this paper, we present a system architecture currently under development that will allow very large (>1e6 neurons, >1e9 synapses) reconfigurable networks to be built, in the form of interlinked Dies on a single wafer. Hardware constraints are identified and a description of the mapping software is given which is needed to faithfully reproduce biological network structures with their attendant plasticity in VLSI. The efficacy of the mapping algorithm is documented via a few samples of the topology mapping.

The complete system will be used as a research tool for exploring various computational paradigms as postulated from neurobiological evidence.

II. SYSTEM DESCRIPTION

If we want to keep the flexibility and configurable network dynamics of complex nets needed to explore new computational properties on VLSI chips, but also extend the processing to very large nets, some compromise has to be achieved between reconfigurability and VLSI hardware constraints. Hardware design is generally constrained by the available resources, especially chip area. Considering a small hardware implementing 10³ neurons with 10³ synapses each requires a crossbar with 10³x10⁶ switches and configuration memory of 10MBit to allow full flexibility. For implementation, already this small example is not feasible and the proposed hardware is implementing orders of magnitude more neural elements.

Hardware constraints in general encompass the following:

1) not enough configuration memory for neuro-elements, so this memory has to be shared, with synapses/neurons...
having similar parameters grouped around this shared memory.

2) not enough configuration memory and IC space for electronic axons and dendrites, so a full network connectivity cannot be achieved.

3) the dynamics of neurons and synapses have to be achieved using less IC space, which inherently speeds up operation of these elements (e.g. smaller integration capacitors), so any analysis circuits and the supply/biasing backbone of such an IC would have to be that much faster.

4) the increased speed of these elements also leads to increased communication bandwidth (both intra-chip and off-chip for analysis and linking of IC’s)

The FACETS hardware platform is proposed for a 1M neurons implemented on several interconnected wavers. The wavers are not diced but used completely by connecting the individual reticles on the wafer by wafer scale interconnect. The reticles consist of configurable analogue network core ASIC’s, which finally encompass the neurons, synapses and connection structures. Each neuron in this system connects in average to 1k other neurons via plastic synaptic links. The research undertaken at TU Dresden is to design the systems communication backplane, the hardware systems simulation and the configuration of the system together with the necessary benchmarks.

A. Communication Architecture

The Analogue Network Core - ANC of which a schematic can be seen in Fig. 1 forms the basic element of the FACETS Architecture.

The logical architecture of the FACETS system communication is a resource constrained three layered structure.

The smallest configuration units of the system are Synapse-Neuron Groups directly linked together with a so called layer 0 connections. Around the analog core elements, the communication layer 1 of multiplexed continuous-time connections is implemented as ring bus structure.

Layer 1 connections can be directed over die boundaries to adjacent ANCs and to the interface of communication layer 2. Layer 2 communication, configuration and general backplane is provided by a PCB backplane situated above the wafer. This backplane contains dedicated ASICs designed for interfacing with the ANCs, so called Digital Network Chips (DNCs). Fig. 3 will provide an overview of the concept.
Wafer-Scale Communication (Layer 1)

digtal Network Chip (DNC)

DNC-facilitated Communication (Layer 2)

Fig. 3 Waferscale system overview

Layer 2 connections are used for long range connection to distant ANCs using a time discretized events-within-a-packet based protocol. The implementation of Layer 2 is utilizing high speed serial LVDS links also designed by the TU Dresden research group.

The three different layers can be distinguished by their communication paradigms and different constraints. The number of routings via a specific layer will be constrained by a limited number of connections per layer and the load dependency of those connections.

**TABLE I**

<table>
<thead>
<tr>
<th>Layer</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>continuous-time,</td>
<td>Dendritic compartments and their respective synapse groups are linked via</td>
</tr>
<tr>
<td></td>
<td>analog direct</td>
<td>configurable analog current connections to form compartmental neurons.</td>
</tr>
<tr>
<td></td>
<td>connection</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>continuous-time,</td>
<td>Main backbone of pulse communication, multiplexed asynchronous digital pulse</td>
</tr>
<tr>
<td></td>
<td>multiplexed</td>
<td>communication. Load independent due to static routed connections, with a slight</td>
</tr>
<tr>
<td></td>
<td></td>
<td>chance of pulse loss for colliding events. Connections to adjacent ANC’s</td>
</tr>
<tr>
<td></td>
<td></td>
<td>within a certain range of die ‘hops’ can be routed over this layer.</td>
</tr>
<tr>
<td>2</td>
<td>discrete, packed</td>
<td>The number of connections is limited by the load capacity of the channels. In</td>
</tr>
<tr>
<td></td>
<td>based</td>
<td>this case the activity of the routed connections determines the number of</td>
</tr>
<tr>
<td></td>
<td></td>
<td>routable channels. Used also for external communication with analysis of</td>
</tr>
<tr>
<td></td>
<td></td>
<td>network.</td>
</tr>
</tbody>
</table>

**B. Neural Elements**

The synapses will be similar to [5] i.e. they will realize an STDP learning rule which can be modified via digital look-up-tables (LUT) to realize additive, multiplicative and power-law weight updates [6]. The parametrizable form of the weight update also makes it possible to let the synapses behave in a BCM-like fashion. Fast synaptic adaptations also form part of the plasticity available on the synapses [7]. Supervisor input or steered learning can be achieved via externally governed weight changes or forcing the neuron to fire at selected points in time [8].

The Hodgkin-Huxley-derived conductance-based IF-neurons [5] are implemented as sections of a dendrite, which can be connected in series from 4 to 64 dendritic compartments, with the spike traveling along the compartments via an analog bus.

As seen in Fig. 4, a sub-block of the ANC is composed of an array of 256 synapse groups, each consisting of 256 individual synapses. In turn, four synapse group are connected to a dendrite section, so the minimum number of synapses per neuron is 1024. The maximum number of synapses per neuron is 16384 if 16 dendritic compartments are connected in series. This range of neuronal fan-in is comparable to average cortical neurons [9]. The synapse groups share the LUT and the parameter storage, so only synapses with similar targeted behavior should be mapped on a single group. The pre- and postsynaptic time measurement and the synapse weight, however, are independent, so their dynamic time course is distinct for each synapse. Every synapse is configured for a single presynaptic neuron, whose output pulses are transmitted via the communication architecture as described above.

**C. Mapping**

The major problem that arises from the hardware-constraint-driven flexibility reduction is the mapping of experimental neural networks on the hardware resources. The task is to place and route given neurons and synapses to a configurable neuronal STDP array and to optimize the synaptic connections. Different connection routings influence the faithfulness of biology-reproduction, as well as the overall number of routable connections. In other words the routing should be done with a minimum of channels, as close to target parameters of a neural net (i.e. synapse and neuron adaptation...
behaviour, delays, etc) as possible and with a minimum of routing costs.

III. MAPPING SOFTWARE

A. Mapping & Optimization Procedure

The mapping and optimization procedure is done in three steps with rising granularity. Note, steps do not correspond to the different layers. The configuration of the system itself can be defined as mapping task, whereas the optimization is a multi objective search for least cost mapping. The different optimization objectives are as follows:

- To minimize the routing costs
- To come as close as possible to the given network parameters. For this, maximum deviations from these target parameters have to be established
- To use as little ANC’s as possible, that is to concentrate as much neurons as possible together.
- To route connections with higher probability first (for probabilistic network descriptions
- To route connections with higher load over load independent Layers.

Basically, the algorithm maps the logical neurons of any given net to the physical neurons on the system, connects them according to the network topology and creates a matrix which reflects the connections over the complete system as shown in the simplified example in Fig. 5. Different target functions apply for the granularity steps, as outlined below.

1) Step 1 – Hyper Global

For the uppermost mapping step, centering on the wafer-level system, the target is to concentrate as much synapses as possible on single wafers (i.e. squares on the main diagonal in the connection matrix), inter-wafer communication has to be reduced. All synapses realized between neurons on different wafers have the same penalty, reflecting the packet-based, all-to-all communication architecture between the wafers.

As seen in the above example, a reordering in the assignment of neurons to a wafer can result in reduced connection density between the wafers.

The number of s-permutations, or variations \( \#VAR \) which is the number of possible configurations without recurrences for a given number of neurons \( \#MAX_{NRN} \) in a neural network to the number of physically available neurons \( \#PHY_{NRN} \) can be calculated with:

\[
\#VAR = \frac{(\#MAX_{NRN})!}{(\#MAX_{NRN}-\#PHY_{NRN})!}
\]  

Fig. 5 Simple example of mapping a network to a connection matrix representation and optimization by reordering

Fig. 6 Variations in [dB] for a 64 NRN array

So the Complexity of the problem makes it obviously impossible to probe all the possible ordering variations of the connection matrix as can be seen in Fig. 5 as number of variations without recurrences on a 64 NRN array with increasing net size, showing a linearized exponential growth.

Different heuristic algorithm and Sparse Matrix reordering algorithms like Reverse Cuthill McKee [1] where tested which yield up to 10% improvement in relative connection density for global routing on a network generated with uniform distribution of synaptic connections with 5 to 15% connection density. On structured nets with a more regular structure, however, the same effort results in significantly higher optimization (Fig. 8).

2) Step 2 – Global

The matrix can then be split up into sub blocks representing the single wafers to proceed to the next mapping step. In this step, the neurons are assigned to individual ANC’s on the wafer, which are interconnected via post-processing waferconnects executed between single dies on the wafer. Here, the ordering of the ANCs also becomes an issue, i.e. the communication paths between dies cannot be treated as equal, as was the case for inter-wafer paths. The next figure gives an example of this:
If we take the system of Layer1 busses connecting the ANC's on the wafer, it is evident that a direct connect between two adjacent ANCs uses less bus resources than a extended connect which crosses one ANC. So this has to be taken into account for the mapping target in the global step. As an example for the global mapping step, Fig. 8 presents a simplified, layered, feed forward V1 according to [9] of 2500 neurons with 300k synapses.

The above figure starts out with the initial V1 net as derived from [6], where a Reverse Cuthill McKee [10] reordering is shown to be insufficient for the task (b) due to the large number of outliers, caused by the feedforward-nature of the network (lower triangular structure) and the feedback pyramidal cells. A Travelling-Salesman GA is applied to the problem in (c), managing to center a substantial part of the feedforward connections on the center diagonal. Finally, in (d) the connection structure of the wafer is superimposed, with the different gray levels corresponding to the Layer1 connections between the ANCs as denoted in Fig. 7. Projecting the synapse connections on the different Layer1 connections gives feedback for the hardware development (i.e. where do bottlenecks develop?) and results in an approximate, high-level configuration of the hardware system for a given net. This approximate configuration is then passed on to the next step to reach a fine-granular, detailed configuration for each single ANC.

The next benchmark uses a less structured net generated from the low granularity, stochastic V1 model description given in [11]. On the hardware side, this benchmark is based on a generic waver-scale architecture, with 2 wavers, 4 ANCs/waver and 125 neurons/ANC, realizing a biological network with 1000 neurons and 135651 synapses. The effectiveness of the mapping algorithm can be seen from the following table, which gives percentages of total synapses realized via the different communication layers before and after application of the mapping.

<table>
<thead>
<tr>
<th></th>
<th>Layer 1</th>
<th>Waferscale</th>
<th>Layer 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Initial Network</td>
<td>12.5</td>
<td>37.5</td>
<td>50.0</td>
</tr>
<tr>
<td>After Mapping</td>
<td>21.2</td>
<td>48.2</td>
<td>30.6</td>
</tr>
</tbody>
</table>

In general, the mapping algorithm achieves the most improvement for highly structured nets, such as cortical structures. It can still improve evenly connected networks somewhat, but is of course limited by the underlying entropy of the network structure. The above table gives results for an interim benchmark, which uses a layered V1 structure, but does not take into account any distance information between neurons, such as a decrease in connection probability with increase in the distance between neurons. Best mapping results can be obtained for a network with detailed structure such as the one in Fig. 8, which employs macro- as well micro-scale cortical geometry information.

3) Step 3– Local

The local optimization concentrates on single ANC’s. After finishing the basic partitioning in Step 2, the resulting connection matrix can again be split up into sub blocks representing single ANC’s with the central part of connections inside the ANC itself and the vertical and horizontal part of outgoing and incoming connection of the selected the ANC. The available routing resources consist of Layer 1 and Layer 0 connections. Prior to optimisation, a first iteration is done by selecting connections with minimum costs until available resources are exhausted. The following iterations use the matrix swap algorithm as described above to reduce the local costs, and after convergence returns to global mapping with either successful mapping or with connections that could not be mapped. Global optimisation is then relocating those connections for a next local optimisation run.

Successful local mapping also generates the hardware configuration data which can be loaded onto the final
hardware to create the hardware representation of the original network.

IV. DATA HANDLING

To give an overview of the amount of data and the complexity of the problem to be solved a short insight into implementation task shall be given. A rough estimation of the amounts of data to be handled shall lead to the problems to be solved in future work.

Taken as example a simple matrix, containing only the information on existing synaptic connections and utilizing a single bit to represent a connection a complete matrix will have ~1.2GByte. Alternatively a List representation storing only the nonzero elements (given above with a 100k Neurons x 1k synapses/per neuron) and the Elements x-y-matrix-index accordingly (which will need 17 bits per element assuming a 100k matrix) the list will need ~0.4GByte.

Although the latter gives a memory effort reduction by 2/3, the data handling becomes more difficult. A comparison of the effort accessing one element of the matrix vs. its list representation, utilizing an a priori classification of the costs and types of commands and their costs according to clock cycles and so on lead to the result that the cost of a list access are ~20 times the costs of the matrix access.

Due to memory limitations we decided at the moment for the List representation, accepting the longer processing time.

V. CONCLUSION/ SUMMARY

We have presented a design effort targeted at implementing configurable large-scale neural networks in VLSI. This hardware will be used for confirming and extending simulation efforts on V1 and other mammalian cortex areas. Because of the faster network execution time for similar network sizes compared to software simulations (where state-of-the-art for 8*10^6 neurons and 5*10^9 synapses is execution in biological real time), developmental plasticity processes can be studied in detail.

A mapping software has been described which closes the gap between the hardware-constrained waferscale-system and biology-derived neural networks, matching up constraints such as (biological) axonal delays and (hardware) pulse routing delays, ensuring faithful reproduction of network behaviour.

During the next project steps we will gain precise information on the technology constraints and limits through research carried out in parallel on the hardware side. A first communication prototype was designed and is momentarily under fabrication.

A further step is to parallelize the optimization algorithms using POSIX Threads. Research is to be carried out on a two Opteron Dual Core Multi Processor system running on Linux, combining a locally threads and globally MPI model.

More effort will be expanded on the optimization algorithms, especially the application of multi-objective genetic algorithms for Hyper Global search which can be parallelized following the island model in [12].

The mapping will be extended to concentrate on further optimization parameters besides the communication, e.g. delays, pulse loss probability, etc.

REFERENCES