MJ Logic Design
Support Processor Sub-System

This block performed chip-level configuration/supervision as well as “slow-
path” packet processing in a next-generation storage processor ASIC.  The
block consisted of a BVCI-based infrastructure necessary to connect an ARC
processor core with various on-chip blocks (e.g., PCIe interface, buffer
manager, DMA, etc) and internal/external memories (DDR SRAM, DDR
SDRAM, SRAM). Some of the more extensive BVCI-based interface and
peripheral blocks designed were as follows:
  • BVCI Initiator & Target Blocks:  The various BVCI initiator and target
    blocks were simplified via the use of parameterized BVCI bus initiator
    and bus slave sub-blocks. These sub-blocks were used throughout
    the sub-system to insulate the “user logic” from the BVCI protocol. For
    instance, the standard bus initiator sub-block allowed multiple
    outstanding burst reads/writes, and then reordered the subsequent
    responses as needed, so that the “user logic” only received in-order
    responses.
  • BVCI Arbiter:  The 8x8 BVCI arbiter allowed 8 initiators to share
    access to 8 targets.  Specific block features included:  parameterized
    datapath width, programmable per-target address windows
    implemented with strict priority to allow window overlap, and per-target
    round-robin arbitration that occurred on burst boundaries and
    provided full bandwidth simultaneously to all 8 targets.  This arbiter
    was instantiated twice within the sub-system.
  • Frame Loader:  This BVCI initiator block accepted variable-length
    frames arriving on 4 different channels, and wrote them into memory
    according to buffer descriptors arranged in a ring. As this block
    consumed a series of buffers to store a frame, it updated descriptor
    fields to indicate the presence of the start-of-frame (SOF), end-of-
    frame (EOF), and in the case of the EOF descriptor, how much of the
    buffer was consumed. Software was interrupted on a per-channel
    basis each time a complete frame was available in memory.
  • Frame Unloader:  This block provided basic multi-channel, descriptor-
    based DMA capability between various memories and the cell-based
    portions of the client’s ASIC.  Byte-level packing and knowledge of
    frame header/payload boundaries allowed full frame construction
    across multiple descriptors.  Frames were then parsed into cells and
    passed through per-channel output FIFOs.
  • DMA Engine:  This 4-channel, descriptor-based, scatter/gather,
    general-purpose DMA engine utilized two BVCI initiators, one for
    source reads, and another for destination writes, which enabled full
    BVCI bus speed transfers from one memory to another. A per-channel
    transfer buffer allowed efficient use of the BVCI bus bandwidth,
    allowing one BVCI burst to be accumulated, while another was
    forwarded.
  • PCIe Initiator/Target:  These blocks provided a bridge between BVCI
    and PCIe.  The Initiator block provided multiple programmable
    aperture windows that were used to convert internal BVCI burst
    commands into posted/non-posted PCIe request TLPs, and also
    converted subsequent non-posted completion TLPs back into a
    corresponding BVCI response.  The Target block converted PCIe
    request TLPs into BVCI bursts and converted the subsequent BVCI
    response for non-posted commands into a corresponding PCIe
    completion TLP.
  • Main Memory Controller:  This BVCI target block translated each
    variable-sized BVCI access into one fixed-sized access to the ASIC’s
    central DDR Memory Controller, performing the appropriate byte-
    masking for writes, or discarding of read data for reads. To overcome
    the DDR Controller latency, this block maintained enough context and
    response completion memory to allow up to 8 outstanding reads and
    writes. One of the interesting challenges on this block was handling
    wrap-around addressing for BVCI bursts, when the DDR Controller
    only supported standard linear addressing.
  • DDR SRAM Interface:  This block utilized inter-clock FIFOs in both the
    command and response directions to interface the 8-byte BVCI bus
    with an effective 4-byte DDR SRAM that operated in a different clock
    domain.