sci.electronics.design | What's Your Favorite Processor on an FPGA?| page 4

Reply by John Larkin ●April 22, 20132013-04-22

On 22 Apr 2013 14:57:24 GMT, Allan Herriman <allanherriman@hotmail.com> wrote:

>On Mon, 22 Apr 2013 07:09:40 -0700, John Larkin wrote:
>
>> On 22 Apr 2013 12:59:27 GMT, Allan Herriman <allanherriman@hotmail.com>
>> wrote:
>> 
>>>On Sun, 21 Apr 2013 09:05:49 -0700, John Larkin wrote:
>>>
>>>> The annoying thing is the CPU-to-FPGA interface. It takes a lot of
>>>> FPGA pins and it tends to be async and slow. It would be great to have
>>>> an industry-standard LVDS-type fast serial interface, with hooks like
>>>> shared memory, but transparent and easy to use.
>>>
>>>You've just described PCI Express.
>> 
>> No. PCIe is insanely complex and has horrible latency. It takes
>> something like 2 microseconds to do an 8-bit read over gen1 4-lane PCIe.
>> It was designed for throughput, not latency.
>
>I agree about it being designed for throughput, not latency.  However, 
>with a fairly simple design, we can do 32 bit non-bursting reads or 
>writes in about 350ns over a single lane of gen 1 through 1 layer of 
>switching.  I suspect there's some problem with your implementation 
>(unless your 2 microsecond figure was just hyperbole).

Writes are relatively fast, ballpark 350 ns gen1/4lane. Reads are slow, around 2
us. That's from an x86 CPU into the PCIe hard core of an Altera FPGA, cabled
PCIe. A read requires two serial packets so is over twice the time of a write. 

A random read or write from an embedded CPU, to, say, a DPM in an FPGA, really
should take tens of nanoseconds. We do parallel ARM-FPGA transfers with a klunky
async parallel interface in 100 ns or so, but it takes a lot of pins.

From an x86 (not that we'd ever use an Intel chip in an embedded app) we haven't
found any way to move more than 32 bits in a non-DMA PCIe read/write, even on a
64-bit CPU that has a few 128-bit MOVE opcodes. 

>
>
>> We've done three PCIe projects so far, and it's the opposite of
>> "transparent and easy to use." The PCIe spec reads like the tax code and
>> Obamacare combined.
>
>I found the spec clear.  It's rather large though, and a text book serves 
>as more friendly introduction to the subject than the spec itself.
>
>One of my co-workers was confused by the way addresses come most 
>significant octet first, whilst the data come least significant octet 
>first.  It makes sense on a little endian machine, once you get over the 
>WTF.

Little-endian is evil, another legacy if Intel's clumsiness.

>
>Hot plug is the only thing that gives us headaches.  PCIe Hot plug is 
>needed when reconfiguring the FPGA while the system is running.
>OS support for hot plug is patchy.

We are still trying to get hot plug to work, both Linux and Windows. HELP!


-- 

John Larkin                  Highland Technology Inc
www.highlandtechnology.com   jlarkin at highlandtechnology dot com   

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom timing and laser controllers
Photonics and fiberoptic TTL data links
VME  analog, thermocouple, LVDT, synchro, tachometer
Multichannel arbitrary waveform generators

Reply by Allan Herriman ●April 22, 20132013-04-22

On Mon, 22 Apr 2013 08:16:04 -0700, John Larkin wrote:

> On 22 Apr 2013 14:57:24 GMT, Allan Herriman <allanherriman@hotmail.com>
> wrote:
> 
>>On Mon, 22 Apr 2013 07:09:40 -0700, John Larkin wrote:
>>
>>> On 22 Apr 2013 12:59:27 GMT, Allan Herriman
>>> <allanherriman@hotmail.com> wrote:
>>> 
>>>>On Sun, 21 Apr 2013 09:05:49 -0700, John Larkin wrote:
>>>>
>>>>> The annoying thing is the CPU-to-FPGA interface. It takes a lot of
>>>>> FPGA pins and it tends to be async and slow. It would be great to
>>>>> have an industry-standard LVDS-type fast serial interface, with
>>>>> hooks like shared memory, but transparent and easy to use.
>>>>
>>>>You've just described PCI Express.
>>> 
>>> No. PCIe is insanely complex and has horrible latency. It takes
>>> something like 2 microseconds to do an 8-bit read over gen1 4-lane
>>> PCIe.
>>> It was designed for throughput, not latency.
>>
>>I agree about it being designed for throughput, not latency.  However,
>>with a fairly simple design, we can do 32 bit non-bursting reads or
>>writes in about 350ns over a single lane of gen 1 through 1 layer of
>>switching.  I suspect there's some problem with your implementation
>>(unless your 2 microsecond figure was just hyperbole).
> 
> Writes are relatively fast, ballpark 350 ns gen1/4lane. Reads are slow,
> around 2 us. That's from an x86 CPU into the PCIe hard core of an Altera
> FPGA, cabled PCIe. A read requires two serial packets so is over twice
> the time of a write.


I thought it was faster than that.  If I remember, I'll measure some in 
the lab tomorrow.

BTW, the write requires two packets as well.

 

>>Hot plug is the only thing that gives us headaches.  PCIe Hot plug is
>>needed when reconfiguring the FPGA while the system is running.
>>OS support for hot plug is patchy.
> 
> We are still trying to get hot plug to work, both Linux and Windows.
> HELP!


I don't know anything about hot plug support on Windows.  On Linux, 
however, there are two ways to do it:

- True hot plug.  You need to use a switch (or root complex) that has 
hardware support for the hot plug signals (particularly "Presence Detect" 
that indicates a card is plugged in).  The switch turns these into 
special messages that get sent back to the RC, and the OS should honour 
these and do the right thing.  This should work on Windows too, as it's 
part of the standard.

- Fake hot plug.  With the Linux "fakephp" driver you can fake the hot 
plug messages if you don't have hardware support for them.  This isn't 
supported in all kernel versions though.  Read more here:
http://scaryreasoner.wordpress.com/2012/01/26/messing-around-with-linux-
pci-hotplug/

In both cases there can be address space fragmentation that can stop the 
system from working.  By that I mean that the OS can't predict what will 
be plugged in, so it can't know to reserve a contiguous chunk of address 
space for your FPGA.  The OS may do something stupid like put your 
soundcard right in the middle of the space you wanted.  Grrr.

Recent versions of the Linux kernel allow you to specify rules regarding 
address allocation to avoid the fragmentation problem, but I've never 
used them and I'm not a kernel hacker, so I don't know anything about 
that.

Regards,
Allan

Reply by John Larkin ●April 22, 20132013-04-22

On 22 Apr 2013 16:02:14 GMT, Allan Herriman <allanherriman@hotmail.com> wrote:

>On Mon, 22 Apr 2013 08:16:04 -0700, John Larkin wrote:
>
>> On 22 Apr 2013 14:57:24 GMT, Allan Herriman <allanherriman@hotmail.com>
>> wrote:
>> 
>>>On Mon, 22 Apr 2013 07:09:40 -0700, John Larkin wrote:
>>>
>>>> On 22 Apr 2013 12:59:27 GMT, Allan Herriman
>>>> <allanherriman@hotmail.com> wrote:
>>>> 
>>>>>On Sun, 21 Apr 2013 09:05:49 -0700, John Larkin wrote:
>>>>>
>>>>>> The annoying thing is the CPU-to-FPGA interface. It takes a lot of
>>>>>> FPGA pins and it tends to be async and slow. It would be great to
>>>>>> have an industry-standard LVDS-type fast serial interface, with
>>>>>> hooks like shared memory, but transparent and easy to use.
>>>>>
>>>>>You've just described PCI Express.
>>>> 
>>>> No. PCIe is insanely complex and has horrible latency. It takes
>>>> something like 2 microseconds to do an 8-bit read over gen1 4-lane
>>>> PCIe.
>>>> It was designed for throughput, not latency.
>>>
>>>I agree about it being designed for throughput, not latency.  However,
>>>with a fairly simple design, we can do 32 bit non-bursting reads or
>>>writes in about 350ns over a single lane of gen 1 through 1 layer of
>>>switching.  I suspect there's some problem with your implementation
>>>(unless your 2 microsecond figure was just hyperbole).
>> 
>> Writes are relatively fast, ballpark 350 ns gen1/4lane. Reads are slow,
>> around 2 us. That's from an x86 CPU into the PCIe hard core of an Altera
>> FPGA, cabled PCIe. A read requires two serial packets so is over twice
>> the time of a write.
>
>
>I thought it was faster than that.  If I remember, I'll measure some in 
>the lab tomorrow.
>
>BTW, the write requires two packets as well.

Does it? Writes are buffered and there is some token-quota mechanism that lets
writes blast away, and there may be a "back off, Sam!" reply packet now and then
if the target can't keep up. If the target is fast, like a RAM or something,
that won't happen, and writes are packet limited in one direction. Probably.

>
> 
>
>>>Hot plug is the only thing that gives us headaches.  PCIe Hot plug is
>>>needed when reconfiguring the FPGA while the system is running.
>>>OS support for hot plug is patchy.
>> 
>> We are still trying to get hot plug to work, both Linux and Windows.
>> HELP!
>
>
>I don't know anything about hot plug support on Windows.  On Linux, 
>however, there are two ways to do it:
>
>- True hot plug.  You need to use a switch (or root complex) that has 
>hardware support for the hot plug signals (particularly "Presence Detect" 
>that indicates a card is plugged in).  The switch turns these into 
>special messages that get sent back to the RC, and the OS should honour 
>these and do the right thing.  This should work on Windows too, as it's 
>part of the standard.

Yeah, Microsoft lives to honor standards.

>
>- Fake hot plug.  With the Linux "fakephp" driver you can fake the hot 
>plug messages if you don't have hardware support for them.  This isn't 
>supported in all kernel versions though.  Read more here:
>http://scaryreasoner.wordpress.com/2012/01/26/messing-around-with-linux-
>pci-hotplug/
>
>In both cases there can be address space fragmentation that can stop the 
>system from working.  By that I mean that the OS can't predict what will 
>be plugged in, so it can't know to reserve a contiguous chunk of address 
>space for your FPGA.  The OS may do something stupid like put your 
>soundcard right in the middle of the space you wanted.  Grrr.

We're assuming that an application will crash if its memory-mapped target region
(in our case, the remapped VME bus) vanishes. What we can't do so far under
Linux is re-enumerate the PCI space and start things back up without rebooting.
We're still working on it. We have implemented all the optocoupled sideband
signals for hot plug, and training packets resume after we reconnect. We're
still working on it.


-- 

John Larkin                  Highland Technology Inc
www.highlandtechnology.com   jlarkin at highlandtechnology dot com   

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom timing and laser controllers
Photonics and fiberoptic TTL data links
VME  analog, thermocouple, LVDT, synchro, tachometer
Multichannel arbitrary waveform generators

Reply by ●April 22, 20132013-04-22

On Sun, 21 Apr 2013 14:12:05 -0700, John Larkin
<jjlarkin@highNOTlandTHIStechnologyPART.com> wrote:

>On Sun, 21 Apr 2013 16:40:22 -0400, rickman <gnuarm@gmail.com> wrote:
>
>>On 4/21/2013 4:22 PM, John Larkin wrote:
>>> On Sun, 21 Apr 2013 17:34:12 GMT, Ralph Barone<address_is@invalid.invalid>
>>> wrote:
>>>>
>>>> and end up doing making new and innovative mistakes (just channeling Murphy
>>>> here).
>>>
>>> DEC wrote operating systems (TOPS10, VMS, RSTS) that ran for months between
>>> power failures, time-sharing multiple, sometimes hostile, users. We are now in
>>> the dark ages of computing, overwhelmed by bloat and slop and complexity. No
>>> wonder people are buying tablets. DEC understood things that Intel and Microsoft
>>> never really got, like: don't execute data.
>>
>>You really should stick to things you understand.  Every Intel processor 
>>since the 8086 has included protection mechanism to prevent the 
>>execution of data.  But they have to be used properly...  Blame 
>>Microsoft and all the other software vendors, but don't blame Intel. 
>
>The Intel memory protection is primitive. 

The 8086 had execute privileges on the segment register level and thus
comparable to PDP-11 with eight up to 8 KiB segments with different
protection attributes. 

With 80386 and some sort of virtual memory support, unfortunately
Intel forgot to include the exe/noexe bit in each page table entry (as
in VAX/VMS), but still relied on the segment register protection bits.

Reply by Tauno Voipio ●April 22, 20132013-04-22

On 22.4.13 11:12 , upsidedown@downunder.com wrote:
> On Sun, 21 Apr 2013 14:12:05 -0700, John Larkin
> <jjlarkin@highNOTlandTHIStechnologyPART.com> wrote:
>
>> On Sun, 21 Apr 2013 16:40:22 -0400, rickman <gnuarm@gmail.com> wrote:
>>
>>> On 4/21/2013 4:22 PM, John Larkin wrote:
>>>> On Sun, 21 Apr 2013 17:34:12 GMT, Ralph Barone<address_is@invalid.invalid>
>>>> wrote:
>>>>>
>>>>> and end up doing making new and innovative mistakes (just channeling Murphy
>>>>> here).
>>>>
>>>> DEC wrote operating systems (TOPS10, VMS, RSTS) that ran for months between
>>>> power failures, time-sharing multiple, sometimes hostile, users. We are now in
>>>> the dark ages of computing, overwhelmed by bloat and slop and complexity. No
>>>> wonder people are buying tablets. DEC understood things that Intel and Microsoft
>>>> never really got, like: don't execute data.
>>>
>>> You really should stick to things you understand.  Every Intel processor
>>> since the 8086 has included protection mechanism to prevent the
>>> execution of data.  But they have to be used properly...  Blame
>>> Microsoft and all the other software vendors, but don't blame Intel.
>>
>> The Intel memory protection is primitive.
>
> The 8086 had execute privileges on the segment register level and thus
> comparable to PDP-11 with eight up to 8 KiB segments with different
> protection attributes.
>
> With 80386 and some sort of virtual memory support, unfortunately
> Intel forgot to include the exe/noexe bit in each page table entry (as
> in VAX/VMS), but still relied on the segment register protection bits.


The first Intel family member to have segment-based protection was 
80286, neither 8086 nor 80186.

There is certain sense in Intel's policy: segmentation is for protection
and paging for virtual mempry under it.

-- 

Tauno Voipio

Reply by lang...@fonz.dk ●April 22, 20132013-04-22

On Apr 22, 5:16=A0pm, John Larkin
<jjlar...@highNOTlandTHIStechnologyPART.com> wrote:
> On 22 Apr 2013 14:57:24 GMT, Allan Herriman <allanherri...@hotmail.com> w=
rote:
>
>
>
>
>
>
>
>
>
> >On Mon, 22 Apr 2013 07:09:40 -0700, John Larkin wrote:
>
> >> On 22 Apr 2013 12:59:27 GMT, Allan Herriman <allanherri...@hotmail.com=
>
> >> wrote:
>
> >>>On Sun, 21 Apr 2013 09:05:49 -0700, John Larkin wrote:
>
> >>>> The annoying thing is the CPU-to-FPGA interface. It takes a lot of
> >>>> FPGA pins and it tends to be async and slow. It would be great to ha=
ve
> >>>> an industry-standard LVDS-type fast serial interface, with hooks lik=
e
> >>>> shared memory, but transparent and easy to use.
>
> >>>You've just described PCI Express.
>
> >> No. PCIe is insanely complex and has horrible latency. It takes
> >> something like 2 microseconds to do an 8-bit read over gen1 4-lane PCI=
e.
> >> It was designed for throughput, not latency.
>
> >I agree about it being designed for throughput, not latency. =A0However,
> >with a fairly simple design, we can do 32 bit non-bursting reads or
> >writes in about 350ns over a single lane of gen 1 through 1 layer of
> >switching. =A0I suspect there's some problem with your implementation
> >(unless your 2 microsecond figure was just hyperbole).
>
> Writes are relatively fast, ballpark 350 ns gen1/4lane. Reads are slow, a=
round 2
> us. That's from an x86 CPU into the PCIe hard core of an Altera FPGA, cab=
led
> PCIe. A read requires two serial packets so is over twice the time of a w=
rite.
>
> A random read or write from an embedded CPU, to, say, a DPM in an FPGA, r=
eally
> should take tens of nanoseconds. We do parallel ARM-FPGA transfers with a=
 klunky
> async parallel interface in 100 ns or so, but it takes a lot of pins.
>
> From an x86 (not that we'd ever use an Intel chip in an embedded app) we =
haven't
> found any way to move more than 32 bits in a non-DMA PCIe read/write, eve=
n on a
> 64-bit CPU that has a few 128-bit MOVE opcodes.
>
>
>
> >> We've done three PCIe projects so far, and it's the opposite of
> >> "transparent and easy to use." The PCIe spec reads like the tax code a=
nd
> >> Obamacare combined.
>
> >I found the spec clear. =A0It's rather large though, and a text book ser=
ves
> >as more friendly introduction to the subject than the spec itself.
>
> >One of my co-workers was confused by the way addresses come most
> >significant octet first, whilst the data come least significant octet
> >first. =A0It makes sense on a little endian machine, once you get over t=
he
> >WTF.
>
> Little-endian is evil, another legacy if Intel's clumsiness.
>

why is it any more or less evil than big endian?

-Lasse

Reply by ●April 22, 20132013-04-22

On Mon, 22 Apr 2013 23:19:50 +0300, Tauno Voipio
<tauno.voipio@notused.fi.invalid> wrote:


>The first Intel family member to have segment-based protection was 
>80286, neither 8086 nor 80186.

I have actively tried to forget that I idid some satellite image and
planeratory probe image analyzing using an i286 machine with a 10 MHz
clock :-)

Reply by John Larkin ●April 22, 20132013-04-22

On Mon, 22 Apr 2013 13:37:21 -0700 (PDT), "langwadt@fonz.dk"
<langwadt@fonz.dk> wrote:

>>
>> Little-endian is evil, another legacy if Intel's clumsiness.
>>
>
>why is it any more or less evil than big endian?
>
>-Lasse

!sdrawkcab s'ti esuaceB


-- 

John Larkin         Highland Technology, Inc

jlarkin at highlandtechnology dot com
http://www.highlandtechnology.com

Precision electronic instrumentation
Picosecond-resolution Digital Delay and Pulse generators
Custom laser drivers and controllers
Photonics and fiberoptic TTL data links
VME thermocouple, LVDT, synchro   acquisition and simulation

Reply by Nico Coesel ●April 22, 20132013-04-22

John Larkin <jjlarkin@highNOTlandTHIStechnologyPART.com> wrote:

>On Sun, 21 Apr 2013 08:23:37 -0500, Vladimir Vassilevsky <nospam@nowhere.com>
>wrote:
>
>>On 4/20/2013 5:42 PM, rickman wrote:
>>> I have been working on designs of processors for FPGAs for quite a
>>> while.  I have looked at the uBlaze, the picoBlaze, the NIOS, two from
>>> Lattice and any number of open source processors.  Many of the open
>>> source designs were stack processors since they tend to be small and
>>> efficient in an FPGA.  J1 is one I had pretty much missed until lately.
>>>   It is fast and small and looks like it wasn't too hard to design
>>> (although looks may be deceptive), I'm impressed.  There is also the b16
>>> from Bernd Paysan, the uCore, the ZPU and many others.
>>>
>>> Lately I have been looking at a hybrid approach which combines features
>>> of addressing registers in order to access parameters of a stack CPU. It
>>> looks interesting.
>>>
>>> Anyone else here doing processor designs on FPGAs?
>>>
>>
>>Soft core is fun thing to do, but otherwise I see no use.
>>Except for very few special applications, standalone processor is better 
>>then FPGA soft core in every point, especially the price.

Most entry level scopes consist of an FPGA running a soft processor.

>The annoying thing is the CPU-to-FPGA interface. It takes a lot of FPGA pins and
>it tends to be async and slow. It would be great to have an industry-standard
>LVDS-type fast serial interface, with hooks like shared memory, but transparent
>and easy to use.

You mean PCI express? :-)

-- 
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------

Reply by Allan Herriman ●April 22, 20132013-04-22

On Mon, 22 Apr 2013 09:27:04 -0700, John Larkin wrote:

[ snip pcie hot plug discussion ]

> We're assuming that an application will crash if its memory-mapped
> target region (in our case, the remapped VME bus) vanishes. What we
> can't do so far under Linux is re-enumerate the PCI space and start
> things back up without rebooting.

With fakephp, you should just need to rescan that slot.  With proper hot 
swap hardware support, it should just happen automatically.  (As if 
anything would go wrong with that!)

When the hot plug removal event happens, the OS is meant to unload the 
drivers.

The drivers get reloaded after the hot plug insertion event.  Possibly 
not the same drivers as before, if the FPGA contains something else.

Your higher level application needs to be aware that the driver can come 
and go with the hot plug events.  You'll need some sort of mechanism to 
inform the application (e.g. a signal).
Presumably the application is the actual cause of the FPGA 
reconfiguration, in which case it knows when the FPGA is there or not and 
doesn't need to be told.

> We're still working on it. We have
> implemented all the optocoupled sideband signals for hot plug, and
> training packets resume after we reconnect. We're still working on it.

I found that just the presence detect was needed for reliable hot plug.  
All the others are optional.

Regards,
Allan