On Mon, 5 Oct 2020 13:20:26 -0700 (PDT), Lasse Langwadt Christensen
<langwadt@fonz.dk> wrote:

>mandag den 5. oktober 2020 kl. 22.06.28 UTC+2 skrev John Larkin:
>> On Fri, 25 Sep 2020 12:16:07 -0700, John Larkin
>> <jlarkin@highland_atwork_technology.com> wrote:
>> 
>> >
>> >
>> >I have a time-critical thing where the signal passes through an XC7A15
>> >FPGA and does a fair lot of stuff inside. I measured delay vs some
>> >voltages:
>> >
>> >1.8 aux   no measurable DC effect
>> > 
>> >3.3 vccio no measurable DC effect
>> >
>> >2.5 vccio ditto (key io's are LVDS in this bank)
>> >
>> >+1 core   -10 ps per millivolt!
>> >
>> >If I vary the trigger frequency, I can see the delay heterodyning
>> >against the 1.8V switcher frequency, a few ps p-p maybe. Gotta track
>> >that down.
>> >
>> >A spritz of freeze spray on the chip had practically no effect on
>> >delay through the chip, on a scope at 100 ps/div.
>> >
>> >I expected sensitivity to core voltage, so we'll make sure we have a
>> >serious, analog-quality voltage regulator next rev.
>> >
>> >The temperature thing surprised me. I was used to CMOS having a
>> >serious positive delay TC. Maybe modern FPGAs have some sort of
>> >temperature compensation designed in?
>> >
>> >We also have a ZYNQ on this board that crashes the ARM core
>> >erratically, especially when the chip is hot. It might crash in maybe
>> >a half hour MTBF if the chip reports 55C internally; the FPGA part
>> >keeps going. At powerup boot from an SD card, it will always configure
>> >the PL FPGA side, but will then fail to run our application if the
>> >chip is hot. We're playing with DRAM and CPU clock rates to see if
>> >that has much effect.
>> >
>> >
>> 
>> Fixed both problems.
>> 
>> Jitter: replaced the 1.8V Vccaux switcher with a linear regulator.
>> 
>
>I believe the mixed mode clock manger and pll in the PL is powered from Vccaux

I did a static sensitivity test on the critical-path FPGA. It showed
essentially zero through-chip delay vs Vccaux. It was super sensitive
to core voltage. But the +1 core supply was LDO'ed from the noisy 1.8,
so maybe some noise sneaked through there.

I'm going to rip out some switchers and use a chain of LDOs to make
the various supplies for the critical XC7A15 FPGA. The Zynq is not in
the picoseconds-time-critical path.

+5  ldo to 3.3 for i/o banks
3.3 ldo to 2.5 for the bank that does LVDS
2.5 ldo to 1.8 for aux
1.8 ldo to 1.0 for core

in one long string.

We're using ST1L08 regs, super low dropout, good filtering, small and
cheap.

mandag den 5. oktober 2020 kl. 22.06.28 UTC+2 skrev John Larkin:
> On Fri, 25 Sep 2020 12:16:07 -0700, John Larkin
> <jlarkin@highland_atwork_technology.com> wrote:
> 
> >
> >
> >I have a time-critical thing where the signal passes through an XC7A15
> >FPGA and does a fair lot of stuff inside. I measured delay vs some
> >voltages:
> >
> >1.8 aux   no measurable DC effect
> > 
> >3.3 vccio no measurable DC effect
> >
> >2.5 vccio ditto (key io's are LVDS in this bank)
> >
> >+1 core   -10 ps per millivolt!
> >
> >If I vary the trigger frequency, I can see the delay heterodyning
> >against the 1.8V switcher frequency, a few ps p-p maybe. Gotta track
> >that down.
> >
> >A spritz of freeze spray on the chip had practically no effect on
> >delay through the chip, on a scope at 100 ps/div.
> >
> >I expected sensitivity to core voltage, so we'll make sure we have a
> >serious, analog-quality voltage regulator next rev.
> >
> >The temperature thing surprised me. I was used to CMOS having a
> >serious positive delay TC. Maybe modern FPGAs have some sort of
> >temperature compensation designed in?
> >
> >We also have a ZYNQ on this board that crashes the ARM core
> >erratically, especially when the chip is hot. It might crash in maybe
> >a half hour MTBF if the chip reports 55C internally; the FPGA part
> >keeps going. At powerup boot from an SD card, it will always configure
> >the PL FPGA side, but will then fail to run our application if the
> >chip is hot. We're playing with DRAM and CPU clock rates to see if
> >that has much effect.
> >
> >
> 
> Fixed both problems.
> 
> Jitter: replaced the 1.8V Vccaux switcher with a linear regulator.
> 

I believe the mixed mode clock manger and pll in the PL is powered from Vccaux

On Fri, 25 Sep 2020 12:16:07 -0700, John Larkin
<jlarkin@highland_atwork_technology.com> wrote:

>
>
>I have a time-critical thing where the signal passes through an XC7A15
>FPGA and does a fair lot of stuff inside. I measured delay vs some
>voltages:
>
>1.8 aux   no measurable DC effect
> 
>3.3 vccio no measurable DC effect
>
>2.5 vccio ditto (key io's are LVDS in this bank)
>
>+1 core   -10 ps per millivolt!
>
>If I vary the trigger frequency, I can see the delay heterodyning
>against the 1.8V switcher frequency, a few ps p-p maybe. Gotta track
>that down.
>
>A spritz of freeze spray on the chip had practically no effect on
>delay through the chip, on a scope at 100 ps/div.
>
>I expected sensitivity to core voltage, so we'll make sure we have a
>serious, analog-quality voltage regulator next rev.
>
>The temperature thing surprised me. I was used to CMOS having a
>serious positive delay TC. Maybe modern FPGAs have some sort of
>temperature compensation designed in?
>
>We also have a ZYNQ on this board that crashes the ARM core
>erratically, especially when the chip is hot. It might crash in maybe
>a half hour MTBF if the chip reports 55C internally; the FPGA part
>keeps going. At powerup boot from an SD card, it will always configure
>the PL FPGA side, but will then fail to run our application if the
>chip is hot. We're playing with DRAM and CPU clock rates to see if
>that has much effect.
>
>

Fixed both problems.

Jitter: replaced the 1.8V Vccaux switcher with a linear regulator.

Temperature-dependant crashing: I found an oscillation on the Zynq 1v
core power supply, about 100 mV p-p and 80 KHz. Putting a lot more
capacitance at the switcher output kills that and makes the crash go
away. The regulator design followed a chart in the LTM8078 data sheet.
A Spice sim with the original values looks stable, no oscillation and
a clean load-step recovery.

There are other indications that ADI's Spice model of the LTM8078 is
less than perfect. I think ADI is struggling to add a lot of new parts
to the LT Spice libraries. Mike E in an interview suggested that
rushing them out was compromising quality. Then he quit.

Glad I fixed this this way. Guys were snooping the AXIbus and Linux at
great expense and no progress.

fredag den 2. oktober 2020 kl. 16.15.52 UTC+2 skrev jla...@highlandsniptechnology.com:
> On Fri, 2 Oct 2020 10:56:43 +0100 (BST), mjb@signal11.invalid (Mike)
> wrote:
> 
> >In article <rl4to3$crq$1@dont-email.me>,
> >Tauno Voipio  <tauno.voipio@notused.fi.invalid> wrote:
> >
> >>After searching for the cause, it proved that the refresh
> >>circuitry was totally broken (a bad chip), so the DRAMs
> >>did not forget in milliseconds, but seconds.
> >
> >The official spec for 4164 DRAM chips says "refresh at 
> >least every 4ms". 
> >
> >In an Oric (6502A based) computer, a ULA is used to
> >provide memory refresh as a side effect of building the
> >TV picture. Suppressing the memory refresh by holding 
> >this ULA in a "reset" state for a second or so seems
> >to have no effect on memory contents, even though this 
> >also stops the system 1MHz clock.
> >
> >Everything comes back working when the reset is released.
> >
> >It takes at least a couple of seconds of refresh/clock 
> >loss for corruption of screen memory contents or the 
> >system to crash (bad data/bad code in RAM, loss of
> >dynamic registers in the 6502A).
> >
> >Didn't expect that, so DRAM *is* more resilient than you'd
> >think.
> 
> We're using a Micron 64G DDR BGA part, which is "self refreshing"
> whatever that means. 

is it not the same part as on the microzed? try loading the standard 
linux image and see if that also crashes

On Fri, 2 Oct 2020 10:56:43 +0100 (BST), mjb@signal11.invalid (Mike)
wrote:

>In article <rl4to3$crq$1@dont-email.me>,
>Tauno Voipio  <tauno.voipio@notused.fi.invalid> wrote:
>
>>After searching for the cause, it proved that the refresh
>>circuitry was totally broken (a bad chip), so the DRAMs
>>did not forget in milliseconds, but seconds.
>
>The official spec for 4164 DRAM chips says "refresh at 
>least every 4ms". 
>
>In an Oric (6502A based) computer, a ULA is used to
>provide memory refresh as a side effect of building the
>TV picture. Suppressing the memory refresh by holding 
>this ULA in a "reset" state for a second or so seems
>to have no effect on memory contents, even though this 
>also stops the system 1MHz clock.
>
>Everything comes back working when the reset is released.
>
>It takes at least a couple of seconds of refresh/clock 
>loss for corruption of screen memory contents or the 
>system to crash (bad data/bad code in RAM, loss of
>dynamic registers in the 6502A).
>
>Didn't expect that, so DRAM *is* more resilient than you'd
>think.

We're using a Micron 64G DDR BGA part, which is "self refreshing"
whatever that means. The data sheet is 132 pages. But there are a
jillion parameters that the Vivado software uses to build the DRAM
interface, so maybe we have one of those wrong. My guys like to tune
for performance, and I like to tune for reliable and good enough.

An older version of this product used a 68332 CPU running at 16 MHz.
Now we have dual ARM cores running at 600 MHz, with cache. We don't
need to push anything.

-- 

John Larkin         Highland Technology, Inc

Science teaches us to doubt.

  Claude Bernard

In article <rl4to3$crq$1@dont-email.me>,
Tauno Voipio  <tauno.voipio@notused.fi.invalid> wrote:

>After searching for the cause, it proved that the refresh
>circuitry was totally broken (a bad chip), so the DRAMs
>did not forget in milliseconds, but seconds.

The official spec for 4164 DRAM chips says "refresh at 
least every 4ms". 

In an Oric (6502A based) computer, a ULA is used to
provide memory refresh as a side effect of building the
TV picture. Suppressing the memory refresh by holding 
this ULA in a "reset" state for a second or so seems
to have no effect on memory contents, even though this 
also stops the system 1MHz clock.

Everything comes back working when the reset is released.

It takes at least a couple of seconds of refresh/clock 
loss for corruption of screen memory contents or the 
system to crash (bad data/bad code in RAM, loss of
dynamic registers in the 6502A).

Didn't expect that, so DRAM *is* more resilient than you'd
think.
-- 
--------------------------------------+------------------------------------
Mike Brown: mjb[-at-]signal11.org.uk  |    http://www.signal11.org.uk

On Thursday, 1 October 2020 at 07:50:45 UTC-7, jla...@highlandsniptechnology.com wrote:
..
> We're still seeing our problem on some boxes. It looks like the 
> boot-time stuff, which runs in cpu SRAM, works, but then Linux crashes 
> when the chip is warm. 
> 
> Vcc_core = 1.1 volts fixes it. 0.92 breaks it hard. People are still 
> hunting. 
>...

I had a tricky problem with somewhat similar symptoms (I don't remember whether it was temperature-sensitive) but it also was cured by increasing the core voltage. 

We worked with Xilinx on that and it seems that there can be package resonances in the 30-50MHz range (this was a Virtex 5 in a large package).
Our system was running with a 168MHz clock and 5 time-slots but one of time-slots had no significant processing. The result was that we had 30A pulses in the supply current at ~33MHz.

I did a board spin to increase external decoupling without any improvement.

The fix we took into production that avoided the problem was to process random data during the fifth time slot to reduce the supply current perturbations.

kw

On Thu, 1 Oct 2020 18:48:19 +0300, Tauno Voipio
<tauno.voipio@notused.fi.invalid> wrote:

>On 1.10.20 18.24, Gerhard Hoffmann wrote:
>> Am 01.10.20 um 16:50 schrieb jlarkin@highlandsniptechnology.com:
>> 
>>>
>>> The tools for tracking down things like this are few.
>>>
>>> Might be a DRAM problem, but it runs the DRAM test OK.
>> 
>> Back in Z80 days I knew someone who could run DRAM tests
>> all day long without a single error.
>> And that was the only thing he could run on this Z80.
>> 
>> Turned out the Z80 supplies 7 Bits for refresh and he had
>> bought 64K rams with 8 bit refresh. And LOTs of them.
>> 
>> The DRAM test program did its own refresh by addressing
>> all possible row adresses.
>> 
>> 
>> Cheers, Gerhard
>
>
>This reminds me of a CP/M computer we built using a Z80
>and DRAMs (with proper 7 bit refresh). The computer booted
>fine and run as long as it was not left idle for longer
>than some seconds. The idle period killed it totally.
>
>After searching for the cause, it proved that the refresh
>circuitry was totally broken (a bad chip), so the DRAMs
>did not forget in milliseconds, but seconds.

Sometimes a DRAM can remember for many seconds without refresh. 

We will look into possible refresh issues. We hadn't considered that.

Worst case, we could maybe run a little program that did refresh.




-- 

John Larkin         Highland Technology, Inc

Science teaches us to doubt.

  Claude Bernard

On 1.10.20 18.24, Gerhard Hoffmann wrote:
> Am 01.10.20 um 16:50 schrieb jlarkin@highlandsniptechnology.com:
> 
>>
>> The tools for tracking down things like this are few.
>>
>> Might be a DRAM problem, but it runs the DRAM test OK.
> 
> Back in Z80 days I knew someone who could run DRAM tests
> all day long without a single error.
> And that was the only thing he could run on this Z80.
> 
> Turned out the Z80 supplies 7 Bits for refresh and he had
> bought 64K rams with 8 bit refresh. And LOTs of them.
> 
> The DRAM test program did its own refresh by addressing
> all possible row adresses.
> 
> 
> Cheers, Gerhard

This reminds me of a CP/M computer we built using a Z80
and DRAMs (with proper 7 bit refresh). The computer booted
fine and run as long as it was not left idle for longer
than some seconds. The idle period killed it totally.

After searching for the cause, it proved that the refresh
circuitry was totally broken (a bad chip), so the DRAMs
did not forget in milliseconds, but seconds.

-- 

-TV

Am 01.10.20 um 16:50 schrieb jlarkin@highlandsniptechnology.com:

> 
> The tools for tracking down things like this are few.
> 
> Might be a DRAM problem, but it runs the DRAM test OK.

Back in Z80 days I knew someone who could run DRAM tests
all day long without a single error.
And that was the only thing he could run on this Z80.

Turned out the Z80 supplies 7 Bits for refresh and he had
bought 64K rams with 8 bit refresh. And LOTs of them.

The DRAM test program did its own refresh by addressing
all possible row adresses.


Cheers, Gerhard