sci.electronics.design | highest frequency periodic interrupt?| page 6

Reply by Don Y ●January 16, 20232023-01-16

On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
> How many registers does it stack automatically? I knew the HLL nonsense
> would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].

PC and PSR must be preserved, of course (if you have a special shadow
register for each, that's just an optimization -- that only works if
ISRs can't be interrupted.  Remember, you can ALWAYS throw an exception,
even in an ISR!  E.g. the "push" can signal a page fault).  The link
register (think "BAL" -- what's old is now new!  :> ) determines *how*
the ISR terminates.

So, the "overhead" that the processor assumes for an ISR is really just
4 (of the 12) general purpose registers.  When you consider how much
"state" the processor holds, this isn't really excessive.  A routine
coded in a HLL would likely be using ALL of the registers (though use
of an "interrupt" keyword could give it a hint to only use those that
it knows are already preserved; this would vary based on the targeted
processor.  And, the compiler could dynamically decide whether
adding code to protect some additional register(s) will offset the
performance gains possible by using those extra registers *in* the
ISR code... that *it* is generating!

The bigger concern (for me) is worrying about which buses I'm
calling on during the execution of the ISR and what other cores
might be doing ON those busses, at the same time (e.g., if I'm
accessing a particular I/O to query/set a GPIO, is another core
accessing that same I/O -- for a different GPIO?)  You never had
to worry about this in single-core architectures (excepting for
the presence of another "bus master")

> Good CPU design still means
> load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
> to special purpose regs which can be stacked as needed by the IRQ

What do you do if you throw an exception BEFORE (or while!) doing
that stacking?  Does the CPU panic?  :>  (e.g., a double fault on
a 68k!)

[Remember, the exception handling/trap interface uses the same
mechanisms as that of an IRQ -- they are just instantiated
by different sources!]

> routine, along with registers to be used in it. Memory accesses are
> the bottleneck, and with HLL code being bloated as it is chances
> are some cache will have to be flushed to make room for stacking.

Of course!  But, invoking the ISR will likely also displace some
contents of the cache -- unless your entire ISR fits in a single
cache line *and* is already in the cache.  (that includes the
DATA that your ISR may need, as well)

Remember, the whole need for cache is because processors are SO
much faster than memory!

> Some *really* well designed for control applications processors allow
> you to lock a part of the cache but I doubt ARM have that, they seem to
> have gone the way "make programming a two click job" to target a
> wider audience.

The "application processors" most definitely let you exert control over
the cache -- as well as processor affinity.

But, you *really* need to be wary about doing this as it sorely
impacts the utility of those mechanisms on the rest of your code!
I.e., if you wire-down part of the cache to expedite an ISR, then
you have forever taken that resource away from the rest of your code
to use.  Are you smart enough to know how to make that decision,
"in general" (specific cases are a different story)?

The Z80 (et al.) had an "alternate register set".  So, one could
      EX AF,AF'
      EXX
at the top of an ISR -- and again, just before exit -- to preserve
(and restore) the current contents of the (main!) register set.  But,
this means only one ISR can be active at a time (no nesting).  Or,
requires only a specific ISR to be active (and never interrupting
itself) as the alternate register set is indistinguishable from the
"regular" register set.   Q:  Are you willing to live without the use
of the alternate registers *in* your code, just for the sake of *an* ISR?

[I've had a really hard time NOT assigning specific cores to
specific portions of the design -- e.g., letting one core just
handle the RMI mechanism.  I'm not sure that I can predict how
effective such an assignment would be vs. letting the processor
*dynamically* adjust to the load, AT THAT TIME.]

Other processors have register *banks* that you can switch to/from
to expedite context switches.  Same sort of restrictions apply.

The 99k allowed you to switch "workspaces" efficiently.  But, as
workspaces resided in RAM (screwed up THAT one, eh, TI?)...

Processors with tiny states (680x with A/B and index) don't really
have much to preserve.  OTOH, they are forever loading and storing
just to get anything done -- no place to "hold onto" results
INSIDE the CPU.  So, has their lack of internal state made them
BETTER workhorses?  Or, just lessened the work required in an ISR
(because they aren't very *capable*, otherwise)?

Reply by Martin Brown ●January 16, 20232023-01-16

On 15/01/2023 10:11, Don Y wrote:
> On 1/15/2023 2:48 AM, Martin Brown wrote:
>> I prefer to use RDTSC for my Intel timings anyway.
>>
>> On many of the modern CPUs there is a freerunning 64 bit counter 
>> clocked at once per cycle. Intel deprecates using it for such purposes 
>> but I have never found it a problem provided that you bracket it 
>> before and after with CPUID to force all the pipelines into an empty 
>> state.
>>
>> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described 
>> here:
>>
>> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters
>>
>> I prefer hard numbers to a vague scope trace.
> 
> Two downsides:
> - you have to instrument your code (but, if you're concerned with 
> performance,
>  &nbsp; you've already done this as a matter of course)

You have to make a test framework to exercise the code in as realistic a 
manner as you can - that isn't quite the same as instrumenting the code 
(although it can be).

I have never found profile directed compilers to be the least bit useful 
on my fast maths codes because their automatic code instrumentation 
breaks the very code that it is supposed to be testing (in the sense of 
wrecking cache lines and locality etc.).

The only profiling method I have found to work reasonably well is 
probably by chance the highest frequency periodic ISR I have ever used 
in anger which was to profile code by accumulating a snapshot of PC 
addresses allowing just a few machine instructions to execute at a time. 
It used to work well back in the old days when 640k was the limit and 
code would reliably load into exactly the same locations every run.

It is a great way to find the hotspots where most time is spent.

> - it doesn't tell you about anything that happens *before* the code runs
>  &nbsp; (e.g., latency between event and recognition thereof)

True enough. Sometimes you need a logic analyser for weird behaviour - 
we once caught a CPU chip where RTI didn't always do what it said on the 
tin and the instruction following the RTI instruction got executed with 
a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but 
we had to sign a non-disclosure agreement.

>> If I'm really serious about finding out why something is unusually 
>> slow I run a dangerous system level driver that allows me full access 
>> to the model specific registers to monitor cache misses and pipeline 
>> stalls.
> 
> But, those results can change from instance to instance (as can latency,
> execution time, etc.).&nbsp; So, you need to look at the *distribution* of
> values and then think about whether that truly represents "typical"
> and/or *worst* case.

It just means that you have to collect an array of data and take a look 
at it later and offline. Much like you would when testing that a library 
function does exactly what it is supposed to.
> 
> Relying on exact timings is sort of naive; it ignores how much
> things can vary with the running system (is the software in a
> critical region when the ISR is invoked?) and the running
> *hardware* (multilevel caches, etc.)

It is quite unusual to see bad behaviour from the multilevel caches but 
it can add to the variance. You always get a few outliers here and there 
in user code if a higher level disk or network interrupt steals cycles.

> Do you have a way of KNOWING when your expectations (which you
> have now decided are REQUIRMENTS!) are NOT being met?&nbsp; And, if so,
> what do you do (at runtime) with that information?&nbsp;&nbsp; ("I'm sorry,
> one of my basic assumptions is proving to be false and I am not
> equipped to deal with that...")

Instrumenting for timing tests is very much development rather than 
production code. ie is it fast enough or do we have to work harder.

Like you I prefer HLL code but I will use ASM if I have to or there is 
no other way (like wanting 80 bit reals in the MS compiler). Actually I 
am working on a class library to allow somewhat clunky access to it.

They annoyingly zapped access to 80 bit reals in v6 I think it was for 
"compatibility" reasons since SSE2 and later can only do 64bit reals.
> 
> Esp given that your implementation will likely evolve and
> folks doing that work may not be as focused as you were on
> this specific issue...

That will be their problem not mine ;-)

-- 
Regards,
Martin Brown

Reply by Martin Brown ●January 16, 20232023-01-16

On 15/01/2023 14:10, Dimiter_Popoff wrote:
> On 1/15/2023 12:48, Lasse Langwadt Christensen wrote:
>> s&oslash;ndag den 15. januar 2023 kl. 06.10.24 UTC+1 skrev 
>> upsid...@downunder.com:

>>> If the processor has separate FP registers and/or separate FP status
>>> words, avoid using FP registers in ISRs.

Generally good advice unless the purpose of the interrupt is to time 
share the available CPU and FPU between various competing numerical 
tasks. Cooperative multitasking has lower overheads if you can do it.

For my money ISRs should do as little as possible at such a high 
privilege level although checking if their interrupt flag is already set 
again before returning is worthwhile for maximum burst transfer speed.

>>> Some compilers may have "interrupt" keywords or similar extensions and
>>> the compiler knows which registers need to be saved in the ISR. To
>>> help the compiler, include all functions that are called by the ISR in
>>> the same module(preferably in-lined) prior to the ISR, so that the
>>> compiler knows what needs to be saved. Do not call external library
>>> routines from ISR, since the compiler doesn't know which registers
>>> need to be saved and saves all.
>>
>> cortex-m automatically stack the registers needed to call a regular C 
>> function
>> and if it has an FPU it supports "lazy stacking" which means it keeps 
>> track of
>> whether the FPU is used and only stack/un-stack them when they are used
>>
>> it also knows that if another interrupt is pending at ISR exit is 
>> doesn't need to
>> to un-stack/stack before calling the other interrupt
>>
> 
> How many registers does it stack automatically? I knew the HLL nonsense
> would catch up with CPU design eventually. Good CPU design still means
> load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
> to special purpose regs which can be stacked as needed by the IRQ
> routine, along with registers to be used in it. Memory accesses are
> the bottleneck, and with HLL code being bloated as it is chances
> are some cache will have to be flushed to make room for stacking.
> Some *really* well designed for control applications processors allow
> you to lock a part of the cache but I doubt ARM have that, they seem to
> have gone the way "make programming a two click job" to target a
> wider audience.

Actually there were processors which took the exact opposite position 
quite early on and they were incredibly good for realtime performance 
but their registers were no different to ram - they were *in* ram so was 
the program counter return address. There was a master register 
workspace pointer and 16 registers TI TMS9900 series for instance.

https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture

I didn't properly appreciate at the time quite how good this trick was 
for realtime work until we tried to implement the same algorithms on the 
much later and on paper faster 68000 series of CPUs.

-- 
Regards,
Martin Brown

Reply by Don Y ●January 16, 20232023-01-16

On 1/16/2023 3:23 AM, Martin Brown wrote:
> For my money ISRs should do as little as possible at such a high privilege 
> level although checking if their interrupt flag is already set again before 
> returning is worthwhile for maximum burst transfer speed.

+42 on both counts.

OTOH, you have to be wary of misbehaving hardware (or, unforseen
circumstances) causing the ISR to loop continuously.

Many processors will give one (a couple?) of instructions in the
background a chance to execute (after RTI), even if there is a
"new" IRQ pending.  So, you can gradually make *some* progress.

If, instead, you let the ISR loop, then you're stuck there...

I like to *quickly* reenable interrupts and take whatever measures
needed to ensure the work that *I* need to do will get done, properly,
even if postponed by a newer IRQ.  This can be treacherous if a series
of different IRQ sources conspire to interrupt each other and leave
you interrupted by *yourself*, later!

>> How many registers does it stack automatically? I knew the HLL nonsense
>> would catch up with CPU design eventually. Good CPU design still means
>> load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
>> to special purpose regs which can be stacked as needed by the IRQ
>> routine, along with registers to be used in it. Memory accesses are
>> the bottleneck, and with HLL code being bloated as it is chances
>> are some cache will have to be flushed to make room for stacking.
>> Some *really* well designed for control applications processors allow
>> you to lock a part of the cache but I doubt ARM have that, they seem to
>> have gone the way "make programming a two click job" to target a
>> wider audience.
> 
> Actually there were processors which took the exact opposite position quite 
> early on and they were incredibly good for realtime performance but their 
> registers were no different to ram - they were *in* ram so was the program 
> counter return address. There was a master register workspace pointer and 16 
> registers TI TMS9900 series for instance.
> 
> https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture

Guttag came out to pitch the 99K, in person (we used a lot of MPUs).
But, at that time (early 80's?), memory speeds were already starting
to creep up past memory *cycles* in processors of that day.  This was,
IMHO, a bad technological prediction on TI's part.

[IIRC, they also predicted sea-of-gates would be the most economical
semi-custom approach (they actually proposed a "sea of inverters"
wired in a mask layer much like DTL)]

> I didn't properly appreciate at the time quite how good this trick was for 
> realtime work until we tried to implement the same algorithms on the much later 
> and on paper faster 68000 series of CPUs.

Reply by Don Y ●January 16, 20232023-01-16

On 1/16/2023 3:21 AM, Martin Brown wrote:
> On 15/01/2023 10:11, Don Y wrote:
>> On 1/15/2023 2:48 AM, Martin Brown wrote:
>>> I prefer to use RDTSC for my Intel timings anyway.
>>>
>>> On many of the modern CPUs there is a freerunning 64 bit counter clocked at 
>>> once per cycle. Intel deprecates using it for such purposes but I have never 
>>> found it a problem provided that you bracket it before and after with CPUID 
>>> to force all the pipelines into an empty state.
>>>
>>> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described here:
>>>
>>> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters
>>>
>>> I prefer hard numbers to a vague scope trace.
>>
>> Two downsides:
>> - you have to instrument your code (but, if you're concerned with performance,
>> &nbsp;&nbsp; you've already done this as a matter of course)
> 
> You have to make a test framework to exercise the code in as realistic a manner 
> as you can - that isn't quite the same as instrumenting the code (although it 
> can be).

It depends on how visible the information of interest is to outside
observers.  If you have to "do something" to make it so, then you
may as well put in the instrumentation and get things as you want them.

> I have never found profile directed compilers to be the least bit useful on my 
> fast maths codes because their automatic code instrumentation breaks the very 
> code that it is supposed to be testing (in the sense of wrecking cache lines 
> and locality etc.).

Exactly.  The same holds true of adding invariants to code;
removing them (#ifndef DEBUG) changes the code -- subtly but
nonetheless.  So, you have to put in place two levels of
final test:
- check to see if you THINK it will pass REAL final test
- actually DO the final test

When installing copy protection/anti-tamper mechanisms in products,
there's a time when you've just enabled them and, thus, changed
how the product runs.  If it *stops* running (properly), you have
to wonder if your "measures" are at fault or if some latent
bug has crept in, aggravated by the slight differences in
execution patterns.

> The only profiling method I have found to work reasonably well is probably by 
> chance the highest frequency periodic ISR I have ever used in anger which was 
> to profile code by accumulating a snapshot of PC addresses allowing just a few 
> machine instructions to execute at a time. It used to work well back in the old 
> days when 640k was the limit and code would reliably load into exactly the same 
> locations every run.
> 
> It is a great way to find the hotspots where most time is spent.

IMO, this is where logic analyzers shine.  I don't agree with
using them to "trace code" (during debug) as there are better ways to
get that information.  But, *watching* to see how code runs (passively)
can be a real win.  Especially when you are trying to watch for
RARE aberrant behavior.

>> - it doesn't tell you about anything that happens *before* the code runs
>> &nbsp;&nbsp; (e.g., latency between event and recognition thereof)
> 
> True enough. Sometimes you need a logic analyser for weird behaviour - we once 
> caught a CPU chip where RTI didn't always do what it said on the tin and the 
> instruction following the RTI instruction got executed with a frequency of 
> about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a 
> non-disclosure agreement.
> 
>>> If I'm really serious about finding out why something is unusually slow I 
>>> run a dangerous system level driver that allows me full access to the model 
>>> specific registers to monitor cache misses and pipeline stalls.
>>
>> But, those results can change from instance to instance (as can latency,
>> execution time, etc.).&nbsp; So, you need to look at the *distribution* of
>> values and then think about whether that truly represents "typical"
>> and/or *worst* case.
> 
> It just means that you have to collect an array of data and take a look at it 
> later and offline. Much like you would when testing that a library function 
> does exactly what it is supposed to.

Yes.  So, you either have the code do the collection (using a black box)
*or* have to have an external device (logic analyzer) that can collect it
for you.

The former is nice because the code can actually make decisions
(at run time) that a passive observer often can't (because the
observer can't see all of the pertinent data).  But, that starts
to have a pronounced impact on the *intended* code...

>> Relying on exact timings is sort of naive; it ignores how much
>> things can vary with the running system (is the software in a
>> critical region when the ISR is invoked?) and the running
>> *hardware* (multilevel caches, etc.)
> 
> It is quite unusual to see bad behaviour from the multilevel caches but it can 
> add to the variance. You always get a few outliers here and there in user code 
> if a higher level disk or network interrupt steals cycles.

Being an embedded system developer, the issues that often muck
up execution are often periodic -- but, with periods that are varied
enough that they only beat against the observed phenomenon occasionally.

I am always amused by folks WHO OBSERVE A F*CKUP.  Then, when they
can't reproduce it or identify a likely cause, ACT AS IF IT NEVER
HAPPENED!  Sheesh, you're not relying on some third-hand report
of an anomaly... YOU SAW IT!  How can you pretend it didn't happen?

>> Do you have a way of KNOWING when your expectations (which you
>> have now decided are REQUIRMENTS!) are NOT being met?&nbsp; And, if so,
>> what do you do (at runtime) with that information?&nbsp;&nbsp; ("I'm sorry,
>> one of my basic assumptions is proving to be false and I am not
>> equipped to deal with that...")
> 
> Instrumenting for timing tests is very much development rather than production 
> code. ie is it fast enough or do we have to work harder.

Again, depends on the code and application.  Few (interesting!) systems have
any sort of "steady state".  Rather, they have to react to a variety of
circumstances occurring "whenever" they choose.  The specifications rarely
say "in this, that or the-other situation, these timing constraints do not
apply".  And, testing for every possible pile-up of events is just not
conceivable.

The alternative is to rethink your deadlines (avoid HARD deadlines because
so few things truly *are* hard!) and how you would recover from a missed
(or delayed) deadline.

I develop with deadline support in the code so the code can sort out how to
react to situations that I can't foresee nor test for.

A deadline handler may not be invoked in a timely fashion (if it *was*,
then why not just code the actual *task* as the deadline handler and
get THAT guarantee?!  :> ).  But, at least it lets the code/system
realize that something unplanned/unintended *has* happened:  "Whaddya
gonna do about it?"

> Like you I prefer HLL code but I will use ASM if I have to or there is no other 
> way (like wanting 80 bit reals in the MS compiler). Actually I am working on a 
> class library to allow somewhat clunky access to it.

"Measure and THEN optimize".  Let the compiler take a stab at it.
If you discover (measurement) that it's not meeting your expectations
(requirements), figure out why and the possible approaches you can take
to remedy that.

I designed a barcode reader that had, as a single input, the "video"
from the photodetector routed to an IRQ pin.  Program the IRQ to sense
the white-to-black edge and wait.  In ISR, capture the system time
from a high resolution timer; program the IRQ to sense the black-to-white
edge and wait.  Lather, rinse, repeat (you never know *when* a user
may try to read a barcode!).

A background task would watch the FIFO maintained by the IRQ and
pull data out of it (atomically) to keep the FIFO from overflowing
as well as get a head start on decoding the "transition times"
into "width intervals".

Competing IRQs would introduce lots of latency into the captured
system times.  So, I modified the ISRs to also tell me how long ago
the actual transition occured.  A bit more work for the ISRs and
similarly for the task monitoring the FIFO.  But, it allowed me
to capture barcodes with features as small as 0.007" at 100IPS...
on a 2MHz 8b CPU.

A modern compiler could probably generate "as effective" code;
the real performance gain was obtained by changing the algorithm
instead of the implementation language.

> They annoyingly zapped access to 80 bit reals in v6 I think it was for 
> "compatibility" reasons since SSE2 and later can only do 64bit reals.

I had an early product use BCD data formats (supported by an
ancient compiler).  When *that* support went away, it was a
real nightmare to go through and rework everything to use
bare ints (and have to bin-to-bcd all the time)

>> Esp given that your implementation will likely evolve and
>> folks doing that work may not be as focused as you were on
>> this specific issue...
> 
> That will be their problem not mine ;-)

Ah, most of my projects are prototypes or proof of concept.
So, my code *will* be reworked.  If that proves to be hard,
folks won't recommend me to other clients!  :>

[So, I make it REALLY easy for folks to Do The Right Thing
(in my opinion of "right") to maximize their chance of getting
good results.  If you want to reinvent the wheel, then don't
fret if you *break* it - cuz you can SEE that it worked!!]

Reply by Gerhard Hoffmann ●January 16, 20232023-01-16

Am 16.01.23 um 11:23 schrieb Martin Brown:

> Actually there were processors which took the exact opposite position 
> quite early on and they were incredibly good for realtime performance 
> but their registers were no different to ram - they were *in* ram so was 
> the program counter return address. There was a master register 
> workspace pointer and 16 registers TI TMS9900 series for instance.
> 
> https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture
> 
> I didn't properly appreciate at the time quite how good this trick was 
> for realtime work until we tried to implement the same algorithms on the 
> much later and on paper faster 68000 series of CPUs.

At TU Berlin we had a place called the Zoo where there was
at least one sample of each CPU family. We used the Zoo to
port Andrew Tanenbaum's Experimental Machine to all of them
under equal conditions. That was a p-code engine from the
Amsterdam Free University Compiler Kit.

The 9900 was slowest, by a lage margin, Z80-league.
Having no cache AND no registers was a braindead idea.



Some friends built a hardware machine around the Fairchild
Clipper. They found out that moving the hard disk driver
just a few bytes made a difference between speedy and slow
as molasse. When the data was through under the head you had
to wait for another disc revolution.

It turned out that Fairchild simply lasered away some faulty
cache lines and sold it. No warning given.
It was entertaining to see, not being in that project.

Gerhard

Reply by Don Y ●January 16, 20232023-01-16

On 1/16/2023 4:25 AM, Don Y wrote:
> On 1/16/2023 3:23 AM, Martin Brown wrote:
>> For my money ISRs should do as little as possible at such a high privilege 
>> level although checking if their interrupt flag is already set again before 
>> returning is worthwhile for maximum burst transfer speed.
> 
> +42 on both counts.
> 
> OTOH, you have to be wary of misbehaving hardware (or, unforseen
> circumstances) causing the ISR to loop continuously.

E.g., I particularly object to folks trying to detect counter
wrap by:
    do {
      high = read(HIGH)
      low = read(LOW)
    } while ( high != read(HIGH) )
and similar.  What *guarantees* do you have that this will ever
complete?  (yeah, unlikely for it to hang here, but not *impossible*!)

Better:
      high = read(HIGH)
      low = read(LOW)
      if ( high != read(HIGH) ) {
         high = high++
         low = 0
      }
or similar (e.g., high = high; low = LOWMAX depending on how you want to
bias the approximation).

Reply by Don Y ●January 16, 20232023-01-16

On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote:
> Am 16.01.23 um 11:23 schrieb Martin Brown:
> 
>> Actually there were processors which took the exact opposite position quite 
>> early on and they were incredibly good for realtime performance but their 
>> registers were no different to ram - they were *in* ram so was the program 
>> counter return address. There was a master register workspace pointer and 16 
>> registers TI TMS9900 series for instance.
>>
>> https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture
>>
>> I didn't properly appreciate at the time quite how good this trick was for 
>> realtime work until we tried to implement the same algorithms on the much 
>> later and on paper faster 68000 series of CPUs.
> 
> At TU Berlin we had a place called the Zoo where there was
> at least one sample of each CPU family. We used the Zoo to
> port Andrew Tanenbaum's Experimental Machine to all of them
> under equal conditions. That was a p-code engine from the
> Amsterdam Free University Compiler Kit.
> 
> The 9900 was slowest, by a lage margin, Z80-league.

But, only for THAT particular benchmark.  I learned, early on,
that I could create a benchmark for damn near any two processors
to make *either* (my choice) look better, just by choosing
the conditions of the test.

[And, as I was often designing the hardware, my "input"
carried a lot of weight]

Are we holding clock frequency constant?  Memory access time?
Code size?  Memory dollars?  Board space?  "Throughput"?
Algorithm?  etc.

I tend to like processors with lots of internal registers
from my days writing ASM; it was an acquired skill to be able
to think about how to design an algorithm so you could keep
everything *in* the processor -- instead of having to constantly
load/store/reload.

But, moving away from ASM, I'm less concerned as to what the
programmer's model looks like.  I'm more interested in what
the architecture supports and how easy it is for me to make
use of those mechanisms (in hardware and software).

> Having no cache AND no registers was a braindead idea.

They could argue that adding cache was a logical way to
design "systems" with the device.

Remember, the 9900/99K were from the "home computer" era.
They lost out to a dog slow 8086!

> Some friends built a hardware machine around the Fairchild
> Clipper. They found out that moving the hard disk driver
> just a few bytes made a difference between speedy and slow
> as molasse. When the data was through under the head you had
> to wait for another disc revolution.

Unless, of course, you were already planning on being busy
doing something else, at that time.   :>  "Benchmarks lie"

> It turned out that Fairchild simply lasered away some faulty
> cache lines and sold it. No warning given.
> It was entertaining to see, not being in that project.

Reply by Gerhard Hoffmann ●January 16, 20232023-01-16

Am 16.01.23 um 13:25 schrieb Don Y:
> On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote:
>> Am 16.01.23 um 11:23 schrieb Martin Brown:

>>> I didn't properly appreciate at the time quite how good this trick 
>>> was for realtime work until we tried to implement the same algorithms 
>>> on the much later and on paper faster 68000 series of CPUs.
>>
>> At TU Berlin we had a place called the Zoo where there was
>> at least one sample of each CPU family. We used the Zoo to
>> port Andrew Tanenbaum's Experimental Machine to all of them
>> under equal conditions. That was a p-code engine from the
>> Amsterdam Free University Compiler Kit.
>>
>> The 9900 was slowest, by a lage margin, Z80-league.
> 
> But, only for THAT particular benchmark.&nbsp; I learned, early on,
> that I could create a benchmark for damn near any two processors
> to make *either* (my choice) look better, just by choosing
> the conditions of the test.

That was not a benchmark; that was a given large p-code machine
with the intent to use the same compilers everywhere. Not unlike
UCSD-Pascal.


>> Having no cache AND no registers was a braindead idea.
> 
> They could argue that adding cache was a logical way to
> design "systems" with the device.

with a non-existing cache controller and cache rams that
cost as much as the cpu. I got a feeling for the price
of cache when I designed this:
< 
https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/ 
      >

> Remember, the 9900/99K were from the "home computer" era.
> They lost out to a dog slow 8086!

8086 was NOT slow. Have you ever used an Olivetti M20 with
a competently engineered memory system? That even challenged
early ATs when protected mode was not needed.

>> Some friends built a hardware machine around the Fairchild
>> Clipper. They found out that moving the hard disk driver
>> just a few bytes made a difference between speedy and slow
>> as molasse. When the data was through under the head you had
>> to wait for another disc revolution.
> 
> Unless, of course, you were already planning on being busy
> doing something else, at that time.&nbsp;&nbsp; :>&nbsp; "Benchmarks lie"

That benchmark was Unix System V, as licensed from Bell.
Find something better to do when you need to swap.

Gerhard

Reply by John Larkin ●January 16, 20232023-01-16

On Mon, 16 Jan 2023 10:21:43 +0000, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:

>On 15/01/2023 10:11, Don Y wrote:
>> On 1/15/2023 2:48 AM, Martin Brown wrote:
>>> I prefer to use RDTSC for my Intel timings anyway.
>>>
>>> On many of the modern CPUs there is a freerunning 64 bit counter 
>>> clocked at once per cycle. Intel deprecates using it for such purposes 
>>> but I have never found it a problem provided that you bracket it 
>>> before and after with CPUID to force all the pipelines into an empty 
>>> state.
>>>
>>> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described 
>>> here:
>>>
>>> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters
>>>
>>> I prefer hard numbers to a vague scope trace.
>> 
>> Two downsides:
>> - you have to instrument your code (but, if you're concerned with 
>> performance,
>>  &#4294967295; you've already done this as a matter of course)
>
>You have to make a test framework to exercise the code in as realistic a 
>manner as you can - that isn't quite the same as instrumenting the code 
>(although it can be).
>
>I have never found profile directed compilers to be the least bit useful 
>on my fast maths codes because their automatic code instrumentation 
>breaks the very code that it is supposed to be testing (in the sense of 
>wrecking cache lines and locality etc.).
>
>The only profiling method I have found to work reasonably well is 
>probably by chance the highest frequency periodic ISR I have ever used 
>in anger which was to profile code by accumulating a snapshot of PC 
>addresses allowing just a few machine instructions to execute at a time. 
>It used to work well back in the old days when 640k was the limit and 
>code would reliably load into exactly the same locations every run.
>
>It is a great way to find the hotspots where most time is spent.
>
>> - it doesn't tell you about anything that happens *before* the code runs
>>  &#4294967295; (e.g., latency between event and recognition thereof)
>
>True enough. Sometimes you need a logic analyser for weird behaviour - 
>we once caught a CPU chip where RTI didn't always do what it said on the 
>tin and the instruction following the RTI instruction got executed with 
>a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but 
>we had to sign a non-disclosure agreement.
>
>>> If I'm really serious about finding out why something is unusually 
>>> slow I run a dangerous system level driver that allows me full access 
>>> to the model specific registers to monitor cache misses and pipeline 
>>> stalls.
>> 
>> But, those results can change from instance to instance (as can latency,
>> execution time, etc.).&#4294967295; So, you need to look at the *distribution* of
>> values and then think about whether that truly represents "typical"
>> and/or *worst* case.
>
>It just means that you have to collect an array of data and take a look 
>at it later and offline. Much like you would when testing that a library 
>function does exactly what it is supposed to.
>> 
>> Relying on exact timings is sort of naive; it ignores how much
>> things can vary with the running system (is the software in a
>> critical region when the ISR is invoked?) and the running
>> *hardware* (multilevel caches, etc.)
>
>It is quite unusual to see bad behaviour from the multilevel caches but 
>it can add to the variance. You always get a few outliers here and there 
>in user code if a higher level disk or network interrupt steals cycles.
>
>> Do you have a way of KNOWING when your expectations (which you
>> have now decided are REQUIRMENTS!) are NOT being met?&#4294967295; And, if so,
>> what do you do (at runtime) with that information?&#4294967295;&#4294967295; ("I'm sorry,
>> one of my basic assumptions is proving to be false and I am not
>> equipped to deal with that...")
>
>Instrumenting for timing tests is very much development rather than 
>production code. ie is it fast enough or do we have to work harder.
>
>Like you I prefer HLL code but I will use ASM if I have to or there is 
>no other way (like wanting 80 bit reals in the MS compiler). Actually I 
>am working on a class library to allow somewhat clunky access to it.
>
>They annoyingly zapped access to 80 bit reals in v6 I think it was for 
>"compatibility" reasons since SSE2 and later can only do 64bit reals.

PowerBasic has 80-bit reals as a native variable type.

As far as timing analysis goes, we always bring out a few port pins to
test points, from uPs and FPGAs, so we can scope things. Raise a pin
at ISR entry, drop it before the RTI, scope it.

We wrote one Linux program that just toggled a test point as fast as
it could. That was interesting on a scope, namely the parts that
didn't toggle.

Previous 4 567 8 9 Next

highest frequency periodic interrupt?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About Electronics-Related.com

Social Networks

The Related Media Group