sci.electronics.design | highest frequency periodic interrupt?| page 7

Reply by John Larkin ●January 16, 20232023-01-16

On Mon, 16 Jan 2023 04:54:38 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 1/16/2023 3:21 AM, Martin Brown wrote:
>> On 15/01/2023 10:11, Don Y wrote:
>>> On 1/15/2023 2:48 AM, Martin Brown wrote:
>>>> I prefer to use RDTSC for my Intel timings anyway.
>>>>
>>>> On many of the modern CPUs there is a freerunning 64 bit counter clocked at 
>>>> once per cycle. Intel deprecates using it for such purposes but I have never 
>>>> found it a problem provided that you bracket it before and after with CPUID 
>>>> to force all the pipelines into an empty state.
>>>>
>>>> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described here:
>>>>
>>>> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters
>>>>
>>>> I prefer hard numbers to a vague scope trace.
>>>
>>> Two downsides:
>>> - you have to instrument your code (but, if you're concerned with performance,
>>> &#4294967295;&#4294967295; you've already done this as a matter of course)
>> 
>> You have to make a test framework to exercise the code in as realistic a manner 
>> as you can - that isn't quite the same as instrumenting the code (although it 
>> can be).
>
>It depends on how visible the information of interest is to outside
>observers.  If you have to "do something" to make it so, then you
>may as well put in the instrumentation and get things as you want them.
>
>> I have never found profile directed compilers to be the least bit useful on my 
>> fast maths codes because their automatic code instrumentation breaks the very 
>> code that it is supposed to be testing (in the sense of wrecking cache lines 
>> and locality etc.).
>
>Exactly.  The same holds true of adding invariants to code;
>removing them (#ifndef DEBUG) changes the code -- subtly but
>nonetheless.  So, you have to put in place two levels of
>final test:
>- check to see if you THINK it will pass REAL final test
>- actually DO the final test
>
>When installing copy protection/anti-tamper mechanisms in products,
>there's a time when you've just enabled them and, thus, changed
>how the product runs.  If it *stops* running (properly), you have
>to wonder if your "measures" are at fault or if some latent
>bug has crept in, aggravated by the slight differences in
>execution patterns.
>
>> The only profiling method I have found to work reasonably well is probably by 
>> chance the highest frequency periodic ISR I have ever used in anger which was 
>> to profile code by accumulating a snapshot of PC addresses allowing just a few 
>> machine instructions to execute at a time. It used to work well back in the old 
>> days when 640k was the limit and code would reliably load into exactly the same 
>> locations every run.
>> 
>> It is a great way to find the hotspots where most time is spent.
>
>IMO, this is where logic analyzers shine.  I don't agree with
>using them to "trace code" (during debug) as there are better ways to
>get that information.  But, *watching* to see how code runs (passively)
>can be a real win.  Especially when you are trying to watch for
>RARE aberrant behavior.
>
>>> - it doesn't tell you about anything that happens *before* the code runs
>>> &#4294967295;&#4294967295; (e.g., latency between event and recognition thereof)
>> 
>> True enough. Sometimes you need a logic analyser for weird behaviour - we once 
>> caught a CPU chip where RTI didn't always do what it said on the tin and the 
>> instruction following the RTI instruction got executed with a frequency of 
>> about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a 
>> non-disclosure agreement.
>> 
>>>> If I'm really serious about finding out why something is unusually slow I 
>>>> run a dangerous system level driver that allows me full access to the model 
>>>> specific registers to monitor cache misses and pipeline stalls.
>>>
>>> But, those results can change from instance to instance (as can latency,
>>> execution time, etc.).&#4294967295; So, you need to look at the *distribution* of
>>> values and then think about whether that truly represents "typical"
>>> and/or *worst* case.
>> 
>> It just means that you have to collect an array of data and take a look at it 
>> later and offline. Much like you would when testing that a library function 
>> does exactly what it is supposed to.
>
>Yes.  So, you either have the code do the collection (using a black box)
>*or* have to have an external device (logic analyzer) that can collect it
>for you.
>
>The former is nice because the code can actually make decisions
>(at run time) that a passive observer often can't (because the
>observer can't see all of the pertinent data).  But, that starts
>to have a pronounced impact on the *intended* code...
>
>>> Relying on exact timings is sort of naive; it ignores how much
>>> things can vary with the running system (is the software in a
>>> critical region when the ISR is invoked?) and the running
>>> *hardware* (multilevel caches, etc.)
>> 
>> It is quite unusual to see bad behaviour from the multilevel caches but it can 
>> add to the variance. You always get a few outliers here and there in user code 
>> if a higher level disk or network interrupt steals cycles.
>
>Being an embedded system developer, the issues that often muck
>up execution are often periodic -- but, with periods that are varied
>enough that they only beat against the observed phenomenon occasionally.
>
>I am always amused by folks WHO OBSERVE A F*CKUP.  Then, when they
>can't reproduce it or identify a likely cause, ACT AS IF IT NEVER
>HAPPENED!  Sheesh, you're not relying on some third-hand report
>of an anomaly... YOU SAW IT!  How can you pretend it didn't happen?

Apply all of your engineering creativity.

Reply by Don Y ●January 16, 20232023-01-16

On 1/16/2023 7:27 AM, Gerhard Hoffmann wrote:
> Am 16.01.23 um 13:25 schrieb Don Y:
>> On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote:
>>> Am 16.01.23 um 11:23 schrieb Martin Brown:
> 
>>>> I didn't properly appreciate at the time quite how good this trick was for 
>>>> realtime work until we tried to implement the same algorithms on the much 
>>>> later and on paper faster 68000 series of CPUs.
>>>
>>> At TU Berlin we had a place called the Zoo where there was
>>> at least one sample of each CPU family. We used the Zoo to
>>> port Andrew Tanenbaum's Experimental Machine to all of them
>>> under equal conditions. That was a p-code engine from the
>>> Amsterdam Free University Compiler Kit.
>>>
>>> The 9900 was slowest, by a lage margin, Z80-league.
>>
>> But, only for THAT particular benchmark.&nbsp; I learned, early on,
>> that I could create a benchmark for damn near any two processors
>> to make *either* (my choice) look better, just by choosing
>> the conditions of the test.
> 
> That was not a benchmark; that was a given large p-code machine
> with the intent to use the same compilers everywhere. Not unlike
> UCSD-Pascal.

Everything that you use as an example of performance is a benchmark.
If you are running a computer for a teaching course, do you
really care how fast the code executes?  Can you name any products
that embodied that code -- where the performance of the hosting
processor would have a cost-performance tradeoff?

Do you care how fast your LISP machine runs?

>>> Having no cache AND no registers was a braindead idea.
>>
>> They could argue that adding cache was a logical way to
>> design "systems" with the device.
> 
> with a non-existing cache controller and cache rams that
> cost as much as the cpu. I got a feeling for the price
> of cache when I designed this:
> < https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/ 
>  &nbsp;&nbsp;&nbsp;&nbsp; >

TI is a chip vendor.  They look at things differently than
folks who *use* the chips.

NatSemi used to make DRAM controllers -- that cost more than
the DRAM!  "What does this do that a few multiplexers *won't*,
for us?"

When you buy things in volume, you're "paying for the plastic";
the die have a relatively insignificant part in the cost.

>> Remember, the 9900/99K were from the "home computer" era.
>> They lost out to a dog slow 8086!
> 
> 8086 was NOT slow. Have you ever used an Olivetti M20 with
> a competently engineered memory system? That even challenged
> early ATs when protected mode was not needed.

The 8086 (4.77MHz) was slower than a Z80, at the time.
Because applications that could use its extra abilities
were relatively few and far between.

By contrast, the number of embedded designs outpaced
"PCs" by likely two orders of magnitudes; far better
"bang for buck".

>>> Some friends built a hardware machine around the Fairchild
>>> Clipper. They found out that moving the hard disk driver
>>> just a few bytes made a difference between speedy and slow
>>> as molasse. When the data was through under the head you had
>>> to wait for another disc revolution.
>>
>> Unless, of course, you were already planning on being busy
>> doing something else, at that time.&nbsp;&nbsp; :>&nbsp; "Benchmarks lie"
> 
> That benchmark was Unix System V, as licensed from Bell.
> Find something better to do when you need to swap.

How many video games was it used in?  Pinball machines?
Medical devices?  Process control systems?  Navigation
systems?  etc.  None of those run a UNIX kernel nor any of
the sorts of algorithms that you'd *find* in a UNIX
kernel.

So, why would I look at the performance of a processor running
UNIX... if my product is NOT running UNIX?

Instead, I'd wonder how well it ran *my* code and what the
*cost* of that performance was -- as that will determine if
I can *price* my product so that it sells in a given market.

Reply by panteltje ●January 16, 20232023-01-16

On a sunny day (Mon, 16 Jan 2023 06:29:25 -0800) it happened John Larkin
<jlarkin@highlandSNIPMEtechnology.com> wrote in
<cenashpl5f7cdk37nq4dab7hcdlqknptqh@4ax.com>:

>On Mon, 16 Jan 2023 10:21:43 +0000, Martin Brown
><'''newspam'''@nonad.co.uk> wrote:
>
>>On 15/01/2023 10:11, Don Y wrote:
>>> On 1/15/2023 2:48 AM, Martin Brown wrote:
>>>> I prefer to use RDTSC for my Intel timings anyway.
>>>>
>>>> On many of the modern CPUs there is a freerunning 64 bit counter 
>>>> clocked at once per cycle. Intel deprecates using it for such purposes 
>>>> but I have never found it a problem provided that you bracket it 
>>>> before and after with CPUID to force all the pipelines into an empty 
>>>> state.
>>>>
>>>> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described 
>>>> here:
>>>>
>>>> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters
>>>>
>>>> I prefer hard numbers to a vague scope trace.
>>> 
>>> Two downsides:
>>> - you have to instrument your code (but, if you're concerned with 
>>> performance,
>>>  &#4294967295; you've already done this as a matter of course)
>>
>>You have to make a test framework to exercise the code in as realistic a 
>>manner as you can - that isn't quite the same as instrumenting the code 
>>(although it can be).
>>
>>I have never found profile directed compilers to be the least bit useful 
>>on my fast maths codes because their automatic code instrumentation 
>>breaks the very code that it is supposed to be testing (in the sense of 
>>wrecking cache lines and locality etc.).
>>
>>The only profiling method I have found to work reasonably well is 
>>probably by chance the highest frequency periodic ISR I have ever used 
>>in anger which was to profile code by accumulating a snapshot of PC 
>>addresses allowing just a few machine instructions to execute at a time. 
>>It used to work well back in the old days when 640k was the limit and 
>>code would reliably load into exactly the same locations every run.
>>
>>It is a great way to find the hotspots where most time is spent.
>>
>>> - it doesn't tell you about anything that happens *before* the code runs
>>>  &#4294967295; (e.g., latency between event and recognition thereof)
>>
>>True enough. Sometimes you need a logic analyser for weird behaviour - 
>>we once caught a CPU chip where RTI didn't always do what it said on the 
>>tin and the instruction following the RTI instruction got executed with 
>>a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but 
>>we had to sign a non-disclosure agreement.
>>
>>>> If I'm really serious about finding out why something is unusually 
>>>> slow I run a dangerous system level driver that allows me full access 
>>>> to the model specific registers to monitor cache misses and pipeline 
>>>> stalls.
>>> 
>>> But, those results can change from instance to instance (as can latency,
>>> execution time, etc.).&#4294967295; So, you need to look at the *distribution* of
>>> values and then think about whether that truly represents "typical"
>>> and/or *worst* case.
>>
>>It just means that you have to collect an array of data and take a look 
>>at it later and offline. Much like you would when testing that a library 
>>function does exactly what it is supposed to.
>>> 
>>> Relying on exact timings is sort of naive; it ignores how much
>>> things can vary with the running system (is the software in a
>>> critical region when the ISR is invoked?) and the running
>>> *hardware* (multilevel caches, etc.)
>>
>>It is quite unusual to see bad behaviour from the multilevel caches but 
>>it can add to the variance. You always get a few outliers here and there 
>>in user code if a higher level disk or network interrupt steals cycles.
>>
>>> Do you have a way of KNOWING when your expectations (which you
>>> have now decided are REQUIRMENTS!) are NOT being met?&#4294967295; And, if so,
>>> what do you do (at runtime) with that information?&#4294967295;&#4294967295; ("I'm sorry,
>>> one of my basic assumptions is proving to be false and I am not
>>> equipped to deal with that...")
>>
>>Instrumenting for timing tests is very much development rather than 
>>production code. ie is it fast enough or do we have to work harder.
>>
>>Like you I prefer HLL code but I will use ASM if I have to or there is 
>>no other way (like wanting 80 bit reals in the MS compiler). Actually I 
>>am working on a class library to allow somewhat clunky access to it.
>>
>>They annoyingly zapped access to 80 bit reals in v6 I think it was for 
>>"compatibility" reasons since SSE2 and later can only do 64bit reals.
>
>PowerBasic has 80-bit reals as a native variable type.
>
>As far as timing analysis goes, we always bring out a few port pins to
>test points, from uPs and FPGAs, so we can scope things. Raise a pin
>at ISR entry, drop it before the RTI, scope it.
>
>We wrote one Linux program that just toggled a test point as fast as
>it could. That was interesting on a scope, namely the parts that
>didn't toggle.

It all depends   using rspberry Pi as FM transmitter (80 to 100 MHz or so):
 https://linuxhint.com/turn-raspberry-pi-fm-transmitter/

That code gave me the following idea,
freq_pi:
 http://panteltje.com/panteltje/newsflex/download.html#freq_pi

and that was for a very old Pi model,
somebody then ported it to a later model, no idea how fast you can go on a Pi4.

Reply by ●January 16, 20232023-01-16

On Sat, 14 Jan 2023 22:21:08 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 1/14/2023 10:10 PM, upsidedown@downunder.com wrote:
>> In the past coding ISRs in assembly was the way to go, but the
>> complexity of current processors (cache, pipelining) makes it hard to
>> beat a _good_ compiler.
>
>Exactly.  And, it's usually easier to see what you are trying
>to do in a HLL vs. ASM (and heaven forbid you want to port
>the application to a different processor!)
>
>The problem with using an HLL is making sure you actually
>understand some "line of code" translates into when it comes
>to actual opcode/memory accesses (not just which instructions
>but, rather, the *cost* of those instructions)
>
>And, this can change, based on *how* the compiler is invoked
>(how aggressive the code generator)
>
>> The main principle still is to minimize the number of registers saved
>> at
>> interrupt entry (and restored at exit).On a primitive processor only
>> the processor status word and program counter needs to be saved (and
>> restored). Additional registers may need to be saved(restored if the
>> ISR uses them.
>
>Some "advanced" processors still support a "Fast IRQ" that saves
>just an accumulator and PSW.  A tacit acknowledgement that you
>don't want to have to save the *entire* processor state (as
>you likely don't know what portions of it the compiler *might*
>call on).
>
>> If the processor has separate FP registers and/or separate FP status
>> words, avoid using FP registers in ISRs.
>
>As with everything, *how* you use them can make a difference.
>E.g., if your ISR reenables interrupts (prior to completion), it
>can make sense to use "expensive" instruction sequences (assuming
>the ISR doesn't interrupt itself).
>
>[Degenerate example: the scheduler being invoked!]

In a system with multiple ISRs, spending too long in a single ISR is a
bad idea. Better just read inputs in the ISR and postpone the time
consuming processing to a lower priority "pseudo ISR" (SW ISR).

Many processors have software interrupts (SWI), traps or whatever each
manufacture is calling it.

In such environment the HW ISR at close to the end issues the TRAP
instruction (SWI request), which is  not activated as long as the HW
ISR is still executing. When the HW ISR exits, interrupts are enabled.

If there is an other HW interrupt(s) pending, those are first
executed. When no more HW interrupts are pending the SW ISR can start
executing. This SW ISR can be quite time consuming. A new hardware
interrupt request may interrupt the SW ISR.

When the SW ISR finally exits, the originally interrupted program is
resumed.

Reply by Don Y ●January 16, 20232023-01-16

On 1/16/2023 10:39 AM, upsidedown@downunder.com wrote:
> On Sat, 14 Jan 2023 22:21:08 -0700, Don Y
> <blockedofcourse@foo.invalid> wrote:
> 
>> On 1/14/2023 10:10 PM, upsidedown@downunder.com wrote:
>>> In the past coding ISRs in assembly was the way to go, but the
>>> complexity of current processors (cache, pipelining) makes it hard to
>>> beat a _good_ compiler.
>>
>> Exactly.  And, it's usually easier to see what you are trying
>> to do in a HLL vs. ASM (and heaven forbid you want to port
>> the application to a different processor!)
>>
>> The problem with using an HLL is making sure you actually
>> understand some "line of code" translates into when it comes
>> to actual opcode/memory accesses (not just which instructions
>> but, rather, the *cost* of those instructions)
>>
>> And, this can change, based on *how* the compiler is invoked
>> (how aggressive the code generator)
>>
>>> The main principle still is to minimize the number of registers saved
>>> at
>>> interrupt entry (and restored at exit).On a primitive processor only
>>> the processor status word and program counter needs to be saved (and
>>> restored). Additional registers may need to be saved(restored if the
>>> ISR uses them.
>>
>> Some "advanced" processors still support a "Fast IRQ" that saves
>> just an accumulator and PSW.  A tacit acknowledgement that you
>> don't want to have to save the *entire* processor state (as
>> you likely don't know what portions of it the compiler *might*
>> call on).
>>
>>> If the processor has separate FP registers and/or separate FP status
>>> words, avoid using FP registers in ISRs.
>>
>> As with everything, *how* you use them can make a difference.
>> E.g., if your ISR reenables interrupts (prior to completion), it
>> can make sense to use "expensive" instruction sequences (assuming
>> the ISR doesn't interrupt itself).
>>
>> [Degenerate example: the scheduler being invoked!]
> 
> In a system with multiple ISRs, spending too long in a single ISR is a
> bad idea.

Of course!  Paraphrasing the man with the wild hair:  "Do as little
as possible -- but no less!"

> Better just read inputs in the ISR and postpone the time
> consuming processing to a lower priority "pseudo ISR" (SW ISR).

Or "outputs".  I always build a "driver" (that runs in the ISR)
and a "handler" for "devices".  Only the handler need be concerned
with the driver.  And, system code should only need to deal with
the handler.

If writing in ASM, one can dick with the stack frame and "arrange"
to RTI to an intermediate level of code that runs below the
background but above the ISR.  This lets you allow more ISRs to
be serviced while you are, effectively, still servicing the
previous one.

You can also structure the ISR as a state machine (if appropriate
to the task at hand) to remove conditionals from executing in
the ISR.  So, subsequent IRQs are dispatched to different
routines, instead of having a "big" ISR that tries to juggle
the various needs.

[I have a clever little bit of code that makes this relatively inexpensive
but requires the code to reside in writeable RAM (as it is self-modifying)]

> Many processors have software interrupts (SWI), traps or whatever each
> manufacture is calling it.
> 
> In such environment the HW ISR at close to the end issues the TRAP
> instruction (SWI request), which is  not activated as long as the HW
> ISR is still executing. When the HW ISR exits, interrupts are enabled.

A trap is often expensive.  You can emulate this functionality (as above).

> If there is an other HW interrupt(s) pending, those are first
> executed. When no more HW interrupts are pending the SW ISR can start
> executing. This SW ISR can be quite time consuming. A new hardware
> interrupt request may interrupt the SW ISR.
> 
> When the SW ISR finally exits, the originally interrupted program is
> resumed.

Nowadays, processors have more than (legacy) interrupts utilizing the
same "exception" mechanism.  Ideally, you structure all of your
"exception handlers" with a similar set of guidelines, even though
some don't have the timeliness constraints of traditional ISRs.

Reply by Gerhard Hoffmann ●January 16, 20232023-01-16

Am 16.01.23 um 17:19 schrieb Don Y:
> On 1/16/2023 7:27 AM, Gerhard Hoffmann wrote:

>> That was not a benchmark; that was a given large p-code machine
>> with the intent to use the same compilers everywhere. Not unlike
>> UCSD-Pascal.
> 
> Everything that you use as an example of performance is a benchmark.

That was not created as a benchmark. The goal was to have the
same Compiler and operating system on most of the upcoming
microssystems available. Not too unexpected for an operation
system department at a univ.
And when there were underperformers, that would not go
unnoticed.

> Do you care how fast your LISP machine runs?

We had some ICL Perqs; not my cup of meat.
I had a small Prolog system on my Z80, funny but
nothing for real work.

And yes, I was interested how fast my machines ran.
In the VLSI course, I talked a group of other students
into doing a stack machine much like Tanenbaum's, only
simpler in HP's dynamic NMOS process. Unluckily, we caught
a metal flake from a neighbor project that the DRC did not get.

>> with a non-existing cache controller and cache rams that
>> cost as much as the cpu. I got a feeling for the price
>> of cache when I designed this:
>> < 
>> https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >
> 
> TI is a chip vendor.&nbsp; They look at things differently than
> folks who *use* the chips.

So is Intel.

> NatSemi used to make DRAM controllers -- that cost more than
> the DRAM!&nbsp; "What does this do that a few multiplexers *won't*,
> for us?"
> 
> When you buy things in volume, you're "paying for the plastic";
> the die have a relatively insignificant part in the cost.
> 
>>> Remember, the 9900/99K were from the "home computer" era.
>>> They lost out to a dog slow 8086!
>>
>> 8086 was NOT slow. Have you ever used an Olivetti M20 with
>> a competently engineered memory system? That even challenged
>> early ATs when protected mode was not needed.
> 
> The 8086 (4.77MHz) was slower than a Z80, at the time.
> Because applications that could use its extra abilities
> were relatively few and far between.

Ah, I had both of them, in the same 19" box.

>>> Unless, of course, you were already planning on being busy
>>> doing something else, at that time.&nbsp;&nbsp; :>&nbsp; "Benchmarks lie"
>>
>> That benchmark was Unix System V, as licensed from Bell.
>> Find something better to do when you need to swap.
> 
> How many video games was it used in?&nbsp; Pinball machines?
> Medical devices?&nbsp; Process control systems?&nbsp; Navigation
> systems?&nbsp; etc.&nbsp; None of those run a UNIX kernel nor any of
> the sorts of algorithms that you'd *find* in a UNIX
> kernel.

Pinball machines with a Fairchild Clipper? Do you have
an idea what a Clipper module did cost? The machine was
intended as a multi user Unix machine. I later got a paid
project to build a VME bus terminal concentrator based
on 80186 for it.

Why should I care about medical devices, video games, pinball
or navigation systems? GPS was an experiment at that time and
the 50 Baud navigation strings no problem for sure.

> So, why would I look at the performance of a processor running
> UNIX... if my product is NOT running UNIX?

The product WAS running UNIX. I wrote they had bought a
commercial source license from Bell.

Cheers, Gerhard

Reply by Lasse Langwadt Christensen ●January 16, 20232023-01-16

mandag den 16. januar 2023 kl. 18.39.13 UTC+1 skrev upsid...@downunder.com:
> On Sat, 14 Jan 2023 22:21:08 -0700, Don Y 
> <blocked...@foo.invalid> wrote:
> >On 1/14/2023 10:10 PM, upsid...@downunder.com wrote: 
> >> In the past coding ISRs in assembly was the way to go, but the 
> >> complexity of current processors (cache, pipelining) makes it hard to 
> >> beat a _good_ compiler. 
> >
> >Exactly. And, it's usually easier to see what you are trying 
> >to do in a HLL vs. ASM (and heaven forbid you want to port 
> >the application to a different processor!) 
> > 
> >The problem with using an HLL is making sure you actually 
> >understand some "line of code" translates into when it comes 
> >to actual opcode/memory accesses (not just which instructions 
> >but, rather, the *cost* of those instructions) 
> > 
> >And, this can change, based on *how* the compiler is invoked 
> >(how aggressive the code generator)
> > 
> >> The main principle still is to minimize the number of registers saved 
> >> at 
> >> interrupt entry (and restored at exit).On a primitive processor only 
> >> the processor status word and program counter needs to be saved (and 
> >> restored). Additional registers may need to be saved(restored if the 
> >> ISR uses them. 
> >
> >Some "advanced" processors still support a "Fast IRQ" that saves 
> >just an accumulator and PSW. A tacit acknowledgement that you 
> >don't want to have to save the *entire* processor state (as 
> >you likely don't know what portions of it the compiler *might* 
> >call on).
> > 
> >> If the processor has separate FP registers and/or separate FP status 
> >> words, avoid using FP registers in ISRs. 
> >
> >As with everything, *how* you use them can make a difference. 
> >E.g., if your ISR reenables interrupts (prior to completion), it 
> >can make sense to use "expensive" instruction sequences (assuming 
> >the ISR doesn't interrupt itself). 
> > 
> >[Degenerate example: the scheduler being invoked!] 
> 
> In a system with multiple ISRs, spending too long in a single ISR is a 
> bad idea. Better just read inputs in the ISR and postpone the time 
> consuming processing to a lower priority "pseudo ISR" (SW ISR). 
> 
> Many processors have software interrupts (SWI), traps or whatever each 
> manufacture is calling it. 
> 
> In such environment the HW ISR at close to the end issues the TRAP 
> instruction (SWI request), which is not activated as long as the HW 
> ISR is still executing. When the HW ISR exits, interrupts are enabled. 
> 
> If there is an other HW interrupt(s) pending, those are first 
> executed. When no more HW interrupts are pending the SW ISR can start 
> executing. This SW ISR can be quite time consuming. A new hardware 
> interrupt request may interrupt the SW ISR. 
> 
> When the SW ISR finally exits, the originally interrupted program is 
> resumed.

I've sometime done that by setting the pending bit on an otherwise unused 
interrupt set at a low priority, cortex-m does tail chaining so RTI from 
an interrupt while another is pending is effectively just a jump and change in priority

another (and tricky to get right) way is to add a stack frame with the new code's
address and do an RTI, ala' a task switch in an OS

Reply by Don Y ●January 16, 20232023-01-16

On 1/16/2023 12:30 PM, Gerhard Hoffmann wrote:
> Am 16.01.23 um 17:19 schrieb Don Y:
>> On 1/16/2023 7:27 AM, Gerhard Hoffmann wrote:
> 
>>> That was not a benchmark; that was a given large p-code machine
>>> with the intent to use the same compilers everywhere. Not unlike
>>> UCSD-Pascal.
>>
>> Everything that you use as an example of performance is a benchmark.
> 
> That was not created as a benchmark. The goal was to have the
> same Compiler and operating system on most of the upcoming
> microssystems available. Not too unexpected for an operation
> system department at a univ.
> And when there were underperformers, that would not go
> unnoticed.

You used <whatever> as a way of comparing the performance of
different processors.  It matters not why the application
was *created*;  you, HERE, used it as a metric for comparing
which makes it a benchmark.

If I report how well my app runs on N different processors,
I am doing so to demonstrate the "value" of those different
processors, when running my app.

If I can assign a cost for each platform, then I can tell
you where my app runs "most economically".

>> Do you care how fast your LISP machine runs?
> 
> We had some ICL Perqs; not my cup of meat.
> I had a small Prolog system on my Z80, funny but
> nothing for real work.
> 
> And yes, I was interested how fast my machines ran.
> In the VLSI course, I talked a group of other students
> into doing a stack machine much like Tanenbaum's, only
> simpler in HP's dynamic NMOS process. Unluckily, we caught
> a metal flake from a neighbor project that the DRC did not get.

So, you migrated EVERY instance over to the "best"
processor?  Or, was it just an academic exercise?

I build things.  I make design decisions based on
how much performance I can get for a given outlay of
money.

When I was doing video games, teh software folks clamored for
a 2MHz design -- claiming the processor was only a dollar or two
more expensive than the 1MHz part.  But, the ramifications to
the rest of the system were considerably more; what value a 2MHz
part if it spends extra cycles waiting for each opcode fetch?

So, why add a dollar to the BoM if you aren't going to get
any performance increase for it?  And, no one is going to
justify adding all the other costs to make the 2MHz part
actually run twice as fast as the 1MHz part.

>>> with a non-existing cache controller and cache rams that
>>> cost as much as the cpu. I got a feeling for the price
>>> of cache when I designed this:
>>> < 
>>> https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; >
>>
>> TI is a chip vendor.&nbsp; They look at things differently than
>> folks who *use* the chips.
> 
> So is Intel.

Where's Fairchild, today?  TI still struggles to produce
mainstream (non-DSP) processors -- without ARM's IP.

>> NatSemi used to make DRAM controllers -- that cost more than
>> the DRAM!&nbsp; "What does this do that a few multiplexers *won't*,
>> for us?"
>>
>> When you buy things in volume, you're "paying for the plastic";
>> the die have a relatively insignificant part in the cost.
>>
>>>> Remember, the 9900/99K were from the "home computer" era.
>>>> They lost out to a dog slow 8086!
>>>
>>> 8086 was NOT slow. Have you ever used an Olivetti M20 with
>>> a competently engineered memory system? That even challenged
>>> early ATs when protected mode was not needed.
>>
>> The 8086 (4.77MHz) was slower than a Z80, at the time.
>> Because applications that could use its extra abilities
>> were relatively few and far between.
> 
> Ah, I had both of them, in the same 19" box.

How many did you sell?

>>>> Unless, of course, you were already planning on being busy
>>>> doing something else, at that time.&nbsp;&nbsp; :>&nbsp; "Benchmarks lie"
>>>
>>> That benchmark was Unix System V, as licensed from Bell.
>>> Find something better to do when you need to swap.
>>
>> How many video games was it used in?&nbsp; Pinball machines?
>> Medical devices?&nbsp; Process control systems?&nbsp; Navigation
>> systems?&nbsp; etc.&nbsp; None of those run a UNIX kernel nor any of
>> the sorts of algorithms that you'd *find* in a UNIX
>> kernel.
> 
> Pinball machines with a Fairchild Clipper? Do you have
> an idea what a Clipper module did cost? The machine was
> intended as a multi user Unix machine. I later got a paid
> project to build a VME bus terminal concentrator based
> on 80186 for it.

We looked at putting T11's and F11's in games.

> Why should I care about medical devices, video games, pinball
> or navigation systems? GPS was an experiment at that time and
> the 50 Baud navigation strings no problem for sure.

Because people make decisions based on the products that they
sell.  I'm not going to put a costly processor in a *mouse*.
And, I'm not going to put a 5c processor in a process control
system.

>> So, why would I look at the performance of a processor running
>> UNIX... if my product is NOT running UNIX?
> 
> The product WAS running UNIX. I wrote they had bought a
> commercial source license from Bell.

How many did you sell?  You can put a PLC in a product -- if
the product can bear the cost/size of a PLC.  *If* it
"benchmarks appropriately" for your product (price,
performance, development time, etc.).  Yet, you might be
able to provide the same functionality with a little SoC.
How do you make that decision?

Reply by Dimiter_Popoff ●January 16, 20232023-01-16

On 1/16/2023 11:10, Don Y wrote:
> On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
>> How many registers does it stack automatically? I knew the HLL nonsense
>> would catch up with CPU design eventually.
> 
> IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].

Lasse also mentioned it, I see and it makes sense; I did not realize
this was a "small" flavour of ARM, I am not familiar with any ARM.

> 
> PC and PSR must be preserved, of course (if you have a special shadow
> register for each, that's just an optimization -- that only works if
> ISRs can't be interrupted.&nbsp; Remember, you can ALWAYS throw an exception,
> even in an ISR!&nbsp; E.g. the "push" can signal a page fault).&nbsp; The link
> register (think "BAL" -- what's old is now new!&nbsp; :> ) determines *how*
> the ISR terminates.

On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don't even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don't need
to stack anything, the 4 regs you have are enough to do the TLB fix.

Getting a page fault in an ISR is hugely problematic, if this is
possible it compromises the entire design (so much about interrupt
latency). In dps for power there is a "block translated area" (no
page translation for it, it is always there) where interrupt handling
code is to be placed.
And there are 3 stack pointers in dps for 32 bit power: user, supervisor
and interrupt. The interrupt stack pointer is always translated (also
in BAT memory) and any exception first stacks a few registers in that
interrupt stack; then it can switch to say supervisor stack pointer
and go on to preempt a task, just do a system call etc.

> .....
> 
>> Good CPU design still means
>> load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
>> to special purpose regs which can be stacked as needed by the IRQ
> 
> What do you do if you throw an exception BEFORE (or while!) doing
> that stacking?&nbsp; Does the CPU panic?&nbsp; :>&nbsp; (e.g., a double fault on
> a 68k!)

I talked about this above, must have anticipated the question :).

 > ....
> 
>> Some *really* well designed for control applications processors allow
>> you to lock a part of the cache but I doubt ARM have that, they seem to
>> have gone the way "make programming a two click job" to target a
>> wider audience.
> 
> The "application processors" most definitely let you exert control over
> the cache -- as well as processor affinity.
> 
> But, you *really* need to be wary about doing this as it sorely
> impacts the utility of those mechanisms on the rest of your code!
> I.e., if you wire-down part of the cache to expedite an ISR, then
> you have forever taken that resource away from the rest of your code
> to use.&nbsp; Are you smart enough to know how to make that decision,
> "in general" (specific cases are a different story)?

The power core I mostly use can lock parts of the cache, IIRC in
1/4-th (i.e. 4k) increments. I have never used that though.

I was somewhat surprised that ARM has the ability to truly prioritize
interrupts, 68k style. Both you and Lasse said that, this is an
important thing to have.

Reply by Lasse Langwadt Christensen ●January 16, 20232023-01-16

mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
> On 1/16/2023 11:10, Don Y wrote: 
> > On 1/15/2023 7:10 AM, Dimiter_Popoff wrote: 
> >> How many registers does it stack automatically? I knew the HLL nonsense 
> >> would catch up with CPU design eventually. 
> > 
> > IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
> Lasse also mentioned it, I see and it makes sense; I did not realize 
> this was a "small" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller 
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit  cpu used in cell phones etc.

> > 
> > PC and PSR must be preserved, of course (if you have a special shadow 
> > register for each, that's just an optimization -- that only works if 
> > ISRs can't be interrupted.  Remember, you can ALWAYS throw an exception, 
> > even in an ISR!  E.g. the "push" can signal a page fault).  The link 
> > register (think "BAL" -- what's old is now new!  :> ) determines *how* 
> > the ISR terminates.
> On power, you get the PC and the MSR saved in special purpose registers 
> assigned to do exactly that. You have to stack them yourself, even that. 
> You don't even have a stack pointer, it is up to you to assign one 
> of the GPR-s to that. The 603e core maximizes delegation to a huge 
> extent and this is very convenient when you know what you are doing. 
> Even when you get a TLB miss, all you get is an exception plus 
> r0-r3 being switched to a shadow bank, Z80 style, meant for you 
> to locate the entry in the page translation table and put it in the 
> TLB; if not in the table you do a page fault, go into allocation, 
> swapping if needed, fixing etc., you know how it goes. You don't need 
> to stack anything, the 4 regs you have are enough to do the TLB fix. 
> 

look similar to the older generation ARM, ARM7-TDMI 

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair) 
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse 
if preemption was needed

Previous 5 678 9 10 Next

highest frequency periodic interrupt?

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About Electronics-Related.com

Social Networks

The Related Media Group