Electronics-Related.com
Forums

highest frequency periodic interrupt?

Started by John Larkin January 13, 2023
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
> How many registers does it stack automatically? I knew the HLL nonsense > would catch up with CPU design eventually.
IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3]. PC and PSR must be preserved, of course (if you have a special shadow register for each, that's just an optimization -- that only works if ISRs can't be interrupted. Remember, you can ALWAYS throw an exception, even in an ISR! E.g. the "push" can signal a page fault). The link register (think "BAL" -- what's old is now new! :> ) determines *how* the ISR terminates. So, the "overhead" that the processor assumes for an ISR is really just 4 (of the 12) general purpose registers. When you consider how much "state" the processor holds, this isn't really excessive. A routine coded in a HLL would likely be using ALL of the registers (though use of an "interrupt" keyword could give it a hint to only use those that it knows are already preserved; this would vary based on the targeted processor. And, the compiler could dynamically decide whether adding code to protect some additional register(s) will offset the performance gains possible by using those extra registers *in* the ISR code... that *it* is generating! The bigger concern (for me) is worrying about which buses I'm calling on during the execution of the ISR and what other cores might be doing ON those busses, at the same time (e.g., if I'm accessing a particular I/O to query/set a GPIO, is another core accessing that same I/O -- for a different GPIO?) You never had to worry about this in single-core architectures (excepting for the presence of another "bus master")
> Good CPU design still means > load/store machines, stacking *nothing* at IRQ, just saving PC and CCR > to special purpose regs which can be stacked as needed by the IRQ
What do you do if you throw an exception BEFORE (or while!) doing that stacking? Does the CPU panic? :> (e.g., a double fault on a 68k!) [Remember, the exception handling/trap interface uses the same mechanisms as that of an IRQ -- they are just instantiated by different sources!]
> routine, along with registers to be used in it. Memory accesses are > the bottleneck, and with HLL code being bloated as it is chances > are some cache will have to be flushed to make room for stacking.
Of course! But, invoking the ISR will likely also displace some contents of the cache -- unless your entire ISR fits in a single cache line *and* is already in the cache. (that includes the DATA that your ISR may need, as well) Remember, the whole need for cache is because processors are SO much faster than memory!
> Some *really* well designed for control applications processors allow > you to lock a part of the cache but I doubt ARM have that, they seem to > have gone the way "make programming a two click job" to target a > wider audience.
The "application processors" most definitely let you exert control over the cache -- as well as processor affinity. But, you *really* need to be wary about doing this as it sorely impacts the utility of those mechanisms on the rest of your code! I.e., if you wire-down part of the cache to expedite an ISR, then you have forever taken that resource away from the rest of your code to use. Are you smart enough to know how to make that decision, "in general" (specific cases are a different story)? The Z80 (et al.) had an "alternate register set". So, one could EX AF,AF' EXX at the top of an ISR -- and again, just before exit -- to preserve (and restore) the current contents of the (main!) register set. But, this means only one ISR can be active at a time (no nesting). Or, requires only a specific ISR to be active (and never interrupting itself) as the alternate register set is indistinguishable from the "regular" register set. Q: Are you willing to live without the use of the alternate registers *in* your code, just for the sake of *an* ISR? [I've had a really hard time NOT assigning specific cores to specific portions of the design -- e.g., letting one core just handle the RMI mechanism. I'm not sure that I can predict how effective such an assignment would be vs. letting the processor *dynamically* adjust to the load, AT THAT TIME.] Other processors have register *banks* that you can switch to/from to expedite context switches. Same sort of restrictions apply. The 99k allowed you to switch "workspaces" efficiently. But, as workspaces resided in RAM (screwed up THAT one, eh, TI?)... Processors with tiny states (680x with A/B and index) don't really have much to preserve. OTOH, they are forever loading and storing just to get anything done -- no place to "hold onto" results INSIDE the CPU. So, has their lack of internal state made them BETTER workhorses? Or, just lessened the work required in an ISR (because they aren't very *capable*, otherwise)?
On 15/01/2023 10:11, Don Y wrote:
> On 1/15/2023 2:48 AM, Martin Brown wrote: >> I prefer to use RDTSC for my Intel timings anyway. >> >> On many of the modern CPUs there is a freerunning 64 bit counter >> clocked at once per cycle. Intel deprecates using it for such purposes >> but I have never found it a problem provided that you bracket it >> before and after with CPUID to force all the pipelines into an empty >> state. >> >> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described >> here: >> >> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters >> >> I prefer hard numbers to a vague scope trace. > > Two downsides: > - you have to instrument your code (but, if you're concerned with > performance, >   you've already done this as a matter of course)
You have to make a test framework to exercise the code in as realistic a manner as you can - that isn't quite the same as instrumenting the code (although it can be). I have never found profile directed compilers to be the least bit useful on my fast maths codes because their automatic code instrumentation breaks the very code that it is supposed to be testing (in the sense of wrecking cache lines and locality etc.). The only profiling method I have found to work reasonably well is probably by chance the highest frequency periodic ISR I have ever used in anger which was to profile code by accumulating a snapshot of PC addresses allowing just a few machine instructions to execute at a time. It used to work well back in the old days when 640k was the limit and code would reliably load into exactly the same locations every run. It is a great way to find the hotspots where most time is spent.
> - it doesn't tell you about anything that happens *before* the code runs >   (e.g., latency between event and recognition thereof)
True enough. Sometimes you need a logic analyser for weird behaviour - we once caught a CPU chip where RTI didn't always do what it said on the tin and the instruction following the RTI instruction got executed with a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a non-disclosure agreement.
>> If I'm really serious about finding out why something is unusually >> slow I run a dangerous system level driver that allows me full access >> to the model specific registers to monitor cache misses and pipeline >> stalls. > > But, those results can change from instance to instance (as can latency, > execution time, etc.).  So, you need to look at the *distribution* of > values and then think about whether that truly represents "typical" > and/or *worst* case.
It just means that you have to collect an array of data and take a look at it later and offline. Much like you would when testing that a library function does exactly what it is supposed to.
> > Relying on exact timings is sort of naive; it ignores how much > things can vary with the running system (is the software in a > critical region when the ISR is invoked?) and the running > *hardware* (multilevel caches, etc.)
It is quite unusual to see bad behaviour from the multilevel caches but it can add to the variance. You always get a few outliers here and there in user code if a higher level disk or network interrupt steals cycles.
> Do you have a way of KNOWING when your expectations (which you > have now decided are REQUIRMENTS!) are NOT being met?  And, if so, > what do you do (at runtime) with that information?   ("I'm sorry, > one of my basic assumptions is proving to be false and I am not > equipped to deal with that...")
Instrumenting for timing tests is very much development rather than production code. ie is it fast enough or do we have to work harder. Like you I prefer HLL code but I will use ASM if I have to or there is no other way (like wanting 80 bit reals in the MS compiler). Actually I am working on a class library to allow somewhat clunky access to it. They annoyingly zapped access to 80 bit reals in v6 I think it was for "compatibility" reasons since SSE2 and later can only do 64bit reals.
> > Esp given that your implementation will likely evolve and > folks doing that work may not be as focused as you were on > this specific issue...
That will be their problem not mine ;-) -- Regards, Martin Brown
On 15/01/2023 14:10, Dimiter_Popoff wrote:
> On 1/15/2023 12:48, Lasse Langwadt Christensen wrote: >> søndag den 15. januar 2023 kl. 06.10.24 UTC+1 skrev >> upsid...@downunder.com:
>>> If the processor has separate FP registers and/or separate FP status >>> words, avoid using FP registers in ISRs.
Generally good advice unless the purpose of the interrupt is to time share the available CPU and FPU between various competing numerical tasks. Cooperative multitasking has lower overheads if you can do it. For my money ISRs should do as little as possible at such a high privilege level although checking if their interrupt flag is already set again before returning is worthwhile for maximum burst transfer speed.
>>> Some compilers may have "interrupt" keywords or similar extensions and >>> the compiler knows which registers need to be saved in the ISR. To >>> help the compiler, include all functions that are called by the ISR in >>> the same module(preferably in-lined) prior to the ISR, so that the >>> compiler knows what needs to be saved. Do not call external library >>> routines from ISR, since the compiler doesn't know which registers >>> need to be saved and saves all. >> >> cortex-m automatically stack the registers needed to call a regular C >> function >> and if it has an FPU it supports "lazy stacking" which means it keeps >> track of >> whether the FPU is used and only stack/un-stack them when they are used >> >> it also knows that if another interrupt is pending at ISR exit is >> doesn't need to >> to un-stack/stack before calling the other interrupt >> > > How many registers does it stack automatically? I knew the HLL nonsense > would catch up with CPU design eventually. Good CPU design still means > load/store machines, stacking *nothing* at IRQ, just saving PC and CCR > to special purpose regs which can be stacked as needed by the IRQ > routine, along with registers to be used in it. Memory accesses are > the bottleneck, and with HLL code being bloated as it is chances > are some cache will have to be flushed to make room for stacking. > Some *really* well designed for control applications processors allow > you to lock a part of the cache but I doubt ARM have that, they seem to > have gone the way "make programming a two click job" to target a > wider audience.
Actually there were processors which took the exact opposite position quite early on and they were incredibly good for realtime performance but their registers were no different to ram - they were *in* ram so was the program counter return address. There was a master register workspace pointer and 16 registers TI TMS9900 series for instance. https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture I didn't properly appreciate at the time quite how good this trick was for realtime work until we tried to implement the same algorithms on the much later and on paper faster 68000 series of CPUs. -- Regards, Martin Brown
On 1/16/2023 3:23 AM, Martin Brown wrote:
> For my money ISRs should do as little as possible at such a high privilege > level although checking if their interrupt flag is already set again before > returning is worthwhile for maximum burst transfer speed.
+42 on both counts. OTOH, you have to be wary of misbehaving hardware (or, unforseen circumstances) causing the ISR to loop continuously. Many processors will give one (a couple?) of instructions in the background a chance to execute (after RTI), even if there is a "new" IRQ pending. So, you can gradually make *some* progress. If, instead, you let the ISR loop, then you're stuck there... I like to *quickly* reenable interrupts and take whatever measures needed to ensure the work that *I* need to do will get done, properly, even if postponed by a newer IRQ. This can be treacherous if a series of different IRQ sources conspire to interrupt each other and leave you interrupted by *yourself*, later!
>> How many registers does it stack automatically? I knew the HLL nonsense >> would catch up with CPU design eventually. Good CPU design still means >> load/store machines, stacking *nothing* at IRQ, just saving PC and CCR >> to special purpose regs which can be stacked as needed by the IRQ >> routine, along with registers to be used in it. Memory accesses are >> the bottleneck, and with HLL code being bloated as it is chances >> are some cache will have to be flushed to make room for stacking. >> Some *really* well designed for control applications processors allow >> you to lock a part of the cache but I doubt ARM have that, they seem to >> have gone the way "make programming a two click job" to target a >> wider audience. > > Actually there were processors which took the exact opposite position quite > early on and they were incredibly good for realtime performance but their > registers were no different to ram - they were *in* ram so was the program > counter return address. There was a master register workspace pointer and 16 > registers TI TMS9900 series for instance. > > https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture
Guttag came out to pitch the 99K, in person (we used a lot of MPUs). But, at that time (early 80's?), memory speeds were already starting to creep up past memory *cycles* in processors of that day. This was, IMHO, a bad technological prediction on TI's part. [IIRC, they also predicted sea-of-gates would be the most economical semi-custom approach (they actually proposed a "sea of inverters" wired in a mask layer much like DTL)]
> I didn't properly appreciate at the time quite how good this trick was for > realtime work until we tried to implement the same algorithms on the much later > and on paper faster 68000 series of CPUs.
On 1/16/2023 3:21 AM, Martin Brown wrote:
> On 15/01/2023 10:11, Don Y wrote: >> On 1/15/2023 2:48 AM, Martin Brown wrote: >>> I prefer to use RDTSC for my Intel timings anyway. >>> >>> On many of the modern CPUs there is a freerunning 64 bit counter clocked at >>> once per cycle. Intel deprecates using it for such purposes but I have never >>> found it a problem provided that you bracket it before and after with CPUID >>> to force all the pipelines into an empty state. >>> >>> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described here: >>> >>> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters >>> >>> I prefer hard numbers to a vague scope trace. >> >> Two downsides: >> - you have to instrument your code (but, if you're concerned with performance, >>    you've already done this as a matter of course) > > You have to make a test framework to exercise the code in as realistic a manner > as you can - that isn't quite the same as instrumenting the code (although it > can be).
It depends on how visible the information of interest is to outside observers. If you have to "do something" to make it so, then you may as well put in the instrumentation and get things as you want them.
> I have never found profile directed compilers to be the least bit useful on my > fast maths codes because their automatic code instrumentation breaks the very > code that it is supposed to be testing (in the sense of wrecking cache lines > and locality etc.).
Exactly. The same holds true of adding invariants to code; removing them (#ifndef DEBUG) changes the code -- subtly but nonetheless. So, you have to put in place two levels of final test: - check to see if you THINK it will pass REAL final test - actually DO the final test When installing copy protection/anti-tamper mechanisms in products, there's a time when you've just enabled them and, thus, changed how the product runs. If it *stops* running (properly), you have to wonder if your "measures" are at fault or if some latent bug has crept in, aggravated by the slight differences in execution patterns.
> The only profiling method I have found to work reasonably well is probably by > chance the highest frequency periodic ISR I have ever used in anger which was > to profile code by accumulating a snapshot of PC addresses allowing just a few > machine instructions to execute at a time. It used to work well back in the old > days when 640k was the limit and code would reliably load into exactly the same > locations every run. > > It is a great way to find the hotspots where most time is spent.
IMO, this is where logic analyzers shine. I don't agree with using them to "trace code" (during debug) as there are better ways to get that information. But, *watching* to see how code runs (passively) can be a real win. Especially when you are trying to watch for RARE aberrant behavior.
>> - it doesn't tell you about anything that happens *before* the code runs >>    (e.g., latency between event and recognition thereof) > > True enough. Sometimes you need a logic analyser for weird behaviour - we once > caught a CPU chip where RTI didn't always do what it said on the tin and the > instruction following the RTI instruction got executed with a frequency of > about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a > non-disclosure agreement. > >>> If I'm really serious about finding out why something is unusually slow I >>> run a dangerous system level driver that allows me full access to the model >>> specific registers to monitor cache misses and pipeline stalls. >> >> But, those results can change from instance to instance (as can latency, >> execution time, etc.).  So, you need to look at the *distribution* of >> values and then think about whether that truly represents "typical" >> and/or *worst* case. > > It just means that you have to collect an array of data and take a look at it > later and offline. Much like you would when testing that a library function > does exactly what it is supposed to.
Yes. So, you either have the code do the collection (using a black box) *or* have to have an external device (logic analyzer) that can collect it for you. The former is nice because the code can actually make decisions (at run time) that a passive observer often can't (because the observer can't see all of the pertinent data). But, that starts to have a pronounced impact on the *intended* code...
>> Relying on exact timings is sort of naive; it ignores how much >> things can vary with the running system (is the software in a >> critical region when the ISR is invoked?) and the running >> *hardware* (multilevel caches, etc.) > > It is quite unusual to see bad behaviour from the multilevel caches but it can > add to the variance. You always get a few outliers here and there in user code > if a higher level disk or network interrupt steals cycles.
Being an embedded system developer, the issues that often muck up execution are often periodic -- but, with periods that are varied enough that they only beat against the observed phenomenon occasionally. I am always amused by folks WHO OBSERVE A F*CKUP. Then, when they can't reproduce it or identify a likely cause, ACT AS IF IT NEVER HAPPENED! Sheesh, you're not relying on some third-hand report of an anomaly... YOU SAW IT! How can you pretend it didn't happen?
>> Do you have a way of KNOWING when your expectations (which you >> have now decided are REQUIRMENTS!) are NOT being met?  And, if so, >> what do you do (at runtime) with that information?   ("I'm sorry, >> one of my basic assumptions is proving to be false and I am not >> equipped to deal with that...") > > Instrumenting for timing tests is very much development rather than production > code. ie is it fast enough or do we have to work harder.
Again, depends on the code and application. Few (interesting!) systems have any sort of "steady state". Rather, they have to react to a variety of circumstances occurring "whenever" they choose. The specifications rarely say "in this, that or the-other situation, these timing constraints do not apply". And, testing for every possible pile-up of events is just not conceivable. The alternative is to rethink your deadlines (avoid HARD deadlines because so few things truly *are* hard!) and how you would recover from a missed (or delayed) deadline. I develop with deadline support in the code so the code can sort out how to react to situations that I can't foresee nor test for. A deadline handler may not be invoked in a timely fashion (if it *was*, then why not just code the actual *task* as the deadline handler and get THAT guarantee?! :> ). But, at least it lets the code/system realize that something unplanned/unintended *has* happened: "Whaddya gonna do about it?"
> Like you I prefer HLL code but I will use ASM if I have to or there is no other > way (like wanting 80 bit reals in the MS compiler). Actually I am working on a > class library to allow somewhat clunky access to it.
"Measure and THEN optimize". Let the compiler take a stab at it. If you discover (measurement) that it's not meeting your expectations (requirements), figure out why and the possible approaches you can take to remedy that. I designed a barcode reader that had, as a single input, the "video" from the photodetector routed to an IRQ pin. Program the IRQ to sense the white-to-black edge and wait. In ISR, capture the system time from a high resolution timer; program the IRQ to sense the black-to-white edge and wait. Lather, rinse, repeat (you never know *when* a user may try to read a barcode!). A background task would watch the FIFO maintained by the IRQ and pull data out of it (atomically) to keep the FIFO from overflowing as well as get a head start on decoding the "transition times" into "width intervals". Competing IRQs would introduce lots of latency into the captured system times. So, I modified the ISRs to also tell me how long ago the actual transition occured. A bit more work for the ISRs and similarly for the task monitoring the FIFO. But, it allowed me to capture barcodes with features as small as 0.007" at 100IPS... on a 2MHz 8b CPU. A modern compiler could probably generate "as effective" code; the real performance gain was obtained by changing the algorithm instead of the implementation language.
> They annoyingly zapped access to 80 bit reals in v6 I think it was for > "compatibility" reasons since SSE2 and later can only do 64bit reals.
I had an early product use BCD data formats (supported by an ancient compiler). When *that* support went away, it was a real nightmare to go through and rework everything to use bare ints (and have to bin-to-bcd all the time)
>> Esp given that your implementation will likely evolve and >> folks doing that work may not be as focused as you were on >> this specific issue... > > That will be their problem not mine ;-)
Ah, most of my projects are prototypes or proof of concept. So, my code *will* be reworked. If that proves to be hard, folks won't recommend me to other clients! :> [So, I make it REALLY easy for folks to Do The Right Thing (in my opinion of "right") to maximize their chance of getting good results. If you want to reinvent the wheel, then don't fret if you *break* it - cuz you can SEE that it worked!!]
Am 16.01.23 um 11:23 schrieb Martin Brown:

> Actually there were processors which took the exact opposite position > quite early on and they were incredibly good for realtime performance > but their registers were no different to ram - they were *in* ram so was > the program counter return address. There was a master register > workspace pointer and 16 registers TI TMS9900 series for instance. > > https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture > > I didn't properly appreciate at the time quite how good this trick was > for realtime work until we tried to implement the same algorithms on the > much later and on paper faster 68000 series of CPUs.
At TU Berlin we had a place called the Zoo where there was at least one sample of each CPU family. We used the Zoo to port Andrew Tanenbaum's Experimental Machine to all of them under equal conditions. That was a p-code engine from the Amsterdam Free University Compiler Kit. The 9900 was slowest, by a lage margin, Z80-league. Having no cache AND no registers was a braindead idea. Some friends built a hardware machine around the Fairchild Clipper. They found out that moving the hard disk driver just a few bytes made a difference between speedy and slow as molasse. When the data was through under the head you had to wait for another disc revolution. It turned out that Fairchild simply lasered away some faulty cache lines and sold it. No warning given. It was entertaining to see, not being in that project. Gerhard
On 1/16/2023 4:25 AM, Don Y wrote:
> On 1/16/2023 3:23 AM, Martin Brown wrote: >> For my money ISRs should do as little as possible at such a high privilege >> level although checking if their interrupt flag is already set again before >> returning is worthwhile for maximum burst transfer speed. > > +42 on both counts. > > OTOH, you have to be wary of misbehaving hardware (or, unforseen > circumstances) causing the ISR to loop continuously.
E.g., I particularly object to folks trying to detect counter wrap by: do { high = read(HIGH) low = read(LOW) } while ( high != read(HIGH) ) and similar. What *guarantees* do you have that this will ever complete? (yeah, unlikely for it to hang here, but not *impossible*!) Better: high = read(HIGH) low = read(LOW) if ( high != read(HIGH) ) { high = high++ low = 0 } or similar (e.g., high = high; low = LOWMAX depending on how you want to bias the approximation).
On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote:
> Am 16.01.23 um 11:23 schrieb Martin Brown: > >> Actually there were processors which took the exact opposite position quite >> early on and they were incredibly good for realtime performance but their >> registers were no different to ram - they were *in* ram so was the program >> counter return address. There was a master register workspace pointer and 16 >> registers TI TMS9900 series for instance. >> >> https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture >> >> I didn't properly appreciate at the time quite how good this trick was for >> realtime work until we tried to implement the same algorithms on the much >> later and on paper faster 68000 series of CPUs. > > At TU Berlin we had a place called the Zoo where there was > at least one sample of each CPU family. We used the Zoo to > port Andrew Tanenbaum's Experimental Machine to all of them > under equal conditions. That was a p-code engine from the > Amsterdam Free University Compiler Kit. > > The 9900 was slowest, by a lage margin, Z80-league.
But, only for THAT particular benchmark. I learned, early on, that I could create a benchmark for damn near any two processors to make *either* (my choice) look better, just by choosing the conditions of the test. [And, as I was often designing the hardware, my "input" carried a lot of weight] Are we holding clock frequency constant? Memory access time? Code size? Memory dollars? Board space? "Throughput"? Algorithm? etc. I tend to like processors with lots of internal registers from my days writing ASM; it was an acquired skill to be able to think about how to design an algorithm so you could keep everything *in* the processor -- instead of having to constantly load/store/reload. But, moving away from ASM, I'm less concerned as to what the programmer's model looks like. I'm more interested in what the architecture supports and how easy it is for me to make use of those mechanisms (in hardware and software).
> Having no cache AND no registers was a braindead idea.
They could argue that adding cache was a logical way to design "systems" with the device. Remember, the 9900/99K were from the "home computer" era. They lost out to a dog slow 8086!
> Some friends built a hardware machine around the Fairchild > Clipper. They found out that moving the hard disk driver > just a few bytes made a difference between speedy and slow > as molasse. When the data was through under the head you had > to wait for another disc revolution.
Unless, of course, you were already planning on being busy doing something else, at that time. :> "Benchmarks lie"
> It turned out that Fairchild simply lasered away some faulty > cache lines and sold it. No warning given. > It was entertaining to see, not being in that project.
Am 16.01.23 um 13:25 schrieb Don Y:
> On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote: >> Am 16.01.23 um 11:23 schrieb Martin Brown:
>>> I didn't properly appreciate at the time quite how good this trick >>> was for realtime work until we tried to implement the same algorithms >>> on the much later and on paper faster 68000 series of CPUs. >> >> At TU Berlin we had a place called the Zoo where there was >> at least one sample of each CPU family. We used the Zoo to >> port Andrew Tanenbaum's Experimental Machine to all of them >> under equal conditions. That was a p-code engine from the >> Amsterdam Free University Compiler Kit. >> >> The 9900 was slowest, by a lage margin, Z80-league. > > But, only for THAT particular benchmark.  I learned, early on, > that I could create a benchmark for damn near any two processors > to make *either* (my choice) look better, just by choosing > the conditions of the test.
That was not a benchmark; that was a given large p-code machine with the intent to use the same compilers everywhere. Not unlike UCSD-Pascal.
>> Having no cache AND no registers was a braindead idea. > > They could argue that adding cache was a logical way to > design "systems" with the device.
with a non-existing cache controller and cache rams that cost as much as the cpu. I got a feeling for the price of cache when I designed this: < https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/ >
> Remember, the 9900/99K were from the "home computer" era. > They lost out to a dog slow 8086!
8086 was NOT slow. Have you ever used an Olivetti M20 with a competently engineered memory system? That even challenged early ATs when protected mode was not needed.
>> Some friends built a hardware machine around the Fairchild >> Clipper. They found out that moving the hard disk driver >> just a few bytes made a difference between speedy and slow >> as molasse. When the data was through under the head you had >> to wait for another disc revolution. > > Unless, of course, you were already planning on being busy > doing something else, at that time.&nbsp;&nbsp; :>&nbsp; "Benchmarks lie"
That benchmark was Unix System V, as licensed from Bell. Find something better to do when you need to swap. Gerhard
On Mon, 16 Jan 2023 10:21:43 +0000, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:

>On 15/01/2023 10:11, Don Y wrote: >> On 1/15/2023 2:48 AM, Martin Brown wrote: >>> I prefer to use RDTSC for my Intel timings anyway. >>> >>> On many of the modern CPUs there is a freerunning 64 bit counter >>> clocked at once per cycle. Intel deprecates using it for such purposes >>> but I have never found it a problem provided that you bracket it >>> before and after with CPUID to force all the pipelines into an empty >>> state. >>> >>> The equivalent DWT_CYCCNT on the Arm CPUs that support it is described >>> here: >>> >>> https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters >>> >>> I prefer hard numbers to a vague scope trace. >> >> Two downsides: >> - you have to instrument your code (but, if you're concerned with >> performance, >> &#4294967295; you've already done this as a matter of course) > >You have to make a test framework to exercise the code in as realistic a >manner as you can - that isn't quite the same as instrumenting the code >(although it can be). > >I have never found profile directed compilers to be the least bit useful >on my fast maths codes because their automatic code instrumentation >breaks the very code that it is supposed to be testing (in the sense of >wrecking cache lines and locality etc.). > >The only profiling method I have found to work reasonably well is >probably by chance the highest frequency periodic ISR I have ever used >in anger which was to profile code by accumulating a snapshot of PC >addresses allowing just a few machine instructions to execute at a time. >It used to work well back in the old days when 640k was the limit and >code would reliably load into exactly the same locations every run. > >It is a great way to find the hotspots where most time is spent. > >> - it doesn't tell you about anything that happens *before* the code runs >> &#4294967295; (e.g., latency between event and recognition thereof) > >True enough. Sometimes you need a logic analyser for weird behaviour - >we once caught a CPU chip where RTI didn't always do what it said on the >tin and the instruction following the RTI instruction got executed with >a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but >we had to sign a non-disclosure agreement. > >>> If I'm really serious about finding out why something is unusually >>> slow I run a dangerous system level driver that allows me full access >>> to the model specific registers to monitor cache misses and pipeline >>> stalls. >> >> But, those results can change from instance to instance (as can latency, >> execution time, etc.).&#4294967295; So, you need to look at the *distribution* of >> values and then think about whether that truly represents "typical" >> and/or *worst* case. > >It just means that you have to collect an array of data and take a look >at it later and offline. Much like you would when testing that a library >function does exactly what it is supposed to. >> >> Relying on exact timings is sort of naive; it ignores how much >> things can vary with the running system (is the software in a >> critical region when the ISR is invoked?) and the running >> *hardware* (multilevel caches, etc.) > >It is quite unusual to see bad behaviour from the multilevel caches but >it can add to the variance. You always get a few outliers here and there >in user code if a higher level disk or network interrupt steals cycles. > >> Do you have a way of KNOWING when your expectations (which you >> have now decided are REQUIRMENTS!) are NOT being met?&#4294967295; And, if so, >> what do you do (at runtime) with that information?&#4294967295;&#4294967295; ("I'm sorry, >> one of my basic assumptions is proving to be false and I am not >> equipped to deal with that...") > >Instrumenting for timing tests is very much development rather than >production code. ie is it fast enough or do we have to work harder. > >Like you I prefer HLL code but I will use ASM if I have to or there is >no other way (like wanting 80 bit reals in the MS compiler). Actually I >am working on a class library to allow somewhat clunky access to it. > >They annoyingly zapped access to 80 bit reals in v6 I think it was for >"compatibility" reasons since SSE2 and later can only do 64bit reals.
PowerBasic has 80-bit reals as a native variable type. As far as timing analysis goes, we always bring out a few port pins to test points, from uPs and FPGAs, so we can scope things. Raise a pin at ISR entry, drop it before the RTI, scope it. We wrote one Linux program that just toggled a test point as fast as it could. That was interesting on a scope, namely the parts that didn't toggle.