Reply by DecadentLinuxUserNumeroUno November 15, 20142014-11-15
On Fri, 14 Nov 2014 20:28:48 -0800, josephkk
<joseph_barrett@sbcglobal.net> Gave us:

snip
> >I am not buying it. It requires memory chips to bring out I/O to indicate >"valid/no error", "error corrected", and "error, not corrected". Now if >accesses are more than 1 chip wide you have to combine the status bits >somehow, whether or not they are shipped to the CPU/DMA/Video. Also you >may want a different ECC protection profile than what the memory chip >maker provides. > >?-) >
The motherboard is involved. I am sure as much as can be placed on the RAM 'device' (stick) itself is. The chipset manages its part. The finished product is the checksum code required by whatever monitors and manages all of it (the chipset) ultimately gets generated and used for comparison. Any errors are handled by it, and the RAM and that little management code hard wired into it all. Then, it is back to square one. Next refresh and compare sequencing. Should be without missing a beat over similarly timed non-ECC RAM. Seeing a bunch of errors would likely slow things, but that would also indicate a bigger problem.
Reply by josephkk November 14, 20142014-11-14
On Fri, 14 Nov 2014 09:25:47 +0200, upsidedown@downunder.com wrote:

>On Thu, 13 Nov 2014 19:05:20 -0800, josephkk ><joseph_barrett@sbcglobal.net> wrote: > >>>One should also remember that magnetic core as well as dynamic RAM >>>perform a destructive readout, so you have to perform a writeback >>>after each read cycle. For core, you only have to do that for the >>>actual read location (at the X and Y wire crossing), for dynamic RAM, >>>you have to write back the whole column (hundreds or thousands of >>>bits). For this reason, real access time (not using CAS multiplexing) >>>is much shorter as the full cycle time. >> >>The similarities between core and DRAM are real. Early DRAM could not >>provide the next sequential address read as it had no registers to =
store
>>it. =20 >>Newer DRAM does (since EDO at least). That said, newer DRAM speeds up >>sequential reads over early DRAM by having the registers for it and not >>needing another complete cycle, just another data clock for the next >>sequential read data. See the DDR series specifications. The restore >>part of the cycle continues unabated making a non-sequential read after >>two or more sequential added reads occur much sooner. > >Apart from the first DRAMs that used all address lines at once, all >the rest have multiplexed addresses with RAS/CAS selection. > >This does not slow the access. For instance the first RAS/CAS >addressed DRAM was 4096x1 bit with 64 rows and 64 columns. The high 6 >bits with the RAS signal were decoded and selected one of the 64 rows. >After a while, all the bits from that row were transferred to 64 >column sense amplifier and latches.=20 > >After the low address bits were decoded with the CAS signal, it just >selected one of the 64 column sense amplifier/latch bit and presented >to the data out pin. Since the DRAM cell access time was much longer >than the output column select multiplexor, multiplexing did not slow >things much even for a single access. > >Now that 64 column bits are already in the internal 64 internal >registers, performing several CAS cycles with different low address >bits allowed fast random access _within_ a 64 bit row, just >multiplexing out the selected bit, instead of doing a dynamic RAM cell >access each time. > >Later models had internal column address counters, allowing >sequentially select column bit access without doing a RAM cell access >after the initial row activation. > >Video-RAMs were very similar. All the TV-line bits were taken from >1024 column bits and then parallel loaded into a shift register >clocked by the bit clock. This required a slow line select every 64 >us, so no problem with propagation delays. > >Since all the column bits are available simultaneously within the >chip, my point was that it would make sense to put the ECC processing >within the memory chip itself.
I am not buying it. It requires memory chips to bring out I/O to = indicate "valid/no error", "error corrected", and "error, not corrected". Now if accesses are more than 1 chip wide you have to combine the status bits somehow, whether or not they are shipped to the CPU/DMA/Video. Also you may want a different ECC protection profile than what the memory chip maker provides. ?-) =20
Reply by November 14, 20142014-11-14
On Thu, 13 Nov 2014 19:05:20 -0800, josephkk
<joseph_barrett@sbcglobal.net> wrote:

>>One should also remember that magnetic core as well as dynamic RAM >>perform a destructive readout, so you have to perform a writeback >>after each read cycle. For core, you only have to do that for the >>actual read location (at the X and Y wire crossing), for dynamic RAM, >>you have to write back the whole column (hundreds or thousands of >>bits). For this reason, real access time (not using CAS multiplexing) >>is much shorter as the full cycle time. > >The similarities between core and DRAM are real. Early DRAM could not >provide the next sequential address read as it had no registers to store >it. >Newer DRAM does (since EDO at least). That said, newer DRAM speeds up >sequential reads over early DRAM by having the registers for it and not >needing another complete cycle, just another data clock for the next >sequential read data. See the DDR series specifications. The restore >part of the cycle continues unabated making a non-sequential read after >two or more sequential added reads occur much sooner.
Apart from the first DRAMs that used all address lines at once, all the rest have multiplexed addresses with RAS/CAS selection. This does not slow the access. For instance the first RAS/CAS addressed DRAM was 4096x1 bit with 64 rows and 64 columns. The high 6 bits with the RAS signal were decoded and selected one of the 64 rows. After a while, all the bits from that row were transferred to 64 column sense amplifier and latches. After the low address bits were decoded with the CAS signal, it just selected one of the 64 column sense amplifier/latch bit and presented to the data out pin. Since the DRAM cell access time was much longer than the output column select multiplexor, multiplexing did not slow things much even for a single access. Now that 64 column bits are already in the internal 64 internal registers, performing several CAS cycles with different low address bits allowed fast random access _within_ a 64 bit row, just multiplexing out the selected bit, instead of doing a dynamic RAM cell access each time. Later models had internal column address counters, allowing sequentially select column bit access without doing a RAM cell access after the initial row activation. Video-RAMs were very similar. All the TV-line bits were taken from 1024 column bits and then parallel loaded into a shift register clocked by the bit clock. This required a slow line select every 64 us, so no problem with propagation delays. Since all the column bits are available simultaneously within the chip, my point was that it would make sense to put the ECC processing within the memory chip itself.
Reply by rickman November 14, 20142014-11-14
On 11/13/2014 2:19 AM, upsidedown@downunder.com wrote:
> On Wed, 12 Nov 2014 20:10:07 -0800, josephkk > <joseph_barrett@sbcglobal.net> wrote: > >> On Wed, 12 Nov 2014 08:30:43 +0200, upsidedown@downunder.com wrote: >> >>> >>> The actual data bits can be stored as they arrive, calculating the >>> check bits take some time, but they can be written into memory cells >>> which will occur slightly later. With multiple write cycles in >>> succession, storing the check bits from the previous write can overlap >>> with writing the actual data bits of the next write. >>> >>> Doing a partial memory word update, e.g. writing only a single byte >>> into a 64 bit (8 byte) memory word is costly, since first you have to >>> read the 7 unmodified bytes, combine the new byte, calculate the ECC >>> for 8 bytes and write 8 bytes+ECC or at least the new byte+full ECC. >>> With cache between the processor and main memory, memory writes should >>> use the full memory words, so this is not be an issue today. >>> >>>> Then on the read all the bits are >>>> calculated to see if there is an error and to correct it. >>> >>> These days the read returns correct results in a huge majority of >>> cases, so why not just send out the speculative data and after ECC >>> check declare it valid by a separate signal from memory to CPU. >>> However, if the ECC check fails, the memory needs to calculate the >>> correct data and indicate that the data word is now valid. Then the >>> memory must calculate the new correct data+ECC and store it into that >>> memory cell to deal with soft errors (i.e. flush the memory cell). Of >>> course, if there is a hard error, this does not help, since the >>> correction must be repeated on each read access to that cell, slowing >>> it up considerably. >> >> You need to study up on Amdahl's law. It relates the frequency of any >> event in the instruction and data sequences to the amount of speed impact >> it has. >> >> ?-) >> > > One should also remember that magnetic core as well as dynamic RAM > perform a destructive readout, so you have to perform a writeback > after each read cycle. For core, you only have to do that for the > actual read location (at the X and Y wire crossing), for dynamic RAM, > you have to write back the whole column (hundreds or thousands of > bits). For this reason, real access time (not using CAS multiplexing) > is much shorter as the full cycle time. > > Putting the ECC logic into the writeback loop doesn't slow down the > _cycle_ time, as long as the ECC writeback is phase shifted from the > main data write back.
Not sure what you are talking about. The writeback either doesn't use the ECC, just copying what is in memory... not very useful... or it has to verify the ECC and apply the correction before writing it back... or the third option is to flag an error in parallel, but then the entire write operation has to be repeated to accommodate the error correction. That extension messes up the timign and is hard to incorporate into most applications of ECC.
> Of course, this does require that the ECC logic is on the same memory > chip, using ECC memory bits and logic on separate chips doesn't work. > > For high radiation environment (if it makes sense to use DRAMs at > all), I would put the ECC into the writeback loop so that the memory > is flushed (ECC corrected) at every refresh as well as every read > access to a column. This will quickly detect single bit errors, which > are correctable, from entering into a multibit non-correctable error.
That's likely not a significant speed issue since the refresh is only a small portion of the memory bandwidth. But you still leave the gap of possible corruption between the last refresh and next read. -- Rick
Reply by rickman November 14, 20142014-11-14
On 11/13/2014 5:48 AM, Bill Sloman wrote:
> On Thursday, 13 November 2014 16:43:49 UTC+11, rickman wrote: >> On 11/12/2014 11:17 PM, Bill Sloman wrote: >>> On Thursday, 13 November 2014 14:46:00 UTC+11, rickman wrote: >>>> On 11/11/2014 10:46 PM, Bill Sloman wrote: >>>>> On Wednesday, 12 November 2014 12:22:00 UTC+11, rickman wrote: >>>>>> On 11/8/2014 2:14 AM, miso wrote: >>> >>> <snip> >>> >>>>>> ECC *has* to be slower. It involves calculating check bits from the >>>>>> word being stored and saving them. Then on the read all the bits are >>>>>> calculated to see if there is an error and to correct it. That takes >>>>>> some time on both the write and the read. It may not be a lot, but it >>>>>> takes time. >>>>> >>>>> It doesn't have to be significantly slower. The processes of creating the check bits, and of using them to calculate a corrected output can in principle be handled by look-up tables - which get a bit big - and in practice are handled by logic networks which are almost as fast. >>>> >>>> Lol, I find it amusing that you think a lookup table is faster than >>>> logic. A lookup table is a bunch of logic for doing the operation with >>>> a fixed pattern of bits. Doing the same operation in discrete logic is >>>> almost certainly faster and almost certainly much smaller. >>> >>> Discrete logic is almost as old-fashioned as hydraulic logic. >>> >>> In practice, either solution is going to be realised in programmable logic, and the look-up table is the version that uses most gates to get the lowest propagation delay, and "logic" is the approach that trades off fewer gates against longer propagation paths that make more choices. >> >> You don't seem to understand, in logic more does not equal faster >> delays. I can assure you that more logic is slower than less. > > If I build my extra logic with ECLinPS and you build yours with 74LS, this won't be true.
Mine will still be faster because I'm using it in a Ferrari. What are you talking about... no, don't tell me. This conversation has gone south.
>> I'm not sure why you are bringing programmable logic into this. That is >> a red herring. > > If you can buy purpose-built ECC chips for you are particular choice of word length it's certainly going to be a red herring. If you have real world requirements that don't correspond to an application that buys more than 100,000 chips per year, you are going to realise most of your system in a programmable logic device. > >>>> I worked on an array >>>> processor with ECC memory controller in a separate chip. The ECC >>>> happened in it's own clock cycle and so did not affect the clock >>>> frequency, but that added latency to the memory access. However this >>>> was a micro programmed machine so the algorithm anticipated all the >>>> various delays to get the right data in the right place at the right >>>> time. The ECC delay was incorporated into the operation of the machine. >>> >>> That could be longer ago, but I doubt it. There were micro-programmed machines around in the mid-1980's but the people who used them tended to be very specialised number crunchers. >> >> It was in the late 80's. Star Technologies was a spin off from Floating >> Point Systems. FPS decided the market wanted 64 bit floating point and >> Star Tech was about speed at 32 bits. They provided a machine (two rack >> cabinets) that did 100 MFLOPS... the second fastest floating point in >> the world next to the Cray. This was before DSPs were terribly useful. >> But it didn't last long. They pumped out a design that did 50 MFLOPS >> in a single 9U rack which was incorporated into GE CAT scanners. They >> nursed that design for continuing support for a long time. Ultimately >> they folded without ever producing another viable design. The day of >> the array processor was over. > > When I applied for my job at EMI central research in 1975, one of the job interviews was with the guys who were building the number-crunching logic for the EMI body-scanner. I knew enough to ask them whether they were going to use AMD's TTL bit-slice components. or Motorola's ECL bit-slices. > > At the time they hadn't made up their minds, but by the time I'd got the job (and got the security clearance that let me actually start work) they'd gone for the AMD parts. They weren't as fast, but they integrated bigger chunks of functionality. By the 1980's integrated circuits could integrate a lot more transistors, and bit-slices weren't all that interesting.
Thanks for sharing. -- Rick
Reply by josephkk November 13, 20142014-11-13
On Thu, 13 Nov 2014 09:19:27 +0200, upsidedown@downunder.com wrote:

>On Wed, 12 Nov 2014 20:10:07 -0800, josephkk ><joseph_barrett@sbcglobal.net> wrote: > >>On Wed, 12 Nov 2014 08:30:43 +0200, upsidedown@downunder.com wrote: >> >>> >>>The actual data bits can be stored as they arrive, calculating the >>>check bits take some time, but they can be written into memory cells >>>which will occur slightly later. With multiple write cycles in >>>succession, storing the check bits from the previous write can overlap >>>with writing the actual data bits of the next write. >>> >>>Doing a partial memory word update, e.g. writing only a single byte >>>into a 64 bit (8 byte) memory word is costly, since first you have to >>>read the 7 unmodified bytes, combine the new byte, calculate the ECC >>>for 8 bytes and write 8 bytes+ECC or at least the new byte+full ECC. >>>With cache between the processor and main memory, memory writes should >>>use the full memory words, so this is not be an issue today. >>> >>>>Then on the read all the bits are=20 >>>>calculated to see if there is an error and to correct it. =20 >>> >>>These days the read returns correct results in a huge majority of >>>cases, so why not just send out the speculative data and after ECC >>>check declare it valid by a separate signal from memory to CPU. >>>However, if the ECC check fails, the memory needs to calculate the >>>correct data and indicate that the data word is now valid. Then the >>>memory must calculate the new correct data+ECC and store it into that >>>memory cell to deal with soft errors (i.e. flush the memory cell). Of >>>course, if there is a hard error, this does not help, since the >>>correction must be repeated on each read access to that cell, slowing >>>it up considerably. >> >>You need to study up on Amdahl's law. It relates the frequency of any >>event in the instruction and data sequences to the amount of speed =
impact
>>it has. >> >>?-) >>=20 > >One should also remember that magnetic core as well as dynamic RAM >perform a destructive readout, so you have to perform a writeback >after each read cycle. For core, you only have to do that for the >actual read location (at the X and Y wire crossing), for dynamic RAM, >you have to write back the whole column (hundreds or thousands of >bits). For this reason, real access time (not using CAS multiplexing) >is much shorter as the full cycle time.
The similarities between core and DRAM are real. Early DRAM could not provide the next sequential address read as it had no registers to store it. =20 Newer DRAM does (since EDO at least). That said, newer DRAM speeds up sequential reads over early DRAM by having the registers for it and not needing another complete cycle, just another data clock for the next sequential read data. See the DDR series specifications. The restore part of the cycle continues unabated making a non-sequential read after two or more sequential added reads occur much sooner.
> >Putting the ECC logic into the writeback loop doesn't slow down the >_cycle_ time, as long as the ECC writeback is phase shifted from the >main data write back. > >Of course, this does require that the ECC logic is on the same memory >chip, using ECC memory bits and logic on separate chips doesn't work.
??? ECC is just stored on more bits of memory word width. The ECC calculations are all done on the CPU chip (both directions).
> >For high radiation environment (if it makes sense to use DRAMs at >all), I would put the ECC into the writeback loop so that the memory >is flushed (ECC corrected) at every refresh as well as every read >access to a column. This will quickly detect single bit errors, which >are correctable, from entering into a multibit non-correctable error.
ECC can be designed to correct as many bits of the word as you want. = Want 4 bit correction and detection of almost all many bit errors, it can be done. It is a problem in optimization, how much do you pay in ECC and secondary ECC to detect everything? ?-) =20
Reply by Bill Sloman November 13, 20142014-11-13
On Thursday, 13 November 2014 18:26:18 UTC+11, DecadentLinuxUserNumeroUno  wrote:
> On Thu, 13 Nov 2014 09:19:27 +0200, upsidedown@downunder.com Gave us: > > >On Wed, 12 Nov 2014 20:10:07 -0800, josephkk > ><joseph_barrett@sbcglobal.net> wrote: > > > >>On Wed, 12 Nov 2014 08:30:43 +0200, upsidedown@downunder.com wrote: > >> > >>> > >>>The actual data bits can be stored as they arrive, calculating the > >>>check bits take some time, but they can be written into memory cells > >>>which will occur slightly later. With multiple write cycles in > >>>succession, storing the check bits from the previous write can overlap > >>>with writing the actual data bits of the next write. > >>> > >>>Doing a partial memory word update, e.g. writing only a single byte > >>>into a 64 bit (8 byte) memory word is costly, since first you have to > >>>read the 7 unmodified bytes, combine the new byte, calculate the ECC > >>>for 8 bytes and write 8 bytes+ECC or at least the new byte+full ECC. > >>>With cache between the processor and main memory, memory writes should > >>>use the full memory words, so this is not be an issue today. > >>> > >>>>Then on the read all the bits are > >>>>calculated to see if there is an error and to correct it. > >>> > >>>These days the read returns correct results in a huge majority of > >>>cases, so why not just send out the speculative data and after ECC > >>>check declare it valid by a separate signal from memory to CPU. > >>>However, if the ECC check fails, the memory needs to calculate the > >>>correct data and indicate that the data word is now valid. Then the > >>>memory must calculate the new correct data+ECC and store it into that > >>>memory cell to deal with soft errors (i.e. flush the memory cell). Of > >>>course, if there is a hard error, this does not help, since the > >>>correction must be repeated on each read access to that cell, slowing > >>>it up considerably. > >> > >>You need to study up on Amdahl's law. It relates the frequency of any > >>event in the instruction and data sequences to the amount of speed impact > >>it has. > >> > >>?-) > >> > > > >One should also remember that magnetic core as well as dynamic RAM > >perform a destructive readout, so you have to perform a writeback > >after each read cycle. For core, you only have to do that for the > >actual read location (at the X and Y wire crossing), for dynamic RAM, > >you have to write back the whole column (hundreds or thousands of > >bits). For this reason, real access time (not using CAS multiplexing) > >is much shorter as the full cycle time. > > > >Putting the ECC logic into the writeback loop doesn't slow down the > >_cycle_ time, as long as the ECC writeback is phase shifted from the > >main data write back. > > > >Of course, this does require that the ECC logic is on the same memory > >chip, using ECC memory bits and logic on separate chips doesn't work. > > > >For high radiation environment (if it makes sense to use DRAMs at > >all), I would put the ECC into the writeback loop so that the memory > >is flushed (ECC corrected) at every refresh as well as every read > >access to a column. This will quickly detect single bit errors, which > >are correctable, from entering into a multibit non-correctable error. > > It sounds like a running ECC on each column string might be better > than byte, word, or actual string correction would. And achieve what > you said about getting single bit errors before they become monsters in > the datagrams.
ECC correction makes more sense for longer words. 64-bit words were a sweet spot, because they could be error detected and corrected with an eight-bit check word. Packet-switched networks detected and corrected on whole packets, with even longer check words. -- Bill Sloman, Sydney
Reply by Bill Sloman November 13, 20142014-11-13
On Thursday, 13 November 2014 16:43:49 UTC+11, rickman  wrote:
> On 11/12/2014 11:17 PM, Bill Sloman wrote: > > On Thursday, 13 November 2014 14:46:00 UTC+11, rickman wrote: > >> On 11/11/2014 10:46 PM, Bill Sloman wrote: > >>> On Wednesday, 12 November 2014 12:22:00 UTC+11, rickman wrote: > >>>> On 11/8/2014 2:14 AM, miso wrote: > > > > <snip> > > > >>>> ECC *has* to be slower. It involves calculating check bits from the > >>>> word being stored and saving them. Then on the read all the bits ar=
e
> >>>> calculated to see if there is an error and to correct it. That take=
s
> >>>> some time on both the write and the read. It may not be a lot, but =
it
> >>>> takes time. > >>> > >>> It doesn't have to be significantly slower. The processes of creating=
the check bits, and of using them to calculate a corrected output can in p= rinciple be handled by look-up tables - which get a bit big - and in practi= ce are handled by logic networks which are almost as fast.
> >> > >> Lol, I find it amusing that you think a lookup table is faster than > >> logic. A lookup table is a bunch of logic for doing the operation wit=
h
> >> a fixed pattern of bits. Doing the same operation in discrete logic i=
s
> >> almost certainly faster and almost certainly much smaller. > > > > Discrete logic is almost as old-fashioned as hydraulic logic. > > > > In practice, either solution is going to be realised in programmable lo=
gic, and the look-up table is the version that uses most gates to get the l= owest propagation delay, and "logic" is the approach that trades off fewer = gates against longer propagation paths that make more choices.
>=20 > You don't seem to understand, in logic more does not equal faster=20 > delays. I can assure you that more logic is slower than less.
If I build my extra logic with ECLinPS and you build yours with 74LS, this = won't be true. =20
> I'm not sure why you are bringing programmable logic into this. That is=
=20
> a red herring.
If you can buy purpose-built ECC chips for you are particular choice of wor= d length it's certainly going to be a red herring. If you have real world r= equirements that don't correspond to an application that buys more than 100= ,000 chips per year, you are going to realise most of your system in a prog= rammable logic device. =20
> >> I worked on an array > >> processor with ECC memory controller in a separate chip. The ECC > >> happened in it's own clock cycle and so did not affect the clock > >> frequency, but that added latency to the memory access. However this > >> was a micro programmed machine so the algorithm anticipated all the > >> various delays to get the right data in the right place at the right > >> time. The ECC delay was incorporated into the operation of the machin=
e.
> > > > That could be longer ago, but I doubt it. There were micro-programmed m=
achines around in the mid-1980's but the people who used them tended to be = very specialised number crunchers.
>=20 > It was in the late 80's. Star Technologies was a spin off from Floating=
=20
> Point Systems. FPS decided the market wanted 64 bit floating point and=
=20
> Star Tech was about speed at 32 bits. They provided a machine (two rack=
=20
> cabinets) that did 100 MFLOPS... the second fastest floating point in=20 > the world next to the Cray. This was before DSPs were terribly useful.=
=20
> But it didn't last long. They pumped out a design that did 50 MFLOPS=
=20
> in a single 9U rack which was incorporated into GE CAT scanners. They=20 > nursed that design for continuing support for a long time. Ultimately=20 > they folded without ever producing another viable design. The day of=20 > the array processor was over.
When I applied for my job at EMI central research in 1975, one of the job i= nterviews was with the guys who were building the number-crunching logic fo= r the EMI body-scanner. I knew enough to ask them whether they were going t= o use AMD's TTL bit-slice components. or Motorola's ECL bit-slices. At the time they hadn't made up their minds, but by the time I'd got the jo= b (and got the security clearance that let me actually start work) they'd g= one for the AMD parts. They weren't as fast, but they integrated bigger chu= nks of functionality. By the 1980's integrated circuits could integrate a l= ot more transistors, and bit-slices weren't all that interesting. --=20 Bill Sloman, Sydney
Reply by Martin Brown November 13, 20142014-11-13
On 13/11/2014 04:05, josephkk wrote:
> On Wed, 12 Nov 2014 19:29:41 -0500, "Maynard A. Philbrook Jr." > <jamie_ka1lpa@charter.net> wrote: > >> >> It does not matter if they were failed or intentional. The fact remains >> that a large amount of software did not force the use of a FPU, some of >> it didn't even attempt to detour in software if there was one present. >> >> Years ago I wrote a sat tracking program that optionally would switch >> to the FPU if one was present, there was a speed up but it wasn't what I >> called worth a fist full of money to get a CPU or add on FPU for it. >> >> Jamie > > The first program that i used that had a noticeable improvement with the > FPU was SPICE. There it made a huge difference. Similar applications had > the same kind of results. > > ?-)
Anything that used a compiler that could generate inline FP code would benefit enormously but if you had a noddy compiler that just had a bunch of library routines that were either calls to the emulator or calls to FP code in a subroutine then benefits were much less. This old page shows the variation in different sqrt coding tricks from way back: http://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi The inline code approximately 5x faster than the ordinary sqrt call. How much benefit you got from the FPU depended critically on the quality of your compiler. You often got a bit of extra precision thrown in too since the FP stack holds intermediate results to 80bits. The original Intel FPU had a few quirks in the trig functions which were found when Cyrix did a full analysis for their own numeric FPU (which was faster, more accurate and cheaper than the Intel part). -- Regards, Martin Brown
Reply by DecadentLinuxUserNumeroUno November 13, 20142014-11-13
On Thu, 13 Nov 2014 09:19:27 +0200, upsidedown@downunder.com Gave us:

>On Wed, 12 Nov 2014 20:10:07 -0800, josephkk ><joseph_barrett@sbcglobal.net> wrote: > >>On Wed, 12 Nov 2014 08:30:43 +0200, upsidedown@downunder.com wrote: >> >>> >>>The actual data bits can be stored as they arrive, calculating the >>>check bits take some time, but they can be written into memory cells >>>which will occur slightly later. With multiple write cycles in >>>succession, storing the check bits from the previous write can overlap >>>with writing the actual data bits of the next write. >>> >>>Doing a partial memory word update, e.g. writing only a single byte >>>into a 64 bit (8 byte) memory word is costly, since first you have to >>>read the 7 unmodified bytes, combine the new byte, calculate the ECC >>>for 8 bytes and write 8 bytes+ECC or at least the new byte+full ECC. >>>With cache between the processor and main memory, memory writes should >>>use the full memory words, so this is not be an issue today. >>> >>>>Then on the read all the bits are >>>>calculated to see if there is an error and to correct it. >>> >>>These days the read returns correct results in a huge majority of >>>cases, so why not just send out the speculative data and after ECC >>>check declare it valid by a separate signal from memory to CPU. >>>However, if the ECC check fails, the memory needs to calculate the >>>correct data and indicate that the data word is now valid. Then the >>>memory must calculate the new correct data+ECC and store it into that >>>memory cell to deal with soft errors (i.e. flush the memory cell). Of >>>course, if there is a hard error, this does not help, since the >>>correction must be repeated on each read access to that cell, slowing >>>it up considerably. >> >>You need to study up on Amdahl's law. It relates the frequency of any >>event in the instruction and data sequences to the amount of speed impact >>it has. >> >>?-) >> > >One should also remember that magnetic core as well as dynamic RAM >perform a destructive readout, so you have to perform a writeback >after each read cycle. For core, you only have to do that for the >actual read location (at the X and Y wire crossing), for dynamic RAM, >you have to write back the whole column (hundreds or thousands of >bits). For this reason, real access time (not using CAS multiplexing) >is much shorter as the full cycle time. > >Putting the ECC logic into the writeback loop doesn't slow down the >_cycle_ time, as long as the ECC writeback is phase shifted from the >main data write back. > >Of course, this does require that the ECC logic is on the same memory >chip, using ECC memory bits and logic on separate chips doesn't work. > >For high radiation environment (if it makes sense to use DRAMs at >all), I would put the ECC into the writeback loop so that the memory >is flushed (ECC corrected) at every refresh as well as every read >access to a column. This will quickly detect single bit errors, which >are correctable, from entering into a multibit non-correctable error.
It sounds like a running ECC on each column string might be better than byte, word, or actual string correction would. And achieve what you said about getting single bit errors before they become monsters in the datagrams.