sci.electronics.design | new spice| page 5

Reply by ●October 1, 20212021-10-01

On Fri, 1 Oct 2021 12:13:40 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

>jlarkin@highlandsniptechnology.com wrote:
>> On Fri, 1 Oct 2021 11:07:48 -0400, Phil Hobbs
>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>> 
>>> jlarkin@highlandsniptechnology.com wrote:
>>>> On Fri, 1 Oct 2021 08:55:01 -0400, Phil Hobbs
>>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>>
>>>>> Gerhard Hoffmann wrote:
>>>>>> Am 30.09.21 um 21:24 schrieb John Larkin:
>>>>>>> On Thu, 30 Sep 2021 19:58:50 +0100, "Kevin Aylward"
>>>>>>> <kevinRemoveandReplaceATkevinaylward.co.uk> wrote:
>>>>>>>
>>>>>>>>> "John Larkin"&#4294967295; wrote in message
>>>>>>>>> news:bdc7lgdap8o66j8m92ullph1nojbg9c5ni@4ax.com...
>>>>>>>>
>>>>>>>>> https://www.linkedin.com/in/mike-engelhardt-a788a822
>>>>>>>>
>>>>>>>> ...but why ......?
>>>>>>>>
>>>>>>>
>>>>>>> Maybe he enjoys it. I'm sure he's enormously wealthy and could do
>>>>>>> anything he wahts.
>>>>>>>
>>>>>>> I want a Spice that uses an nvidia board to speed it up 1000:1.
>>>>>>>
>>>>>>
>>>>>> Hopeless. That has already been tried in IBM AT times with these
>>>>>> Weitek coprocessors and NS 32032 processor boards; it never lived
>>>>>> up to the expectations.
>>>>>
>>>>> Parallellizing sparse matrix computation is an area of active research.
>>>>> The best approach ATM seems to be runtime profiling that generates a
>>>>> JIT-compiled FPGA image for the actual crunching, sort of like the way
>>>>> FFTW generates optimum FFT routines.
>>>>>
>>>>>>
>>>>>> That's no wonder. When in time domain integration the next result
>>>>>> depends on the current value and maybe a few in the past, you cannot
>>>>>> compute more future timesteps in parallel. Maybe some speculative
>>>>>> versions in parallel and then selecting the best. But that is no
>>>>>> work for 1000 processors.
>>>>>
>>>>> SPICE isn't a bad application for parallellism, if you can figure out
>>>>> how to do it--you wouldn't bother for trivial things, where the run
>>>>> times are less than 30s or so, but for longer calculations the profiling
>>>>> would be a small part of the work.  The inner loop is time-stepping the
>>>>> same matrix topology all the time (though the coefficients change with
>>>>> the time step).
>>>>>
>>>>> Since all that horsepower would be spending most of its time waiting for
>>>>> us to dork the circuit and dink the fonts, it could be running the
>>>>> profiling in the background during editing.  You might get 100x speedup
>>>>> that way, ISTM.
>>>>>
>>>>>>
>>>>>> The inversion of the nodal matrix might use some improvement since
>>>>>> it is NP complete, like almost everything that is interesting.
>>>>>
>>>>> Matrix inversion is  NP-complete?  Since when?  It's actually not even
>>>>> cubic, asymptotically--the lowest-known complexity bound is less than
>>>>> O(N**2.4).
>>>>>
>>>>>> Its size grows with the number of nodes and the matrix is sparse since
>>>>>> most nodes have no interaction. Dividing the circuit into subcircuits,
>>>>>> solving these separately and combining the results could provide
>>>>>> a speedup, for problems with many nodes. That would be a MAJOR change.
>>>>>>
>>>>>> Spice has not made much progress since Berkeley is no longer involved.
>>>>>> Some people make some local improvements and when they lose interest
>>>>>> after 15 years their improvements die. There is no one to integrate
>>>>>> that stuff in one open official version. Maybe NGspice comes closest.
>>>>>>
>>>>>> Keysight ADS has on option to run it on a bunch of workstations but
>>>>>> that helps probably most for electromagnetics which has not much
>>>>>> in common with spice.
>>>>>>
>>>>>
>>>>> It has more than you might think.  EM simulators basically have to loop
>>>>> over all of main memory twice per time step, and all the computational
>>>>> boundaries have to be kept time-coherent.   With low-latency
>>>>> interconnects and an OS with a thread scheduler that isn't completely
>>>>> brain-dead (i.e. anything except Linux AFAICT), my EM code scales within
>>>>> 20% or so of linearly up to 15 compute nodes, which is as far as I've
>>>>> tried.
>>>>>
>>>>> So I'm more optimistic than you, if still rather less than JL. ;)
>>>>>
>>>>> Cheers
>>>>>
>>>>> Phil Hobbs
>>>>
>>>> Since a schematic has a finite number of nodes, why not have one CPU
>>>> per node?
>>>
>>> Doing what, exactly?
>> 
>> Computing the node voltage for the next time step.
>
>Right, but exactly how?
>> 
>>> Given that the circuit topology forms an irregular
>>> sparse matrix, there would be a gigantic communication bottleneck in
>>> general.
>> 
>> Shared ram. Most nodes only need to see a few neighbors, plainly
>> visible on the schematic. 
>
>"Shared ram" is all virtual, though--you don't have N-port memory 
>really. 

FPGAs do.

>It has to be connected somehow, and all the caches kept 
>coherent.  That causes communications traffic that grows very rapidly 
>with the number of cores--about N**4 if it's done in symmetrical (SMP) 
>fashion.

Then don't cache the node voltages; put them in sram. Mux the ram
accesses cleverly.

>
>> The ram memory map could be clever that way.
>
>Sure, that's what the JIT FPGA approach does, but the memory layout 
>doesn't solve the communications bottleneck with a normal CPU or GPU.
>
>>> Somebody has to decide on the size of the next time step, for
>>> instance, which is a global property that has to be properly
>>> disseminated after computation.
>> 
>> Step when the slowest CPU is done processing its node.
>
>But then it has to decide what to do next.  The coefficients of the next 
>iteration depend on the global time step, so there's no purely 
>node-local method for doing adaptive step size.

Proceed when all the nodes are done their computation. Then each reads
the global node ram to get its inputs for the next step.

This would all work in a FPGA that had a lot of CPUs on chip. Let the
FPGA hardware do the node ram and access paths.

I've advocated for such a chip as a general OS host. One CPU per
process, with absolute hardware protections.

You could about do that now, with klunky soft-core CPUs.

LT Spice sometimes runs a billion to one behind real time. We can do
better.



-- 

Father Brown's figure remained quite dark and still; 
but in that instant he had lost his head. His head was
always most valuable when he had lost it.

Reply by Phil Hobbs ●October 1, 20212021-10-01

jlarkin@highlandsniptechnology.com wrote:
> On Fri, 1 Oct 2021 12:13:40 -0400, Phil Hobbs
> <pcdhSpamMeSenseless@electrooptical.net> wrote:
> 
>> jlarkin@highlandsniptechnology.com wrote:
>>> On Fri, 1 Oct 2021 11:07:48 -0400, Phil Hobbs
>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>
>>>> jlarkin@highlandsniptechnology.com wrote:
>>>>> On Fri, 1 Oct 2021 08:55:01 -0400, Phil Hobbs
>>>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>>>
>>>>>> Gerhard Hoffmann wrote:
>>>>>>> Am 30.09.21 um 21:24 schrieb John Larkin:
>>>>>>>> On Thu, 30 Sep 2021 19:58:50 +0100, "Kevin Aylward"
>>>>>>>> <kevinRemoveandReplaceATkevinaylward.co.uk> wrote:
>>>>>>>>
>>>>>>>>>> "John Larkin"&nbsp; wrote in message
>>>>>>>>>> news:bdc7lgdap8o66j8m92ullph1nojbg9c5ni@4ax.com...
>>>>>>>>>
>>>>>>>>>> https://www.linkedin.com/in/mike-engelhardt-a788a822
>>>>>>>>>
>>>>>>>>> ...but why ......?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Maybe he enjoys it. I'm sure he's enormously wealthy and could do
>>>>>>>> anything he wahts.
>>>>>>>>
>>>>>>>> I want a Spice that uses an nvidia board to speed it up 1000:1.
>>>>>>>>
>>>>>>>
>>>>>>> Hopeless. That has already been tried in IBM AT times with these
>>>>>>> Weitek coprocessors and NS 32032 processor boards; it never lived
>>>>>>> up to the expectations.
>>>>>>
>>>>>> Parallellizing sparse matrix computation is an area of active research.
>>>>>> The best approach ATM seems to be runtime profiling that generates a
>>>>>> JIT-compiled FPGA image for the actual crunching, sort of like the way
>>>>>> FFTW generates optimum FFT routines.
>>>>>>
>>>>>>>
>>>>>>> That's no wonder. When in time domain integration the next result
>>>>>>> depends on the current value and maybe a few in the past, you cannot
>>>>>>> compute more future timesteps in parallel. Maybe some speculative
>>>>>>> versions in parallel and then selecting the best. But that is no
>>>>>>> work for 1000 processors.
>>>>>>
>>>>>> SPICE isn't a bad application for parallellism, if you can figure out
>>>>>> how to do it--you wouldn't bother for trivial things, where the run
>>>>>> times are less than 30s or so, but for longer calculations the profiling
>>>>>> would be a small part of the work.  The inner loop is time-stepping the
>>>>>> same matrix topology all the time (though the coefficients change with
>>>>>> the time step).
>>>>>>
>>>>>> Since all that horsepower would be spending most of its time waiting for
>>>>>> us to dork the circuit and dink the fonts, it could be running the
>>>>>> profiling in the background during editing.  You might get 100x speedup
>>>>>> that way, ISTM.
>>>>>>
>>>>>>>
>>>>>>> The inversion of the nodal matrix might use some improvement since
>>>>>>> it is NP complete, like almost everything that is interesting.
>>>>>>
>>>>>> Matrix inversion is  NP-complete?  Since when?  It's actually not even
>>>>>> cubic, asymptotically--the lowest-known complexity bound is less than
>>>>>> O(N**2.4).
>>>>>>
>>>>>>> Its size grows with the number of nodes and the matrix is sparse since
>>>>>>> most nodes have no interaction. Dividing the circuit into subcircuits,
>>>>>>> solving these separately and combining the results could provide
>>>>>>> a speedup, for problems with many nodes. That would be a MAJOR change.
>>>>>>>
>>>>>>> Spice has not made much progress since Berkeley is no longer involved.
>>>>>>> Some people make some local improvements and when they lose interest
>>>>>>> after 15 years their improvements die. There is no one to integrate
>>>>>>> that stuff in one open official version. Maybe NGspice comes closest.
>>>>>>>
>>>>>>> Keysight ADS has on option to run it on a bunch of workstations but
>>>>>>> that helps probably most for electromagnetics which has not much
>>>>>>> in common with spice.
>>>>>>>
>>>>>>
>>>>>> It has more than you might think.  EM simulators basically have to loop
>>>>>> over all of main memory twice per time step, and all the computational
>>>>>> boundaries have to be kept time-coherent.   With low-latency
>>>>>> interconnects and an OS with a thread scheduler that isn't completely
>>>>>> brain-dead (i.e. anything except Linux AFAICT), my EM code scales within
>>>>>> 20% or so of linearly up to 15 compute nodes, which is as far as I've
>>>>>> tried.
>>>>>>
>>>>>> So I'm more optimistic than you, if still rather less than JL. ;)
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Phil Hobbs
>>>>>
>>>>> Since a schematic has a finite number of nodes, why not have one CPU
>>>>> per node?
>>>>
>>>> Doing what, exactly?
>>>
>>> Computing the node voltage for the next time step.
>>
>> Right, but exactly how?
>>>
>>>> Given that the circuit topology forms an irregular
>>>> sparse matrix, there would be a gigantic communication bottleneck in
>>>> general.
>>>
>>> Shared ram. Most nodes only need to see a few neighbors, plainly
>>> visible on the schematic.
>>
>> "Shared ram" is all virtual, though--you don't have N-port memory
>> really.
> 
> FPGAs do.
> 
>> It has to be connected somehow, and all the caches kept
>> coherent.  That causes communications traffic that grows very rapidly
>> with the number of cores--about N**4 if it's done in symmetrical (SMP)
>> fashion.
> 
> Then don't cache the node voltages; put them in sram. Mux the ram
> accesses cleverly.
> 
>>
>>> The ram memory map could be clever that way.
>>
>> Sure, that's what the JIT FPGA approach does, but the memory layout
>> doesn't solve the communications bottleneck with a normal CPU or GPU.
>>
>>>> Somebody has to decide on the size of the next time step, for
>>>> instance, which is a global property that has to be properly
>>>> disseminated after computation.
>>>
>>> Step when the slowest CPU is done processing its node.
>>
>> But then it has to decide what to do next.  The coefficients of the next
>> iteration depend on the global time step, so there's no purely
>> node-local method for doing adaptive step size.
> 
> Proceed when all the nodes are done their computation. Then each reads
> the global node ram to get its inputs for the next step.
> 
> This would all work in a FPGA that had a lot of CPUs on chip. Let the
> FPGA hardware do the node ram and access paths.

Yeah, the key is that the circuit topology gets handled by the FPGA, 
which is more or less my point.  (Not that I'm the one doing it.)

Large sparse matrices don't map well onto purely general-purpose hardware.
> 
> I've advocated for such a chip as a general OS host. One CPU per
> process, with absolute hardware protections.

Still has the SMP problem if the processes need to know about each other 
at all.

> 
> You could about do that now, with klunky soft-core CPUs.
> 
> LT Spice sometimes runs a billion to one behind real time. We can do
> better.

Cheers

Phil Hobbs

-- 
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Reply by ●October 1, 20212021-10-01

On Fri, 1 Oct 2021 12:34:25 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

>jlarkin@highlandsniptechnology.com wrote:
>> On Fri, 1 Oct 2021 12:13:40 -0400, Phil Hobbs
>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>> 
>>> jlarkin@highlandsniptechnology.com wrote:
>>>> On Fri, 1 Oct 2021 11:07:48 -0400, Phil Hobbs
>>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>>
>>>>> jlarkin@highlandsniptechnology.com wrote:
>>>>>> On Fri, 1 Oct 2021 08:55:01 -0400, Phil Hobbs
>>>>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>>>>
>>>>>>> Gerhard Hoffmann wrote:
>>>>>>>> Am 30.09.21 um 21:24 schrieb John Larkin:
>>>>>>>>> On Thu, 30 Sep 2021 19:58:50 +0100, "Kevin Aylward"
>>>>>>>>> <kevinRemoveandReplaceATkevinaylward.co.uk> wrote:
>>>>>>>>>
>>>>>>>>>>> "John Larkin"&#4294967295; wrote in message
>>>>>>>>>>> news:bdc7lgdap8o66j8m92ullph1nojbg9c5ni@4ax.com...
>>>>>>>>>>
>>>>>>>>>>> https://www.linkedin.com/in/mike-engelhardt-a788a822
>>>>>>>>>>
>>>>>>>>>> ...but why ......?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Maybe he enjoys it. I'm sure he's enormously wealthy and could do
>>>>>>>>> anything he wahts.
>>>>>>>>>
>>>>>>>>> I want a Spice that uses an nvidia board to speed it up 1000:1.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hopeless. That has already been tried in IBM AT times with these
>>>>>>>> Weitek coprocessors and NS 32032 processor boards; it never lived
>>>>>>>> up to the expectations.
>>>>>>>
>>>>>>> Parallellizing sparse matrix computation is an area of active research.
>>>>>>> The best approach ATM seems to be runtime profiling that generates a
>>>>>>> JIT-compiled FPGA image for the actual crunching, sort of like the way
>>>>>>> FFTW generates optimum FFT routines.
>>>>>>>
>>>>>>>>
>>>>>>>> That's no wonder. When in time domain integration the next result
>>>>>>>> depends on the current value and maybe a few in the past, you cannot
>>>>>>>> compute more future timesteps in parallel. Maybe some speculative
>>>>>>>> versions in parallel and then selecting the best. But that is no
>>>>>>>> work for 1000 processors.
>>>>>>>
>>>>>>> SPICE isn't a bad application for parallellism, if you can figure out
>>>>>>> how to do it--you wouldn't bother for trivial things, where the run
>>>>>>> times are less than 30s or so, but for longer calculations the profiling
>>>>>>> would be a small part of the work.  The inner loop is time-stepping the
>>>>>>> same matrix topology all the time (though the coefficients change with
>>>>>>> the time step).
>>>>>>>
>>>>>>> Since all that horsepower would be spending most of its time waiting for
>>>>>>> us to dork the circuit and dink the fonts, it could be running the
>>>>>>> profiling in the background during editing.  You might get 100x speedup
>>>>>>> that way, ISTM.
>>>>>>>
>>>>>>>>
>>>>>>>> The inversion of the nodal matrix might use some improvement since
>>>>>>>> it is NP complete, like almost everything that is interesting.
>>>>>>>
>>>>>>> Matrix inversion is  NP-complete?  Since when?  It's actually not even
>>>>>>> cubic, asymptotically--the lowest-known complexity bound is less than
>>>>>>> O(N**2.4).
>>>>>>>
>>>>>>>> Its size grows with the number of nodes and the matrix is sparse since
>>>>>>>> most nodes have no interaction. Dividing the circuit into subcircuits,
>>>>>>>> solving these separately and combining the results could provide
>>>>>>>> a speedup, for problems with many nodes. That would be a MAJOR change.
>>>>>>>>
>>>>>>>> Spice has not made much progress since Berkeley is no longer involved.
>>>>>>>> Some people make some local improvements and when they lose interest
>>>>>>>> after 15 years their improvements die. There is no one to integrate
>>>>>>>> that stuff in one open official version. Maybe NGspice comes closest.
>>>>>>>>
>>>>>>>> Keysight ADS has on option to run it on a bunch of workstations but
>>>>>>>> that helps probably most for electromagnetics which has not much
>>>>>>>> in common with spice.
>>>>>>>>
>>>>>>>
>>>>>>> It has more than you might think.  EM simulators basically have to loop
>>>>>>> over all of main memory twice per time step, and all the computational
>>>>>>> boundaries have to be kept time-coherent.   With low-latency
>>>>>>> interconnects and an OS with a thread scheduler that isn't completely
>>>>>>> brain-dead (i.e. anything except Linux AFAICT), my EM code scales within
>>>>>>> 20% or so of linearly up to 15 compute nodes, which is as far as I've
>>>>>>> tried.
>>>>>>>
>>>>>>> So I'm more optimistic than you, if still rather less than JL. ;)
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Phil Hobbs
>>>>>>
>>>>>> Since a schematic has a finite number of nodes, why not have one CPU
>>>>>> per node?
>>>>>
>>>>> Doing what, exactly?
>>>>
>>>> Computing the node voltage for the next time step.
>>>
>>> Right, but exactly how?
>>>>
>>>>> Given that the circuit topology forms an irregular
>>>>> sparse matrix, there would be a gigantic communication bottleneck in
>>>>> general.
>>>>
>>>> Shared ram. Most nodes only need to see a few neighbors, plainly
>>>> visible on the schematic.
>>>
>>> "Shared ram" is all virtual, though--you don't have N-port memory
>>> really.
>> 
>> FPGAs do.
>> 
>>> It has to be connected somehow, and all the caches kept
>>> coherent.  That causes communications traffic that grows very rapidly
>>> with the number of cores--about N**4 if it's done in symmetrical (SMP)
>>> fashion.
>> 
>> Then don't cache the node voltages; put them in sram. Mux the ram
>> accesses cleverly.
>> 
>>>
>>>> The ram memory map could be clever that way.
>>>
>>> Sure, that's what the JIT FPGA approach does, but the memory layout
>>> doesn't solve the communications bottleneck with a normal CPU or GPU.
>>>
>>>>> Somebody has to decide on the size of the next time step, for
>>>>> instance, which is a global property that has to be properly
>>>>> disseminated after computation.
>>>>
>>>> Step when the slowest CPU is done processing its node.
>>>
>>> But then it has to decide what to do next.  The coefficients of the next
>>> iteration depend on the global time step, so there's no purely
>>> node-local method for doing adaptive step size.
>> 
>> Proceed when all the nodes are done their computation. Then each reads
>> the global node ram to get its inputs for the next step.
>> 
>> This would all work in a FPGA that had a lot of CPUs on chip. Let the
>> FPGA hardware do the node ram and access paths.
>
>Yeah, the key is that the circuit topology gets handled by the FPGA, 
>which is more or less my point.  (Not that I'm the one doing it.)
>
>Large sparse matrices don't map well onto purely general-purpose hardware.


Then stop thinking of circuit simulation in terms of matrix math.


>> 
>> I've advocated for such a chip as a general OS host. One CPU per
>> process, with absolute hardware protections.
>
>Still has the SMP problem if the processes need to know about each other 
>at all.

There needs to be the clever multiport common SRAM, and one global
DONE line.

One shared register could be readable by all CPUs. It could have some
management bits.

I sure hope we're not still running Windows on 86 and linux on ARM a
hundred years from now. 





-- 

Father Brown's figure remained quite dark and still; 
but in that instant he had lost his head. His head was
always most valuable when he had lost it.

Reply by Phil Hobbs ●October 1, 20212021-10-01

jlarkin@highlandsniptechnology.com wrote:
> On Fri, 1 Oct 2021 12:34:25 -0400, Phil Hobbs
> <pcdhSpamMeSenseless@electrooptical.net> wrote:
> 
>> jlarkin@highlandsniptechnology.com wrote:
>>> On Fri, 1 Oct 2021 12:13:40 -0400, Phil Hobbs
>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>
>>>> jlarkin@highlandsniptechnology.com wrote:
>>>>> On Fri, 1 Oct 2021 11:07:48 -0400, Phil Hobbs
>>>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>>>
>>>>>> jlarkin@highlandsniptechnology.com wrote:
>>>>>>> On Fri, 1 Oct 2021 08:55:01 -0400, Phil Hobbs
>>>>>>> <pcdhSpamMeSenseless@electrooptical.net> wrote:
>>>>>>>
>>>>>>>> Gerhard Hoffmann wrote:
>>>>>>>>> Am 30.09.21 um 21:24 schrieb John Larkin:
>>>>>>>>>> On Thu, 30 Sep 2021 19:58:50 +0100, "Kevin Aylward"
>>>>>>>>>> <kevinRemoveandReplaceATkevinaylward.co.uk> wrote:
>>>>>>>>>>
>>>>>>>>>>>> "John Larkin"&nbsp; wrote in message
>>>>>>>>>>>> news:bdc7lgdap8o66j8m92ullph1nojbg9c5ni@4ax.com...
>>>>>>>>>>>
>>>>>>>>>>>> https://www.linkedin.com/in/mike-engelhardt-a788a822
>>>>>>>>>>>
>>>>>>>>>>> ...but why ......?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Maybe he enjoys it. I'm sure he's enormously wealthy and could do
>>>>>>>>>> anything he wahts.
>>>>>>>>>>
>>>>>>>>>> I want a Spice that uses an nvidia board to speed it up 1000:1.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hopeless. That has already been tried in IBM AT times with these
>>>>>>>>> Weitek coprocessors and NS 32032 processor boards; it never lived
>>>>>>>>> up to the expectations.
>>>>>>>>
>>>>>>>> Parallellizing sparse matrix computation is an area of active research.
>>>>>>>> The best approach ATM seems to be runtime profiling that generates a
>>>>>>>> JIT-compiled FPGA image for the actual crunching, sort of like the way
>>>>>>>> FFTW generates optimum FFT routines.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> That's no wonder. When in time domain integration the next result
>>>>>>>>> depends on the current value and maybe a few in the past, you cannot
>>>>>>>>> compute more future timesteps in parallel. Maybe some speculative
>>>>>>>>> versions in parallel and then selecting the best. But that is no
>>>>>>>>> work for 1000 processors.
>>>>>>>>
>>>>>>>> SPICE isn't a bad application for parallellism, if you can figure out
>>>>>>>> how to do it--you wouldn't bother for trivial things, where the run
>>>>>>>> times are less than 30s or so, but for longer calculations the profiling
>>>>>>>> would be a small part of the work.  The inner loop is time-stepping the
>>>>>>>> same matrix topology all the time (though the coefficients change with
>>>>>>>> the time step).
>>>>>>>>
>>>>>>>> Since all that horsepower would be spending most of its time waiting for
>>>>>>>> us to dork the circuit and dink the fonts, it could be running the
>>>>>>>> profiling in the background during editing.  You might get 100x speedup
>>>>>>>> that way, ISTM.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The inversion of the nodal matrix might use some improvement since
>>>>>>>>> it is NP complete, like almost everything that is interesting.
>>>>>>>>
>>>>>>>> Matrix inversion is  NP-complete?  Since when?  It's actually not even
>>>>>>>> cubic, asymptotically--the lowest-known complexity bound is less than
>>>>>>>> O(N**2.4).
>>>>>>>>
>>>>>>>>> Its size grows with the number of nodes and the matrix is sparse since
>>>>>>>>> most nodes have no interaction. Dividing the circuit into subcircuits,
>>>>>>>>> solving these separately and combining the results could provide
>>>>>>>>> a speedup, for problems with many nodes. That would be a MAJOR change.
>>>>>>>>>
>>>>>>>>> Spice has not made much progress since Berkeley is no longer involved.
>>>>>>>>> Some people make some local improvements and when they lose interest
>>>>>>>>> after 15 years their improvements die. There is no one to integrate
>>>>>>>>> that stuff in one open official version. Maybe NGspice comes closest.
>>>>>>>>>
>>>>>>>>> Keysight ADS has on option to run it on a bunch of workstations but
>>>>>>>>> that helps probably most for electromagnetics which has not much
>>>>>>>>> in common with spice.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It has more than you might think.  EM simulators basically have to loop
>>>>>>>> over all of main memory twice per time step, and all the computational
>>>>>>>> boundaries have to be kept time-coherent.   With low-latency
>>>>>>>> interconnects and an OS with a thread scheduler that isn't completely
>>>>>>>> brain-dead (i.e. anything except Linux AFAICT), my EM code scales within
>>>>>>>> 20% or so of linearly up to 15 compute nodes, which is as far as I've
>>>>>>>> tried.
>>>>>>>>
>>>>>>>> So I'm more optimistic than you, if still rather less than JL. ;)
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> Phil Hobbs
>>>>>>>
>>>>>>> Since a schematic has a finite number of nodes, why not have one CPU
>>>>>>> per node?
>>>>>>
>>>>>> Doing what, exactly?
>>>>>
>>>>> Computing the node voltage for the next time step.
>>>>
>>>> Right, but exactly how?
>>>>>
>>>>>> Given that the circuit topology forms an irregular
>>>>>> sparse matrix, there would be a gigantic communication bottleneck in
>>>>>> general.
>>>>>
>>>>> Shared ram. Most nodes only need to see a few neighbors, plainly
>>>>> visible on the schematic.
>>>>
>>>> "Shared ram" is all virtual, though--you don't have N-port memory
>>>> really.
>>>
>>> FPGAs do.
>>>
>>>> It has to be connected somehow, and all the caches kept
>>>> coherent.  That causes communications traffic that grows very rapidly
>>>> with the number of cores--about N**4 if it's done in symmetrical (SMP)
>>>> fashion.
>>>
>>> Then don't cache the node voltages; put them in sram. Mux the ram
>>> accesses cleverly.
>>>
>>>>
>>>>> The ram memory map could be clever that way.
>>>>
>>>> Sure, that's what the JIT FPGA approach does, but the memory layout
>>>> doesn't solve the communications bottleneck with a normal CPU or GPU.
>>>>
>>>>>> Somebody has to decide on the size of the next time step, for
>>>>>> instance, which is a global property that has to be properly
>>>>>> disseminated after computation.
>>>>>
>>>>> Step when the slowest CPU is done processing its node.
>>>>
>>>> But then it has to decide what to do next.  The coefficients of the next
>>>> iteration depend on the global time step, so there's no purely
>>>> node-local method for doing adaptive step size.
>>>
>>> Proceed when all the nodes are done their computation. Then each reads
>>> the global node ram to get its inputs for the next step.
>>>
>>> This would all work in a FPGA that had a lot of CPUs on chip. Let the
>>> FPGA hardware do the node ram and access paths.
>>
>> Yeah, the key is that the circuit topology gets handled by the FPGA,
>> which is more or less my point.  (Not that I'm the one doing it.)
>>
>> Large sparse matrices don't map well onto purely general-purpose hardware.
> 
> 
> Then stop thinking of circuit simulation in terms of matrix math.

Oh, come _on_.  The problem is a large sparse system of nonlinear ODEs 
with some bags hung onto the side for Tlines and such.  How you write it 
out doesn't change what has to be done--the main issue is the 
irregularity and unpredictibility of the circuit topology.

> 
> 
>>>
>>> I've advocated for such a chip as a general OS host. One CPU per
>>> process, with absolute hardware protections.
>>
>> Still has the SMP problem if the processes need to know about each other
>> at all.
> 
> There needs to be the clever multiport common SRAM, and one global
> DONE line.

Yes, "clever" in the sense of "magic happens here."

> 
> One shared register could be readable by all CPUs. It could have some
> management bits.

For some reasonable number of "all CPUs", sure.  Not an unlimited number.
> 
> I sure hope we're not still running Windows on 86 and linux on ARM a
> hundred years from now.

I certainly won't be. ;)

Cheers

Phil Hobbs

-- 
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Reply by Jan Panteltje ●October 1, 20212021-10-01

On a sunny day (Fri, 1 Oct 2021 16:15:23 +0100) it happened Tom Gardner
<spamjunk@blueyonder.co.uk> wrote in <sj78mb$h21$2@dont-email.me>:

>If being really snarky I'd attribute that to the C/C++
>community having to allocate their brainpower to problems
>that are a consequence of their tools' characteristics and
>historic choices.

Mmm if I read this right...

C is a simple language.
There are methods to program in C that are very fast,
I really do not know of a singe example that cannot be done in C.

Lots of blah blah about object oriented languages..
When I use linked lists in C I basically already have all of that.
Not a day goes by without some new back door or hack is found
in all those high level written applications that DID chose the high level language
just to protect the programmer from that,
If you cannot drive a car then driving bus does not make it safer ...
C has libraries available for just about anything, some are good and have been debugged to a large extent
so a C programmer can use those if needed,
No need to reference other peoples silly ideas how to do X in bloat-55.9

And -after all- a computah program is just like a step by step explanation to find the way to some address
in a big city.
One mistake and you end up nowhere,
Writing a novel that shows your hero's adventures recovering from ever again losing his/her/its way
and calling it a new language does not help.

I really do not see were the problem is other than that those who get lost have no clue of the basics.

Maybe that is why I like asm, most things can be done in asm with a 32 bit math library.
In a fraction of the code size2, a fraction of the power consumption and a fraction of the hardware.
And open source, read it, use it,  learn from it. And publish yours.

Reply by Dimiter_Popoff ●October 1, 20212021-10-01

On 10/1/2021 13:56, Don Y wrote:
> On 10/1/2021 3:04 AM, Jeroen Belleman wrote:
>> There's also Wirth's observation: "Software gets slower faster
>> than hardware gets faster."
> 
> You've got to have a bit of sympathy for folks who write desktop
> applications; they have virtually no control over the environment
> in which their code executes.

Given what today's popular OS-s look - and the amount of space
the *need* to work - I have more than sympathy for them. So much
human resource wasted on coping with legacies built on half-baked
ideas etc. But this is how life works I suppose.

> 
> They can "recommend" a particular hardware/OS configuration.
> But, few actually *enforce* those minimum requirements.
> 

And what have they to choose from? Windows or Linux, x86 and ARM....
Soon that risk-v will come into the picture, same thing of course.

Power - by far the most advanced and "fully baked" architecture humans
have produced - is well hidden from the wider public. Fortunately I
can still get some processors to do what I want to but what if it
dies? The only processors left will be this or that version of
some little-endian half-baked "shite".

And don't get me started with programming languages which went
totally C - which most if not all kids nowadays deem as "low level" ...
How this came to be is known, so what. I have been going my path
for nearly 30 years now, I have vpa (to become renamed to MIA,
for Machine Independent Assembly once I have its 64 bit version)
and the dps environment with its object maintenance system etc. etc.,
a world of its own incomparably more efficient than the popular
bloats - but I am unlikely to live long enough to be able to
win with it against the rest of the world... Not that I care much
about it lately.

Dimiter

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

Reply by Tom Gardner ●October 1, 20212021-10-01

On 01/10/21 16:35, Phil Hobbs wrote:
> Tom Gardner wrote:
>> On 01/10/21 15:46, Phil Hobbs wrote:
>>> Tom Gardner wrote:
>>>> On 01/10/21 14:05, Phil Hobbs wrote:
>>>>> Jeroen Belleman wrote:
>>>>>> On 2021-10-01 14:11, Gerhard Hoffmann wrote:
>>>>>>> Am 01.10.21 um 12:04 schrieb Jeroen Belleman:
>>>>>>>
>>>>>>>
>>>>>>>> There's also Wirth's observation: "Software gets slower
>>>>>>>> faster than hardware gets faster."
>>>>>>>
>>>>>>> When asked how his name should be pronounced he said:
>>>>>>>
>>>>>>> "You can call me by name:&nbsp; that's Wirth or you can call me by
>>>>>>> value: that's worth"
>>>>>>>
>>>>>>>
>>>>>>>>> Cheers, Gerhard
>>>>>>
>>>>>> I still don't get why C++ had to add call by reference. Big mistake, in my 
>>>>>> view.
>>>>>>
>>>>>> Jeroen Belleman
>>>>>
>>>>> Why?&nbsp; Smart pointers didn't exist in 1998 iirc, and reducing the number of 
>>>>> bare pointers getting passed around has to be a good
>>>>> thing, surely?
>>>>
>>>> "Smart pointer" is a much older concept given a new name.
>>>
>>> Hmm, interesting.&nbsp; Got a link?&nbsp; You sort of need destructors to make
>>> smart pointers work properly.
>>
>> No, I don't have a specific link.
>>
>> I'm not thinking of anything that had specific built-in
>> compiler support, but which was implemented by library
>> or environment support. That implied it was application
>> specific based on a programming convention.
> 
> Okay, it _could_ have been implemented way back, sure.&nbsp; Cfront had 
> destructors--it's all a matter of bookkeeping, 

That's the key point. I was thinking of the way people
tended to (naively) implement referencing counting GC,
usually in C.


> and could have been done in 
> Algol, I expect.  

My experience is limited to Elliott Algol-60, my first HLL,
when I was 16. I later understood the compiler was written
by a hero of mine, CAR (Tony) Hoare, of quicksort fame and
"billion dollar mistake" (his description) infamy - the null
reference :)


> I'm not aware of anybody having actually done it, and 
> apparently you aren't either.

My view is more like "it is an obvious technique that nobody
bothered to crow about, I remember seeing it done, can't be
bothered to find a link".


>> In a similar vein, I found I had triumphantly invented
>> a primitive form of OOP for a realtime project. When I
>> later used OOP terminology to explain the concepts, people
>> readily understood it (and its limitations).
> 
> Sure, OOP is a tool, like a Crescent wrench (adjustable spanner OYSOTP). &nbsp;It's a 
> particularly well suited tool for most of the programming I do--simulators, 
> embedded stuff, and instrument control--so I like it a lot.

Yup. Simulators was my first use of OOP, although I was attracted
to OOP concepts that addressed clients' statements:
  - "I'd like another of those" and
  - "I'd like one just like that except..."


> (Other stuff is mostly scripts, for which I generally use Rexx.&nbsp; Rexx isn't 
> fussy about tabs v. spaces and doesn't change the meaning of a program when it's 
> reformatted.&nbsp; Could use more library support, though.)

ISTR briefly looking at Rexx, and thinking it didn't offer
/me/ enough benefit for the learning curve.


>> I'll add a standard dig, which is overstated but contains
>> more than a little validity....
>>
>> If you look at academic papers on C++ (and to a lesser
>> extent C), they refer to other C++/C papers. If you look
>> at academic papers on other languages, they refer to many
>> different languages. In other words, the C/C++ academic
>> community has less awareness of other communities and their
>> progress.
>>
>> If being really snarky I'd attribute that to the C/C++
>> community having to allocate their brainpower to problems
>> that are a consequence of their tools' characteristics and
>> historic choices.
> 
> But you wouldn't do such a thing, of course. ;)

Perish the thought :)

Reply by Jan Panteltje ●October 1, 20212021-10-01

On a sunny day (Fri, 1 Oct 2021 11:51:51 -0400) it happened Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote in
<bb70982a-9bfe-8fbd-9874-444f16eca48f@electrooptical.net>:

>I sort of doubt that you or anyone on this group actually knows "what 
>really happens" when a 2021-vintage compiler maps your source code onto 
>2010s or 2020s-vintage hardware.  See Chisnall's classic 2018 paper,
>"C is not a low-level language. Your computer is not a fast PDP-11."
><https://dl.acm.org/doi/abs/10.1145/3212477.3212479>

Of course C is not a low level language, 
neither is asm, asm for me at least (sigh), is also a high level language.
The problem people have it seems it that asm may have hundreds of instructions
so a learning curve a  bit higher than C or BASIC or whatever snake language
I programmed in binary to make my first EPROM programmer,
Even then there is no difference,


>but what we were talking about was references vs. pointers.  ;)

really.
I am neural net I not use spice beep.

:-)


But everything always works.

:-)

Motivation counts

You earthlings confused

Reply by Don Y ●October 1, 20212021-10-01

On 10/1/2021 10:05 AM, Jan Panteltje wrote:
> On a sunny day (Fri, 1 Oct 2021 16:15:23 +0100) it happened Tom Gardner
> <spamjunk@blueyonder.co.uk> wrote in <sj78mb$h21$2@dont-email.me>:
> 
>> If being really snarky I'd attribute that to the C/C++
>> community having to allocate their brainpower to problems
>> that are a consequence of their tools' characteristics and
>> historic choices.
> 
> Mmm if I read this right...
> 
> C is a simple language.

ASM is a simple language.

> There are methods to program in C that are very fast,

There are methods to program in ASM that are very fast.

> I really do not know of a singe example that cannot be done in C.

I really don't know of a single example that cannot be done in ASM.

The appeal of HLLs is that of abstraction -- freeing the developer
from dealing with the minutiae of "running the CPU".

And, of providing more information to the toolchain to enable it
to ensure you are doing what you *should* be doing ("No, you
probably don't want to multiply "Fred" by 9.7302, even if that
is what you wrote!")

The problem with HLLs is that they don't always conveniently map onto
our thought processes.  Just like developing parallel algorithms
requires a "special effort" or synchronizing multiple actions, etc.

[When asked to explain how to do something, we tend to think
in terms of sequential operations -- and thinking about what
*could* be done simultaneously requires a special effort.
Even deciding what *order* is "required" can be a challenge
(think petri net)]

Reply by whit3rd ●October 1, 20212021-10-01

On Friday, October 1, 2021 at 9:05:42 AM UTC-4, Phil Hobbs wrote:
> Jeroen Belleman wrote: 

> > I still don't get why C++ had to add call by reference. Big 
> > mistake, in my view. 


> Why? Smart pointers didn't exist in 1998 iirc, and reducing the number 
> of bare pointers getting passed around has to be a good thing, surely?

Apple's MacOS used 'handles' even back in the eighties.

<https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_management>

Previous 3 456 7 8 Next

new spice

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About Electronics-Related.com

Social Networks

The Related Media Group