sci.electronics.design | Burn-in strategy| page 3

Reply by Don Y ●December 6, 20222022-12-06

On 12/6/2022 10:05 AM, legg wrote:
> On Mon, 5 Dec 2022 05:09:48 -0700, Don Y <blockedofcourse@foo.invalid>
> wrote:
> 
>> On 12/4/2022 10:46 AM, Joe Gwinn wrote:
>>> I was doing much the same in the late 1970s.  We had a number of new
>>> SEL 32/55 midi computers, with this brand new semiconductor RAM memory
>>> (replacing magnetic core memory), and were having lots of early
>>> failures.
>>>
>>> So, I decided to give them some hot summer days:  The computers were
>>> looping on a memory test, as before, but now with their air intakes
>>> partially blocked by cardboard, with a thermocouple in the core so we
>>> could adjust the cardboard to achieve the max allowed temperature.
>>>
>>> Initially, delivered units would fail within a day.  We would remove
>>> the cardboard et al and call the vendor, who would then find and
>>> replace the failed memory.  Rinse and repeat.
>>>
>>> Pretty soon, the vendor instituted a hot screening program before
>>> delivery, it being far cheaper to fix in factory than the field, and
>>> in a year or two semiconductor memory field reliability had improved
>>> greatly.
>>
>> But, the vendor likely didn't just "block the vents" and *hope*
>> ALL the early faults would manifest in the first 24 hours.
>>
>> Instead, he likely stressed a sample population over a longer
>> period of time and recorded the failure rates, over time -- looking
>> for the "knee" at which the failure rate leveled off.  Longer burnin
>> times would just needlessly shorten the useful life of the device;
>> shorter would risk some number of infant mortality failures slipping
>> through to manifest at the customer.
>>
>> It seems that most folks have a naive understanding of how burnin is
>> supposed to work.  That "simply" plugging the unit in before sale
>> is enough to catch the early failures.  Unless you know where (in time)
>> those failures are probabilistically  going to manifest, how can
>> you know that 24, 48, 72, 168 hours is "enough"?  Or, that 60C is
>> the best temperature to accelerate failures?  (my residential
>> devices have to *operate* at 60C.  And, -40C.)
>>
>> [If you're not going to approach it with a scientific basis, you're
>> likely just looking to capitalize on your customers' ignorance:
>> "We burn in our products for ## hours to ensure quality".  Yeah.
>> Right.  "Then why did OUR unit shit the bed after two weeks?"]
> 
> The best temperature to accellerate failures is the operating limit
> for which the design is intended to address, under functioning
> conditions that produce the highest intended self-generated rise.

Yes.  My point was that naively assuming N hours at T degrees is
just silly.  You need to characterize your failure pattern before
you can figure out how long and at which conditions you should
stress the design.

> If you have access to early testing, you'll have some idea of the
> margins for functional operation that this limit condition provides,
> and the accompanying MTBF calculation for this previously-measured
> condition.
> 
> It is only when margins to the limits are actually exceeded that
> predicted life is possibly compromised.

Any time "operating" comes at the expense of "remaining useful life".
If you can assume that the time spent operating is << the expected
useful life, then you can ignore it.

OTOH, if your "usage in test/burnin" represents a significant
portion of the useful life of the device, approaching it willy-nilly
can be costly.

E.g., there are SIMM/DIMM connectors that are rated for a *handful*
of operations.  You'd not want to be designing a test plan that
called for them to be exercised *dozens* of times.  And, then wonder
why their reliability suffered post-test!

> Complete thermal cycling is impractical for simple burn-in.
> It is usually restricted in application to design verification or
> later sample process quality assurance.

Most devices operate in a narrow temperature range -- esp if
deployed in human-occupied environments.  Something intended
for use in a lab will likely see constant temperatures.

OTOH, there are classes of devices that are not constrained by
"human habitation".  E.g., an outdoor whether station will
typically see 100+C variations, over its lifetime -- though
likely only ~30C in a given (short) interval.  Here, I expect
to encounter 0F to ~140F as a normal yearly range.  In
North Dakota, it might be -40F to +100F, etc.  If you don't want
to design AZ and ND models, you have to test the design at
the union of those conditions.

Then, there are devices that are intended to operate in environments
where the operating conditions are varied by necessity (e.g.,
many manufacturing processes).

> Cold cycling tolerance is relevant to consumer products mainly to
> demonstrate air-shipment worthiness.

Or, it's -26F outside and will be that way for a few days!

> For burn in, simple on-off cycling to allow stress over self-
> generated temperature swings is considered adequate.

Reply by John Walliker ●December 6, 20222022-12-06

On Tuesday, 6 December 2022 at 19:38:38 UTC, Don Y wrote:
> On 12/6/2022 10:05 AM, legg wrote: 
> > On Mon, 5 Dec 2022 05:09:48 -0700, Don Y <blocked...@foo.invalid> 
> > wrote: 
> > 
> >> On 12/4/2022 10:46 AM, Joe Gwinn wrote: 
> >>> I was doing much the same in the late 1970s. We had a number of new 
> >>> SEL 32/55 midi computers, with this brand new semiconductor RAM memory 
> >>> (replacing magnetic core memory), and were having lots of early 
> >>> failures. 
> >>> 
> >>> So, I decided to give them some hot summer days: The computers were 
> >>> looping on a memory test, as before, but now with their air intakes 
> >>> partially blocked by cardboard, with a thermocouple in the core so we 
> >>> could adjust the cardboard to achieve the max allowed temperature. 
> >>> 
> >>> Initially, delivered units would fail within a day. We would remove 
> >>> the cardboard et al and call the vendor, who would then find and 
> >>> replace the failed memory. Rinse and repeat. 
> >>> 
> >>> Pretty soon, the vendor instituted a hot screening program before 
> >>> delivery, it being far cheaper to fix in factory than the field, and 
> >>> in a year or two semiconductor memory field reliability had improved 
> >>> greatly. 
> >> 
> >> But, the vendor likely didn't just "block the vents" and *hope* 
> >> ALL the early faults would manifest in the first 24 hours. 
> >> 
> >> Instead, he likely stressed a sample population over a longer 
> >> period of time and recorded the failure rates, over time -- looking 
> >> for the "knee" at which the failure rate leveled off. Longer burnin 
> >> times would just needlessly shorten the useful life of the device; 
> >> shorter would risk some number of infant mortality failures slipping 
> >> through to manifest at the customer. 
> >> 
> >> It seems that most folks have a naive understanding of how burnin is 
> >> supposed to work. That "simply" plugging the unit in before sale 
> >> is enough to catch the early failures. Unless you know where (in time) 
> >> those failures are probabilistically going to manifest, how can 
> >> you know that 24, 48, 72, 168 hours is "enough"? Or, that 60C is 
> >> the best temperature to accelerate failures? (my residential 
> >> devices have to *operate* at 60C. And, -40C.) 
> >> 
> >> [If you're not going to approach it with a scientific basis, you're 
> >> likely just looking to capitalize on your customers' ignorance: 
> >> "We burn in our products for ## hours to ensure quality". Yeah. 
> >> Right. "Then why did OUR unit shit the bed after two weeks?"] 
> > 
> > The best temperature to accellerate failures is the operating limit 
> > for which the design is intended to address, under functioning 
> > conditions that produce the highest intended self-generated rise.
> Yes. My point was that naively assuming N hours at T degrees is 
> just silly. You need to characterize your failure pattern before 
> you can figure out how long and at which conditions you should 
> stress the design.
> > If you have access to early testing, you'll have some idea of the 
> > margins for functional operation that this limit condition provides, 
> > and the accompanying MTBF calculation for this previously-measured 
> > condition. 
> > 
> > It is only when margins to the limits are actually exceeded that 
> > predicted life is possibly compromised.
> Any time "operating" comes at the expense of "remaining useful life". 
> If you can assume that the time spent operating is << the expected 
> useful life, then you can ignore it. 
> 
> OTOH, if your "usage in test/burnin" represents a significant 
> portion of the useful life of the device, approaching it willy-nilly 
> can be costly. 
> 
> E.g., there are SIMM/DIMM connectors that are rated for a *handful* 
> of operations. You'd not want to be designing a test plan that 
> called for them to be exercised *dozens* of times. And, then wonder 
> why their reliability suffered post-test!
> > Complete thermal cycling is impractical for simple burn-in. 
> > It is usually restricted in application to design verification or 
> > later sample process quality assurance.
> Most devices operate in a narrow temperature range -- esp if 
> deployed in human-occupied environments. Something intended 
> for use in a lab will likely see constant temperatures. 
> 
> OTOH, there are classes of devices that are not constrained by 
> "human habitation". E.g., an outdoor whether station will 
> typically see 100+C variations, over its lifetime -- though 
> likely only ~30C in a given (short) interval. Here, I expect 
> to encounter 0F to ~140F as a normal yearly range. In 
> North Dakota, it might be -40F to +100F, etc. If you don't want 
> to design AZ and ND models, you have to test the design at 
> the union of those conditions. 
> 
> Then, there are devices that are intended to operate in environments 
> where the operating conditions are varied by necessity (e.g., 
> many manufacturing processes).
> > Cold cycling tolerance is relevant to consumer products mainly to 
> > demonstrate air-shipment worthiness.
> Or, it's -26F outside and will be that way for a few days!
> > For burn in, simple on-off cycling to allow stress over self- 
> > generated temperature swings is considered adequate.

There are some assumptions about the thermodynamics of the
failure mechanisms built into many accelerated testing scenarios
which are sometimes not justified.

John

Reply by Don Y ●December 6, 20222022-12-06

On 12/6/2022 3:10 PM, John Walliker wrote:
> There are some assumptions about the thermodynamics of the
> failure mechanisms built into many accelerated testing scenarios
> which are sometimes not justified.

Again, my point is that you need to understand your design and the
environment in which it will operate before you can create a
burn-in strategy.  And, you need to monitor the EXPECTED failures
in your burnin process to determine if the characterization of the
product needs to be updated (new model).

Places that take this seriously usually have staff on hand to
keep track of the process on all such products.  After release
to manufacturing, engineering only gets re-involved when "they"
discover something has changed (component suppliers, process,
etc.)

Early in my career, I used an MNOS WAROM in a design.  The
*suggested* burnin regimen would have caused 100% failures
in the first few *hours* as the device was only guaranteed
for ~1,000 (that's "one thousand") write cycles.  So, we had
to come up with a different way of exercising the design that
didn't involve repeated accesses to that device.

Similar issues exist, today, with MLC flash, etc.

Reply by Phil Hobbs ●December 6, 20222022-12-06

John Walliker wrote:
> On Tuesday, 6 December 2022 at 19:38:38 UTC, Don Y wrote:
>> On 12/6/2022 10:05 AM, legg wrote:
>>> On Mon, 5 Dec 2022 05:09:48 -0700, Don Y <blocked...@foo.invalid>
>>> wrote:
>>>
>>>> On 12/4/2022 10:46 AM, Joe Gwinn wrote:
>>>>> I was doing much the same in the late 1970s. We had a number of new
>>>>> SEL 32/55 midi computers, with this brand new semiconductor RAM memory
>>>>> (replacing magnetic core memory), and were having lots of early
>>>>> failures.
>>>>>
>>>>> So, I decided to give them some hot summer days: The computers were
>>>>> looping on a memory test, as before, but now with their air intakes
>>>>> partially blocked by cardboard, with a thermocouple in the core so we
>>>>> could adjust the cardboard to achieve the max allowed temperature.
>>>>>
>>>>> Initially, delivered units would fail within a day. We would remove
>>>>> the cardboard et al and call the vendor, who would then find and
>>>>> replace the failed memory. Rinse and repeat.
>>>>>
>>>>> Pretty soon, the vendor instituted a hot screening program before
>>>>> delivery, it being far cheaper to fix in factory than the field, and
>>>>> in a year or two semiconductor memory field reliability had improved
>>>>> greatly.
>>>>
>>>> But, the vendor likely didn't just "block the vents" and *hope*
>>>> ALL the early faults would manifest in the first 24 hours.
>>>>
>>>> Instead, he likely stressed a sample population over a longer
>>>> period of time and recorded the failure rates, over time -- looking
>>>> for the "knee" at which the failure rate leveled off. Longer burnin
>>>> times would just needlessly shorten the useful life of the device;
>>>> shorter would risk some number of infant mortality failures slipping
>>>> through to manifest at the customer.
>>>>
>>>> It seems that most folks have a naive understanding of how burnin is
>>>> supposed to work. That "simply" plugging the unit in before sale
>>>> is enough to catch the early failures. Unless you know where (in time)
>>>> those failures are probabilistically going to manifest, how can
>>>> you know that 24, 48, 72, 168 hours is "enough"? Or, that 60C is
>>>> the best temperature to accelerate failures? (my residential
>>>> devices have to *operate* at 60C. And, -40C.)
>>>>
>>>> [If you're not going to approach it with a scientific basis, you're
>>>> likely just looking to capitalize on your customers' ignorance:
>>>> "We burn in our products for ## hours to ensure quality". Yeah.
>>>> Right. "Then why did OUR unit shit the bed after two weeks?"]
>>>
>>> The best temperature to accellerate failures is the operating limit
>>> for which the design is intended to address, under functioning
>>> conditions that produce the highest intended self-generated rise.
>> Yes. My point was that naively assuming N hours at T degrees is
>> just silly. You need to characterize your failure pattern before
>> you can figure out how long and at which conditions you should
>> stress the design.
>>> If you have access to early testing, you'll have some idea of the
>>> margins for functional operation that this limit condition provides,
>>> and the accompanying MTBF calculation for this previously-measured
>>> condition.
>>>
>>> It is only when margins to the limits are actually exceeded that
>>> predicted life is possibly compromised.
>> Any time "operating" comes at the expense of "remaining useful life".
>> If you can assume that the time spent operating is << the expected
>> useful life, then you can ignore it.
>>
>> OTOH, if your "usage in test/burnin" represents a significant
>> portion of the useful life of the device, approaching it willy-nilly
>> can be costly.
>>
>> E.g., there are SIMM/DIMM connectors that are rated for a *handful*
>> of operations. You'd not want to be designing a test plan that
>> called for them to be exercised *dozens* of times. And, then wonder
>> why their reliability suffered post-test!
>>> Complete thermal cycling is impractical for simple burn-in.
>>> It is usually restricted in application to design verification or
>>> later sample process quality assurance.
>> Most devices operate in a narrow temperature range -- esp if
>> deployed in human-occupied environments. Something intended
>> for use in a lab will likely see constant temperatures.
>>
>> OTOH, there are classes of devices that are not constrained by
>> "human habitation". E.g., an outdoor whether station will
>> typically see 100+C variations, over its lifetime -- though
>> likely only ~30C in a given (short) interval. Here, I expect
>> to encounter 0F to ~140F as a normal yearly range. In
>> North Dakota, it might be -40F to +100F, etc. If you don't want
>> to design AZ and ND models, you have to test the design at
>> the union of those conditions.
>>
>> Then, there are devices that are intended to operate in environments
>> where the operating conditions are varied by necessity (e.g.,
>> many manufacturing processes).
>>> Cold cycling tolerance is relevant to consumer products mainly to
>>> demonstrate air-shipment worthiness.
>> Or, it's -26F outside and will be that way for a few days!
>>> For burn in, simple on-off cycling to allow stress over self-
>>> generated temperature swings is considered adequate.
> 
> There are some assumptions about the thermodynamics of the
> failure mechanisms built into many accelerated testing scenarios
> which are sometimes not justified.
> 
> John
> 

The idea that all failure modes follow an Arrhenius temperature 
dependence over many orders of magnitude is completely up a pole. 
MIL-HDBK-217 really ought to be relegated to a museum.

Cheers

Phil Hobbs

-- 
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Previous 1 23Next

Burn-in strategy

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About Electronics-Related.com

Social Networks

The Related Media Group