RE: [suse-oracle] [OT] Proliant DL740 hangs with sles8 (k_smp 2.4.21-190)

McAllister, Andrew


> -----Original Message-----
> From: Michael Hasenstein [mailto:mha@(protected)]
> Miquel Colom wrote:
> > 1-Hangs reported due to FC cards. SOLVED with firmware upgrade.
We don't have FC cards and our Perc4/DCs are at 3.28/1.05 firmware etc.

> > 2-Hangs suspected to be due to using framebuffer. DISABLING fmb not
> > tested.
We never had a problem when using the console.

> > 3-Hangs due to asynch io. SOLVED by applying a RHEL 3.0 (on
> SLES8, is that
> > correct)?.
> Not RH specific, it is an Oracle bug with the stub libraries.
We haven't applied this patch but are researching now.

> > 4-Hangs due to bad reiserfs filesystem. SOLVED with reiserfs fsck.
> > Correct?
Someone reported this. We'll be fscking next reboot.

> > 5-Finally, Andrew is experiencing hangs due to high load.
> There is here no
> > FC card, but there is async io and reiserfs. Also there is
> a note that
> > taking out the broadcom cards contributes to a better
> uptime. This can be
> > a driver problem or an IO-APIC issue. Hangs not
> reproducible on test
> > system, only in production (sigh).
We can't use ext2 as our database files are all 2gig and some of our
nightly data loads come in as files > 2 gig (like 6 gig).

> >
> > Do I miss something?
> I'd be interested to see this with ext2, I'd like to know if
> reiserfs is
> involved, directly or indirectly (triggering a bug somewhere else)
> doesn't matter.

Our open TAR number with oracle was sent to Michael off-list.

Our production system is in its last stages before meltdown as of right
now. We're limping along until after business hours. One of the two
listeners is dying every couple hours (this is also a symptom that we
thought was gone with the pro100 card install). And now a user is
reporting corrupt database block errors. We'll run the fsck on next

Any other ideas as to what to look for? If this thing hangs between 8-5
US/Central we'll have to bring it back up immediately, if after 5 we may
have an hour or so to tweak or check settings. I'd be happy to run any
non-destructive test or check of settings while the machine is on its
last legs. Obviously if it is a true hang, we'll have to power cycle.

Another interesting development...
We set up a test 2650 with SLES 8 SP3 and 2.4.21-190-smp. We've also
built a stress test database and set of data files with load scripts
that will continuously load data and check for errors.

This test database on our standby 6650 will produce "ORA-12599:
TNS:cryptographic checksum mismatch" and "ORA-03113 end-of-file on
communications channel" after about 15 minutes. These Oracle errors are
the first sign of impending doom. We normally get them after 24 hours of
running on our production box. On the production side, eventually this
ORA-... error rate goes up and the box will exhibit other symptoms like
hung pipes, then file systems, then complete hangs.

On the test 2650 we went through 2700 loads last night without any
problem. The test 2650 has ASYNC IO turned OFF!

Differences between the 6650 and 2650: number and speed of CPU's,
chipset?, 6650's have PERC4/DC with megaraid2 drivers, 2650 have the
internal Adaptec raid (aacraid driver). Async IO was off on the 2650.
Both use reiserfs and LVM.

We've just disabled ansyc IO on the standby 6650 and are restarting that
stress test. Will report results back.

We just turned async on on the 2650 and relinked all and are starting
the stress test over again to see if we get errors. If so, we know it
is async io. If not, then something in the kernel or megaraid drivers?

Tom reported problems with max open files:
5676   1495   131072
Should be no problem for us here.

Thus I think we are narrowing down the problems to megaraid,
chipset/hardware, or something in the kernel that is aggrivated by our


