I manage a computer farm which uses GFS to share
storage. By suggestion of HP we updated the firmware
of our mc data fiber channel switch. It was not
supposed to produce any problem or complication.
However, the cluster machines stopped being able
to fence the switch a week after the update. We didn't
think it was because the update due to the time that
had passed before the problem to occur. Reviewing the
systems, I noticed that it had to do with fencing.
I reviewed the fence_mcdata program and discovered a
very slight but annoying modification introduced by
the firmware update. When you ask the switch about
the status of a port it used to answer:
Port Information
Port Number: 0
Name: Name of the port
Blocked: true << or false, depending on the status
Extended Distance: false
Type: gPort
Now it answers
Port Information
Port Number: 0
Name: Name of the port
Blocked: Blocked << or Unblocked, depending on the status
Extended Distance: false
Type: gPort
This simple modification striked down my cluster and I had to modify fence_mcdata
from :
foreach my $line (@lines)
{
my $field = "";
my $b_state = "";
if ( $line =~ /^(.*):\s*(\S*)/ )
{
$field = $1;
$b_state = $2;
}
next unless ( $field eq "Blocked" );
if ( ($block && $b_state eq "true") ||
(!$block && $b_state eq "false") )
{
$fail = 0;
}
last;
}
to:
foreach my $line (@lines)
{
my $field = "";
my $b_state = "";
if ( $line =~ /^(.*):\s*(\S*)/ )
{
$field = $1;
$b_state = $2;
}
next unless ( $field eq "Blocked" );
if ( ($block && $b_state eq "Blocked") ||
(!$block && $b_state eq "Unblocked") )
{
$fail = 0;
}
last;
}
It is a real shame that our distributors didn't tell
us that there were such modifications on the interfaces !!!
It took a lot of time to find out about the problem
and our systems had to work with just one machine
until we found it. THey should't modify their interfaces
unless clearly notifying it to all of their clients.
Bad thing for you, falks at HP: you didn't honor
our priority contract.