From: Herbert Poetzl (herbert_at_13thfloor.at)
Date: Mon 10 Feb 2003 - 13:29:18 GMT
On Tue, Feb 11, 2003 at 08:07:14AM +1300, Sam Vilain wrote:
> On Thu, 06 Feb 2003 23:03, you wrote:
> > On Wed, Feb 05, 2003 at 07:55:19PM +0100, Herbert Poetzl wrote:
> > > On Wed, Feb 05, 2003 at 04:46:21PM +0000, Paul Sladen wrote:
> > > > On Mon, 3 Feb 2003, John Goerzen wrote:
> > >
> > > Justin M Kuntz reported a kernel oops in
> > > sched.c 570 on a 2.4.20 ctx16 with reiserfs
> > > on january 01 2003, so this seems to be
> > > the same race ...
> >
> > Hmm, i'm also using reiserfs on the server which crashed, it might be
> > related.
> >
> > John, are you using reiserfs ?
> 
> Hmm, funny you should suspect reiserfs so quickly.  You have good reason.
> 
> As I've recently become painfully aware, reiserfs can easily break under 
> not so unusual circumstances.  Though I used to swear by it, I have in 2 
> years or so of using it had five unexplained data corruption incidents 
> running so-called `stable' versions since early 2.4 days, which is five 
> more than all other UNIX filesystems I've used combined.  3 of these have 
> been following a system crash, when reiserfs's journalling failed.  One of 
> these resulted in a complete loss of the filesystem structure, due to the 
> inadequacy of the `reiserfsck' tool.
> 
> In addition to data corruption, it's not all that hard to create a 
> directory structure that even root cannot read; I've just managed to 
> create one, and all I was doing was duplicating ~25% of the directory 
> structure using an analogue of `cp -al'.  Reiserfs really cracks under 
> pressure, and that's the last thing you want a filesystem to do!
> 
> With these problems under high load, it's hard to think of a truly useful 
> application for reiserfs.  It really is still experimental as hell; the 
> version in 2.4.20 seems particularly bad.  Best to stick with ext3/ext2 
Hans Reiser will hate you for that *G* ...
> (with the directory hashing patch if you need it).  Or try your luck with 
hmm, hmm, should I mention that the change to the
ext3 htree extension (which is part of the latest
ext3 versions) easily wiped several partitions,
because the e2fsck tools wasn't up to date ...
> xfs/jfs if you really need the speed.
seems you have detailed information about the 
speed/load issues compared between xfs, jfs, reiser 
and ext3? please share with us!
> Check out this e-mail seen on the reiserfs list:
> ----
> [... talking about a crash ...]
> And now I can reliably reproduce it.  It has nothing to do with MD,
> linear, raid, SMP, or unclean shutdowns.
> 
> I can reproduce this bug on a plain IDE disk partition in about three
> hours on Linux 2.4.20 (compiled for SMP but running on UP, full .config
> and system details available on request).  My test system has about 4 gigs
> under /etc, /usr, and /var, /dev/hdc2 is 25GB, and there is 1G of swap.
> 
> 
> 
> 
> BEGIN cut-and-paste-into-a-root-shell
> 
> # Create an empty filesystem:
> 
> mkreiserfs -f -f /dev/hdc2
> mount /dev/hdc2 /test
> cd /test
> 
> # Script used to control the load average.  Note that as written the loops
> # below will keep spawning new processes, so we need some way to throttle
> # them.  Change the '-lt 10' to another number to change the number
> # of processes.
> 
> cat <<'LC' > loadcheck && chmod 755 loadcheck
> #!/bin/sh
> read av1 av5 av15 rest < /proc/loadavg
> echo -n "Load Average: $av1 ... "
> av1=${av1%.*}
> if [ $av1 -lt 10 ]; then
>         echo OK
>         exit 0
> else
>         echo "Whoa, Nellie!"
>         exit 1
> fi
> LC
> 
> # Create directories used by test
> mkdir foo bar
> 
> # Start up some rsyncs.  I use /etc, /usr, and /var because there's a 
> # good mixture of files with some hardlinks between them, and on a normal
> # Linux system some of them change from time to time.
> 
> while sleep 1m; do 
>         ./loadcheck || continue; 
>         for x in usr etc var; do 
>                 rsync -avxHS --delete /$x/. foo/$x/. & 
>         done; 
> done &
> 
> # Start up some cp -al's and rm -rf's.  Note there are two concurrent
> # sets of 'cp's and two concurrent sets of 'rm's, and each of those
> # has different instances of 'cp' and 'rm' running at different times.
> for x in  1 2; do
>         while sleep 1m; do 
>                 ./loadcheck || continue; 
>                 cp -al foo bar/`date +%s` & 
>         done &
>         while sleep 1m; do 
>                 ./loadcheck || continue; 
>                 for x in bar/*; do 
>                         rm -rf $x; 
>                         sleep 1m; 
>                 done & 
>         done &
> done &
> 
> END cut-and-paste-into-a-root-shell
> 
> 
> 
> 
> rm and occasionally cp will frequently complain about "No such file
> or directory".  This is normal.  After about 3 hours, the following
> non-normal messages appear:
> 
> readlink lib/R/library/base/help/contrasts: Permission denied
> readlink lib/R/library/base/html/hsv.html: Permission denied
> rm: cannot remove 
> `bar/1042550428/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/appletalk/ltpc.o': 
> Permission denied
> rm: cannot remove 
> `bar/1042550428/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/aironet4500_proc.c': 
> Permission denied
> cp: cannot stat 
> `foo/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/e1000/.e1000_ethtool.o.flags': 
> Permission denied
> cp: cannot stat 
> `foo/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/.eepro.o.flags': 
> Permission denied
> 
> This needs a 'reiserfsck --fix-fixable' to fix.  
> 
> It looks to me like there may be some sort of locking bug triggered by
> concurrent link/unlink/rename calls, but I'm not even a filesystem expert,
> much less a reiserfs expert.  ;-)
> 
> -- 
> Sam Vilain, sam_at_vilain.net
> 
>   To be sure of hitting the target, shoot first, and call whatever you
> hit the target.
> ASHLEIGH BRILLIANT