vserver development mailing list: Re: [vserver] hybrid zfs pools as iSCSI targets for vserver

From: John A. Sullivan III <jsullivan_at_opensourcedevel.com>
Date: Sat 06 Aug 2011 - 23:09:54 BST
Message-ID: <1312668594.8151.27.camel@denise.theartistscloset.com>

Thank you, Gordan. I'm very interested to pursue this and will answer
in-line - John

On Sat, 2011-08-06 at 21:59 +0100, Gordan Bobic wrote:
> On 08/06/2011 09:51 PM, John A. Sullivan III wrote:
> > On Sat, 2011-08-06 at 21:37 +0100, Gordan Bobic wrote:
> >> On 08/06/2011 09:30 PM, John A. Sullivan III wrote:
> >>> On Sat, 2011-08-06 at 21:40 +0200, Eugen Leitl wrote:
> >>>> I've recently figured out how to make low-end hardware (e.g. HP N36L)
> >>>> work well as zfs hybrid pools. The system (Nexenta Core + napp-it)
> >>>> exports the zfs pools as CIFS, NFS or iSCSI (Comstar).
> >>>>
> >>>> 1) is this a good idea?
> >>>>
> >>>> 2) any of you are running vserver guests on iSCSI targets? Happy with it?
> >>>>
> >>> Yes, we have been using iSCSI to hold vserver guests for a couple of
> >>> years now and are generally unhappy with it. Besides our general
> >>> distress at Nexenta, there is the constraint of the Linux file system.
> >>>
> >>> Someone please correct me if I'm wrong because this is a big problem for
> >>> us. As far as I know, Linux file system block size cannot exceed the
> >>> maximum memory page size and is limited to no more than 4KB.
> >>
> >> I'm pretty sure it is _only_ limited by memory page size, since I'm
> >> pretty sure I remember that 8KB blocks were available on SPARC.
> > Yes, or for example, Oracle can write directly bypassing the file system
> > and thus works very well with iSCSI by setting very large block sizes.
>
> That sounds very much like your problem is in the FS layer rather than
> iSCSI. Unless you are saying that it insists on ACK-ing per-operation.
> But the FS flush will typically commit in large batches rather than
> block-by-block, unless your every write is fsync()-ed.
I'm somewhat out of my depth here but I think it is an iSCSI problem and
that iSCSI needs to ACK each operation. Let's consider a READ request.
The OS will request X number of (not sure if it is bytes or blocks). It
cares not if it is local disk or iSCSI I assume. I think iSCSI gets the
request and passes the request the to target but, I believe it needs to
ACK each block read.
WRITE requests I suspect are similar - I am only guessing. The OS may
batch the writes but ultimately hands the batch to iSCSI which writes
the blocks acknowledging each one.
>
> >>> iSCSI
> >>> appears to acknowledge every individual block that is sent. That means
> >>> the most data one can stream without an ACK is 4KB. That means the
> >>> throughput is limited by the latency of the network rather than the
> >>> bandwidth.
> >>
> >> Hmm, buffering in the FS shouldn't be dependant on the block layer
> >> immediately acknowledging unless you are issuing fsync()/barriers. What
> >> FS are you using on top of the iSCSI block device and is your
> >> application fsync() heavy?
> > The application is for standard file service and we are not using
> > barriers. We have tried using the device as disk, as part of LVM, as
> > part of a RAID device, as part of dm-multipath multi-bus. Pretty much
> > the same results across the board. We could produce higher aggregate
> > throughput with RAID and multibus by multiplexing several individual
> > streams but even then only after changing from the default CFQ scheduler
> > to noop (which I suppose makes sense when writing to a SAN). Individual
> > streams are still limited by latency.
>
> I find that the deadline scheduler works best for most things.
Does a scheduler even do anything on a SAN where there is no direct
access to the physical disk? Again, I am only guessing. I would assume
the scheduler is optimizing head movement on the disk - no probably not
that low level, perhaps organizing read and write requests in a way it
thinks will be best handled by the physical media. Once the data is
sent to the SAN, doesn't the SAN do exactly that with its own
optimization routines? Thus, I would think any optimization on SAN bound
data is a waste of overhead but I am only guessing. As you can tell, it
is not an area of expertise for me!
>
> > When we are drawing from cache, our systems fly but anything that
> > touches disk is painfully slow - John
>
> I presume you have disabled things like atime. What kernel are you
> running and what fs? I have found that moving from 2.6.29 to 2.6.38 with
> ext2 made a very noticeable difference to disk I/O performance on slow
> media (SD cards in my case). ext4 (without the journal) is even better.
> Could it be that you are suffering from a kernel "feature" that has
> since been fixed?
Yes, we have disabled atime. We are using 2.6.28 and 2.6.29 and are
eagerly anticipating an upgrade to CentOS 6.1 and moving to something
newer (closely following the list discussion on a stable release). We
are using ext4 primarily.

I don't think it is a kernel feature. I think the mathematics stand on
their own and closely match our measurements. That is what both the
Nexenta engineers and very helpful folks on the dm mail list had to say.
Take our measured latency and work it back to iops times the block size
and it matches our measured throughput almost exactly.
>
> Gordan
Received on Sat Aug 6 23:10:10 2011