vserver development mailing list: [vserver] Re: [OT] [vserver] hybrid zfs pools as iSCSI targets for vserver

From: John A. Sullivan III <jsullivan_at_opensourcedevel.com>
Date: Sun 07 Aug 2011 - 02:07:00 BST
Message-ID: <1312679220.8151.47.camel@denise.theartistscloset.com>

I hope you don't mind I labeled this Off Topic as it is not specifically
to VServer at this point but . . . as long as no one minds . .
I'll respond in-line - John

On Sun, 2011-08-07 at 01:30 +0100, Gordan Bobic wrote:
> On 08/06/2011 11:09 PM, John A. Sullivan III wrote:
> > Thank you, Gordan. I'm very interested to pursue this and will answer
> > in-line - John
> >
> > On Sat, 2011-08-06 at 21:59 +0100, Gordan Bobic wrote:
> >> On 08/06/2011 09:51 PM, John A. Sullivan III wrote:
> >>> On Sat, 2011-08-06 at 21:37 +0100, Gordan Bobic wrote:
> >>>> On 08/06/2011 09:30 PM, John A. Sullivan III wrote:
> >>>>> On Sat, 2011-08-06 at 21:40 +0200, Eugen Leitl wrote:
> >>>>>> I've recently figured out how to make low-end hardware (e.g. HP N36L)
> >>>>>> work well as zfs hybrid pools. The system (Nexenta Core + napp-it)
> >>>>>> exports the zfs pools as CIFS, NFS or iSCSI (Comstar).
> >>>>>>
> >>>>>> 1) is this a good idea?
> >>>>>>
> >>>>>> 2) any of you are running vserver guests on iSCSI targets? Happy with it?
> >>>>>>
> >>>>> Yes, we have been using iSCSI to hold vserver guests for a couple of
> >>>>> years now and are generally unhappy with it. Besides our general
> >>>>> distress at Nexenta, there is the constraint of the Linux file system.
> >>>>>
> >>>>> Someone please correct me if I'm wrong because this is a big problem for
> >>>>> us. As far as I know, Linux file system block size cannot exceed the
> >>>>> maximum memory page size and is limited to no more than 4KB.
> >>>>
> >>>> I'm pretty sure it is _only_ limited by memory page size, since I'm
> >>>> pretty sure I remember that 8KB blocks were available on SPARC.
> >>> Yes, or for example, Oracle can write directly bypassing the file system
> >>> and thus works very well with iSCSI by setting very large block sizes.
> >>
> >> That sounds very much like your problem is in the FS layer rather than
> >> iSCSI. Unless you are saying that it insists on ACK-ing per-operation.
> >> But the FS flush will typically commit in large batches rather than
> >> block-by-block, unless your every write is fsync()-ed.
> > I'm somewhat out of my depth here but I think it is an iSCSI problem and
> > that iSCSI needs to ACK each operation. Let's consider a READ request.
> > The OS will request X number of (not sure if it is bytes or blocks). It
> > cares not if it is local disk or iSCSI I assume. I think iSCSI gets the
> > request and passes the request the to target but, I believe it needs to
> > ACK each block read.
> > WRITE requests I suspect are similar - I am only guessing. The OS may
> > batch the writes but ultimately hands the batch to iSCSI which writes
> > the blocks acknowledging each one.
>
> That just seems wrong because TCP will guarantee the ordering and take
> care of the fragmentation and ACKs. It makes no sense for a protocol
> running over TCP to do it's own reliability and ordering work. I think
> the problem is elsewhere.
It's not acking every packet, it's acking every block. At least that is
what we have been told. I don't recall if I put a protocol analyzer on
the line to confirm that. So it's not a transport layer ACK; it's a
data layer ACK.
>
> >>>>> iSCSI
> >>>>> appears to acknowledge every individual block that is sent. That means
> >>>>> the most data one can stream without an ACK is 4KB. That means the
> >>>>> throughput is limited by the latency of the network rather than the
> >>>>> bandwidth.
> >>>>
> >>>> Hmm, buffering in the FS shouldn't be dependant on the block layer
> >>>> immediately acknowledging unless you are issuing fsync()/barriers. What
> >>>> FS are you using on top of the iSCSI block device and is your
> >>>> application fsync() heavy?
> >>> The application is for standard file service and we are not using
> >>> barriers. We have tried using the device as disk, as part of LVM, as
> >>> part of a RAID device, as part of dm-multipath multi-bus. Pretty much
> >>> the same results across the board. We could produce higher aggregate
> >>> throughput with RAID and multibus by multiplexing several individual
> >>> streams but even then only after changing from the default CFQ scheduler
> >>> to noop (which I suppose makes sense when writing to a SAN). Individual
> >>> streams are still limited by latency.
> >>
> >> I find that the deadline scheduler works best for most things.
> > Does a scheduler even do anything on a SAN where there is no direct
> > access to the physical disk?
>
> Yes. If your data medium access is slow then you want to make sure that
> you maximize the amount of work you can do regardless. Writes can be
> buffered. Reads, however have to happen immediately, and if you can
> provide the data ASAP, the process can continue. With the deadline
> scheduler you can make reads take priority over writes which gives best
> latency in the general case.
Interesting, although we found no measurable difference between noop and
deadline.
>
> > Again, I am only guessing. I would assume
> > the scheduler is optimizing head movement on the disk - no probably not
> > that low level, perhaps organizing read and write requests in a way it
> > thinks will be best handled by the physical media.
>
> Yes, CFQ does that, but deadline does not. On anything except spinning
> disks, deadline should yield better performance. The more
> cached/abstracted your spinning disks are, the less advantage CFQ will
> have. I run deadline even on mechanical disks because lower latency is
> generally more useful than higher throughput, especially as with TCQ
> enabled the disk will internally to re-ordering CFQ style.
>
> > Once the data is
> > sent to the SAN, doesn't the SAN do exactly that with its own
> > optimization routines? Thus, I would think any optimization on SAN bound
> > data is a waste of overhead but I am only guessing. As you can tell, it
> > is not an area of expertise for me!
>
> It's not a waste of time because you are claiming that your problem is
> the ACKs between the server and the SAN. It is read latency there thay
> you want to reduce. Once it's hit the SAN, the SAN will internally do
> whatever it is designed to do and you get no say in the matter anyway.
>
> >>> When we are drawing from cache, our systems fly but anything that
> >>> touches disk is painfully slow - John
> >>
> >> I presume you have disabled things like atime. What kernel are you
> >> running and what fs? I have found that moving from 2.6.29 to 2.6.38 with
> >> ext2 made a very noticeable difference to disk I/O performance on slow
> >> media (SD cards in my case). ext4 (without the journal) is even better.
> >> Could it be that you are suffering from a kernel "feature" that has
> >> since been fixed?
> > Yes, we have disabled atime. We are using 2.6.28 and 2.6.29 and are
> > eagerly anticipating an upgrade to CentOS 6.1 and moving to something
> > newer (closely following the list discussion on a stable release). We
> > are using ext4 primarily.
> >
> > I don't think it is a kernel feature. I think the mathematics stand on
> > their own and closely match our measurements. That is what both the
> > Nexenta engineers and very helpful folks on the dm mail list had to say.
> > Take our measured latency and work it back to iops times the block size
> > and it matches our measured throughput almost exactly.
>
> You still haven't said what FS you use and with what, if any,
> creation/mount time options/parameters.
>
> Just out of interest, are your servers on the same physical network
> segment as the SAN? Can your SAN do ATAoE?
Sorry, I thought I had mentioned we were ext4. It is on a GboE LAN.
The iSCSI parameters have been carefully matched to Nexenta.
TCP is highly tuned:
# Controls tcp maximum receive window size
net.core.rmem_max = 8738000
net.ipv4.tcp_rmem = 8192 873800 8738000

# Controls tcp maximum send window size
net.core.wmem_max = 6553600
net.ipv4.tcp_wmem = 4096 655360 6553600

# Controls disabling Nagle algorithm and delayed acks
net.ipv4.tcp_low_latency=1

net.core.netdev_max_backlog = 2000

ip link set eth1 txqlen 2000
ip link set eth2 txqlen 2000
ip link set eth3 txqlen 2000
ip link set eth4 txqlen 2000

We even played with the ring buffers on the NICs. These are not onboard
NICs but rather quad port Intels.

Disks are configured in a software RAID0 array to try to fire as many
channels as possible. We found this gave us better throughput in a
variety of loadings as opposed to dm-multipath multi-bus but have
concluded we will migrate to multi-bus as using dm-RAID across iSCSI
prevents us from taking a transactionally consistent snapshot.

blockdev --setra 1024 /dev/mapper/isda
blockdev --setra 1024 /dev/mapper/isdb
blockdev --setra 1024 /dev/mapper/isdc
blockdev --setra 1024 /dev/mapper/isdd

mdadm -C /dev/md_d2 -l 0 -n 4 -c 8
-ap /dev/mapper/isda /dev/mapper/isdb /dev/mapper/isdc /dev/mapper/isdd

Typical formatting:
mkfs.ext4 -b 4096 -E stride=2,stripe-width=8,lazy_itable_init -L DATA -m
1 -O dir_index,flex_bg,extent,uninit_bg,sparse_super /dev/md_d2p3

mount -o defaults,noatime,user_xattr /dev/md_d2p3 /data

I'll include Herbert's always well informed comments here to keep it all
together.

"assuming a TCP receive window of 64k (maximum) we end up
with 750 kilobit per second at an optimal roundtrip time
of 0.1 miliseconds, which means 93MB/s theoretical maximum
for iSCSI over TCP/IP on gigabit ethernet."

That assumes we can fill the window. That's the problem. Because iSCSI
appears to be acknowledging every block, we can only send three packets
(4KB) before we need to await an ACK. We cannot fill the pipe; we are
latency / iops bound.

"so I assume if you speak of pretty lousy disk throughput
of 40MB/s you are using isolated 10G ethernet on the
client (initiator) side and separate 10G connections on
the target side together with TCP offloading and a really
smart switch all utilizing jumbo frames :)"

The connections are isolate 1G Ethernet. We could improve performance
with 10G but only marginally as the greatest contributor to latency is
not the network traversal but traversing the IP stack particularly in
OpenSolaris. Ironically, our practical measurements showed about a 20%
improvement when we disabled jump frames. We had originally tuned the
iSCSI network to hold two blocks in a jumbo frame but it now looks like
it will only send one anyway. Even then, it slows down performance. I
can only guess that it adds some latency somewhere and, since we are
latency bound, it negatively impacts performance.

A nasty, unexpected problem! Thanks, all - John
>
> Gordan
Received on Sun Aug 7 02:07:17 2011