vserver development mailing list: Re: [vserver] Re: [OT] [vserver] hybrid zfs pools as iSCSI targets for vserver

From: John A. Sullivan III <jsullivan_at_opensourcedevel.com>
Date: Wed 10 Aug 2011 - 06:33:19 BST
Message-ID: <1312954399.3410.11.camel@denise.theartistscloset.com>

On Mon, 2011-08-08 at 06:04 -0400, John A. Sullivan III wrote:
> On Sun, 2011-08-07 at 11:10 +0200, Adrian Reyer wrote:
> > On Sat, Aug 06, 2011 at 09:07:00PM -0400, John A. Sullivan III wrote:
> > > It's not acking every packet, it's acking every block. At least that is
> > > what we have been told. I don't recall if I put a protocol analyzer on
> > > the line to confirm that. So it's not a transport layer ACK; it's a
> > > data layer ACK.
> >
> > iSCSI is SCSI, You should be able to use tagged command queuing with
> > somewhat deep queues there.
> > I dislike SANs and think iSCSI is a slow mess, but I currently export
> > disks by iSCSI from one linux host to another to use it as backup disk
> > for a bacula (backup) server.
> > I heavily played with the schedulers and tuned them specific to my load
> > (1GB files, mostly sequentially accessed) by using sysfs via sysfsutils:
> >
> > block/sdb/queue/scheduler = deadline
> > # documentation states the scheduler keeps heads where they are after
> > # read requests for a short time to see if further requests to that
> > # location come in. As it is a) a SAN and b) a RAID-system, head
> > # locations are not in any way transparent anyway
> > block/sdb/queue/rq_affinity = 0
> > block/sdb/queue/nr_requests = 1024
> > # quite high read expiry because of my specific work load, you will want
> > # a smaller one
> > block/sdb/queue/iosched/read_expire = 5000
> > # write expiry should be fine, with ext4 my mount options are:
> > # rw,noatime,data=writeback,journal_async_commit,delalloc
> > block/sdb/queue/iosched/write_expire = 2000
> > block/sdb/device/queue_depth = 1024
> > # Again: I use big files on a backup server, jobs are migrated from
> > # storage to tape, 1MB readahead, default is 128kB.
> > block/sdb/queue/read_ahead_kb = 1024
> >
> > Regards,
> > _are_
> Very interesting. So are you saying that if we configure tagged command
> queueing properly, iSCSI will send several requests at once without
> asking for an ACK for each? Thanks - John
>
Well, I decided to not just take Nexenta's word for it (nor the very
helpful fellow who assisted us on the dm-multipath list) and took those
packet traces of our iSCSI connections. I was rather surprised by the
results.

As several have surmised, we are not ACKing every block action. We are
coalescing them. In fact CFQ coalesces them into huge blocks whereas
deadline and noop are more evenly divided. We do know that CFQ prevents
scaling with multiple threads whereas deadline and noop scale linearly
so CFQ is not the way to go.

Regardless of scheduler, we saw "micro-bursts" of say 50KB to 250KB
where throughput was literally saturating the GbE connections -
outstanding throughput - 87MBps, 93 MBps, 125MBps. However, when we
measure the sustained throughput over say 10MB to 15MB, throughput
consistently plummets to far worse that we thought - consistently
between 4 MBps and 5 MBps.

I suspect we are "stuttering" - bursting and pausing. I'll need to
examine the traces a little more closely to see what is happening but,
in short, we are combining block reads and writes in the iSCSI requests
and saturating the Ethernet connections for short durations. Thanks,
all - John
Received on Wed Aug 10 06:33:38 2011