RE: Help me select hardware....Some real world data that might help

From: Paul Tice <ptice_at_aldridge.com>
Date: Tue, 27 Jan 2009 12:41:20 -0600
Excuse my rambling, perhaps something in this mess will be useful.

I'm currently using 8 cores (2x Xeon E5405), 16G FB-DIMM, and 8 x 750GB drives on a backup system (I plan to add the other in the chassis one by one, testing the speed along the way)
8-current AMD64, ZFS, Marvell 88sx6081 PCI-X card (8 port SATA) + LSI1068E (8 port SAS/SATA) for the main Array, and the Intel onboard SATA for boot drive(s).
Data is sucked down through 3 gigabit ports, with another available but not yet activated. 
Array drives all live on the LSI right now. Drives are  <ATA ST3750640AS K>.

ZFS is stable _IF_ you disable the prefetch and ZIL, otherwise the classic ZFS wedge rears it's ugly head. I haven't had a chance to test just one yet, but I'd guess it's the prefetch that's the quick killer. Even with prefetching and ZIL disabled, my current bottleneck is the GigE. I'm waiting to get new switches in that support jumbo frames, quick and dirty testing shows almost 2x increase in throughput, and ~40% drop in interrupt rates from the NICs compared to the current standard (1500 MTU) frames.

Pool was created with 'zpool create backup raidz da0 da1 da2 da3 da4 da5 da6 da7'

I've seen references to 8-Current having a kernel memory limit of 8G (compared to 2G for pre 8 from what I understand so far) and ZFS ARC (caching) is done in kernel memory space. (Please feel free to correct me if I'm wrong on any of this!)
Default ZFS (no disables) with a 1536M kern mem limit, and 512M ARC limit, I saw 2085 ARC memory throttles before the box wedged.

Using rsync over several machines with this setup, I'm getting a little over  1GB/min to the disks. 
'zpool iostat 60' is a wonderful tool. 
I would mention something I've noticed that doesn't seem to be documented:
The first reading from 'zpool iostat' (whether single run or with an interval) is a running average, although I haven't found the time period averaged yet. (from pool mount time maybe?)

The jumbo frame interrupt reduction may be important. I run 'netstat -i -w60' right beside 'zpool iostat 60', and the throughput is closely inversely related. I can predict a disk write (bursty writes in ZFS it seems) by throughput dropping to on the NIC side. The drop is up to 75% averaging around 50%. Using a 5 interval instead of 60, I see disk out throughput spikes up to 90MB/s, although 55, 0, 0, 0, 55 is more common. 
Possibly, binding interrupts to particular cpu's might help a bit too. I haven't found, and don't feel competent to write, userspace tools to do this.

CPU usage during all this is suprisingly low.  rsync is running with -z, the files themselves are compressed as they go onto the drives with pbzip2, and the whole thing runs on (ducking) BackupPC, which is all perl script. 
With all that, 16 machines backing up, and 1+GB/Min going to the platters, CPU is still avg 40% idle using top. I'm considering remaking the array raidz2, I seem to have enough CPU to handle it.

Random ZFS thoughts:
You cannot shrink/grow a raidz or raidz2. You can grow a stripe array, I'm don't know if you can shrink it successfully.
You cannot promote a stripe array to raidz/z2, nor demote in the other direction.
You can have hot spares, haven't seen a provision for warm/cold spares.
/etc/default/rc.conf already has cron ZFS status/scrub checks, but not enabled.

Anyway, enough rambling, just thought I'd use something not too incredibly far from your suggested system to toss some data out.

Thanks
Paul





-----Original Message-----
From: owner-freebsd-current_at_freebsd.org on behalf of Terry Kennedy
Sent: Fri 1/23/2009 8:30 PM
To: freebsd-current_at_freebsd.org
Subject: Help me select hardware and software options for very large server
 
  [I posted the following message to freebsd-questions, as I thought it
woule be the most appropriate list. As it has received no replies in two
weeks, I'm trying freebsd-current.]

--------

  [I decided to ask this question here as it overlaps -hardware, -current,
and a couple other lists. I'd be glad to redirect the conversation to a
list that's a better fit, if anyone would care to suggest one.]

  I'm in the process of planning the hardware and software for the second
generation of my RAIDzilla file servers (see http://www.tmk.com/raidzilla
for the current generation, in production for 4+ years).

  I expect that what I'm planning is probably "off the scale" in terms of
processing and storage capacity, and I'd like to find out and address any
issues before spending lots of money. Here's what I'm thinking of:

o Chassis - CI Design SR316 (same model as current chassis, except i2c link
  between RAID controller and front panel
o Motherboard - Intel S5000PSLSATAR
o CPU - 2x Intel Xeon E5450 BX80574E5450P
p Remote management - Intel Remote Management Module 2 - AXXRM2
o Memory - 16GB - 8x Kingston KVR667D2D4F5/2GI
o RAID controller - 3Ware 9650SE-16ML w/ BBU-MODULE-04
o Drives - 16x 2TB drives [not mentioning manufacturer yet]
o Cables - 4x multi-lane SATA cables
o DVD-ROM drive
o Auxiliary slot fan next to BBU card
o Adaptec AHA-39160 (for Quantum Superloader 3 tape drive)

  So much for the hardware. On the software front:

o FreeBSD 8.x?
o amd64 architecture
o MBR+UFS2 for operating system partitions (hard partition in controller)
o GPT+ZFS for data partitions
o Multiple 8TB data partitions (separate 8TB controller partitions or one
  big partition divided with GPT?)

  I looked at "Large data storage in FreeBSD", but that seems to be a stale
page from 2005 or so: http://www.freebsd.org/projects/bigdisk/index.html

  I'm pretty sure I need ZFS, since even with the 2TB partitions I have now,
taking snapshots for dump or doing a fsck take approximately forever 8-)
I'll be using the harware RAID 6 on the 3Ware controller, so I'd only be
using ZFS to get filesystems larger than 2TB.

  I've been following the ZFS discussions on -current and -stable, and I
think that while it isn't quite ready yet, it probably will be ready in
a few months, being available around the same time I get this hardware
asssembled. I recall reading that there will be an import of newer ZFS 
code in the near future.

  Similarly, the ports collection seems to be moving along nicely with
amd64 support.

  I think this system may have the most storage ever configured on a
FreeBSD system, and it is probably up near the top in terms of CPU and
memory. Once I have it assembled I'd be glad to let any FreeBSD devel-
opers test and stress it if that would help improve FreeBSD on that
type of configuration.

In the meantime, any suggestions regarding the hardware or software con-
figuration would be welcomed.

        Terry Kennedy             http://www.tmk.com
        terry_at_tmk.com             New York, NY USA
_______________________________________________
freebsd-current_at_freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
Received on Tue Jan 27 2009 - 17:46:32 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:41 UTC