Re: you are in an fs with millions of small files

From: Bakul Shah <bakul_at_BitBlocks.com>
Date: Thu, 09 Jun 2005 12:11:15 -0700
> Giorgos Keramidas <keramida_at_freebsd.org> writes:
> > Is there a better way to sort a linked list (not necessarily a
> > singly-linked list, like the one fts_link is used for).
> 
> Don't build a linked list to begin with.  The comparison function is
> known at the time the directory entries are read, so it should be a
> simple matter to read them into a red-black tree instead of a singly-
> linked list.  I'm working on a patch.

Unless the tree needs to grow dynamically, a red-black tree
won't buy you anything.  It is faster to use mergesort on an
array than a red black tree implementation.  They have the
same O n log n complexity but per item overhead is higher in
r-b perhaps due to more mallocs.  Even a skewheap will be
faster than r-b since you don't need to search, but just
return items in sort order.

Start with a small array.  When it gets full double it (or
grow by half if you want to waste less memory).

In my tests, with input file constructed from multiple copies
of /usr/share/dict/words rather than a directory, the
mergesort+array was 3 times faster than red-black for upto at
least 250K entries.  At about 10M entries and 100MB of data,
mergesort+array was still about twice as fast and used less
memory.  Using mmap the performance doubled and memory use
went down!

Benchmark program (but not with the mmap code) attached:-)

% wc z
 10000000 10000000 105733020 z
% time a.out redblack < z
15.445u 1.314s 0:16.76 99.9%    10+320297k 0+0io 0pf+0w
% time a.out array < z
9.524u 0.691s 0:10.21 100.0%    10+240289k 0+0io 0pf+0w
% time a.out -m array < z
4.632u 0.653s 0:05.28 100.0%    10+151192k 0+0io 0pf+0w

-- bakul
[source compacted to fit on one page]

#include <stdio.h>
#include <string.h>
#include <sys/tree.h>

int debug;

#define strndup(s,n) strncpy(malloc(n+1), s, n)

int compare(const void*a, const void* b) {
	return strcmp(*(char**)a, *(char**)b);
}

do_ar() {
	char* line;
	size_t len;
	int c = 0, i;
	int s = 16;
	char** a = (char**)malloc(s*sizeof(char*));
	if (debug) fprintf(stderr, "Using array+mergesort\n");
	while (line = fgetln(stdin, &len)) {
		if (c >= s) a = (char**)realloc(a, (s *= 2)*sizeof(char*));
		a[c++] = strndup(line, len-1);
	}
	mergesort(a, c, sizeof(char*), compare);
	if (debug) for (i = 0; i < c; i++) printf("%s\n", a[i]);
}

typedef struct node { char *name; RB_ENTRY(node) links; } Node;

int cmp(Node* a, Node* b) { return strcmp(a->name, b->name); }

RB_HEAD(rb_tree, node);
RB_GENERATE(rb_tree, node, links, cmp);

do_rb() {
	char* line;
	size_t len;
	struct rb_tree rb;
	Node* n;
	if (debug) fprintf(stderr, "Using red-back tree\n");
	RB_INIT(&rb);
	while (line = fgetln(stdin, &len)) {
		struct node* node = malloc(sizeof *node);
		node->name = strndup(line, len-1);
		RB_INSERT(rb_tree, &rb, node);
	}
	if (debug) RB_FOREACH(n, rb_tree, &rb) printf("%s\n", n->name);
}

int
main(int c, char** v) {
	if (c > 1 && strcmp(v[1], "-d") == 0) { debug = 1; c--; v++; }
	if (c > 1 && v[1][0] == 'r') do_rb(); else do_ar();
}
Received on Thu Jun 09 2005 - 17:11:19 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:36 UTC