pty4/radixsort.c at master · drudru/pty4 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
/* radixsort.c, radixsort.h: linear-per-byte in-memory string sort
Daniel J. Bernstein, brnstnd@nyu.edu; Keith Bostic, bostic@ucbvax.berkeley.edu.
No dependencies.
Requires malloc, free, bcopy, bzero. Can use -DUCHAR_MAX.
10/5/91 DJB: Made c static, added ctr.
7/23/91 DJB: Baseline. radixsort/DJB 3.0. See bottom for BSD copyright.
No known patent problems.

Documentation in radixsort.3, portions based on BSD documentation.

History: I discovered adaptive distribution sort in 1987. (I haven't
published a paper on it---it's too trivial for that---though I did
mention it in a letter to Knuth.) It grew out of a suggestion quoted in
Knuth's section on quicksort et al., volume 3, page 128: ``John McCarthy
has suggested setting K \approx (u + v)/2, if all keys are known to lie
between u and v.'' What's the biggest problem with MSD radix sort? You
can waste lots of time considering ranges of keys that don't even exist!
In distribution sort, however, you usually start by finding the minimum
and maximum keys. Here McCarthy's suggestion pops in. To sort a pile of
floating-point numbers, find their minimum and maximum, divide the
interval evenly into n subintervals, and split the numbers into piles by
interval just as in MSD radix sort. Just as in quicksort, stop splitting
when a pile can be sorted efficiently by a small-scale method. That's
adaptive distribution sort, and it turns out to be hellishly fast for
sorting floating-point data. The credit really belongs with McCarthy---I
only generalized from 2 to n.

Adaptive distribution sort can be applied to strings, too: find the
first character where two strings in the pile differ, and distribute on
that character. There's no fine line between adaptive distribution sort
and MSD radix sort, but in any case you get a big speed boost from
sorting small piles by a small-scale method. See especially Knuth,
exercise 5.2.5-10.

Computer scientists should note that this method is linear in the number
of bytes being sorted. Sometime in 1989, as I recall, I saw a notice
that someone had discovered an o(n log n) method of sorting n integers.
The method depended on all sorts of weid bit manipulations and was
utterly impractical. As the integers had to be short anyway, MSD radix
sort worked in O(n) time. My guess is that most computer scientists
don't learn about MSD radix sort (and hence don't know that sorting is
linear-per-byte) because it's widely seen as having too big a constant
factor to be practical. This radixsort() is a constructive proof that
the opposite is true.

I ended up sending this code to Berkeley. Keith Bostic cleaned it up,
fixed a few bugs, added a shell sort for the small case, and did several
helpful optimizations. radixsort() will be part of BSD 4.4. I took back
his version and modified it to what you see below. Among other things, I
ported it from ANSI C back to the rest of the world, cleaned up some of
the comments, added a proof that part of the method actually works,
added the radixsort5(), radixsort7(), and radixsort3() variants,
restored the original indentation, fixed an overly conservative estimate
of the necessary stack size, and plugged a memory leak.
*/

#define blob unsigned char /* technically, shouldn't be typedefed */

/* KB says:
 * __rspartition is the cutoff point for a further partitioning instead
 * of a shellsort.  If it changes check __rsshell_increments.  Both of
 * these are exported, as the best values are data dependent.
 */
#ifndef NPARTITION
#define	NPARTITION 40
#endif
int __rspartition = NPARTITION;
int __rsshell_increments[] = { 4, 1, 0, 0, 0, 0, 0, 0 };

/* KB says:
 * Shellsort (diminishing increment sort) from Data Structures and
 * Algorithms, Aho, Hopcraft and Ullman, 1983 Edition, page 290;
 * see also Knuth Vol. 3, page 84. The increments are selected from
 * formula (8), page 95. Roughly O(N^3/2).
 */
static void shellsort(p,index,n,tr)
register blob **p;
register blob *tr;
register int index;
register int n;
{
 register blob ch;
 register blob *s1;
 register blob *s2;
 register int incr;
 register int *incrp;
 register int t1;
 register int t2;

 incrp = __rsshell_increments;
 while (incr = *incrp++)
   for (t1 = incr;t1 < n;++t1)
     for (t2 = t1 - incr;t2 >= 0;)
      {
       s1 = p[t2] + index;
       s2 = p[t2 + incr] + index;
       while ((ch = tr[*s1++]) == tr[*s2] && ch)
         ++s2;
       if (ch > tr[*s2])
	{
         s1 = p[t2];
         p[t2] = p[t2 + incr];
         p[t2 + incr] = s1;
         t2 -= incr;
	}
       else
         break;
      }
}

/* KB says:
 * Stackp points to context structures, where each structure schedules a
 * partitioning.  Radixsort exits when the stack is empty.
 *
 * If the buckets are placed on the stack randomly, the worst case is when
 * all the buckets but one contain (npartitions + 1) elements and the bucket
 * pushed on the stack last contains the rest of the elements.  In this case,
 * stack growth is bounded by:
 *
 *	limit = (nelements / (npartitions + 1)) - 1;
 *
 * This is a very large number, 52,377,648 for the maximum 32-bit signed int.
 *
 * By forcing the largest bucket to be pushed on the stack first, the worst
 * case is when all but two buckets each contain (npartitions + 1) elements,
 * with the remaining elements split equally between the first and last
 * buckets pushed on the stack.  In this case, stack growth is bounded when:
 *
 *	for (partition_cnt = 0; nelements > npartitions; ++partition_cnt)
 *		nelements =
 *		    (nelements - (npartitions + 1) * (nbuckets - 2)) / 2;
 *
 * The bound is:
 *
 *	limit = partition_cnt * (nbuckets - 1);
 *
 * This is a much smaller number, 4590 for the maximum 32-bit signed int.

Note inserted by DJB: About time I proved this... Fix the number of
buckets, b. Any given pile of n elements is split into m stack piles and
b - m small-scale piles which immediately disappear. We ignore the case
where the pile is split into only one pile of its original size---any
pile will be split into smaller piles eventually. Say the stack is left
with piles of sizes p_1 ... p_m, each at least P + 1, none equal to n,
while x elements, for some x from 0 to m'P, are in small-scale piles.
(Here P is the cutoff.) We must have p_1 + ... + p_m + x = n. Depending
on the other (p,x) constraints chosen, we define s(n) as the maximum
stack size for n elements. As the subpile distributions are independent,
clearly

  s(n) = max_m max_(p,x) max {s(p_1),s(p_2) + 1,...,s(p_m) + m - 1}

over all valid m and (p,x). In particular, if m > 0 then we must have
n > p_1 >= P + 1, so if n <= P + 1 then the maximum is (vacuously) 0. So
s(n) is monotone increasing for n <= P + 1. An easy induction shows that
s(n) is in fact 0 for all n < 2P + 2.

Clearly s(n) is monotone: for n >= P + 2 we choose m = 1, p_1 = n - 1,
and x = 0, and we see s(n) >= s(p_1) = s(n - 1). For m = 0, the maximum
always equals 0, and for m = 1, the maximum always equals a previous
s(p_1), so we have

  s(n) = max { max_{k<n} s(k) , max_{m>=2} max_(p,x) max s(p_j) + j - 1 }.

For convenience we define t(n) = max_{m>=2} max_(p,x) max s(p_j) + j - 1.

Fix n. Fix m >= 2. Consider a (p,x) achieving the maximum value of
max s(p_j) + j - 1. Since p_1 >= P + 1, we have p_2 <= n - P - 1. If x
isn't 0, we can move an element from one of the small-scale piles to
stack pile p_2, under either of the constraints in question. This
increases p_2---hence does not decrease s(p_2)---without affecting the
other s(p_j). Hence there is a (p,x) with smaller x also achieving the
maximum.

So consider a (p,0) achieving the maximum, and say max s(p_j) + j - 1 is
achieved at j = i. If we exchange p_i and p_j while meeting the
constraints, we must not be at a higher maximum; in particular,
s(p_i) + j - 1 <= s(p_i) + i - 1, so j <= i.

Restrict attention the the case without constraints, NC. The choice of j
is unrestricted, so in particular m <= i and hence i = m. Thus t(n)
equals max_{m>=2} s(p_m) + m - 1. Since all p_j are at least P + 1, we
have

  p_m + (P + 1)(m - 1) <= p_m + ... + p_1 = n
  p_m <= n - (P + 1)(m - 1)
  s(p_m) <= s(n - (P + 1)(m - 1))
  t(n) <= max_{m>=2} s(n - (P + 1)(m - 1)) + m - 1    (NC),

and it is easy to see that for n >= (P + 1)m, we can choose p_m as
n - (P + 1)(m - 1) >= P + 1 and all other p_j = P + 1, so that the bound
is achieved:

  t(n) = max_{m>=2} s(n - (P + 1)(m - 1)) + m - 1  for  n/(P + 1) >= m.  (NC)

For 2 <= n/(P + 1) < b, we can choose m anywhere between 2 and
f = floor(n/(P + 1)) inclusive. Now

  s(n) = max { max s(k), max_{2<=m<f} s(n - (P + 1)(m - 1)) + m - 1 } (NC)

We claim inductively that s(n) = f - 1 for any n with floor(n/(P + 1)) =
f. This is true for f = 1. For larger f, s(n - (P + 1)(m - 1)) =
floor((n - (P + 1)(m - 1))/(P + 1)) - 1 = f - (m - 1) - 1 = f - m, so
that

  s(n) = max { max s(k), f - 1 }   (NC)

If n = (P + 1)f, then by inductive hypothesis s(k) <= f - 2, so that
s(n) = f - 1 as desired. Otherwise, by (sub)induction on n, s(k) is
still bounded by f - 1, so that s(n) = f - 1 always. Therefore:

  s(n) = floor(n/(P + 1)) - 1,  n >= P + 1.     (NC)

Now consider the push-largest-pile-first constraint, FC. This requires
that p_1 >= p_j for all j. Hence we cannot swap p_1 with p_i. However,
if i is not 1 then we can swap p_j with p_i for all j > 1, hence j <= i
for all j between 2 and m, hence i is m. Thus the maximum is achieved
either at i = 1 or at i = m, and we have

  t(n) = max_{m>=2} max_(p,0) max { s(p_m) + m - 1, s(p_1) }.   (FC)

Take a p vector achieving the maximum of max { s(p_m) + m - 1, s(p_1) },
with m fixed. Suppose some p_j for j between 2 and m - 1 inclusive is
larger than P + 1. (This is vacuous for m = 2.) Then we can subtract one
from p_j and add one to p_1, preserving the constraint and not
decreasing the maximum. Hence the maximum is achieved somewhere with all
p_j = P + 1 for 2 <= j < m, i.e.,

  p_1 + p_m + (P + 1)(m - 2) = n.   (FC)

Furthermore, p_1 >= p_m. Hence 2p_1 >= n - (P + 1)(m - 2), and p_1 can
range from ceiling((n - (P + 1)(m - 2))/2) up to n - (P + 1). Similarly,
2p_m <= n - (P + 1)(m - 2), so p_m can range from P + 1 up to
floor((n - (P + 1)(m - 2))/2). The global maximum of these quantities
simultaneously equals the global maximum of them individually, so if all
bounds can be achieved then

  t(n) = max_{m>=2} max { s(n - (P + 1)),
			  s(floor((n - (P + 1)(m - 2))/2)) + m - 1 }.  (FC)

This can be achieved if n - (P + 1) >= ceiling((n - (P + 1)(m - 2))/2)
and, equivalently, P + 1 <= floor((n - (P + 1)(m - 2))/2), since in that
case we can choose both p_1 and p_m as stated. These reduce after some
simple manipulation to n >= (P + 1)m, i.e., m <= floor(n/(P + 1)). For
other m it is not possible to choose any valid p_i (exercise).

We'd like to show inductively that for all n >= 2P + 2 we have

  s(n) = u(n) with
  u(n) = max_{m>=2} s(floor((n - (P + 1)(m - 2))/2)) + m - 1.   (FC,*)

To do this we need only show that the other terms of the maximum do not
``get in the way,'' i.e., that u(n) >= s(n - (P + 1)) and u(n) >= s(k)
for k < n. The second half is easy: s(k) = u(k), which is at most u(n)
by monotonicity of s. The first half also follows from the induction:

  s(n) = max_m s(floor((n - (P + 1)(m - 2))/2)) + m - 1
  s(n - (P + 1)) = u(n - (P + 1))
   = max_{m>=2} s(floor((n - (P + 1) - (P + 1)(m - 2))/2)) + m - 1
   <= max_{m>=2} s(floor((n - (P + 1)(m - 2))/2)) + m - 1
   = u(n)                    (FC)

again by the monotonicity of s. This proves (FC,*). For small n, i.e.,
floor(n/(P + 1)) = f with 2 <= f <= b, we can choose m = f, so s(n) is
at least s(floor((n - (P + 1)(f - 2))/2)) + f - 1. The floor term is at
most P + 1, so s(n) >= f - 1. Furthermore, s in the constrained case is
at most s in the unconstrained case, so s(n) = f - 1. For larger n, by
similar logic, the maximum is attained at m = b, so we finally have

  s(n) = floor(n/(P + 1)) - 1  for n < (b + 1)(P + 1)
  s(n) = s(floor((n - (P + 1)(b - 2))/2)) + b - 1  otherwise    (FC)

As in the first case s(n) is always bounded by b - 1, we can calculate
a bound on s(n) by repeatedly setting n to floor(n - (P + 1)(b - 2))/2)
until it is under 2P + 2 (so that s(n) = 0), counting the number of
iterations necessary, and multiplying by b - 1. And that's what we
wanted to prove.
*/

#ifndef UCHAR_MAX /* XXX: we aren't even giving a chance for a definition! */
#define UCHAR_MAX 256
#endif
#define	NBUCKETS (UCHAR_MAX + 1)

typedef struct
 {
  blob **bot;
  int index;
  int n;
 }
context;

#define	STACKPUSH { \
  stackp->bot = p; \
  stackp->n = n; \
  stackp->index = index; \
  ++stackp; \
}

#define	STACKPOP { \
  if (stackp == stack) \
    break; \
  --stackp; \
  bot = stackp->bot; \
  n = stackp->n; \
  index = stackp->index; \
}

/* KB says:
 * This uses a simple sort as soon as a bucket crosses a cutoff point,
 * rather than sorting the entire list after partitioning is finished.
 * This should be an advantage.

Note from DJB: The original comment read ``There is no strong evidence
that this is an advantage.'' Depressing. Here's what I wrote in response:

   Of course it's an advantage: it has to be, I coded it that way. :-)
   Seriously, I just coded the sort that way since I was following Knuth's
   description of MSD to the hilt. As you can imagine, though, doing the
   sort this way saves just a bit of paging of the index array. It also
   means that the simple sort doesn't have to worry about crossing past
   already-determined boundaries---for an average 2x gain. Trust me, it's
   an advantage.
*/

int radixsort7(l1,n,endchar,tab,indexstart,ralloc,rfree)
blob **l1;
register int n;
unsigned int endchar; /* could use blob, but chars are unsafe with prototypes */
blob *tab;
int indexstart;
char *(*ralloc)();
void (*rfree)();
{
 register int i;
 register int index;
 register int t1;
 register int t2;
 register blob **l2;
 register blob **p;
 register blob **bot;
 register blob *tr;
 context *stack;
 context *stackp;
 static int c[NBUCKETS + 1];
 static int *ctr[NBUCKETS + 1];
 int max;
 blob ltab[NBUCKETS]; /* local (default) table */

 if (n <= 1)
   return 0;

 /* Allocate the stack. */
 t1 = (__rspartition + 1) * (NBUCKETS - 2);
 for (i = 0,t2 = n;t2 > __rspartition;i += NBUCKETS - 1)
   t2 = (t2 - t1)/2; /* could go negative! but that's okay */
 if (!i)
   stack = stackp = 0;
 else
   if (!(stack = stackp = (context *)ralloc(i * sizeof(context))))
     return -1;

 /* KB says:
  * There are two arrays, one provided by the user (l1), and the
  * temporary one (l2). The data is sorted to the temporary stack,
  * and then copied back. The speedup of using index to determine
  * which stack the data is on and simply swapping stacks back and
  * forth, thus avoiding the copy every iteration, turns out to not
  * be any faster than the current implementation.
  */
 if (!(l2 = (blob **)ralloc(n * sizeof(blob *))))
  {
   rfree((char *)stackp);
   return -1;
  }

 /* KB says:
  * tr references a table of sort weights; multiple entries may
  * map to the same weight; EOS char must have the lowest weight.
  */
 if (tab)
   tr = tab;
 else
  {
   t2 = endchar;
   for (t1 = 0;t1 < t2;++t1)
     ltab[t1] = t1 + 1;
   ltab[t2] = 0;
   for (t1 = t2 + 1;t1 < NBUCKETS;++t1)
     ltab[t1] = t1;
   tr = ltab;
  }

 for (t1 = 0;t1 < NBUCKETS;++t1)
   ctr[t1] = c + tr[t1];

 /* First sort is entire pile. */
 bot = l1;
 index = indexstart;

 for (;;)
  {
   /* Clear the bucket count array. XXX: This isn't portable to */
   /* machines where the byte representation of int 0 isn't all */
   /* zeros. :-) */
   bzero((char *)c,sizeof(c));

   /* Compute number of items that sort to the same bucket for this index. */
   p = bot;
   i = n;
   while (--i >= 0)
     ++*ctr[(*p++)[index]];

   /* KB says:
    * Sum the number of characters into c, dividing the temp stack
    * into the right number of buckets for this bucket, this index.
    * c contains the cumulative total of keys before and included in
    * this bucket, and will later be used as an index to the bucket.
    * c[NBUCKETS] contains the total number of elements, for determining
    * how many elements the last bucket contains. At the same time, we
    * find the largest bucket so it gets pushed first.
    */
   t2 = __rspartition;
   max = t1 = 0;
   for (i = 0;i <= NBUCKETS;++i)
    {
     if (c[i] > t2)
      {
       t2 = c[i];
       max = i;
      }
     t1 = c[i] += t1;
    }

   /* Partition the elements into buckets; c decrements through the */
   /* bucket, and ends up pointing to the first element of the bucket. */
   i = n;
   while (--i >= 0)
    {
     --p;
     l2[--*ctr[(*p)[index]]] = *p;
    }

   /* Copy the partitioned elements back to the user stack. */
   bcopy(l2,bot,n * sizeof(blob *));

   ++index;
   /* KB says:
    * Sort buckets as necessary; don't sort c[0], it's the
    * EOS character bucket, and nothing can follow EOS.
    */
   for (i = max;i;--i)
    {
     if ((n = c[i + 1] - (t1 = c[i])) < 2)
       continue;
     p = bot + t1;
     if (n > __rspartition)
       STACKPUSH
     else
       shellsort(p,index,n,tr);
    }
   for (i = max + 1;i < NBUCKETS;++i)
    {
     if ((n = c[i + 1] - (t1 = c[i])) < 2)
       continue;
     p = bot + t1;
     if (n > __rspartition)
       STACKPUSH
     else
       shellsort(p,index,n,tr);
    }
   /* Break out when stack is empty */
   STACKPOP
  }

 rfree((char *)l2);
 rfree((char *)stack);
 return 0;
}

extern char *malloc();
extern void free();

int radixsort5(l1,n,endchar,tab,indexstart)
blob **l1; register int n; unsigned int endchar; blob *tab; int indexstart;
{
 return radixsort7(l1,n,endchar,tab,indexstart,malloc,free);
}

int radixsort4(l1,n,endchar,tab)
blob **l1; register int n; unsigned int endchar; blob *tab;
{
 return radixsort7(l1,n,endchar,tab,0,malloc,free);
}

int radixsort(l1,n,tab,endchar)
blob **l1; register int n; blob *tab; unsigned int endchar;
{
 return radixsort7(l1,n,endchar,tab,0,malloc,free);
}

int radixsort3(l1,n,endchar)
blob **l1; register int n; unsigned int endchar;
{
 return radixsort7(l1,n,endchar,(blob *) 0,0,malloc,free);
}

/*- BSD copyright says:
 * Copyright (c) 1990 The Regents of the University of California.
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3. All advertising materials mentioning features or use of this software
 *    must display the following acknowledgement:
 *    This product includes software developed by the University of
 *    California, Berkeley and its contributors.
 * 4. Neither the name of the University nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

/* BSD sccs says:
static char sccsid[] = "@(#)radixsort.c  5.7 (Berkeley) 2/23/91";

but this is (heavily) modified.
*/