Discussion:
[PATCH] unify itoa
(too old to reply)
Denis Vlasenko
2006-07-08 09:44:39 UTC
Permalink
Hi folks, Rob,

Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.

The itoa which lives in ash is not converted:

/* Our own itoa(). */
static int
cvtnum(arith_t num)
{
int len;
expdest = makestrspace(32, expdest);
#ifdef CONFIG_ASH_MATH_SUPPORT_64
len = fmtstr(expdest, 32, "%lld", (long long) num);
#else
len = fmtstr(expdest, 32, "%ld", num);
#endif
STADJUST(len, expdest);
return len;
}

Where fmtstr is:

int
fmtstr(char *outbuf, size_t length, const char *fmt, ...)
{
va_list ap;
int ret;
va_start(ap, fmt);
INTOFF;
ret = vsnprintf(outbuf, length, fmt, ap);
va_end(ap);
INTON;
return ret;
}

If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: busybox_itoa.patch
Type: text/x-diff
Size: 8832 bytes
Desc: not available
Url : http://busybox.net/lists/busybox/attachments/20060708/b7e27d7b/busybox_itoa-0001.bin
Rob Landley
2006-07-08 10:45:46 UTC
Permalink
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...

You have a large quantity of code under #ifdef NOT_NEEDED_YET that by itself
would make me reject this patch. (What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
Post by Denis Vlasenko
If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
It can handle int, and long long, but not long. On 64 bit platforms (we
already support x86-64, and this sort of thing will only get more common to
the point it may be a noticeable chunk of the embedded space in 3-5 years),
these are three different sizes.

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-08 11:42:16 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.

sprintf is a performance disaster. In any libc I've looked into,
it _creates a custom struct FILE_, and then calls v_something_printf()
on that, which _parses format string_ with gazzilion of possible
obscure format specifiers, and _then_ it does itoa thing.
Post by Rob Landley
You have a large quantity of code under #ifdef NOT_NEEDED_YET that by itself
would make me reject this patch.
I wrote that just in case you'll want me to convert ash too.
See below.
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.

gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Post by Rob Landley
Post by Denis Vlasenko
If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
It can handle int, and long long, but not long. On 64 bit platforms (we
already support x86-64, and this sort of thing will only get more common to
the point it may be a noticeable chunk of the embedded space in 3-5 years),
these are three different sizes.
long long currently is needed for ash only, everybody else wants
itoa([unsigned] int). There are no ltoa users today.
ltoa is of course a piece of cake, can do when someone will need it.

So, do you want me to

a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
--
vda
Michael S. Zick
2006-07-08 12:40:23 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.

Mike
Denis Vlasenko
2006-07-08 12:59:24 UTC
Permalink
Post by Michael S. Zick
Post by Denis Vlasenko
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.
The whole world is not a x86[-64].

Please show me the simple constant which is a biggest
power of 10 which fits into variable of type "int" on
platforms with 16-bit, 32-bit, 64-bit and 128-bit ints?
--
vda
Michael S. Zick
2006-07-08 13:45:43 UTC
Permalink
Post by Denis Vlasenko
Post by Michael S. Zick
Post by Denis Vlasenko
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.
The whole world is not a x86[-64].
Please show me the simple constant which is a biggest
power of 10 which fits into variable of type "int" on
platforms with 16-bit, 32-bit, 64-bit and 128-bit ints?
I believe you have in mind 'long int' which is usually the
native register size.

The compiler chosen sets the size of 'int'

What compiler are you using? Turbo C? MixC?
Post by Denis Vlasenko
--
vda
Mike
Rich Felker
2006-07-08 21:59:00 UTC
Permalink
Post by Denis Vlasenko
Post by Michael S. Zick
Post by Denis Vlasenko
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.
The whole world is not a x86[-64].
Please show me the simple constant which is a biggest
power of 10 which fits into variable of type "int" on
platforms with 16-bit, 32-bit, 64-bit and 128-bit ints?
There are no such things at present, only 32-bit ints.

Rich
Michael S. Zick
2006-07-08 20:45:35 UTC
Permalink
Post by Denis Vlasenko
Post by Michael S. Zick
Post by Denis Vlasenko
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.
The whole world is not a x86[-64].
Please show me the simple constant which is a biggest
power of 10 which fits into variable of type "int" on
platforms with 16-bit, 32-bit, 64-bit and 128-bit ints?
I believe you have in mind 'long int' which is usually the
native register size.

The compiler chosen sets the size of 'int'

What compiler are you using? Turbo C? MixC?
Post by Denis Vlasenko
--
vda
Mike
Rich Felker
2006-07-09 05:12:03 UTC
Permalink
Post by Denis Vlasenko
Post by Michael S. Zick
Post by Denis Vlasenko
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.
The whole world is not a x86[-64].
Please show me the simple constant which is a biggest
power of 10 which fits into variable of type "int" on
platforms with 16-bit, 32-bit, 64-bit and 128-bit ints?
There are no such things at present, only 32-bit ints.

Rich
Denis Vlasenko
2006-07-08 19:59:03 UTC
Permalink
Post by Michael S. Zick
Post by Denis Vlasenko
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.
The whole world is not a x86[-64].

Please show me the simple constant which is a biggest
power of 10 which fits into variable of type "int" on
platforms with 16-bit, 32-bit, 64-bit and 128-bit ints?
--
vda
Rich Felker
2006-07-08 21:58:01 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Perhaps, but BB is always about optimizing for size, not speed, unless
a very large speed gain can be achieved with a relatively small amount
of code..
Post by Denis Vlasenko
sprintf is a performance disaster. In any libc I've looked into,
it _creates a custom struct FILE_, and then calls v_something_printf()
on that, which _parses format string_ with gazzilion of possible
obscure format specifiers, and _then_ it does itoa thing.
Then your libc sucks. Get a good one. BB does not write huge bloated
replacements for the libc functions; it uses them as-is, and thus will
perform well on a well-implemented system and poorly on a
poorly-implemented system.
Post by Denis Vlasenko
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
LOL.. *crying* why does itoa need to know such a thing?!

Rich
Rob Landley
2006-07-09 08:47:03 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when
you need it to malloc?) Especially if the darn compiler actually starts
doing intelligent string merging, which it's inching towards...
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many users,
that I saw.)
Post by Denis Vlasenko
sprintf is a performance disaster. In any libc I've looked into,
it _creates a custom struct FILE_, and then calls v_something_printf()
on that, which _parses format string_ with gazzilion of possible
obscure format specifiers, and _then_ it does itoa thing.
That would be a yes, then. :)
Post by Denis Vlasenko
Post by Rob Landley
You have a large quantity of code under #ifdef NOT_NEEDED_YET that by
itself would make me reject this patch.
I wrote that just in case you'll want me to convert ash too.
See below.
I'm far more interested in replacing ash than fixing it, and am working
towards that end here (albeit by clearing a couple other low hanging fruit
off my todo list first so I can focus on bbsh with fewer interruptions).
Post by Denis Vlasenko
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use the
lp64 model (as do all modern Unix variants, including MacOS X), where only
long and pointer are 64 bit.

http://www.unix.org/whitepapers/64bit.html

This even works for Windows, which uses llp64 because they decided to make 64
bit programming really inconvenient in the name of backwards compatability,
and still managed to to screw it up:

http://blogs.msdn.com/oldnewthing/archive/2005/01/31/363790.aspx

Not that I care about that part. :)
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
It can handle int, and long long, but not long. On 64 bit platforms (we
already support x86-64, and this sort of thing will only get more common
to the point it may be a noticeable chunk of the embedded space in 3-5
years), these are three different sizes.
long long currently is needed for ash only, everybody else wants
itoa([unsigned] int). There are no ltoa users today.
ltoa is of course a piece of cake, can do when someone will need it.
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.

A general rule of thumb: we can always add code to the tree later. (And if we
remove it, we can always add it _back_ later.)

Thanks,

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-09 09:53:42 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many users,
that I saw.)
Examples of things which should care about it
(I do not claim that current code cares):
od, seq, uniq -c, tcpdump, hexdump...
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use the
lp64 model (as do all modern Unix variants, including MacOS X), where only
long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.

BTW, I could do this instead:

static inline unsigned max_unsigned_power10(void)
{
if (sizeof(unsigned) == 4) return 1000000000;
if (sizeof(unsigned) == 8) // gcc warns a lot here, shut it up
return (unsigned)(10ULL*1000000000*1000000000);
return BUG_unsupported_architecture();
}

but since you don't want to handle long long, it's irrelevant for now.
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later. (And if we
remove it, we can always add it _back_ later.)
See attached patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: busybox_itoa_v3.patch
Type: text/x-diff
Size: 5808 bytes
Desc: not available
Url : http://busybox.net/lists/busybox/attachments/20060709/00d19121/busybox_itoa_v3.bin
Rich Felker
2006-07-09 16:59:58 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many users,
that I saw.)
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
Of these od/hexdump are the only ones that sound very interesting to
optimize for performance, and the 'itoa' code in question is
decimal-only anyway making them irrelevant.

Anyway if snprintf is too slow then what needs to be fixed is
snprintf, not the applications that use it. The C library exists for a
reason. As soon as you decide you can't rely on it to do what it's
supposed to do and do it at an acceptable level of efficiency, you end
up with the bloat that is Mozilla (NSPR), GNOME (glib), or coreutils
(gnulib, i.e. half of glibc copied into the coreutils source tree...).
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use the
lp64 model (as do all modern Unix variants, including MacOS X), where only
long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
gcc does not support sizeof(int)!=4, so no, it does not have 8. You're
thinking of long. In fact any C compiler with sizeof(int)!=4 is
virtually unusable for multimedia apps because it means you lack at
least one of the sizes 8, 16, or 32, all of which are needed in order
to have any hope of efficient audio and video processing.

Rich
Rob Landley
2006-07-10 00:44:02 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many
users, that I saw.)
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use
the lp64 model (as do all modern Unix variants, including MacOS X), where
only long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.

B) If it's that broken, I don't care.
Post by Denis Vlasenko
static inline unsigned max_unsigned_power10(void)
{
if (sizeof(unsigned) == 4) return 1000000000;
if (sizeof(unsigned) == 8) // gcc warns a lot here, shut it up
return (unsigned)(10ULL*1000000000*1000000000);
return BUG_unsupported_architecture();
}
but since you don't want to handle long long, it's irrelevant for now.
I'd rather wait until users of long long show up before adding support.
Also, "long long" isn't a specific type: on a 32 bit platform long is 32 bits
and long long is 64 bits. On a 64 bit platform, long is 64 bits and long
long is 128 bits. That's why they invented uint64_t and friends. :)
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later. (And
if we remove it, we can always add it _back_ later.)
See attached patch.
Well, it's definitely an improvement. However, it would be nice if you could
call itoa on a buffer and have it actually fill in that buffer from the
start, so you could use it like you could sprintf or strcpy, and not
potentially have to track two pointers to your own buffer.

I did a quick stab at that and checked it into xfuncs.c, and now I'm about to
pass out. Could you tell me if that's useful, or if not what did I screw up?
I have to go fall over and pass out now, it's 4 am and I have work in the
morning...

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-10 02:47:10 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
than utoa. Operations done per second:
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934

A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use
the lp64 model (as do all modern Unix variants, including MacOS X), where
only long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.
I am not sure. It looks like you are not sure too.
Post by Rob Landley
B) If it's that broken, I don't care.
What's broken about it? If the CPU is _designed_ to handle
mostly 64-bit operands, then it's only natural to make
"int" type 64-bit.

For AMD64, instructions are 32-bit (with auto sign extend
to 64 bit) unless you add a "I want 64bit op" prefix.
IOW: AMD64 CPU most of the time works with 32bit operands.
That's why for AMD64, natural size of "int" is 32 bits.
Cool, AMD got best things from both worlds.

IIRC Alpha took a simpler "we are a fully 64 animal"
route. Sane C ABI on such CPU should use 64-bit ints,
or else you will drown in zillions of "n &= 0x00000000fffffffff"
type instructions which will enforce 32-bitness
on 64-bit CPU.

I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later. (And
if we remove it, we can always add it _back_ later.)
See attached patch.
Well, it's definitely an improvement. However, it would be nice if you could
call itoa on a buffer and have it actually fill in that buffer from the
start, so you could use it like you could sprintf or strcpy, and not
potentially have to track two pointers to your own buffer.
That will need either a clever method to determine how much digits are there
_before_ we do the conversion, or copying the result.
I hope that in most cases you can just

do_something(utoa_to_buf(n, buf, sizeof(buf));

(Actually, im most cases people will just use do_something(utoa(n)); ...)
Post by Rob Landley
I did a quick stab at that and checked it into xfuncs.c, and now I'm about to
pass out. Could you tell me if that's useful, or if not what did I screw up?
I have to go fall over and pass out now, it's 4 am and I have work in the
morning...
Aha! So you DO have a clever method! :) Thanks!

Hmmm.... (/me takes a look) I think I can improve on that.

Please see (and run) attached program. utoa_to_buf_rob is your version
copied from xfuncs.c. Two problems: it's almost as slow as sprintf
and it doesn't check for buflen. utoa_to_buf_rob2 is a speed improved
version and utoa_to_buf_rob2_with_check adds compile-time check
that unsigned is really 32-bit and also correctly handles buflen.

There is a correctness tests for utoa_to_buf_rob2_with_check.
Seems to work right.

Please put utoa_to_buf_rob2_with_check code into xfuncs
(unless you find more way to make it better).
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa_bench.c
Type: text/x-csrc
Size: 3950 bytes
Desc: not available
Url : http://busybox.net/lists/busybox/attachments/20060710/9aebbade/itoa_bench.c
Michael S. Zick
2006-07-10 07:14:09 UTC
Permalink
Post by Denis Vlasenko
I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Almost for some,

(HP) Vax and MicroVax - no new machines, remarket only
(HP) Alpha Servers - new orders only until Oct. 2006
(HP) PA-RISC, 64-bit - alive and well - no 64-bit userland
under Linux - 64-bit Linux kernel alive and well.

Two on-going efforts to keep in mind...

Giving parisc-Linux a 64-bit userland is on-going, but you
will not see these in an embedded system.

If an adaption of BB gets shoved into the kernel as some
propose - then handling 64-bit & 128-bit types may come up again.

- - - -

But still, sizeof(unsigned int) is set by choosing the compiler
to use for the code. There are only two widely used compilers
for pa-risc - HP and GCC - those are 32-bit unsigned int for
both compilers, both 32-bit and 64-bit machines.
(The hardware handles the masking and sign extension.)

Now if you wanted to know about the sizeof(long double)...
Thank goodness, you don't.

Mike
Rob Landley
2006-07-11 08:36:54 UTC
Permalink
Post by Michael S. Zick
Post by Denis Vlasenko
I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Almost for some,
(HP) Vax and MicroVax - no new machines, remarket only
(HP) Alpha Servers - new orders only until Oct. 2006
(HP) PA-RISC, 64-bit - alive and well - no 64-bit userland
under Linux - 64-bit Linux kernel alive and well.
I always assumed that PA-RISC used HP-PA processors. (Which I vaguely recall
Linux being ported to by The Puffin Group, years ago...)
Post by Michael S. Zick
Two on-going efforts to keep in mind...
Giving parisc-Linux a 64-bit userland is on-going, but you
will not see these in an embedded system.
I still think PA-RISC is an architecture from HP. And the parisc-linux
website would appear to agree with me: http://www.parisc-linux.org/
Post by Michael S. Zick
If an adaption of BB gets shoved into the kernel as some
propose - then handling 64-bit & 128-bit types may come up again.
*blink* *blink*

No.

And I'm not against handling 64 bit and 128 bit types, I'm just saying
that "int" is 32 bit (not 64 or 128 bit) on any platform we care to support.
Long is 64 bits on some platforms of interest to us, and long long is 128 bit
on some, 64 bit on others.
Post by Michael S. Zick
But still, sizeof(unsigned int) is set by choosing the compiler
to use for the code. There are only two widely used compilers
for pa-risc - HP and GCC - those are 32-bit unsigned int for
both compilers, both 32-bit and 64-bit machines.
(The hardware handles the masking and sign extension.)
Now if you wanted to know about the sizeof(long double)...
Thank goodness, you don't.
We largely avoid floating point. Where we use it, we usually have a CONFIG
option, ala SORT.

Rob
--
Never bet against the cheap plastic solution.
Rob Landley
2006-07-11 15:36:44 UTC
Permalink
Post by Michael S. Zick
Post by Denis Vlasenko
I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Almost for some,
(HP) Vax and MicroVax - no new machines, remarket only
(HP) Alpha Servers - new orders only until Oct. 2006
(HP) PA-RISC, 64-bit - alive and well - no 64-bit userland
under Linux - 64-bit Linux kernel alive and well.
I always assumed that PA-RISC used HP-PA processors. (Which I vaguely recall
Linux being ported to by The Puffin Group, years ago...)
Post by Michael S. Zick
Two on-going efforts to keep in mind...
Giving parisc-Linux a 64-bit userland is on-going, but you
will not see these in an embedded system.
I still think PA-RISC is an architecture from HP. And the parisc-linux
website would appear to agree with me: http://www.parisc-linux.org/
Post by Michael S. Zick
If an adaption of BB gets shoved into the kernel as some
propose - then handling 64-bit & 128-bit types may come up again.
*blink* *blink*

No.

And I'm not against handling 64 bit and 128 bit types, I'm just saying
that "int" is 32 bit (not 64 or 128 bit) on any platform we care to support.
Long is 64 bits on some platforms of interest to us, and long long is 128 bit
on some, 64 bit on others.
Post by Michael S. Zick
But still, sizeof(unsigned int) is set by choosing the compiler
to use for the code. There are only two widely used compilers
for pa-risc - HP and GCC - those are 32-bit unsigned int for
both compilers, both 32-bit and 64-bit machines.
(The hardware handles the masking and sign extension.)
Now if you wanted to know about the sizeof(long double)...
Thank goodness, you don't.
We largely avoid floating point. Where we use it, we usually have a CONFIG
option, ala SORT.

Rob
--
Never bet against the cheap plastic solution.
Rich Felker
2006-07-10 07:40:48 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
Which snprintf? glibc? uClibc?
Anyway I ran the rest (after sanitizing it; your benchmarking method
is not valid due to multitasking) and obtained much more extreme
results.

Still your code is huge massive bloat to do something very simple...
It's like these 10k memcpy implementations...
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.
I am not sure. It looks like you are not sure too.
Post by Rob Landley
B) If it's that broken, I don't care.
What's broken about it? If the CPU is _designed_ to handle
mostly 64-bit operands, then it's only natural to make
"int" type 64-bit.
And this cpu won't be able to handle lots of real-world tasks in C
code. Lovely. The conclusion: the cpu sucks.
Post by Denis Vlasenko
For AMD64, instructions are 32-bit (with auto sign extend
to 64 bit) unless you add a "I want 64bit op" prefix.
IOW: AMD64 CPU most of the time works with 32bit operands.
That's why for AMD64, natural size of "int" is 32 bits.
Cool, AMD got best things from both worlds.
Actually I think the prefix is needed for 32bit when running 64bit
code, but it makes little difference to performance.
Post by Denis Vlasenko
IIRC Alpha took a simpler "we are a fully 64 animal"
route.
No, it took the "we are an idiotic risc arch" approach of not having
opcodes to manipulate all integer sizes up to and including the system
word size.
Post by Denis Vlasenko
Sane C ABI on such CPU should use 64-bit ints,
gcc is not sane then. :)
Post by Denis Vlasenko
or else you will drown in zillions of "n &= 0x00000000fffffffff"
type instructions which will enforce 32-bitness
on 64-bit CPU.
I suspect it's easier than this. Still ugly, and still alpha's fault.
Post by Denis Vlasenko
Please put utoa_to_buf_rob2_with_check code into xfuncs
(unless you find more way to make it better).
Calling sprintf makes it better by adding less size to BB...
At the very least this should be a build option.

Rich
Rob Landley
2006-07-10 17:47:31 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it
essentially a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa have
to do with a base 16 hexdump?
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We
use the lp64 model (as do all modern Unix variants, including MacOS
X), where only long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.
I am not sure. It looks like you are not sure too.
Apparently I didn't make myself clear.

Linux standardized on LP64, following the official Open Group recommendations
for all Unix platforms. I am sure of this, and there were good reasons for
it. The research behind the decision is stated here:
http://www.unix.org/version2/whatsnew/lp64_wp.html

I believe all 64-bit Linux platforms are x86-64. x86-64 is LP64. PPC 64 is
LP64. Even Itanic is LP64. Because there's a _spec_ that says LP64 and the
Linux developers have announced support for that spec. It seems highly
unlikely for the Alpha port to not be LP64, but it's hard looking up
information on the alpha because the platform's so obsolete. (The Alpha
Linux FAQ was last updated six years ago.)

However, google for "linux alpha sizeof int" and the first hit is the Wine
developers talking about it (three years ago), where they point out that
sizeof(int) != sizeof(void *), because Alpha Linux is LP64.

Here's explicitly running a program on Alpha to print out sizeof(int) and
sizeof(long):
http://www.alphalinux.org/archives/axp-list/January2001/0072.shtml
Post by Denis Vlasenko
Post by Rob Landley
B) If it's that broken, I don't care.
What's broken about it? If the CPU is _designed_ to handle
mostly 64-bit operands, then it's only natural to make
"int" type 64-bit.
No, it's natural to make "long" 64 bit, "int" 32 bit, "short" 16 bit,
and "char" 8 bit.

If Alpha was broken that way, I would just document that BusyBox does not
support it. I see no reason to add support for something that's _both_
stupid and obsolete.
Post by Denis Vlasenko
IIRC Alpha took a simpler "we are a fully 64 animal"
route.
I.E. we don't handle 32, 16, or 8 bit data types well.
Post by Denis Vlasenko
Sane C ABI on such CPU should use 64-bit ints,
or else you will drown in zillions of "n &= 0x00000000fffffffff"
type instructions which will enforce 32-bitness
on 64-bit CPU.
This is sad, but the deficiencies of Alpha's instruction set really are not
BusyBox's problem. Should we have a 64 bit char type for similar reasons?
Post by Denis Vlasenko
I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Bernhard's still adding Tru64 support. It's his pet platform...
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later.
(And if we remove it, we can always add it _back_ later.)
See attached patch.
Well, it's definitely an improvement. However, it would be nice if you
could call itoa on a buffer and have it actually fill in that buffer from
the start, so you could use it like you could sprintf or strcpy, and not
potentially have to track two pointers to your own buffer.
That will need either a clever method to determine how much digits are
there _before_ we do the conversion, or copying the result.
Nope. See the checkin I did for libbb/xfuncs.c. Yes, it potentially does a
few extra divides, but it still does 1.4 million utoa_to_buf() per second on
my laptop.
Post by Denis Vlasenko
I hope that in most cases you can just
do_something(utoa_to_buf(n, buf, sizeof(buf));
(Actually, im most cases people will just use do_something(utoa(n)); ...)
Yup. It's not returning the buf from utoa_to_buf(). It can if there's a
pressing need, but utoa() and itoa() both return it and that's probably the
common case for inline use.
Post by Denis Vlasenko
Post by Rob Landley
I did a quick stab at that and checked it into xfuncs.c, and now I'm
about to pass out. Could you tell me if that's useful, or if not what
did I screw up? I have to go fall over and pass out now, it's 4 am and I
have work in the morning...
Aha! So you DO have a clever method! :) Thanks!
Hmmm.... (/me takes a look) I think I can improve on that.
Glances at attachment... You spotted my lack of bounds checking, which is
definitely a big oversight on my part. :)
Post by Denis Vlasenko
Please see (and run) attached program. utoa_to_buf_rob is your version
copied from xfuncs.c. Two problems: it's almost as slow as sprintf
and it doesn't check for buflen. utoa_to_buf_rob2 is a speed improved
version and utoa_to_buf_rob2_with_check adds compile-time check
that unsigned is really 32-bit and also correctly handles buflen.
On even the 64-bit platforms BusyBox targets, unsigned is 32 bit. I'm pretty
sure of this one.

The lookup table is a classic size/speed tradeoff, something BusyBox has
traditionally come down on the "size" side of. Doing a write() on the result
is going to be way slower than the actual conversion anyway. What actual
bottleneck are you seeing?

I definitely agree the bounds checking needs to be added, though. I'll do
that now...

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-11 07:45:00 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa have
to do with a base 16 hexdump?
This is a real-life example of a program which does a lot of
int -> string conversions, so that it actually mattered for me recently.

hexdump is of course using base 16 conversion. tcpdump could be
an example of a proc which does base 10 conversions, but I didn't
use it lately.
Post by Rob Landley
The lookup table is a classic size/speed tradeoff, something BusyBox has
traditionally come down on the "size" side of. Doing a write() on the result
is going to be way slower than the actual conversion anyway. What actual
bottleneck are you seeing?
I definitely agree the bounds checking needs to be added, though. I'll do
that now...
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".

How about this modified version?

// 0x63 bytes
void utoa_to_buf_rob(unsigned n, char *buf, int buflen)
{
unsigned i, out;
if (buflen) {
buflen -= 11;
out = 0;
for (i=1000000000; i; i/=10) {
unsigned res = n / i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}

It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.

I still dislike the speed, i/=10 eats too much CPU...
--
vda
Denis Vlasenko
2006-07-11 09:14:21 UTC
Permalink
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.

i = (i*429496730ULL)>>32 /* divide by 10, fast */

Speed is ok, correctness check ok:

sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
utoa_to_buf_test(0)='0'
utoa_to_buf_test(123)='123'
utoa_to_buf_test(100223)='100223'
utoa_to_buf_test(100223,3)='23'
utoa_to_buf_test(100223,2)='3'
utoa_to_buf_test(100223,1)=''

Please apply.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa.diff
Type: text/x-diff
Size: 930 bytes
Desc: not available
Url : http://busybox.net/lists/busybox/attachments/20060711/863e5e7d/itoa.bin
Rob Landley
2006-07-11 12:41:26 UTC
Permalink
Post by Denis Vlasenko
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.

B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.

C) On a lot of 64 bit platforms (I don't know if it's all), long long is a 128
bit number, so this is a big pessimization.
Post by Denis Vlasenko
sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
What are you using to produce these numbers? What hardware, what test
program, and what _other_ platforms have you tried it on to see what the
impact is there?

Rob
--
Never bet against the cheap plastic solution.
Rich Felker
2006-07-11 13:36:10 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.
:)
Post by Rob Landley
B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.
This code should not generate a 64bit multiply since the operands both
fit in 32 bits and a 32x32 multiply gives a 64bit result. Still, gcc
is broken so maybe it does 64x64 multiply anyway...

BTW inttypes.h has a macro to declare uint64_t constants, or you can
just cast the integer constant to (uint64_t) before multiplying..

Anyway I doubt this is faster than just dividing, especially since the
divide also gives you the modulo result for free without extra ops.
Certainly the division should result in smaller code size.

Rich
Michael S. Zick
2006-07-11 13:51:01 UTC
Permalink
Post by Rich Felker
Anyway I doubt this is faster than just dividing, especially since the
divide also gives you the modulo result for free without extra ops.
Certainly the division should result in smaller code size.
You want small and fast?

q1) What target processors do you wish to handle?

Any processor with special instructions for BCD, it
is smallest and fastest to:
int -> BCD
BCD -> ASCII

q2) How gcc-centric is BusyBox?

Any interest in a __builtin_uitoa( ... )?

I just happen to have gcc's builtins.c torn apart
in front of me for a different reason.
Post by Rich Felker
Rich
Mike
Denis Vlasenko
2006-07-12 07:33:33 UTC
Permalink
Post by Rich Felker
Post by Rob Landley
Post by Denis Vlasenko
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.
:)
Post by Rob Landley
B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.
This code should not generate a 64bit multiply since the operands both
fit in 32 bits and a 32x32 multiply gives a 64bit result. Still, gcc
is broken so maybe it does 64x64 multiply anyway...
It does 32x32 multiply and optimizes out the shift altogether.
Post by Rich Felker
Anyway I doubt this is faster than just dividing,
On i386, it is.
Post by Rich Felker
especially since the
divide also gives you the modulo result for free without extra ops.
We do not need the remainder of i/10 here:

? ? ? ? for (i=1000000000; i; i/=10) {
? ? ? ? ? ? int res = n/i;
? ? ? ? ? ? ? ? n %= i; // gcc reuses remainder produced in division
? ? ? ? ? ? ? ? if (res || out || i == 1) {
? ? ? ? ? ? ? ? ? ? out++;
? ? ? ? ? ? ? ? ? ? *buf++ = '0' + res;
? ? ? ? ? ? ? ? }
? ? ? ? ? ? buflen++;
? ? ? ? }
Post by Rich Felker
Certainly the division should result in smaller code size.
This depends on the smartness of the compiler.
My gcc does this:

# cat tt.c
int f(unsigned n) {
unsigned k,l;
asm("#1");
k = n/10;
asm("#1");
l = (n*429496730ULL)>>32;
asm("#1");
return k + l;
}
# gcc -S -Os tt.c
# cat tt.s
.file "tt.c"
.text
.globl f
.type f, @function
f:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
#APP
#1
#NO_APP
movl $10, %edx
movl %edx, %ecx
movl 8(%ebp), %eax
xorl %edx, %edx
divl %ecx
movl %eax, %ebx
#APP
#1
#NO_APP
movl $429496730, %ecx
movl 8(%ebp), %eax
mull %ecx
#APP
#1
#NO_APP
leal (%ebx,%edx), %eax
popl %ebx
popl %esi
popl %edi
leave
ret
.size f, .-f
.section .note.GNU-stack,"",@progbits
.ident "GCC: (GNU) 3.4.3"

--
vda
Michael S. Zick
2006-07-11 20:50:53 UTC
Permalink
Post by Rich Felker
Anyway I doubt this is faster than just dividing, especially since the
divide also gives you the modulo result for free without extra ops.
Certainly the division should result in smaller code size.
You want small and fast?

q1) What target processors do you wish to handle?

Any processor with special instructions for BCD, it
is smallest and fastest to:
int -> BCD
BCD -> ASCII

q2) How gcc-centric is BusyBox?

Any interest in a __builtin_uitoa( ... )?

I just happen to have gcc's builtins.c torn apart
in front of me for a different reason.
Post by Rich Felker
Rich
Mike
Denis Vlasenko
2006-07-12 14:27:29 UTC
Permalink
Post by Rich Felker
Post by Rob Landley
Post by Denis Vlasenko
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.
:)
Post by Rob Landley
B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.
This code should not generate a 64bit multiply since the operands both
fit in 32 bits and a 32x32 multiply gives a 64bit result. Still, gcc
is broken so maybe it does 64x64 multiply anyway...
It does 32x32 multiply and optimizes out the shift altogether.
Post by Rich Felker
Anyway I doubt this is faster than just dividing,
On i386, it is.
Post by Rich Felker
especially since the
divide also gives you the modulo result for free without extra ops.
We do not need the remainder of i/10 here:

? ? ? ? for (i=1000000000; i; i/=10) {
? ? ? ? ? ? int res = n/i;
? ? ? ? ? ? ? ? n %= i; // gcc reuses remainder produced in division
? ? ? ? ? ? ? ? if (res || out || i == 1) {
? ? ? ? ? ? ? ? ? ? out++;
? ? ? ? ? ? ? ? ? ? *buf++ = '0' + res;
? ? ? ? ? ? ? ? }
? ? ? ? ? ? buflen++;
? ? ? ? }
Post by Rich Felker
Certainly the division should result in smaller code size.
This depends on the smartness of the compiler.
My gcc does this:

# cat tt.c
int f(unsigned n) {
unsigned k,l;
asm("#1");
k = n/10;
asm("#1");
l = (n*429496730ULL)>>32;
asm("#1");
return k + l;
}
# gcc -S -Os tt.c
# cat tt.s
.file "tt.c"
.text
.globl f
.type f, @function
f:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
#APP
#1
#NO_APP
movl $10, %edx
movl %edx, %ecx
movl 8(%ebp), %eax
xorl %edx, %edx
divl %ecx
movl %eax, %ebx
#APP
#1
#NO_APP
movl $429496730, %ecx
movl 8(%ebp), %eax
mull %ecx
#APP
#1
#NO_APP
leal (%ebx,%edx), %eax
popl %ebx
popl %esi
popl %edi
leave
ret
.size f, .-f
.section .note.GNU-stack,"", at progbits
.ident "GCC: (GNU) 3.4.3"

--
vda
Denis Vlasenko
2006-07-12 07:11:18 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.
8) hopefully comment helps to clear the confusion
Post by Rob Landley
B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.
This piece of C code:

asm("#1");
i = ((n * 429496730ULL) >> 32);
asm("#1");

is compiled into:

# gcc -Os -S fast_div10.c
# grep -C2 -F '#1' fast_div10.s
testl %eax, %eax
movl %eax, -32(%ebp)
jne .L15
#APP
#1
#NO_APP
movl $429496730, %eax
mull 8(%ebp)
movl %edx, %eax
#APP
#1
#NO_APP
leal -12(%ebp), %esp
popl %ebx
popl %esi

As you see here, gcc 3.4.3 produces just three instructions for that.
Post by Rob Landley
C) On a lot of 64 bit platforms (I don't know if it's all), long long is a 128
bit number, so this is a big pessimization.
Post by Denis Vlasenko
sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
What are you using to produce these numbers? What hardware, what test
program, and what _other_ platforms have you tried it on to see what the
impact is there?
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 10
cpu MHz : 930.476
cache size : 256 KB
...

The test program is the same as yours:

#include <stdio.h>
// time: user 0m0.572s
void utoa_to_buf_test(unsigned n, char *buf, int buflen)
{
unsigned i, out = 0;
if (buflen > 0) {
buflen -= 11;
for (i=1000000000; i; i = (i*429496730ULL)>>32 /* divide by 10, fast */) {
int res = n/i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}


int main(int argc, char *argv[])
{
int i;
char blah[16];

for(i=0;i<1000000;i++) {
utoa_to_buf_test(i, blah, 16);
}
}

--
vda
Rich Felker
2006-07-11 20:35:08 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.
:)
Post by Rob Landley
B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.
This code should not generate a 64bit multiply since the operands both
fit in 32 bits and a 32x32 multiply gives a 64bit result. Still, gcc
is broken so maybe it does 64x64 multiply anyway...

BTW inttypes.h has a macro to declare uint64_t constants, or you can
just cast the integer constant to (uint64_t) before multiplying..

Anyway I doubt this is faster than just dividing, especially since the
divide also gives you the modulo result for free without extra ops.
Certainly the division should result in smaller code size.

Rich
Denis Vlasenko
2006-07-12 14:06:58 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.
8) hopefully comment helps to clear the confusion
Post by Rob Landley
B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.
This piece of C code:

asm("#1");
i = ((n * 429496730ULL) >> 32);
asm("#1");

is compiled into:

# gcc -Os -S fast_div10.c
# grep -C2 -F '#1' fast_div10.s
testl %eax, %eax
movl %eax, -32(%ebp)
jne .L15
#APP
#1
#NO_APP
movl $429496730, %eax
mull 8(%ebp)
movl %edx, %eax
#APP
#1
#NO_APP
leal -12(%ebp), %esp
popl %ebx
popl %esi

As you see here, gcc 3.4.3 produces just three instructions for that.
Post by Rob Landley
C) On a lot of 64 bit platforms (I don't know if it's all), long long is a 128
bit number, so this is a big pessimization.
Post by Denis Vlasenko
sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
What are you using to produce these numbers? What hardware, what test
program, and what _other_ platforms have you tried it on to see what the
impact is there?
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 10
cpu MHz : 930.476
cache size : 256 KB
...

The test program is the same as yours:

#include <stdio.h>
// time: user 0m0.572s
void utoa_to_buf_test(unsigned n, char *buf, int buflen)
{
unsigned i, out = 0;
if (buflen > 0) {
buflen -= 11;
for (i=1000000000; i; i = (i*429496730ULL)>>32 /* divide by 10, fast */) {
int res = n/i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}


int main(int argc, char *argv[])
{
int i;
char blah[16];

for(i=0;i<1000000;i++) {
utoa_to_buf_test(i, blah, 16);
}
}

--
vda
Rob Landley
2006-07-11 19:41:17 UTC
Permalink
Post by Denis Vlasenko
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.
i = (i*429496730ULL)>>32 /* divide by 10, fast */
A) That's too ugly for words.

B) On what platform are you having a speed problem, and on what platform is
this an improvement? On 32 bit platforms (which are what most of busybox is
deployed on these days, and where most of it will _continue_ to be deployed
on for years to come if not the foreseeable future) gcc produces absolutely
_horrible_ code for long long multiplies, and it triggers the gcc_s shared
library problem with some toolchains.

C) On a lot of 64 bit platforms (I don't know if it's all), long long is a 128
bit number, so this is a big pessimization.
Post by Denis Vlasenko
sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
What are you using to produce these numbers? What hardware, what test
program, and what _other_ platforms have you tried it on to see what the
impact is there?

Rob
--
Never bet against the cheap plastic solution.
Rob Landley
2006-07-11 12:42:45 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa
have to do with a base 16 hexdump?
This is a real-life example of a program which does a lot of
int -> string conversions, so that it actually mattered for me recently.
hexdump is of course using base 16 conversion. tcpdump could be
an example of a proc which does base 10 conversions, but I didn't
use it lately.
Understood, but if I wanted to speed up hexdump | grep I'd profile both
hexdump and grep, and probably find out that our grep is dog slow (there's
the pending patch to speed up get_chomped_line_from_file() in my todo heap,
which is waiting for me to get back around to the sed NUL bug and make sure
that they play nice together).
Post by Denis Vlasenko
Post by Rob Landley
The lookup table is a classic size/speed tradeoff, something BusyBox has
traditionally come down on the "size" side of. Doing a write() on the
result is going to be way slower than the actual conversion anyway. What
actual bottleneck are you seeing?
I definitely agree the bounds checking needs to be added, though. I'll
do that now...
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't want
it to overflow and stomp memory it doesn't own.
Post by Denis Vlasenko
How about this modified version?
...
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
Very nice. My only concern is on which platforms does division automatically
produce a remainder? I expect division to be supported everywhere, and to
perform fairly well. I have a vague recollection that in some cases
environments modulus is overlooked and (as a result) kind of evil, but I
don't remember specifics.
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead. What platform have you
profiled this on? I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...

Could I have some benchmark numbers on the system you're worried about,
please? I already mentioned I tried the following here:

#include <stdio.h>

// 0x4f bytes
void utoa_to_buf(unsigned n, char *buf, int buflen)
{
int i, out = 0;
for (i=1000000000; i; i/=10) {
int res = n/i;

if (res || out || i == 1) {
out++;
n -= res*i;
*buf++ = '0' + res;
}
}
*buf = 0;
}

int main(int argc, char *argv[])
{
int i;
char blah[16];

for(i=0;i<1000000;i++) {
utoa_to_buf(i, blah, 16);
}
}

Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
according to my handy dandy python interpreter (the best calculator Linux
ever shipped with). Adding bounds checking didn't slow it down _that_ much.

What do you get on the system you're worried about the performance of?

Rob
--
Never bet against the cheap plastic solution.
Rich Felker
2006-07-11 13:52:01 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa
have to do with a base 16 hexdump?
This is a real-life example of a program which does a lot of
int -> string conversions, so that it actually mattered for me recently.
hexdump is of course using base 16 conversion. tcpdump could be
an example of a proc which does base 10 conversions, but I didn't
use it lately.
Understood, but if I wanted to speed up hexdump | grep I'd profile both
hexdump and grep, and probably find out that our grep is dog slow (there's
the pending patch to speed up get_chomped_line_from_file() in my todo heap,
which is waiting for me to get back around to the sed NUL bug and make sure
that they play nice together).
Currently get_chomped_line_from_file does not support embedded NUL
bytes. My patch could add support at the cost of a little performance
(but still much better performance than the current version).
Post by Rob Landley
Post by Denis Vlasenko
How about this modified version?
...
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
Very nice. My only concern is on which platforms does division automatically
produce a remainder? I expect division to be supported everywhere, and to
perform fairly well. I have a vague recollection that in some cases
environments modulus is overlooked and (as a result) kind of evil, but I
don't remember specifics.
All I know is that division yields both on x86. Sorry.
Post by Rob Landley
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead. What platform have you
profiled this on? I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
Division is something like 40 cycles on old x86 (386-586). IIRC on
modern x86 it's around 10-15 cycles.
Post by Rob Landley
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...
Sadly some modern cpus perform better at float than integer, simply
because of stupidity and nonsensical goals. Obviously integer ops
should perform significantly better if both are given the same
attention by the engineers, but I guess winning floating point
benchmarks buys you more than improving practical performance.
Post by Rob Landley
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
according to my handy dandy python interpreter (the best calculator Linux
ever shipped with). Adding bounds checking didn't slow it down _that_ much.
Yes, time(1) is a much more accurate benchmarking system, as it won't
include the time spent by other processes interrupting your benchmark.
The time(3) based benchmark is a rather bad idea..

Rich
Denis Vlasenko
2006-07-12 07:08:50 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead.
I'm sure that there are tons of things to improve in hexdump and/or
tcpdump. Last I looked, tcpdump was using printf _far too much_,
not even coalescing adjacent calls into single one with bigger
format string... oh well...
Post by Rob Landley
What platform have you
profiled this on?
i386
Post by Rob Landley
I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...
Could I have some benchmark numbers on the system you're worried about,
#include <stdio.h>
// 0x4f bytes
void utoa_to_buf(unsigned n, char *buf, int buflen)
{
int i, out = 0;
for (i=1000000000; i; i/=10) {
int res = n/i;
if (res || out || i == 1) {
out++;
n -= res*i;
*buf++ = '0' + res;
}
}
*buf = 0;
}
int main(int argc, char *argv[])
{
int i;
char blah[16];
for(i=0;i<1000000;i++) {
utoa_to_buf(i, blah, 16);
}
}
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.

Let's use the below one which is 1.5 faster on i386, has bound checking
and truncates most significant digits if buffer is too small, not least
significant ones:

// 0x52 bytes
// time: user 0m0.572s
void utoa_to_buf_test(unsigned n, char *buf, int buflen)
{
unsigned i, out = 0;
if (buflen > 0) {
buflen -= 11;
for (i=1000000000; i; i = (i*429496730ULL)>>32 /* divide by 10, fast */) {
int res = n/i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}

Correctness check for it:

sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
utoa_to_buf_test(0)='0'
utoa_to_buf_test(123)='123'
utoa_to_buf_test(100223)='100223'
utoa_to_buf_test(100223,3)='23'
utoa_to_buf_test(100223,2)='3'
utoa_to_buf_test(100223,1)=''

Please apply the attached patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa.diff
Type: text/x-diff
Size: 930 bytes
Desc: not available
Url : http://busybox.net/lists/busybox/attachments/20060712/15c57894/itoa.bin
Rob Landley
2006-07-12 08:58:50 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.
So you're in the neighborhood of a million conversions a second even with the
bounds checking added. You were previously talking about 1/20th that many
conversions per second. Congratulations, we just sped it up by a factor of
twenty without actually doing anything.
Post by Denis Vlasenko
Let's use the below one which is 1.5 faster on i386,
And yet that's still not fast enough. Fast enough for what? I am aware that
we could hand code it in assembly (which isn't much uglier than bit-shifting
a long long magic constant), but I'm really not interested without a good
reason.

What, exactly, are you doing that's too slow, and what makes you think that
_this_ code is this bottleneck? This is not a "we can do better" when the
end result is bigger and harder to understand.
Post by Denis Vlasenko
has bound checking
and truncates most significant digits if buffer is too small, not least
I added bounds checking back on the 10th (svn 15683).

If you truncate it doesn't get the correct result no matter what you do, but
the current behavior is the _expected_ behavior, it's what you get with
snprintf() and safe_strcpy(), and everything else I'm aware of that does
bounds checking. If you want just the lower digits, you can always do a
modulus before calling itoa().
Post by Denis Vlasenko
Please apply the attached patch.
No.

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-13 09:57:01 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.
So you're in the neighborhood of a million conversions a second even with the
bounds checking added. You were previously talking about 1/20th that many
conversions per second. Congratulations, we just sped it up by a factor of
twenty without actually doing anything.
I would be really happy if you will actually read messages
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
and program does this:

#define REP(a) \
????????a a a a ?a a a a ?a a a a ?a a a a \
????????a a a a ?a a a a ?a a a a ?a a a a
...
????????while(t == time(NULL)) {
????????????????REP( p = utoa_to_buf(v, local_buf, sizeof(local_buf)); )
????????????????count++;
????????}
????????printf("utoa_to_buf='%s' count=%d\n", p, count);

I think it is easy to see why reported number is not near one million.

I never said that it is "too slow" for me, where did you get it from?
I am just saying that if we do itoa at all, we can make it
at least faster than sprintf.
Post by Rob Landley
Post by Denis Vlasenko
Let's use the below one which is 1.5 faster on i386,
And yet that's still not fast enough. Fast enough for what? I am aware that
we could hand code it in assembly (which isn't much uglier than bit-shifting
a long long magic constant), but I'm really not interested without a good
reason.
And because of that, we will use slower _and_ bigger (in machine code size)
method. This is, uh, interesting...

i = n/10:
? ? ? ? movl ? ?$10, %edx
? ? ? ? movl ? ?%edx, %ecx
? ? ? ? movl ? ?8(%ebp), %eax
? ? ? ? xorl ? ?%edx, %edx
? ? ? ? divl ? ?%ecx
? ? ? ? movl ? ?%eax, %ebx

i = (n*429496730ULL)>>32;
? ? ? ? movl ? ?$429496730, %ecx
? ? ? ? movl ? ?8(%ebp), %eax
? ? ? ? mull ? ?%ecx

--
vda
Rob Landley
2006-07-14 09:50:36 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
So you're in the neighborhood of a million conversions a second even with
the bounds checking added. You were previously talking about 1/20th that
many conversions per second. Congratulations, we just sped it up by a
factor of twenty without actually doing anything.
I would be really happy if you will actually read messages
I suspected there was a printf in there, but you'd never explicitly said there
was. And the only reason the overhead of the printf, pipe to xterm, and
blocking on said pipe while xterm consumed the output isn't higher than it is
is that Linux does intelligent batching.
Post by Denis Vlasenko
I think it is easy to see why reported number is not near one million.
Very much so.
Post by Denis Vlasenko
I never said that it is "too slow" for me, where did you get it from?
Your persistent attempts to complicate it in the name of speed?
Post by Denis Vlasenko
I am just saying that if we do itoa at all, we can make it
at least faster than sprintf.
The objection to sprintf isn't speed. It's _size_.
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Let's use the below one which is 1.5 faster on i386,
And yet that's still not fast enough. Fast enough for what? I am aware
that we could hand code it in assembly (which isn't much uglier than
bit-shifting a long long magic constant), but I'm really not interested
without a good reason.
And because of that, we will use slower _and_ bigger (in machine code size)
method. This is, uh, interesting...
It's simpler source code, easy to understand why it's doing what it's doing.
And is your optimization a size win on arm, power PC, x86-64, and Mips?

Rob
--
Never bet against the cheap plastic solution.
Rob Landley
2006-07-13 23:29:09 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
So you're in the neighborhood of a million conversions a second even with
the bounds checking added. You were previously talking about 1/20th that
many conversions per second. Congratulations, we just sped it up by a
factor of twenty without actually doing anything.
I would be really happy if you will actually read messages
I suspected there was a printf in there, but you'd never explicitly said there
was. And the only reason the overhead of the printf, pipe to xterm, and
blocking on said pipe while xterm consumed the output isn't higher than it is
is that Linux does intelligent batching.
Post by Denis Vlasenko
I think it is easy to see why reported number is not near one million.
Very much so.
Post by Denis Vlasenko
I never said that it is "too slow" for me, where did you get it from?
Your persistent attempts to complicate it in the name of speed?
Post by Denis Vlasenko
I am just saying that if we do itoa at all, we can make it
at least faster than sprintf.
The objection to sprintf isn't speed. It's _size_.
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Let's use the below one which is 1.5 faster on i386,
And yet that's still not fast enough. Fast enough for what? I am aware
that we could hand code it in assembly (which isn't much uglier than
bit-shifting a long long magic constant), but I'm really not interested
without a good reason.
And because of that, we will use slower _and_ bigger (in machine code size)
method. This is, uh, interesting...
It's simpler source code, easy to understand why it's doing what it's doing.
And is your optimization a size win on arm, power PC, x86-64, and Mips?

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-13 16:56:50 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.
So you're in the neighborhood of a million conversions a second even with the
bounds checking added. You were previously talking about 1/20th that many
conversions per second. Congratulations, we just sped it up by a factor of
twenty without actually doing anything.
I would be really happy if you will actually read messages
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
and program does this:

#define REP(a) \
????????a a a a ?a a a a ?a a a a ?a a a a \
????????a a a a ?a a a a ?a a a a ?a a a a
...
????????while(t == time(NULL)) {
????????????????REP( p = utoa_to_buf(v, local_buf, sizeof(local_buf)); )
????????????????count++;
????????}
????????printf("utoa_to_buf='%s' count=%d\n", p, count);

I think it is easy to see why reported number is not near one million.

I never said that it is "too slow" for me, where did you get it from?
I am just saying that if we do itoa at all, we can make it
at least faster than sprintf.
Post by Rob Landley
Post by Denis Vlasenko
Let's use the below one which is 1.5 faster on i386,
And yet that's still not fast enough. Fast enough for what? I am aware that
we could hand code it in assembly (which isn't much uglier than bit-shifting
a long long magic constant), but I'm really not interested without a good
reason.
And because of that, we will use slower _and_ bigger (in machine code size)
method. This is, uh, interesting...

i = n/10:
? ? ? ? movl ? ?$10, %edx
? ? ? ? movl ? ?%edx, %ecx
? ? ? ? movl ? ?8(%ebp), %eax
? ? ? ? xorl ? ?%edx, %edx
? ? ? ? divl ? ?%ecx
? ? ? ? movl ? ?%eax, %ebx

i = (n*429496730ULL)>>32;
? ? ? ? movl ? ?$429496730, %ecx
? ? ? ? movl ? ?8(%ebp), %eax
? ? ? ? mull ? ?%ecx

--
vda
Rob Landley
2006-07-12 15:58:37 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.
So you're in the neighborhood of a million conversions a second even with the
bounds checking added. You were previously talking about 1/20th that many
conversions per second. Congratulations, we just sped it up by a factor of
twenty without actually doing anything.
Post by Denis Vlasenko
Let's use the below one which is 1.5 faster on i386,
And yet that's still not fast enough. Fast enough for what? I am aware that
we could hand code it in assembly (which isn't much uglier than bit-shifting
a long long magic constant), but I'm really not interested without a good
reason.

What, exactly, are you doing that's too slow, and what makes you think that
_this_ code is this bottleneck? This is not a "we can do better" when the
end result is bigger and harder to understand.
Post by Denis Vlasenko
has bound checking
and truncates most significant digits if buffer is too small, not least
I added bounds checking back on the 10th (svn 15683).

If you truncate it doesn't get the correct result no matter what you do, but
the current behavior is the _expected_ behavior, it's what you get with
snprintf() and safe_strcpy(), and everything else I'm aware of that does
bounds checking. If you want just the lower digits, you can always do a
modulus before calling itoa().
Post by Denis Vlasenko
Please apply the attached patch.
No.

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-12 07:33:43 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't want
it to overflow and stomp memory it doesn't own.
Real-world example why we should do it:

printf("%02d:%02d\n", seconds/60, seconds);

Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
--
vda
Rich Felker
2006-07-12 09:23:22 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't want
it to overflow and stomp memory it doesn't own.
printf("%02d:%02d\n", seconds/60, seconds);
Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
Apparently you don't know C -- field width never causes truncation.
The only format specifier that can cause truncation is precision (not
width) when used with strings.

Rich
Rob Landley
2006-07-13 09:58:44 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't
want it to overflow and stomp memory it doesn't own.
printf("%02d:%02d\n", seconds/60, seconds);
Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
***@driftwood:~/busybox/busybox$ cat hello.c
#include <stdio.h>

int main(int argc, char *argv[])
{
printf("%02d:%02d\n", 123,456);
return 0;
}
***@driftwood:~/busybox/busybox$ gcc hello.c
***@driftwood:~/busybox/busybox$ ./a.out
123:456

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-13 10:01:33 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't
want it to overflow and stomp memory it doesn't own.
printf("%02d:%02d\n", seconds/60, seconds);
Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
#include <stdio.h>
int main(int argc, char *argv[])
{
printf("%02d:%02d\n", 123,456);
return 0;
}
123:456
Yes, I was stupid and wrong on this one, as well as on Alpha sizeof(int).

:)

--
vda
Denis Vlasenko
2006-07-13 17:01:17 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't
want it to overflow and stomp memory it doesn't own.
printf("%02d:%02d\n", seconds/60, seconds);
Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
landley at driftwood:~/busybox/busybox$ cat hello.c
#include <stdio.h>
int main(int argc, char *argv[])
{
printf("%02d:%02d\n", 123,456);
return 0;
}
landley at driftwood:~/busybox/busybox$ gcc hello.c
landley at driftwood:~/busybox/busybox$ ./a.out
123:456
Yes, I was stupid and wrong on this one, as well as on Alpha sizeof(int).

:)

--
vda
Rich Felker
2006-07-12 16:22:33 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't want
it to overflow and stomp memory it doesn't own.
printf("%02d:%02d\n", seconds/60, seconds);
Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
Apparently you don't know C -- field width never causes truncation.
The only format specifier that can cause truncation is precision (not
width) when used with strings.

Rich
Rob Landley
2006-07-13 13:43:27 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't
want it to overflow and stomp memory it doesn't own.
printf("%02d:%02d\n", seconds/60, seconds);
Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
landley at driftwood:~/busybox/busybox$ cat hello.c
#include <stdio.h>

int main(int argc, char *argv[])
{
printf("%02d:%02d\n", 123,456);
return 0;
}
landley at driftwood:~/busybox/busybox$ gcc hello.c
landley at driftwood:~/busybox/busybox$ ./a.out
123:456

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-12 09:08:13 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead.
I'm sure that there are tons of things to improve in hexdump and/or
tcpdump. Last I looked, tcpdump was using printf _far too much_,
not even coalescing adjacent calls into single one with bigger
format string... oh well...
Post by Rob Landley
What platform have you
profiled this on?
i386
Post by Rob Landley
I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...
Could I have some benchmark numbers on the system you're worried about,
#include <stdio.h>
// 0x4f bytes
void utoa_to_buf(unsigned n, char *buf, int buflen)
{
int i, out = 0;
for (i=1000000000; i; i/=10) {
int res = n/i;
if (res || out || i == 1) {
out++;
n -= res*i;
*buf++ = '0' + res;
}
}
*buf = 0;
}
int main(int argc, char *argv[])
{
int i;
char blah[16];
for(i=0;i<1000000;i++) {
utoa_to_buf(i, blah, 16);
}
}
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.

Let's use the below one which is 1.5 faster on i386, has bound checking
and truncates most significant digits if buffer is too small, not least
significant ones:

// 0x52 bytes
// time: user 0m0.572s
void utoa_to_buf_test(unsigned n, char *buf, int buflen)
{
unsigned i, out = 0;
if (buflen > 0) {
buflen -= 11;
for (i=1000000000; i; i = (i*429496730ULL)>>32 /* divide by 10, fast */) {
int res = n/i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}

Correctness check for it:

sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
utoa_to_buf_test(0)='0'
utoa_to_buf_test(123)='123'
utoa_to_buf_test(100223)='100223'
utoa_to_buf_test(100223,3)='23'
utoa_to_buf_test(100223,2)='3'
utoa_to_buf_test(100223,1)=''

Please apply the attached patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa.diff
Type: text/x-diff
Size: 930 bytes
Desc: not available
Url : http://busybox.net/lists/busybox/attachments/20060712/778d04cf/itoa.bin
Rich Felker
2006-07-11 20:51:02 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa
have to do with a base 16 hexdump?
This is a real-life example of a program which does a lot of
int -> string conversions, so that it actually mattered for me recently.
hexdump is of course using base 16 conversion. tcpdump could be
an example of a proc which does base 10 conversions, but I didn't
use it lately.
Understood, but if I wanted to speed up hexdump | grep I'd profile both
hexdump and grep, and probably find out that our grep is dog slow (there's
the pending patch to speed up get_chomped_line_from_file() in my todo heap,
which is waiting for me to get back around to the sed NUL bug and make sure
that they play nice together).
Currently get_chomped_line_from_file does not support embedded NUL
bytes. My patch could add support at the cost of a little performance
(but still much better performance than the current version).
Post by Rob Landley
Post by Denis Vlasenko
How about this modified version?
...
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
Very nice. My only concern is on which platforms does division automatically
produce a remainder? I expect division to be supported everywhere, and to
perform fairly well. I have a vague recollection that in some cases
environments modulus is overlooked and (as a result) kind of evil, but I
don't remember specifics.
All I know is that division yields both on x86. Sorry.
Post by Rob Landley
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead. What platform have you
profiled this on? I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
Division is something like 40 cycles on old x86 (386-586). IIRC on
modern x86 it's around 10-15 cycles.
Post by Rob Landley
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...
Sadly some modern cpus perform better at float than integer, simply
because of stupidity and nonsensical goals. Obviously integer ops
should perform significantly better if both are given the same
attention by the engineers, but I guess winning floating point
benchmarks buys you more than improving practical performance.
Post by Rob Landley
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
according to my handy dandy python interpreter (the best calculator Linux
ever shipped with). Adding bounds checking didn't slow it down _that_ much.
Yes, time(1) is a much more accurate benchmarking system, as it won't
include the time spent by other processes interrupting your benchmark.
The time(3) based benchmark is a rather bad idea..

Rich
Denis Vlasenko
2006-07-12 14:08:15 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead.
I'm sure that there are tons of things to improve in hexdump and/or
tcpdump. Last I looked, tcpdump was using printf _far too much_,
not even coalescing adjacent calls into single one with bigger
format string... oh well...
Post by Rob Landley
What platform have you
profiled this on?
i386
Post by Rob Landley
I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...
Could I have some benchmark numbers on the system you're worried about,
#include <stdio.h>
// 0x4f bytes
void utoa_to_buf(unsigned n, char *buf, int buflen)
{
int i, out = 0;
for (i=1000000000; i; i/=10) {
int res = n/i;
if (res || out || i == 1) {
out++;
n -= res*i;
*buf++ = '0' + res;
}
}
*buf = 0;
}
int main(int argc, char *argv[])
{
int i;
char blah[16];
for(i=0;i<1000000;i++) {
utoa_to_buf(i, blah, 16);
}
}
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.

Let's use the below one which is 1.5 faster on i386, has bound checking
and truncates most significant digits if buffer is too small, not least
significant ones:

// 0x52 bytes
// time: user 0m0.572s
void utoa_to_buf_test(unsigned n, char *buf, int buflen)
{
unsigned i, out = 0;
if (buflen > 0) {
buflen -= 11;
for (i=1000000000; i; i = (i*429496730ULL)>>32 /* divide by 10, fast */) {
int res = n/i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}

Correctness check for it:

sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
utoa_to_buf_test(0)='0'
utoa_to_buf_test(123)='123'
utoa_to_buf_test(100223)='100223'
utoa_to_buf_test(100223,3)='23'
utoa_to_buf_test(100223,2)='3'
utoa_to_buf_test(100223,1)=''

Please apply the attached patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa.diff
Type: text/x-diff
Size: 930 bytes
Desc: not available
Url : http://lists.busybox.net/pipermail/busybox/attachments/20060712/15c57894/attachment.bin
Denis Vlasenko
2006-07-12 14:32:12 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't want
it to overflow and stomp memory it doesn't own.
Real-world example why we should do it:

printf("%02d:%02d\n", seconds/60, seconds);

Here user of sprintf deliberately uses the fact
that printf shows least significant digits,
not most significant ones.
--
vda
Denis Vlasenko
2006-07-12 13:59:26 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead.
I'm sure that there are tons of things to improve in hexdump and/or
tcpdump. Last I looked, tcpdump was using printf _far too much_,
not even coalescing adjacent calls into single one with bigger
format string... oh well...
Post by Rob Landley
What platform have you
profiled this on?
i386
Post by Rob Landley
I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...
Could I have some benchmark numbers on the system you're worried about,
#include <stdio.h>
// 0x4f bytes
void utoa_to_buf(unsigned n, char *buf, int buflen)
{
int i, out = 0;
for (i=1000000000; i; i/=10) {
int res = n/i;
if (res || out || i == 1) {
out++;
n -= res*i;
*buf++ = '0' + res;
}
}
*buf = 0;
}
int main(int argc, char *argv[])
{
int i;
char blah[16];
for(i=0;i<1000000;i++) {
utoa_to_buf(i, blah, 16);
}
}
Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
Got 0.944 seconds here.

Let's use the below one which is 1.5 faster on i386, has bound checking
and truncates most significant digits if buffer is too small, not least
significant ones:

// 0x52 bytes
// time: user 0m0.572s
void utoa_to_buf_test(unsigned n, char *buf, int buflen)
{
unsigned i, out = 0;
if (buflen > 0) {
buflen -= 11;
for (i=1000000000; i; i = (i*429496730ULL)>>32 /* divide by 10, fast */) {
int res = n/i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}

Correctness check for it:

sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
utoa_to_buf_test(0)='0'
utoa_to_buf_test(123)='123'
utoa_to_buf_test(100223)='100223'
utoa_to_buf_test(100223,3)='23'
utoa_to_buf_test(100223,2)='3'
utoa_to_buf_test(100223,1)=''

Please apply the attached patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa.diff
Type: text/x-diff
Size: 930 bytes
Desc: not available
Url : http://lists.busybox.net/pipermail/busybox/attachments/20060712/778d04cf/attachment.bin
Denis Vlasenko
2006-07-11 16:14:09 UTC
Permalink
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
I still dislike the speed, i/=10 eats too much CPU...
This patch converts division into mul + shift.
It's smaller and faster.

i = (i*429496730ULL)>>32 /* divide by 10, fast */

Speed is ok, correctness check ok:

sprintf='12345678' count=27599
utoa_to_buf='12345678' count=62667
utoa='12345678' count=59227
utoa_to_buf_test='12345678' count=41303
utoa_to_buf_test(0)='0'
utoa_to_buf_test(123)='123'
utoa_to_buf_test(100223)='100223'
utoa_to_buf_test(100223,3)='23'
utoa_to_buf_test(100223,2)='3'
utoa_to_buf_test(100223,1)=''

Please apply.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa.diff
Type: text/x-diff
Size: 930 bytes
Desc: not available
Url : http://lists.busybox.net/pipermail/busybox/attachments/20060711/863e5e7d/attachment.bin
Rob Landley
2006-07-11 19:42:38 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa
have to do with a base 16 hexdump?
This is a real-life example of a program which does a lot of
int -> string conversions, so that it actually mattered for me recently.
hexdump is of course using base 16 conversion. tcpdump could be
an example of a proc which does base 10 conversions, but I didn't
use it lately.
Understood, but if I wanted to speed up hexdump | grep I'd profile both
hexdump and grep, and probably find out that our grep is dog slow (there's
the pending patch to speed up get_chomped_line_from_file() in my todo heap,
which is waiting for me to get back around to the sed NUL bug and make sure
that they play nice together).
Post by Denis Vlasenko
Post by Rob Landley
The lookup table is a classic size/speed tradeoff, something BusyBox has
traditionally come down on the "size" side of. Doing a write() on the
result is going to be way slower than the actual conversion anyway. What
actual bottleneck are you seeing?
I definitely agree the bounds checking needs to be added, though. I'll
do that now...
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".
If the buffer is too small it essentially produces garbage, we just don't want
it to overflow and stomp memory it doesn't own.
Post by Denis Vlasenko
How about this modified version?
...
Post by Denis Vlasenko
It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.
Very nice. My only concern is on which platforms does division automatically
produce a remainder? I expect division to be supported everywhere, and to
perform fairly well. I have a vague recollection that in some cases
environments modulus is overlooked and (as a result) kind of evil, but I
don't remember specifics.
Post by Denis Vlasenko
I still dislike the speed, i/=10 eats too much CPU...
It's a tight L1 cache-hot loop. In a pipeline like hexdump | grep it's
unlikely to be even 1% of the total CPU overhead. What platform have you
profiled this on? I don't believe integer division has boiled down to a
subtraction loop on any interesting processor for a quarter-century (I
remember that multiplication could be done in something like 6 clock cycles
without being _that_ clever, although this is a vague recollection from my
undergraduate days circa 1993), but then I'm not entirely certain how some of
the really low-transistor-budget embedded processors (like ARM) are
implemented. I doubt they're _stupid_, though. Many of them have floating
point units, there's no reason for them to stint on integer arithmetic
operations...

Could I have some benchmark numbers on the system you're worried about,
please? I already mentioned I tried the following here:

#include <stdio.h>

// 0x4f bytes
void utoa_to_buf(unsigned n, char *buf, int buflen)
{
int i, out = 0;
for (i=1000000000; i; i/=10) {
int res = n/i;

if (res || out || i == 1) {
out++;
n -= res*i;
*buf++ = '0' + res;
}
}
*buf = 0;
}

int main(int argc, char *argv[])
{
int i;
char blah[16];

for(i=0;i<1000000;i++) {
utoa_to_buf(i, blah, 16);
}
}

Ran that under "time", and got 0.707 seconds. 1000000/.707 is 1.4 million
according to my handy dandy python interpreter (the best calculator Linux
ever shipped with). Adding bounds checking didn't slow it down _that_ much.

What do you get on the system you're worried about the performance of?

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-11 14:44:48 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa have
to do with a base 16 hexdump?
This is a real-life example of a program which does a lot of
int -> string conversions, so that it actually mattered for me recently.

hexdump is of course using base 16 conversion. tcpdump could be
an example of a proc which does base 10 conversions, but I didn't
use it lately.
Post by Rob Landley
The lookup table is a classic size/speed tradeoff, something BusyBox has
traditionally come down on the "size" side of. Doing a write() on the result
is going to be way slower than the actual conversion anyway. What actual
bottleneck are you seeing?
I definitely agree the bounds checking needs to be added, though. I'll do
that now...
The version in current svn produces _most significant_ digits
if buffer is too small. IOW: utoa_to_buf_rob(100223,buf,3)=="10".
The usual practuce is to produce "23".

How about this modified version?

// 0x63 bytes
void utoa_to_buf_rob(unsigned n, char *buf, int buflen)
{
unsigned i, out;
if (buflen) {
buflen -= 11;
out = 0;
for (i=1000000000; i; i/=10) {
unsigned res = n / i;
n %= i; // gcc reuses remainder produced in division
if (res || out || i == 1) {
out++;
if (buflen >= 0)
*buf++ = '0' + res;
}
buflen++;
}
*buf = 0;
}
}

It's even a bit smaller. gcc is smart enough to know div instruction
produces the remainder also, and we get n %= i for free.

I still dislike the speed, i/=10 eats too much CPU...
--
vda
Michael S. Zick
2006-07-10 14:13:56 UTC
Permalink
Post by Denis Vlasenko
I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Almost for some,

(HP) Vax and MicroVax - no new machines, remarket only
(HP) Alpha Servers - new orders only until Oct. 2006
(HP) PA-RISC, 64-bit - alive and well - no 64-bit userland
under Linux - 64-bit Linux kernel alive and well.

Two on-going efforts to keep in mind...

Giving parisc-Linux a 64-bit userland is on-going, but you
will not see these in an embedded system.

If an adaption of BB gets shoved into the kernel as some
propose - then handling 64-bit & 128-bit types may come up again.

- - - -

But still, sizeof(unsigned int) is set by choosing the compiler
to use for the code. There are only two widely used compilers
for pa-risc - HP and GCC - those are 32-bit unsigned int for
both compilers, both 32-bit and 64-bit machines.
(The hardware handles the masking and sign extension.)

Now if you wanted to know about the sizeof(long double)...
Thank goodness, you don't.

Mike
Rich Felker
2006-07-10 14:54:07 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
Which snprintf? glibc? uClibc?
Anyway I ran the rest (after sanitizing it; your benchmarking method
is not valid due to multitasking) and obtained much more extreme
results.

Still your code is huge massive bloat to do something very simple...
It's like these 10k memcpy implementations...
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.
I am not sure. It looks like you are not sure too.
Post by Rob Landley
B) If it's that broken, I don't care.
What's broken about it? If the CPU is _designed_ to handle
mostly 64-bit operands, then it's only natural to make
"int" type 64-bit.
And this cpu won't be able to handle lots of real-world tasks in C
code. Lovely. The conclusion: the cpu sucks.
Post by Denis Vlasenko
For AMD64, instructions are 32-bit (with auto sign extend
to 64 bit) unless you add a "I want 64bit op" prefix.
IOW: AMD64 CPU most of the time works with 32bit operands.
That's why for AMD64, natural size of "int" is 32 bits.
Cool, AMD got best things from both worlds.
Actually I think the prefix is needed for 32bit when running 64bit
code, but it makes little difference to performance.
Post by Denis Vlasenko
IIRC Alpha took a simpler "we are a fully 64 animal"
route.
No, it took the "we are an idiotic risc arch" approach of not having
opcodes to manipulate all integer sizes up to and including the system
word size.
Post by Denis Vlasenko
Sane C ABI on such CPU should use 64-bit ints,
gcc is not sane then. :)
Post by Denis Vlasenko
or else you will drown in zillions of "n &= 0x00000000fffffffff"
type instructions which will enforce 32-bitness
on 64-bit CPU.
I suspect it's easier than this. Still ugly, and still alpha's fault.
Post by Denis Vlasenko
Please put utoa_to_buf_rob2_with_check code into xfuncs
(unless you find more way to make it better).
Calling sprintf makes it better by adding less size to BB...
At the very least this should be a build option.

Rich
Rob Landley
2006-07-11 00:47:14 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it
essentially a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934
A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
I haven't really looked closely at hexdump, but what does a base 10 utoa have
to do with a base 16 hexdump?
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We
use the lp64 model (as do all modern Unix variants, including MacOS
X), where only long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.
I am not sure. It looks like you are not sure too.
Apparently I didn't make myself clear.

Linux standardized on LP64, following the official Open Group recommendations
for all Unix platforms. I am sure of this, and there were good reasons for
it. The research behind the decision is stated here:
http://www.unix.org/version2/whatsnew/lp64_wp.html

I believe all 64-bit Linux platforms are x86-64. x86-64 is LP64. PPC 64 is
LP64. Even Itanic is LP64. Because there's a _spec_ that says LP64 and the
Linux developers have announced support for that spec. It seems highly
unlikely for the Alpha port to not be LP64, but it's hard looking up
information on the alpha because the platform's so obsolete. (The Alpha
Linux FAQ was last updated six years ago.)

However, google for "linux alpha sizeof int" and the first hit is the Wine
developers talking about it (three years ago), where they point out that
sizeof(int) != sizeof(void *), because Alpha Linux is LP64.

Here's explicitly running a program on Alpha to print out sizeof(int) and
sizeof(long):
http://www.alphalinux.org/archives/axp-list/January2001/0072.shtml
Post by Denis Vlasenko
Post by Rob Landley
B) If it's that broken, I don't care.
What's broken about it? If the CPU is _designed_ to handle
mostly 64-bit operands, then it's only natural to make
"int" type 64-bit.
No, it's natural to make "long" 64 bit, "int" 32 bit, "short" 16 bit,
and "char" 8 bit.

If Alpha was broken that way, I would just document that BusyBox does not
support it. I see no reason to add support for something that's _both_
stupid and obsolete.
Post by Denis Vlasenko
IIRC Alpha took a simpler "we are a fully 64 animal"
route.
I.E. we don't handle 32, 16, or 8 bit data types well.
Post by Denis Vlasenko
Sane C ABI on such CPU should use 64-bit ints,
or else you will drown in zillions of "n &= 0x00000000fffffffff"
type instructions which will enforce 32-bitness
on 64-bit CPU.
This is sad, but the deficiencies of Alpha's instruction set really are not
BusyBox's problem. Should we have a 64 bit char type for similar reasons?
Post by Denis Vlasenko
I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Bernhard's still adding Tru64 support. It's his pet platform...
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later.
(And if we remove it, we can always add it _back_ later.)
See attached patch.
Well, it's definitely an improvement. However, it would be nice if you
could call itoa on a buffer and have it actually fill in that buffer from
the start, so you could use it like you could sprintf or strcpy, and not
potentially have to track two pointers to your own buffer.
That will need either a clever method to determine how much digits are
there _before_ we do the conversion, or copying the result.
Nope. See the checkin I did for libbb/xfuncs.c. Yes, it potentially does a
few extra divides, but it still does 1.4 million utoa_to_buf() per second on
my laptop.
Post by Denis Vlasenko
I hope that in most cases you can just
do_something(utoa_to_buf(n, buf, sizeof(buf));
(Actually, im most cases people will just use do_something(utoa(n)); ...)
Yup. It's not returning the buf from utoa_to_buf(). It can if there's a
pressing need, but utoa() and itoa() both return it and that's probably the
common case for inline use.
Post by Denis Vlasenko
Post by Rob Landley
I did a quick stab at that and checked it into xfuncs.c, and now I'm
about to pass out. Could you tell me if that's useful, or if not what
did I screw up? I have to go fall over and pass out now, it's 4 am and I
have work in the morning...
Aha! So you DO have a clever method! :) Thanks!
Hmmm.... (/me takes a look) I think I can improve on that.
Glances at attachment... You spotted my lack of bounds checking, which is
definitely a big oversight on my part. :)
Post by Denis Vlasenko
Please see (and run) attached program. utoa_to_buf_rob is your version
copied from xfuncs.c. Two problems: it's almost as slow as sprintf
and it doesn't check for buflen. utoa_to_buf_rob2 is a speed improved
version and utoa_to_buf_rob2_with_check adds compile-time check
that unsigned is really 32-bit and also correctly handles buflen.
On even the 64-bit platforms BusyBox targets, unsigned is 32 bit. I'm pretty
sure of this one.

The lookup table is a classic size/speed tradeoff, something BusyBox has
traditionally come down on the "size" side of. Doing a write() on the result
is going to be way slower than the actual conversion anyway. What actual
bottleneck are you seeing?

I definitely agree the bounds checking needs to be added, though. I'll do
that now...

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-10 09:46:46 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
The attached test program shows that snprintf is 2.5 times slower
than utoa. Operations done per second:
sprintf='12345678' count=27346
utoa_to_buf='12345678' count=69880
utoa='12345678' count=65934

A few days ago I ran "hexdump | grep" over entire hard drive.
Took a long time. I wish hexdump was optimized a bit better...
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use
the lp64 model (as do all modern Unix variants, including MacOS X), where
only long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.
I am not sure. It looks like you are not sure too.
Post by Rob Landley
B) If it's that broken, I don't care.
What's broken about it? If the CPU is _designed_ to handle
mostly 64-bit operands, then it's only natural to make
"int" type 64-bit.

For AMD64, instructions are 32-bit (with auto sign extend
to 64 bit) unless you add a "I want 64bit op" prefix.
IOW: AMD64 CPU most of the time works with 32bit operands.
That's why for AMD64, natural size of "int" is 32 bits.
Cool, AMD got best things from both worlds.

IIRC Alpha took a simpler "we are a fully 64 animal"
route. Sane C ABI on such CPU should use 64-bit ints,
or else you will drown in zillions of "n &= 0x00000000fffffffff"
type instructions which will enforce 32-bitness
on 64-bit CPU.

I think I'm starting to waste your (and my) time.
Alpha is almost history now...
Post by Rob Landley
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later. (And
if we remove it, we can always add it _back_ later.)
See attached patch.
Well, it's definitely an improvement. However, it would be nice if you could
call itoa on a buffer and have it actually fill in that buffer from the
start, so you could use it like you could sprintf or strcpy, and not
potentially have to track two pointers to your own buffer.
That will need either a clever method to determine how much digits are there
_before_ we do the conversion, or copying the result.
I hope that in most cases you can just

do_something(utoa_to_buf(n, buf, sizeof(buf));

(Actually, im most cases people will just use do_something(utoa(n)); ...)
Post by Rob Landley
I did a quick stab at that and checked it into xfuncs.c, and now I'm about to
pass out. Could you tell me if that's useful, or if not what did I screw up?
I have to go fall over and pass out now, it's 4 am and I have work in the
morning...
Aha! So you DO have a clever method! :) Thanks!

Hmmm.... (/me takes a look) I think I can improve on that.

Please see (and run) attached program. utoa_to_buf_rob is your version
copied from xfuncs.c. Two problems: it's almost as slow as sprintf
and it doesn't check for buflen. utoa_to_buf_rob2 is a speed improved
version and utoa_to_buf_rob2_with_check adds compile-time check
that unsigned is really 32-bit and also correctly handles buflen.

There is a correctness tests for utoa_to_buf_rob2_with_check.
Seems to work right.

Please put utoa_to_buf_rob2_with_check code into xfuncs
(unless you find more way to make it better).
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itoa_bench.c
Type: text/x-csrc
Size: 3950 bytes
Desc: not available
Url : http://lists.busybox.net/pipermail/busybox/attachments/20060710/9aebbade/attachment.c
Rich Felker
2006-07-10 00:13:05 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many users,
that I saw.)
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
Of these od/hexdump are the only ones that sound very interesting to
optimize for performance, and the 'itoa' code in question is
decimal-only anyway making them irrelevant.

Anyway if snprintf is too slow then what needs to be fixed is
snprintf, not the applications that use it. The C library exists for a
reason. As soon as you decide you can't rely on it to do what it's
supposed to do and do it at an acceptable level of efficiency, you end
up with the bloat that is Mozilla (NSPR), GNOME (glib), or coreutils
(gnulib, i.e. half of glibc copied into the coreutils source tree...).
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use the
lp64 model (as do all modern Unix variants, including MacOS X), where only
long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
gcc does not support sizeof(int)!=4, so no, it does not have 8. You're
thinking of long. In fact any C compiler with sizeof(int)!=4 is
virtually unusable for multimedia apps because it means you lack at
least one of the sizes 8, 16, or 32, all of which are needed in order
to have any hope of efficient audio and video processing.

Rich
Rob Landley
2006-07-10 07:43:56 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many
users, that I saw.)
Examples of things which should care about it
od, seq, uniq -c, tcpdump, hexdump...
How many of these users actually need a malloc, and how many is it essentially
a better sprintf?
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use
the lp64 model (as do all modern Unix variants, including MacOS X), where
only long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.
A) Not on Linux. Linux standardized on LP64.

B) If it's that broken, I don't care.
Post by Denis Vlasenko
static inline unsigned max_unsigned_power10(void)
{
if (sizeof(unsigned) == 4) return 1000000000;
if (sizeof(unsigned) == 8) // gcc warns a lot here, shut it up
return (unsigned)(10ULL*1000000000*1000000000);
return BUG_unsupported_architecture();
}
but since you don't want to handle long long, it's irrelevant for now.
I'd rather wait until users of long long show up before adding support.
Also, "long long" isn't a specific type: on a 32 bit platform long is 32 bits
and long long is 64 bits. On a 64 bit platform, long is 64 bits and long
long is 128 bits. That's why they invented uint64_t and friends. :)
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later. (And
if we remove it, we can always add it _back_ later.)
See attached patch.
Well, it's definitely an improvement. However, it would be nice if you could
call itoa on a buffer and have it actually fill in that buffer from the
start, so you could use it like you could sprintf or strcpy, and not
potentially have to track two pointers to your own buffer.

I did a quick stab at that and checked it into xfuncs.c, and now I'm about to
pass out. Could you tell me if that's useful, or if not what did I screw up?
I have to go fall over and pass out now, it's 4 am and I have work in the
morning...

Rob
--
Never bet against the cheap plastic solution.
Denis Vlasenko
2006-07-09 16:53:25 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many users,
that I saw.)
Examples of things which should care about it
(I do not claim that current code cares):
od, seq, uniq -c, tcpdump, hexdump...
Post by Rob Landley
Post by Denis Vlasenko
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use the
lp64 model (as do all modern Unix variants, including MacOS X), where only
long and pointer are 64 bit.
http://www.unix.org/whitepapers/64bit.html
What about Alpha? I heard that it has (had?) sizeof(int)==8.

BTW, I could do this instead:

static inline unsigned max_unsigned_power10(void)
{
if (sizeof(unsigned) == 4) return 1000000000;
if (sizeof(unsigned) == 8) // gcc warns a lot here, shut it up
return (unsigned)(10ULL*1000000000*1000000000);
return BUG_unsupported_architecture();
}

but since you don't want to handle long long, it's irrelevant for now.
Post by Rob Landley
Post by Denis Vlasenko
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.
A general rule of thumb: we can always add code to the tree later. (And if we
remove it, we can always add it _back_ later.)
See attached patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: busybox_itoa_v3.patch
Type: text/x-diff
Size: 5808 bytes
Desc: not available
Url : http://lists.busybox.net/pipermail/busybox/attachments/20060709/00d19121/attachment.bin
Michael S. Zick
2006-07-08 19:40:14 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Unless you are supporting multiple brands of compilers, it
optimizes out to a simple constant before you write the code.

Mike
Rich Felker
2006-07-09 05:11:02 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Perhaps, but BB is always about optimizing for size, not speed, unless
a very large speed gain can be achieved with a relatively small amount
of code..
Post by Denis Vlasenko
sprintf is a performance disaster. In any libc I've looked into,
it _creates a custom struct FILE_, and then calls v_something_printf()
on that, which _parses format string_ with gazzilion of possible
obscure format specifiers, and _then_ it does itoa thing.
Then your libc sucks. Get a good one. BB does not write huge bloated
replacements for the libc functions; it uses them as-is, and thus will
perform well on a well-implemented system and poorly on a
poorly-implemented system.
Post by Denis Vlasenko
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
LOL.. *crying* why does itoa need to know such a thing?!

Rich
Rob Landley
2006-07-09 15:47:10 UTC
Permalink
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when
you need it to malloc?) Especially if the darn compiler actually starts
doing intelligent string merging, which it's inching towards...
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.
Could you give me an example? (Your patch didn't actually convert many users,
that I saw.)
Post by Denis Vlasenko
sprintf is a performance disaster. In any libc I've looked into,
it _creates a custom struct FILE_, and then calls v_something_printf()
on that, which _parses format string_ with gazzilion of possible
obscure format specifiers, and _then_ it does itoa thing.
That would be a yes, then. :)
Post by Denis Vlasenko
Post by Rob Landley
You have a large quantity of code under #ifdef NOT_NEEDED_YET that by
itself would make me reject this patch.
I wrote that just in case you'll want me to convert ash too.
See below.
I'm far more interested in replacing ash than fixing it, and am working
towards that end here (albeit by clearing a couple other low hanging fruit
off my todo list first so I can focus on bbsh with fewer interruptions).
Post by Denis Vlasenko
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.
gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
When is sizeof(unsigned) not going to be 4 on any Linux platform? We use the
lp64 model (as do all modern Unix variants, including MacOS X), where only
long and pointer are 64 bit.

http://www.unix.org/whitepapers/64bit.html

This even works for Windows, which uses llp64 because they decided to make 64
bit programming really inconvenient in the name of backwards compatability,
and still managed to to screw it up:

http://blogs.msdn.com/oldnewthing/archive/2005/01/31/363790.aspx

Not that I care about that part. :)
Post by Denis Vlasenko
Post by Rob Landley
Post by Denis Vlasenko
If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
It can handle int, and long long, but not long. On 64 bit platforms (we
already support x86-64, and this sort of thing will only get more common
to the point it may be a noticeable chunk of the embedded space in 3-5
years), these are three different sizes.
long long currently is needed for ash only, everybody else wants
itoa([unsigned] int). There are no ltoa users today.
ltoa is of course a piece of cake, can do when someone will need it.
So, do you want me to
a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
The first, please.

A general rule of thumb: we can always add code to the tree later. (And if we
remove it, we can always add it _back_ later.)

Thanks,

Rob
--
Never bet against the cheap plastic solution.
Rich Felker
2006-07-08 21:54:35 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...
You have a large quantity of code under #ifdef NOT_NEEDED_YET that by itself
would make me reject this patch. (What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
Agree, this code is INSANELY bloated. Use snprintf!

Rich
Denis Vlasenko
2006-07-08 18:42:05 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...
I think that some programs may print so much numerical output
that it may be important for them to do itoa more-or-less
efficiently.

sprintf is a performance disaster. In any libc I've looked into,
it _creates a custom struct FILE_, and then calls v_something_printf()
on that, which _parses format string_ with gazzilion of possible
obscure format specifiers, and _then_ it does itoa thing.
Post by Rob Landley
You have a large quantity of code under #ifdef NOT_NEEDED_YET that by itself
would make me reject this patch.
I wrote that just in case you'll want me to convert ash too.
See below.
Post by Rob Landley
(What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
It calculates maximum power-of-10 which fits in unsigned int.
It does that at compile time, portably. That's why it looks funny.

gcc optimizes it out to simple 1000000000 constant
(if sizeof(unsigned)==4, that is).
Post by Rob Landley
Post by Denis Vlasenko
If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
It can handle int, and long long, but not long. On 64 bit platforms (we
already support x86-64, and this sort of thing will only get more common to
the point it may be a noticeable chunk of the embedded space in 3-5 years),
these are three different sizes.
long long currently is needed for ash only, everybody else wants
itoa([unsigned] int). There are no ltoa users today.
ltoa is of course a piece of cake, can do when someone will need it.

So, do you want me to

a) remove #ifdef NOT_NEEDED_YET part
or
b) enable it and use it in ash?
--
vda
Rich Felker
2006-07-09 05:07:38 UTC
Permalink
Post by Rob Landley
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...
You have a large quantity of code under #ifdef NOT_NEEDED_YET that by itself
would make me reject this patch. (What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
Agree, this code is INSANELY bloated. Use snprintf!

Rich
Denis Vlasenko
2006-07-08 16:44:18 UTC
Permalink
Hi folks, Rob,

Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.

The itoa which lives in ash is not converted:

/* Our own itoa(). */
static int
cvtnum(arith_t num)
{
int len;
expdest = makestrspace(32, expdest);
#ifdef CONFIG_ASH_MATH_SUPPORT_64
len = fmtstr(expdest, 32, "%lld", (long long) num);
#else
len = fmtstr(expdest, 32, "%ld", num);
#endif
STADJUST(len, expdest);
return len;
}

Where fmtstr is:

int
fmtstr(char *outbuf, size_t length, const char *fmt, ...)
{
va_list ap;
int ret;
va_start(ap, fmt);
INTOFF;
ret = vsnprintf(outbuf, length, fmt, ap);
va_end(ap);
INTON;
return ret;
}

If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: busybox_itoa.patch
Type: text/x-diff
Size: 8832 bytes
Desc: not available
Url : http://lists.busybox.net/pipermail/busybox/attachments/20060708/b7e27d7b/attachment.bin
Rob Landley
2006-07-08 17:45:50 UTC
Permalink
Post by Denis Vlasenko
Hi folks, Rob,
Attached patch removes all instances of local itoa()-like
functions, introduces common one into libbb and uses
it where appropriate.
Is "sprintf" really that bad? (Or xasprintf() out of our library when you
need it to malloc?) Especially if the darn compiler actually starts doing
intelligent string merging, which it's inching towards...

You have a large quantity of code under #ifdef NOT_NEEDED_YET that by itself
would make me reject this patch. (What's max_unsigned_power10() doing,
anyway, and is that really a sane way to do whatever it is?)
Post by Denis Vlasenko
If you want it converted too, I'll do it. itoa.c
already has code which can handle unsigned long long conv,
it is #ifdef'ed out for now. See the patch.
It can handle int, and long long, but not long. On 64 bit platforms (we
already support x86-64, and this sort of thing will only get more common to
the point it may be a noticeable chunk of the embedded space in 3-5 years),
these are three different sizes.

Rob
--
Never bet against the cheap plastic solution.
Continue reading on narkive:
Loading...