Sunday, May 1, 2011

Convert string to integer (not atoi!)

** READ BEFORE POSTING ** I am aware of the multiply and add solution but since these are arbitrary length numbers, the multiply and add operations are not free so I'd like to avoid them, if at all possible.

My challenge is this: I want to be able to take, as input, a character pointer to a number in base 2 through 16 and as a second parameter, what base the number is in and then convert that to it's representation in base 2. The integer can be of arbitrary length. My solution now does what the atoi() function does, but I was curious purely out of academic interest if a lookup table solution is possible.

I have found that this is simple for binary, octal, and hexadecimal. I can simply use a lookup table for each digit to get a series of bits. For instance:

0xF1E ---> (F = 1111) (1 = 0001) (E = 1110) ---> 111100011110

0766 ---> (7 = 111) (6 = 110) (6 = 110) ---> 111110110

1000 ---> ??? ---> 1111101000

However, my problem is that I want to do this look up table method for odd bases, like base 10. I know that I could write the algorithm like atoi does and do a bunch of multiplies and adds, but for this specific problem I'm trying to see if I can do it with a look up table. It's definitely not so obvious with base 10, though. I was curious if anyone had any clever way to figure out how to generate a generic look up table for Base X -> Base 2. I know that for base 10, you can't just give it one digit at a time, so the solution would likely have to lookup a group of digits at a time.

Any ideas are greatly appreciated!

From stackoverflow
  • The algorithm is quite simple. Language agnostic would be:

    total = 0
    base <- input_base
    for each character in input:
       total <- total*base + number(char)
    

    In C++:

    // Helper to convert a digit to a number
    unsigned int number( char ch )
    {
       if ( ch >= '0' && ch <= '9' ) return ch-'0';
       ch = toupper(ch);
       if ( ch >= 'A' && ch <= 'F' ) return ch-'A';
    }
    unsigned int parse( std::string const & input, unsigned int base )
    {
       unsigned int total = 0;
       for ( int i = 0; i < input.size(); ++i )
       {
          total = total*base + number(input[i]);
       }
       return total;
    }
    

    Of course, you should take care of possible errors (incoherent input: base 2 and input string 'af12') or any other exceptional condition.

    David Rodríguez - dribeas : One digit at a time, not a group of them.
  • You will have to use a look up table with an input width of m base b symbols returning n bits so that

    n = log2(b) * m
    

    for positive integers b, n and m. So if b is not a power of two, there will be no (simple) look up table solution.

    I do not think that there is a solution. The following example with base 10 illustrates why.

    65536 = 1 0000 0000 0000 0000
    

    Changing the last digit from 6 to 5 will flip all bits.

    65535 = 0 1111 1111 1111 1111
    

    And almost the same will hold if you process the input starting from the end. Changing the first digit from 6 to 5 flips a significant number of bits.

    55535 = 0 1101 1000 1111 0000
    
    • Start with a running count of 0.
    • For each character in the string (reading left to right)
      • Multiply count by base.
      • Convert character to int value (0 through base)
      • Add character value to running count.
  • How big are the strings? You can potentially convert the multiply-and-add to a lookup-and-add by doing something like this:

    • Store the numbers 0-9, 10, 20, 30, 40, ... 90, 100, 200, ... 900, 1000, 2000, ... , 9000, 10000, ... in the target base in a table.
    • For each character starting with the rightmost, index appropriately into the table and add it to a running result.

    Of course I'm not sure how well this will actually perform, but it's a thought.

    David Thornley : It's the only thing I've seen that has a prayer of being better than multiply-and-add.
    Daniel Brückner : The idea sounds promising but will not work because if the base is not a power of two any symbol in the input might affect almost any bit in the output.
    Yuliy : Right, the addition will see to that. Binary addition is still a fairly fast operation. The issues with this approach are basically cache sizes and the performance of addition with wide numbers.
  • This is not possible in bases that aren't powers of two to convert to base-2. The reason that it is possible for base 8 (and 16) is that the way the conversion works is following:

    octal ABC = 8^2*A + 8^1*B + 8^0*C (decimal)
              = 0b10000000*A + 0b1000*B + C (binary)
    

    so if you have the lookup table of A = (0b000 to 0b111), then the multiplication is always by 1 and some trailing zeros, so the multiplication is simple (just shifting left).

    However, consider the 'odd' base of 10. When you look at the powers of 10:

    10^1 = 0b1010
    10^2 = 0b1100100
    10^3 = 0b1111101000
    10^4 = 0b10011100010000
    ..etc
    

    You'll notice that the multiplication never gets simple, so you can't have any lookup tables and do bitshifts and ors, no matter how big you group them. It will always overlap. The best you can do is have a lookup table of the form: (a,b) where a is the digit position, and b is the digit (0..9). Then, you are only reduced to adding n numbers, rather than multiplying and adding n numbers (plus the cost of the memory of the lookup table)

  • How accurate do you need to be?

    If you're looking for perfection, then multiply-and-add is really your only recourse. And I'd be very surprised if it's the slowest part of your application.

    If order-of-magnitude is good enough, use a lookup table to find the closest power of 2.

    Example 1: 1234, closest power of 2 is 1024. Example 2: 98765, closest is 65536

    You could also drive this by counting the number of digits, and multiplying the appropriate power of 2 by the leftmost digit. This can be implemented as a left-shift:

    Example 3: 98765 has 5 digits, closest power of 2 to 10000 is 8192 (2^13), so result is 9 << 13

  • I wrote this before your clarifying comment so it probably isn't quite is applicable. I'm not sure if a lookup table approach is possible or not. If you really don't need arbitrary precision, then take advantage of the runtime.

    If a C/C++ solution is acceptable, I believe that the following is what you are looking for is something like the following. It probably contains bugs in edge cases, but it does compile and work as expected at least for positive numbers. Making it really work is an exercise for the reader.

    /*
     * NAME
     *    convert_num - convert a numerical string (str) of base (b) to
     *                  a printable binary representation
     * SYNOPSIS
     *    int convert_num(char const* s, int b, char** o)
     * DESCRIPTION
     *    Generates a printable binary representation of an input number
     *    from an arbitrary base.  The input number is passed as the ASCII
     *    character string `s'.  The input string consists of characters
     *    from the ASCII character set {'0'..'9','A'..('A'+b-10)} where
     *    letter characters may be in either upper or lower case.
     * RETURNS
     *    The number of characters from the input string `s' which were
     *    consumed by this operation.  The output string is placed into
     *    newly allocated storage which is pointed to by `*o' upon successful
     *    completion.  An error is signalled by returning `-1'.
     */
    int
    convert_num(char const *str, int b, char **out)
    {
     int rc = -1;
     char *endp = NULL;
     char *outp = NULL;
     unsigned long num = strtoul(str, &endp, b);
     if (endp != str) { /* then we have some numbers */
      int numdig = -1;
      rc = (endp - str); /* we have this many base `b' digits! */
      frexp((double)num, &numdig); /* we need this many base 2 digits */
      if ((outp=malloc(numdig+1)) == NULL) {
       return -1;
      }
      *out = outp; /* return the buffer */
      outp += numdig; /* make sure it is NUL terminated */
      *outp-- = '\0';
      while (numdig-- != 0) { /* fill it in from LSb to MSb */
       *outp-- = ((num & 1) ? '1' : '0');
       num >>= 1;
      }
     }
     return rc;
    }
    

0 comments:

Post a Comment