Try the Free Math Solver or Scroll down to Tutorials!

Enter expression, e.g. (x^2-y^2)/(x-y)

Enter expression, e.g. x^2+5x+6

Enter expression, e.g. (x+1)^3

Enter a set of expressions, e.g. ab^2,a^2b

Enter equation to solve, e.g. 2x+3=4

Enter equation to graph, e.g. y=3x^2-1

Depdendent Variable

Number of equations to solve:

Equ. #1:

Equ. #2:

Equ. #3:

Equ. #4:

Equ. #5:

Equ. #6:

Equ. #7:

Equ. #8:

Equ. #9:

Solve for:

Enter inequality to solve, e.g. 2x+3>4

Enter inequality to graph, e.g. y<3x^2-1

Dependent Variable

Number of inequalities to solve:

Ineq. #1:

Ineq. #2:

Ineq. #3:

Ineq. #4:

Ineq. #5:

Ineq. #6:

Ineq. #7:

Ineq. #8:

Ineq. #9:

Solve for:

Math solver on your site

Please use this form if you would like
to have this math solver on your website,
free of charge.

Name:

Email:

Your Website:

Msg:

Topics in Computer Mathematics - Number Systems

Although computer designers can arbitrarily choose the size of the register to be used as
well as the size of the fields, most modern systems follow the standards established by
the IEEE (Institute of Electrical and Electronic Engineers). This standard provides two
formats, a single precision format and a double precision format. They are summarized
in table TN4. When all of the bits in the format are 0's, the number is assumed to be
zero.

Feature	Single Precision Format	Double Precision Format
Register Length	32 bits	64 bits
Mantissa	23 + 1 implied	52 + 1 implied
Exponent, Bias	8 bits, 127	11 bits, 1023

Table. TN4. IEEE Floating Point Standard

Conversion of Decimal numbers to Binary Floating Point

The following steps are required in order to convert a decimal number to its binary
floating point representation.

a. Convert the decimal number to a binary number (This includes either
expanding 10^x into decimal form , if x is small, or converting it to 2^y)
b. Put the binary number into floating-point form
c. Normalize the binary number
d. Convert the exponent to binary, and add the bias
e. Specify the sign as a binary digit

In the following examples, we’ll assume a floating point format with a 16 bit register to
hold the FP number, having

1 sign bit
4 exponent bits, with a bias of 7
11 mantissa bits, plus 1 implied bit

Note that the exponent range in this format is -7 through +8. This means that the
smallest (absolute value) number representable is

and the largest is

Example 27. Convert 54.23 to the above binary format.

a.
(11101011 repeats)

b. and c.
(note that 15 fraction bits are shown;
only 11 of them will be retained in the specified format.)

d. 5 = 0101; add the bias of 7: 0101 +111 = 1100.

e. Sign = 0

In floating point hardware format, Notice that
the 1 to the left of the binary point is not included.

Example 28. Convert to binary FP format

a. The easy way to do this is to write the number in integer format (375) and
convert it to binary. In practice, however, the exponents are likely to be quite
large and it would not be feasible to take this approach. Let’s develop a general
rule for converting 10^x to 2^y.

We want 10^x = 2^y; that is, we want to solve this equation for y in terms of x. Take
the log to the base 2 of both sides

Thus, any exponent of 10 can be converted to an exponent of 2 by multiplying the
decimal exponent by , which is roughly equal to 3.322... So, for this
example, we could say

Since we don’t allow for fractional exponents, we will have to make the following
adjustment:

(using a calculator)

and

Converting 5.86 to binary gives us

[Note that when converting a fraction from decimal to binary we can stop
multiplying by two when we have generated enough digits (counting both the
integer and fraction portions of the number) to fill the mantissa field of the FLP
format.]

b. and c.

d. 4. Exponent 8 biased by 7 is 15 = 1111.

e. Sign (-) is 1.

The final result is

Note that by converting 10² to 2^y, we introduced several rounding and truncation
errors. Compare the result above with what we get by simply converting 375:

a. 375 = 101110111

b. and c.

d. Exponent is still 8 --> 1111 in biased form

e. Sign = 1

The final result is which differs from our previous
result only in the eighth fractional position, so we are off by 2^-8 or 1/256 = .0039.

Convert binary floating point numbers to Decimal numbers

The procedure is

a. Convert the exponent to decimal and subtract the bias
b. Evaluate 2^x if possible (x is small) or convert to 10^y
c. Convert the mantissa to decimal and add 1 (to restore the implied digit)
d. Combine the sign and the results of steps 1 through 3 into the decimal
form of the number

Example 29. Convert to a decimal number.

a. The exponent is subtracting the bias (7) gives -6.

b. Alternatively, ; taking the of both
sides gives

[That is, just as an exponent of 10 can be converted to an exponent of 2 by
multiplying by we can convert an exponent of 2 back to an
exponent of 10 by multiplying by .]

Compare this with .015625; we
will continue with the latter, since it contains no rounding or truncation errors.

c. .0101 = 5/16 = .3125; restoring the implied one = 1.3125

d. (Either form is acceptable)

The concepts involved in the last two examples are important, whereas the actual
numerical manipulations are merely tedious. The following practice problems represent
the kinds of questions one might actually be asked to answer on an exam or in real life.

Practice Problems - Binary Floating Point Numbers

1. Using the floating point register format given above for the examples, show the
register contents for the following binary floating point numbers:

2. What is the minimum number of exponent bits required to accommodate the
following binary numbers (review the section on biased binary numbers, if
necessary.)

3. What is the binary floating point number (in the form b.bbb... x 2⁹) represented
by each of the following FLP register contents

a. (five bit exponent, bias = 15)
b. (eight bit exponent, bias = 127)
c. (six bit exponent, bias = 31)

4. Express the following binary floating point number in IEEE floating point format

5. The following hex number represents the contents of an IEEE floating point
register. Express this in normalized binary floating point form ( b.bbb... x 2⁹)

9F380000

Home