The repository is about shooting large occurences of the Fibonacci numbers (A000045). The sequence is often a good pretext for exciting journeys to the heart of computer science and a means for illustrating programming patterns. Recall the sequence
Negafibonacci
Using
Machine code
Hexadecimal representation of a x86-64 machine code function that calculates the
89 f8
85 ff
74 26
83 ff 02
76 1c
89 f9
ba 01 00 00 00
be 01 00 00 00
8d 04 16
83 f9 02
74 0d
89 d6
ff c9
89 c2
eb f0
b8 01 00 00
c3Assembly code
Same Fibonacci calculator, but in x86-64 assembly language using AT&T syntax:
fib:
movl %edi, %eax ; put the argument into %eax
testl %edi, %edi ; is it zero?
je .return_from_fib ; yes - return 0, which is already in %eax
cmpl $2, %edi ; is 2 greater than or equal to it?
jbe .return_1_from_fib ; yes (i.e., it's 1 or 2) - return 1
movl %edi, %ecx ; no - put it in %ecx, for use as a counter
movl $1, %edx ; the previous number in the sequence, which starts out as 1
movl $1, %esi ; the number before that, which also starts out as 1
.fib_loop:
leal (%rsi,%rdx), %eax ; put the sum of the previous two numbers into %eax
cmpl $2, %ecx ; is the counter 2?
je .return_from_fib ; yes - %eax contains the result
movl %edx, %esi ; make the previous number the number before the previous one
decl %ecx ; decrement the counter
movl %eax, %edx ; make the current number the previous number
jmp .fib_loop ; keep going
.return_1_from_fib:
movl $1, %eax ; set the return value to 1
.return_from_fib:
ret ; return
Fibonacci numbers are strongly related to the golden ratio. The figure is a captivating journey through art and architecture, botany and biology, physics and, of course, mathematics. It was called by Euclid extreme and mean ratio and divine proportion by Luca Pacioli.
Euclid's Elements book (~300b.c.) provides several propositions along with their proofs involving the golden ratio and contains its first known definition which proceeds as follows:
Elements / Liber VI / Definition 3
Ακρον καὶ μέσον λόγον εὐθεῖα τετμῆσθαι λέγεται, ὅταν ᾖ ὡς ἡ ὅλη πρὸς τὸ μεῖζον τμῆμα, οὕτως τὸ μεῖζον πρὸς τὸ ἔλαττὸν.
A straight line is said to have been cut in extreme and mean ratio when, as the whole line is to the greater segment, so is the greater to the lesser.
The
The Elements, written in thirteen books (i.e. chapters) is the most famous and scientifically most significant work by the Greek mathematician Euclid. After the Bible, it is the most printed and studied book in the history of the western world. It represents geometry as a logically self-contained system built on a handful of definitions, postulates and axioms. Besides geometry grounds, it contains everything known at that time about number theory. Here too there were for the first time important findings on prime numbers.
As famous problem from his book Liber abaci shows, Fibonacci was familiar – in Euclid's tradition – with the concept of proportion in accordance with what was first termed golden ratio only in the
Calculation
Algebraically, two quantities
One method for finding a closed form for
Solving the quadratic equation yields to two real solutions. Since
Illustrations
There are several methods for computing the value of a given
Closed-form formula
Here the expression of one
One can quickly notice that the second term's absolute value is always less than
Though the golden ratio
Matrix form
The 2-dimensional system of linear difference equations that describes the Fibonacci sequence is:
The second innocent-looking matrix exponentiation identity can be proven from weak induction as follows:
-
Base case for
$n=1$ , clearly:
-
Induction step, assume for
$n\gt 1$ :
Then, one can induce the equation remains verified for
Double identities
The sequence has remarkable properties whom studies are the subject of regular publications. Three following equations are derived by applying
By equating each respective cell of the first and last terms, some identities are deduced directly or via the sustitution:
Therefore,
Other notable properties:
- Cassini's identity is
$F_{n-1}F_{n+1} - F_{n}^{2} = (-1)^{n}$ - Addition rule is
$F_{n+p} = F_{p}F_{n+1} + F_{p-1}F_{n} = F_{n}F_{p+1} + F_{n-1}F_{p}$ - Greatest common denominator (gcd) identity is
$gcd(F_{n}, F_{p}) = F_{gcd(n,p)}$
Matrix multiplication
This binary operation is central tool of linear algebra and has numerous applications in applied mathematics, statistics, physics, economics and engineering.
Keep in mind that a computer performs operations at same speed regardless of the source code or instructions given. It just depends on hardware specifications. It turns out the performance of one method stems from the efficiency of the set of instructions to get as close as possible to the computing power and to run the straightest through the process to achieve the desired result. In other words, the processor does not care how clever or redundant its instructions are, it will execute at the same rate. The outperformance of one algorithm over another arises solely from human reasoning, considered as intelligence, and in regards with the current context.
Textbook recursive
Naively, one can execute directly the recurrence formula as Fibonacci sequence is inherently recursive. Unfortunately this would turn to be hopelessly slow as one will immediately understand that the subproblem redundancy grows exponentially in
Paired with a lookup table (e.g. cache, memoïzation) that stores the results of previously solved subproblems, the programming pattern ensures unique instance computation, bringing back time complexity to somewhat linear
Iteration Both the starting points and the number of iterations to climb the ladder of the sequence are well known ahead. It could also be returned via a generator function*.
In regards with Fibonacci, both approaches actually are performing almost equally, in
Dynamic programming
Recursion and iteration are equally expressive. The former can be replaced by the latter with eventually an explicit call stack, while iteration can be turned into tail recursion (tail call elimination).
Generally, two properties of a problem should be observed while considering dynamic programming:
-
Subproblem redundancy, meaning valid results to smaller instances are useful numerous times to solve one larger instance of the problem. Fibonacci sequence ticks the box big time!
-
Subproblem optimality, meaning an optimal solution of the larger instance is obtained thanks to the optimal results of each subproblem, instead of trying every possible valid ways. (e.g. shortest path search meets the property whereas longest path search does not). Here again, the problem at hand comply with this principle.
A primary difference is that recursion can be employed as a solution without prior knowledge as to how many times the process will have to repeat, or as to how the problem will exactly destructure into smaller instances, while a successful iteration requires that foreknowledge. Implementing an algorithm using iteration may not be easily achievable. Many problems are inherently recursive: e.g. multiple recursion like dfs, generative recursion such as gcd, binary search, mergesort, etc. They may be implemented iteratively with the help of an explicit stack, but the programmer effort involved in managing the stack, and the complexity of the resulting program, arguably outweigh any advantages of the iterative solution.
Iterative code
let dynamic = (n) => {let [a, b]=[1, 0]; for (let i=2; i<=n; i++) {[a, b]=[a + b, a];}; return a;};Matrix exponentiation
The algorithm is calculating the simple-looking matrix form
Running the equation step by step,
Binary exponentiation is a general method for faster computation of large powers of a number, or a square matrix (also called double and add). It is a corollary of the powerful divide and conquer algorithm paradigm. The process consists of repeatedly computing the squaring of
Combined with the binary expression of the exponent
-
The total number of squaring operations, also of iterations, is equal to the number of bits, i.e.
$\lfloor log_{2}\text{ }n\rfloor$ .
Each of these steps doubles the exponentiation. -
A complementary simple multiplication by
$x$ is performed when the iterated bit is$1$ .
This increments the exponent by$1$ only.
Logarithmic time complexity
Calculation example
Exponentiation by squaring for
| i | bit | operation | result | ||||
|---|---|---|---|---|---|---|---|
| 0 | MSb most significant bit | ||||||
| 1 | |||||||
| 2 | |||||||
| 3 | |||||||
| 4 | |||||||
| 5 | LSb least significant bit |
In the end, the framework has performed
Alternatively, one may also process according to
Programming examples
Illustration of computation table above, running binary representation from left to right (MSb to LSb)
const formula_i = (x, n) => {let exp = 1, binary = n.toString(2);
for (let bit of binary) {exp *= exp; if (bit == 1) {exp *= x;};}; return exp;};Alternative formula (though not preferred) running binary array from right to left (LSb to MSb).
const formula_ii = (x, n) => {let exp = 1, binary = n.toString(2), i = binary.length - 1;
do {if (binary[i] == 1) {exp *= x;}; x *= x;} while (i--); return exp;};Iterative version with constant auxiliairy memory y.
function exponentiationBySquaring (x, n) {
if (n < 0) {x = 1 / x; n = -n;};
if (n == 0) {return 1;};
let y = 1; // stores the complementary simple multiplications
while (n > 1) {
if (n % 2 != 0) {y *= x; n -= 1;}
else {x *= x; n /= 2;};
};
return x * y;};Version performing a fixed quantity of operations (multiplications and squarings) regardless of specific's bit value for cryptographic concerns.
function MontgomeryLadder (x, n) {
let [x1, x2] = [x, x * x]; let base = n.toString(2);
for (let i = 1; i < base.length; i++) {
if (base[i] == 0) {[x1, x2] = [x1 * x1, x1 * x2];}
else {[x1, x2] = [x1 * x2, x2 * x2]};
};
return x1;};Generalized exponentiation
Exponentiation by squaring can be viewed as a suboptimal addition-chain exponentiation algorithm. Equivalently, it is the minimal number of multiplications required to compute the
Fast double
The matrix exponentiation method allows for working fast the sequence up to determining a great Fibonacci number while before it was endlessly slow. Nevertheless, computing the whole matrix ends up to redundant calculations as various cells contain identical values. Using the double identities instead addresses this concern.
It is worth pointing out the strength of pair induction versus a simple relation which can lead to a dead-end or to partial sequencing. Indeed, a single induction like
Iterative IIFE
const iterativeWrappedIIFE = (n) =>
(func => ((n) || (n % 2)) ? func(Math.abs(n)) : -func(Math.abs(n))) // negafibonacci
(function fibonacci (n) {
n = n.toString(2); // exponent binary notation
let [f_2n1, f_2n] = [1n, 0n]; // initialization
for (let i = 0; i < n.length; i++) {
[f_2n1, f_2n] = [(f_2n1 * f_2n1) + (f_2n * f_2n), f_2n * (f_2n1 * 2n - f_2n)];
if (n[i] == 1) {[f_2n1, f_2n] = [f_2n1 + f_2n, f_2n1];};};
return f_2n;}
);Binet's turnaround
Since the closed expression requires dealing with irrational Math.sqrt(5) begins pushing deviation from
Windows calculator
Computing directly
// true
Symbolic algebra approach can remove the floating-point difficulty provided with
-
$a$ and$b$ being respectively the real and algebraic parts of$\alpha$ . - Obviously,
$(1,1)\iff \varphi$ and$(1,-1)\iff \varphi'$ . Besides$(2,0)\iff 1$ .
Using this notation it can be simply developed
Similarly, one can obtain the representation
Thanks to the two operations, it can also be undertaken one pairing
It is deduced the straightforward and surprisingly costless result:
Obviously this approach is meant to be run using binary exponentiation.
Caveat
Before running this algorithm, one needs to align the output initialization with the loop index.
-
[a,b] = [2,0]means$(2,0)\iff 1$ . Loop traverses binary array from most leftbit[0] = 1. -
[a,b] = [1,1]means$(1,1)\iff \varphi$ . Loop traverses binary expansion from secondbit[1].
Programming patterns of the Fibonacci sequence exhibit a large range of algorithm classes and time complexities:
| Method | Time | Comment |
|---|---|---|
| Textbook |
|
|
| Cached recursion |
node.js default stack size is exceeded beyond |
|
| Tabulated iteration | runtime for |
|
| Matrix exponentiated |
|
|
| Fast double | ||
| Binet algebraic |
Here a good illutration that bisect paradigm is a paramount optimization tool which often leads to an improvement of the asymptotic cost of the solution. If
Some notable examples of divide-and-conquer frameworks:
- The mergesort algorithm, invented by John von Neumann in 1945, specifically developed for computers,
- The ancient Euclidean algorithm to determine the greatest common denominator of two numbers,
- The Karatsuba algorithm that achieves multiplication of two
$n$ -digits integers in$O(n^{log_{2}^{3}})$ .
In computer science,
It is very useful to classify algorithms for efficiency. What matters is the growth rate and
- When
$f(x)$ is a sum of several terms and one has a larger growth rate, it can be kept while all others omitted. - When
$f(x)$ is a product of multiple terms, any constants (not dependent on$x$ ) can be omitted.
Therefore both
Moreover, the precise number of steps depends on the details of the machine model on which the algorithm executes, but different computers typically vary by a constant factor so
Clock rate
Aside from algorithm efficiency, hardware specifications matters!
The performance of a computer is very dependent on the Central Processing Unit, the brain of the PC. The CPU processes instructions from all different programs every second. Some of these instructions involve simple arithmetic while others are more complicated.
The clock speed measures the number of cycles the CPU executes per second, measured in GHz (Gigahertz). A cycle is the basic unit that measures the speed of the CPU. During each cycle, billions of transistors within the processor open and close. This is how the CPU executes the calculations contained in the instructions it receives. Sometimes, multiple instructions are completed in a single clock cycle. In other cases, one instruction might be handled over multiple clock cycles.
Different processor designs handle instructions differently. Moreover, one older chip with a higher clock speed may very well be outperformed by a slower but newer processor whom architecture deals with instructions more efficiently. Of course, there are many other factors to consider when measuring the performance index of a computer such as data bus, latency of memory, architecture, microarchitectures, cache, etc.
The speed of floating-point operations, commonly measured in terms of FLOPS, is an other important characteristic of a computer system. A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers.
In a nutshell, one algorithm time complexity could be intuitively spotted like so:
| With this rule of thumb, chances are... | notation | class | example |
|---|---|---|---|
| Whenever the number of steps is constant whatever the size of the input ( |
constant | lookup table | |
| Anytime a problem can be divided over again into |
logarithmic | binary search | |
| Whenever algorithm is linearly processing |
linear | read the book | |
| When it involves walking through the input about input times, or keeps swooping through with nested loops | quadratic | school multiplication, bubblesort | |
| Anytime the growth rate in time doubles or |
exponential | plain Fibonacci, password bruteforce |
|
| When the running time grows in a factorial way like for generating all unrestricted permutations | factorial | travelling salesman problem via bruteforce |
Generally speaking, sublinear time is considered fast whereas complexities higher than linearithmic are rather said slow. Typically there is a tradeoff time vs. space (more relevant decades ago than nowadays since space has become cheap) or time over intelligence (e.g. effort in writing a more sophisticated model).
To end with, together with
-
$\Omega$ is expressing the lower bound, i.e. the best case, of the algorithm, -
$\Theta$ is used when$O = \Omega$ .
Illustrations
Sleep sort!
Sleep sort is undoubtedly one of the most ingenious sorting algorithm. Many methods use strategies based on the divide-and-conquer principle to get an array sorted more efficiently. Here the idea though unconventional is incredibly simple. For each element it opens up a new thread that sleeps for an amount of time proportional to its value and once time is out, it simply emits the item. Elements are then collected sequentially in time. The sorting is brilliantly delegated to the CPU scheduler (e.g. JS event loop).
Needless to say, generally speaking, one algorithm effectiveness depends on the given population to be sorted. For example, whenever a few elements have a much higher value (in magnitude) than the rest of the cluster, the sorting process will be penalized efficient by the extra time taken for those specific values.
Array.prototype.sleepSort = function (f) {this.forEach((n) => setTimeout(() => f(n), 2 * n));};
[1,9,6,7,3,4,0,5,8,2].sleepSort(function (i) {document.write(i + '<br>');});
// emits the element after 2 times its value in millisecondsArray.prototype.fisherYates = function () {let i = this.length - 1;
do {let j = ~~(Math.random() * (i+1)); [this[i],this[j]] = [this[j],this[i]];} while (--i);};
// Modern version of the Fisher–Yates shuffle algorithmArray.prototype.sleepSort = function (callback) {let sorted = [];
for (let n of this) {setTimeout(() => {
sorted.push(n); if (this.length === sorted.length) {callback(sorted);};}, n);};
return sorted;};
// returns the sorted array (positive values), each item is pushed after its n time (ms)
[18,12,21,8,7,10,1,16,11,2,19,17,22,9,20,4,23,25,14,5,15,13,6,3,24].sleepSort(console.log);
// [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]Miscellaneous 🎉
Occurences of the Fibonacci sequence are countless in life.
It is fair to say
Here are a selection of easy access use cases involving Fibonacci sequence.
Miles to kilometers
Because
-
$8$ miles$\to 13$ kilometers, similarly as$F_{5} = 8$ and$F_{5} \times \varphi \sim F_{5+1} = 13$ . -
$55$ km/h$\gets 34$ mph follows sequence like so$F_{10} = 55$ and$55 / \varphi \sim F_{9} = 34$ .
Climb the stairs
Fibonacci numbers give solution to certain enumerative problems, including the rabbit problem initiated by Fibonacci himself! One most common is counting the number of ways of writing a given number
There are actually
For example with
The example illustrates the evidence that the number of compositions
-
$C_{n-1}$ , the number of compositions that express$(n-1)$ , to which one would add$1$ , being also the number of compositions for$C_{n}$ ending with$1$ , and -
$C_{n-2}$ , the number of ways to write$(n-2)$ , to which one would logically add$2$ , also being the number of compositions for$C_{n}$ that ends by$2$ .
Given base cases
Different stories told for an identical solution are like so:
- Number of ways of emptying a drum of
$n$ liters with the help of containers of either one or two liters capacity, - Number of different domino tilings of the
$2\times n$ plan.
Heads or tails
In order to build a game of length
Given base cases
The following problem shares this solution: the number of subsets
Representing and manipulating real numbers efficiently is required in many fields of science such as engineering, finance and more. Since the early years of electronic computing, many different ways of approximating real numbers on computers have been introduced. Floating-point arithmetic is by far the most widely used way of representing real numbers in modern computers.
It is a truism:
- Humans count in decimal basis because they own
$10$ fingers! - Computers are made up of transistors which switch between only two states: either on or off (electric current flows or does not pass). They perform calculations (and actually everything) using the binary system because of its straightforward implementation in logic gates.
Expressing one same quantity:
${\color{blue}325.1875}_{10} = {\color{blue}3}\times 10^{2} + {\color{blue}2}\times 10^{1} + {\color{blue}5}\times 10^{0} + {\color{blue}1}\times 10^{-1} + {\color{blue}8}\times 10^{-2} + {\color{blue}7}\times 10^{-3} + {\color{blue}5}\times 10^{-4}$ ${\color{green}101000101.0011}_{2} = {\color{green}1} \times 2^{8} + {\color{green}1} \times 2^{6} + {\color{green}1} \times 2^{2} + {\color{green}1} \times 2^{0} + {\color{green}0} \times 2^{-1} + {\color{green}1} \times 10^{-3} + {\color{green}1} \times 10^{-4}$
The base-
Computers store bits, each of which can hold a state either of 4-bit and 8-bit words up to today's 64-bit words. For example, floating-point arithmetic was historically often not available on 8-bit microprocessors, but had to be carried out in software. Integration of the floating-point unit, first as a separate integrated circuit and then as part of the same microprocessor chip, sped up floating-point calculations.
Integers
| bits | unsigned |
|
use cases |
|---|---|---|---|
8 |
|
|
latin character set |
16 |
|
|
graphic coordinates |
32 |
|
|
general purpose |
64 |
|
|
general purpose |
Two's complement is the almost exclusive method of representing signed integers on computers, and more generally, fixed-point binary values. Two's complement uses the binary digit with the greatest place value as the sign bit to indicate whether the binary number is positive (
Two's complement only applies to numbers all having the same -bit length.
Given a set of all possible -bit encodings, the lower half (in binary!) is assigned to positive integers
| Set | binary range | MSb | encoding | interval |
|---|---|---|---|---|
| lower half |
MSb |
from |
||
| upper half |
MSb |
from |
Fundamentally, the system represents negative integers by counting backward and wrapping around (modular arithmetic). The boundary between positive and negative subsets could be arbitrarily different but convention is all negative numbers have a left-most bit of
-bit integer,
More generally, the value -bit encoded integer
Overcoming the issues raised by a naïve representation (wrong arithmetic, double zero encoding), the method is carried out like so.
- First,
$\mathbb{N^{+}}$ are encoded as usual whereas$\mathbb{Z^{-}}$ are encoded by performing the two consecutive steps:
-
Flipping all bits (bitwise operator
NOTin JavaScript~), (e.g. ones' complement) -
Adding
$+1$ to the entire inverted number, ignoring any overflow. Accounting for overflow would produce the wrong value for the result.
With half-byte length, encoding
Reversely, to verify that
Signed vs. unsigned
Given one byte length, Two's complement will encode
1111 1111 255
- 0101 0010 - 82
============ ======
1010 1101 (one's complement) 173
+ 1 + 1
============ ======
1010 1110 (two's complement) 174Manual shortcut
To quickly determine one integer's additive inverse using two's complement expression, one can do by hand:
- Work from LSb
$\to$ MSb and keep each bit identical up to the very first$1$ bit (inclusive) encountered, - Flip all bits thereafter up to the most-left bit.
Example:
-
$76_{10} = 01001100_{2} \to$ flip all:$10110011 \to$ add:$+1 \to$ makes$10110100_{2} = -76_{10}$ -
$76_{10} = 01001100_{2} \to$ keep right bits: 01001100$\to$ flip thereafter: 10110100$\to$ makes$10110100_{2} = -76_{10}$
Weird number
One exception where the additive inverse of the lowest integer is not representable. One will point out that the two's complement being the same number is detected as an overflow condition since there is a carry into but not out of the most-significant bit. As for
-bit integer,
Origin of the name Expanded fully in a -bit system, it actually means complement to
Compared to other systems for representing signed numbers (e.g., ones' complement), two's complement has the advantage that the fundamental arithmetic operations of addition, subtraction, and multiplication are identical to those for unsigned binary numbers (as long as the inputs are represented in the same number of bits as the output, and any overflow beyond those bits is discarded from the result).
Subtraction Using two's complement notation subtraction is executed with the following formula and so is eliminated the need for dedicated circuitry.
Signed extension When turning a two's complement number from a certain bits length into one larger storage (e.g. copying from byte to byte word), the most significant bit must be repeated in all extra bits. Some processors do this in a single instruction while on other processors, a conditional must be used followed by code to set the relevant bits or bytes.
Floating-point arithmetic
Simulating an infinite and continuous set
Floating point arithmetic is not real! The primary use of floating-point arithmetic is to perform computations over real numbers. Due to the limited precision and range of floating-point numbers, various numerical issues arise: rounding errors, overflows, underflows, loss of algebraic properties, catastrophic cancellation, and so on. Yet, these issues did not stop computers from using floating-point arithmetic for serious applications. In fact, it remains one extremely efficient way of performing approximate calculations given sufficient care in the implementation and powerful floating-point units available in modern processors.
The IEEE 754 standard establishes the framework for using a finite pack of bits (e.g. 32, 64) to store numbers of a large range, including the subnormal floating numbers (extremely close to zero).
Floating point arithmetic represents subsets of real numbers using both scientific notation and a numeral system, mostly binary. This representation handles with the same number of bits a much wider dynamic range at the price of precision. Numbers are not uniformly spaced and the difference between two consecutive representable numbers varies with their exponent. A floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floating-point range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.
Analogous to scientific notation, floating point processes numbers as follows:
- The sign,
$+$ or$-$ stored in the last bit (little endian storage), - A string of digits as the significand (or mantissa) which length determines the precision of the number,
- A base, either
$2$ or$10$ in IEEE 754, - The exponent, which indicates the magnitude of the value, within a a given range
$[1-e_{max}, e_{max}]$ (or scale), in biaised form with a negative spread. Extreme exponents, all$0s$ or all$1s$ , are reserved for special numbers, respectively$0$ and$\infty$ .
The radix point position is assumed to be also always set right after the most significant bit. The most significant digit of the mantissa for a non-zero number can be required not to equal zero. This is called normalization. The first digit ranges
The standard specifies some special values and their representation:
code |
bytes | sign | exponentbias [range] | precision | ~decimals | underflow |
|---|---|---|---|---|---|---|
binary16 |
||||||
binary32 |
||||||
binary64 |
Note that for each larger format, the mantissa has been very expanded whereas the exponent has not. Indeed, according to IEEE standard, precision is more important than amplitude.
- Safe integers range between
$\pm \text{ } 2^{p-1}$ - Equivalent decimal length is computed like so
$\sim p \times log_{10}\text{ }(2)$ - Fraction component
$\in [1;2 - 2^{1-p}] \in [1\text{ };\text{ }2[$ (for normalized numbers) - Subnormal numbers range
$]-b^{1-e_{max}}\text{ };\text{ }b^{1-e_{max}}[$ excluded or$[-b^{1-e_{max}}(1-b^{1-p}) \text{ };\text{ }b^{1-e_{max}}(1-b^{1-p})]$ included - Overflow levels for infinity
$-\infty \lt (-1)(1-b^{-p})(b^{e_{max}+1})$ and$(1-b^{-p})(b^{e_{max}+1}) \lt +\infty$
Exponent encoding involves an offset-binary, called exponent bias in the IEEE 754 standard.
The stored exponents all
| Exponent | Mantissa |
Mantissa |
Equation |
|---|---|---|---|
| all |
subnormal number | ||
| normal | normal | ||
| all |
NaN (quiet, signaling) |
double In regards with the 8-bytes word,
single In regards with the 32 bits format,
Encoding example
Consider binary32:
- Sign is negative so
bit[31]=1 - Encoding
$123.456$ 10 is carried out like so:
| Integer part | Fractional part | ||
|---|---|---|---|
|
|
|||
|
|
|
- As precision
$p=24$ inbinary32, there is no need to keep encoding further$(7+17)$ , - Significand is not finite, so rounding rule must apply (round to nearest, ties to even). Observing the next guard guard sticky bits, the least significant bit rounds up to
$1$ instead of$0$ expected. - Normalizing the binary number by shifting (e.g. right as
$\gt 1$ ) the significand by$e = +6$ digits, - By corollary, setting exponent to
$E = +6 - bias = 6 - (-127) = 133_{10} = (10000101)_{2}$ ,
One then form the resulting 32-bit IEEE 754 binary32 format representation of
binary64
Hexadecimal is widely used as a shorthand method for conveniently representing binary numbers because each digit16 stand for four bits (digit2), also called nybble.
Notable cases
Some double-precision binary64 encoding
| hexadecimal | IEEE 754 Converter | |
|---|---|---|
| 8000 0000 0000 000016 | -0 | signed zero |
| 3FF0 0000 0000 000116 | next greater than 1 ( |
|
| 3FF0 0000 0000 001016 | next number thereafter | |
| BFD5 5555 5555 555516 | irrational | |
| 4009 21FB 5444 2D1816 | irrational | |
| 800F FFFF FFFF FFFF16 | min subnormal (negative) | |
| 7FEF FFFF FFFF FFFF16 | maximum double | |
| 412E 8480 0000 000016 |
One main issue lies in that a real number might not be exactly representable via the floating-point method.
console.log(0.1 + 0.2); // 0.30000000000000004
let binary64 = 2 / 10 == 0.3 - 0.1; // falseThose results look off! (conversions decimal
However, should they be interpreted with the precision of the inputs (
- Accuracy indicates how close a value (whether computed or targeted) is to the true number,
- Precision focuses on resolution, that is how far two values need to be from each other before you can tell the difference.
For example,
- Circumstances under which exact computations are required,
- Precise rounding behavior expected whenever computations cannot be exact.
One might be surprised how little precision actually suffices to get consistent results. For example, the Sun is
Mathematically speaking, the normalized floating-point numbers of a given sign are roughly logarithmically spaced, and as such any finite-sized normal float cannot include zero. Subnormal floats are a linearly spaced set of values, which span the gap between the negative and positive normal floats. Subnormal numbers provide the guarantee that addition and subtraction of floating-point numbers never underflows. Two nearby floating-point numbers always have a representable non-zero difference. This can, in turn, lead to division by zero errors that cannot occur when gradual underflow is used.
Precision magnitude and limitations
For radix-
| Set | interval range | spread with next |
|---|---|---|
|
|
||
| range for exactly representable integers | ||
| rounds to integer multiple of |
||
| rounds to integer multiple of |
||
| rounds to integer multiple of |
||
|
|
respectively |
Generalization for baseb when
Visualization approach Instead of exponent, think of a window between two consecutive powers of the base single:
- The window tells within which two consecutive power of
$2$ the target number will seat:$[0.5,1]$ ,$[1,2]$ ,$[16,32]$ , so on up to$[2^{127}, 2^{128}[$ . - Ths offset divides the window in a fixed set of
$2^{p-1}=2^{23}=8\text{ } 838\text{ }608$ buckets. Each bucket is a regularly spaced number within the current window. When a boundary of the window is reached, one can float up (shift right) or float down (shift left) the window. It changes the precision by a one power of$2$ .
The three block a floating point number: signs
Rounding rules The standard sets five methods. Round to nearest, ties to even is the default for binary floating point and the recommended default for decimal. Round to nearest, ties to away is only required for decimal implementations.
Guard digits point out that computing the exact difference or sum of two floating-point numbers can be very expensive when their exponents are substantially different. Computing with a single guard digit will not always give the same answer as computing the exact result and then rounding. By introducing a second guard digit and a third sticky bit, differences can be computed at only a little more cost than with a single guard digit, but the result is the same as if the difference were computed exactly and then rounded. Thus the standard is implemented efficiently.
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. The importance of preserving parentheses cannot be overemphasized. For example:
let a = 10e30, b = -10e30, c = 1;
a + (b + c); // 0
(a + b) + c; // 1
(a + c + b); // 0IEEE 754 defines also operations and provide with indications on exceptional conditions, exception handling. It requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even). The lastest version released in 2019 also defines a new method fused-multiply-add (FMA) to better calculate in a single operation the result of
Decimal formats
IEEE 754 2008 introduces base-$10$ floating-point encodings: decimal32, decimal64 and decimal128. For example, this compares the corresponding
| format | subnormals | precision | |
|---|---|---|---|
decimal64 |
|||
binary64 |
Unlike in a binary format, in a decimal floating-point format a given number might have multiple representations, called a cohort (absence of normalization). Decimal interchange format encoding is more sophisticated than the binary standard. Detailed layouts and canonical (i.e. preferred) encoding are fully described in the IEEE documentation. A few points are worth mentioning to shed more light on decimal representation:
- The combination includes classifying bits plus the exponent and the most significand digit (not bit!) of the mantissa,
- The exponent is biased with a decimal number though it is encoded in binary. It takes a form of
$3\times 2^{k}$ starting with$00$ or$01$ or$10$ . - The combination field is arranged differently whether the trailing significand field is encoded with decimal encoding (densely-packed decimal, DPD) or in binary integer decimal (BID).
- First or three first MSb of the significand are always implied, meaning not encoded, like so
${\color{purple}0}$ abc or${\color{purple}100}$ c.
Some examples of encoding shapes:
| sign | combination | exponent | significand | case |
|---|---|---|---|---|
| s |
|
|
|
DPD for a larger first digit ( |
| s |
|
|
|
DPD for a smaller first digit ( |
| s |
|
|
|
BID for a larger first digit ( |
| s |
|
|
|
BID for a smaller first digit ( |
With a true significand being the sequence of the decimal digits encoded
In regards with the binary representation method (BID), the format uses a significand from
Decimal floating point formats are relatively new as released in 2008. Outside the banking industry, and even then, there is not much call for these. Quants trying to make a buck on arbitrage opportunities do not require it. Working with integers, like using cents instead of one monetary unit, allow to compute exactly billions of bank balances or amortization tables and still replicate results that human would have performed by hand. Besides, decimal floating point hardware seem not worth so far.
Arbirtrary precision
Whenever more precision is required, one can use arbitrary-precision arithmetic (bignum). Rather than storing values as a fixed number of bits related to the size of the processor register, these implementations typically use variable-length strings or arrays of digits. Common application is public-key cryptography, whose algorithms commonly employ arithmetic with integers having hundreds of digits. It is also used to overcome inherent limitations of fixed-precision floating-point arithmetic like overflow, no associativity, cancellation. Numerous algorithms have been developed to efficiently perform arithmetic operations on numbers stored with arbitrary precision.
JavaScript As specified by the ECMAScript standard, all arithmetic shall be done using double-precision floating-point arithmetic.
const to64bitFloat = (number) => {
let float = '';
let dv = new DataView(new ArrayBuffer(8)); dv.setFloat64(0, number, false);
for (let i = 0; i < 8; i++) {
let byte = dv.getUint8(i).toString(2);
if (byte.length < 8) {byte = byte.padStart(8,0);};
float += byte;};
return float;}; // emits a string
const toHex = (str) => str ? (+('0b' + str.slice(0, 4))).toString(16) + toHex(str.slice(4)) : '';
toHex(to64bitFloat(0.3)); // 3fd3333333333333
toHex(to64bitFloat(0.1 + 0.2)); // 3fd3333333333334
toHex(to64bitFloat(123.456)); // 405edd2f1a9fbe77
toHex(to32bitFloat(-97.531)); // c2c30fdfTo end with, integers and floating-point numbers all have their place.
On the one hand, integers are still ideal whenever iterating, dealing with sets in // true always. On the other hand, floating point arithmetic delivers incredibly wider dynamic range and higher precision within the format range. And all modern processors power very smart FPUs.
Arithmetic is the very paramount job of computers.
Chips are carefully crafted pieces of silicon made up of billions of transistors which behaves as electronic switches that react to on/off (
It only suffices just a few basic logic gates (and a ton of human intelligence!!) to create sophisticated digital circuits that solve highly complex computational tasks.
The arithmetic logic unit (ALU) is the combinational digital circuit which performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU) that operates on floating point numbers. It is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs and graphics processing units (GPUs).
Instruction cycle
In order to run applications, the CPU interacts with the computer memory (e.g. RAM, cache, registers) in a serie of
Typical repertoire of ALU includes operations for arithmetics, bitwise logic and bitshift.
Nowadays arithmetic-logic units, math coprocessors and micro-code algorithms are all burned into the chip. Most currently available microprocessors have optimized circuitry for fast arithmetic performing ne plus ultra algorithms, at the price of more complex hardware realization.
As stated earlier, effectiveness and complexity of one method depend on the resources required to run (e.g. time, memory) in regards with the input at hand, the context, and of course hardware.
Because of its straightforward implementation in digital electronic circuitry using logic gates, the binary system is used by modern computers. Arithmetic in binary is much like arithmetic in other positional notation numeral systems.
There are numerous algorithms and digital circuits for common mathematical operations including:
- Adders: from vanilla half adder and full adder, to highly complex carry-lookahead adders, carry-save adders, etc.
- Multipliers: binary, Karatsuba, Booth's multiplication, Wallace trees, Dadda, etc.
Adders
Half adder Sums two single bits
Full adder performs the operation on two binary digits -bit binary numbers.
It can be implemented in different ways but most common is like so
| half adder | full adder |
|---|---|
![]() |
![]() |
| 1 XORgate and 1 ANDgate | 2 XORgates, 2 ANDgates, 1 ORgate |
| 5 NANDgates | 9 NANDgates or 9 NORgates |
Ripple-carry adder (RCA) If one aligns a bunch of adders in a row, this computes a -bit addition. Though easily designed, it is relatively slow since each full adder must wait for the carry signal to propagate from the preceeding adder. The gate delay (double integer is
Slow motion..?
Carry-lookahead adder (CLA) To overcome this carry ripple problem, more sophisticated and much faster adders were devised such as Kogge–Stone, Brent–Kung or hybrid adders, thus bringing down the computational time, for example for a -bit addition down to
- If both inputs are
$0$ , then there will definitely be no carry, so a carry is killed, - If both inputs are
$1$ , the carry will definitely be$1$ , so a carry is generated, - If only one input is
$1$ , then one carryOUT will only occur if comes a carryIN, so a carry is propagated.
Hence,
Rewriting the latter relation to express the recursive pattern in a multi-bit adder
By combining
Usually more than one level of lookahead-carry logic is done. Deciding groups sizes and number of levels requires a detailed analysis of gate and propagation delays for the particular technology being used.
From -bit numbers, carry-save adders may be a preferable option (cryptography) since they spend no time at all on carry signal propagation.
Carry-save adder (CSA) Efficiently computes the sum of
- The sum of two digits can never carry more than a
$1$ , nor the sum of two digits plus$1$ . - Numbers are added in groups of three using
$3$ -input adders which generate two results, one sum and one carry. This is thanks to that a full adder is a 3:2 lossy compressor. Indeed, it sums three$1$ -bitinputs and returns the result as a single$2$ -bitnumber, e.g.$b_{1} = C_{out}$ and$b_{0} = S$ (or maps$8$ input cases to$4$ output values). - The sum and carry outputs may thereafter be feeding a subsequent
$3$ -input adder (along next operand) without having to wait for propagation of a carry signal. - After all stages of addition, a conventional adder (e.g. ripple-carry, carry_lookahead) must be used to combine the final sum and carry results.
Given three -bit numbers
Example:
0011 1010 1001 1010 1010 1111 0100 1001 (e.g. 983 215 945 base 10)
+ 0000 1110 1110 0100 1101 0111 1100 0000 (e.g. 249 878 464 base 10)
+ 0001 0100 1101 1001 1101 1010 1111 1011 (e.g. 349 821 691 base 10)
-------------------------------------------
0 0010 0000 1010 0111 1010 0010 0111 0010 (sum without any carry propagation)
+ 0 0011 0101 1011 0001 1011 1111 1001 0010 (carry saved as a second number)
-------------------------------------------
= 0101 1110 0101 1001 0110 0010 0000 0100 (e.g. 1 582 916 100 base 10) // trueCSAs are typically very fast.
Multipliers
A variety of computer arithmetic techniques can be used to implement a binary multiplier. Most techniques involve computing a set of partial products. Multiplication in binary is similar to decimal methods taught in school. One grade school example for the record where bits can also be multiplied after the fractional point:
111.011 // 7.375 in decimal
x 11.101 // 3.625 in decimal
------------
+ 0.111011 // 3 bits RIGHT shift of the first operand
+ 00.0000 // all 0s since the second operand second bit is 0
+ 11.1011 // 1 bit RIGHT shift of the first operand
+ 111.011 // each partial product is just a left/right shift of first operand
+ 1110.11 // 1 LEFT shift of the first operand
------------
11010.101111Just shifts and adds This is much simpler than in the decimal system as there is no table of multiplication to remember. However, the method is rather slow as it involves many time consuming intermdiate additions. Again, modern computers engineers much faster multipliers.
Multiplier circuit
The schematic of a 2-bit by 2-bit binary multiplier is implemented with 2 XORgates and 6 ANDgates.
Chips implement multiplication in hardware or in microcode, for various integer and floating point word sizes. In arbitrary-precision arithmetic, it is common to use long multiplication with the base set to 2b, where b is the number of bits in a word.
Bitwise shifts are usually (but not always) faster than a multiply instruction and can be used to multiply (shift
(x << 6) - (x << 2) // Here times 60 is computed as (x*2^6) - (x*2^2) = 64x - 4x
(((x << 4) - x) << 2) // Here times 60 is computed as (x*2^4 - x) * 2^2 = (16x - x) * 4Karatsuba algorithm is a multiplication algorithm asymptotically faster than the standard version.
Given one representation like so for a number
Instead of
Whenever operands are sufficiently large, the approach is recursively undertaken as per the divide-and-conquer framework. It works ideally when
Because of the overhead of recursion, Karatsuba's multiplication is slower than long grade-school for numbers with small length
Nevertheless, the Karatsuba's algorithm was discovered in 1960 and is the first known method that is asymptotically faster than long multiplication and can thus be viewed as the starting point for the theory of fast multiplications.
Integer vs. floating-point
Multiplying floating point numbers happens differently to account for IEEE standard (encoding, operations rules, rounding). As an overview, signs
First for historical reasons (separate or expensive coprocessors), as much as because of strong differences in arithmetic representation, circuits, registers, etc. floating point units and arithmetic logic unit are most always separated in computer hardware. Nonetheless, each architecture makes slightly different tradeoffs.
JavaScript uses the below bitwise operators (which converts operands to -bit integers).
| bitwise | logic description | |
|---|---|---|
& |
AND | Output |
| OR | Output |
|
^ |
XOR | Output |
~ |
NOT | Output |
<< |
Pushes zero fill left shift | zeros in from right, leftmost bits fall off |
>> |
Pushes signed right shift | copies of leftmost bit in from left, rightmost bits fall off (sign-propagating) |
>>> |
Pushes zero fill right shift | zeros in from left, rightmost bits fall off |
Further readings about electronics | hardware | JavaScript.













