Overview of Python

Python is interpreted

  • Python is an interpreted language, in contrast to Java and C which are compiled languages.

  • This means we can type statements into the interpreter and they are executed immediately.

In [1]:
5 + 5
Out[1]:
10
  • Groups of statements are all executed one after the other:
In [2]:
x = 5
y = 'Hello There'
z = 10.5
In [3]:
x + 5
Out[3]:
10

Assignments versus equations

  • In Python when we write x = 5 this means something different from an equation $x=5$.

  • Unlike variables in mathematical models, variables in Python can refer to different things as more statements are interpreted.

In [4]:
x = 1
print('The value of x is', x)

x = 2.5
print('Now the value of x is', x)

x = 'hello there'
print('Now it is ', x)
The value of x is 1
Now the value of x is 2.5
Now it is  hello there

Calling Functions

We can call functions in a conventional way using round brackets

In [5]:
round(3.14)
Out[5]:
3

Types

  • Values in Python have an associated type.

  • If we combine types incorrectly we get an error.

In [6]:
print(y)
Hello There
In [13]:
y + 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-b85a2dbb3f6a> in <module>
----> 1 y + 5

TypeError: can only concatenate str (not "int") to str

The type function

  • We can query the type of a value using the type function.
In [4]:
type(1)
Out[4]:
int
In [14]:
type('hello')
Out[14]:
str
In [15]:
type(2.5)
Out[15]:
float
In [16]:
type(True)
Out[16]:
bool

Null values

  • Sometimes we represent "no data" or "not applicable".

  • In Python we use the special value None.

  • This corresponds to Null in Java or SQL.

In [17]:
result = None
  • When we fetch the value None in the interactive interpreter, no result is printed out.
In [18]:
result

Testing for Null values

  • We can check whether there is a result or not using the is operator:
In [19]:
result is None
Out[19]:
True
In [20]:
x = 5
x is None
Out[20]:
False

Converting values between types

  • We can convert values between different types.

Converting to floating-point

  • To convert an integer to a floating-point number use the float() function.
In [21]:
x = 1
x
Out[21]:
1
In [22]:
type(x)
Out[22]:
int
In [23]:
y = float(x)
y
Out[23]:
1.0

Converting to integers

  • To convert a floating-point to an integer use the int() function.
In [24]:
type(y)
Out[24]:
float
In [25]:
int(y)
Out[25]:
1

Variables are not typed

  • Variables themselves, on the other hand, do not have a fixed type.
  • It is only the values that they refer to that have a type.
  • This means that the type referred to by a variable can change as more statements are interpreted.
In [26]:
y = 'hello'
print('The type of the value referred to by y is ', type(y))
y = 5.0
print('And now the type of the value is ', type(y))
The type of the value referred to by y is  <class 'str'>
And now the type of the value is  <class 'float'>

Polymorphism

  • The meaning of an operator depends on the types we are applying it to.
In [27]:
1 + 1
Out[27]:
2
In [28]:
'a' + 'b'
Out[28]:
'ab'
In [29]:
'1' + '1'
Out[29]:
'11'

Conditional Statements and Indentation

  • The syntax for control structures in Python uses colons and indentation.

  • Beware that white-space affects the semantics of Python code.

  • Statements that are indented using the Tab key are grouped together.

if statements

In [30]:
x = 5
if x > 0:
    print('x is strictly positive.')
    print(x)
    
print('finished.')
x is strictly positive.
5
finished.

Changing indentation

In [3]:
x = 0
if x > 0:
    print('x is strictly positive.')
print(x)
    
print('finished.')
0
finished.

if and else

In [12]:
x = 0
print('Starting.')
if x > 0:
    print('x is strictly positive.')
else:
    if x < 0:
        print('x is strictly negative.')
    else:
        print('x is zero.')
print('finished.')
Starting.
x is zero.
finished.

elif

In [11]:
print('Starting.')
if x > 0:
    print('x is strictly positive')
elif x < 0:
    print('x is strictly negative')
else:
    print('x is zero')
print('finished.')
Starting.
x is zero
finished.

Lists

We can use lists to hold an ordered sequence of values.

In [32]:
l = ['first', 'second', 'third']
l
Out[32]:
['first', 'second', 'third']

Lists can contain different types of variable, even in the same list.

In [33]:
another_list = ['first', 'second', 'third', 1, 2, 3]
another_list
Out[33]:
['first', 'second', 'third', 1, 2, 3]

Mutable Datastructures

Lists are mutable; their contents can change as more statements are interpreted.

In [34]:
l.append('fourth')
l
Out[34]:
['first', 'second', 'third', 'fourth']

References

  • Whenever we bind a variable to a value in Python we create a reference.

  • A reference is distinct from the value that it refers to.

  • Variables are names for references.

In [35]:
X = [1, 2, 3]
Y = X

Side effects

  • The above code creates two different references (named X and Y) to the same value [1, 2, 3]

  • Because lists are mutable, changing them can have side-effects on other variables.

  • If we append something to X what will happen to Y?

In [36]:
X.append(4)
X
Out[36]:
[1, 2, 3, 4]
In [37]:
Y
Out[37]:
[1, 2, 3, 4]

State and identity

  • The state referred to by a variable is different from its identity.

  • To compare state use the == operator.

  • To compare identity use the is operator.

  • When we compare identity we check equality of references.

  • When we compare state we check equality of values.

Example

  • We will create two different lists, with two associated variables.
In [14]:
X = [1, 2]
Y = [1]
Y.append(2)

Comparing state

In [15]:
X
Out[15]:
[1, 2]
In [16]:
Y
Out[16]:
[1, 2]
In [17]:
X == Y
Out[17]:
True

Comparing identity

In [18]:
X is Y
Out[18]:
False

Copying data prevents side effects

  • In this example, because we have two different lists we avoid side effects
In [19]:
Y.append(3)
X
Out[19]:
[1, 2]
In [20]:
X == Y
Out[20]:
False
In [21]:
X is Y
Out[21]:
False

Iteration

  • We can iterate over each element of a list in turn using a for loop:
In [24]:
my_list = ['first', 'second', 'third', 'fourth']
for i in my_list:
    print(i)
first
second
third
fourth

Including more than one statement inside the loop

In [26]:
my_list = ['first', 'second', 'third', 'fourth']
for i in my_list:
    print("The next item is:")
    print(i)
    print()
The next item is:
first

The next item is:
second

The next item is:
third

The next item is:
fourth

Looping a specified number of times

  • To perform a statement a certain number of times, we can iterate over a list of the required size.
In [27]:
for i in [0, 1, 2, 3]:
    print("Hello!")
Hello!
Hello!
Hello!
Hello!

The range function

  • To save from having to manually write the numbers out, we can use the function range() to count for us.

  • We count starting at 0 (as in Java and C++).

In [28]:
list(range(4))
Out[28]:
[0, 1, 2, 3]

for loops with the range function

In [49]:
for i in range(4):
    print("Hello!")
Hello!
Hello!
Hello!
Hello!

List Indexing

  • Lists can be indexed using square brackets to retrieve the element stored in a particular position.
In [29]:
my_list
Out[29]:
['first', 'second', 'third', 'fourth']
In [30]:
my_list[0]
Out[30]:
'first'
In [31]:
my_list[1]
Out[31]:
'second'

List Slicing

  • We can also a specify a range of positions.

  • This is called slicing.

  • The example below indexes from position 0 (inclusive) to 2 (exclusive).

In [32]:
my_list[0:2]
Out[32]:
['first', 'second']

Indexing from the start or end

  • If we leave out the starting index it implies the beginning of the list:
In [33]:
my_list[:2]
Out[33]:
['first', 'second']
  • If we leave out the final index it implies the end of the list:
In [34]:
my_list[2:]
Out[34]:
['third', 'fourth']

Copying a list

  • We can conveniently copy a list by indexing from start to end:
In [35]:
new_list = my_list[:]
In [36]:
new_list
Out[36]:
['first', 'second', 'third', 'fourth']
In [37]:
new_list is my_list
Out[37]:
False

Negative Indexing

  • Negative indices count from the end of the list:
In [38]:
my_list[-1]
Out[38]:
'fourth'
In [39]:
my_list[:-1]
Out[39]:
['first', 'second', 'third']

Collections

  • Lists are an example of a collection.

  • A collection is a type of value that can contain other values.

  • There are other collection types in Python:

    • tuple
    • set
    • dict

Tuples

  • Tuples are another way to combine different values.

  • The combined values can be of different types.

  • Like lists, they have a well-defined ordering and can be indexed.

  • To create a tuple in Python, use round brackets instead of square brackets

In [57]:
tuple1 = (50, 'hello')
tuple1
Out[57]:
(50, 'hello')
In [58]:
tuple1[0]
Out[58]:
50
In [59]:
type(tuple1)
Out[59]:
tuple

Tuples are immutable

  • Unlike lists, tuples are immutable. Once we have created a tuple we cannot add values to it.
In [60]:
tuple1.append(2)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-60-46e3866e32ee> in <module>
----> 1 tuple1.append(2)

AttributeError: 'tuple' object has no attribute 'append'

Sets

  • Lists can contain duplicate values.

  • A set, in contrast, contains no duplicates.

  • Sets can be created from lists using the set() function.

In [1]:
X = set([1, 2, 3, 3, 4])
X
Out[1]:
{1, 2, 3, 4}
In [2]:
type(X)
Out[2]:
set
  • Alternatively we can write a set literal using the { and } brackets.
In [3]:
X = {1, 2, 3, 4}
type(X)
Out[3]:
set

Sets are mutable

  • Sets are mutable like lists:
In [4]:
X.add(5)
X
Out[4]:
{1, 2, 3, 4, 5}
  • Duplicates are automatically removed
In [5]:
X.add(5)
X
Out[5]:
{1, 2, 3, 4, 5}

Sets are unordered

  • Sets do not have an ordering.

  • Therefore we cannot index or slice them:

In [6]:
X[0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-19c40ecbd036> in <module>
----> 1 X[0]

TypeError: 'set' object is not subscriptable

Operations on sets

  • Union: $X \cup Y$
In [9]:
X = {1, 2, 3}
Y = {4, 5, 6}
X | Y
Out[9]:
{1, 2, 3, 4, 5, 6}
  • Intersection: $X \cap Y$:
In [10]:
X = {1, 2, 3, 4}
Y = {3, 4, 5}
X & Y
Out[10]:
{3, 4}
  • Difference $X - Y$:
In [11]:
X - Y
Out[11]:
{1, 2}

Dictionaries

  • A dictionary contains a mapping between keys, and corresponding values.

    • Mathematically it is a one-to-one function with a finite domain and range.
  • Given a key, we can very quickly look up the corresponding value.

  • The values can be any type (and need not all be of the same type).

  • Keys can be any immutable (hashable) type.

  • They are abbreviated by the keyword dict.

  • In other programming languages they are sometimes called associative arrays.

Creating a dictionary

  • A dictionary contains a set of key-value pairs.

  • To create a dictionary:

In [6]:
students = { 107564: 'Xu', 108745: 'Ian', 102567: 'Steve' }
  • The above initialises the dictionary students so that it contains three key-value pairs.

  • The keys are the student id numbers (integers).

  • The values are the names of the students (strings).

  • Although we use the same brackets as for sets, this is a different type of collection:

In [7]:
type(students)
Out[7]:
dict

Accessing the values in a dictionary

  • We can access the value corresponding to a given key using the same syntax to access particular elements of a list:
In [8]:
students[108745]
Out[8]:
'Ian'
  • Accessing a non-existent key will generate a KeyError:
In [9]:
students[123]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-26e887eb0296> in <module>
----> 1 students[123]

KeyError: 123

Updating dictionary entries

  • Dictionaries are mutable, so we can update the mapping:
In [12]:
students[108745] = 'Fred'
print(students[108745])
Fred
  • We can also grow the dictionary by adding new keys:
In [13]:
students[104587] = 'John'
print(students[104587])
John

Dictionary keys can be any immutable type

  • We can use any immutable type for the keys of a dictionary

  • For example, we can map names onto integers:

In [16]:
age = { 'John':21, 'Steve':47, 'Xu': 22 }
In [17]:
age['Steve']
Out[17]:
47

Creating an empty dictionary

  • We often want to initialise a dictionary with no keys or values.

  • To do this call the function dict():

In [18]:
result = dict()
  • We can then progressively add entries to the dictionary, e.g. using iteration:
In [19]:
for i in range(5):
    result[i] = i**2
print(result)
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

Iterating over a dictionary

  • We can use a for loop with dictionaries, just as we can with other collections such as sets.
  • When we iterate over a dictionary, we iterate over the keys.
  • We can then perform some computation on each key inside the loop.
  • Typically we will also access the corresponding value.
In [20]:
for id in students:
    print(students[id])
Xu
Ian
Steve

Arrays

  • Python also has arrays which contain a single type of value.

  • i.e. we cannot have different types of value within the same array.

  • Arrays are mutable like lists; we can modify the existing elements of an array.

  • However, we typically do not change the size of the array; i.e. it has a fixed length.

The numpy module

  • Arrays are provided by a separate module called numpy. Modules correspond to packages in e.g. Java.

  • We can import the module and then give it a shorter alias.

In [3]:
import numpy as np
  • We can now use the functions defined in this package by prefixing them with np.

  • The function array() creates an array given a list.

Creating an array

  • We can create an array from a list by using the array() function defined in the numpy module:
In [4]:
x = np.array([0, 1, 2, 3, 4])
x
Out[4]:
array([0, 1, 2, 3, 4])
In [5]:
type(x)
Out[5]:
numpy.ndarray

Functions over arrays

  • When we use arithmetic operators on arrays, we create a new array with the result of applying the operator to each element.
In [6]:
y = x * 2
y
Out[6]:
array([0, 2, 4, 6, 8])
  • The same goes for functions:
In [7]:
x = np.array([-1, 2, 3, -4])
y = abs(x)
y
Out[7]:
array([1, 2, 3, 4])

Populating Arrays

  • To populate an array with a range of values we use the np.arange() function:
In [8]:
x = np.arange(0, 10)
x
Out[8]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
  • We can also use floating point increments.
In [9]:
x = np.arange(0, 1, 0.1)
x
Out[9]:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

Basic Plotting

  • We will use a module called matplotlib to plot some simple graphs.

  • This module provides functions which are very similar to MATLAB plotting commands.

In [10]:
import matplotlib.pyplot as plt

y = x*2 + 5
plt.plot(x, y)
plt.show()

Plotting a sine curve

In [29]:
from numpy import pi, sin

x = np.arange(0, 2*pi, 0.01)
y = sin(x)
plt.plot(x, y)
plt.show()

Plotting a histogram

  • We can use the hist() function in matplotlib to plot a histogram
In [30]:
# Generate some random data
data = np.random.randn(1000)

ax = plt.hist(data)
plt.show()

Computing histograms as matrices

  • The function histogram() in the numpy module will count frequencies into bins and return the result as a 2-dimensional array.
In [31]:
np.histogram(data)
Out[31]:
(array([  2,  24,  49, 166, 247, 258, 157,  72,  18,   7]),
 array([-3.34086253, -2.67834483, -2.01582712, -1.35330942, -0.69079171,
        -0.02827401,  0.6342437 ,  1.2967614 ,  1.95927911,  2.62179681,
         3.28431452]))

Defining new functions

In [32]:
def squared(x):
    return x ** 2

squared(5)
Out[32]:
25

Local Variables

  • Variables created inside functions are local to that function.

  • They are not accessable to code outside of that function.

In [33]:
def squared(x):
    result = x ** 2
    return result

squared(5)
Out[33]:
25
In [34]:
result
Out[34]:
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

Functional Programming

  • Functions are first-class citizens in Python.

  • They can be passed around just like any other value.

In [35]:
squared
Out[35]:
<function __main__.squared(x)>
In [36]:
y = squared
y
Out[36]:
<function __main__.squared(x)>
In [37]:
y(5)
Out[37]:
25

Mapping the elements of a collection

  • We can apply a function to each element of a collection using the built-in function map().

  • This will work with any collection: list, set, tuple or string.

  • This will take as an argument another function, and the list we want to apply it to.

  • It will return the results of applying the function, as a list.

In [38]:
list(map(squared, [1, 2, 3, 4]))
Out[38]:
[1, 4, 9, 16]

List Comprehensions

  • Because this is such a common operation, Python has a special syntax to do the same thing, called a list comprehension.
In [39]:
[squared(i) for i in [1, 2, 3, 4]]
Out[39]:
[1, 4, 9, 16]
  • If we want a set instead of a list we can use a set comprehension
In [40]:
{squared(i) for i in [1, 2, 3, 4]}
Out[40]:
{1, 4, 9, 16}

Cartesian product using list comprehensions

image courtesy of [Quartl](https://commons.wikimedia.org/wiki/User:Quartl)

The Cartesian product of two collections $X = A \times B$ can be expressed by using multiple for statements in a comprehension.

example

In [41]:
A = {'x', 'y', 'z'}
B = {1, 2, 3}
{(a,b) for a in A for b in B}
Out[41]:
{('x', 1),
 ('x', 2),
 ('x', 3),
 ('y', 1),
 ('y', 2),
 ('y', 3),
 ('z', 1),
 ('z', 2),
 ('z', 3)}

Cartesian products with other collections

  • The syntax for Cartesian products can be used with any collection type.
In [42]:
first_names = ('Steve', 'John', 'Peter')
surnames = ('Smith', 'Doe')

[(first_name, surname) for first_name in first_names for surname in surnames]
Out[42]:
[('Steve', 'Smith'),
 ('Steve', 'Doe'),
 ('John', 'Smith'),
 ('John', 'Doe'),
 ('Peter', 'Smith'),
 ('Peter', 'Doe')]

Anonymous Function Literals

  • We can also write anonymous functions.
  • These are function literals, and do not necessarily have a name.
  • They are called lambda expressions (after the $\lambda-$calculus).
In [43]:
list(map(lambda x: x ** 2, [1, 2, 3, 4]))
Out[43]:
[1, 4, 9, 16]

Filtering data

  • We can filter a list by applying a predicate to each element of the list.

  • A predicate is a function which takes a single argument, and returns a boolean value.

  • filter(p, X) is equivalent to $\{ x : p(x) \; \forall x \in X \}$ in set-builder notation.

In [44]:
list(filter(lambda x: x > 0, [-5, 2, 3, -10, 0, 1]))
Out[44]:
[2, 3, 1]

We can use both filter() and map() on other collections such as strings or sets.

In [45]:
list(filter(lambda x: x > 0, {-5, 2, 3, -10, 0, 1}))
Out[45]:
[1, 2, 3]

Filtering using a list comprehension

  • Again, because this is such a common operation, we can use simpler syntax to say the same thing.

  • We can express a filter using a list-comprehension by using the keyword if:

In [46]:
data = [-5, 2, 3, -10, 0, 1]
[x for x in data if x > 0]
Out[46]:
[2, 3, 1]
  • We can also filter and then map in the same expression:
In [47]:
from numpy import sqrt
[sqrt(x) for x in data if x > 0]
Out[47]:
[1.4142135623730951, 1.7320508075688772, 1.0]

The reduce function

  • The reduce() function recursively applies another function to pairs of values over the entire list, resulting in a single return value.
In [48]:
from functools import reduce
reduce(lambda x, y: x + y, [0, 1, 2, 3, 4, 5])
Out[48]:
15

Big Data

  • The map() and reduce() functions form the basis of the map-reduce programming model.

  • Map-reduce is the basis of modern highly-distributed large-scale computing frameworks.

  • It is used in BigTable, Hadoop and Apache Spark.

  • See these examples in Python for Apache Spark.

Numerical Computing in Python

Steve Phelps

Overview

  • Floating-point representation
  • Arrays and Matrices with numpy
  • Basic plotting with matplotlib
  • Pseudo-random variates with numpy.random

Representing continuous values

  • Digital computers are inherently discrete.

  • Real numbers $x \in R$ cannot always be represented exactly in a digital computer.

  • They are stored in a format called floating-point.

  • IEEE Standard 754 specifies a universal format across different implementations.

    • As always there are deviations from the standard.
  • There are two standard sizes of floating-point numbers: 32-bit and 64-bit.

  • 64-bit numbers are called double precision, are sometimes called double values.

  • IEEE floating-point calculations are performed in hardware on modern computers.

  • How can we represent aribitrary real values using only 32 bits?

Fixed-point verses floating-point

  • One way we could discretise continuous values is to represent them as two integers $x$ and $y$.

  • The final value is obtained by e.g. $r = x + y \times 10^{-5}$.

  • So the number $500.4421$ would be represented as the tuple $x = 500$, $y = 44210$.

  • The exponent $5$ is fixed for all computations.

  • This number represents the precision with which we can represent real values.

  • It corresponds to the where we place we place the decimal point.

  • This scheme is called fixed precision.

  • It is useful in certain circumstances, but suffers from many problems, in particular it can only represent a very limited range of values.

  • In practice, we use variable precision, also known as floating point.

Scientific Notation

  • Humans also use a form of floating-point representation.

  • In Scientific notation, all numbers are written in the form $m \times 10^n$.

  • When represented in ASCII, we abbreviate this <m>e<n>, for example 6.72e+11 = $6.72 \times 10^{11}$.

  • The integer $m$ is called the significand or mantissa.

  • The integer $n$ is called the exponent.

  • The integer $10$ is the base.

Scientific Notation in Python

  • Python uses Scientific notation when it displays floating-point numbers:
In [1]:
print(672000000000000000.0)
6.72e+17
  • Note that internally, the value is not represented exactly like this.

  • Scientific notation is a convention for writing or rendering numbers, not representing them digitally.

Floating-point representation

  • Floating point numbers use a base of $2$ instead of $10$.

  • Additionally, the mantissa are exponent are stored in binary.

  • Therefore we represent floating-point numbers as $m \times 2 ^ e$.

  • The integers $m$ (mantissa) and $e$ (exponent) are stored in binary.

  • The mantissa uses two's complement to represent positive and negative numbers.

    • One bit is reserved as the sign-bit: 1 for negative, 0 for positive values.
  • The mantissa is normalised, so we assume that it starts with the digit $1$ (which is not stored).

Bias

  • We also need to represent signed exponents.

  • The exponent does not use two's complement.

  • Instead a bias value is subtracted from the stored exponent ($s$) to obtain the final value ($e$).

  • Double-precision values use a bias of $b = 1023$, and single-precision uses a bias value of $b = 127$.

  • The actual exponent is given by $e = s - b$ where $s$ is the stored exponent.

  • The stored exponent values $s=0$ and $s=1024$ are reserved for special values- discussed later.

  • The stored exponent $s$ is represented in binary without using a sign bit.

Double and single precision formats

The number of bits allocated to represent each integer component of a float is given below:

Format Sign Exponent Mantissa Total
single 1 8 23 32
double 1 11 52 64

Loss of precision

  • We cannot represent every value in floating-point.

  • Consider single-precision (32-bit).

  • Let's try to represent $4,039,944,879$.

Loss of precision

  • As a binary integer we write $4,039,944,879$ as:

11110000 11001100 10101010 10101111

  • This already takes up 32-bits.

  • The mantissa only allows us to store 24-bit integers.

  • So we have to round. We store it as:

+1.1110000 11001100 10101011e+31

  • Which gives us

+11110000 11001100 10101011 0000000

$= 4,039,944,960$

Ranges of floating-point values

In single precision arithmetic, we cannot represent the following values:

  • Negative numbers less than $-(2-2^{-23}) \times 2^{127}$

  • Negative numbers greater than $-2^{-149}$

  • Positive numbers less than $2^{-149}$

  • Positive numbers greater than $(2-2^{-23}) \times 2^{127}$

Attempting to represent these numbers results in overflow or underflow.

Effective floating-point range

Format Binary Decimal
single $\pm (2-2^{-23}) \times 2^{127}$ $\approx \pm 10^{38.53}$
double $\pm (2-2^{-52}) \times 2^{1023}$ $\approx \pm 10^{308.25}$
In [2]:
import sys
sys.float_info.max
Out[2]:
1.7976931348623157e+308
In [3]:
sys.float_info.min
Out[3]:
2.2250738585072014e-308
In [4]:
sys.float_info
Out[4]:
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

Range versus precision

  • With a fixed number of bits, we have to choose between:
    • maximising the range of values (minimum to maximum) we can represent,
    • maximising the precision with-which we can represent each indivdual value.
  • These are conflicting objectives:

    • we can increase range, but only by losing precision,
    • we can increase precision, but only by decreasing range.
  • Floating-point addresses this dilemma by allowing the precision to vary ("float") according to the magnitude of the number we are trying to represent.

Floating-point density

  • Floating-point numbers are unevenly-spaced over the line of real-numbers.

  • The precision decreases as we increase the magnitude.

density of floats

Representing Zero

  • Zero cannot be represented straightforwardly because we assume that all mantissa values start with the digit 1.

  • Zero is stored as a special-case, by setting mantissa and exponent both to zero.

  • The sign-bit can either be set or unset, so there are distinct positive and negative representations of zero.

Zero in Python

In [2]:
x = +0.0
x
Out[2]:
0.0
In [3]:
y = -0.0
y
Out[3]:
-0.0
  • However, these are considered equal:
In [4]:
x == y
Out[4]:
True

Infinity

  • Positive overflow results in a special value of infinity (in Python inf).

  • This is stored with an exponent consiting of all 1s, and a mantissa of all 0s.

  • The sign-bit allows us to differentiate between negative and positive overflow: $-\infty$ and $+\infty$.

  • This allows us to carry on calculating past an overflow event.

Infinity in Python

In [5]:
x = 1e300 * 1e100
x
Out[5]:
inf
In [6]:
x = x + 1
x
Out[6]:
inf

Negative infinity in Python

In [7]:
x > 0
Out[7]:
True
In [8]:
y = -x
y
Out[8]:
-inf
In [9]:
y < x
Out[9]:
True

Not A Number (NaN)

  • Some mathematical operations on real numbers do not map onto real numbers.

  • These results are represented using the special value to NaN which represents "not a (real) number".

  • NaN is represented by an exponent of all 1s, and a non-zero mantissa.

NaN in Python

In [10]:
from numpy import sqrt, inf, isnan, nan
x = sqrt(-1)
x
/home/sphelps/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in sqrt
  
Out[10]:
nan
In [16]:
y = inf - inf
y
Out[16]:
nan

Comparing nan values in Python

  • Beware of comparing nan values
In [17]:
x == y
Out[17]:
False
  • To test whether a value is nan use the isnan function:
In [18]:
isnan(x)
Out[18]:
True

NaN is not the same as None

  • None represents a missing value.

  • NaN represents an invalid floating-point value.

  • These are fundamentally different entities:

In [19]:
nan is None
Out[19]:
False
In [20]:
isnan(None)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-4a0142ec4134> in <module>()
----> 1 isnan(None)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Beware finite precision

In [21]:
x = 0.1 + 0.2
In [22]:
x == 0.3
Out[22]:
False
In [23]:
x
Out[23]:
0.30000000000000004

Relative and absolute error

  • Consider a floating point number $x_{fp}$ which represents a real number $x \in \mathbf{R}$.

  • In general, we cannot precisely represent the real number; that is $x_{fp} \neq x$.

  • The absolute error $r$ is $r = x - x_{fp}$.

  • The relative error $R$ is:

\begin{equation} R = \frac{x - x_{fp}}{x} \end{equation}

Numerical Methods

  • In e.g. simulation models or quantiative analysis we typically repeatedly update numerical values inside long loops.

  • Programs such as these implement numerical algorithms.

  • It is very easy to introduce bugs into code like this.

Numerical stability

  • The round-off error associated with a result can be compounded in a loop.

  • If the error increases as we go round the loop, we say the algorithm is numerically unstable.

  • Mathematicians design numerically stable algorithms using numerical analysis.

Catastrophic Cancellation

  • Suppose we have two real values $x$, and $y = x + \epsilon$.

  • $\epsilon$ is very small and $x$ is very large.

  • $x$ has an exact floating point representation

  • However, because of lack of precision $x$ and $y$ have the same floating point representation.

    • i.e. they are represented as the same sequence of 64-bits
  • Consider wheat happens when we compute $y - x$ in floating-point.

Catestrophic Cancellation and Relative Error

  • Catestrophic cancellation results in very large relative error.

  • If we calculate $y - x$ in floating-point we will obtain the result 0.

  • The correct value is $(x + \epsilon) - x = \epsilon$.

  • The relative error is

\begin{equation} \frac {\epsilon - 0} {\epsilon} = 1 \end{equation}
  • That is, the relative error is $100\%$.

  • This can result in catastrophy.

Catastrophic Cancellation in Python

In [3]:
x = 3.141592653589793
x
Out[3]:
3.141592653589793
In [4]:
y = 6.022e23
x = (x + y) - y
In [5]:
x
Out[5]:
0.0

Cancellation versus addition

  • Addition, on the other hand, is not catestrophic.
In [6]:
z = x + y
z
Out[6]:
6.022e+23
  • The above result is still inaccurate with an absolute error $r \approx \pi$.

  • However, let's examine the relative error:

\begin{equation} R = \frac {1.2044 \times 10^{24} - (1.2044 \times 10^{24} + \pi)} {1.2044 \times 10^{24} + \pi} \\ \approx 10^{-24} \end{equation}
  • Here we see that that the relative error from the addition is miniscule compared with the cancellation.

Floating-point arithemetic is nearly always inaccurate.

  • You can hardly-ever eliminate absolute rounding error when using floating-point.

  • The best we can do is to take steps to minimise error, and prevent it from increasing as your calculation progresses.

  • Cancellation can be catastrophic, because it can greatly increase the relative error in your calculation.

Use a well-tested library for numerical algorithms.

  • Avoid subtracting two nearly-equal numbers.

  • Especially in a loop!

  • Better-yet use a well-validated existing implementation in the form of a numerical library.

Importing numpy

  • Functions for numerical computiing are provided by a separate module called numpy.

  • Before we use the numpy module we must import it.

  • By convention, we import numpy using the alias np.

  • Once we have done this we can prefix the functions in the numpy library using the prefix np.

In [27]:
import numpy as np
  • We can now use the functions defined in this package by prefixing them with np.

Arrays

  • Arrays represent a collection of values.

  • In contrast to lists:

    • arrays typically have a fixed length
      • they can be resized, but this involves an expensive copying process.
    • and all values in the array are of the same type.
      • typically we store floating-point values.
  • Like lists:

    • arrays are mutable;
    • we can change the elements of an existing array.

Arrays in numpy

  • Arrays are provided by the numpy module.

  • The function array() creates an array given a list.

In [28]:
import numpy as np
x = np.array([0, 1, 2, 3, 4])
x
Out[28]:
array([0, 1, 2, 3, 4])

Array indexing

  • We can index an array just like a list
In [29]:
x[4]
Out[29]:
4
In [30]:
x[4] = 2
x
Out[30]:
array([0, 1, 2, 3, 2])

Arrays are not lists

  • Although this looks a bit like a list of numbers, it is a fundamentally different type of value:
In [31]:
type(x)
Out[31]:
numpy.ndarray
  • For example, we cannot append to the array:
In [32]:
x.append(5)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-7e52d4acf950> in <module>()
----> 1 x.append(5)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

Populating Arrays

  • To populate an array with a range of values we use the np.arange() function:
In [34]:
x = np.arange(0, 10)
print(x)
[0 1 2 3 4 5 6 7 8 9]
  • We can also use floating point increments.
In [66]:
x = np.arange(0, 1, 0.1)
print(x)
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]

Functions over arrays

  • When we use arithmetic operators on arrays, we create a new array with the result of applying the operator to each element.
In [67]:
y = x * 2
y
Out[67]:
array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8])
  • The same goes for numerical functions:
In [68]:
x = np.array([-1, 2, 3, -4])
y = abs(x)
y
Out[68]:
array([1, 2, 3, 4])

Vectorized functions

  • Note that not every function automatically works with arrays.

  • Functions that have been written to work with arrays of numbers are called vectorized functions.

  • Most of the functions in numpy are already vectorized.

  • You can create a vectorized version of any other function using the higher-order function numpy.vectorize().

vectorize example

In [69]:
def myfunc(x):
    if x >= 0.5:
        return x
    else:
        return 0.0
    
fv = np.vectorize(myfunc)
In [70]:
x = np.arange(0, 1, 0.1)
x
Out[70]:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
In [71]:
fv(x)
Out[71]:
array([0. , 0. , 0. , 0. , 0. , 0.5, 0.6, 0.7, 0.8, 0.9])

Testing for equality

  • Because of finite precision we need to take great care when comparing floating-point values.

  • The numpy function allclose() can be used to test equality of floating-point numbers within a relative tolerance.

  • It is a vectorized function so it will work with arrays as well as single floating-point values.

In [72]:
x = 0.1 + 0.2
y = 0.3
x == y
Out[72]:
False
In [73]:
np.allclose(x, y)
Out[73]:
True

Plotting with matplotlib

  • We will use a module called matplotlib to plot some simple graphs.

  • This module has a nested module called pyplot.

  • By convention we import this with the alias plt.

  • This module provides functions which are very similar to MATLAB plotting commands.

In [ ]:
import matplotlib.pyplot as plt

A simple linear plot

In [158]:
x = np.arange(0, 1, 0.1)
y = x*2 + 5
plt.plot(x, y)
plt.xlabel('$x$')
plt.ylabel('$y = 2x + 5$')
plt.title('Linear plot')
plt.show()

Plotting a sine curve

In [154]:
from numpy import pi, sin

x = np.arange(0, 2*pi, 0.01)
y = sin(x)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()

Multi-dimensional data

  • Numpy arrays can hold multi-dimensional data.

  • To create a multi-dimensional array, we can pass a list of lists to the array() function:

In [100]:
import numpy as np

x = np.array([[1,2], [3,4]])
x
Out[100]:
array([[1, 2],
       [3, 4]])

Arrays containing arrays

  • A multi-dimensional array is an array of an arrays.

  • The outer array holds the rows.

  • Each row is itself an array:

In [101]:
x[0]
Out[101]:
array([1, 2])
In [102]:
x[1]
Out[102]:
array([3, 4])
  • So the element in the second row, and first column is:
In [103]:
x[1][0]
Out[103]:
3

Matrices

  • We can create a matrix from a multi-dimensional array.
In [104]:
M = np.matrix(x)
M
Out[104]:
matrix([[1, 2],
        [3, 4]])

Plotting multi-dimensional with matrices

  • If we supply a matrix to plot() then it will plot the y-values taken from the columns of the matrix (notice the transpose in the example below).
In [105]:
from numpy import pi, sin, cos

x = np.arange(0, 2*pi, 0.01)
y = sin(x)
ax = plt.plot(x, np.matrix([sin(x), cos(x)]).T)
plt.show()

Performance

  • When we use numpy matrices in Python the corresponding functions are linked with libraries written in C and FORTRAN.

  • For example, see the BLAS (Basic Linear Algebra Subprograms) library.

  • These libraries are very fast.

  • Vectorised code can be more easily ported to frameworks like TensorFlow so that operations are performed in parallel using GPU hardware.

Matrix Operators

  • Once we have a matrix, we can perform matrix computations.

  • To compute the transpose and inverse use the T and I attributes:

To compute the transpose $M^{T}$

In [106]:
M.T
Out[106]:
matrix([[1, 3],
        [2, 4]])

To compute the inverse $M^{-1}$

In [107]:
M.I
Out[107]:
matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

Matrix Dimensions

  • The total number of elements, and the dimensions of the array:
In [108]:
M.size
Out[108]:
4
In [109]:
M.shape
Out[109]:
(2, 2)
In [110]:
len(M.shape)
Out[110]:
2

Creating Matrices from strings

  • We can also create arrays directly from strings, which saves some typing:
In [111]:
I2 = np.matrix('2 0; 0 2')
I2
Out[111]:
matrix([[2, 0],
        [0, 2]])
  • The semicolon starts a new row.

Matrix Multiplication

Now that we have two matrices, we can perform matrix multiplication:

In [112]:
M * I2
Out[112]:
matrix([[2, 4],
        [6, 8]])

Matrix Indexing

In [113]:
M[:,1]
Out[113]:
matrix([[2],
        [4]])

Slices are references

  • If we use this is an assignment, we create a reference to the sliced elements, not a copy.
In [114]:
V = M[:,1]  # This does not make a copy of the elements!
V
Out[114]:
matrix([[2],
        [4]])
In [115]:
M[0,1] = -2
V
Out[115]:
matrix([[-2],
        [ 4]])

Copying matrices and vectors

  • To copy a matrix, or a slice of its elements, use the function np.copy():
In [116]:
M = np.matrix('1 2; 3 4')
V = np.copy(M[:,1])  # This does copy the elements.
V
Out[116]:
array([[2],
       [4]])
In [117]:
M[0,1] = -2
V
Out[117]:
array([[2],
       [4]])

Sums

One way we could sum a vector or matrix is to use a for loop.

In [118]:
vector = np.arange(0.0, 100.0, 10.0)
vector
Out[118]:
array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90.])
In [119]:
result = 0.0
for x in vector:
    result = result + x
result
Out[119]:
450.0
  • This is not the most efficient way to compute a sum.

Efficient sums

  • Instead of using a for loop, we can use a numpy function sum().

  • This function is written in the C language, and is very fast.

In [121]:
vector = np.array([0, 1, 2, 3, 4])
print(np.sum(vector))
10

Summing rows and columns

  • When dealing with multi-dimensional data, the 'sum()' function has a named-argument axis which allows us to specify whether to sum along, each rows or columns.
In [123]:
matrix = np.matrix('1 2 3; 4 5 6; 7 8 9')
print(matrix)
[[1 2 3]
 [4 5 6]
 [7 8 9]]

To sum along rows:

In [124]:
np.sum(matrix, axis=0)
Out[124]:
matrix([[12, 15, 18]])

To sum along columns:

In [125]:
np.sum(matrix, axis=1)
Out[125]:
matrix([[ 6],
        [15],
        [24]])

Cumulative sums

  • Suppose we want to compute $y_n = \sum_{i=1}^{n} x_i$ where $\vec{x}$ is a vector.
In [127]:
import numpy as np
x = np.array([0, 1, 2, 3, 4])
y = np.cumsum(x)
print(y)
[ 0  1  3  6 10]

Cumulative sums along rows and columns

In [129]:
x = np.matrix('1 2 3; 4 5 6; 7 8 9')
print(x)
[[1 2 3]
 [4 5 6]
 [7 8 9]]
In [130]:
y = np.cumsum(x)
np.cumsum(x, axis=0)
Out[130]:
matrix([[ 1,  2,  3],
        [ 5,  7,  9],
        [12, 15, 18]])
In [131]:
np.cumsum(x, axis=1)
Out[131]:
matrix([[ 1,  3,  6],
        [ 4,  9, 15],
        [ 7, 15, 24]])

Cumulative products

  • Similarly we can compute $y_n = \Pi_{i=1}^{n} x_i$ using cumprod():
In [132]:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
np.cumprod(x)
Out[132]:
array([  1,   2,   6,  24, 120])
  • We can compute cummulative products along rows and columns using the axis parameter, just as with the cumsum() example.

Generating (pseudo) random numbers

  • The nested module numpy.random contains functions for generating random numbers from different probability distributions.
In [133]:
from numpy.random import normal, uniform, exponential, randint
  • Suppose that we have a random variable $\epsilon \sim N(0, 1)$.

  • In Python we can draw from this distribution like so:

In [135]:
epsilon = normal()
epsilon
Out[135]:
0.1465312427787133
  • If we execute another call to the function, we will make a new draw from the distribution:
In [136]:
epsilon = normal()
epsilon
Out[136]:
-2.3135050810408773

Pseudo-random numbers

  • Strictly speaking, these are not random numbers.

  • They rely on an initial state value called the seed.

  • If we know the seed, then we can predict with total accuracy the rest of the sequence, given any "random" number.

  • Nevertheless, statistically they behave like independently and identically-distributed values.

    • Statistical tests for correlation and auto-correlation give insignificant results.
  • For this reason they called pseudo-random numbers.

  • The algorithms for generating them are called Pseudo-Random Number Generators (PRNGs).

  • Some applications, such as cryptography, require genuinely unpredictable sequences.

    • never use a standard PRNG for these applications!

Managing seed values

  • In some applications we need to reliably reproduce the same sequence of pseudo-random numbers that were used.

  • We can specify the seed value at the beginning of execution to achieve this.

  • Use the function seed() in the numpy.random module.

Setting the seed

In [137]:
from numpy.random import seed

seed(5)
In [138]:
normal()
Out[138]:
0.44122748688504143
In [139]:
normal()
Out[139]:
-0.33087015189408764
In [140]:
seed(5)
In [141]:
normal()
Out[141]:
0.44122748688504143
In [142]:
normal()
Out[142]:
-0.33087015189408764

Drawing multiple variates

  • To generate more than number, we can specify the size parameter:
In [143]:
normal(size=10)
Out[143]:
array([ 2.43077119, -0.25209213,  0.10960984,  1.58248112, -0.9092324 ,
       -0.59163666,  0.18760323, -0.32986996, -1.19276461, -0.20487651])
  • If you are generating very many variates, this will be much faster than using a for loop
  • We can also specify more than one dimension:
In [144]:
normal(size=(5,5))
Out[144]:
array([[-0.35882895,  0.6034716 , -1.66478853, -0.70017904,  1.15139101],
       [ 1.85733101, -1.51117956,  0.64484751, -0.98060789, -0.85685315],
       [-0.87187918, -0.42250793,  0.99643983,  0.71242127,  0.05914424],
       [-0.36331088,  0.00328884, -0.10593044,  0.79305332, -0.63157163],
       [-0.00619491, -0.10106761, -0.05230815,  0.24921766,  0.19766009]])

Histograms

  • We can plot a histograms of randomly-distributed data using the hist() function from matplotlib:
In [153]:
import matplotlib.pyplot as plt

data = normal(size=10000)
ax = plt.hist(data)
plt.title('Histogram of normally distributed data ($n=10^5$)')
plt.show()

Computing histograms as matrices

  • The function histogram() in the numpy module will count frequencies into bins and return the result as a 2-dimensional array.
In [146]:
import numpy as np
np.histogram(data)
Out[146]:
(array([  23,  136,  618, 1597, 2626, 2635, 1620,  599,  130,   16]),
 array([-3.59780883, -2.87679609, -2.15578336, -1.43477063, -0.71375789,
         0.00725484,  0.72826758,  1.44928031,  2.17029304,  2.89130578,
         3.61231851]))

Descriptive statistics

  • We can compute the descriptive statistics of a sample of values using the numpy functions mean() and var() to compute the sample mean $\bar{X}$ and sample variance $\sigma_{X}^2$ .
In [147]:
np.mean(data)
Out[147]:
-0.00045461080333497925
In [148]:
np.var(data)
Out[148]:
1.0016048722546331
  • These functions also have an axis parameter to compute mean and variances of columns or rows of a multi-dimensional data-set.

Descriptive statistics with nan values

  • If the data contains nan values, then the descriptive statistics will also be nan.
In [149]:
from numpy import nan
import numpy as np
data = np.array([1, 2, 3, 4, nan])
np.mean(data)
Out[149]:
nan
  • To omit nan values from the calculation, use the functions nanmean() and nanvar():
In [150]:
np.nanmean(data)
Out[150]:
2.5

Discrete random numbers

  • The randint() function in numpy.random can be used to draw from a uniform discrete probability distribution.

  • It takes two parameters: the low value (inclusive), and the high value (exclusive).

  • So to simulate one roll of a die, we would use the following Python code.

In [151]:
die_roll = randint(0, 6) + 1
die_roll
Out[151]:
4
  • Just as with the normal() function, we can generate an entire sequence of values.

  • To simulate a Bernoulli process with $n=20$ trials:

In [152]:
bernoulli_trials = randint(0, 2, size = 20)
bernoulli_trials
Out[152]:
array([1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0])

Financial data with data frames

Steve Phelps

Data frames

  • The pandas module provides a powerful data-structure called a data frame.

  • It is similar, but not identical to:

    • a table in a relational database,
    • an Excel spreadsheet,
    • a dataframe in R.

Types of data

Data frames can be used to represent:

Loading data

  • Data frames can be read and written to/from:

    • financial web sites
    • database queries
    • database tables
    • CSV files
    • json files
  • Beware that data frames are memory resident;

    • If you read a large amount of data your PC might crash
    • With big data, typically you would read a subset or summary of the data via e.g. a select statement.

Importing pandas

  • The pandas module is usually imported with the alias pd.
In [1]:
import pandas as pd

Series

  • A Series contains a one-dimensional array of data, and an associated sequence of labels called the index.

  • The index can contain numeric, string, or date/time values.

  • When the index is a time value, the series is a time series.

  • The index must be the same length as the data.

  • If no index is supplied it is automatically generated as range(len(data)).

Creating a series from an array

In [2]:
import numpy as np
data = np.random.randn(5)
data
Out[2]:
array([ 1.67809544,  0.29118986, -0.86985529, -0.91905744, -0.10415342])
In [3]:
my_series = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
my_series
Out[3]:
a    1.678095
b    0.291190
c   -0.869855
d   -0.919057
e   -0.104153
dtype: float64

Plotting a series

  • We can plot a series by invoking the plot() method on an instance of a Series object.

  • The x-axis will autimatically be labelled with the series index.

In [4]:
import matplotlib.pyplot as plt
my_series.plot()
plt.show()

Creating a series with automatic index

  • In the following example the index is creating automatically:
In [5]:
pd.Series(data)
Out[5]:
0    1.678095
1    0.291190
2   -0.869855
3   -0.919057
4   -0.104153
dtype: float64

Creating a Series from a dict

In [6]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
my_series = pd.Series(d)
my_series
Out[6]:
a    0.0
b    1.0
c    2.0
dtype: float64

Indexing a series with []

  • Series can be accessed using the same syntax as arrays and dicts.

  • We use the labels in the index to access each element.

In [7]:
my_series['b']
Out[7]:
1.0
  • We can also use the label like an attribute:
In [8]:
my_series.b
Out[8]:
1.0

Slicing a series

  • We can specify a range of labels to obtain a slice:
In [9]:
my_series[['b', 'c']]
Out[9]:
b    1.0
c    2.0
dtype: float64

Arithmetic and vectorised functions

  • numpy vectorization works for series objects too.
In [10]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
squared_values = pd.Series(d) ** 2
squared_values
Out[10]:
a    0.0
b    1.0
c    4.0
dtype: float64
In [11]:
x = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})
y = pd.Series({'a' : 3., 'b' : 4., 'c' : 5.})
x + y
Out[11]:
a    3.0
b    5.0
c    7.0
dtype: float64

Time series

In [12]:
dates = pd.date_range('1/1/2000', periods=5)
dates
Out[12]:
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05'],
              dtype='datetime64[ns]', freq='D')
In [13]:
time_series = pd.Series(data, index=dates)
time_series
Out[13]:
2000-01-01    1.678095
2000-01-02    0.291190
2000-01-03   -0.869855
2000-01-04   -0.919057
2000-01-05   -0.104153
Freq: D, dtype: float64

Plotting a time-series

In [14]:
ax = time_series.plot()

Missing values

  • Pandas uses nan to represent missing data.

  • So nan is used to represent missing, invalid or unknown data values.

  • It is important to note that this only convention only applies within pandas.

    • Other frameworks have very different semantics for these values.

DataFrame

  • A data frame has multiple columns, each of which can hold a different type of value.

  • Like a series, it has an index which provides a label for each and every row.

  • Data frames can be constructed from:

    • dict of arrays,
    • dict of lists,
    • dict of dict
    • dict of Series
    • 2-dimensional array
    • a single Series
    • another DataFrame

Creating a dict of series

In [15]:
series_dict = {
        'x' : 
            pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'y' : 
            pd.Series([4., 5., 6., 7.], index=['a', 'b', 'c', 'd']),
        'z' :
            pd.Series([0.1, 0.2, 0.3, 0.4], index=['a', 'b', 'c', 'd'])
}

series_dict
Out[15]:
{'x': a    1.0
 b    2.0
 c    3.0
 dtype: float64,
 'y': a    4.0
 b    5.0
 c    6.0
 d    7.0
 dtype: float64,
 'z': a    0.1
 b    0.2
 c    0.3
 d    0.4
 dtype: float64}

Converting the dict to a data frame

In [16]:
df = pd.DataFrame(series_dict)
df
Out[16]:
x y z
a 1.0 4.0 0.1
b 2.0 5.0 0.2
c 3.0 6.0 0.3
d NaN 7.0 0.4

Plotting data frames

  • When plotting a data frame, each column is plotted as its own series on the same graph.

  • The column names are used to label each series.

  • The row names (index) is used to label the x-axis.

In [17]:
ax = df.plot()

Indexing

  • The outer dimension is the column index.

  • When we retrieve a single column, the result is a Series

In [18]:
df['x']
Out[18]:
a    1.0
b    2.0
c    3.0
d    NaN
Name: x, dtype: float64
In [19]:
df['x']['b']
Out[19]:
2.0
In [20]:
df.x.b
Out[20]:
2.0

Projections

  • Data frames can be sliced just like series.
  • When we slice columns we call this a projection, because it is analogous to specifying a subset of attributes in a relational query, e.g. SELECT x FROM table.
  • If we project a single column the result is a series:
In [21]:
slice = df['x'][['b', 'c']]
slice
Out[21]:
b    2.0
c    3.0
Name: x, dtype: float64
In [22]:
type(slice)
Out[22]:
pandas.core.series.Series

Projecting multiple columns

  • When we include multiple columns in the projection the result is a DataFrame.
In [23]:
slice = df[['x', 'y']]
slice
Out[23]:
x y
a 1.0 4.0
b 2.0 5.0
c 3.0 6.0
d NaN 7.0
In [24]:
type(slice)
Out[24]:
pandas.core.frame.DataFrame

Vectorization

  • Vectorized functions and operators work just as with series objects:
In [25]:
df['x'] + df['y']
Out[25]:
a    5.0
b    7.0
c    9.0
d    NaN
dtype: float64
In [26]:
df ** 2
Out[26]:
x y z
a 1.0 16.0 0.01
b 4.0 25.0 0.04
c 9.0 36.0 0.09
d NaN 49.0 0.16

Logical indexing

  • We can use logical indexing to retrieve a subset of the data.
In [27]:
df['x'] >= 2
Out[27]:
a    False
b     True
c     True
d    False
Name: x, dtype: bool
In [28]:
df[df['x'] >= 2]
Out[28]:
x y z
b 2.0 5.0 0.2
c 3.0 6.0 0.3

Combining predicates

operator meaning
& and
| or
~ not
In [29]:
df[(df['x'] >= 2) & (df['y'] > 5)]
Out[29]:
x y z
c 3.0 6.0 0.3
In [53]:
df[~((df['x'] >= 2) & (df['y'] > 5))]
Out[53]:
x y z
a 1.0 4.0 0.1
b 2.0 5.0 0.2
d NaN 7.0 0.4

Descriptive statistics

  • To quickly obtain descriptive statistics on numerical values use the describe method.
In [30]:
df.describe()
Out[30]:
x y z
count 3.0 4.000000 4.000000
mean 2.0 5.500000 0.250000
std 1.0 1.290994 0.129099
min 1.0 4.000000 0.100000
25% 1.5 4.750000 0.175000
50% 2.0 5.500000 0.250000
75% 2.5 6.250000 0.325000
max 3.0 7.000000 0.400000

Accessing a single statistic

  • The result is itself a DataFrame, so we can index a particular statistic like so:
In [31]:
df.describe()['x']['mean']
Out[31]:
2.0

Accessing the row and column labels

  • The row labels (index) and column labels can be accessed:
In [32]:
df.index
Out[32]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [33]:
df.columns
Out[33]:
Index(['x', 'y', 'z'], dtype='object')

Head and tail

  • Data frames have head() and tail() methods which behave analgously to the Unix commands of the same name.

Financial data

  • Pandas was originally developed to analyse financial data.

  • We can download tabulated data in a portable format called Comma Separated Values (CSV).

In [34]:
import pandas as pd
googl = pd.read_csv('data/GOOGL.csv')

Examining the first few rows

  • When working with large data sets it is useful to view just the first/last few rows in the dataset.

  • We can use the head() method to retrieve the first rows:

In [35]:
googl.head()
Out[35]:
Date Open High Low Close Adj Close Volume
0 2013-11-13 503.878876 516.941956 503.753754 516.751770 516.751770 3155600
1 2013-11-14 517.477478 520.395386 515.690674 518.133118 518.133118 2331000
2 2013-11-15 517.952942 519.519531 515.670654 517.297302 517.297302 2550000
3 2013-11-18 518.393372 524.894897 515.135132 516.291321 516.291321 3515800
4 2013-11-19 516.376404 517.892883 512.037048 513.113098 513.113098 2260900

Examining the last few rows

In [36]:
googl.tail()
Out[36]:
Date Open High Low Close Adj Close Volume
1505 2019-11-06 1290.089966 1292.989990 1282.270020 1291.010010 1291.010010 1231300
1506 2019-11-07 1294.280029 1322.650024 1293.750000 1306.939941 1306.939941 2257000
1507 2019-11-08 1301.520020 1317.109985 1301.520020 1309.000000 1309.000000 1519600
1508 2019-11-11 1304.000000 1304.900024 1295.869995 1298.280029 1298.280029 861700
1509 2019-11-12 1298.569946 1309.349976 1294.239990 1297.209961 1297.209961 1442600

Parsing time stamps

  • So far, the Date attribute is of type string.
In [37]:
googl.Date[0]
Out[37]:
'2013-11-13'
In [38]:
type(googl.Date[0])
Out[38]:
str

Converting to datetime values

  • In order to work with time-series data, we need to construct an index containing time values.

  • Time values are of type datetime or Timestamp.

  • We can use the function to_datetime() to convert strings to time values.

In [39]:
pd.to_datetime(googl['Date']).head()
Out[39]:
0   2013-11-13
1   2013-11-14
2   2013-11-15
3   2013-11-18
4   2013-11-19
Name: Date, dtype: datetime64[ns]

Setting the index

  • Now we need to set the index of the data-frame so that it contains the sequence of dates.
In [40]:
googl.set_index(pd.to_datetime(googl['Date']), inplace=True)
googl.index[0]
Out[40]:
Timestamp('2013-11-13 00:00:00')
In [41]:
type(googl.index[0])
Out[41]:
pandas._libs.tslibs.timestamps.Timestamp

Plotting series

  • We can plot a series in a dataframe by invoking its plot() method.

  • Here we plot a time-series of the daily traded volume:

In [42]:
ax = googl['Volume'].plot()
plt.show()

Adjusted closing prices as a time series

In [43]:
googl['Adj Close'].plot()
plt.show()

Slicing series using date/time stamps

  • We can slice a time series by specifying a range of dates or times.

  • Date and time stamps are specified strings representing dates in the required format.

In [44]:
googl['Adj Close']['1-1-2016':'1-1-2017'].plot()
plt.show()

Resampling

  • We can resample to obtain e.g. weekly or monthly prices.

  • In the example below the 'W' denotes weekly.

  • See the documentation for other frequencies.

  • We group data into weeks, and then take the last value in each week.

  • For details of other ways to resample the data, see the documentation.

Resampled time-series plot

In [45]:
weekly_prices = googl['Adj Close'].resample('W').last()
weekly_prices.head()
Out[45]:
Date
2013-11-17    517.297302
2013-11-24    516.461487
2013-12-01    530.325317
2013-12-08    535.470459
2013-12-15    530.925903
Freq: W-SUN, Name: Adj Close, dtype: float64
In [46]:
weekly_prices.plot()
plt.title('Prices for GOOGL sampled at weekly frequency')
plt.show()

Converting prices to log returns

Log prices

In [47]:
log_prices = np.log(weekly_prices)
log_prices.head()
Out[47]:
Date
2013-11-17    6.248618
2013-11-24    6.247001
2013-12-01    6.273491
2013-12-08    6.283146
2013-12-15    6.274622
Freq: W-SUN, Name: Adj Close, dtype: float64

Differences

In [48]:
diffs = log_prices.diff()
diffs.head()
Out[48]:
Date
2013-11-17         NaN
2013-11-24   -0.001617
2013-12-01    0.026490
2013-12-08    0.009655
2013-12-15   -0.008523
Freq: W-SUN, Name: Adj Close, dtype: float64

Remove missing values

In [49]:
weekly_rets = diffs.dropna()
weekly_rets.head()
Out[49]:
Date
2013-11-24   -0.001617
2013-12-01    0.026490
2013-12-08    0.009655
2013-12-15   -0.008523
2013-12-22    0.036860
Freq: W-SUN, Name: Adj Close, dtype: float64

Log return time series plot

In [50]:
plt.plot(weekly_rets)
plt.xlabel('t'); plt.ylabel('$r_t$')
plt.title('Weekly log-returns for GOOGL')
plt.show()

Plotting a return histogram

In [51]:
weekly_rets.hist()
plt.show()

Descriptive statistics of returns

In [52]:
weekly_rets.describe()
Out[52]:
count    313.000000
mean       0.002937
std        0.032039
min       -0.099918
25%       -0.013341
50%        0.004653
75%        0.021327
max        0.229571
Name: Adj Close, dtype: float64

Statistics and optimization with SciPy

The SciPy library

  • SciPy is a library that provides several modules for scientific computing.

  • You can read more about it by reading the reference guide.

  • It provides modules for:

    • Solving optimization problems.
    • Linear algebra.
    • Interpolation.
    • Statistical inference.
    • Fourier transform.
    • Numerical differentation and integration.

Overview

  1. loading data with pandas,
  2. computing returns,
  3. Quantile-Quantile (q-q) plots,
  4. The Jarque-Bera test for normally-distributed data,
  5. ordinary-least squares (OLS) regression.
  6. Portfolio optimization

Loading data

  • We will first obtain some data from Yahoo finance using the pandas library.

  • First we will import the functions and modules we need.

In [1]:
import matplotlib.pyplot as plt
import datetime
import pandas as pd
import numpy as np

Importing a CSV file

In [2]:
def prices_from_csv(fname):
    df = pd.read_csv(fname)
    df.set_index(pd.to_datetime(df['Date']), inplace=True)
    return df
In [3]:
msft = prices_from_csv('data/MSFT.csv')
msft.head()
Out[3]:
Date Open High Low Close Adj Close Volume
Date
2002-07-01 2002-07-01 27.059999 27.195000 26.290001 26.330000 16.997240 66473800
2002-07-02 2002-07-02 26.190001 26.459999 25.665001 25.719999 16.603462 82814200
2002-07-03 2002-07-03 25.620001 26.260000 25.225000 25.920000 16.732574 80936600
2002-07-05 2002-07-05 26.545000 27.450001 26.525000 27.424999 17.704121 35673600
2002-07-08 2002-07-08 27.205000 27.465000 26.290001 26.459999 17.081167 63199400

Data preparation

Daily prices

In [4]:
msft['Adj Close'].plot()
plt.ylabel('MSFT price')
plt.show()

Converting to monthly data

  • We will resample the data at a frequency of one calendar month.

  • The code below takes the last price in every month.

In [5]:
daily_prices = msft['Adj Close']
In [6]:
monthly_prices = daily_prices.resample('M').last()
monthly_prices.plot()
plt.ylabel('MSFT Price')
plt.show()

Calculating log returns

In [7]:
stock_returns = pd.DataFrame({'MSFT monthly returns': np.log(monthly_prices).diff().dropna()})
stock_returns.plot()
plt.xlabel('t'); plt.ylabel('$r_t$')
plt.show()

Exploratory data analysis

Return histogram

In [8]:
stock_returns.hist()
plt.show()

Descriptive statistics of the return distribution

In [9]:
stock_returns.describe()
Out[9]:
MSFT monthly returns
count 208.000000
mean 0.010822
std 0.064673
min -0.178358
25% -0.031284
50% 0.018196
75% 0.051398
max 0.222736

Summarising the distribution using a boxplot

In [10]:
stock_returns.boxplot()
plt.show()

Q-Q plots

  • Quantile-Quantile (Q-Q) plots are a useful way to compare distributions.

  • We plot empirical quantiles against the quantiles computed the inverted c.d.f. of a specified theoretical distribution.

In [11]:
import numpy as np 
import matplotlib.pyplot as plt
import scipy.stats as stats

stats.probplot(stock_returns.values[:,0], dist="norm", plot=plt)
plt.show()

Statistical Hypothesis Testing

The Jarque-Bera Test

  • The Jarque-Bera (JB) test is a statistical test that can be used to test whether a given sample was drawn from a normal distribution.

  • The null hypothesis is that the data have the same skewness (0) and kurtosis (3) as a normal distribution.

  • The test statistic is:

$$JB = \frac{n}{6}(S^2 + \frac{1}{4}(K - 3)^2)$$

where $S$ is the sample skewness, $K$ is the sample kurtosis, and $n$ is the number of observations.

  • It is implemented in scipy.stats.jarque_bera().

References

Jarque, C. and Bera, A. (1980) “Efficient tests for normality, homoscedasticity and serial independence of regression residuals”, 6 Econometric Letters 255-259.

The Jarque-Bera test using a bootstrap

  • We can test against the null hypothesis of S=0 and K=3.

  • A finite sample can exhibit non-zero skewness and excess kurtosis simply due to sample noise, even if the distribution is Guassian.

  • What is the distribution of the sum of the squared sample skewness and kurtosis under repeated sampling?

  • We can answer this question using a Monte-Caro method called bootstrapping.

    • Note that this is very expensive, and we would not always do this in practice (see the subsequent slides).

Bootstrap code

In [12]:
from scipy.stats import skew, kurtosis

def jb(n, s, k):
    return  n / 6. * (s**2 + (((k - 3.)**2) / 4.))

def jb_from_samples(n, bootstrap_samples):
    s = skew(bootstrap_samples)
    k = kurtosis(bootstrap_samples, fisher=False)
    return jb(n, s, k)

The distribution of the test-statistic under the null hypothesis

In [13]:
bootstrap_replications = 10000
n = 10  # Sample size
test_statistic_null = jb_from_samples(n, np.random.normal(size=(n, bootstrap_replications)))
plt.hist(test_statistic_null, bins=100)
plt.title('JB test-statistic under null hypothesis from bootstrap ($n=10$)'); plt.xlabel('$JB$')
plt.show()

The critical value

  • The 95-percentile can be computed from the bootstrap data.

  • This is called the critical value for $p= 0.05$.

In [14]:
critical_value = np.percentile(test_statistic_null, 95)
critical_value
Out[14]:
2.4743729769954297
  • This is the value of $JB_{crit}$ such that area underneath the p.d.f. over the interval $[0, JB_{crit}]$ sums to $0.95$ (95\% of the area under the curve).

  • The corresponding p-value is $1 - 0.95 = 0.05$.

Rejecting the null hypothesis

  • When we test an empirical sample, we compute its sample skewness and kurtosis, and the corresponding value of the test statistic $JB_{data}$.

  • We reject the null hypothesis iff. $JB_{data} > JB_{crit}$:

In [15]:
def jb_critical_value(n, bootstrap_samples, p):
    return np.percentile(jb_from_samples(n, bootstrap_samples), (1. - p) * 100.)

def jb_test(data_sample, bootstrap_replications=100000, p=0.05):
    sample_size = len(data_sample)
    bootstrap_samples = np.random.normal(size=(sample_size, bootstrap_replications))
    critical_value = jb_critical_value(sample_size, bootstrap_samples, p)
    empirical_jb = jb(sample_size, skew(data_sample), kurtosis(data_sample, fisher=False))
    return (empirical_jb > critical_value, empirical_jb, critical_value)

Test data from a normal distribution

In [16]:
x = np.random.normal(size=2000)
jb_test(x)
Out[16]:
(False, 0.7414659339472695, 5.912503427821925)

Test data from a log-normal distribution

In [17]:
jb_test(np.exp(x))
Out[17]:
(True, 45398.441639228855, 5.95816195181965)

Critical-values from a Chi-Squared table

  • The code on the previous slide is not very efficient, since we have to perform a lengthly bootstrap operation each time we test a data sample.

  • For $n > 2000$, the distribution of the test statistic follows a Chi-squared distribution with two degrees of freedom ($k=2$), so we can look up the critical values for any given confidence level ($p$-value) using a Chi-Squared table.

  • For smaller $n$ we must resort to a bootstrap.

Producing a table of table critical values from a bootstrap

In [18]:
n = 10
bootstrap_samples = np.random.normal(size=(n, 300000))
confidence_levels = np.array([0.025, 0.05, 0.10, 0.20])
critical_values = np.vectorize(lambda p: jb_critical_value(n, bootstrap_samples, p))(confidence_levels)
critical_values_df = pd.DataFrame({'critical value (n=10)': critical_values}, index=confidence_levels)
critical_values_df.index.name = 'p-value'
critical_values_df
Out[18]:
critical value (n=10)
p-value
0.025 3.755172
0.050 2.516959
0.100 1.624813
0.200 1.124019
  • If we save this data-frame permanently, then we do not need to re-compute the critical value for the given sample size.

  • We can simply calculate the test-statistic from the data sample, and see whether the value thus obtained exceeds the critical value for the chosen level of confidence (p-value).

Using the jarque_bera function in scipy

  • The function scipy.stats.jarque_bera() contains code already written to implement the Jarque-Bera (JB) test.

  • It computes the p-value from the cdf. of the Chi-Squared distribution and the empirical test-statistic.

  • This assumes a large sample $n \geq 2000$.

  • The variable test_statistic returned below is the value of $JB$ calculated from the empirical data sample.

  • If the p-value in the result is $\leq 0.05$ then we reject the null hypothesis at 95\% confidence.

  • The null hypothesis is that the data are drawn from a distribution with skew 0 and kurtosis 3.

In [19]:
import scipy.stats
x = np.random.normal(size=2000)
(test_statistic, p_value) = scipy.stats.jarque_bera(x)
print("JB test statistic = %f" % test_statistic)
print("p-value = %f" % p_value)
JB test statistic = 0.136840
p-value = 0.933868

Testing the empirical data

In [20]:
len(stock_returns)
Out[20]:
208
In [21]:
stats.jarque_bera(stock_returns)
Out[21]:
(6.19438331072041, 0.04517589389305776)

Estimating the single-index model

$$r_{i,t} - r_f = \alpha_i + \beta_i ( r_{m,t} - r_f) + \epsilon_{i,t}$$$$\epsilon_{i, t} \sim N(0, \sigma_i)$$
  • $r_{i,t}$ is return to stock $i$ in period $t$.
  • $r_f$ is the risk-free rate.
  • $r_{m,t}$ is the return to the market portfolio.

Elton, E. J., & Gruber, M. J. (1997). Modern portfolio theory, 1950 to date. Journal of Banking and Finance, 21(11–12), 1743–1759. https://doi.org/10.1016/S0378-4266(97)00048-4

Index price data

  • We will first obtain data on the market index: in this case the NASDAQ:
In [22]:
nasdaq_index = prices_from_csv('data/^NDX.csv')
nasdaq_index.head()
Out[22]:
Date Open High Low Close Adj Close Volume
Date
2002-07-01 2002-07-01 1044.479980 1049.880005 997.969971 998.169983 998.169983 2320650000
2002-07-02 2002-07-02 989.250000 993.989990 961.760010 963.659973 963.659973 2722550000
2002-07-03 2002-07-03 957.260010 995.950012 950.330017 995.679993 995.679993 2661060000
2002-07-05 2002-07-05 1018.630005 1061.050049 1018.630005 1060.890015 1060.890015 1120960000
2002-07-08 2002-07-08 1051.270020 1066.280029 1008.780029 1014.330017 1014.330017 1708150000

Converting to monthly data

  • As before, we can resample to obtain monthly data.
In [23]:
nasdaq_monthly_prices = nasdaq_index['Adj Close'].resample('M').last()
nasdaq_monthly_prices.head()
Out[23]:
Date
2002-07-31     962.099976
2002-08-31     942.380005
2002-09-30     832.520020
2002-10-31     989.539978
2002-11-30    1116.099976
Freq: M, Name: Adj Close, dtype: float64

Plotting monthly returns

In [24]:
index_log_returns = pd.DataFrame({'NASDAQ monthly returns': np.log(nasdaq_monthly_prices).diff().dropna()})
index_log_returns.plot()
plt.show()

Converting to simple returns

In [25]:
index_simple_returns = np.exp(index_log_returns) - 1.
index_simple_returns.plot()
plt.show()
In [26]:
stock_simple_returns = np.exp(stock_returns) - 1.

Concatenating data into a single data frame

  • We will now concatenate the data into a single data fame.

  • We can use pd.concat(), specifying an axis of 1 to merge data along columns.

  • This is analogous to performing a zip() operation.

In [27]:
comparison_df = pd.concat([index_simple_returns, stock_simple_returns], axis=1)
comparison_df.head()
Out[27]:
NASDAQ monthly returns MSFT monthly returns
Date
2002-08-31 -0.020497 0.022926
2002-09-30 -0.116577 -0.108802
2002-10-31 0.188608 0.222450
2002-11-30 0.127898 0.078736
2002-12-31 -0.118027 -0.103676

Scatter plots

  • We can produce a scatter plot to see whether there is any relationship between the stock returns, and the index returns.

  • There are two ways to do this:

    1. Use the function scatter() in matplotlib.pyplot
    2. Invoke the plot() method on a data frame, passing kind='scatter'

Scatter plots using the plot() method of a data frame

  • In the example below, the x and y named arguments refer to column numbers of the data frame.

  • Notice that the plot() method is able to infer the labels automatically.

In [28]:
comparison_df.plot(x=0, y=1, kind='scatter')
plt.show()

Computing the correlation matrix

  • For random variables $X$ and $Y$, the Pearson correlation coefficient is:
\begin{eqnarray} \rho_{X,Y} & = & \frac{\operatorname{cov}(X, Y)}{\sigma_X \sigma_Y} \\ & = & \frac{E[(X - \mu_x)(Y - \mu_Y)]}{\sigma_X \sigma_Y}\\ \end{eqnarray}

Covariance and correlation of a data frame

  • We can invoke the cov() and corr() methods on a data frame.
In [29]:
comparison_df.cov()
Out[29]:
NASDAQ monthly returns MSFT monthly returns
NASDAQ monthly returns 0.002675 0.002243
MSFT monthly returns 0.002243 0.004283
In [30]:
comparison_df.corr()
Out[30]:
NASDAQ monthly returns MSFT monthly returns
NASDAQ monthly returns 1.000000 0.662686
MSFT monthly returns 0.662686 1.000000

Comparing multiple attributes in a data frame

  • It is often useful to work with more than two variables.

  • We can add columns (attributes) to our data frame.

  • Many of the methods we are using will automatically incorporate the additional variables into the analysis.

Using a function to compute returns

  • The code below defines a function defines a function which will return a data frame containing a single series of returns for the specified symbol, and sampled over the specified frequency.
In [31]:
def returns_df(symbol, frequency='M'):
    df = prices_from_csv('data/%s.csv' % symbol)
    prices = df['Adj Close'].resample(frequency).last()
    column_name = symbol + ' returns (' + frequency + ')'
    return pd.DataFrame({column_name: np.exp(np.log(prices).diff().dropna()) - 1.})                            
In [32]:
apple_returns = returns_df('AAPL')   
apple_returns.head()
Out[32]:
AAPL returns (M)
Date
2002-08-31 -0.033421
2002-09-30 -0.016949
2002-10-31 0.108276
2002-11-30 -0.035470
2002-12-31 -0.075484

Adding another stock to the portfolio

In [33]:
comparison_df = pd.concat([comparison_df, apple_returns], axis=1)
comparison_df.head()
Out[33]:
NASDAQ monthly returns MSFT monthly returns AAPL returns (M)
Date
2002-08-31 -0.020497 0.022926 -0.033421
2002-09-30 -0.116577 -0.108802 -0.016949
2002-10-31 0.188608 0.222450 0.108276
2002-11-30 0.127898 0.078736 -0.035470
2002-12-31 -0.118027 -0.103676 -0.075484
In [34]:
comparison_df.plot()
plt.show()
In [35]:
comparison_df.corr()
Out[35]:
NASDAQ monthly returns MSFT monthly returns AAPL returns (M)
NASDAQ monthly returns 1.000000 0.662686 0.629116
MSFT monthly returns 0.662686 1.000000 0.375185
AAPL returns (M) 0.629116 0.375185 1.000000
In [36]:
plt.figure(figsize=(8, 6))
comparison_df.boxplot()
plt.show()

Boxplots without outliers

In [37]:
plt.figure(figsize=(8, 6))
comparison_df.boxplot(showfliers=False)
plt.show()

Scatter matrices

In [38]:
pd.plotting.scatter_matrix(comparison_df, figsize=(8, 6))
plt.show()

Scatter matrices with Kernel-density plots

In [39]:
pd.plotting.scatter_matrix(comparison_df, diagonal='kde', figsize=(8, 6))
plt.show()

Ordinary-least squares

  • For $n$ observations $(x_{1, j}, y_1), (x_{2, j}, y_2), \ldots, (x_{n, j}, y_n)$ over $j \in \{ 1, 2, \ldots p \}$ regressors:
$$ y_i = \alpha_i + \beta_1 x_{i,1} + \beta_2 x_{i, 2} + x_{i,2} + \ldots + \beta_p x_{i,p} + \epsilon_i $$

Ordinary-least squares estimation in Python

  • First we import the stats module:
In [40]:
import scipy.stats as stats
  • Now we prepare the data set:
In [41]:
rr = 0.01 # risk-free rate
ydata = stock_simple_returns.values[:,0] - rr
xdata = index_simple_returns.values[:,0] - rr
In [42]:
regression_result = (beta, alpha, rvalue, pvalue, stderr) = \
    stats.linregress(x=xdata, y=ydata)
print(regression_result)
LinregressResult(slope=0.8385169376111149, intercept=0.0015326890415947839, rvalue=0.6626859398568364, pvalue=1.1260177711868404e-27, stderr=0.06602262800537023)

Plotting the fitted model

In [43]:
plt.scatter(x=ydata, y=xdata)
plt.plot(ydata, alpha + beta * ydata)
plt.xlabel('index return')
plt.ylabel('stock return')
plt.title('Single-index model fit ')
plt.show()

Regressing attributes of a data frame

  • First we will create a new data frame containing the excess returns.
In [44]:
excess_returns_df = comparison_df - rr
excess_returns_df.head()
Out[44]:
NASDAQ monthly returns MSFT monthly returns AAPL returns (M)
Date
2002-08-31 -0.030497 0.012926 -0.043421
2002-09-30 -0.126577 -0.118802 -0.026949
2002-10-31 0.178608 0.212450 0.098276
2002-11-30 0.117898 0.068736 -0.045470
2002-12-31 -0.128027 -0.113676 -0.085484

Renaming the columns of a data frame

  • We will now rename the columns to make the variable names easier to work with.
In [45]:
excess_returns_df.rename(columns={'NASDAQ monthly returns': 'index', 
                                   'MSFT monthly returns': 'msft',
                                    'AAPL returns (M)': 'aapl'},
                         inplace=True)
excess_returns_df.head()
Out[45]:
index msft aapl
Date
2002-08-31 -0.030497 0.012926 -0.043421
2002-09-30 -0.126577 -0.118802 -0.026949
2002-10-31 0.178608 0.212450 0.098276
2002-11-30 0.117898 0.068736 -0.045470
2002-12-31 -0.128027 -0.113676 -0.085484

Fitting the model

In [46]:
import statsmodels.formula.api as sm
result = sm.ols(formula = 'msft ~ index', data=excess_returns_df).fit()

The full regression results

In [47]:
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   msft   R-squared:                       0.439
Model:                            OLS   Adj. R-squared:                  0.436
Method:                 Least Squares   F-statistic:                     161.3
Date:                Wed, 04 Nov 2020   Prob (F-statistic):           1.13e-27
Time:                        09:12:15   Log-Likelihood:                 332.64
No. Observations:                 208   AIC:                            -661.3
Df Residuals:                     206   BIC:                            -654.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0015      0.003      0.450      0.653      -0.005       0.008
index          0.8385      0.066     12.700      0.000       0.708       0.969
==============================================================================
Omnibus:                       11.619   Durbin-Watson:                   2.369
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               21.780
Skew:                           0.241   Prob(JB):                     1.86e-05
Kurtosis:                       4.510   Cond. No.                         19.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The intercept and coefficient

In [48]:
print(result.params)
Intercept    0.001533
index        0.838517
dtype: float64
In [49]:
coefficient = result.params['index']
coefficient
Out[49]:
0.8385169376111143

Portfolio optimization

  • For a column vector $\mathbf{w}$ of portfolio weights, the portfolio return $r_p$ is given by:
\begin{equation} r_p = \sum_{i=1}^n w_i r_i \end{equation}
  • The portfolio variance $\sigma_p$ is given by:
\begin{equation} \sigma_p = \sum_{i=1}^n \sum_{j=1}^n w_i w_j k_{i,j} \sigma_i \sigma_j = \mathbf{w}^T \cdot \mathbf{K} \cdot \mathbf{w} \end{equation}

where $\mathbf{K}$ is the covariance matrix.

Portfolio mean and variance in Python

  • We can write the equation from the previous slide as a Python function:
In [50]:
def portfolio_mean_var(w, R, K):
    portfolio_mean = np.mean(R, axis=0) * w
    portfolio_var = w.T * K * w
    return portfolio_mean.item(), portfolio_var.item()
  • The item() method is required to convert a one-dimensional matrix into a scalar.

Obtaining portfolio data in Pandas

In [51]:
portfolio = pd.concat([returns_df(s) for s in ['AAPL', 'ATVI', 'MSFT', 'VRSN', 'WDC']], axis=1)
portfolio.head()
Out[51]:
AAPL returns (M) ATVI returns (M) MSFT returns (M) VRSN returns (M) WDC returns (M)
Date
2002-08-31 -0.033421 -0.029596 0.022926 0.121875 -0.087838
2002-09-30 -0.016949 -0.141371 -0.108802 -0.296657 0.160493
2002-10-31 0.108276 -0.143335 0.222450 0.594059 0.317022
2002-11-30 -0.035470 0.053658 0.078736 0.305590 0.365105
2002-12-31 -0.075484 -0.324537 -0.103676 -0.236917 -0.243787

Computing the covariance matrix

In [52]:
portfolio.cov()
Out[52]:
AAPL returns (M) ATVI returns (M) MSFT returns (M) VRSN returns (M) WDC returns (M)
AAPL returns (M) 0.009047 0.003175 0.002335 0.002891 0.004194
ATVI returns (M) 0.003175 0.009093 0.001402 0.001898 0.002607
MSFT returns (M) 0.002335 0.001402 0.004283 0.002008 0.002413
VRSN returns (M) 0.002891 0.001898 0.002008 0.011007 0.005368
WDC returns (M) 0.004194 0.002607 0.002413 0.005368 0.016106

Converting to matrices

In [53]:
R = np.matrix(portfolio)
K = np.matrix(portfolio.cov())
In [54]:
K
Out[54]:
matrix([[0.00904665, 0.00317498, 0.00233534, 0.00289089, 0.00419387],
        [0.00317498, 0.00909339, 0.00140182, 0.00189845, 0.00260658],
        [0.00233534, 0.00140182, 0.00428272, 0.00200831, 0.00241259],
        [0.00289089, 0.00189845, 0.00200831, 0.0110073 , 0.00536781],
        [0.00419387, 0.00260658, 0.00241259, 0.00536781, 0.01610639]])

An example portfolio

  • Let's construct a single portfolio by specifying a weight vector:
In [55]:
w = np.matrix('0.4; 0.2; 0.2; 0.1; 0.1')
w
Out[55]:
matrix([[0.4],
        [0.2],
        [0.2],
        [0.1],
        [0.1]])
In [56]:
np.sum(w)
Out[56]:
1.0
In [57]:
portfolio_mean_var(w, R, K)
Out[57]:
(0.023266799902568663, 0.004278614805731206)

Optimizing portfolios

  • We can use the scipy.optimize module to solve the portfolio optimization problem.

  • First we import the module:

In [58]:
import scipy.optimize as sco

Defining an objective function

  • Next we define an objective function.

  • This function will be minimized.

In [59]:
def portfolio_performance(w_list, R, K, risk_aversion):
    w = np.matrix(w_list).T
    mean, var = portfolio_mean_var(w, R, K)
    return risk_aversion * var - (1 - risk_aversion) * mean

Computing the performance of a given portfolio

In [60]:
def uniform_weights(n):
    return [1. / float(n)] * n
In [61]:
uniform_weights(5)
Out[61]:
[0.2, 0.2, 0.2, 0.2, 0.2]
In [62]:
portfolio_performance(uniform_weights(5), R, K, risk_aversion=0.5)
Out[62]:
-0.008482461529679837

Finding optimal portfolio weights

In [63]:
def optimal_portfolio(R, K, risk_aversion):
    n = R.shape[1]
    constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
    bounds = tuple( (0,1) for asset in range(len(w)))
    result = sco.minimize(portfolio_performance, uniform_weights(n), args=(R, K, risk_aversion),
                        method='SLSQP', bounds=bounds, constraints=constraints)
    return np.matrix(result.x).T
In [64]:
optimal_portfolio(R, K, risk_aversion=0.5)
Out[64]:
matrix([[9.04156453e-01],
        [7.28583860e-17],
        [0.00000000e+00],
        [9.58435472e-02],
        [0.00000000e+00]])

Computing the Pareto frontier

  • First we define our risk aversion coefficients
In [65]:
risk_aversion_coefficients = np.arange(0.0, 1.0, 0.1)
risk_aversion_coefficients
Out[65]:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

The optimal portfolios

In [66]:
optimal_portfolios = [optimal_portfolio(R, K, ra) for ra in risk_aversion_coefficients]
In [67]:
optimal_portfolios[0]
Out[67]:
matrix([[1.00000000e+00],
        [2.95770353e-16],
        [0.00000000e+00],
        [9.29811783e-16],
        [3.77302356e-16]])
In [68]:
optimal_portfolios[1]
Out[68]:
matrix([[1.00000000e+00],
        [2.63677968e-16],
        [2.60208521e-17],
        [1.11022302e-16],
        [0.00000000e+00]])

The efficient frontier

In [69]:
pareto_frontier = np.matrix([portfolio_mean_var(w, R, K) for w in optimal_portfolios])
In [70]:
pareto_frontier
Out[70]:
matrix([[0.0318904 , 0.00904665],
        [0.0318904 , 0.00904665],
        [0.0318904 , 0.00904665],
        [0.0318904 , 0.00904665],
        [0.0318904 , 0.00904665],
        [0.03097262, 0.00799777],
        [0.02890978, 0.00648401],
        [0.02604018, 0.00509803],
        [0.0219562 , 0.00381382],
        [0.01966883, 0.00337872]])

Plotting the Pareto frontier

In [71]:
plt.plot(pareto_frontier[:, 1], pareto_frontier[:, 0])
plt.xlabel('$\sigma_p$');  plt.ylabel('$r_p$')
plt.show()

Monte-Carlo Methods

Quantitative Models

  • A mathematical model uses variables to represent quantities in the real-world.

    • e.g. security prices
  • Variables are related to each other through mathematical equations.

  • Variables can be divided into:

    • Input variables: parameters (independent variables)
      • Inititial conditions: parameters which specify the initial values of in a time-varying (dynamic) model.
    • Output variables: dependent-variables (e.g. payoff of an option).

Monte-Carlo Methods

  • Financial models are typically stochastic.

  • Stochastic models make use of random variables.

  • If the dependent variables are stochastic, we typically want to compute their expectation.

    • Note, however, that in some models, dependent variables are deterministic, even when parameters are random.
  • If the parameters are stochastic, we can use Monte-Carlo methods to estimate the expected values of dependent variables.

The Monte-Carlo Casino

Pseudo-code

  • We will illustrate a simple Monte-Carlo method for analysing a stochastic model.

  • We will make use of pseudo-code.

  • Pseudo-code is written for people.

  • It is not executable by machines.

  • It is written to illustrate exactly how something is done.

  • Exact specifications of the steps required to compute mathematical values are called algorithms.

  • Pseudo-code can be used to write down algorithms.

A simple Monte-Carlo method

  • Here we consider a simple model with one input variable $X$, and one output variable $Y$, related by a function $Y = f(X)$.

  • $X$ and $Y$ are random variables.

  • $X$ is iid. distributed with some known distribution.

  • We want to compute the expected value of the dependent variable $E(Y)$.

  • We do so by drawing a random sample of $n$ random variates $( x_1, x_2, \ldots, x_n )$ from the specified distribution.

  • We map these values onto a sample $\mathbf{y}$ of the dependent variable $Y$: $\mathbf{y} = ( f(x_1), f(x_2), \ldots, f(x_n) )$.

  • We can use the sample mean $\bar{\mathbf{y}} = \sum_i f(x_i) / n$ to estimate $E(Y)$.

  • Provided that $n$ is sufficiently large, our estimate will be accurate by the law of large numbers.

  • $\bar{\mathbf{y}}$ is called the Monte-Carlo estimator.

In Pseudo-code

  • The pseudo-code below illustrates the method specified on the previous slide using iteration:
sample = []
for i in range(n):
    x = draw_random_value(distribution)
    y = f(input_variable)
    sample.append(y)
result = mean(sample)
  • We can write this more concisely using a comprehension:
inputs = draw_random_value(distribution, size=n)
result = mean([f(x) for x in inputs])

A Monte-Carlo algorithm for computing $\pi$

  1. Inscribe a circle in a square.
  2. Randomly generate points $(X, Y$) in the square.
  3. Determine the number of points in the square that are also in the circle.
  4. Let $R$ be the number of points in the circle divided by the number of points in the square, then $\pi = 4 \times E(R)$.

See this tutorial.

In [17]:
import numpy as np

def f(x, y):
        if x*x + y*y < 1:
            return 1.
        else:
            return 0.

n = 1000000
X = np.random.random(size=n)
Y = np.random.random(size=n)
pi_approx = 4 * np.mean([f(x, y) for (x, y) in zip(X,Y)])
print("Pi is approximately %f" % pi_approx)
Pi is approximately 3.139656

Monte-Carlo Integration

The expectation of a random variable $X \in \mathbb{R}$ with pdf. $f(x)$ can be written:

$$ E[X] = \int_{-\infty}^{+\infty} x f(x) \; dx $$

For a continuous uniform distribution over $U(0, 1)$, the pdf. is $f(x) = 1$, and:

$$ E[X] = \int_{0}^{1} x \; dx $$

Estimating $\pi$ using Monte-Carlo integration

Consider:

$$E[\sqrt{1 - X^2}] = \int_0^1 \sqrt{1 - x^2} \; dx$$

If we draw a finite random sample $x_1, x_2, \ldots, x_n$ from $U(0, 1)$, then

\begin{eqnarray} \bar{\mathbf{x}} \approx E[X] & = & \int_0^1 \sqrt{1 - x^2} dx \end{eqnarray}\begin{eqnarray} \int \sqrt{1 - x^2} dx & = & \frac{1}{2} ( x \sqrt{1-x^2} + \arcsin(x) ) .\\ \end{eqnarray}

Therefore:

$$\bar{\mathbf{x}} \approx E[X] = \frac{\pi}{4}$$

Estimation error

  • By the law of large numbers $\lim_{n \rightarrow \infty} \bar{\mathbf{x}} = E(X)$.

  • However, for finite values of $n$ we will have an estimation error.

  • Can we quantify the estimation error as a function of $n$?

Computing the error numerically

  • If we draw from a standard normal distribution, we know that $E(X) = 0$.

  • Therefore, we can easily compute the estimation error in any given sample.

The error for a small random sample.

  • Here $X \sim N(0, 1)$, and we draw a random sample $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ of size $n=5$.

  • We will compute $\epsilon_\mathbf{x} = | \bar{\mathbf{x}} - E(X) | = | \bar{\mathbf{x}} |$

In [18]:
x = np.random.normal(size=5)
x
Out[18]:
array([-1.08775513, -0.75418851,  0.56435569, -0.01047129, -1.22179773])
In [19]:
np.mean(x)
Out[19]:
-0.5019713931238332
In [20]:
estimation_error = np.abs(np.mean(x))
estimation_error
Out[20]:
0.5019713931238332

Computing the error for different samples

  • If we draw a different sample, will the error be different or the same?

Sample 1

In [21]:
x = np.random.normal(size=5)
estimation_error = np.abs(np.mean(x))
estimation_error
Out[21]:
0.6439078712065711

Sample 2

In [22]:
x = np.random.normal(size=5)
estimation_error = np.abs(np.mean(x))
estimation_error
Out[22]:
0.2065274715526703

Sample 3

In [23]:
x = np.random.normal(size=5)
estimation_error = np.abs(np.mean(x))
estimation_error
Out[23]:
0.24725495232312905

Errors are random variables

  • We can see that the error $\epsilon_{\mathbf{x}}$ is itself a random variable.

  • How can we compute $E(\epsilon_{\mathbf{x}})$?

Monte-Carlo estimation of the sampling error

In [24]:
def sampling_error(n):
    errors = [np.abs(np.mean(np.random.normal(size=n))) \
              for i in range(100000)]
    return np.mean(errors)    

sampling_error(5)
Out[24]:
0.35691413131802474
  • Notice that this estimate is relatively stable:
In [25]:
sampling_error(5)
Out[25]:
0.35615647290478425
In [26]:
sampling_error(5) 
Out[26]:
0.3577865227312204

Monte-Caro estimation of the standard error

  • We can now examine the relationship between sample size $n$ and the expected error using a Monte-Carlo method.
In [27]:
import matplotlib.pyplot as plt
n = np.arange(5, 200, 10)
plt.plot(n, np.vectorize(sampling_error)(n))
plt.xlabel('$n$');  plt.ylabel('$e_\mathbf{x}$')
plt.show()

The sampling distribution of the mean

  • The variance in the error occurs because the sample mean is a random variable.

  • What is the distribution of the sample mean?

The sampling distribution of the mean

  • Let's fix the sample size at $n=30$, and look at the empirical distribution of the sample means.
In [28]:
# Sample size
n = 30
# Number of repeated samples
N = 20000

means_30 = [np.mean(np.random.normal(size=n)) for i in range(N)]
ax = plt.hist(means_30, bins=50)
plt.show()

The sampling distribution of the mean for $n=30$

  • Now let's do this again for a variable sampled from a different distribution: $X \sim U(0, 1)$.
In [29]:
# Sample size
n = 30
# Number of repeated samples
N = 20000
means_30_uniform = [np.mean(np.random.uniform(size=n)) for i in range(N)]
ax = plt.hist(means_30_uniform, bins=50)
plt.show()

Increasing the sample size $n=200$

In [30]:
# Sample size
n = 200

means_200 = [np.mean(np.random.normal(size=n)) for i in range(N)]
ax1 = plt.hist(means_30, bins=50)    
ax2 = plt.hist(means_200, bins=50)
plt.show()

Increasing the sample size $n=1000$

In [31]:
# Sample size
n = 1000
means_1000 = [np.mean(np.random.normal(size=n)) for i in range(N)]
ax1 = plt.hist(means_30, bins=50)    
ax2 = plt.hist(means_200, bins=50)
ax3 = plt.hist(means_1000, bins=50)
plt.show()

The sampling distribution of the mean

  • In general the sampling distribution of the mean approximates a normal distribution.

  • If $X \sim N(\mu, \sigma^2)$ then $\bar{\mathbf{x}_n} \sim N(\mu, \frac{\sigma^2}{n})$.

  • The standard error of the mean is $\sigma_{\bar{\mathbf{x}}} = \frac{\sigma}{\sqrt{n}}$.

  • Therefore sample size must be quadrupled to achieve half the measurement error.

Summary

  • Monte-Carlo methods can be used to analyse quantitative models.

  • Any problem in which the solution can be written as an expectation of random variable(s) can be solved using a Monte-Carlo approach.

  • We write down an estimator for the problem; a variable whose expectation represents the solution.

  • We then repeatedly sample input variables, and calculate the estimator numerically (in a computer program).

  • The sample mean of this variable can be used as an approximation of the solution; that is, it is an estimate.

  • The larger the sample size, the more accurate the estimate.

  • There is an inverse-square relationship between sample size and the estimation error.

Random walks in Python

A Simple Random Walk

  • Imagine a board-game in which we move a counter either up or down on an infinite grid based on the flip of a coin.

  • We start in the center of the grid at position $y_1 = 0$.

  • Each turn we flip a coin. If it is heads we move up one square, otherwise we move down.

  • How will the counter behave over time? Let's simulate this in Python.

  • First we create a variable $y$ to hold the current position

In [2]:
y = 0

Movements as Bernoulli trials

  • Now we will generate a Bernoulli sequence representing the moves
  • Each movement is an i.i.d. discrete random variable $\epsilon_t$ distributed with $p(\epsilon_t = 0) = \frac{1}{2}$ and $p(\epsilon_t = 1) = \frac{1}{2}$.
  • We will generate a sequence $( \epsilon_1, \epsilon_2, \ldots, \epsilon_{t_{max}} )$ such movements, with $t_{max} = 100$.
  • The time variable is also discrete, hence this is a discrete-time model.
  • This means that time values can be represented as integers.

Simulating a Bernoulli process in Python

In [3]:
import numpy as np
from numpy.random import randint

max_t = 100
movements = randint(0, 2, size=max_t)
print(movements)
[0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1
 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1
 1 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 0 0 1 0 0 1 0]

An integer random-walk in Python

  • Each time we move the counter, we move it in the upwards direction if we flip a 1, and downwards for a 0.
  • So we add 1 to $y_t$ for a 1, and subtract 1 for a $0$.

an integer random-walk using a loop

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import randint, normal, uniform
max_t = 100
movements = randint(0, 2, size=max_t)
y = 0
values = [y]
for movement in movements:
    if movement == 1:
        y = y + 1
    else:
        y = y - 1
    values.append(y)
plt.xlabel('t')
plt.ylabel('y')
ax = plt.plot(values)
plt.show()

A random-walk as a cumulative sum

  • Notice that the value of $y_t$ is simply the cumulative sum of movements randomly chosen from $-1$ or $+1$.
  • So if $p(\epsilon = -1) = \frac{1}{2}$ and $p(\epsilon = +1) = \frac{1}{2}$ then
  • We can define our game as a simple stochastic process : $y_t = \sum_{t=1}^{t_{max}} \epsilon_t$
  • We can use numpy's where() function to replace all zeros with $-1$.

an integer random-walk using an accumulator

In [5]:
t_max = 100
random_numbers = randint(0, 2, size=t_max)
steps = np.where(random_numbers == 0, -1, +1)
y = 0
values = [0]
for step in steps:
    y = y + step
    values.append(y)
plt.xlabel('t')
plt.ylabel('y')
ax = plt.plot(values)

A random-walk using arrays

  • We can make our code more efficient by using the cumsum() function instead of a loop.
  • This way we can work entirely with arrays.
  • Remember that vectorized code can be much faster than iterative code.

an integer random-walk using vectorization

In [6]:
# Vectorized random-walk with arrays to improve efficiency
t_max = 100
random_numbers = randint(0, 2, size=t_max)
steps = np.where(random_numbers == 0, -1, +1)
values = np.cumsum(steps)
plt.xlabel('t')
plt.ylabel('y')
ax = plt.plot(values)

Using concatenate to prepend the initial value

  • If we want to include the initial position $y_0 = 0$, we can concatenate this value to the computed values from the previous slide.

  • The numpy.concatenate() function takes a single argument containing a sequence of arrays, and returns a new array which contains all values in a single array.

In [7]:
plt.plot(np.concatenate(([0], values)))
plt.xlabel('$t$')
plt.ylabel('$y_t$')
plt.show()

Multiple realisations of a stochastic process

  • Because we are making use of random numbers, each time we execute this code we will obtain a different result.

  • In the case of a random-walk, the result of the simulation is called a path.

  • Each path is called a realisation of the model.

  • We can generate multiple paths by using a 2-dimensional array (a matrix).

  • Suppose we want $n= 10$ paths.

  • In Python we can pass two values for the size argument in the randint() function to specify the dimensions (rows and columns):

In [8]:
t_max = 100
n = 10
random_numbers = randint(0, 2, size=(t_max, n))
steps = np.where(random_numbers == 0, -1, +1)

Using cumsum()

We can then tell cumsum() to sum the rows using the axis argument:

In [9]:
values = np.cumsum(steps, axis=0)
values = np.concatenate((np.matrix(np.zeros(n)), values), axis=0)
plt.xlabel('$t$')
plt.ylabel('$y_t$')
ax = plt.plot(values)

Multiplicative Random Walks

  • The series of values we have looked at do not closely resememble how security prices change over time.

  • In order to obtain a more realistic model of how prices change over time, we need to multiply instead of add.

  • Let $r_t$ denote an i.i.d. random variable distributed $r_t \sim N(0, \sigma^2)$

  • Define a strictly positive intitial value $y_0 \in \mathbf{R}$; e.g. $y_0 = 10$.

  • Subsequent values are given by $y_t = y_{t-1} \times (1 + r_t)$

  • We can write this as a cumulative product:

$y_t = y_0 \times \Pi_{t=1}^{t_{max}} \epsilon_t$

Using cumprod()

  • This can be computed efficiently using numpy's cumprod() function.
In [10]:
initial_value = 100.0
random_numbers = normal(size=t_max) * 0.005
multipliers = 1 + random_numbers
values = initial_value * np.cumprod(multipliers)
plt.xlabel('$t$')
plt.ylabel('$y_t$')
ax = plt.plot(np.concatenate(([initial_value], values)))

Random walk variates as a time-series

  • Now let's plot the random perturbations over time
In [11]:
plt.xlabel('$t$')
plt.ylabel('$r_t$')
ax = plt.plot(random_numbers)

Gross returns

  • If we take $100 \times \epsilon_t$, then these represent the percentage changes in the value at discrete time intervals.

  • If the values represent prices that have been adjusted to incorporate dividends, then the multipliers are called simple returns.

  • The gross return is obtained by adding 1.

In [12]:
plt.xlabel('$t$')
plt.ylabel('$r_t$')
ax = plt.plot(random_numbers + 1)

Continuously compounded, or log returns

  • A simple return $R_t$ is defined as

$R_t = (y_t - y_{t-1}) / y_{t-1} = y_{t} / y_{t-1} - 1$

where $y_t$ is the adjusted price at time $t$.

  • The gross return is $R_t + 1$

  • A continuously compounded return $r_t$, or log-return, is defined as:

$r_t = \log(y_t / y_{t-1}) = \log(y_t) - \log(y_{t-1})$

  • In Python:
In [13]:
from numpy import diff, log

prices = values
log_returns = diff(log(prices))

Aggregating returns

  • Simple returns aggregate across assets- the return on a portfolio of assets is the weighted average of the simple returns of its individual securities.

  • Log returns arregate across time.

  • If return in year one is $r_1 = \log(p_1 / p_0) = \log(p_1) - \log(p_0)$

  • and return in year two is $r_2 = \log(p_2 / p_1) = \log(p_2) - \log(p_1)$,
  • then return over two years is $r_1 + r_2 = \log(p_2) - \log(p_0)$

Converting between simple and log returns

  • A simple return $R_t$ can be converted into a log-return $r_t$:

$r_t = \log(R_t + 1)$

Comparing simple and log returns

  • For small values of $r_t$ then $R_t \approx r_t$.

  • We can examine the error for larger values:

In [14]:
simple_returns = np.arange(-0.75, +0.75, 0.01)
log_returns = np.log(simple_returns + 1)
plt.xlim([-1.5, +1.5])
plt.ylim([-1.5, +1.5])
plt.plot(simple_returns, log_returns)
x = np.arange(-1.5, 1.6, 0.1)
plt.xlabel('R')
plt.ylabel('r')
ax = plt.plot(x, x)

A discrete multiplicative random walk with log returns

  • Let $r_t$ denote a random i.i.d. variable distributed $r_t \sim N(0, \sigma^2)$

  • Then $y_t = y_0 \times \operatorname{exp}(\sum_{t=1}^{t_{max}} r_t)$

Plotting a single realization

In [15]:
from numpy import log, exp, cumsum

t_max = 100
volatility = 1e-2
initial_value = 100.
r = normal(size=t_max) * np.sqrt(volatility)
y = initial_value * exp(cumsum(r))
plt.xlabel('$t$')
plt.ylabel('$y_t$')
ax = plt.plot(np.concatenate(([initial_value], y)))

Multiple realisations of a multiplicative random-walk

  • Let's generate $n=10$ realisations of this process:
In [16]:
def random_walk(initial_value = 100, n = 10, 
                t_max = 100, volatility = 0.005):
    r = normal(size=(t_max+1, n)) * np.sqrt(volatility)
    return np.concatenate((np.matrix([initial_value] * n), 
                           initial_value * exp(np.cumsum(r, axis=0))))

plt.xlabel('$t$')
plt.ylabel('$y_t$')
ax = plt.plot(random_walk(n=10))

Geometric Brownian Motion

  • For a continuous-time process, we use Geometric Brownian Motion (GBM):
$$ S_{t_k} = S_0 \prod_{i=1}^{k} Y_i $$\begin{eqnarray} Y_i & = & \exp(\sigma\sqrt{\Delta_t} z_i + \mu \Delta_t) \\ & = & \exp(\sigma\sqrt{\Delta_t} z_i + (\bar{r} - \frac{\sigma^2}{2}) \Delta_t) \\ \end{eqnarray}
  • As a cumulative sum:
$$ S_{t_k} = S_0 \times \exp(\sum_{i=1}^{k} \sigma\sqrt{\Delta_t} z_i + (\bar{r} - \frac{\sigma^2}{2}) \Delta_t)$$

GBM with multiple paths in Python

In [18]:
def gbm(sigma, r, k, t_max, S0, I=1):
    z = np.random.normal(size=(k-1, I))  
    dt = t_max/k
    y = sigma * np.sqrt(dt)*z + (r - sigma**2 / 2.) * dt
    return S0 * np.exp(np.cumsum(y, axis=0))  

Plotting multiple realizations of GBM

In [19]:
sigma = 0.05; r = 0.01; k = 100; t_max = 10.; S0 = 100.
T = np.arange(0, t_max - t_max/k, t_max/k)
ax = plt.plot(T, gbm(sigma, r, k, t_max, S0, I=50))
plt.xlabel('$t$'); plt.ylabel('$S_t$'); plt.show()

Monte-Carlo simulation for option pricing

Options

  • An option gives the right to buy (call) or sell (put) an underlying stock at a prespecified price, called the strike price, at a specified date, or period.

    • A European option specifies a single date.
    • An American option specifies a period.
  • Options can also specify an index, in which case they are settled in cash.

  • People selling options are called option writers.

  • People buying options are the option holders.

Payoff to an option holder

  • The payoff to an option holder depends on the index price $S_t$ when the option is exercised.

  • For a European option with strike price $K$, and maturity date $T$:

    • If the index is below the strike, the option is worthless.
    • Otherwise the holder receives the difference between the index price and the strike price: $S_T - K$.
  • Therefore the payoff to the option is:

$$\max(S_T - K, 0).$$

Outcomes

  • In-the-money (ITM):
    • a call (put) is in-the-money if $ S > K (S < K) $.
  • At-the-money (ATM):
    • an option is at-the-money if $S \approx K$.
  • Out-of-the-money (OTM):
    • a call (put) is out-of-the-money if $ S < K (S > K) $.
In [1]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt


K = 8000 # Strike price
S = np.arange(7000, 9000, 100)  # index level values
h = np.maximum(S - K, 0)  # inner values of call option

Plotting the payoff function

In [2]:
plt.plot(S, h, lw=2.5)  # plot inner values at maturity
plt.xlabel('index level $S_t$ at maturity'); plt.ylabel('inner value of European call option')
Out[2]:
Text(0, 0.5, 'inner value of European call option')

Parameters which affect the inner-value

  • Initial price level of the index $S_0$.

  • Volatility of the index $\sigma$.

  • The return(s) of the index.

  • Time-to-maturity $T$.

Risk-neutral assumptions

When all of the following assumptions hold:

  • no arbitrage,
  • complete markets (no transaction costs and perfect information),
  • law of one price (assets with identical risk and return have a unique price)

Risk-neutral parameters

we can price options without having to take account of investors' risk-preferences using only the following parameters:

  • Initial price level of the index $S_0$.
  • Volatility of the index $\sigma$.
  • Time-to-maturity $T$.
  • The risk-free rate $r$.

Monte-Carlo option valuation

  • For a European option, the inner-value is not path-dependent.

  • Under risk-neutral pricing, we use Geometric Brownian Motion (GBM) with:

$$S_T = S_0 \times \exp((r - \frac{\sigma^2}{2})T + \sigma \sqrt{T}z)$$

  • where $r$ is the risk-free rate.

Non-path dependent algorithm to estimate the inner-value:

  1. Draw $I$ random numbers $z_1, z_2, \ldots, z_I$ from the standard normal distribution.
  2. For $i \in \{1, 2, \ldots, I\}$:

    i). Calculate index level at maturity by simulating geometric Brownian motion using the above equation.

    ii). Compute the inner-value of the option $h_T = \max(S_T - K, 0)$.

    iii). Discount back to the present at the risk-free rate $r$, giving the present value: $$C_i = e^{-rT} h_T.$$

  3. Output the final estimate by computing the Monte-Carlo estimator $\bar{\mathbf{C}} = \frac{\sum_{i=1}^I C_i}{I}$

Monte-Carlo valuation of European call option in Python

In the following we use the following parameterization of the model: initial index price $S_0 = 100$, strike price $K=105$, time-to-maturity $T=1$, risk-free rate $r=0.02$, index volatility $\sigma=0.02$, number of independent realizations $I=10^5$.

In [2]:
from numpy import sqrt, exp, cumsum, sum, maximum, mean
from numpy.random import standard_normal

# Parameters
S0 = 100.; K = 105.; T = 1.0
r = 0.02; sigma = 0.1; I = 100000

# Simulate I outcomes 
S = S0 * exp((r - 0.5 * sigma ** 2) * T + sigma * sqrt(T) * 
                          standard_normal(I))

# Calculate the Monte Carlo estimator
C0 = exp(-r * T) * mean(maximum(S - K, 0)) 
print("Estimated present value is %f" % C0)
Estimated present value is 2.788481

Asian (average-value) option

  • The payoff of an asian option is determined by the average of the price of the underlying over a pre-defined period of time.

  • Payoff for a fixed-strike Asian call option:

$$ C_T = \max(A(0, T) - K, 0) $$

where:

$$ A(0, T) = \frac{1}{T} \int_0^T S_t dt $$

If we let $t_i = i \times \frac{T}{n} \; i \in {0, 1, 2, \ldots, n}$:

$$ A(0, T) \approx \frac{1}{n} \sum_{i=0}^{n-1} S_{t_i} $$

  • The payoff is path dependent, and therefore we need to simulate intermediate values of $S_t$.

Path-dependent Monte-Carlo option pricing

  • To simulate GBM at $M$ evenly spaced time intervals $t_i$ with $\Delta_T = T/M$:
$$ S_{t_k} = S_0 \times \exp(\sum_{i=1}^{k} \sigma\sqrt{\Delta_t} z_i + (r - \frac{\sigma^2}{2}) \Delta_t)$$

Algorithm for path-dependent option pricing

  1. Draw $I \times M$ random numbers from the standard normal distribution.
  2. For $i \in \{1, 2, \ldots, I\}$:

    i). Calculate index level at times $t_i \in \{\Delta_T, 2\Delta_T, \ldots, T\}$ by simulating geometric Brownian motion with drift $\mu = r$ and volatility $\sigma$ using the equation for $S_{t_k}$.

    ii). Estimate the inner-value of the option $\hat{h}_T = \frac{1}{T} \sum_{i=1}^M S_{t_i}$.

    iii). Discount back to the present at the risk-free rate $r$, giving the present value: $$C_i = e^{-rT} \hat{h}_T.$$

  3. Output the final estimate by computing the Monte-Carlo estimator $\bar{\mathbf{C}} = \frac{\sum_{i=1}^I C_i}{I}$

Monte-Carlo valuation of Asian fixed-strike call option in Python

In the following we use the following parameterization of the model: initial index price $S_0 = 100$, time-to-maturity $T=1$, number of time-steps $M=200$, risk-free rate $r=0.02$, index volatility $\sigma=0.1$, number of independent realizations $I=10^5$.

In [3]:
from numpy import sqrt, exp, cumsum, sum, maximum, mean
from numpy.random import standard_normal

# Parameters
S0 = 100.; T = 1.0; r = 0.02; sigma = 0.1
M = 200; dt = T / M; I = 100000

def inner_value(S):
    """ Inner value for a fixed-strike Asian call option """
    return mean(S, axis=0)

# Simulate I paths with M time steps
S = S0 * exp(cumsum((r - 0.5 * sigma ** 2) * dt + sigma * sqrt(dt) * 
                          standard_normal((M + 1, I)), axis=0))

# Calculate the Monte Carlo estimator
C0 = exp(-r * T) * mean(inner_value(S))
print("Estimated present value is %f" % C0)
Estimated present value is 99.019152

Estimating Value-At-Risk (VaR) in Python

Value-at-Risk (VaR)

  • Value at risk (VaR) is a methodology for computing a risk measurement on a portfolio of investments.

  • It is defined over:

    • a duration of time, e.g. one day.
    • a confidence level (or equivalent percentage) $\alpha$.
  • The $VaR_{\alpha}$ over duration $T$ is the maximum possible loss during $T$, excluding outcomes whose probability is less than $\alpha$, according to our model.

Value-at-Risk (VaR)

VaR_diagram

Mathematical definition

  • Value at Risk with confidence $\alpha$ can be defined
$$ VaR_{\alpha}(X) = \min\{x \in \mathbb{R} : 1 - F_X(-x) \geq \alpha\} $$

where $X$ is a random variable representing the value of the portfolio, with cumulative distribution function $F_X$.

  • The $VaR_{\alpha}(X)$ is simply the negative of the $\alpha$-quantile.

  • We typically assume mark-to-market accounting, and so the value of the portfolio is determined from fair market prices.

Quantiles, Quartiles and Percentiles

0 quartile = 0 quantile = 0 percentile

1 quartile = 0.25 quantile = 25 percentile

2 quartile = .5 quantile = 50 percentile (median)

3 quartile = .75 quantile = 75 percentile

4 quartile = 1 quantile = 100 percentile

Computing quantiles in Python

  • First we will generate some random data.
In [16]:
import numpy as np

# Draw values sampled iid from standard normal distribution
data = np.random.normal(size=100000)
import matplotlib.pyplot as plt
ax = plt.hist(data, bins=100)
plt.show()

Computing quantiles in Python

In [17]:
# Compute the 5th-percentile
np.percentile(data, q=5)
Out[17]:
-1.6503146704878209

Computing several percentiles

In [18]:
for p in range(1, 6):    
    print("The %d-percentile is %f" % (p, np.percentile(data, q=p)))
The 1-percentile is -2.352340
The 2-percentile is -2.071996
The 3-percentile is -1.883997
The 4-percentile is -1.755926
The 5-percentile is -1.650315

Estimating VaR

  • The VaR depends on the distribution of a random variable, e.g. the price of an index, over a specified period of time.

  • How can we estimate the quantiles of this distribution?

Estimating VaR

Common methods:

  • Variance/Covariance method.

  • Historical simulation- bootstrap from historical data.

  • Monte-Carlo simulation.

Historical simulation

To calculate $VaR_{\alpha}(X)$ with sampling interval $\Delta_t$ over $T = n \times \Delta_t$ using a total of $N$ bootstrap samples:

  1. Assuming that the returns are stationary over the entire period, obtain a large sample of historical prices for the components of the portfolio or index.

  2. Convert the prices into returns with frequency $1/\Delta_t$.

  3. For every $i \in \{1, \ldots, N\}$:

    • Randomly choose $n$ returns $r_1, r_2, \ldots, r_n$ with replacement.

    • Compute $P_i | (r_1, r_2, \ldots, r_n)$ - the profit and loss of the investment given these returns.

  4. Compute $Q(\alpha)$ from the sample $\mathbf{P}$, where $Q$ is the quantile function.

Random choices in Python

  • We can use the function choice() from the numpy module to choose randomly from a set of values.

  • To choose a single random value:

In [19]:
import numpy as np
data = np.random.randint(1, 6+1, size=20)
data
Out[19]:
array([1, 2, 3, 2, 6, 4, 3, 2, 1, 5, 6, 2, 1, 3, 1, 4, 2, 3, 4, 2])
In [20]:
np.random.choice(data, replace=True)
Out[20]:
1
In [21]:
np.random.choice(data, replace=True)
Out[21]:
6

Generating a sequence of choices

  • We can think of this as a simple bootstrap model of a dice:
In [22]:
np.random.choice(data, size=5, replace=True)
Out[22]:
array([2, 3, 5, 1, 3])

Bootstrapping from empirical data

  • Typically we will collect real-world (empirical) data from a random process whose true distribution is unknown.

  • In Finance, we can bootstrap from historical returns.

Obtaining returns for the Nikkei 225 index

In [23]:
import datetime
from pandas_datareader import data

start = datetime.datetime(2015, 1, 1)
end = datetime.datetime(2016, 1, 1)
n225 = data.DataReader("^N225", "yahoo", start, end)
returns = np.diff(np.log(n225["Adj Close"]))
plt.plot(returns)
plt.xlabel('t')
plt.ylabel('r')
plt.show()

Simulating returns

  • We will now simulate the returns over the next five days:
In [24]:
num_days = 5
simulated_returns = \
    np.random.choice(returns, size=num_days, replace=True)
simulated_returns
Out[24]:
array([ 0.01161962,  0.00071675, -0.00093621, -0.01293   ,  0.01161962])

Simulating prices

In [25]:
initial_price = 100.
prices = initial_price * np.exp(np.cumsum(simulated_returns))
plt.plot(prices)
plt.xlabel('t')
plt.ylabel('price')
plt.show()

Multiple realisations

  • If we perform this simulation again, will we obtain the same result?

First simulation

In [26]:
num_days = 5
simulated_returns = \
    np.random.choice(returns, size=num_days, replace=False)
prices = initial_price * np.exp(np.cumsum(simulated_returns))
plt.plot(prices)
plt.xlabel('t')
plt.ylabel('price')
plt.show()

Second simulation

In [27]:
num_days = 5
simulated_returns = \
    np.random.choice(returns, size=num_days, replace=False)
prices = initial_price * np.exp(np.cumsum(simulated_returns))
plt.plot(prices)
plt.xlabel('t')
plt.ylabel('price')
plt.show()

Third simulation

In [28]:
num_days = 5
simulated_returns = \
    np.random.choice(returns, size=num_days, replace=False)
prices = initial_price * np.exp(np.cumsum(simulated_returns))
plt.plot(prices)
plt.xlabel('t')
_ = plt.ylabel('price')

Many realisations - the distribution of the final price

In [29]:
def final_price():
    num_days = 20
    simulated_returns = \
        np.random.choice(returns, size=num_days, replace=True)
    prices = initial_price * np.exp(np.cumsum(simulated_returns))
    return prices[-1]

num_samples = 100000
prices = [final_price() for i in range(num_samples)]
plt.hist(prices, bins=100)
plt.show()

The distribution of the profit and loss

In [32]:
def profit_and_loss(final_price):
    return 1200000. * (final_price - initial_price)
    
p_and_l = np.vectorize(profit_and_loss)(prices)
plt.hist(p_and_l, bins=100)
plt.show()

The quantiles of the profit and loss

In [33]:
for p in range(1, 6):    
    print("The %.2f-quantile is %.4f" % (p/100., np.percentile(p_and_l, q=p)))
The 0.01-quantile is -14902743.5058
The 0.02-quantile is -13131038.2840
The 0.03-quantile is -11951449.1238
The 0.04-quantile is -11107289.6563
The 0.05-quantile is -10361058.5222

What is the 5% Value-At-Risk?

In [34]:
import locale
var = -1 * np.percentile(p_and_l, q=5)
print("5%%-VaR is %.4f" % var)
5%-VaR is 10361058.5222