Advanced NumPy
Array container
Learning Objectives
After the lesson learner:
- Can list some of the object types which can be contained in an array.
- Understands the concept of
dtype
and can selectdtype
best for the data at hand. - Knows what is an object array and when it is created.
- Can explain what are
ndim
,shape
andstride
properties of an array. - Understand the layout of an array in memory and knows how to use it for best array performance.
- Can explain the difference between Fortran- and C-based order. Knows the default.
Data type
In contrast to built-in Python containers (like lists) NumPy arrays can store elements of pre-determined type only. To see the type of array contents you can use the dtype
attribute. Let’s look at two examples:
>>> a = np.array([1, 2, 3])
>>> a.dtype
dtype('int64')
>>> b = np.array([1., 2., 3.])
>>> b.dtype
dtype('float64')
In the first case the numbers are 64-bit (8-byte) integers and in the second 64-bit floating point (real) numbers. Note that NumPy auto-detects the data-type from the input. Specialised data types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers.
Note that all of the elements of an array must be of the same type. If we construct an array with different elements it will be cast to the “most general” type that can represent all elements. For example, array constructed from real numbers and integers will have a floating point data type:
>>> a = np.array([1., 2])
dtype('float64')
In case it is impossible, NumPy will use an object
type (also represented by capital 'O'
) which can represent any Python object – even a function:
>>> def f(): pass
>>> a = np.array([f, f])
>>> a.dtype
dtype('O')
Some of NumPy features (like element-wise functions, np.abs
, np.sqrt
, etc., or reductions, np.sum
, np.max
etc.) won’t work with object arrays, but all types of indexing still work.
object
type is most commonly encountered when constructing an array from multiple lists of different lengths:
>>> np.array([[1], [2, 3]])
array([[1], [2, 3]], dtype=object)
Integer or real number?
Construct the array x = np.array([0, 1, 2, 255], dtype=np.uint8)
(here, uint8
represents a single byte in memory, an unsigned integer between 0 and 255). Can you explain the results obtained by x + 1 and x / 2? Also try x.astype(float) + 1
and x.astype(float) / 2
.
Data types
Try to guess the data type of the following arrays. Then test your prediction by constructing the arrays and check their dtype attribute.
a = np.array([[1, 2],
[2, 3]])
b = np.array(['a', 'b', 'c'])
c = np.array([1, 2, 'a'])
d = np.array([np.dot, np.array])
e = np.random.randn(5) > 0
f = np.arange(5)
Complex data types
Imagine you have a list of names and scores, which you want to store in numpy array. Construct a dtype such that the following works. Look at documentation of np.dtype
.
dtype = ?
np.array([('Bartosz', 5), ('Nelle', 4)], dtype=dtype)
Memory layout
NumPy array is just a memory block with extra information how to interpret its contents. Since memory has only linear address space, NumPy arrays need extra information how to lay out this block into multiple dimensions. This is done by means of shape
and strides
attributes:
Lets try to reproduce this example. We first generate a 1D NumPy array of 8 elements:
>>> a = np.arange(8, dtype=np.uint8)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7], dtype=uint8)
>>> a.strides
(1,)
>>> a.shape
(8,)
shape
and strides
attributes are read-only, so we can not modify them directly. However, we my use as_strided
function from NumPy library module:
>>> a1 = np.lib.stride_tricks.as_strided(a, strides=(4, 1), shape=(2,4))
>>> a1
array([[0, 1, 2, 3],
[4, 5, 6, 7]], dtype=uint8)
Similarly, we can obtain the second example:
>>> a2 = np.lib.stride_tricks.as_strided(a, strides=(2, 1), shape=(3,4))
>>> a2
array([[0, 1, 2, 3],
[2, 3, 4, 5],
[4, 5, 6, 7]], dtype=uint8)
Note that in the second case the same data appears twice. However, it does not consume extra memory – all three arrays share the same memory block:
>>> a[2] = 100
>>> a1
array([[ 0, 1, 100, 3],
[ 4, 5, 6, 7]], dtype=uint8)
>>> a2
array([[ 0, 1, 100, 3],
[100, 3, 4, 5],
[ 4, 5, 6, 7]], dtype=uint8)
Transpose
Create 3x4 random array. Have a look at its different properties: x.shape
, x.ndim
, x.dtype
, x.strides
. What does each property tell you about the array?
Compare the strides property of x.T to the above. What is x.T and can you explain the new strides?
Fastest changing index
Compare the time of summing over rows and columns of an array A = np.random.rand(10, 100000)
. How would you explain the differences? (Hint: To measure evaluation time you can use %timeit
of ipython)
Sliding window
Use as_strided
to produce a sliding-window view of a 1D array.
def sliding_window(arr, size=2):
"""Produce an array of sliding window views of `arr`
Parameters
----------
arr : 1D array, shape (N,)
The input array.
size : int, optional
The size of the sliding window.
Returns
-------
arr_slide : 2D array, shape (N - size - 1, size)
The sliding windows of size `size` of `arr`.
Examples
--------
>>> a = np.array([0, 1, 2, 3])
>>> sliding_window(a, 2)
array([[0, 1],
[1, 2],
[2, 3]])
"""
return arr # fix this
Fortran or C-ordering?
The order
keyword of some numpy
functions determines how two- or more dimensional arrays are laid out in the memory. If order is ‘C’, then the array will be in C-contiguous order (last-index varies the fastest). If order is ‘F’, then the returned array will be in Fortran-contiguous order (first-index varies the fastest). In what order will be the 2D array stored in memory? (Hint: You can use np.ravel(x, order='A')
to test it).
Broadcasting revisited
Explain how broadcasting works internally using the example below. What will be shapes and strides of x
and y
after broadcasting. Test it using np.broadcast_arrays
in the following example and look at strides
and shape
properties of both arrays.
x = np.random.rand(5, 10)
y = np.random.rand(10)
z = x + y
xb, yb = np.broadcast_arrays(x, y)