# NumPy (Part II)

What you will learn in this lesson:


- Indexing and slicing NumPy arrays
- Performing calculations on NumPy arrays

In [1]:
# Remember, we always have to import the package before starting to use it!
import numpy as np

## Indexing and Slicing

**1D** arrays in NumPy are indexed, sliced, and iterated over in the same way as lists and other Python data structures.

In [2]:
arr1d = np.arange(10)
arr1d

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]:
arr1d[5]

5

In [4]:
arr1d[5:8]

array([5, 6, 7])

Note that if we assign a scalar value to a slice of an array, all elements within that slice will be updated to the same value. This behavior is known as **broadcasting**.

In [5]:
arr1d[5:8] = 12
arr1d

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

Also, note that changes made to slices directly affect the original array, as slices are **views**, not copies.

In [7]:
arr_slice = arr1d[5:8]
arr_slice

array([12, 12, 12])

In [9]:
arr_slice[1] = 12345
arr1d

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

You can set all elements to a given value by slicing the whole array:

In [10]:
arr_slice[:] = 64
arr_slice

array([64, 64, 64])

In [12]:
# See again how the original array also changed...
arr1d

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

Since NumPy is designed for handling large datasets, copying data unnecessarily could lead to performance and memory issues. To avoid this, NumPy uses views by default instead of copies.

```{note}
If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array; for example `arr[5:8].copy()`.
```

In [14]:
arr1d = np.arange(10)

arr_slice = arr1d[5:8].copy() # create a copy instead of a view

print(arr_slice)

arr_slice[:] = 64

# The slice has changed...
print(arr_slice)

# But the original array not
print(arr1d)

[5 6 7]
[64 64 64]
[0 1 2 3 4 5 6 7 8 9]


**Higher Dimensional Arrays**

In higher dimensional arrays, you index/slice arrays along each axis independently.

Here is an **2D** example:

In [15]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [16]:
arr2d[2]

array([7, 8, 9])

In [17]:
arr2d[0][2]

3

In [18]:
arr2d[0][1:3]

array([2, 3])



**Slicing: Simplified notation**

In [19]:
arr2d[0, 2]

3

In [20]:
arr2d[0, 1:3]

array([2, 3])

A nice visual of a 2D array

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781449323592/files/httpatomoreillycomsourceoreillyimages2172112.png" height="50%" width="50%"/>

**Two-Dimensional Array Slicing**

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781449323592/files/httpatomoreillycomsourceoreillyimages2172114.png" height="50%" width="50%"/>

The following is an example with **3D** arrays:

In [21]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print(arr3d)
print(arr3d.shape)

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]
(2, 2, 3)


If you find NumPy's way of showing the data a bit difficult to parse visually.

ðŸ’¡ **Here is a way to visualize 3 and higher dimensional data:**

```python
[ # AXIS 0                     CONTAINS 2 ELEMENTS (arrays)
    [ # AXIS 1                 CONTAINS 2 ELEMENTS (arrays)
        [1, 2, 3], # AXIS 3    CONTAINS 3 ELEMENTS (integers)
        [4, 5, 6]  # AXIS 3
    ],  
    [ # AXIS 1
        [7, 8, 9],
        [10, 11, 12]
    ]
]
```
Each axis is a level in the nested hierarchy, i.e. a tree or DAG (directed-acyclic graph).

* Each axis is a container.
* There is only one top container.
* Only the bottom containers have data.


<div class="alert alert-block alert-info">
    <b>Important:</b> 

In multidimensional arrays, if you omit indices for the later dimensions, the returned object will be a **lower-dimensional ndarray** that contains all the data from the higher-indexed dimensions.
</div>

So in the 2 Ã— 2 Ã— 3 array `arr3d`:

In [23]:
print(arr3d[0])
print(arr3d[0].shape)

[[1 2 3]
 [4 5 6]]
(2, 3)


In [24]:
x = arr3d[1]
x

array([[ 7,  8,  9],
       [10, 11, 12]])

In [26]:
print(x[0])
print(x[0].shape)

[7 8 9]
(3,)


Saving data before modifying an array.

In [27]:
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Putting the data back.

In [28]:
arr3d[0] = old_values
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Similarly, `arr3d[1, 0]` gives you all of the values whose indices start with (1, 0), forming a 1-dimensional array:

In [29]:
arr3d[1, 0]

array([7, 8, 9])

### Boolean slicing

In NumPy, we can mask arrays using boolean arrays. This is a crucial concept because it also applies to libraries like Pandas and R.

You can pass a boolean array to the array indexer (i.e., `[]`), and it will return only the elements where the corresponding boolean value is `True`.

For example, letâ€™s continue with the 2D array we used earlier:

In [30]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [31]:
# This will flag the elements that satisfy the condition of being greater than 5.
arr2d > 5

array([[False, False, False],
       [False, False,  True],
       [ True,  True,  True]])

In [32]:
# We can use this to mask our array
arr2d[arr2d > 5]

array([6, 7, 8, 9])

Letâ€™s look at a more detailed example, typical of a data science scenario. Assume we have two related arrays:

* `names`, which represents the rows (observations) of a table.
* `data`, which holds the data associated with each feature.

In the example below, we will have 7 observations, and 4 features:

In [33]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
print(names)

data = np.random.standard_normal((7, 4))
print(data)

['Bob' 'Joe' 'Will' 'Bob' 'Will' 'Joe' 'Joe']
[[-0.37089796  1.79692896  0.31746519 -0.16161859]
 [ 0.06106916  0.51503547  2.96334102  0.1863705 ]
 [ 0.36083801  0.44561537  0.24537085  0.00764191]
 [ 1.13704217  0.14447022 -0.16208699  0.22259159]
 [ 1.31751621  0.64211213  0.86928899 -1.15932185]
 [ 0.62638858 -1.12000534 -0.59301056 -0.43993317]
 [ 0.19025998  0.32629521  0.64349858  0.98523597]]


In [35]:
print(names.shape, data.shape)

(7,) (7, 4)


A comparison operation on an array returns an array of boolean values.

In [36]:
# This will return an array of booleans testing whether its value is equal to Bob
names == 'Bob'

array([ True, False, False,  True, False, False, False])

In [37]:
# We can now use this as an array indexer to mask our data
data[names == 'Bob']

array([[-0.37089796,  1.79692896,  0.31746519, -0.16161859],
       [ 1.13704217,  0.14447022, -0.16208699,  0.22259159]])

In [38]:
# We can also, at the same time, slice on the second axis to select data
data[names == 'Bob', 2:]

array([[ 0.31746519, -0.16161859],
       [-0.16208699,  0.22259159]])

Here are some examples of boolean operations being applied:

In [39]:
# This selects all rows whose names are not 'Bob'
names != 'Bob'
data[~(names == 'Bob')]

array([[ 0.06106916,  0.51503547,  2.96334102,  0.1863705 ],
       [ 0.36083801,  0.44561537,  0.24537085,  0.00764191],
       [ 1.31751621,  0.64211213,  0.86928899, -1.15932185],
       [ 0.62638858, -1.12000534, -0.59301056, -0.43993317],
       [ 0.19025998,  0.32629521,  0.64349858,  0.98523597]])

In [40]:
cond = (names == 'Bob')
data[~cond]

array([[ 0.06106916,  0.51503547,  2.96334102,  0.1863705 ],
       [ 0.36083801,  0.44561537,  0.24537085,  0.00764191],
       [ 1.31751621,  0.64211213,  0.86928899, -1.15932185],
       [ 0.62638858, -1.12000534, -0.59301056, -0.43993317],
       [ 0.19025998,  0.32629521,  0.64349858,  0.98523597]])

In [41]:
# This selects all rows whose names are 'Bob' or 'Will'
mask = (names == 'Bob') | (names == 'Will')
print(mask)
print(data[mask])

[ True False  True  True  True False False]
[[-0.37089796  1.79692896  0.31746519 -0.16161859]
 [ 0.36083801  0.44561537  0.24537085  0.00764191]
 [ 1.13704217  0.14447022 -0.16208699  0.22259159]
 [ 1.31751621  0.64211213  0.86928899 -1.15932185]]


In [42]:
# We can always update the masked parts of the original array using broadcasting
data[names != 'Joe'] = 7
data

array([[ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 0.06106916,  0.51503547,  2.96334102,  0.1863705 ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 0.62638858, -1.12000534, -0.59301056, -0.43993317],
       [ 0.19025998,  0.32629521,  0.64349858,  0.98523597]])

<div class="alert alert-block alert-info">
    <b>Important:</b> We use the tilde <code>~</code> instead of <code>not</code> to negate (flip) a value. Similarly, we use <code>&</code> and <code>|</code> instead of <code>and</code> and <code>or</code>.
</div>

### Fancy Indexing

In what is known as fancy indexing, we use arrays of index numbers to access specific data.

Instead of passing a single integer or a range using `:`, we pass a `list` of index numbers to the indexer.

In [43]:
# Let's create the following array
arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

In [44]:
# This selects rows 4, 3, 0, and 6, in that order
arr[[4, 3, 0, 6]]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

In [45]:
# We can also index using negative index rules
arr[[-3, 0, -1]]

array([[5., 5., 5., 5.],
       [0., 0., 0., 0.],
       [7., 7., 7., 7.]])

Note that in this example, we are indexing along the first axis.

In [229]:
arr[[4, 3, 0, 6], :]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

We can also perform indexing along other axes:

In [230]:
# This selects the first and third column
arr[:, [0, 2]]

array([[0., 0.],
       [1., 1.],
       [2., 2.],
       [3., 3.],
       [4., 4.],
       [5., 5.],
       [6., 6.],
       [7., 7.]])

What is happening in the previous examples is that we are combining fancy indexing with standard slicing.

In [46]:
# We could also do this
arr[:3, [0, 2]]

array([[0., 0.],
       [1., 1.],
       [2., 2.]])

We can also combine fancy indexing and simple indices:

In [47]:
arr[0, [0, 2]]

array([0., 0.])

And even with masking:

In [48]:
mask = np.array([1, 0, 1, 0, 0, 0, 1,0], dtype=bool)
arr[mask, :]

array([[0., 0., 0., 0.],
       [2., 2., 2., 2.],
       [6., 6., 6., 6.]])

Look at this example, now:

In [49]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([1., 5., 7., 2.])

This is just creating an array with the elements (1,0), (5,3), (7,1), and (2,2).

If we want to index both axes to create a new array with specific rows and columns, we need to apply fancy indexing separately to each axis.

In [50]:
# We first index axis 0 (rows) and then axis 1 (columns)
arr[[1, 5, 7, 2], :][:,[0, 3, 1, 2]]

array([[1., 1., 1., 1.],
       [5., 5., 5., 5.],
       [7., 7., 7., 7.],
       [2., 2., 2., 2.]])

In [51]:
# Here we first index axis 1 (columns) and then axis 0 (rows)
arr[:,[0, 3, 1, 2]][[1, 5, 7, 2], :]

array([[1., 1., 1., 1.],
       [5., 5., 5., 5.],
       [7., 7., 7., 7.],
       [2., 2., 2., 2.]])

## Inserting + Dropping Array Values

Sometimes, itâ€™s useful to exclude a specific index or drop the start or end of an array of values.

In [52]:
myarr = np.array([10,15,20,25,30,35,40,45,50])

- `np.insert`: It inserts an element in a given index position.

In [53]:
# This insert the value 200 in the third position
np.insert(myarr, 2, 200)

array([ 10,  15, 200,  20,  25,  30,  35,  40,  45,  50])

- `np.delete`: It drops an element in a specific index.

In [54]:
# This drops the third element
np.delete(myarr, 2)

array([10, 15, 25, 30, 35, 40, 45, 50])

In both cases, **a new array is being created**. That is, these functions do not modify the original array.

In [55]:
print(np.insert(myarr, 2, 200))
print(np.delete(myarr, 2))
print(myarr)

[ 10  15 200  20  25  30  35  40  45  50]
[10 15 25 30 35 40 45 50]
[10 15 20 25 30 35 40 45 50]


## Basic calculations

- Addition and subtraction

In [56]:
a = np.array([1.0, 2.0, 3.0, 4.0])
b = np.array([2.0, 2.0, 2.0, 2.0])

print(a + b)

print(a + 2)

print(a - b)

print(a - 2)


[3. 4. 5. 6.]
[3. 4. 5. 6.]
[-1.  0.  1.  2.]
[-1.  0.  1.  2.]


- Multiplication and division

In [57]:
print(a * b)

print(a * 2)

print(a / b)

print(a / 2)


[2. 4. 6. 8.]
[2. 4. 6. 8.]
[0.5 1.  1.5 2. ]
[0.5 1.  1.5 2. ]


## More useful calculations

NumPy includes over 500 built-in functions for performing operations, most of which can be applied directly to array data. Here are some common and straightforward examples:

In [58]:
# Start with the basic two-dimensional array we used above and manipulate in basic ways:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

- `np.flip`: It reverses the data of an array.

In [60]:
arr2_flipped = np.flip(arr2)
print(arr2_flipped)

[[8 7 6 5]
 [4 3 2 1]]


- `np.copy`:  It copies an array to an entirely separate array.

In [61]:
arr2_copy = np.copy(arr2)
print(hex(id(arr2_copy)))
print(hex(id(arr2)))

0x7f4fd2076730
0x7f4fd2076370


- `np.concatenate`:  It combines all elements within an array into a single list.

In [62]:
arr2_concat = np.concatenate(arr2)
print(arr2_concat)

[1 2 3 4 5 6 7 8]


- `np.min`: It calculates the min in an array.

In [63]:
arr2_min = np.min(arr2)
print(arr2_min)

1


- `np.max`: It calculates the maximun element in an array.

In [64]:
arr2_max = np.max(arr2)
print(arr2_max)

8


- `np.mean`: It calculates the mean.

In [65]:
# calculate the mean
arr2_mean = np.mean(arr2)
print(arr2_mean)

4.5


Let's stop in this function for a bit. You can also calculate the mean along just one particular axis

In [66]:
help(np.mean)

Help on function mean in module numpy:

mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>)
    Compute the arithmetic mean along the specified axis.
    
    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.
    
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the means are computed. The default is to
        compute the mean of the flattened array.
    
        .. versionadded:: 1.7.0
    
        If this is a tuple of ints, a mean is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the mean.

In [67]:
arr2.shape

(2, 4)

In [68]:
# This is taking the mean along axis 0, i.e. along each row
np.mean(arr2, 0)

array([3., 4., 5., 6.])

In [69]:
# This is taking the mean along axis 1, i.e. along each column
np.mean(arr2, 1)

array([2.5, 6.5])

- `np.std`: It calculates the standard deviation.

In [70]:
arr2_std = np.std(arr2)
print(arr2_std)

2.29128784747792


- NumPy also provides many universal functions, or `ufuncs`, which perform elementwise operations on data in ndarrays. You can think of them as **fast, vectorized wrappers for simple functions** that take one or more scalar values and return one or more scalar results. Many ufuncs perform simple elementwise transformations. For example:

In [71]:
# This takes the sin
print(np.sin(arr2))

# This takes the cos
print(np.cos(arr2))

# This computes the sqrt
print(np.sqrt(arr2))

# This computes the exponent
print(np.exp(arr2))

[[ 0.84147098  0.90929743  0.14112001 -0.7568025 ]
 [-0.95892427 -0.2794155   0.6569866   0.98935825]]
[[ 0.54030231 -0.41614684 -0.9899925  -0.65364362]
 [ 0.28366219  0.96017029  0.75390225 -0.14550003]]
[[1.         1.41421356 1.73205081 2.        ]
 [2.23606798 2.44948974 2.64575131 2.82842712]]
[[2.71828183e+00 7.38905610e+00 2.00855369e+01 5.45981500e+01]
 [1.48413159e+02 4.03428793e+02 1.09663316e+03 2.98095799e+03]]


Note what happens if you apply `np.sqrt` to negative values:

In [72]:
np.sqrt(np.array([4, -3, 16, 9, -5]))

  np.sqrt(np.array([4, -3, 16, 9, -5]))


array([ 2., nan,  4.,  3., nan])

`nan` is a special value in NumPy.

## Practice exercises

```{exercise}
:label: numpy6

Given the array below:

1- Extract the subarray containing the first two rows and the last three columns.

2- Reverse the third row.

3- Extract every second element from the entire array (flatten it into a 1D array).

```

In [73]:
arr_exec1 = np.array([[10, 20, 30, 40, 50],
                      [60, 70, 80, 90, 100],
                      [110, 120, 130, 140, 150],
                      [160, 170, 180, 190, 200]])

In [1]:
# Your answers from here

```{exercise}
:label: numpy7

Given the array below:

1- Select the elements at positions (0,1), (2,3), and (3,0) using fancy indexing.

2- Extract the second and fourth rows using fancy indexing.

3- Extract the first, third, and fourth elements from the first column using fancy indexing.

```

In [75]:
arr_exec2 =  np.array([[5, 10, 15, 20],
                       [25, 30, 35, 40],
                       [45, 50, 55, 60],
                       [65, 70, 75, 80]])

In [76]:
# Your answers from here

```{exercise}
:label: numpy8

Considering the variable `scores` below as an array of student test scores, and  `classes` the class each student belongs to, do the following:

1- Compute the mean score for each student (row-wise).

2- Compute the mean score for each test (column-wise).

3- Compute the overall standard deviation of all scores.

4- Subtract the mean of each test (column-wise) from the scores.

5- Use Boolean indexing to extract the scores of students from "Class A" and compute the mean score for "Class A".

6- Use Boolean indexing to find the students in "Class B" who scored above 85 in their first test.  

```

In [77]:
scores = np.array([[75, 80, 85, 90, 95],
                   [88, 92, 78, 85, 91],
                   [60, 75, 70, 65, 80],
                   [90, 85, 88, 92, 94],
                   [55, 60, 65, 70, 75],
                   [95, 100, 90, 85, 92],
                   [85, 89, 90, 87, 86],
                   [78, 80, 85, 82, 81],
                   [65, 70, 68, 72, 74],
                   [92, 94, 90, 88, 95]])

classes = np.array(['Class A', 'Class B', 'Class A', 'Class C', 
                    'Class A', 'Class C', 'Class B', 'Class B', 
                    'Class A', 'Class C'])


In [78]:
# Your answers from here