Introduction to Numpy Library for Data Science 📊

Numpy

Numpy means Numerical Python. Python's Linear Algebra library Gives us multidimensional (multi-dim) arrays.

Why do we need Numpy?

Memory requirement: Numpy lists require less memory than Python ones.
Operations on the Numpy list are faster than those on normal lists.
Numpy is more convenient and has wider functionality.

Why are Numpy arrays fast?

In Numpy Array operations take place in chunks rather than element-wise. For example, in the case of adding respective elements of two lists, the addition takes place in chunks and not one element at a time.

What is vectorization? "Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

How to create Numpy arrays?

import numpy as np # np becomes alias for numpy

a = [1, 2, 3]
b = np.array(a)

print(b)
print(type(b))

[1 2 3]
<class 'numpy.ndarray'>

2D Array

b = np.ones((2, 4), dtype = int)
b

array([[1, 1, 1, 1],
       [1, 1, 1, 1]])

numpy.arange

numpy.arange([start, ]stop, [step, ]dtype=None)

Return evenly spaced values within a given interval. Values are generated within the half-open interval [start, stop).

For integer arguments the function is equivalent to the Python built-in range function, but returns an ndarray rather than a list. When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use numpy.linspace for these cases.

Parameters

start : number, optional

Start of interval. The interval includes this value. The default start value is 0.

stop : number

End of interval. The interval does not include this value, except in some cases where step is not an integer and floating point round-off affects the length of out.

step : number, optional

Spacing between values. For any output out, this is the distance between two adjacent values, out[i+1] - out[i]. The default step size is 1. If step is specified as a position argument, start must also be given.

dtype : dtype

The type of the output array. If dtype is not given, infer the data type from the other input arguments.

Returns

arange : ndarray

Array of evenly spaced values.

For floating point arguments, the length of the result is ceil((stop - start)/step). Because of floating point overflow, this rule may result in the last element of out being greater than stop.

b = np.arange(2, 20, 2)
b

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

numpy.linspace

numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)

Return evenly spaced numbers over a specified interval.
Returns num evenly spaced samples, calculated over the interval [start, stop].

The endpoint of the interval can optionally be excluded.

Parameters

start : array_like

The starting value of the sequence.

stop : array_like

The end value of the sequence, unless endpoint is set to False. In that case, the sequence consists of all but the last of num + 1 evenly spaced samples, so that stop is excluded. Note that the step size changes when endpoint is False.

num : int, optional Number of samples to generate. Default is 50. Must be non-negative.

endpoint : bool, optional

If True, stop is the last sample. Otherwise, it is not included. Default is True.

retstep : bool, optional

If True, return (samples, step), where step is the spacing between samples.

dtype : dtype, optional

The type of the output array. If dtype is not given, infer the data type from the other input arguments.

axis : int, optional

The axis in the result to store the samples. Relevant only if start or stop are array-like. By default (0), the samples will be along a new axis inserted at the beginning. Use -1 to get an axis at the end.

Returns:

samples : ndarray

There are num equally spaced samples in the closed interval [start, stop] or the half-open interval [start, stop) (depending on whether endpoint is True or False).

step : float, optional

Only returned if retstep is True. Size of spacing between samples.

b = np.linspace(2, 10, 5, dtype = int, endpoint = False)
b

array([2, 3, 5, 6, 8])

Indexing and Slicing in Numpy Array

Numpy array is a collection of references which point to 4 different attributes.

data => reference to first byte/element of the array
shape => represents size of the array
dtype => represents dtype of elements present in array
strides => represent number bytes to be skipped to get to next element

li = [1, 2, 3, 4, 5]
arr = np.array(li)
print(li[3])
print(arr[3])

print(li[1:4])
print(arr[1:4])

4
4
[2, 3, 4]
[2 3 4]

Broadcasting

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations.

Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In order to have compatible dimensions there are two rules

Dimensions are equal (eg A.dim -> 3, 2 and B.dim -> 3, 2)
One of them is one (eg A.dim -> 3, 3 and B.dim -> 3,)

x = np.random.randint(1, 10, (3, 2))
y = np.random.randint(1, 10, (2, 3))
y = np.transpose(y)
print(x)
print(y)

[[2 4]
 [5 4]
 [8 5]]
[[2 1]
 [8 4]
 [7 9]]

Numpy Practice Problems