Introduction to Numpy Library for Data Science đź“Š

Introduction to Numpy Library for Data Science đź“Š

·

5 min read

Numpy

  • Numpy means Numerical Python. Python's Linear Algebra library Gives us multidimensional (multi-dim) arrays.

Why do we need Numpy?

  • Memory requirement: Numpy lists require less memory than Python ones.

  • Operations on the Numpy list are faster than those on normal lists.

  • Numpy is more convenient and has wider functionality.

Why are Numpy arrays fast?

In Numpy Array operations take place in chunks rather than element-wise. For example, in the case of adding respective elements of two lists, the addition takes place in chunks and not one element at a time.

What is vectorization? "Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

How to create Numpy arrays?

import numpy as np # np becomes alias for numpy

a = [1, 2, 3]
b = np.array(a)

print(b)
print(type(b))
[1 2 3]
<class 'numpy.ndarray'>

2D Array

b = np.ones((2, 4), dtype = int)
b
array([[1, 1, 1, 1],
       [1, 1, 1, 1]])

numpy.arange

numpy.arange([start, ]stop, [step, ]dtype=None)

Return evenly spaced values within a given interval. Values are generated within the half-open interval [start, stop).

For integer arguments the function is equivalent to the Python built-in range function, but returns an ndarray rather than a list. When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use numpy.linspace for these cases.

Parameters

start : number, optional

Start of interval. The interval includes this value. The default start value is 0.

stop : number

End of interval. The interval does not include this value, except in some cases where step is not an integer and floating point round-off affects the length of out.

step : number, optional

Spacing between values. For any output out, this is the distance between two adjacent values, out[i+1] - out[i]. The default step size is 1. If step is specified as a position argument, start must also be given.

dtype : dtype

The type of the output array. If dtype is not given, infer the data type from the other input arguments.

Returns

arange : ndarray

Array of evenly spaced values.

For floating point arguments, the length of the result is ceil((stop - start)/step). Because of floating point overflow, this rule may result in the last element of out being greater than stop.

b = np.arange(2, 20, 2)
b
array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

numpy.linspace

numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)

  • Return evenly spaced numbers over a specified interval.

  • Returns num evenly spaced samples, calculated over the interval [start, stop].

  • The endpoint of the interval can optionally be excluded.

Parameters

start : array_like

The starting value of the sequence.

stop : array_like

The end value of the sequence, unless endpoint is set to False. In that case, the sequence consists of all but the last of num + 1 evenly spaced samples, so that stop is excluded. Note that the step size changes when endpoint is False.

num : int, optional Number of samples to generate. Default is 50. Must be non-negative.

endpoint : bool, optional

If True, stop is the last sample. Otherwise, it is not included. Default is True.

retstep : bool, optional

If True, return (samples, step), where step is the spacing between samples.

dtype : dtype, optional

The type of the output array. If dtype is not given, infer the data type from the other input arguments.

axis : int, optional

The axis in the result to store the samples. Relevant only if start or stop are array-like. By default (0), the samples will be along a new axis inserted at the beginning. Use -1 to get an axis at the end.

Returns:

samples : ndarray

There are num equally spaced samples in the closed interval [start, stop] or the half-open interval [start, stop) (depending on whether endpoint is True or False).

step : float, optional

Only returned if retstep is True. Size of spacing between samples.

b = np.linspace(2, 10, 5, dtype = int, endpoint = False)
b
array([2, 3, 5, 6, 8])

Indexing and Slicing in Numpy Array

Numpy array is a collection of references which point to 4 different attributes.

  1. data => reference to first byte/element of the array

  2. shape => represents size of the array

  3. dtype => represents dtype of elements present in array

  4. strides => represent number bytes to be skipped to get to next element

li = [1, 2, 3, 4, 5]
arr = np.array(li)
print(li[3])
print(arr[3])

print(li[1:4])
print(arr[1:4])
4
4
[2, 3, 4]
[2 3 4]

Broadcasting

  • The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations.
  • Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
  • Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.
  • There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In order to have compatible dimensions there are two rules

  • Dimensions are equal (eg A.dim -> 3, 2 and B.dim -> 3, 2)

  • One of them is one (eg A.dim -> 3, 3 and B.dim -> 3,)

x = np.random.randint(1, 10, (3, 2))
y = np.random.randint(1, 10, (2, 3))
y = np.transpose(y)
print(x)
print(y)
[[2 4]
 [5 4]
 [8 5]]
[[2 1]
 [8 4]
 [7 9]]

Numpy Practice Problems

Â