Introduction to the numpy package in Python

by K. Yue

1. Overview of the numpy package

  1. NumPy (Numerical Python) is a popular rich Python library used for numerical computations, such as arrays. 
  2. Resources:
    1. Main site: https://numpy.org/
    2. numpy at w3schools: https://www.w3schools.com/python/numpy/default.asp
    3. Codebase in Github: https://github.com/numpy/numpy
  3. Installation: at command line prompt: "pip install numpy"
  4. One main data structure in numpy is "ndarray": the class of n-dimensional arrays.
    1. One dimensional arrays: vectors.
    2. Two dimensional arrays: matrices.
    3. n dimensional arrays (n >= 3): tensors.
    4. The array() in numpy is used to create objects of the class ndarray.

1.1 Python's lists vs numpy's arrays

Python's lists (review):

  1. Python's lists are mutatable.
  2. Python's lists are of varying sizes.
  3. Python's lists are heterogeneous (an element of a list can be any object).
  4. Python's lists are very flexbile but slow.

Numpy's ndarray:

  1. ndarray objects are immutable. (When changed, a new ndarray object is created.)
  2. ndarray objects are of fixed sizes.
  3. ndarray objects are homogeneous (all elements have the same types.
  4. ndarray are very fast comparing to Python's lists, offering 5x to 100x speedups, depending on the kind of operations.
  5. Numpy provides rich functionality, including vectorization (operaions involving whole arrays).

2. Examples

numpy_stat_1.py.txt: print basic statistics of a file of data using numpy. Run and annotate the program.

import numpy as np
import os # For checking if the file exists

def calculate_statistics(filename):
    """
    Reads a text file of integers using NumPy and prints basic statistics.

    Args:
        filename (str): The name of the input text file.
    """
    if not os.path.exists(filename):
        print(f"Error: The file '{filename}' was not found.")
        print(f"Please ensure '{filename} is in the same directory as this script.")
        return

    try:
        # Use np.loadtxt to read data from the file
        # This function is flexible and can handle various delimiters (space, comma, etc.)
        data = np.loadtxt(filename, dtype=int)
       
        # Flatten the array in case the data is read as a 2D array
        # This ensures all statistics are calculated on a single list of numbers. Uncomment the next two print statements to see the effect.
        #  print(data)
        flat_data = data.flatten()
        #  print(flat_data)
   
        print("-" * (len(filename) + 25))
        print(f"--- Statistics for '{filename}' ---")
        print(f"---    Data points found: {flat_data.size}")
        print(f"---    Data type: {flat_data.dtype}")
        print("-" * (len(filename) + 25))

        # Calculate and print basic statistics
        print(f"Mean (Average):   {np.mean(flat_data):.2f}")
        print(f"Median (Middle):  {np.median(flat_data):.2f}")
        print(f"Minimum Value:    {np.min(flat_data)}")
        print(f"Maximum Value:    {np.max(flat_data)}")
        print(f"Standard Deviation: {np.std(flat_data):.2f}")
        print(f"Sum of all values: {np.sum(flat_data)}")

    except ValueError as e:
        print(f"Error reading file: {e}")
        print("Ensure the file only contains integers and standard delimiters (spaces/commas).")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    # Specify the name of your data file
    input_file = input("Please input the data file name: ")
    calculate_statistics(input_file)

For the input file numpy_stat_1_data_1.txt:

1 2 3 2 5
2 1 4 5 9
7 8 10 4 3
9 2 3 4 6


 Running the program:

C:\...>python numpy_stat_1.py
Please input the data file name: numpy_stat_1_data_1.txt
------------------------------------------------
--- Statistics for 'numpy_stat_1_data_1.txt' ---
---    Data points found: 20
---    Data type: int64
------------------------------------------------
Mean (Average):   4.50
Median (Middle):  4.00
Minimum Value:    1
Maximum Value:    10
Standard Deviation: 2.73
Sum of all values: 90

 

3. Performance Matters (a lot)

Example:

numpy_array_vs_lists.py: compare the times for using three methods to create a list/array by adding two other lists/arrays. Run and annotate the program. Methods:

  1. Using Python's lists and an explicit loop.
  2. Using Python's lists and an implicit loop and the zip function.
  3. Using numpy's array and vectorized operations.

import numpy as np
import time

NUM_ELEMENTS = 10000000

a = list(range(NUM_ELEMENTS))
b = list(range(NUM_ELEMENTS))
c = []

start_time = time.process_time()
for i in range(len(a)):
    c.append(a[i] + b[i])
end_time = time.process_time()

print(f"Time taken for adding two lists/arrays of {NUM_ELEMENTS} elements to create a third.")
print()

print(f"Time taken for Python's lists with an explicit loop: {end_time - start_time} seconds")

start_time = time.process_time()
c = [x + y for x, y in zip(a, b)]
end_time = time.process_time()

print(f"Time taken for Python's lists with no explicit loop but using zip(): {end_time - start_time} seconds")

a_np = np.arange(NUM_ELEMENTS)
b_np = np.arange(NUM_ELEMENTS)

start_time = time.process_time()
c_np = a_np + b_np # Single vectorized operation
end_time = time.process_time()

print(f"Time taken for nparray with vectorization: {end_time - start_time} seconds")

A run of the program (result times may differ from runs to runs):

C:\...>python numpy_array_vs_lists.py
Time taken for adding two lists/arrays of 10000000 elements to create a third.

Time taken for Python's lists with an explicit loop: 2.03125 seconds
Time taken for Python's lists with no explicit loop but using zip(): 0.953125 seconds
Time taken for nparray with vectorization: 0.03125 seconds