Introduction to the numpy package in Python
by K. Yue
1. Overview of the numpy package
1.1 Python's lists vs numpy's arrays
Python's lists (review):
Numpy's ndarray:
2. Examples
numpy_stat_1.py.txt: print basic statistics of a file of data using numpy. Run and annotate the program.
import numpy as np
import os # For checking if the file exists
def calculate_statistics(filename):
"""
Reads a text file of integers using NumPy and prints basic statistics.
Args:
filename (str): The name of the input text file.
"""
if not os.path.exists(filename):
print(f"Error: The file '{filename}' was not found.")
print(f"Please ensure '{filename} is in the same directory as this script.")
return
try:
# Use np.loadtxt to read data from the file
# This function is flexible and can handle various delimiters (space, comma, etc.)
data = np.loadtxt(filename, dtype=int)
# Flatten the array in case the data is read as a 2D array
# This ensures all statistics are calculated on a single list of numbers. Uncomment the next two print statements to see the effect.
# print(data)
flat_data = data.flatten()
# print(flat_data)
print("-" * (len(filename) + 25))
print(f"--- Statistics for '{filename}' ---")
print(f"--- Data points found: {flat_data.size}")
print(f"--- Data type: {flat_data.dtype}")
print("-" * (len(filename) + 25))
# Calculate and print basic statistics
print(f"Mean (Average): {np.mean(flat_data):.2f}")
print(f"Median (Middle): {np.median(flat_data):.2f}")
print(f"Minimum Value: {np.min(flat_data)}")
print(f"Maximum Value: {np.max(flat_data)}")
print(f"Standard Deviation: {np.std(flat_data):.2f}")
print(f"Sum of all values: {np.sum(flat_data)}")
except ValueError as e:
print(f"Error reading file: {e}")
print("Ensure the file only contains integers and standard delimiters (spaces/commas).")
except Exception as e:
print(f"An unexpected error occurred: {e}")
if __name__ == "__main__":
# Specify the name of your data file
input_file = input("Please input the data file name: ")
calculate_statistics(input_file)
For the input file numpy_stat_1_data_1.txt:
1 2 3 2 5
2 1 4 5 9
7 8 10 4 3
9 2 3 4 6
Running the program:
C:\...>python numpy_stat_1.py
Please input the data file name: numpy_stat_1_data_1.txt
------------------------------------------------
--- Statistics for 'numpy_stat_1_data_1.txt' ---
--- Data points found: 20
--- Data type: int64
------------------------------------------------
Mean (Average): 4.50
Median (Middle): 4.00
Minimum Value: 1
Maximum Value: 10
Standard Deviation: 2.73
Sum of all values: 90
3. Performance Matters (a lot)
Example:
numpy_array_vs_lists.py: compare the times for using three methods to create a list/array by adding two other lists/arrays. Run and annotate the program. Methods:
import numpy as np
import time
NUM_ELEMENTS = 10000000
a = list(range(NUM_ELEMENTS))
b = list(range(NUM_ELEMENTS))
c = []
start_time = time.process_time()
for i in range(len(a)):
c.append(a[i] + b[i])
end_time = time.process_time()
print(f"Time taken for adding two lists/arrays of {NUM_ELEMENTS} elements to create a third.")
print()
print(f"Time taken for Python's lists with an explicit loop: {end_time - start_time} seconds")
start_time = time.process_time()
c = [x + y for x, y in zip(a, b)]
end_time = time.process_time()
print(f"Time taken for Python's lists with no explicit loop but using zip(): {end_time - start_time} seconds")
a_np = np.arange(NUM_ELEMENTS)
b_np = np.arange(NUM_ELEMENTS)
start_time = time.process_time()
c_np = a_np + b_np # Single vectorized operation
end_time = time.process_time()
print(f"Time taken for nparray with vectorization: {end_time - start_time} seconds")
A run of the program (result times may differ from runs to runs):
C:\...>python numpy_array_vs_lists.py
Time taken for adding two lists/arrays of 10000000 elements to create a third.
Time taken for Python's lists with an explicit loop: 2.03125 seconds
Time taken for Python's lists with no explicit loop but using zip(): 0.953125 seconds
Time taken for nparray with vectorization: 0.03125 seconds