6  Pandas

Pandas is a powerful and popular open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Pandas is especially useful for working with data in tabular form, similar to how data is organized in spreadsheets or SQL tables.

Key features of Pandas:

  1. Data Structures:
    • Series: A one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). It’s similar to a column in a spreadsheet or a single column in a database.
    • DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It’s similar to a table in a relational database or a data table in Excel.
  2. Data Handling:
    • Data Cleaning: Pandas makes it easy to handle missing data, remove duplicates, and perform data transformations.
    • Data Aggregation: You can group data and apply aggregate functions like sum, mean, count, etc. Data
    • Merging and Joining: Combine multiple DataFrames using various methods like merge, join, and concatenate.
  3. File I/O:
    • Pandas can read from and write to various file formats, including CSV, Excel, JSON, SQL databases, and more.
  4. Indexing and Slicing:
    • Pandas offers powerful ways to access and modify data using labels, conditions, and positions.
  5. Time Series Data:
    • Pandas has specialized tools for handling time series data, including date range generation, frequency conversion, and resampling.

6.1 Import Pandas

import pandas as pd

6.2 Create a dataframe

# Create a Pandas DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)
print(df)
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston

6.3 Working with dataframes

Selecting columns:

print(df['Name'])  # Select the 'Name' column
0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

Filtering rows:

print(df[df['Age'] > 30])  # Filter rows where 'Age' is greater than 30
      Name  Age     City
2  Charlie   35  Chicago
3    David   40  Houston

Adding a new column:

df['Salary'] = [50000, 60000, 70000, 80000]
print(df)
      Name  Age         City  Salary
0    Alice   25     New York   50000
1      Bob   30  Los Angeles   60000
2  Charlie   35      Chicago   70000
3    David   40      Houston   80000

Handling missing data:

df.dropna()  # Drop rows with missing values
df.fillna(0)  # Replace missing values with 0
Name Age City Salary
0 Alice 25 New York 50000
1 Bob 30 Los Angeles 60000
2 Charlie 35 Chicago 70000
3 David 40 Houston 80000

Group by and aggregate

import pandas as pd

# Example DataFrame
data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles'],
    'Population': [8000000, 4000000, 2700000, 8500000, 4100000],
    'Area': [468.9, 503, 234, 470, 503],
    'Temperature': [70, 75, 60, 72, 76]
}

df2 = pd.DataFrame(data)

# Group by 'City' and calculate the mean of numeric columns
grouped = df2.groupby('City').mean()

print(grouped)
             Population    Area  Temperature
City                                        
Chicago       2700000.0  234.00         60.0
Los Angeles   4050000.0  503.00         75.5
New York      8250000.0  469.45         71.0

6.4 Example using Pandas

Code
import pandas as pd

# Create a DataFrame
data3 = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
    'Age': [23, 45, 35, 29, 42],
    'Salary': [70000, 80000, 65000, 56000, 72000]
}

df3 = pd.DataFrame(data3)

# Add a new column
df3['Age Category'] = ['Young' if age < 30 else 'Middle-aged' for age in df3['Age']]

# Filter the DataFrame
high_salary_df3 = df3[df3['Salary'] > 60000]

# Group by 'Age Category' and calculate the average for numeric columns
grouped = high_salary_df3.groupby('Age Category').mean(numeric_only=True)

print("Original DataFrame:")
print(df3)

print("\nFiltered DataFrame (Salary > 60000):")
print(high_salary_df3)

print("\nGrouped DataFrame (Average Salary by Age Category):")
print(grouped)
Original DataFrame:
      Name  Age  Salary Age Category
0    Alice   23   70000        Young
1      Bob   45   80000  Middle-aged
2  Charlie   35   65000  Middle-aged
3    David   29   56000        Young
4   Edward   42   72000  Middle-aged

Filtered DataFrame (Salary > 60000):
      Name  Age  Salary Age Category
0    Alice   23   70000        Young
1      Bob   45   80000  Middle-aged
2  Charlie   35   65000  Middle-aged
4   Edward   42   72000  Middle-aged

Grouped DataFrame (Average Salary by Age Category):
                    Age        Salary
Age Category                         
Middle-aged   40.666667  72333.333333
Young         23.000000  70000.000000

6.5 Useful functions

Datatypes:

df3.dtypes
Name            object
Age              int64
Salary           int64
Age Category    object
dtype: object

Information:

df3.info
<bound method DataFrame.info of       Name  Age  Salary Age Category
0    Alice   23   70000        Young
1      Bob   45   80000  Middle-aged
2  Charlie   35   65000  Middle-aged
3    David   29   56000        Young
4   Edward   42   72000  Middle-aged>

Accessing top of dataframe:

df3.head()
Name Age Salary Age Category
0 Alice 23 70000 Young
1 Bob 45 80000 Middle-aged
2 Charlie 35 65000 Middle-aged
3 David 29 56000 Young
4 Edward 42 72000 Middle-aged

Describe:

df3.describe()
Age Salary
count 5.000000 5.000000
mean 34.800000 68600.000000
std 9.066422 8876.936408
min 23.000000 56000.000000
25% 29.000000 65000.000000
50% 35.000000 70000.000000
75% 42.000000 72000.000000
max 45.000000 80000.000000

6.6 Save your dataframe for future sessions!

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
    'Age': [23, 45, 35, 29, 42],
    'Salary': [70000, 80000, 65000, 56000, 72000]
}

df = pd.DataFrame(data)

# Check your working directory is set in the top right hand corner
# Save the DataFrame to a CSV file 
df.to_csv('survey.csv', index=False)

6.7 How to import data

Check and verify that your csv file is in your working directory. Next, we will read the csv file into a new Python session to continue working with the dataframe.

survey_df = pd.read_csv('survey.csv')
survey_df.head()
Name Age Salary
0 Alice 23 70000
1 Bob 45 80000
2 Charlie 35 65000
3 David 29 56000
4 Edward 42 72000

6.8 Summarize the data

Summarize the data by caclulating the summary statistics for the quantitative variables. (What are the quantitative variables in the survey data set?)

6.9 Exercises

  1. What is Pandas used for in Python?

  2. In an interview, you are asked: “What’s the difference between a Pandas Series and a DataFrame?” How would you explain this?

  3. What function can you use to view the summary statistics for each column in a dataset? What is one major limitation of this function?

  4. What does df.info() do, and when might it be useful? What is one limitation of this function?

  5. Amelia just downloaded a new dataset, but is not sure if it is the one she is expecting. Give two functions she could use to quickly see the contents without opening the whole dataset.

  6. Sam’s friend has never used a programming software, but is wondering what a working directory is. How could Sam explain it in a way his friend would understand?

  7. Roger is tasked with cleaning a dataset. There are missing values he isn’t sure how to handle. Give two functions he should consider and how they’re applied.

  8. What’s wrong with this code? How can it be corrected?

df.dropna(inplace=False)
  1. Using the following data frame:
df = pd.DataFrame({
    'Product': ['Phone', 'Tablet', 'Laptop', 'Laptop'],
    'Price': [500, 300, 1200, 1300],
    'Rating': [4.5, 4.2, None, 4.7]
})
  1. Replace missing ratings with the average rating.
  2. Add a column showing whether the product is over $1,000.
  3. Group by Product and return the average price and rating.
  1. .dropna() deletes columns with missing values by default. True or False?

  2. What does .dtypes() show you? Give an example of when it can add value.

  3. Alina wants to check if salary data is numeric but gets an error with:

print(df.dtypes())
  1. Write a line of code that shows the first 3 rows and explain why doing this is better than printing the whole DataFrame.

  2. The output of the following is NaN. Why? How can it be fixed?

df['Score'] = [90, None, 85]
df.fillna(0)
print(df)

For the following questions, create the following DataFrame and store it as df: . A column Name with the values: Alice, Bob, Cara, Dan, Eva . A column Department with the values: Sales, IT, Sales, HR, HR . A column Salary with the values: 50000, 60000, 52000, 48000, missing value . A column Rating with the values: 4, 5, missing value, 2, missing value

  1. Provide your code to create the DataFrame.

  2. How do you select just the “Name” column from a DataFrame df?

  3. What’s wrong with this code?

df['Department]
  1. Josh tried to filter employees who earn less than $50,000, but the output is blank. His code:
df[df.Salary > '50000']
  1. Add a new column “City” with values ‘Toronto’ and ‘London’.

  2. Some of the values in the ‘Rating’ column are missing. Replace the missing values with the number 3.

  3. Add a column ‘Bonus’ that gives everyone 10% of their salary.

  4. Jennifer wants to see which department gets paid the most compared to the salary expense for the company. How can she find the total salary paid to each department?

  5. Jonathan is working with the same dataset and wants to dig deeper. Instead of just the total salary, he wants to know which department pays the highest average salary. How can he do this, and how does the output help him understand compensation differences between departments?