import pandas as pd6 Pandas
Pandas is a powerful and popular open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Pandas is especially useful for working with data in tabular form, similar to how data is organized in spreadsheets or SQL tables.
Key features of Pandas:
- Data Structures:
- Series: A one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). It’s similar to a column in a spreadsheet or a single column in a database.
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It’s similar to a table in a relational database or a data table in Excel.
- Data Handling:
- Data Cleaning: Pandas makes it easy to handle missing data, remove duplicates, and perform data transformations.
- Data Aggregation: You can group data and apply aggregate functions like sum, mean, count, etc. Data
- Merging and Joining: Combine multiple DataFrames using various methods like merge, join, and concatenate.
- Data Cleaning: Pandas makes it easy to handle missing data, remove duplicates, and perform data transformations.
- File I/O:
- Pandas can read from and write to various file formats, including CSV, Excel, JSON, SQL databases, and more.
- Indexing and Slicing:
- Pandas offers powerful ways to access and modify data using labels, conditions, and positions.
- Time Series Data:
- Pandas has specialized tools for handling time series data, including date range generation, frequency conversion, and resampling.
6.1 Import Pandas
6.2 Create a dataframe
# Create a Pandas DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df) Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston
6.3 Working with dataframes
Selecting columns:
print(df['Name']) # Select the 'Name' column0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object
Filtering rows:
print(df[df['Age'] > 30]) # Filter rows where 'Age' is greater than 30 Name Age City
2 Charlie 35 Chicago
3 David 40 Houston
Adding a new column:
df['Salary'] = [50000, 60000, 70000, 80000]
print(df) Name Age City Salary
0 Alice 25 New York 50000
1 Bob 30 Los Angeles 60000
2 Charlie 35 Chicago 70000
3 David 40 Houston 80000
Handling missing data:
df.dropna() # Drop rows with missing values
df.fillna(0) # Replace missing values with 0| Name | Age | City | Salary | |
|---|---|---|---|---|
| 0 | Alice | 25 | New York | 50000 |
| 1 | Bob | 30 | Los Angeles | 60000 |
| 2 | Charlie | 35 | Chicago | 70000 |
| 3 | David | 40 | Houston | 80000 |
Group by and aggregate
import pandas as pd
# Example DataFrame
data = {
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles'],
'Population': [8000000, 4000000, 2700000, 8500000, 4100000],
'Area': [468.9, 503, 234, 470, 503],
'Temperature': [70, 75, 60, 72, 76]
}
df2 = pd.DataFrame(data)
# Group by 'City' and calculate the mean of numeric columns
grouped = df2.groupby('City').mean()
print(grouped) Population Area Temperature
City
Chicago 2700000.0 234.00 60.0
Los Angeles 4050000.0 503.00 75.5
New York 8250000.0 469.45 71.0
6.4 Example using Pandas
Code
import pandas as pd
# Create a DataFrame
data3 = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
'Age': [23, 45, 35, 29, 42],
'Salary': [70000, 80000, 65000, 56000, 72000]
}
df3 = pd.DataFrame(data3)
# Add a new column
df3['Age Category'] = ['Young' if age < 30 else 'Middle-aged' for age in df3['Age']]
# Filter the DataFrame
high_salary_df3 = df3[df3['Salary'] > 60000]
# Group by 'Age Category' and calculate the average for numeric columns
grouped = high_salary_df3.groupby('Age Category').mean(numeric_only=True)
print("Original DataFrame:")
print(df3)
print("\nFiltered DataFrame (Salary > 60000):")
print(high_salary_df3)
print("\nGrouped DataFrame (Average Salary by Age Category):")
print(grouped)Original DataFrame:
Name Age Salary Age Category
0 Alice 23 70000 Young
1 Bob 45 80000 Middle-aged
2 Charlie 35 65000 Middle-aged
3 David 29 56000 Young
4 Edward 42 72000 Middle-aged
Filtered DataFrame (Salary > 60000):
Name Age Salary Age Category
0 Alice 23 70000 Young
1 Bob 45 80000 Middle-aged
2 Charlie 35 65000 Middle-aged
4 Edward 42 72000 Middle-aged
Grouped DataFrame (Average Salary by Age Category):
Age Salary
Age Category
Middle-aged 40.666667 72333.333333
Young 23.000000 70000.000000
6.5 Useful functions
Datatypes:
df3.dtypesName object
Age int64
Salary int64
Age Category object
dtype: object
Information:
df3.info<bound method DataFrame.info of Name Age Salary Age Category
0 Alice 23 70000 Young
1 Bob 45 80000 Middle-aged
2 Charlie 35 65000 Middle-aged
3 David 29 56000 Young
4 Edward 42 72000 Middle-aged>
Accessing top of dataframe:
df3.head()| Name | Age | Salary | Age Category | |
|---|---|---|---|---|
| 0 | Alice | 23 | 70000 | Young |
| 1 | Bob | 45 | 80000 | Middle-aged |
| 2 | Charlie | 35 | 65000 | Middle-aged |
| 3 | David | 29 | 56000 | Young |
| 4 | Edward | 42 | 72000 | Middle-aged |
Describe:
df3.describe()| Age | Salary | |
|---|---|---|
| count | 5.000000 | 5.000000 |
| mean | 34.800000 | 68600.000000 |
| std | 9.066422 | 8876.936408 |
| min | 23.000000 | 56000.000000 |
| 25% | 29.000000 | 65000.000000 |
| 50% | 35.000000 | 70000.000000 |
| 75% | 42.000000 | 72000.000000 |
| max | 45.000000 | 80000.000000 |
6.6 Save your dataframe for future sessions!
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
'Age': [23, 45, 35, 29, 42],
'Salary': [70000, 80000, 65000, 56000, 72000]
}
df = pd.DataFrame(data)
# Check your working directory is set in the top right hand corner
# Save the DataFrame to a CSV file
df.to_csv('survey.csv', index=False)6.7 How to import data
Check and verify that your csv file is in your working directory. Next, we will read the csv file into a new Python session to continue working with the dataframe.
survey_df = pd.read_csv('survey.csv')
survey_df.head()| Name | Age | Salary | |
|---|---|---|---|
| 0 | Alice | 23 | 70000 |
| 1 | Bob | 45 | 80000 |
| 2 | Charlie | 35 | 65000 |
| 3 | David | 29 | 56000 |
| 4 | Edward | 42 | 72000 |
6.8 Summarize the data
Summarize the data by caclulating the summary statistics for the quantitative variables. (What are the quantitative variables in the survey data set?)
6.9 Exercises
What is Pandas used for in Python?
In an interview, you are asked: “What’s the difference between a Pandas Series and a DataFrame?” How would you explain this?
What function can you use to view the summary statistics for each column in a dataset? What is one major limitation of this function?
What does df.info() do, and when might it be useful? What is one limitation of this function?
Amelia just downloaded a new dataset, but is not sure if it is the one she is expecting. Give two functions she could use to quickly see the contents without opening the whole dataset.
Sam’s friend has never used a programming software, but is wondering what a working directory is. How could Sam explain it in a way his friend would understand?
Roger is tasked with cleaning a dataset. There are missing values he isn’t sure how to handle. Give two functions he should consider and how they’re applied.
What’s wrong with this code? How can it be corrected?
df.dropna(inplace=False)- Using the following data frame:
df = pd.DataFrame({
'Product': ['Phone', 'Tablet', 'Laptop', 'Laptop'],
'Price': [500, 300, 1200, 1300],
'Rating': [4.5, 4.2, None, 4.7]
})- Replace missing ratings with the average rating.
- Add a column showing whether the product is over $1,000.
- Group by Product and return the average price and rating.
.dropna() deletes columns with missing values by default. True or False?
What does .dtypes() show you? Give an example of when it can add value.
Alina wants to check if salary data is numeric but gets an error with:
print(df.dtypes())Write a line of code that shows the first 3 rows and explain why doing this is better than printing the whole DataFrame.
The output of the following is NaN. Why? How can it be fixed?
df['Score'] = [90, None, 85]
df.fillna(0)
print(df)For the following questions, create the following DataFrame and store it as df: . A column Name with the values: Alice, Bob, Cara, Dan, Eva . A column Department with the values: Sales, IT, Sales, HR, HR . A column Salary with the values: 50000, 60000, 52000, 48000, missing value . A column Rating with the values: 4, 5, missing value, 2, missing value
Provide your code to create the DataFrame.
How do you select just the “Name” column from a DataFrame df?
What’s wrong with this code?
df['Department]- Josh tried to filter employees who earn less than $50,000, but the output is blank. His code:
df[df.Salary > '50000']Add a new column “City” with values ‘Toronto’ and ‘London’.
Some of the values in the ‘Rating’ column are missing. Replace the missing values with the number 3.
Add a column ‘Bonus’ that gives everyone 10% of their salary.
Jennifer wants to see which department gets paid the most compared to the salary expense for the company. How can she find the total salary paid to each department?
Jonathan is working with the same dataset and wants to dig deeper. Instead of just the total salary, he wants to know which department pays the highest average salary. How can he do this, and how does the output help him understand compensation differences between departments?