13  Importing, Exploring, and Creating Data

13.1 Importing csv files

If you have an Excel file, it is easier to convert to a csv (comma deliminated) file.

Make sure the spreadsheet has been prepared. The top row is for the name of each variable, and every column MUST be the same length. Otherwise, the file will be read in as a list or some other structure type.

13.2 Import the dataset to your R session

What can you discover about the dataset?

# Import the dataset
customer_sales <- read.csv("BusinessMetrics.csv")

13.3 Preliminary data exploration

dim(customer_sales)
[1] 50  5
summary(customer_sales)
 Product.Category   Purchase.Amount Customer.Satisfaction.Rating
 Length:50          Min.   : 28.6   Min.   :1.0                 
 Class :character   1st Qu.:375.0   1st Qu.:2.0                 
 Mode  :character   Median :572.6   Median :3.0                 
                    Mean   :542.6   Mean   :3.2                 
                    3rd Qu.:761.9   3rd Qu.:4.0                 
                    Max.   :978.8   Max.   :5.0                 
 Repeat.Customer    Previous.Purchase.Amount
 Length:50          Min.   :  0.0           
 Class :character   1st Qu.:  0.0           
 Mode  :character   Median :136.8           
                    Mean   :278.3           
                    3rd Qu.:478.9           
                    Max.   :982.8           

13.4 Installing R packages

Go to Help for CRAN packages search options.

Install the package.

You must run library(“name of package”) for each new R session.

13.5 Introducing the ‘dplyr’ package

Let’s try installing the R package commonly used for data manipulation:

# Remove the # symbol and install the package
#install.packages('dplyr')
⏳ Installation Tip

Wait for it!

Some packages take a while to install!

You only need to install a package once, but you will need to attach the library for each new R session.

library('dplyr')

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

13.6 Core functions in dplyr

select():

Choose specific columns from a data frame.

customer_sales_selected <- customer_sales %>%
  select(Purchase.Amount, Repeat.Customer)


   Purchase.Amount Repeat.Customer
1           553.33             Yes
2           718.04              No
3           606.74             Yes
4           549.43             Yes
5           429.42             Yes
6           649.44              No
7           443.21             Yes
8           892.86              No
9           964.03             Yes
10          389.61              No
11          793.81             Yes
12          533.61             Yes
13          572.36             Yes
14          926.34             Yes
15           80.33             Yes
16           96.26              No
17           30.02             Yes
18          834.29             Yes
19          780.38              No
20          871.31             Yes
21          978.83             Yes
22          801.17             Yes
23          466.86              No
24          782.72             Yes
25          127.09             Yes
26          643.52              No
27          151.92             Yes
28          945.22              No
29          526.63             Yes
30          420.52             Yes
31          271.91              No
32          776.49              No
33          461.59             Yes
34          572.75             Yes
35           28.60             Yes
36          621.46              No
37          615.97              No
38          620.76             Yes
39          944.31             Yes
40          685.00             Yes
41          365.91             Yes
42          442.66             Yes
43          700.65              No
44           69.62             Yes
45          670.10              No
46          673.93             Yes
47          218.28             Yes
48          137.64             Yes
49          322.27              No
50          370.07              No

filter():

Subset rows based on conditions.

customer_sales_filtered <- customer_sales %>%
  filter(Purchase.Amount > 550)


customer_sales_filtered 
   Product.Category Purchase.Amount Customer.Satisfaction.Rating
1          Clothing          553.33                            2
2          Clothing          718.04                            3
3       Electronics          606.74                            5
4            Sports          649.44                            5
5          Clothing          892.86                            4
6       Electronics          964.03                            5
7          Clothing          793.81                            5
8        Automotive          572.36                            5
9            Sports          926.34                            4
10         Clothing          834.29                            2
11      Electronics          780.38                            4
12       Automotive          871.31                            1
13       Automotive          978.83                            1
14       Automotive          801.17                            2
15           Sports          782.72                            5
16      Electronics          643.52                            1
17           Sports          945.22                            3
18       Automotive          776.49                            1
19           Sports          572.75                            3
20       Home Goods          621.46                            3
21      Electronics          615.97                            4
22       Home Goods          620.76                            3
23         Clothing          944.31                            2
24           Sports          685.00                            3
25       Home Goods          700.65                            4
26       Automotive          670.10                            4
27       Home Goods          673.93                            5
   Repeat.Customer Previous.Purchase.Amount
1              Yes                   125.04
2               No                     0.00
3              Yes                   417.70
4               No                     0.00
5               No                     0.00
6              Yes                   663.07
7              Yes                   238.23
8              Yes                   622.62
9              Yes                   480.12
10             Yes                   389.63
11              No                     0.00
12             Yes                   879.67
13             Yes                   111.83
14             Yes                   101.45
15             Yes                   556.30
16              No                     0.00
17              No                     0.00
18              No                     0.00
19             Yes                   453.45
20              No                     0.00
21              No                     0.00
22             Yes                   842.73
23             Yes                   905.60
24             Yes                    47.18
25              No                     0.00
26              No                     0.00
27             Yes                   148.57

mutate():

Create new columns or modify existing ones.

For example, add a new column called Purchase.Amount.Tax:

customer_sales_mutated <- customer_sales %>%
  mutate(Purchase.Amount.Tax = Purchase.Amount * 0.15)


colnames(customer_sales_mutated)
[1] "Product.Category"             "Purchase.Amount"             
[3] "Customer.Satisfaction.Rating" "Repeat.Customer"             
[5] "Previous.Purchase.Amount"     "Purchase.Amount.Tax"         

arrange():

Sort rows based on one or more columns.

For example, sort the dataset by Purchase.Amount in ascending order:

customer_sales_arranged <- customer_sales %>%
  arrange(Purchase.Amount)


customer_sales_arranged
   Product.Category Purchase.Amount Customer.Satisfaction.Rating
1        Automotive           28.60                            4
2        Automotive           30.02                            2
3       Electronics           69.62                            3
4          Clothing           80.33                            3
5          Clothing           96.26                            1
6        Automotive          127.09                            3
7            Sports          137.64                            3
8          Clothing          151.92                            4
9            Sports          218.28                            2
10       Automotive          271.91                            2
11           Sports          322.27                            4
12           Sports          365.91                            4
13       Automotive          370.07                            2
14           Sports          389.61                            5
15      Electronics          420.52                            1
16       Home Goods          429.42                            5
17      Electronics          442.66                            4
18       Home Goods          443.21                            5
19         Clothing          461.59                            3
20         Clothing          466.86                            3
21      Electronics          526.63                            3
22      Electronics          533.61                            1
23      Electronics          549.43                            4
24         Clothing          553.33                            2
25       Automotive          572.36                            5
26           Sports          572.75                            3
27      Electronics          606.74                            5
28      Electronics          615.97                            4
29       Home Goods          620.76                            3
30       Home Goods          621.46                            3
31      Electronics          643.52                            1
32           Sports          649.44                            5
33       Automotive          670.10                            4
34       Home Goods          673.93                            5
35           Sports          685.00                            3
36       Home Goods          700.65                            4
37         Clothing          718.04                            3
38       Automotive          776.49                            1
39      Electronics          780.38                            4
40           Sports          782.72                            5
41         Clothing          793.81                            5
42       Automotive          801.17                            2
43         Clothing          834.29                            2
44       Automotive          871.31                            1
45         Clothing          892.86                            4
46           Sports          926.34                            4
47         Clothing          944.31                            2
48           Sports          945.22                            3
49      Electronics          964.03                            5
50       Automotive          978.83                            1
   Repeat.Customer Previous.Purchase.Amount
1              Yes                   108.57
2              Yes                   295.11
3              Yes                   866.28
4              Yes                   475.43
5               No                     0.00
6              Yes                    43.29
7              Yes                   982.75
8              Yes                   327.79
9              Yes                    42.89
10              No                     0.00
11              No                     0.00
12             Yes                   513.23
13              No                     0.00
14              No                     0.00
15             Yes                   106.29
16             Yes                   257.30
17             Yes                   175.18
18             Yes                   965.76
19             Yes                   541.65
20              No                     0.00
21             Yes                   149.85
22             Yes                   400.92
23             Yes                   678.68
24             Yes                   125.04
25             Yes                   622.62
26             Yes                   453.45
27             Yes                   417.70
28              No                     0.00
29             Yes                   842.73
30              No                     0.00
31              No                     0.00
32              No                     0.00
33              No                     0.00
34             Yes                   148.57
35             Yes                    47.18
36              No                     0.00
37              No                     0.00
38              No                     0.00
39              No                     0.00
40             Yes                   556.30
41             Yes                   238.23
42             Yes                   101.45
43             Yes                   389.63
44             Yes                   879.67
45              No                     0.00
46             Yes                   480.12
47             Yes                   905.60
48              No                     0.00
49             Yes                   663.07
50             Yes                   111.83

For decreasing order use:

customer_sales_arranged_desc <- customer_sales %>%
  arrange(desc(Purchase.Amount))


customer_sales_arranged_desc
   Product.Category Purchase.Amount Customer.Satisfaction.Rating
1        Automotive          978.83                            1
2       Electronics          964.03                            5
3            Sports          945.22                            3
4          Clothing          944.31                            2
5            Sports          926.34                            4
6          Clothing          892.86                            4
7        Automotive          871.31                            1
8          Clothing          834.29                            2
9        Automotive          801.17                            2
10         Clothing          793.81                            5
11           Sports          782.72                            5
12      Electronics          780.38                            4
13       Automotive          776.49                            1
14         Clothing          718.04                            3
15       Home Goods          700.65                            4
16           Sports          685.00                            3
17       Home Goods          673.93                            5
18       Automotive          670.10                            4
19           Sports          649.44                            5
20      Electronics          643.52                            1
21       Home Goods          621.46                            3
22       Home Goods          620.76                            3
23      Electronics          615.97                            4
24      Electronics          606.74                            5
25           Sports          572.75                            3
26       Automotive          572.36                            5
27         Clothing          553.33                            2
28      Electronics          549.43                            4
29      Electronics          533.61                            1
30      Electronics          526.63                            3
31         Clothing          466.86                            3
32         Clothing          461.59                            3
33       Home Goods          443.21                            5
34      Electronics          442.66                            4
35       Home Goods          429.42                            5
36      Electronics          420.52                            1
37           Sports          389.61                            5
38       Automotive          370.07                            2
39           Sports          365.91                            4
40           Sports          322.27                            4
41       Automotive          271.91                            2
42           Sports          218.28                            2
43         Clothing          151.92                            4
44           Sports          137.64                            3
45       Automotive          127.09                            3
46         Clothing           96.26                            1
47         Clothing           80.33                            3
48      Electronics           69.62                            3
49       Automotive           30.02                            2
50       Automotive           28.60                            4
   Repeat.Customer Previous.Purchase.Amount
1              Yes                   111.83
2              Yes                   663.07
3               No                     0.00
4              Yes                   905.60
5              Yes                   480.12
6               No                     0.00
7              Yes                   879.67
8              Yes                   389.63
9              Yes                   101.45
10             Yes                   238.23
11             Yes                   556.30
12              No                     0.00
13              No                     0.00
14              No                     0.00
15              No                     0.00
16             Yes                    47.18
17             Yes                   148.57
18              No                     0.00
19              No                     0.00
20              No                     0.00
21              No                     0.00
22             Yes                   842.73
23              No                     0.00
24             Yes                   417.70
25             Yes                   453.45
26             Yes                   622.62
27             Yes                   125.04
28             Yes                   678.68
29             Yes                   400.92
30             Yes                   149.85
31              No                     0.00
32             Yes                   541.65
33             Yes                   965.76
34             Yes                   175.18
35             Yes                   257.30
36             Yes                   106.29
37              No                     0.00
38              No                     0.00
39             Yes                   513.23
40              No                     0.00
41              No                     0.00
42             Yes                    42.89
43             Yes                   327.79
44             Yes                   982.75
45             Yes                    43.29
46              No                     0.00
47             Yes                   475.43
48             Yes                   866.28
49             Yes                   295.11
50             Yes                   108.57

summarize():

Summarize data by calculating summary statistics.

For example, calculate the average Purchase.Amount:

customer_sales_summary <- customer_sales %>%
  summarize(avg_purchase = mean(Purchase.Amount, na.rm=TRUE))


customer_sales_summary
  avg_purchase
1     542.5854

groupby():

Group data by one or more columns, often used with summarize().

For example, group by repeat customer and then calculate the average Purchase.Amount for each group (Yes/No).

customer_sales_grouped_summary <- customer_sales %>%
  group_by(Repeat.Customer) %>% 
  summarize(avg_purchase = mean(Purchase.Amount, na.rm=TRUE))


customer_sales_grouped_summary
# A tibble: 2 × 2
  Repeat.Customer avg_purchase
  <chr>                  <dbl>
1 No                      584.
2 Yes                     521.

13.7 Tibbles

A tibble is a modern, improved version of a data frame in R. It keeps data in a tabular format but offers a few enhancements:

  • Better Printing: Tibbles print only the first few rows and columns, which is helpful for large datasets.

  • No Automatic Conversion: Tibbles don’t convert character strings to factors automatically, unlike data frames.

  • Enhanced Readability: Tibbles handle lists and nested data structures more gracefully.

They’re especially popular for data analysis because they work well with tidyverse functions.

13.8 The pipe operator %>%

As we have seen, the pipe operator allows you to chain commands together in a readable way. For example:

customer_sales %>%
  filter(Purchase.Amount < 300) %>%
  select(Repeat.Customer, Purchase.Amount) %>%
  arrange(desc(Purchase.Amount))
   Repeat.Customer Purchase.Amount
1               No          271.91
2              Yes          218.28
3              Yes          151.92
4              Yes          137.64
5              Yes          127.09
6               No           96.26
7              Yes           80.33
8              Yes           69.62
9              Yes           30.02
10             Yes           28.60

13.9 Scatterplot matrices

plot(customer_sales, col=3)

13.10 Correlations?

plot(Purchase.Amount ~ Previous.Purchase.Amount,
     data=customer_sales, col="blue")

13.11 Graphical parameters

Search for more features to customize your plots by typing:

> ?plot

> ?points

> ?par

For colour palettes in base R plots use:

> palette()

Example:

plot(Purchase.Amount ~ Previous.Purchase.Amount,
     data=customer_sales, col="blue", pch=19)

13.12 Reflection questions

  1. What could go wrong if the top row of your spreadsheet doesn’t contain variable names?

  2. What does the dim() function tell you about your dataset? Why is it useful?

  3. How do packages like dplyr enhance the data analysis process compared to using base R?

  4. Why do you need to run library(“dplyr”) each time you start a new R session, even if the package is already installed?

  5. How does chaining functions with the pipe %>% make your code easier to read?

  6. What insights can you gain by creating a scatterplot matrix of your dataset?

  7. What is one new graphical customization you discovered using the ?plot, ?par, or ?points help pages?

13.13 Exercises

  1. What is the goal of the dplyr package, and why do many analysts prefer it over base R for data manipulation?

  2. Why might you choose to use filter() over using logical indexing directly in base R?

  3. What makes mutate() particularly useful when working with grouped data?

  4. What is the main difference between mutate() and summarize()?

  5. Why is arrange() combined with desc() often used in practice?

  6. Why might you choose to use select() instead of subsetting columns with [ , ] in base R?

  7. Explain the relationship between group_by() and summarize(). Why does group_by() not change the dataset on its own?

  8. What problem do tibbles solve compared to base data frames?

  9. How does the pipe operator %>% change the way you write and read code?

For questions 10 to 15, use the following dataframe:

orders <- data.frame(
  order_id = 1:5,
  customer = c("A","B","C","D","E"),
  year     = c(2025,2025,2024,2025,2024),
  total    = c(200, 1200, 450, 800, 300),
  discount = c(20, 100, 0, 50, 10)
)
  1. Keep only the columns order_id, customer, and total using select() with the pipe operator.

  2. Filter to show only rows where total > 500 and year == 2025.

  3. Add a new column called net to orders that equals total - discount. Use mutate().

  4. Sort the orders dataset by total in descending order. Use arrange() with desc().

  5. Group the orders dataset by year and compute the average order total in each year.

  6. Karl wrote:

orders <- select(order_id, customer, total) %>% orders

Correct the code.

  1. Explain what a tibble is. Convert the orders dataset into a tibble and observe how it prints differently compared to a data frame.

  2. Explain what the pipe operator %>% does. Rewrite this nested call using the pipe:

summarize(group_by(orders, year), avg_total = mean(total))
  1. What is the advantage of a scatterplot matrix over looking at pairs of plots individually?

  2. Why is it important to learn how to add parameters like points(), par(), and color palettes to plots?

  3. How could adding the pch (point character) parameter help when creating scatterplot matrices with groups of data?

For questions 21 to 28, use the built-in dataset mtcars.

  1. Create scatterplots of mpg vs hp and mpg vs wt in a single window using par(mfrow=) and the plot() function.

  2. Filter rows where mpg > 25 OR hp < 100. Write the code using filter().

  3. Plot mpg against wt using formula notation (y ~ x). Add labels for the axes and a title.

  4. Highlight 6-cylinder cars in the above scatterplot by adding points with a different symbol and size.

  5. Color points in the mpg ~ wt plot by cylinder group (4, 6, 8) using a custom palette.

  6. Add two new variables to mtcars:

  • power_to_weight = hp / wt
  • kpl = mpg / 2.35 (to convert miles per gallon to kilometers per liter)
  1. Group by cyl and calculate both the average mpg and the maximum hp for each group.

  2. Rewrite this nested call using the pipe operator:

head(arrange(filter(mtcars, mpg > 20), desc(hp)), 5)
  1. Why does the formula interface (y ~ x) in base R plotting make code more robust?