3 👆 Selecting and Filtering Data
Let’s practice importing data, from 2. Download the periodic table data now, and import it into R as periodic_table
.
Remember to move this into your working directory:
getwd()
Since this is a CSV file, use the From Text (readr) function.
Change the name of the data set to periodic_table before you import it. Remember that’s on the bottom left.
The first 10 rows of periodic_table
are reproduced below.
atomic_number | symbol | name | name_origin | group | period | block | state_at_stp | occurrence | description | atomic_weight | aw_uncertainty | any_stable_nuclides | density | density_predicted | melting_point | mp_predicted | boiling_point | bp_predicted | heat_capacity | electronegativity | abundance |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | H | Hydrogen | composed of the Greek elements hydro- and -gen meaning ‘water-forming’ | 1 | 1 | s | Gas | Primordial | Non-metal | 1.008 | NA | Yes | 0.0001 | FALSE | 14.010 | FALSE | 20.28 | FALSE | 14.304 | 2.20 | 1.40e+03 |
2 | He | Helium | the Greek helios, ‘sun’ | 18 | 1 | s | Gas | Primordial | Noble gas | 4.003 | 2 | Yes | 0.0002 | FALSE | 0.956 | FALSE | 4.22 | FALSE | 5.193 | NA | 8.00e-03 |
3 | Li | Lithium | the Greek lithos, ‘stone’ | 1 | 2 | s | Solid | Primordial | Alkali metal | 6.940 | NA | Yes | 0.5340 | FALSE | 453.690 | FALSE | 1560.00 | FALSE | 3.582 | 0.98 | 2.00e+01 |
4 | Be | Beryllium | beryl, a mineral | 2 | 2 | s | Solid | Primordial | Alkaline earth metal | 9.012 | 5 | Yes | 1.8500 | FALSE | 1560.000 | FALSE | 2742.00 | FALSE | 1.825 | 1.57 | 2.80e+00 |
5 | B | Boron | borax, a mineral | 13 | 2 | p | Solid | Primordial | Metalloid | 10.810 | NA | Yes | 2.3400 | FALSE | 2349.000 | FALSE | 4200.00 | FALSE | 1.026 | 2.04 | 1.00e+01 |
6 | C | Carbon | the Latin carbo, ‘coal’ | 14 | 2 | p | Solid | Primordial | Non-metal | 12.011 | NA | Yes | 2.2670 | FALSE | 3800.000 | FALSE | 4300.00 | FALSE | 0.709 | 2.55 | 2.00e+02 |
7 | N | Nitrogen | the Greek nitron and ‘-gen’ meaning ‘niter-forming’ | 15 | 2 | p | Gas | Primordial | Non-metal | 14.007 | NA | Yes | 0.0013 | FALSE | 63.150 | FALSE | 77.36 | FALSE | 1.040 | 3.04 | 1.90e+01 |
8 | O | Oxygen | from the Greek oxy-, both ‘sharp’ and ‘acid’, and -gen, meaning ‘acid-forming’ | 16 | 2 | p | Gas | Primordial | Non-metal | 15.999 | NA | Yes | 0.0014 | FALSE | 54.360 | FALSE | 90.20 | FALSE | 0.918 | 3.44 | 4.61e+05 |
9 | F | Fluorine | the Latin fluere, ‘to flow’ | 17 | 2 | p | Gas | Primordial | Halogen | 18.998 | 6 | Yes | 0.0017 | FALSE | 53.530 | FALSE | 85.03 | FALSE | 0.824 | 3.98 | 5.85e+02 |
10 | Ne | Neon | the Greek neos, meaning ‘new’ | 18 | 2 | p | Gas | Primordial | Noble gas | 20.180 | 6 | Yes | 0.0009 | FALSE | 24.560 | FALSE | 27.07 | FALSE | 1.030 | NA | 5.00e-03 |
So we have a big data frame of information about the periodic table, but what if we wanted to extract smaller pieces of the data frame?
3.1 Selecting a column using $
By default, R will organize character strings in alphabetical order. To see this, we’ll introduce two new features: the table
function and the $
operator.
Let’s say I’m interested in elements in a gas, liquid, and solid state at room temperature. Looking at the periodic_table
columns, I see that the column that gives me that information is state_at_stp
.
periodic_table$state_at_stp
## [1] "Gas" "Gas" "Solid" "Solid" "Solid" "Solid"
## [7] "Gas" "Gas" "Gas" "Gas" "Solid" "Solid"
## [13] "Solid" "Solid" "Solid" "Solid" "Gas" "Gas"
## [19] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [25] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [31] "Solid" "Solid" "Solid" "Solid" "Liquid" "Gas"
## [37] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [43] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [49] "Solid" "Solid" "Solid" "Solid" "Solid" "Gas"
## [55] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [61] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [67] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [73] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [79] "Solid" "Liquid" "Solid" "Solid" "Solid" "Solid"
## [85] "Solid" "Gas" "Solid" "Solid" "Solid" "Solid"
## [91] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [97] "Solid" "Solid" "Solid" "Solid" "Solid" "Solid"
## [103] "Solid" NA NA NA NA NA
## [109] NA NA NA NA NA NA
## [115] NA NA NA NA
As you can see, to access specific variables inside a data frame, we can do so by entering the name of the data frame followed by $
and the name of the variable. (Note that spaces in variable names will not work. You’ll likely learn that the hard way, as I have.)
3.2 Selecting an element using [ , ]
Recall in 1.3 section that we could use the [ ]
operator, to select elements from a single vector. We can extend these operations into a dataframe.
A single vector can be thought of as a dataframe with one row and \(n\) columns, which is why you just needed a single number to select an element. Now, since there are multiple rows and columns, you need a comma to differentiate the row and column index.
The [ ]
operator allows you to select elements within a vector or a dataframe based on its row and column position. It starts by locating the row position, and then the column position.
3.2.1 Selecting a single element using [ , ]
Let’s say I wanted to select the name
for Sodium
, that is in row 11, column 3.
periodic_table[11, 3]
## # A tibble: 1 × 1
## name
## <chr>
## 1 Sodium
periodic_table$name[11]
## [1] "Sodium"
Notice that both commands above give me the desired value– Sodium– but the formats are different. + In the first command, R gives me back a table with 1 row and 1 column. This isn’t a coincidence, because you pulled directly from the table itself. It’s the difference between copying a “cell” in Excel versus double-clicking on the cell, and then copying the text inside. Here, you directly pulled from the table, so R returns back a value within a table. + The second gives me a single character vector. This makes sense– the $
operator selects a single vector name
from the table, and then selects the 11th element. Generally, this is the way you want to select a single table.
Recall the use of [ ]
when a vector was printed, to help us better understand where we were in printing a large vector. We can use this same tool to select the tenth to the twentieth elements of the periodic_table$name
variable:
periodic_table$name[10:20]
## [1] "Neon" "Sodium" "Magnesium" "Aluminium"
## [5] "Silicon" "Phosphorus" "Sulfur" "Chlorine"
## [9] "Argon" "Potassium" "Calcium"
3.3 Selecting rows and columns using [ , ]
You have now seen how to select specific elements of a vector or a variable, but what if we want a subset of the values in the full data frame across both rows (observations) and columns (variables).?
We can use [ , ]
where the spot before the comma corresponds to rows and the spot after the comma corresponds to columns. Let’s select rows 40 to 50 and columns 1, 2, and 4 from periodic_table
:
periodic_table[41:50, c(1, 2, 4)]
## # A tibble: 10 × 3
## atomic_number symbol name_origin
## <dbl> <chr> <chr>
## 1 41 Nb Niobe, daughter of king Tantalus f…
## 2 42 Mo the Greek molybdos meaning 'lead'
## 3 43 Tc the Greek tekhn??tos meaning 'arti…
## 4 44 Ru Ruthenia, the New Latin name for R…
## 5 45 Rh the Greek rhodos, meaning 'rose co…
## 6 46 Pd the then recently discovered aster…
## 7 47 Ag English word (argentum in Latin)
## 8 48 Cd the New Latin cadmia, from King Ka…
## 9 49 In indigo
## 10 50 Sn English word (stannum in Latin)
3.3.1 “Let’s eat Grandma!” vs. “Let’s eat, Grandma!”
The comma is very important. Let’s look at some examples:
#Returns all of column 2:
periodic_table[2]
## # A tibble: 118 × 1
## symbol
## <chr>
## 1 H
## 2 He
## 3 Li
## 4 Be
## 5 B
## 6 C
## 7 N
## 8 O
## 9 F
## 10 Ne
## # … with 108 more rows
#Returns all of row 2:
periodic_table[2, ]
## # A tibble: 1 × 22
## atomic_number symbol name name_origin group period block
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 2 He Heli… the Greek … 18 1 s
## # … with 15 more variables: state_at_stp <chr>,
## # occurrence <chr>, description <chr>,
## # atomic_weight <dbl>, aw_uncertainty <dbl>,
## # any_stable_nuclides <chr>, density <dbl>,
## # density_predicted <lgl>, melting_point <dbl>,
## # mp_predicted <lgl>, boiling_point <dbl>,
## # bp_predicted <lgl>, heat_capacity <dbl>, …
#Returns all of column 2:
periodic_table[ ,2]
## # A tibble: 118 × 1
## symbol
## <chr>
## 1 H
## 2 He
## 3 Li
## 4 Be
## 5 B
## 6 C
## 7 N
## 8 O
## 9 F
## 10 Ne
## # … with 108 more rows
#Returns a single element in Row 2, Column 2:
periodic_table[2,2]
## # A tibble: 1 × 1
## symbol
## <chr>
## 1 He
3.4 Replacing and Removing Data
Destructive operations are any operation where you override the original data. This is not necessairly a bad thing– sometimes, we don’t have a need for the original data, but want to manipulate it in some way. However, it means that if you made a mistake, the only way to revert the mistake is to completely reload the data from its original source, and repeat all your operations again. As in 1.2, you can alter your dataframe by using the storage operator <-
to overwrite data.
3.4.1 Replacing one element
Let’s say that I accidentally typed ‘Abram’ as my friend’s name, and I want to replace it with ‘Abe’
friends$names[1]<- "Abe"
friends
## names ages DC_Resident fav_number
## 1 Abe 34 TRUE 1.00
## 2 Bryant 35 FALSE 2.17
## 3 Colleen 32 FALSE 26.00
## 4 David 29 TRUE 7.00
## 5 Esther 30 FALSE 10.00
## 6 Jeremiah 30 TRUE 9.00
3.4.2 Replacing columns with a mathematical formula
I want to see what everybody’s age is in 50 years.
friends$ages<- friends$ages+50
friends
## names ages DC_Resident fav_number
## 1 Abe 84 TRUE 1.00
## 2 Bryant 85 FALSE 2.17
## 3 Colleen 82 FALSE 26.00
## 4 David 79 TRUE 7.00
## 5 Esther 80 FALSE 10.00
## 6 Jeremiah 80 TRUE 9.00
3.4.3 Replacing entire columns with another vector
I want to add their last names.
friends$names <- c("Abram A", "Bryant B", "Colleen C", "David D", "Esther E", "Jeremiah J")
friends
## names ages DC_Resident fav_number
## 1 Abram A 84 TRUE 1.00
## 2 Bryant B 85 FALSE 2.17
## 3 Colleen C 82 FALSE 26.00
## 4 David D 79 TRUE 7.00
## 5 Esther E 80 FALSE 10.00
## 6 Jeremiah J 80 TRUE 9.00
Note in the last example, I had to type the entire vector I planned to override. That’s because these are characters, and if you use the +
operator, R will return an error saying that you can’t add “a” with “b”. We will discuss a better way to do this, below.
3.4.4 Dropping Rows
Or I added the first person by mistake to my Friends list, so I have to drop the first row that represents “Abram A”:
friends<- friends[-1, ]
friends
## names ages DC_Resident fav_number
## 2 Bryant B 85 FALSE 2.17
## 3 Colleen C 82 FALSE 26.00
## 4 David D 79 TRUE 7.00
## 5 Esther E 80 FALSE 10.00
## 6 Jeremiah J 80 TRUE 9.00
3.5 🤔 Selecting Data Using Logical Conditions
3.5.1 Using %in%
Recall in 1.3.1 that we could use a logical vector, and the %in%
operator, to select elements from a single vector. We can extend these operations into a dataframe.
At its most basic level, you can use the ==
operator to test whether one element exactly matches another. For example, "gas"=="gas"
returns TRUE, but "gas"=="Gas"
will return false.
Let’s say that I want to find any elements that are naturally gasses.
periodic_table[periodic_table$state_at_stp == "Liquid", ]
## # A tibble: 17 × 22
## atomic_number symbol name name_origin group period
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 35 Br Bromine the Greek bro… 17 4
## 2 80 Hg Mercury the New Latin… 12 6
## 3 NA <NA> <NA> <NA> NA NA
## 4 NA <NA> <NA> <NA> NA NA
## 5 NA <NA> <NA> <NA> NA NA
## 6 NA <NA> <NA> <NA> NA NA
## 7 NA <NA> <NA> <NA> NA NA
## 8 NA <NA> <NA> <NA> NA NA
## 9 NA <NA> <NA> <NA> NA NA
## 10 NA <NA> <NA> <NA> NA NA
## 11 NA <NA> <NA> <NA> NA NA
## 12 NA <NA> <NA> <NA> NA NA
## 13 NA <NA> <NA> <NA> NA NA
## 14 NA <NA> <NA> <NA> NA NA
## 15 NA <NA> <NA> <NA> NA NA
## 16 NA <NA> <NA> <NA> NA NA
## 17 NA <NA> <NA> <NA> NA NA
## # … with 16 more variables: block <chr>,
## # state_at_stp <chr>, occurrence <chr>,
## # description <chr>, atomic_weight <dbl>,
## # aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## # density <dbl>, density_predicted <lgl>,
## # melting_point <dbl>, mp_predicted <lgl>,
## # boiling_point <dbl>, bp_predicted <lgl>, …
You’ll notice that this returns back a massive table of NAs
, rather than just the two elements that exist in liquid state! Why?
Let’s examine the command periodic_table$state_at_stp=="Liquid"
more carefully.
periodic_table$state_at_stp=="Liquid"
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [46] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [55] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [82] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [91] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE NA NA NA NA NA
## [109] NA NA NA NA NA NA NA NA NA
## [118] NA
- Every time the element’s state is not
Liquid
, R returns False, instead of ignoring the row, which means you get a logical vector of all 118 elements. - This gets passed onto the subset function, which, rather than ignore any row that says FALSE, returns each row as entirely NAs.
Although this technically does the job you want, it is not a very elegant solution, which is why the %in%
operator is better.
periodic_table[periodic_table$state_at_stp %in% "Liquid", ]
## # A tibble: 2 × 22
## atomic_number symbol name name_origin group period block
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 35 Br Brom… the Greek … 17 4 p
## 2 80 Hg Merc… the New La… 12 6 d
## # … with 15 more variables: state_at_stp <chr>,
## # occurrence <chr>, description <chr>,
## # atomic_weight <dbl>, aw_uncertainty <dbl>,
## # any_stable_nuclides <chr>, density <dbl>,
## # density_predicted <lgl>, melting_point <dbl>,
## # mp_predicted <lgl>, boiling_point <dbl>,
## # bp_predicted <lgl>, heat_capacity <dbl>, …
Much better. Notice that, with this call, we didn’t specify any columns for R to filter out, so it left those blank.
Now we can think about how to subset an entire data frame using the same sort of creation of two logical vectors (one for rows and one for columns):
periodic_table[ (periodic_table$name %in% c("Hydrogen", "Oxygen") ),
c("atomic_weight", "state_at_stp")]
## # A tibble: 2 × 2
## atomic_weight state_at_stp
## <dbl> <chr>
## 1 1.01 Gas
## 2 16.0 Gas
The extra parentheses around periodic_table$name %in% c("Hydrogen", "Oxygen")
are a good habit to get into as they ensure everything before the comma is used to select specific rows matching that condition. For the columns, we can specify a vector of the column names to focus on only those variables. The resulting table here gives the atomic_weight
and state_at_stp
for "Hydrogen"
and then for "Oxygen"
.
There are many more complicated ways to subset a data frame and one can use the subset
function built into R, but in my experience, whenever you want to do anything more complicated than what we have done here, it is easier to use the filter
and select
functions in the dplyr
package.
3.5.2 Using Comparison Operators <
and >
These operators are self explanatory. The <
and >
can be used to select rows based on a numeric condition.
!=
is the not equals operator. This operator allows for both numeric and character types.
# Prints any row whose atomic number is less than 10.
periodic_table[periodic_table$atomic_number<10,]
## # A tibble: 9 × 22
## atomic_number symbol name name_origin group period block
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 1 H Hydr… composed o… 1 1 s
## 2 2 He Heli… the Greek … 18 1 s
## 3 3 Li Lith… the Greek … 1 2 s
## 4 4 Be Bery… beryl, a m… 2 2 s
## 5 5 B Boron borax, a m… 13 2 p
## 6 6 C Carb… the Latin … 14 2 p
## 7 7 N Nitr… the Greek … 15 2 p
## 8 8 O Oxyg… from the G… 16 2 p
## 9 9 F Fluo… the Latin … 17 2 p
## # … with 15 more variables: state_at_stp <chr>,
## # occurrence <chr>, description <chr>,
## # atomic_weight <dbl>, aw_uncertainty <dbl>,
## # any_stable_nuclides <chr>, density <dbl>,
## # density_predicted <lgl>, melting_point <dbl>,
## # mp_predicted <lgl>, boiling_point <dbl>,
## # bp_predicted <lgl>, heat_capacity <dbl>, …
#Prints any row whose state at standard room temperature is not Solid.
periodic_table[periodic_table$state_at_stp != "Solid",]
## # A tibble: 28 × 22
## atomic_number symbol name name_origin group period
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 H Hydrogen composed of … 1 1
## 2 2 He Helium the Greek he… 18 1
## 3 7 N Nitrogen the Greek ni… 15 2
## 4 8 O Oxygen from the Gre… 16 2
## 5 9 F Fluorine the Latin fl… 17 2
## 6 10 Ne Neon the Greek ne… 18 2
## 7 17 Cl Chlorine the Greek ch… 17 3
## 8 18 Ar Argon the Greek ar… 18 3
## 9 35 Br Bromine the Greek br… 17 4
## 10 36 Kr Krypton the Greek kr… 18 4
## # … with 18 more rows, and 16 more variables: block <chr>,
## # state_at_stp <chr>, occurrence <chr>,
## # description <chr>, atomic_weight <dbl>,
## # aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## # density <dbl>, density_predicted <lgl>,
## # melting_point <dbl>, mp_predicted <lgl>,
## # boiling_point <dbl>, bp_predicted <lgl>, …
3.5.3 Combining logical operators with “or”, “and” conditions
Finally, the &
operator means and, and the |
operator means or. You can fund |
by pressing the shirt key and kitting the “reverse slash” key above the enter
key on your keyboard.
It allows you to combine multiple logical statements together, which may be useful if you have multiple criteria to work on.
#Prints any element which is Non-metal AND whose atomic number is less than 15
periodic_table[periodic_table$description != "Non-metal" & periodic_table$atomic_number <15,]
## # A tibble: 10 × 22
## atomic_number symbol name name_origin group period
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2 He Helium the Greek h… 18 1
## 2 3 Li Lithium the Greek l… 1 2
## 3 4 Be Beryllium beryl, a mi… 2 2
## 4 5 B Boron borax, a mi… 13 2
## 5 9 F Fluorine the Latin f… 17 2
## 6 10 Ne Neon the Greek n… 18 2
## 7 11 Na Sodium the English… 1 3
## 8 12 Mg Magnesium Magnesia, a… 2 3
## 9 13 Al Aluminium from alumin… 13 3
## 10 14 Si Silicon from the La… 14 3
## # … with 16 more variables: block <chr>,
## # state_at_stp <chr>, occurrence <chr>,
## # description <chr>, atomic_weight <dbl>,
## # aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## # density <dbl>, density_predicted <lgl>,
## # melting_point <dbl>, mp_predicted <lgl>,
## # boiling_point <dbl>, bp_predicted <lgl>, …
#Prints any element which is Non-metal OR whose atomic number is less than 15
periodic_table[periodic_table$description != "Non-metal" | periodic_table$atomic_number <15,]
## # A tibble: 115 × 22
## atomic_number symbol name name_origin group period
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 H Hydrogen composed of… 1 1
## 2 2 He Helium the Greek h… 18 1
## 3 3 Li Lithium the Greek l… 1 2
## 4 4 Be Beryllium beryl, a mi… 2 2
## 5 5 B Boron borax, a mi… 13 2
## 6 6 C Carbon the Latin c… 14 2
## 7 7 N Nitrogen the Greek n… 15 2
## 8 8 O Oxygen from the Gr… 16 2
## 9 9 F Fluorine the Latin f… 17 2
## 10 10 Ne Neon the Greek n… 18 2
## # … with 105 more rows, and 16 more variables:
## # block <chr>, state_at_stp <chr>, occurrence <chr>,
## # description <chr>, atomic_weight <dbl>,
## # aw_uncertainty <dbl>, any_stable_nuclides <chr>,
## # density <dbl>, density_predicted <lgl>,
## # melting_point <dbl>, mp_predicted <lgl>,
## # boiling_point <dbl>, bp_predicted <lgl>, …