R is Not So Hard! A Tutorial, Part 20: Useful Commands for Exploring Data

Sometimes when you’re learning a new stat software package, the most frustrating part is not knowing how to do very basic things. This is especially frustrating if you already know how to do them in some other software.

Let’s look at some basic but very useful commands that are available in R.

We will use the following data set of tourists from different nations, their gender and numbers of children. Copy and paste the following array into R.

```A <- structure(list(NATION = structure(c(3L, 3L, 3L, 1L, 3L, 2L, 3L,
1L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 3L, 2L), .Label = c("CHINA",
"GERMANY", "FRANCE"), class = "factor"), GENDER = structure(c(1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L
), .Label = c("F", "M"), class = "factor"), CHILDREN = c(1L,
3L, 2L, 2L, 3L, 1L, 0L, 1L, 0L, 1L, 2L, 2L, 1L, 1L, 1L, 0L, 2L
)), .Names = c("NATION", "GENDER", "CHILDREN"), row.names = 2:18, class = "data.frame")
```

Want to check that R read the variables correctly? We can look at the first 3 rows using the `head()` command, as follows:

```head(A, 3)
NATION   GENDER   CHILDREN
2 FRANCE      F        1
3 FRANCE      M        3
4 FRANCE      M        2
```

Now we look at the last 4 rows using the `tail()` command:

```tail(A, 4)
NATION   GENDER  CHILDREN
15  FRANCE      F        1
16  FRANCE      M        1
17  FRANCE      F        0
18 GERMANY      F        2
```

Now we find the number of rows and number of columns using `nrow()` and `ncol()`.

```nrow(A)
[1] 17

ncol(A)
[1] 3
```

So we have 17 rows (cases) and three columns (variables). These functions look very basic, but they turn out to be very useful if you want to write R-based software to analyse data sets of different dimensions.

Now let’s attach A and check for the existence of particular data.

`attach(A)`

As you may know, attaching a data object makes it possible to refer to any variable by name, without having to specify the data object which contains that variable.

Does the USA appear in the NATION variable? We use the `any()` command and put USA inside quotation marks.

```any(NATION == "USA")
[1] FALSE
```

Clearly, we do not have any data pertaining to the USA.

What are the values of the variable NATION?

```levels(NATION)
[1] "CHINA"   "GERMANY" "FRANCE"
```

How many non-missing observations do we have in the variable NATION?

```length(NATION)
[1] 17
```

OK, but how many different values of NATION do we have?

```length(levels(NATION))
[1] 3
```

We have three different values.

Do we have tourists with more than three children? We use the `any()` command to find out.

```any(CHILDREN > 3)
[1] FALSE
```

None of the tourists in this data set have more than three children.

Do we have any missing data in this data set?

In R, missing data is indicated in the data set with NA.

```any(is.na(A))
[1] FALSE
```

We have no missing data here.

Which observations involve FRANCE? We use the `which()` command to identify the relevant indices, counting column-wise.

```which(A == "FRANCE")
[1]  1  2  3  5  7  9 10 14 15 16
```

How many observations involve FRANCE? We wrap the above syntax inside the `length()` command to perform this calculation.

```length(which(A == "FRANCE"))
[1] 10
```

We have a total of ten such observations.

That wasn’t so hard! In our next post we will look at further analytic techniques in R.

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

Getting Started with R
Kim discusses the use of R statistical software for data manipulation, calculation, and graphical display.