Arnab Chakraborty's R tutorial: page 1 of 3

Site hosted by Angelfire.com: Build your free website today!

www.angelfire.com/dragon/letstry
arnab_et@hotmail.com

Tutorials:

R tutorial:

Introduction to R

R is a statistical software package. Just like Octave it can perform basic mathematical operations (e.g., matrix manipulation). Apart from these it can also do statistics to analyse/present large data sets.

Starting R in Unix

First open a terminal window. Then type the path of the R executible:

/usr/local/R-2.1.1/bin/R

and hit

. Depending on your machine and version you may need to change the exact path. Please consult your system administrator about this. You should see something like



R : Copyright 2005, The R Foundation for Statistical Computing
Version 2.1.1  (2005-06-20), ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.

>

The > is the R prompt. You have to type commands in front of this prompt and press the

key on your keyboard.

Starting R in Windows

If you have R installed in your Windows machine, you will possibly find an icon like the following on your desktop.

R icon

Double-click on it to get the following R console window:

R console window

The red > is the R prompt. You have to type commands in front of this prompt and press the

key on your keyboard.

Simple mathematics


x = 23.4 + 2.1

This stores the number 25.5 in a variable called x. Notice that R, unlike Octave, does not show the value of x automatically though you have not typed a semicolon. To see the value of a variable in R you have to type its name at the prompt and hit

Make a vector called v:


v = c(1,4,-5,5.6)

The c command concatenates (i.e., strings together) the numbers into a vector. Unlike Octave, R does not distinguish between row and column vectors. Vectors whose entries are in arithmetic progression can be easily created much as in Octave:


x = 1:5
y = 15:2
z = seq(2.3,5.7,0.3)

Can you explain the last line? [In these pages, words written in bold are special names that are commands in the R language.] R has more ways to create arithmetic progressions than Octave. Try the following to understand these:


x = seq(2,9,length.out=10)
y = seq(x)

To find the length of a vector x type


length(x)

R has no direct way to create an arbitrary matrix. You have to first list all the entries of the matrix as a single vector (an m by n matrix will need a vector of length mn) and then fold the vector into a matrix. To create

1 2

3 4

we first list the entries column by column to get

1, 3, 2, 4.

To create the matrix in R:


A = matrix(c(1,3,2,4),nrow=2)

The nrow=2 command tells R that the matrix has 2 rows (then R can compute the number of columns by dividing the length of the vector by nrow.) You could have also typed:


A = matrix(c(1,3,2,4),ncol=2)

to get the same effect. Notice that R folds a vector into a matrix column by column. Sometimes, however, we may need to fold row by row :


A = matrix(c(1,3,2,4),nrow=2,byrow=T)

The T stands for "true". Matrix operations in R are more or less as in Octave:


A = matrix(c(1,3,2,4),ncol=2)
B = matrix(2:7,nrow=2)
C = matrix(5:2,ncol=2)
A+C
A-C
A%*%C
A*C
A%*%B
t(B)

As we have said already, R does not distinguish between row vectors and column vectors. To see this type:


x = c(4,7)

Now both of the following commands are valid:


x%*%A
A%*%x

To see the (1,1)-th entry of a matrix A use


A[1,1]

The 2nd row can be referred to as


A[2,]

Exercise: How to refer to the second column of A?

R has one subtle difference from Octave. In R you can add vectors of unequal lengths, provided the length of the longer vector is a multiple of that of the shorter vector. In this case the shorter vector is "recycled" as shown in the example below:

1

1

1

1

+

1

2

=

1

1

1

1

+

1

2

1

2

Try this out:


c(1,1,1,1)+c(1,2)

Exercise: Try to add a short vector to a long one where the longer length is not a multiple of the shorter length.

Reading data from files

A statistician has to work with data sets. A data set is basically a table of numbers, e.g., an examination marks dataset may consist of the marks of 10000 students for 5 different subjects. Data sets are stored in files outside R. To access the data set from inside R we need to load it in R.

Example: Suppose that we have an examination dataset in a file called exam.dat. The file is already provided to you. It contains the marks of 5 students in 4 subjects. Read the data in a 5 by 4 matrix called marks:
marks = as.matrix(read.table("exam.dat"))
This is the common way to read tabular data from a file. First read the file as a table using the read.table command, then convert the table into a matrix by the as.matrix command. Type
x
to see the contents of x.

Graphical representation of data

Often data sets are large and it is very difficult to see any structure of the data just by staring at the numbers in the dataset. So one presents a large dataset in terms of pictures to bring out certain aspects of the data.

Histograms and pie charts

First we shall read an marks data set from the file marks.datthat contains the marks (out of 100) of 1000 students.


marks = as.matrix(read.table("marks.dat"))

There are 4 marks categories: fail (1 to 34), C grade (35 to 50), B grade (51 to 80), A grade (80 to 100.) We want to find out how many students belong to each category. The number of students in a category is called the frequency of that category.


breaks = c(0.5,34.5,50.5,80.5,100.5)
pf = hist(marks,breaks)

This creates a histogram. We use numbers ending with .5 as the breakpoints between categories to avoid confusion with marks exactly equal to a breakpoint. In a histogram a rectangle is drawn over each category. The larger the area of a rectangle the more students are in that category. The vertical axis of a histogram shows relative density, which is the total number of students in the category divided by the product of the total number of students and the width of the category. The variable pf stores the same information as the histogram. Sometimes laymen find the vertical axis hard to interpret. For them you may like to present a bar chart showing the frequencies in the vertical axis:


barplot(pf$counts)

To make it more attractive we can label the bars:


lab = c("Fail","C","B","A")
barplot(pf$counts,names=lab)

To present the information as a pie chart:


pie(pf$counts)

The pie chart has labels 1,2,3.... We can provide our own labels:


pie(pf$counts,lab)

Exercise: What information is missing from the pie chart that was present in the bar plot?

Scatter plots

Sometimes we want to see the relation of one variable (Y) with another variable (X) based on a dataset of the form

(X₁,Y₁),...,(X_n,Y_n).

Example: Suppose that we want to see if there is any relation between mathematics marks and physics marks. The marks of 50 students in these subjects are given in the file mathphys.dat. The file contains 50 rows of numbers, one row for each student. Each row has two numbers, the first number is the mathematics mark (X) and the second number is the physics mark (Y). First read the data:
data = as.matrix(read.table("mathphys.dat"))
Now plot the second column against the first:
plot(data[,2],data[,1])
The resulting plot is called a scatterplot. It consists of 50 points, one point for each student. Note how the points are spread approximately from a south-east to north-west direction. This tells us that a student getting high mark in mathematics is also likely to get a relatively high mark in physics.

Time series plots

A statistician sometimes needs to observe how some variable changes over time. For instance, the number of cars passing through a given junction in a city for each hour in a day will give important information for traffic control.

Example: The file profit.dat has data about the monthly profit of a shop over the period from Jan, 1999 to Dec, 2004. Let us read the dataset:
profit = as.matrix(read.table("profit.dat"))
Now plot it against time:
plot(profit,type="l")
The type="l" caused R to join the points with lines. However, the horizontal axis (the time axis) displays the times as 1,2,3... rather than as years. For this we need to use the ts command:
pr = ts(profit,start=c(1999,1),frequency=12) plot(pr)
The start=c(1999,1) command means the first observation is for Jan, 1999. The frequency=12 says that we are dealing with monthly data.
Notice that the graph is initially rising upwards and then going down. This means that the business was flourishing at first, but then has started to decline. Also there is a pattern that is repeating over the year. Can you explain the pattern? Such patterns are easier to detect from the plot than from the original dataset.

Summary statistics

In statistics we want to extract information from data. If a dataset is large then we usually start by looking at some summary features of the data, e.g., the average value. For instance, if we want to quickly compare the performance of students of Vidyamandir with those of Presidency College we would first look at the average results of the two colleges. There are different ways of computing averages. The most popular method is the (arithmetic) mean.


x = as.matrix(read.table("somedata.dat"))
mean(x)

Another method is median. It is defined as the middlemost value in the dataset:


median(x)

Exercise: Use the mean command suitably to compute the geometric mean and harmonic mean of x. Geometric mean of n numbers is defined as the n-th positive root of the product of the numbers. Harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocals of the numbers.

Example: In a dairy farm milk is bottled automatically by a machine. The machine is supposed to pour 500 ml milk in each bottle and seal it. However, no machine is perfect, and so the actual amount of milk is sometimes slightly more or less than 500 ml. The quality control officer needs to keep track of whether the machine is misbehaving too much. For this he has to ensure two things:

the average is 500 ml
the bottle-to-bottle variation is low.
To check the average the quality control officer can use mean. To check for the variation he can use variance of the data. A typical data set is in the file bottle.dat
milk = as.matrix(read.data("bottle.dat")) var(milk)
The higher the variation the larger is this number. The bottling machine has a maximum permissible value for this variance. If the sample variance exceeds this, then the quality control officer stops the machine and mechanics are sent for to repair it.

In some examinations students are given percentiles rather than percentages. If you get 90 out of 100 then your percentage is 90. However, this does not tell us how good a student you are compared to the other examinees. Some boards (e.g., central boards) are notorious for giving very marks to all the students, so that even a relatively bad performance can get 95 out of 100. On the other hand, Calcutta university is very stringent about marks, and a 90 percent there is really a great news. Thus percentages are not reliable when comparing students of different boards. For this one uses percentiles which gives the percent of students getting less than or equal to your mark. Thus, if there are 1000 students in total, among whom exactly 700 students have got marks less than or equal to your marks, then your percentile is 70 (or equivalently, your quantile is 0.70.)

The quantile command lets you compute the original marks given the percentile. For p between 0 and 1, the p-th quantile of a dataset is defined as a value such that p fraction of the numbers in the dataset are below that value.


x = 1:10
quantile(x,0.4)

Here p=0.4. Note that any number between 4 and 5 can be considered as the 0.4-th quantile for our x. However, R uses some special formula to arrive at 4.6. The formula is somewhat complicated and we shall not discuss it here.

Exercise: The file marks.dat stores the marks of 1000 students. Approximately find the marks obtained by the student whose percentile is 95. Why is the answer approximate?

Random sampling

Suppose that the principal of a college wants to make sure that the students of a certain class are studying properly. Since the class has 100 students and the principal does not have time to interview all of them, he decides to draw a small sample from the students, say some 10 students. How should he choose the 10 students? If just goes into the class one day and picks the first 10 he sees, then he is most likely to meet the regular students than the students who bunk classes. Similarly picking 10 students from among the last benchers will also not give a representative sample. To avoid any such bias the principal should draw a random sample, where each student is equally likely to be picked. One method to do this is to write the roll numbers of the students on 100 pieces of papers, then mix the papers well, and finally draw 10 papers one by one at random. R makes this a lot easier:


sample(1:100,10)

Note that a student that is picked once cannot be picked again. Such a random sample is called Simple Random Sample WithOut Replacement (SRSWOR).

Nowadays many casinos use computers to play games of chance. The sample command is a great tool for this. Tossing a coin, for instance, is basically drawing a simple random sample from "Head" and "Tail":


sample(c("Head","Tail"),1)

Rolling a (fair) die is equally easy:


sample(1:6,1)

Now let us roll the die 10 times. This is like picking 10 students from a class of 100 students, except that here the numbers can repeat. This is called Simple Random Sample With Replacement (SRSWR):


sample(1:6,10,replace=T)

What does the following lines do?


x = sample(1:6,100,replace=T)
y = hist(x,0.5+0:6)

Now make a pie chart:


pie(y$counts)

Exercise: Use the sample command to shuffle a deck of 52 cards. Treat the deck as a vector of length 52 consisting of the numbers 1 to 52. After shuffling you should get another vector of length 52 that contains the same numbers but randomly rearranged.

Estimation

A statistician has to often guess the the value of some unknown quantity based on data. This is called estimation, e.g., a statistician may try to estimate the number of tigers in Sunderban based on data collected about the tiger pug marks found in the area.

Here we shall consider a simple example of estimation: we want to estimate the number of fish in a lake. A popular method employed for this is called the capture-recapture method. Here we first capture 100 fish, say, from the lake. We resist the temptation to fry them. Rather, we keep them alive, tag them with some water-resistant colour and return them to the lake. Within a week or so these 100 tagged fish get mixed among the other fish of the lake. Then we recapture 200 fish, say, from the lake. Some of these will be tagged, the others will be untagged. If there are k tagged fish in the recapture and a total of N fish in the lake, then we estimate N using unitary method as follows:

k tagged fish are there among 200 fish.
1 tagged fish is there among every 200/k fish.
100 tagged fish are there among 20000/k fish.

N = 20000/k (approx.)

In general if our capture size is m and recapture size in n, then the estimated N is

mn/k.

Now we shall carry out the process using R. We shall first create our lake with 1000 fish in it. The fish are numbered from 1 to 1000. So all the fish in the lake is a vector:


fish = 1:1000

Next we capture 100 fish, i.e., draw a SRSWOR of size 100:


capture = sample(fish,100)

We tag the captured fish by making them negative:


fish[capture] = -fish[capture]

Now we draw the recapture sample:


recapture = sample(fish,200)

and count the number of tagged (i.e., negative) fish in the recapture:


k = sum(fish[recapture] < 0)

Then the estimated number of total fish in the lake is:


100*200/k