|
www.angelfire.com/dragon/letstry
arnab_et@hotmail.com | |
Tutorials:
R tutorial:
Introduction to R
R is a statistical software package. Just like Octave it can perform basic
mathematical operations (e.g., matrix manipulation). Apart from these it
can also do statistics to analyse/present large data sets.
Starting R in Unix
First open a terminal window. Then type the path of the R executible:
/usr/local/R-2.1.1/bin/R
and hit . Depending on your machine and version you may need to
change the exact path. Please consult your system administrator about
this. You should see something like
R : Copyright 2005, The R Foundation for Statistical Computing
Version 2.1.1 (2005-06-20), ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.
>
The >
is the R prompt.
You have to type commands in
front of this prompt and press the key on your
keyboard.
Starting R in Windows
If you have R installed in your Windows machine, you will possibly find an
icon like the following on your desktop.
|
R icon |
Double-click on it to get the following R console window:
|
R console window |
The red >
is the R prompt.
You have to type commands in
front of this prompt and press the key on your
keyboard.
Simple mathematics
x = 23.4 + 2.1
This stores the number 25.5 in a variable called x . Notice
that R, unlike Octave, does not show the value of x automatically
though you have not typed a semicolon. To see the value of a variable in R
you have to type its name at the prompt and hit :
x
Make a vector called v :
v = c(1,4,-5,5.6)
The c command concatenates (i.e., strings together)
the numbers into a vector. Unlike Octave, R does not distinguish between
row and column vectors. Vectors whose entries are in arithmetic
progression can be easily created much as in Octave:
x = 1:5
y = 15:2
z = seq(2.3,5.7,0.3)
Can you explain the last line? [In these pages, words written in
bold are special names that are commands in the R language.]
R has more ways to create arithmetic
progressions than Octave. Try the following to understand these:
x = seq(2,9,length.out=10)
y = seq(x)
To find the length of a vector x type
length(x)
R has no direct way to create an arbitrary matrix. You have to first list all the
entries of the matrix as a single vector (an m by n matrix
will need a vector of length mn) and then fold the vector into a
matrix. To create
we first list the entries column by column to get
1, 3, 2, 4.
To create the matrix in R:
A = matrix(c(1,3,2,4),nrow=2)
The nrow=2 command tells R that the matrix has 2 rows (then R
can compute the number of columns by dividing the length of the vector by
nrow .) You could have also typed:
A = matrix(c(1,3,2,4),ncol=2)
to get the same effect. Notice that R folds a vector into a matrix
column by column. Sometimes, however, we may need to fold row by row
:
A = matrix(c(1,3,2,4),nrow=2,byrow=T)
The T stands for "true".
Matrix operations in R are more or less as in Octave:
A = matrix(c(1,3,2,4),ncol=2)
B = matrix(2:7,nrow=2)
C = matrix(5:2,ncol=2)
A+C
A-C
A%*%C
A*C
A%*%B
t(B)
As we have said already, R does not distinguish between row vectors and
column vectors. To see this type:
x = c(4,7)
Now both of the following commands are valid:
x%*%A
A%*%x
To see the (1,1)-th entry of a matrix A use
A[1,1]
The 2nd row can be referred to as
A[2,]
| Exercise:
How to refer to the second column of A ?
|
R has one subtle difference from Octave. In R you can add vectors of
unequal lengths, provided the length of the longer vector is a multiple of
that of the shorter vector. In this case the shorter vector is "recycled"
as shown in the example below:
Try this out:
c(1,1,1,1)+c(1,2)
| Exercise:
Try to add a short vector to a long one where the longer length is not a
multiple of the shorter length.
|
Reading data from files
A statistician has to work with data sets. A data set is basically a table of
numbers, e.g., an examination marks dataset may consist of the marks of
10000 students for 5 different subjects. Data sets are
stored in files outside R. To access the data set from inside R we need to
load it in R.
| Example:
Suppose that we have an examination dataset in a file called
exam.dat. The file is already provided to you. It contains
the marks of 5 students in 4 subjects. Read the data in a 5 by 4 matrix
called marks :
marks = as.matrix(read.table("exam.dat"))
This is the common way to read tabular data from a file. First read the
file as a table using the read.table command, then convert
the table into a matrix by the as.matrix command. Type
x
to see the contents of x .
|
Graphical representation of data
Often data sets are large and it is very difficult to see any structure of
the data just by staring at the numbers in the dataset. So one presents
a large dataset in terms of pictures to bring out certain aspects of the
data.
Histograms and pie charts
First we shall read an marks data set from the file marks.datthat contains the marks (out of 100) of
1000 students.
marks = as.matrix(read.table("marks.dat"))
There are 4 marks categories: fail (1 to 34), C grade (35 to 50), B grade
(51 to 80), A grade (80 to 100.)
We want to find out how many students belong to each category. The number
of students in a category is called the frequency of that category.
breaks = c(0.5,34.5,50.5,80.5,100.5)
pf = hist(marks,breaks)
This creates a histogram. We use numbers ending with .5 as the
breakpoints between categories to avoid confusion with marks exactly equal
to a breakpoint. In a histogram a rectangle is drawn over
each category. The larger the area of a rectangle the more students are in
that category. The vertical axis of a histogram shows relative
density, which is the total number of students in the category divided
by the product of the total number of students and the width of the
category. The variable pf stores the same information as the
histogram.
Sometimes laymen find the vertical axis hard to interpret. For
them you may like to present a bar chart showing the frequencies in the
vertical axis:
barplot(pf$counts)
To make it more attractive we can label the bars:
lab = c("Fail","C","B","A")
barplot(pf$counts,names=lab)
To present the information as a pie chart:
pie(pf$counts)
The pie chart has labels 1,2,3.... We can provide our own labels:
pie(pf$counts,lab)
| Exercise:
What information is missing from the pie chart that was present in the
bar plot?
|
Scatter plots
Sometimes we want to see the relation of one variable (Y) with
another
variable (X) based on a dataset of the form
(X1,Y1),...,(Xn,Yn).
| Example:
Suppose that we want to see if there is any relation between mathematics
marks and physics marks. The marks of 50 students in these subjects are
given in the file mathphys.dat. The file
contains 50 rows of
numbers, one row for each student. Each row has two numbers, the first
number is the mathematics mark (X) and the second number is the
physics mark (Y). First read the data:
data = as.matrix(read.table("mathphys.dat"))
Now plot the second column against the first:
plot(data[,2],data[,1])
The resulting plot is called a scatterplot. It consists of 50
points, one point for each student. Note how the points are spread
approximately from a south-east to north-west direction. This tells us
that a student getting high mark in mathematics is also likely to get a
relatively high mark in physics.
|
Time series plots
A statistician sometimes needs to observe how some variable changes over
time. For instance, the number of cars passing through a given junction in
a city for each hour in a day will give important information for traffic
control.
| Example:
The file profit.dat has data about the monthly profit of a
shop over the period from Jan, 1999 to Dec, 2004. Let us read the dataset:
profit = as.matrix(read.table("profit.dat"))
Now plot it against time:
plot(profit,type="l")
The type="l" caused R to join the points with lines.
However, the horizontal axis (the time axis) displays the times as
1,2,3...
rather than as years. For this we need to use the
ts command:
pr = ts(profit,start=c(1999,1),frequency=12)
plot(pr)
The start=c(1999,1) command means the first observation is
for Jan, 1999. The frequency=12 says that we are dealing with
monthly data.
Notice that the graph is initially rising upwards and then going down. This
means that the business was flourishing at first, but then has started to
decline.
Also
there is a pattern that is repeating over the year. Can you explain the
pattern? Such patterns are easier to detect from the plot than from the
original dataset.
|
Summary statistics
In statistics we want to extract information from data. If a dataset is
large then we usually start by looking at some summary features of the
data, e.g., the average value. For instance, if we want to quickly compare
the performance of students of Vidyamandir with those of Presidency
College we would first look at the average results of the two colleges.
There are different ways of computing averages. The most popular method is
the (arithmetic) mean.
x = as.matrix(read.table("somedata.dat"))
mean(x)
Another method is median. It is defined as the middlemost value in the
dataset:
median(x)
| Exercise:
Use the mean command suitably to compute the geometric mean
and harmonic mean of x . Geometric mean of n numbers is
defined as the
n-th positive root of the product of the numbers. Harmonic mean is
defined as the reciprocal of the arithmetic mean of the reciprocals of the
numbers.
|
| Example:
In a dairy farm milk is bottled automatically by a
machine. The machine is supposed to pour 500 ml milk in each bottle and
seal it. However, no machine is perfect, and so the actual amount of milk
is sometimes slightly more or less than 500 ml. The quality control
officer needs to keep track of whether the machine is misbehaving too
much. For this he has to ensure two things:
- the average is 500 ml
- the bottle-to-bottle variation is low.
To check the average the quality control officer can use mean. To check
for the variation he can use variance of the data. A typical data
set is in the file bottle.dat
milk = as.matrix(read.data("bottle.dat"))
var(milk)
The higher the variation the larger is this number. The bottling machine
has
a maximum permissible value for this variance. If the sample variance
exceeds this, then the quality control officer stops the machine and
mechanics are sent for to repair it.
|
In some examinations students are given percentiles
rather than percentages. If you get 90 out of 100
then your percentage is 90. However, this does not tell us how good a
student you are compared to the other examinees. Some boards (e.g.,
central boards) are notorious for giving very marks to all the students,
so that even a relatively bad performance can get 95 out of 100. On the
other hand, Calcutta university is very stringent about marks, and a 90
percent there
is really a great news. Thus percentages are not reliable when comparing
students of different boards. For this one uses percentiles which gives
the percent of students getting less than or equal to your mark. Thus, if
there are 1000 students in total, among whom exactly 700 students have got
marks less than or equal to your marks, then your percentile is 70 (or
equivalently, your quantile is 0.70.)
The quantile command lets you compute the original marks given
the percentile. For p between 0 and 1, the p-th quantile of
a dataset is defined as a value such that p fraction of the
numbers in the dataset are below that value.
x = 1:10
quantile(x,0.4)
Here p=0.4.
Note that any number between 4 and 5 can be considered as the
0.4-th quantile for our x. However, R uses some special
formula to arrive at 4.6. The formula is somewhat complicated and we shall
not discuss it here.
| Exercise:
The file marks.dat stores the marks of 1000
students. Approximately find the marks obtained by the student whose
percentile is 95. Why is the answer approximate?
|
Random sampling
Suppose that the principal of a college wants to make sure that the
students of a certain class are studying properly. Since the class has 100
students and the principal does not have time to interview all of them, he
decides to draw a small sample from the students, say some 10 students.
How should he choose the 10 students? If just goes into the class one day
and picks the first 10 he sees, then he is most likely to meet the regular
students than the students who bunk classes. Similarly picking 10 students
from among the last benchers will also not give a representative
sample. To avoid any such bias the principal should draw a random sample, where
each student is equally likely to be picked. One method to do this is to
write the roll numbers of the students on 100 pieces of papers, then mix the
papers well, and finally draw 10 papers one by one at random. R makes this
a lot easier:
sample(1:100,10)
Note that a student that is picked once cannot be picked again. Such a
random sample is called Simple Random Sample WithOut Replacement
(SRSWOR).
Nowadays many casinos use computers to play games of chance. The
sample command is a great tool for this. Tossing a coin, for
instance, is basically drawing a simple random sample from "Head" and
"Tail":
sample(c("Head","Tail"),1)
Rolling a (fair) die is equally easy:
sample(1:6,1)
Now let us roll the die 10 times. This is like picking 10 students from a
class of 100 students, except that here the numbers can repeat. This is
called Simple Random Sample With Replacement
(SRSWR):
sample(1:6,10,replace=T)
What does the following lines do?
x = sample(1:6,100,replace=T)
y = hist(x,0.5+0:6)
Now make a pie chart:
pie(y$counts)
| Exercise:
Use the sample command to shuffle a deck of 52 cards. Treat
the deck as a vector of length 52 consisting of the numbers 1 to 52. After
shuffling you should get another vector of length 52 that contains the
same numbers but randomly rearranged.
|
Estimation
A statistician has to often guess the the value of some unknown quantity
based on data. This is called estimation, e.g., a statistician may
try to estimate the number of tigers in Sunderban based on data collected
about the tiger pug marks found in the area.
Here we shall consider a simple example of estimation: we want to
estimate the number of fish in a lake. A popular method employed for
this is called the capture-recapture method. Here we first capture
100 fish, say, from the lake. We resist the temptation to fry
them. Rather, we keep them alive, tag them with some water-resistant
colour and return them to the lake. Within a week or so these 100 tagged
fish get mixed among the other fish of the lake. Then we recapture 200
fish, say, from the lake. Some of these will be tagged, the others will
be untagged. If there are k tagged fish in the recapture and
a total of N fish in the lake, then we estimate N using
unitary method as follows:
k tagged fish are there among 200 fish.
1 tagged fish is there among every 200/k fish.
100 tagged fish are there among 20000/k fish.
So
N = 20000/k (approx.)
In general if our
capture size is m and recapture size in n, then the
estimated N is
mn/k.
Now we shall carry out the process using R. We shall first create our lake
with 1000 fish in it. The fish are numbered from 1 to 1000. So all the
fish in the lake is a vector:
fish = 1:1000
Next we capture 100 fish, i.e., draw a SRSWOR of size 100:
capture = sample(fish,100)
We tag the captured fish by making them negative:
fish[capture] = -fish[capture]
Now we draw the recapture sample:
recapture = sample(fish,200)
and count the number of tagged (i.e., negative) fish in the recapture:
k = sum(fish[recapture] < 0)
Then the estimated number of total fish in the lake is:
100*200/k
|