Published May 17, 2023, 1:20 a.m. by Naomi Charles
Are you looking for a comprehensive R programming course? Then you've come to the right place! simplilearn's R programming tutorial will help you learn the basics of programming in R and start using it for statistical analysis, data visualization, and predictive modeling.
R is a powerful programming language that is widely used in statistical analysis, data visualization, and predictive modeling. It is a popular language among data scientists and statisticians.
If you're just getting started with R, our R programming tutorial for beginners is the perfect place to start. This course will teach you the basics of programming in R, including how to install R and RStudio, import data, perform basic statistical analysis, and create data visualizations.
Once you've completed the R programming tutorial for beginners, you can move on to our other R courses to learn more advanced topics. We offer a wide range of R courses, from beginner to expert level.
No matter what your level of experience is, we have an R course that's right for you. So what are you waiting for? Start learning R today!
You may also like to read about:
r has become the language for
statistical computing and graphics it is
one of the most popular analytic tools
our programming was written by robert
gentleman and ross e haka at the
auckland university new zealand
r is a free and open source software
that is commonly used to solve
statistics time series classification
clustering and other data science tasks
it is also widely preferred for data
visualization because it has a
collection of great packages the
availability of our packages makes it
stand differently from the other
programming languages
by learning r you can become a data
scientist statistician data analyst our
programmer or a business analyst with
sectors such as health care e-commerce
retail banking and finance
with more and more companies focusing on
generating insights from data a
significant growth has been noticed in
our programming over the years some of
the top companies using r include google
amazon twitter ibm oracle and firefox
so r is constantly evolving and keeping
itself ahead of the edge our vast
community ensures that r does not get
outdated or rolls cool as they keep
adding new functionalities and updates
with that let's have a look at the
agenda for our r programming course for
2022
first we will look into variables and
data types
then we will move on to logical
operators
following which we will look into vector
matrix list and data frames then we will
look into functions and flow control
statements
followed by dplyer and tidyr for data
manipulation next we will look into
ggplot library for data visualization
and finally have a look at time series
in r
so let's get started
let's see what is our programming
and how it helps
so r
is
well known as a language of data science
now if you really look at the ranking
from survey of data mining experts
based on the softwares they have often
used in their work
r is used more than python when it comes
to data science python is also used
however r is
predominantly more used for data science
kind of activities it's a open source
programming language used for
statistical computing it is one of the
most popular programming languages today
it was inspired by s plus
and it is similar to s programming
language so when it comes to data
science
what we can say is r is
a popularly used programming language
across the globe
it is free and open source as i
mentioned it is optimized for vector
operations which we will learn about
later
it has an amazing
community
has
in fact 9 000 plus
contributed or community packages
allowing us to do
almost anything or everything using r
now when we talk about features of r
as i said it's open source programming
language so you can install r for free
and you can straight away start working
you wouldn't have to really go for a
licensed version or pay for the software
non-coders can also understand and
perform programming in r as it is easy
to understand
and it has various data structures and
operators it can be integrated with
other programming languages like c c
plus plus java and python
it consists of various inbuilt packages
a lot of sample data sets which can be
used
and that makes
reporting the results of an analysis
easier by using r
now before we start learning about
variables loops how you work with r and
so on it would be good to know how you
can set up r and work on r so for that
what you can do is you can
just go to r minus project
dot org
and
once we get to the home page
of our project for statistical computing
using this link
we can click on download r here
now that brings you to a page to
download it now there are various links
here so it shows you the comprehensive r
archive network that is cran mirrors
and it is available at different urls
however i would choose the first one
which is zero cloud you can just click
on this one and then based on your
operating system whether you are working
on a linux machine on a macbook or
windows you can install it so you can
just click on this one as of now i'm
using a windows machine so i can click
on download r for windows and that takes
me to this link which says binaries for
base distribution now this is what we
can use to work with r straight away
however there is one more package that
is rstudio we will see how we can set up
that now this one takes us to the best
mirror possible
for our location from where we can
download r
so you can click on this base and then
you can download by clicking on this
link i have already downloaded this so
once you click on this one you can just
save it so i have it here already in my
downloads and that's more than enough
then you can just double click and you
can
go through the instructions
to set up r that would also allow you to
basically
set up a desktop shortcut which i have
already done here on my machine and if i
go in here i see our base
you can click on this one and that
brings you to the page which you can use
to straight away start working with r
now
yes there is uh one more package called
rstudio
which is set up on top of base r which
makes working with r easier now here
also you can start working so it shows
you our console and you can click on
file and if you have some scripts or
files already written in the format of r
you can use those so i can click on open
script and that takes me to a page where
i have some files which are already
existing i can just select this one and
click on open
and that shows me some options here so i
have an editor which shows me say if i
want to get a library to use built-in
data sets i could summarize the data i
could do a clean up and we'll see all of
this but i would suggest using rstudio
rather than just using base r however
installing base r would be required and
depending on your machine configuration
like mine is a 64 bit i have chosen 64
bit while i was setting up base r
now when it comes to r studio
it is basically a
package which makes working with our
easier
so to install our studio what you can do
is you can go to the r studio home page
or you can just go to google and say
type r studio download
and then it takes you to this page you
can click on this which says download
rstudio you can choose your version you
can go for the free version that is r
studio desktop and you can click on this
download and then you can download
rstudio for windows which i have already
done and then you have to run through
the steps so just click on this one and
i already have r studio here
right now i can just basically use that
so for example if i go to downloads and
if i look for r studio if i do a double
click i can say yes
and then it takes me to the r studio
setup just click on next
and here
you can choose the location if you would
want to place it in a specific location
click on next and then it says select
the start menu folder so let our studio
be chosen here click on install
and then it will basically start
installing this in a particular location
now in my case it is already existing
right so
we can even click on show details and
see what it is doing what packages or
what executables it is extracting now
once this is done then you will be able
to use our studio you can also add a
shortcut to your taskbar
and you can continue using it so i've
already done this this might take couple
of seconds just wait for this to
complete
and you would have r studio
which is an easier way of working with r
so a lot of developers across the globe
would be using rstudio when they are
working with r to work on their data
science or programming requirements
now let's just wait it is almost done
and now i can click on finish
so so that part is done you can add it
as a shortcut so rstudio has consistent
commands
it has unified interface it makes easy
to navigate and manage through r
and it is set up on top of your
r base now if i click and open on this
so
that's my r studio which is coming up
now
here you see console which
will show you the result where you can
give your commands
so where we can get text output now
again i can choose a file so i can just
say open file and then i can go into a
particular location where i have
downloaded some data
and then basically i can choose say for
example rstudio
and that brings me here so now you have
your script which has some commands
right on the left
bottom you have console where you can
see the output
on the right side you also have
environment
now that is to use or provide variables
and then we can also have plots which
we can see here now we can look at this
as an example so here i am
loading the built-in data sets so what i
can just do is i can place my cursor
here and i can just do a control enter
and that basically loads the built-in
data sets which we can see here that has
been done now there is an inbuilt iris
data set
and we can just use head option to look
at the first six lines of iris data set
so
just place your cursor and do a control
enter and that shows you a summary
basically the first six lines of this
data set what it contains we will look
into this data set later this is a
default data set
which you can easily find when you are
working with r you can also have your
cursor place on summary and then just do
a control enter so that basically shows
you summary statistics for iris data
you can do a plot
and that basically shows you the plot
which you can also maximize and look at
it in full screen you can just do a zoom
if you are interested in looking into
this and we will discuss how
or what kind of information we can infer
from the plots now when it comes to
cleaning up you can just do detach and
then we can say package data sets
and here we had loaded those data sets
so we are just doing a detach and we can
say unload equals true so i'll just do a
ctrl enter i can also clear off the
plots by doing this
for whatever plots we had and we can
either do a edit and then we can do a
clear console from here or the shortcut
is ctrl r and you can clear of the
console
so that's a simple way of starting
your working with r by installing r
studio
so let's continue learning
about working with r
and basically the first thing which we
should learn here is about variables in
r
so variables
as in any programming language is a way
to store
your data value
factor of list values or a data set or
object in r
it allows us to conveniently reference
the variable name
basically saving us from rewriting
the data value or object many times in
our program so when we talk about
variables in r
they are mainly used to store
data with named locations that your
programs can manipulate
a variable can be a combination of
letters digits period and underscore
so you can have some valid variables as
total sum
you can also have dot notation so there
are different naming or style
conventions in r and we can use dot to
separate names in description of a
variable we can also start a variable
with dot
we can include numbers in a variable and
remember r is case sensitive so we have
to whenever we declare a variable we
need to remember
what case was used
as in
in the name of the variable and there
can be other conventions also such as
using an underscore or even using a case
in between the variables so variables
can only consist of letters numbers
periods
underscores
your dot followed by a letter not a
number
and we can declare our variables we can
also look at the type of the variables
and the class to which it belongs so
there are some invalid variables which
we are seeing here so that also needs to
be remembered so this is an example
where you can use an assignment operator
which you see here between x and 10 to
assign a value to a variable you could
also do that by doing a dot y
and then assign a value you could be
doing that by using a z
and then having a
computation done between x and y and
finally you could do a print so let's
see some example here before we move
further and for that i can bring up my r
studio here so as i said we can
basically have different
kind of variables or naming conventions
for example i could do something like
model one
and then i can basically assign this so
this is just
a
variable and i could be assigning
anything to it i could be assigning
different data types which are available
here for example i could do something
like this and i could do a control enter
so
that's my variable i can always do a
type
off
and then basically
i can check what's the type of my
variable so it tells it's a character
i can also
do a class
and then i can basically say show me the
class and that shows me it belongs to
the character class we'll learn about
data types later but we are using
assignment operator now if i say what is
model 1 it shows me the value but if i
would do something like this
then it says object model not found and
why because it is case sensitive the
variable which we had created was all in
lower case
and the one which we tried to call was
starting with an upper case
so you could have variables created in
such way i could also do something like
hello
underscore string
and this could be my variable where we
are using an underscore
and then we can just given something
here
and that becomes my variable which you
can always call
and check what is the value of that you
could also be doing something like this
so you could
be using
different cases and then i could say
something like this and that's also my
variable
and then i can basically look at the
value of this variable now
if we try to create a variable where we
start the variable name with the number
what would happen so if i say something
like this
and then if i try to assign a value to
it
for example let's say 100 now this one
will throw an error message because you
cannot have
your variable starting with a number but
if i used period and then basically
give
something like this
and let's try doing this by giving it a
number
so
if you see here
since we gave a period
the rule is that it should be followed
always by
a
letter and not a number so i could just
remove this and that works perfectly
fine
so these are some naming conventions
which when you practice you will learn
about so now i can assign a variable by
just doing a dot pairs and then assign
any value to it but always remember if
you are using a period if you are using
a notation then in that case that should
always be followed by a letter one more
thing which is always practiced in a
real time environment is that
we cannot have spaces
when we are creating variables so for
example if i say first
num and then i try to assign this a
value
it basically fails but obviously i could
have done this by doing it underscore
and that perfectly works fine and you
can basically then call the value for
this one always remember one more
standard practice which is followed in
real time environment
is you will try to have variable names
with
a little meaning to them so for example
if i would create a variable
and i would say for example
let's say bird
that's my variable name
and then if i assign this a value tiger
it works fine but then it really does
not make sense
and that would basically create a lot of
ambiguity in our coding so it is always
good to say for example animal and then
i would say okay so tiger is an animal
and that basically not only
allows me to assign a value to the
variable but it is also a little bit
more
meaningful now when we talk about
variables it is also good to know the
different data types which are available
in r now like any other programming
language
r also supports different data types
so you have your logical data type such
as true and false you have numeric
values which is say these numbers you
could also be creating an integer
which is 3l and 40 l for l and so on you
can have a complex number you can have
characters which can be just letters or
a set of letters or anything which is
within the quotes or you can even have
raw data so these are different data
types we can again see quick examples
here on data types let me come out of
this one and as we saw already when we
created model 1 this was character now i
can just say x and let's say
100 and obviously this is going to be
not my integer
okay so let's see this what is this one
this one by default is double
it is by default double so if i would
want an integer then i would say for
example something like
like this
and this one you can check
by using type off and you can see the
value for this one so this is an integer
so similarly you can have character you
can have
complex you can have raw data you can
have numeric values so all these are
different data types you could also be
saying for example i would want to check
the boolean so i could check this
and select this one
and now when i check the value for a
it is true
and we will learn about logical
operators where we can basically be
using these values assigned to the
variables to compare to compute between
different variables so this is a simple
small example of using variables
so we have seen here using variables and
also
using the assignment operator
and then assigning values to the
variables and different naming
conventions we can also be
using different data types which are
supports
and work with the variables
now once we have learnt about variables
or data types let's also just
first learn about your operators
and how they can be used in your r
programming language
now
we might be
intending to do some calculations on
numeric values
find out differences between values
or say for example compare values so in
that we can be using different kind of
operators so we have
various operators we have arithmetic
operators we have rational
operators we also have logical operators
so before we straightaway look into
logical operators let's also understand
about the basics such as your arithmetic
operators which supports for example let
me pull up a notepad file here
and when we talk about arithmetic
operators
here we are talking about
your
addition
[Music]
you have subtraction
you have multiplication
you have division
and you have remainder or modulus
and you have exponent
and what makes it also important is that
when you're using arithmetic operators
you also need to know about the order of
operations
so when you say order of operations
always the priority is to parenthesis
so that takes the priority you have then
exponent
or your computation if that would
involve
exponent
so let's say
exponent here
which is then followed by your
multiplication
and division
and that one also follows an order of
left to right
whichever comes first when we talk about
multiplication and division and
similarly when we talk about addition
and subtraction
it is left to right
whichever comes first so these are some
of the arithmetic operators now we can
see some examples here quickly
although these are some simple examples
so for example i can say 100 plus
100 and that gives me the value right
you can always do a 100 minus
fifty
you can do a hundred multiplication
you could do a hundred division two
or you could also use modulus
to
which basically gives you an error here
so i will
oh
just give me a minute
so let's give here one more percentage
sign
and that basically says what would be
the remainder
so if we would want to look at the
ordering when we are using this
arithmetic operators
we can see an example so for example if
i say 34 plus 46
divided by 2 gives me
57 however if i use 34 plus 46 in
paranthesis which gets the priority and
then i divide my result is different so
understanding what arithmetic operators
you can use and also the ordering in
which
that leads to the computation is very
important
so we can use all of these arithmetic
operators and to control the ordering we
can be using paranthesis
or we can have our computations ordered
with what kind of operation we would
want whether that would be
multiplication or division addition or
subtraction now at any point of time i
can always do a control l
and that allows me to clear my console
let's continue our learning and let's
learn about operators
so when we speak about arithmetic
operators we see that allows us to do
computations but we have also rational
and logical operators which help us in
doing our computations or comparing
values or sometimes finding
difference between different values
whether those are group of values or
whether those are individual values so
with your rational and logical operators
you can compare data values
so
if we would want to see if the values
match or not match or if the values are
above or below equal to something and so
on
so when we talk about your rational
operators we basically have
in case of rational or
logical operators
rational or logical operators
so we obviously have greater than
you have
less than
you have greater than or
equal
you have less than
or equal
you have equal to
and you have not equal
these are some of your
rational operators we can say
and when you talk about your logical
operators then you have and you have or
and you have not
so and
is
when it compares two values so it
returns true if both the conditions are
true else it will return a false
so for example if i have 10 greater than
20
and 10 is less than 20. now that's not
possible and we are comparing
the result of both of these so we are
checking
if both the conditions are true and
that's not
really true here so we see the value as
false now if i would have replaced this
one this and with or
it would check
even if one of the conditions is true it
would basically show me a result as true
you can also use a not operator which
takes each element of the vector
and gives the opposite value
so we can be using any one of these
operators
and then basically do our computations
so let's see some examples about these
logical operators now either you could
just be assigning values to your
variables and check or you could also be
picking up a data set
from your machine and then try to use
these logical operators so for example
if i say x
has been assigned 100
y
has been assigned 200
and if i try to say x
equals
y
so that already
checks the value and compares and tells
me that's not true it is false and if i
would have used a not operator
for example if i would have said
something like
this one
so it tells me true so i can just check
simple conditions like this
i can say
is my y greater than x
and that tells me yes it is true
if i say y is greater than or equal to x
well
it would still say true
because when you are saying greater than
or equal to x so when you're saying this
one it works fine right now we can also
be picking up some data set and for that
what i can do is i can pick up one of
the data set from my machine so i can go
in here and i have some data sets let's
look into that and i would be interested
in taking this auction data set and
loading the values here so i'll get this
path
and i will come here i can use auction
as my variable name you could have given
a dot separated name for example i could
have said auction dot data if this is
what you want to do
and then you can assign variable
a value so here i'll say read.csv
and i intend to pick up a file so i give
this path
and when we are working on windows
machine we need to give a double slash
so i'll say auction.csv now i could give
other things like header being true
what is the separator
if you would want to fill values to take
care of missing values we can look at
all of those so here i'll just add a
backslash
i will add a backslash
and i will basically just do a control
enter now i can look at the values of
this by just doing a auction.data
and i can see what values it has so it
has a lot of data here
it has a lot of your data here you could
have used some other functions which we
can see later
where
i can choose
head
and i can see the first top five values
so we can basically
assign
data to the variable and continue
working on this
now we can keep it simple so let me
repeat this step
and here i will say auction
as my variable name and i'll assign this
so i can basically do a also a view
on auction
so auction
and then basically that shows me a
tabular format of the data which allows
me to look into the data and basically
understand it and then i can
you know
use this to work on variables so what i
can do here is i can say x
and let's say
assign some value to this for which i
would want to work on my data set which
is auction
now what do i want to do here so let's
use auction
and then i can use a dollar symbol and i
can choose which column i'm interested
in so for example let's choose bidder
and i can just give a value to this one
and let's pick up a name
so let's say tweet
and that's the name
and i can be assigning all the values to
this
or i could say i would want to use
another condition so i'll say auction
dollar
and then let's take this value of bid
and let's say it is equals to
100
and then i ended up with comma and i can
try doing this now here it gives me a
problem because what we did was
we
did not use the right operator so we
will say for example and
so i will say
x
is being assigned the value of
auction bidder
being
tweak
and auction bid value being hundred
so now once we do this i can look at the
value of x and that shows me the value
so this is just a simple example of
using a logical operator now i could
have
just said
instead of and i could have used or
which is basically a pipe
which you have to use
and that gives you or condition and now
hit on enter and if i now look at the
values of x it will show me a lot of
values because we have given an r
condition which basically matches one of
the conditions so in this way we can use
logical operators and continue working
and
continue doing our computations
let's learn about print formatting and
how print can be used to
view your data
when you talk about r r uses print
function to display the variables
so for example if i have assigned number
10 to x
i can do a print x and that will show me
the value of
x
what we see here with 1 in square
brackets that also has a meaning which
basically means it is a vector
and we'll learn about vectors later so r
uses the paste and paste
0 functions to format strings and
variables together for printing in few
different ways for example if i would do
this which i say is print paste
and then
pass in
two strings here or two words here such
as hello and world
that would be
printed as follows now i could also do a
print paste
and then use a separator
so my print would look something like
this if i use paste 0 then that avoids
any space between these two words or for
example these three words
so let's see some basic examples here
when we talk about print
so for example if i bring up my r studio
here is an example
so x as we say now this is your
assignment operator which we already
discussed
now i can be assigning a value to this
so i can just place my cursor here
and i can just hit on control enter so
value has been assigned now let's look
at the value of x
now i could also be doing a print x
explicitly by
using print function for example if i do
similarly for message as hello
and then i can print the message
by using print
now if for example i do something like
this
this is not going to print anything
until i call the variable or i use a
print function so for example if i do a
y
pc auto printing
shows us the value or i could do
explicitly by using the print function
by explicit printing now whenever we
look at this number one as i mentioned
it means y is a vector and five
is its first element now you can also
use operator to create integer sequences
and we'll learn about sequences or list
later but this is just a simple example
so i am creating an integer sequence of
length 20. i can place my cursor here
which would start with 10 and end at 30
so let's look at this values
for our sequence
of integers
now at any point of time you can always
use a class
to look at
the
class
of say x
and that shows me the classes
of integers
now looking further when we talk about
different data types as we learned
few minutes before
so r has basically
five basic or atomic classes of objects
so you have character numeric values
that is real numbers you have integers
you have complex and you have logical
values
let's
spend some time in understanding some
basic arithmetic operations and how you
can do it using your r programming
language now here i have opened up
rstudio and these are some basic
examples such as performing arithmetic
operations
now for example we can add two numbers
and i can just place my cursor here and
please press ctrl enter
that shows me the addition i can do a
subtraction
i can do multiplication division
also going for exponential power
or use modulo which returns the
remainder
now
when we are performing operations what
we can also do is we can change the
order of operations
and in this case we are using
parentheses so i am putting in 500 into
2 in a paranthesis
plus
80 divided by 2 so first it operates
what is given in parenthesis
and that's why i get a result 1040
similarly i can change the order of
operations so here i can give 500 into
and then something in the parenthesis so
that gets operated first and hence you
get result of thousand five hundred now
we have already discussed about the
assignment operator and what we can do
here is we can assign variables
some value so for example i create a
variable called selling and then i would
assign it a value similarly for cost and
then we can do some calculation so we
can say profit
is selling minus cost
we can do that and here i can look at
the value of profit which shows me 250.
now let's also spend some time in
understanding data types in our so we
can have different types
of data so
this one shows me an example of
assigning a decimal value which is part
of a numeric class so i can do this
and then if i would be interested in
seeing the value of num so i can just
look at the value of num
if i would be interested in looking at
the type of num so i can do that here by
just typing in type off
and
then select this one and pass in your
num and it shows me the value is double
i can also look at what class it belongs
to and that shows me
it is numeric
so in this way you can not only assign
values to a variable but you can look at
the class and type of it now here we can
assign whole numbers which are also
known as integers now if i look at the
type of this it shows me double so if i
would want to explicitly assign an
integer i could have done for example i
let's say j
and i could have used the assignment
operator and i could have done this
and then if i look at the value of j it
shows me the value but what we would be
interested in looking at the class of j
so we can do this and it shows me it is
an integer so explicitly either i can
assign this by using a capital l
or i could use a function called
as dot integer so we'll see that later
now we can also assign boolean values or
basically your logicals so here we
assign true and then we do a false
and we can look at the type of t and
that tells me it is a logical class
now similarly you might be interested in
working on
text or string values and here we can do
this by saying
ch and then passing in a value look at
the class of this it tells me it is
the data type is character and if you
look at the type of it it says me
character
similarly r also supports complex data
types so we can do that too by just
doing this and look at the class of it
it tells me it is complex and you can
also pull out the length of this by
now here we are doing a length on
the character so let's look at this one
and it shows me what is the length of
this
now
one of the useful functions which we
usually use in r is print now i can
simply do a print hey and that prints
whatever values pass to print i can
assign a value to a variable and then
print it so that is also fine you could
have also without using function just
type y and that also shows the value
however sometimes using print as an
explicit function can be useful it makes
your code more readable now here we
would use an inbuilt data set that is
empty cars
and then if you would want to print the
data set that shows me the values which
shows me
the car models and different other
features such as mileage cylinder
horsepower and so on now one of the use
case of print
with a paste function can also be seen
here so i'm doing a print paste and that
basically prints whatever was passed in
a concatenated way i could also do a
print paste with a separator
if i would want to format my data in a
particular way so here i've used
separator as comma
there is one more function paste 0 which
can be used so i'm just doing here paste
0 and that tells me just concatenate
these values without any space so paste
0 shows no space between these two
elements which were passed now we can
explicitly do some printing and for that
i'm using a s print f
option
i am going to pass in percentage s which
is for string and percentage f for float
and we can print the values of this so
these are some basic operations or usage
of your functions to
basically do some computations or look
at your results
so when you talk about basic type of any
r object it is a vector
and when we talk about vectors empty
vectors can be created with vector
function
a vector can contain objects of same
type or a class now when we talk about
list
list is a vector which contains objects
of different classes
so these are some basic examples so
apart from your print formatting we can
be looking at what we call as our
objects such as vectors or lists and so
on so when we talk about vectors it is a
sequence of data elements of same basic
type
we use
the function to declare a vector so we
can always do a c function to declare a
vector
for example here we are creating a
variable v 1 and we are assigning it a
vector by using c and then giving some
basic type so numbers 1 to 5 or for
example words you can always do a print
or you can also use a class
to find out what is the class
of the elements or
the values which have been passed to the
particular object so we can look at some
examples like this for example
we can see here so list
is a vector which contains objects of
different classes
so you can have numeric objects so that
is your numbers such as 1 2 etc
which are your numeric values for
example here what we are doing is we are
assigning a value 1
to a
and that
can then be used i can either do a print
or i can just use auto printing i can
also do
here a value for a i
or i could be doing something like this
which shows me 0
which can be for missing value so if i
would want to
use auto printing i can just call a and
it shows me the value what has been
assigned to it you can always use a type
of
to look at the value of a which is
double by default and if i look at type
of a i
that is basically an integer because we
used l here
so in this way we can continue working
with say
our different classes
of objects so for example let's create a
vector here so i can say v1 and then
basically assign it by using a c
function
and then pass in the values to this one
and that basically
gives me a variable and you can look at
what are the values assigned to it now
if i look at the class of v1
that shows me it is numeric if you use
type off
and then you would want to see the
values of v1 that shows me the values
are double now as we were seeing here we
can be looking at the class so for
example if i create one more
variable and then assign values to it
using c
so
passing in some words here
for example let's go and say hello world
and then i can basically
do this and look at the values of this
one i could also explicitly print as we
discussed earlier
by doing a print v2 we could also be
having a paste function
if we would want to use that so for
example if i would do a paste function
i could be using
and this is missing a bracket so let's
complete this
and that shows me the value i could have
also used for example
paste 0 function
and that also works fine
so it depends on what we are looking at
here so if i look at class
of v1 which we had it is numeric and v2
is basically
having elements which are of the class
character
so this is just a simple example of
having
your print functions having vectors
created printing out the values of those
printing out class and type of these
to continue our learning on vectors
as i mentioned earlier we can use the c
function
which can be used to create vectors of
objects by concatenating things together
so for example if we look at this
one which says x and then i use c
function and i say
0.5 and 0.6 so we can have a vector of
numeric types
so let's
do this
and then we can look at the value of x
so it shows me my vector which has 0.5
and 0.06 i can also have
my vector of logical values
and now let's look at the value of x so
it has true and false
or we could have done it in this way
where we can then look at the values so
we can use the short form by using
capital t and f
i can create a vector
with character types and then look at
the values of those
i can also be
creating a sequence of integers as we
saw in previous example and then look at
the values which start at 9 and end at
29.
now you can also create with complex
types and look at the values so these
are some simple examples of creating
vectors now we can also use vector
function to initialize vectors
so for example if i would do this
where i am saying my vector will be of
type numeric length is 10 and then look
at the values so it just shows me
a vector which has
all zeros and the length is 10. now you
can create a vector of numbers
by doing this as we saw in previous
example and use explicit printing to
look at the values or might be letters
and then use a print statement to print
function to basically look at the values
of the vector now we can also try
concatenating the above two
so that creates a mixed vector which has
two different kind of types here so i
can do a mixed vector by using the c
function and then passing in my numbers
which has numeric types and letters
which has character types
and then we can basically do a printing
of this which shows me the value but
here what we see is coercion that is
basically
casting if you would know as the word in
different programming languages so it
basically coerces the numbers to
character as characters cannot be
coerced into numbers
and then you can print the values of
this mixed vector where everything is of
character types so for example
at this point of time if i would have
done something like class
of mixed
vector
and if i would want to look into the
values of this one it shows me
everything is of character types here
now
data type of different vectors can be
returned by the function class as we saw
just now so it is common to use the
class function
to integrate an object
asking what is the class
now we can create one dimensional object
such as an integer vector which we have
done earlier and then look at the class
of it which tells me it is an integer i
can also create a numeric vector
by giving in some values here
so when we do this
so i have given the vector function c
and then giving in the value and look at
the class it shows me it would have
numeric values
now you can create a character vector
and then basically look at the values of
it now at any point of time in all of
these for example if i would do num
i can see what are the values assigned
to it
i can do letters
and i can see the values of this so let
me just create some space here now i can
create a factor vector
and then look at the values of it or
also you can see
what is the value in this factor vector
so here
we said as
dot
factor so factor function is being used
here and we are creating a vector of
letters
and then we look at the class
we also look at the values what are
assigned to this
or what are in this particular vector so
if you look into all of these vector
examples
initially we were
using an assignment operator where we
were using the c function and when we
started creating vectors by say
concatenating or vectors of particular
types we are using equals here and that
also is fine now looking further
when we look at concatenating two
different kind of vectors so for example
here we have
say
numbers and letters
as we discussed earlier
it will do coercion that is change
one type
into other
now when we talk about one dimensional
objects we can have integer vectors
or say float which we saw just now
ending at
10.5 so when we say c 1 is to 10
it basically starts with 1 but then
there is also you can say a question
happening here and then you have the
values ending at 10.5 that is float
and i can look at the class of it
and when we did a class of
did we do a class here so let's come
here and let's do a class of this one it
saves me it is numeric you can look at
the values of it
similarly you can create a character
vector which is
1 to 10 and then basically look at the
class of it or basically the value of
this vector
or as we did the factor vector now for
two dimensionals we will explore that
when we are learning about matrix so as
of now let's forget that now when you
talk about mixing objects there are
occasions when
classes of our objects get mixed
together so that could be accidentally
or that could be intentional so if you
look at this example here we have y
which has been given values which is 1.7
and a
and at this stage if i would look at the
value of y
that's my vector
if you look at the class of y
that shows me
it is
as character now when you look at some
other examples so let's pass in
logical and numeric values
what would happen in this case so we can
again
use
class
of y
and that basically has numeric
and if you would want to look at the
value of y that shows me
1 and 2 here
let's go further
so
let's look at the value of this one so y
and then basically see what is the value
of y so it is a
true
and you can also look at the class of it
now we are mixing objects of two
different classes in a vector remember
when we talk about
vector we always talk about vector
having elements of same type
but when we talk about lists which we
will learn later
that would have
basically or that can have
your each element of different type
so for vectors it is not allowed so when
different objects are mixed in a vector
coefficient occurs so that every element
in the vector is
of the same class
now
we have seen earlier the implicit
coercion where our r tries to find
a way to represent all the objects or
elements as i say so all the objects in
the vector in a reasonable fashion so we
can also be doing explicit coercion
so that is
from one class to another by using a as
dot and then using a relevant function
so if i have x here now if i look at the
class of x it tells me it is an integer
but i can convert that to numeric by
doing a as dot numeric or as dot logical
or as dot character to basically do a
coercion and change the class of the
objects now if r cannot figure out how
to coerce an object this will result in
nas being produced which we can also
relate to missing values or not
applicable values so for example if we
create x and look at the class of x it
tells me this character let's try
changing character to numeric which will
not work and it says n a's are
introduced if you do it even in logical
that would not work and it shows me any
values or if you do a complex it says
values have been introduced so at this
point of time if i look at the value of
x it tells me
it was assigned a b c and we try to
convert that into a different class
now when we talk about vectors it is
also good to know about attributes in
brief
so
all your r objects have attributes that
is metadata for object so when you talk
about our object attributes you could
have names you can have dimension names
you can have the dimensions that is
matrices and arrays you can look at the
classes such as integer numeric and so
on and you can also look at length which
is user defined attribute so if i say x
we are assigning a value to x now at
this point of time if i see my value to
x is 1 but then all objects need not
necessarily have attributes so in that
case whenever you try to use an
attributes function
that would return null so
at this point of time if i look at the
attributes of x
it shows me null value so these are some
of the basics which
help us in working with r
and using your vector function
or
looking at the coefficient which is
implicitly happening or explicitly can
be done by us by using a as dot sum
function
now let's learn about
lists
and how we can work
using r on list
when we talk about vector which we saw
in previous examples
vector is a one-dimensional array right
and it can hold elements only of same
type so we would say vector is more of
one dimensional but when you talk about
list list is a generic vector that can
contain objects of different types
so when you talk about say for example
matrices matrices can also hold elements
of same type but
in matrices it is a two-dimensional
array we will talk about matrices also
we will learn so when you talk about
lists they can contain all kind of r
objects so you can have dates you can
have data frames you can have vectors
and many more so in list
there is no coercion which is required
that is changing of data type there is
no loss of functionality and lists do
not follow any predefined structure now
we can create lists using this list
function as it is shown here so you can
create a variable and then assign a list
to it where you can be
using either passing in a vector or what
you can do is you can simply create a
list by using this
list function so let's see some example
here now for that what we can do is i
can bring up my r studio
where we can see an example on list and
how it works so when you talk about list
what you can do here is
let me close this one
and this one yeah
so what we can do is we can basically
say for example test
and i can basically
give
something here so for example i can say
music tracks
and then i can say how many hundred of
them
and i can say
let's give 100 as number
and then we can say how many of them got
five stars
and
i can do this so i can
check this and this shows me all the
objects or elements of this
list right
now
when we do this what we are doing is we
are creating a vector right and vector
basically
can have question depending on what are
the elements which are passed because
whenever you use the c
and you create a vector it will only
accept
elements of the same type so for example
if i do a class on test
it shows me
here all the objects are of type
character right and you can also use
type off
to check
for
our test variable and
it is basically having all the objects
as character now how would you create a
list so what we can do is we can use a
list function so for example let's again
do a test here but this time i'm
interested in creating a list
and list can have objects of different
types so let's say music
tracks and then i can just give hundred
and i can say with rating five
and now if i look at my test it shows me
all the elements of your particular list
here
we see each element or each object with
a double bracket and we can see each
element
now what we can also do is we can use is
list
function and then we can pass and test
here to check what is it and it is a
list right
so here we have created a list but if
for example we take the previous example
where we were creating a vector
and if i would do a is list it would
show me false right so we just created a
simple list
and we can also arrange labels
or
we can
use a name function to basically give
names so what i can do here is let me
create a list first
so i can do that like this
and now what i can do is i can do a name
and i can use a name function to this
test
and then basically what i can do is i
can pass labels
so here i can just
given some names here
so for example i can say
let's give it a name
product
so say we are talking about product of a
company
and then we can say
here i can give
count
and here i can give rating
and this is basically two given names so
let's just give
some error here let me just check this
so let's use this
name function here and what i'll do is
i will basically
use names
and now let's
do a test
so that shows me the names what we have
assigned
to
our
list objects
now
we can always access the elements of our
objects from a list
using indices or even using double
square so for example i have test here
and basically i can give something like
this
which gives me based on the indices
the position
where you are
accessing the elements of the list
so we can do this
what i can also do is
we can specify names when creating a
particular list so for example what i
can also do is
i can say
product
dot
category
and now i can just give
list function
so i would want to assign names while
creating a list
so i can say for example
product
and this would be say
[Music]
music
tracks
then i can give say for example count
and count would be hundred
and then ratings
and i can say five
and
now we can basically access this
list which we have created
so what we have done here unlike
earlier when we created a list and then
basically
used names function to assign a name to
it or each object here while creating a
list itself
we passed in the names so we can also do
that
now
if you would want to basically
list display the list
or
a compactly display structure of a list
we can always use the string function
and here i can pass in the name
so let's choose this one
and this is
in a more compact way listing down the
elements of your list so list can be
containing other lists also and we can
also do that so for example i create one
more list for example i can say similar
product
and here i can give
a list again
and what i would want to do is
i would want to say
product
equals
and i can say film
and then i can basically give a count
and then i can give ratings
say 4
and here what i've done is i have just
created one more list
but my intention is not just to create a
list but i would want to add this
to
our existing list so what we can do here
is
we can take our previous list that is
product dot category
like what we did earlier
and now i intend to
say list
and here
i would want to
say for example
let's copy this
or we can just
so this is what we were doing when we
were creating a list using product
giving the names while creating a list
and what i also want to do is here i
will just say
similar
and then pass in
similar dot prod
so now if you look at
our list we have just added new elements
so this is one more way where we can
create a list and we can
basically add or our list can have other
list
so when we talk about subsetting or
extending list
so one of the main ways as i said to
access a specific element or a subset we
use double brackets and we can always do
that so for example we take
our prod dot category and then i would
want to access a particular element so i
can always do this by giving the index
positions
and i can access the elements of my list
so this is one single way now here
if we use a single bracket instead of
double bracket
then in that case we will the output
would be a list
so if i look at this one then this would
be a list but if you use double brackets
then you are accessing
a particular object
if we were creating a vector we could
just be using a subset by using the c
function
now what we can also do is we can subset
by names or even logical
so what we can do here is we can take
this product category and if we have
defined names then in that case what i
can do is i can say i would be
interested in music tracks
and this is the name we had given
so we can close this one and we can try
accessing the elements here so we what's
the name we had given
so
[Music]
no it's not
music tracks that's the value the name
is product
so we do this and then we can access the
elements what we can also do is
we can be
subsetting based on logicals so what we
can do is we can basically just give
something like this and here we can pass
in values
something like this
and we missed a bracket
so
that's also a way of pulling out the
values so you can be doing a subsetting
using the names which you have assigned
to objects within your list
or you can say names which you have
assigned to the elements or by using
logicals now what we can also do is we
can use the dollar function now if you
see here we are looking at the name
and that is preceded by dollar
so we can always pull out the values
from our list
by giving the list name and then give a
dollar symbol and then choose
the name for example if i choose product
i can list the values here
i can be
looking at say
dollar and then choose a count and this
is also one way of accessing your
elements from the list using your dollar
symbol now to add elements to a list as
i said you can add a vector of names
and that can be passed to your list
so these are different ways in which you
can work with list and then you can
access the elements either using indices
or using names or even using dollar
symbol and pointing the right names so
this is one simple example of working
with your list now
one more and now i can just do a ctrl l
and i can clear that off so
your list always remember is a generic
vector that can contain objects of
different types
now when we talk about matrices
now matrix is a collection of data
elements
arranged in two dimensional rectangular
layout so we can use matrix function to
create a matrix as shown here
so matrix is
two dimensional now we already know that
vector is one dimensional array of data
elements or a sequence of data elements
but when we talk about matrix it's a
collection of data elements that is
two-dimensional arranged in fixed number
of rows and columns so here you see that
we are creating a matrix and we have
specified the number of rows is 3 number
of columns is 3 and we want it to be
arranged by row where we have given the
value as true
so always remember matrix is 2
dimensional and matrix can have only one
atomic vector type unlike your list it's
a natural extension of vector going from
one dimension to two dimensions so
matrix actually needs a vector
which contains values that you place in
a matrix and at least one matrix
dimension so we can choose to specify
the number of rows or number of columns
when we are creating matrix so let's see
a quick example of working with matrix
so for example i could just say
matrix which will have values 1 2 6
and then i can basically give n
row
and you can give a value to this one and
that's my matrix similarly you could
also be giving
n columns
so i can just say end call
and i can choose this one and then pass
in the value so that's a matrix where r
fills values column by column now if you
intend to fill up matrix in a row wise
fashion so that your values 1 2 and 3
are in first row then we have to just
modify this in a little bit different
way so we have to say matrix
1 colon 6
and row is 2
and then i can give by
row
so you always have these helper
functions which allow you to
put out the values
so for example i do this
and then i can do a control enter so now
if you see
you have the values 1 2 and 3
in your first row
so when we pass a matrix function to a
vector
that is too short to fill up an entire
matrix
then something different happens we can
have a look at this
so say you pass a vector containing
value 1 to 3 to the matrix function
and say explicitly you want a matrix
with two rows and three columns how do
we do that so for example i can say
matrix
and here i can say one is to three
now i can give n row
and then i can give the number of rows
which we want is 2 and then i say n
column
and this one i'll say 3
so i can do this and here what i have
done is i have given the values 1 to 3
i have said number of rows is 2
and your number of columns is 3. so here
r fills the matrix column by column and
simply repeats the vector
now if you want to fill using a four
element vector in a six element matrix
in that case
obviously r will generate a warning
message now apart from the simple matrix
function which we are seeing
you also have
some functions such as r bind
and c bind
which are offers when you are working
with matrices so we can use those so for
example i could say c bind i could say 1
colon 3 and then i can say 1 colon 3
and that's
my c bind that is column bind where i'm
passing the values 1 to 3 and which are
stacked in a in columns i can also do r
bind
and similarly we can be passing in the
values so i can say
r bind and that basically arranges the
values row wise
so be creating a variable for example
let's say n
and let me create
a matrix here
so i'll say matrix now that will contain
1 to 6
and i can say by row
and then you can give
value which is true
and then i can basically say the number
of rows
is going to be 2
and this is also fine so let's look at
the value of n here so you basically
created a matrix with one two six
you arrange them row wise and the number
of rows what you have chosen is two
so what we can also do is we can use our
bind and we can add values to it so for
example if i want to add value 7 to 9
what i can simply do is i can do a r
bind i can say i would want to edit my n
and then pass in the values so i can
just do this and this has basically
appended or added values to existing
matrix so similarly you could have done
a column bind and you could have added
values
to your existing matrix so for example
if i
take this one
and look at my n and what i could do is
i could do a c bind
and then i can basically take my
n
and then pass in values to this one so
let's say 10 and 11
and basically i've added 10 11 as a
column to my existing matrix so this is
one simple way where you work with a
matrix and you are
appending the values either at a row
level or at a column level
so let's also look at some other
examples so basically if you would want
to work with matrix one of the useful
things would be naming the matrix that
is in case of matrices we can assign
names to either the columns or the rows
if you don't do it we see the default
values here which follows a numbering
but what we can also do is we can use
two functions here one is row
names
or you can use
column names so these are the two
functions which can be used so for
example let's do a control l
let's try to get our n
and this is what we are doing here but
what we would want to do is we would
want to give them some names so for
example i'll say row names
and then
i will basically
pass in a vector which has
row names or vector which has column
names so what i can do here is i can say
i would want to give row names to n
and then
i basically give some value so for
example let's say
row one
and then let's say row two
and
now i can look at my n which has the row
names assigned to my rows similarly i
could have also given column names
so all i need to do here is i need to
say
column one
and then i will say column two
and then i can be using column names
and
let's look at this one so what went
wrong here so we have three columns here
we forgot that so we have to add one
more column name and then it should be
5. so now if you look at this one we
have just given row names and column
names so
naming the columns or rows in your
matrices can be very useful now as the
previous error says there is also a
function called dim names
and that's basically an argument
of matrix function which can be used so
we could also do something like this so
for example i have
dim names
so let's have r n
and then what you can do is you can do a
dim names
which you can then just create a list
and in this one you can pass in a vector
for row one
and then vector for row 2
and what we can do here is once we have
given this
let's give a comma here
and then give c
and then give your column names which is
column one
column two
and then basically
column three
and now if you just look at dim names so
you can just see that you have given
some row names and column names and this
can be used basically to
assign to your list
so if you try to store different objects
in a matrix what would happen coercion
would take place right so for example if
i have x
and let's basically
try to create a matrix which will have 1
to 8
and let's say the number of columns
is going to be 2
so let's look at our x and this has the
values now what if i create
say l
and then basically
i will create a matrix which will be
a matrix of letters
so let's say letters
and then here with letters i'll
say 1 colon 6.
now i would want to give the number of
rows
and let's give it say four
and let's say number of columns
and let's give it three
and now let's look at the value of l so
it has letters
and x is having numbers and what if we
bind them together using
c bind which is for column wise binding
so for example if i do a c bind and then
pass in
my x comma l
so if you see here there is a question
which has happened where everything is
converted into character so you can
always do a class and you can check so
this is a simple example of working with
matrices there are much more you can do
subsetting like what we saw in list but
that we can learn later
now let's learn about data frames
and what is the data frame and how do
you use r to work with data frame now
data frame is used to store the data in
the form of a table and for this
we have a function data dot frame to
create a data frame
so what we know already is that data
sets are comprised of observations
or what we call as instances or
variables and we always have
observations
to which
some variables are associated for
example we can talk about
data sets of
say five people now let's look at the
information here
here we look at the body mass index bmi
where we are using a data dot frame
function and then we are passing in say
gender so we use the c function to pass
in the values and then you have height
and then you have weight
and age
and these things then become the columns
of your data frame so for example if we
would want to work
on creating a data frame for people
where
let's say each person is an instance
and properties about each person such as
name
age child
or if the person has a child would
become the variables so
if we have such kind of information we
cannot easily store that in matrix or
list
now data frames can be used for such
cases
now it's a fundamentals data structure
to store data sets pretty similar to
matrix as it has rows and columns and
here
rows correspond to observations now here
we can talk about in every individual or
every person
columns correspond to variables that is
properties for each person
now difference
between your data frame and matrix is
that data frames can contain elements of
different data types
so for example we can have one column
being character
other being numeric and yet another
being logical or numeric
so restriction is that elements in one
column should be of the same data type
now how do we work with data frames
let's see some examples so what we can
do is we can bring up our r
so when we talk about data frames
usually we don't create data frames by
ourself we import data from data sources
such as csv file
or rdbms
or even your excel or spss and then we
create data frames now of course r has
ways to manually create data frames
using data dot frame function
so we can create three vectors first and
then we can pass in those vectors to
create our data frame so let's do that
so let's say name
and here i will use the assignment
operator which we have learnt earlier
and then i'll use c and then i can give
some names here so let's say john
and let's say peter
let's say patrick
and let's say
julie
and let's also give one more name
so let's say
bob
so this is the vector which we are
creating and we can check
this is the vector which we have created
now obviously you can do a class
and you can check what is this
and that says it is a vector of
character
now similarly we can create one more
vector which is age
and let's give some numbers here so for
example let's say 28
and 30 31
38
35
and these are the values for the age so
age is also created similarly we can say
if each person has children so we can
say
children and then i'll create one more
vector and here i'll give values which
are logicals
i'm not going to give any numerics or
character but i'm using logicals here so
if a particular person has children or
no
so let's have this
vector created
and now we have three vectors that is
name age and children and we can use
this to create our data frame so we can
just call our data frame as df
and what we can do is we can use data
dot frame function
and then what we can do is we can pass
our vectors within this such as name
age
children
and that should create my data frame
let's have a look at this and this shows
me
that the data frame is created now
column names are inferred from variables
which are passed to data dot frame
function so the variables which we have
passed to our data dot frame function is
name age and children and those become
the column headings for my data frame
now what we could have also done is we
could have created it in a different way
so i could have said df
and then i could have used my data dot
frame function
and in data dot frame function i could
have said name is going to be name
age
would be age
and then i could say
children
could be
children
and i could do this and this is also one
more way where i'm creating a data frame
and in this way
we can now have
rows of data frames
like in matrix so this is also one way
of creating a data frame to look into
the data frame structure we can always
use str and then we can pass our data
frame and this basically prints out
similar to that of list
so
we also need to know that under the hood
data frame is a list and in this case
this is a list with three elements
so each list element is a vector of
length phi corresponding to the number
of observations
if we create data frame with vectors not
of same length we would get an error
now here when we look at our data frame
we know that name is a column so name
column which is character
is actually a factor instead of
character
to suppress this behavior we can always
use a property that is strings as
factors equals false so what i can do is
i can do a data frame
like this
use my data dot frame function
and then basically we can pass in our
vectors that is name
age
and
children
and then what i can do is i can say
strings as factors and set this value to
false
so if i do this and now if i look at my
data frame structure
sorry
here
now let's look at this one and this one
shows me that
unlike your earlier one now we are
creating a data frame where our name
would be containing characters
there also by default it was showing as
character usually if you because this
value by default is set to false
or it would have created characters or
factors as we say now how do we do a
subset and extend and sort data frames
in r
so as we have learned so far in brief
about your data frames
so
data frame is somewhere like an
intersection between matrices and lists
so if you would want to subset a data
frame we can always use
the square brackets and in that we can
use the single square brackets which are
from matrices or we can use
double square brackets
from list or we can also use the dollar
symbol
so that all these things can be used to
subset the data frame so let's use our
data frame which contains information
about people
so we can select single element from our
data frame so here
what we can do is
we can just say df and then i can use a
single bracket and i can just do a three
comma two
so it would be good if we can first
print the data frame and that's my value
and now let's do a single bracket and
let's look at this one so this tells me
that
we are
using the row index first which is
number three
which shows me that we would be going to
the row number three and then we point
or pass in our column index that is
number two
so we could have done it in a different
way also so we could have done df
and then give it row index and then give
the column name which you are interested
in looking at and that also gives me the
value so just like matrices we can
choose to omit one of two indices to end
up with entire row or entire column
and for example if we would be
interested in looking for information
for patrick what i could have done is i
could have just add
df3 comma and this is showing me the
entire row
now always remember whatever results we
see here that is
giving me a data frame with a single
observation because there has to be a
way to store different data types and
that's why the result is also a data
frame
what we can also do is to get entire age
column we can just use our data frame
and then
we can pass in the column name here
like this and that gives me
just the column now here
the point to notice is result is a
vector because columns contain elements
of the same type
in previous example we were seeing a row
and in that row was
not a vector it was a data frame because
values were of different data types now
subsetting a data frame that results in
a data frame
and contains multiple observations can
also be done by doing something like
this for example i will do df
and then i will say
let me get
3 comma 5
and
then i can just say
age and children for example
so let's say age
and
children
and i can be pulling out the values in
this way
so i could also be
just getting
the results in the age column if i'm
interested in by just saying df
and here i can just pass in the column
number and that also gives me the h
column
now we know data frame is a list
containing vectors of same length
this means we can use list syntax to
select elements also and
what we can do is we can use our dollar
symbol and then choose the column name
and this is also one way wherein you can
pull out the values or you can use
double brackets as i mentioned earlier
and pass in the column name
so that's also fine
or you can give a column number
and that also would work and in all
these cases result is a vector
now with single brackets you can still
do it always remember if you use single
brackets then that will result in a data
frame
the result
can be a data frame here
but
what we are seeing here is
a list which contains only age column
having the data elements
so these are different ways in which you
can do a subsetting of a data frame now
using single brackets or double brackets
can have serious consequences so we need
to always think about what we are
dealing with and how are we handling it
now what we can also do is we can extend
our data frames that is we can add
variables we can add
columns that is adding variables or we
can add rows which are nothing but
observations
so adding columns is like adding new
elements to the list and for which we
can obviously use dollar or double
brackets say for example now this is my
data frame and if we would want to add
height
whose information is in a vector so
let's say height
let's create a vector here and this one
is what i would want to add for each
person so let me do this
and let me pass in some values here
and
the last one
something like this so this is a vector
created now what i can do is
so we are data frame is called df so we
can say df dollar
height
and then i will pass in this vector here
and now if i look at my data frame you
see
the fourth column has been added and
that's my height column
now what i can do is
i could have done it in a different way
basically
if i had my data frame i could have just
done df double brackets and then give it
a name
and then
i could have passed my vector in this
way however so this is also one way of
doing it we have already added the
column so we don't need to repeat the
step now what we can also do is we can
use
a
c bind function and if you remember c
bind that is for column binding so for
example let's create a weight vector now
and let's pass in some values here so
for example let's say 75
65 54 34
78
and these are my values of weight now
what i can do is i can just do a c bind
and then pass in my data frame and then
pass in this vector
and in this way i'm just adding columns
or i'm extending my data frames by
adding more columns to it now obviously
if we can use c bind then we can also
use r bind to add new rows
so
for r bind creating a new vector won't
work
because we need to create a new data
frame with one single observation
remember row
will have values of different data types
so we cannot create a vector we have to
create a new data frame and then we can
add it using our bind
so let me create a data frame here for
example let's call data frame as storm
and let's pass in some values here so i
will say data dot frame function
and then let's give name
what we can do is we can give age
then we can give the logical value
then we can give say height
and since we have added weight let's
also add weight
and this is my data frame now we can use
our bind function so i can say r bind
and then i can pass in my data frame and
this new data frame which we have
created
and
this tells me
that
the number of columns of arguments do
not match so we will have to check this
one
so we
have
our data frame which has just height so
it does not have the weight
that was only as the result of c bind so
let's create the storm again without
weight
and now let's do a r bind
and let's again check what is the reason
here
so this is height
and let me just check this
so to look at this
this is the error we were getting
because i was creating a data frame with
four columns and then i was trying to
add that to a data frame which had three
columns now yes we had done a c bind and
c bind was showing us the fourth or
fifth column but the original data frame
only had three columns
so what i did here was i did tom
and then basically
i
created a data frame with three columns
which matches with my original data
frame
which had three columns and then i could
use r bind to basically add one more row
so what we did was we used r bind and r
bind was used to add a new row to our
data frame
now when it comes to sorting or ordering
your data frame say for example we want
to sort data frame by age
now how do we do that so we could easily
do sort
df
and then select
our column and we could just do a
sorting now if we do this it is good but
not really what we need
now other clear way of doing that would
be using ranks so for example if i do a
ranks and instead of doing a sort i
would use order
and then basically pass in my column so
i would say df
and then i would use h
now in this case
if i look at ranks it shows me
a vector of ranks with rank position of
each element now if i do a df dollar age
it shows me
the values and if you look at the ranks
it will tell
21 or here the lowest value
is are 28
and that's the lowest value and that's
why we see as rank as one and so on we
can look at the ranks so what we can
also do is we can just do a df
and then basically use ranks and we can
just look at the result so this shows
data frame which is
a ordered data frame now based on ranks
now if we would want to do it in a
descending order what we could also do
is we could do a df
and then use order
and within order i will basically pass
my data frame i will choose my column
and then i could say decreasing
equals true
and i could do this
and here this could show me the value so
it says undefined column names so what i
would have to check is what is my data
frame here so we have h
and then what we would have to do is we
would have to select a particular column
so let's do that
and here i have just selected the column
and then there is a comma missing that
was showing an error so now we can have
the data
ordered in a descending way so there are
dozens of packages such as d plier data
table which can help you
manipulate filter merge and sort your
data frames so this is in brief about
the data frames
working with data frames subsetting them
and also sorting the data in your data
frames
now one more important
type of object in your r is vector and
that really helps us in
various ways so let's see how we work on
vectors here so to create a vector we
can use the c function and pass in the
values those will be the objects or
elements within the vector
and then you can look at the value of
the vector or also at the class of it
which tells me the values are numeric
now in case of vector all the values
have to be of the same type
or belong to the same class we can say
so here we are creating a vector looking
at the value of it and then looking at
the class which says the values passed
in here are character
similarly we can do it for numerics
that is true false and then look at the
value of this and this class is logical
now what we can also do is we can print
all the three vectors at once and here
we will use semicolon to separate two or
more variables
and we can pull out the values of all
the vectors which see we see here
now what happens if
we pass in the values which belong to
different classes or you can say
different data types so within a vector
if you do that there is something called
as coercion which takes place which will
convert all the values into one type and
in this case it has converted everything
into character
similarly
we can pass in values wherein we can
pass logical and numeric and in this
case it's not going to go for character
it is going to convert everything into
numeric
now if i had done this where i passed a
character and numeric
and if you look at this then it has
converted everything into character so
character always takes a precedence if
it is one of the values of vector
and you have other values which are not
characters then in that case coercion
will happen
there is one more way of creating a
vector and that is by providing a range
to your c function so we can do that
here
wherein i said c
1 colon 20 and then basically look at
the value of vector 7 so it shows me all
the values starting from 1 till 20
however there is one more way you can
use the sequence function to do the same
thing
now
i could avoid the bracket i could avoid
the c function and i can straight away
pass a range and that is also fine to
create our vector starting from 1 ending
at 25
so what if i want to create a vector
with odd values between 1 to 20. now in
this case i am going to say how many
values to skip or to jump so i'm
creating a variable called odd value i'm
using sequence function
and then to that i'm passing the
beginning number the ending number and
then the skip or the jump
and now if you look at the values it
shows me only the odd values
well you could have done the same thing
to get even values and that's not very
complicated so you can start from 2
and then you can do skip wherein
after 2 it basically gives you
every second value so we are looking at
the even values and this is how you can
create a vector which is having odd or
even values
now what if you want to create a vector
with 10 odd values starting from 10 so
you are basically giving a length so
here you can say from where you would
want to start
what is your skip and then the length of
the vector which tells me it gives me 10
odd values
beginning from 20
or from 20 onwards that is we take it
from 21. now one of the
requirements is always to name the
values so that we can access the values
either by indexing or by their name
which have been passed to the value so
let's see that so let's create a vector
which is called temperature so variable
is temperature pass in the values to
this
look at the values of temperature now
what we would want to do is we would
want to assign these names to each value
which makes it more readable more
accessible
so i can use the names function
pass in my temperature as a vector to
names function and then assign the names
to
each value of temperature now if you
look at temperature it shows me the
names which have been assigned well we
could have done it in a different way we
could have created a vector of names
something like this
and then
what i could have done is i could have
created one more vector such as
temperature and instead of assigning
values we could have assigned the vector
to our
existing vector so if you do this so you
are assigning the names vector to the
temperature 1 and now look at the values
it still does the same thing
so this is where you are assigning names
to every value
of your existing vector
now there is one more way and that is
using your sequence so here i am
creating a sequence which starts with
100 and set to 2020 with a skip of
20 values
or every jump would be 20 values so
let's do that
use your names function on price
and then what i'm going to do is i'm
going to use my paste 0 option
which takes p
and then
1 to 7 as the values so we know base 0
basically skips the space
and we are going to assign those values
to
as names to price
and now let's look at our price so that
basically gives me the names as we
desire so these are some smarter ways of
assigning names to
every element or every object within
your vector
now how do we perform some basic
operations let's have a look
so let's create a vector passing in the
values
and then you can simply do
an addition on two vectors where each
element is getting added to other
element of the vector
you can
subtract two vectors that is element to
element subtraction
element to element multiplication or
division
and you can basically perform operations
on the vectors now how do we use some
inbuilt basic math functions and that's
pretty easy
this is my vector now let's do a sum
which sums up all the elements let's
find out a standard deviation for all
the values let's find out the variance
for all the values here let's do a
product of vector values find the
maximum or find the minimum value so
these are some basic inbuilt math
functions which sometimes are useful in
our data science or data analysis kind
of activities
now one more requirement might be
comparing the vectors
using comparison operators
and this is where i create a vector 1
create a vector 2 and let's find out the
values in v1 which are smaller than v2
values and that gives me the logicals as
the response that is false true and
false
similarly you can do v1 greater than v2
or you can say where v1 values are not
equal to v2
or equal to v2 so these are some simple
comparison examples now i can create a
different vector and then i can find out
individually if the elements in the
vector are lesser than 3 by just doing a
v
lesser than 3 so it compares each
element with this so you are actually
using one scalar value to compare it
with all the elements and you can do
that it gives you the logicals so you
can
also
be doing slicing and indexing on vectors
and this is very much important when you
are storing your data in vectors how do
you access them so let's create a vector
using sequence
let's give it some names as we have seen
in past
and let's look at our price one so that
tells me the name and the values now
you can access the elements using
indexing so let's get the third element
and it shows me 590. remember the
indexing here starts with one unlike
other programming languages like python
where indexing starts with zero now i
can also get the third and fourth value
by doing a three colon four i can also
specify the vector
and say one comma four and that shows me
the first and the fourth position or
second or sixth position so this is one
way where you are using indexing to
access the elements
similarly i can give the names now
that's where we see the benefit of
giving names to every element so i can
use c
function pass in the name and look at
the value
for that particular name or selectively
select different columns or different
names
or we can also use this square bracket
wherein we pass the names
so sometimes it can also be useful to
use logical positioning that is we would
want to
find out the logical position if the
value exists and we can do that
or using true and false and then look at
the values
so
there is one
useful
way where you can exclude a particular
position might be that is an n a value
might be a value which you are not
interested in and that's where you will
say minus 2
which will skip the p 2 value or minus 2
n minus 5
where we are skipping a p 2 and p 5
and we can exclude particular values
from our vector
now how do we do a comparison operator
on the values of vector so you can just
say price 1 and i would want all the
values which are greater than 600
or you can assign this to a filter and
then basically
pass in the filter for your
vector
so these are some simple basic
operations which you can run
using your r programming where you would
want to manipulate where you would want
to store some data and extract that data
use your different logical operators or
other operators and perform your basic
easy computations
now that we have seen some basic
operations using r let's look at some
more
operations when you're working with
vectors such as one of the common issues
is handling the missing values now here
we are
assigning a vector
to a
variable order detail
and this one has a missing value now
let's see how this is handled and you
see all the values in the vector are
assigned what you can also do is you can
assign names
as we have seen earlier by using the
names function
and then look at the value of order
detail so you see the names and these
are your missing values which are also
taken care now what we can also do is we
can perform an operation on a particular
vector which will be applied to all
values of the vector so for example here
i will just add a scalar value plus 5 to
the elements in the vector
and that shows me number five has been
added to each element or each object in
the vector
now if you would want to work on two
vectors for example to add two vectors
let's create a vector called new order
and then
let's add it to order detail now in this
case
what we are doing is we have a vector
which is from 5 to 10
and what we are doing is we are adding
values to order detail now our order
detail earlier was 10 20 30 n a 50 and
60
and what i have done is i have passed in
a vector which is 5 and 10 and you are
adding it to the elements so 5 gets
added to 10
and then your value 10 gets added to 20
and then you have again 5 which is added
to 30 now you cannot add in anything to
a missing value so that remains as it is
then you add again 5 to 50 and then 10
is added to 60. so in this way you are
adding two vectors which are not of same
length but you are adding these values
now what i can also do is
i can update the order
by doing this
so i'm creating an update order
and now let's look at the value of
update order what does it show
so you are basically doing the same
thing
so
if you would want to work on a subset of
vector how do you do that so here you
are using some indexes so i'm saying
order detail
and this is my order detail
so let's take one colon two and assign
it to first two so if we look at the
value which is assigned to first two we
have just sliced and added a subset of
vector to this one and if i would want
to take the length
of order detail it shows me
the length here which is six elements
here
including the missing value also
what we can also do is
we can do some more operations so for
example from order detail what i'm doing
is i'm saying length minus 1
and then
up to the length so let's do this and
let's see the result of this so what we
have done is
we had our order detail which had
these values
and what we have done is we have said
length minus 1
colon length so you have taken these two
elements and you have assigned that to
your v1
similarly we can do length minus 1 and 2
elements so i can do this and now let's
look at the value of v2 so this shows me
the value where you are taking length
minus 1 and then you are taking it till
the second position of the index element
which is 20 so you are getting in the
values here so you get
your
50
n a 30 and 20 because you started with
length minus 2 and up till the second
index position
similarly we can use the length and we
can take it from
this element and let's look at the value
of v3 so that shows me that i'm i'm
doing some slicing or i'm getting subset
of my vector so similarly you can also
do this one so v4
and let's do this and then let's look at
the value of v4
so it gives me the values based on our
subsetting or slicing now you can
extract all the values below 30 and this
is where you are doing a comparison so
you will take your vector and then
you would want to compare each value if
it is less than 30 and you would want to
take all the values here so it gives me
the logicals or the response for all the
values which are lesser than 30
what we can do is
we can also
use the square brackets and do this this
will show me the actual values here we
were just getting the logicals but here
we are getting the values
now to omit any value from the vector we
can use n a dot omit
and this one will help me
in getting rid of the n a values plus i
am also checking the values if they are
less than 30 and then
i am basically doing
using this n a dot omit
so you can do something like this you
can look at the values what you can also
do is you can find the order details
that are multiples of three and here we
would want to use modulus and we would
want to find out if the remainder is
zero then i am getting the numbers which
are
divisible or multiples of 3. so let's do
this
and it gives me again the logical values
of all the values which are divisible by
3 giving us a remainder of 0
or if you would want to look at the
values then you can say order detail
open up a square bracket and then pass
in
your condition
now we can then omit
any from this one and then we can look
at the values
so this is simple way where you are
subsetting a vector or extracting the
values which you are interested in which
might be one of the requirements of
your data wrangling or data manipulation
or just data extraction
now i can also use a sum function
now
if we do this it returns n a because
there is already a missing value and you
cannot do a sum on the values
now what i can do is i can do a n a dot
r m
to remove the n a values
so i can do a sum on order detail where
i intend to add up all the values but
what i also want to do is i want to
remove the n a value so i'm giving it a
value as true and then if i do it it
gives me the sum of all the values so
similarly you can do a mean you can do a
maximum you can find out the minimum
value standard deviation or even square
root now these are some simple
operations what we are doing on vector
where we are interested in extracting
some specific values now let's look at
matrix which we have also discussed and
matrix is also one way where you can use
the matrix function to create a matrix
which is multi-dimensional so for
example if i do this and if i look at
the value of v
i get a matrix which starts with a value
of 20 ends with 30 and at any point of
time you can convert this to matrix so
first we created a vector and now i'll
create a matrix out of it wherein i am
seeing the row numbers i am seeing the
column number and i am seeing the values
in that particular column
so
you have already done that now let's
take it to the next level so let's
create a matrix wherein we are using the
matrix function we will say 0 comma 3
comma 3 and now let's look what it has
done so you have created a matrix which
is of
three columns and three rows and by
default the row number and column
numbers have been assigned to them we
can also create a matrix by passing in
values so we can say 1 colon 9 and then
give the dimensions that is number of
columns is 3 number of rows is 3
and if i look at the matrix now i have
passed in the values to my matrix
sometimes you may want to arrange the
data in a matrix for particular kind of
calculations
you can also use n row and by row
so
you can say how many number of rows you
would want
and you would want to assign the data
row wise so when we are doing this now
if you notice the difference between the
previous one where we just gave the
values and we said three rows and three
columns so it was doing it column wise
so one two three four five six seven
eight nine but here we said by row is
true so it has arrange the values in a
row wise fashion so it goes one two
three four five six and seven eight nine
similarly i could have just done this by
giving the dimension and selecting by
row and if i do this it is still doing
the same thing
now what we can also do is we can create
matrix using vectors
so here let's create
a vector stock one and then stock2
now
we would want to merge both the vectors
so you can always do a c function and
then create a new vector that is stocks
which is emerged result of stock one and
stock two and let's look at the results
so that's my stocks that's a vector and
now what i would want to do is i would
want to create a matrix
using the stocks so i'm giving it a name
that is stock dot matrix i'm using the
matrix function wherein i will pass my
vector
i will say by row so i want the values
to be arranged row wise and i'm also
selecting the number of rows
so if you look at this one so the values
which we had in our stock
which was all the values have now been
arranged row wise and in two rows so it
starts with 450 51 52 45 and 68 that's
my first row and the rest five values
are arranged in the second row so one of
the main requirements is instead of
going for default
column names and default row names we
can give specific names to our columns
and rows to make more sense to the data
how do we do that so we can basically
say days
so this is a vector which we are
creating
and then what we want to do is we want
to create a new variable which is stock
1 and stock 2.
now
this is for my columns and this will be
for my rows now how do we assign that so
we can say column names and this is
where i will say on my stock dot matrix
i will assign days which has five values
and that will become my column names and
similarly using row names function i can
basically assign row names to my
matrix so if i look at my matrix now it
shows me the column names and row names
which we have assigned or which we have
passed to our matrix
now there are different functions which
are associated with the matrix and let's
look at some examples so these are some
simple basic examples now if i say
let me find out the number of rows and
that gives me the number of rows or
number of columns or get a dimension
that is the number of rows and columns
of your matrix
now
we might be just interested in getting
the row names or column names or even
the dimension names which basically will
give me
returns the row and column names
so in this way you can use these symbol
functions which are associated with
matrix to extract information about your
matrix or data which has been
transformed into matrix to pull out some
information about that
one of the requirements
which data scientist or data analyst
might face is
carrying out
arithmetic operations on your matrix now
what we can do is we can create a matrix
which takes values 1 to 50. we want to
arrange it by rows and we will say
number of rows is 5 so that's my values
starting from 1
now i can do a addition
here by just doing a 5 plus mat 1 and if
you notice number 5 as a scalar value
has been added to
every element of the matrix
similarly you can do a multiplication
you can do a division
you can basically return the quotient if
you would want to do that or go for
exponential values so you can perform
simple arithmetic operations
for every element of the matrix
and what if you want to have
arithmetic operations done on multiple
matrix so let's create mat one plus mat
one
and we get a total where every element
is added to every element
you can do a subtraction
you can do a multiplication and you can
get the value so this might be also very
useful when you are working on
multi-dimensional data
you can also
do some more operations on matrix such
as
returns the sum for each column
say you are doing a summation or at a
row level or you want to do a mean for
every row you can do that by using these
simple functions
now
you can add rows and columns to a matrix
using r bind and c bind functions
so r bind is for row bind and c bind is
for column bind but for that we have to
first create a vector so
let me create a vector of same length
which will then be added to
every or added as a row to my existing
matrix now my matrix has five
columns so let's create a vector with
five elements
and then i can basically add this as a
row to my existing matrix
by doing this and now if i look at my
values i will see the new values at this
as the third row
and if you also see
the variable name becomes the row name
and we have added a row to our matrix
now similarly
i can find out row means that is we have
seen earlier by calculating the mean or
average so i can do that
and i can find out the value of average
now what i can do is i have got the
average for every column and
what we can do is we can basically
do a column bind
by using a c bind function
and i will say
i'm going to take the total stock which
has three rows and then get the average
and now let's look at the total stock
which shows me the average value which
is the new column which has been added
to the matrix so these are some simple
very simple operations which you can do
but that gives you good insight in what
can be done at a matrix level where your
data is arranged in multi dimensions
now how do we do a selection and
indexing in matrix so in vectors we were
using either names or we were using
positions or we were using indexing now
here let's create a matrix called
student
and we are using the matrix function but
within the matrix function we are using
the c function to create a vector
which will pass in all the values which
also has n a values if you closely
notice
we will split these values into number
of rows is four
so that means the values the number of
values in this vector should be a
multiple of four i am saying columns is
4 and i would want to arrange this data
row wise so i've done that and if you
would want to get the dimensions out of
this so i can do a dim names so what i'm
doing here is
on my student
i am assigning a list
which will basically have these names
which are basically assigned and now if
you look at your student it basically
shows me
the values which were first
applied to the row names that is john
matthew sam and alice
and then you have one more vector which
goes as the column names for the values
so you have not only created a matrix
by using a vector by defining your
dimensions that is number of rows and
columns you have arranged the data in a
row order and what you have also done is
using a list function you have passed in
the values which will be applied as row
names and column names to your matrix
now how do we extract particular columns
here so we can take our matrix and we
can just say comma 1 and that basically
gives me the values for
john matthew sam and alice and what we
are looking at is the first column
now i can also say from first column
onwards i would want to look at how many
columns so i can do this and now here
i'm selecting first and second column
i can also be
using a vector function here and
that also does the same thing where i'm
saying 1 comma 3 and i'm getting the
values from first and second column so
third is not included here
now if you would want to do row wise
then you have to give the row position
first so if i do a student 1 that gives
me the row values
and this is giving me values for
my student which we are seeing here so
for john we have 20 30 na and 70
and that's what we get here when you do
a row wise operation you can also do a
row wise and how many rows do you want
you can use the vector function to do
that
you can also select or slice out a value
where you are getting an intersection of
row 2 and column 2 and then you can also
start from a particular position and
then onwards get your rows
so these are different ways in which you
are slicing the values from your matrix
by
columns or by rows
so
at this point of time let me just type
in student here and let's look at the
value of student
and then here we are interested in 3
colon 4
and then 2 column 3 so what does that
give me so you are looking at
third to fourth row so you're looking at
sam and alice
and then you are looking at columns two
and three so that basically gives you
your 26 32 24 and a
so first is you are giving your row
positions or how many rows you want and
then you are giving your column so
similarly you can do this you can say
from row number 2 to 4
and then column wise you can say 1 to 3
so if we do this so this tells me two
columns which is first and second and it
shows me rows which is
from second to fourth
so in this way we can extract
data based on rows and columns now if we
would be interested in finding out a
specific value so for example if i again
bring up student
this is my student and what i would be
interested in is getting the value of
john
and for specific subjects
so maybe we are looking for
2 colon 3 now if i do this
it shows me for john
and what we are interested in
is 2 colon 3 so that gives me the value
for chemistry and biology so you are
giving the columns so row wise you have
already specified the name and that
basically selects the particular row i
could have given a number and chosen
which row or which rows we would want to
pull out the values now if i would want
to find out the value for john and sam
now in that case
i could use indexing or positioning but
that has to be continuous but here you
are talking about john and sam which has
matthew in between so we will basically
create
we will get the values for john and sam
and then we will look at
the value 4
now
that is basically giving me the values
in the fourth column which is 70 and 75.
similarly
if you go further you can look at maths
and bioscore of sam and alice
so you will give your
row names that is sam and alice and then
you would want the values for maths and
bio so that is basically your
third and fourth column and we can do
that by looking at the values
how do you find out an average
well
that's pretty simple you can use the
mean function
on student
you
will select your row name that is john
you also want to get rid of n a values
otherwise that will give a problem so
you get rid of that by saying n a dot r
m
equals true and then you
get the average score of john now
how do i do further computation that is
if i want to find out the average and
total score of all students so in this
case
i can apply or i can use an apply
function
here i'm saying i'm working on student
and
we would want to give the row number
that is 1
and we want to also give the column so i
want to find out mean
i want to remove or get rid of the n a
values and now if i
look at help apply it tells me how does
the apply function works over the array
margins so i will do an apply function
on student
where i would want to select the first
row
i would say i want the sum
and i want to get rid of the n a values
so this gives me the sum for each
student
and here we are getting a mean value
which was for
each student
so what we are doing here is for example
let's look at student again just so that
we avoid confusion so we have
student
and then we have physics chemistry bio
maths and i have said
row one so basically what we want is
for john we want
the total
and what we can do here is
we can say 20
plus 30
avoiding n a and then 70 that gives me
120
then you look at matthew so this is
again doing a totaling there is no n a
value and you look at the value
right so when we have chosen apply
function we have worked on student
now here we are interested in the values
that is sum of all the values for this
particular row
i'm saying take care of any and then
give me a sum similarly you did a mean
and that was giving you a mean for each
student
so these are some simple operations now
what we can also do is
we can basically create a vector called
passing score
and what we would want to do is we want
to get the values or find in how many
subjects alice has passed how do we do
that we will have to compare
alice score
which should be greater than
or equal to the passing score so what we
can do is we can create a variable here
pass now i am saying student i would be
interested in the values for alice so
i've mentioned that row name here i'm
then comparing it with passing score
which we have created here and that will
give me the values wherever
alice has passed in a particular subject
now i can obviously get rid of the na
values and then look at this which
basically tells me
there was
one subject in which alice passed and
rest were either false or any
now same thing we can do for sam
so sam is here
and what we want to do is we want to
look at the values here so we will say
let's do the same thing for sam and find
out the comparison with passing score
and get rid of n a values so you are
basically extracting value so these are
some
easier operations and usage of functions
on your matrices
which are filled in with values at row
level and column level and then you can
apply one of these functions
or
multiple functions to basically extract
value which makes more meaning
so that's with your matrix now let's
also look at data frames now data frames
as we know
is basically data which has been ordered
in rows and columns
wherein we can assign row names we can
assign column names we can do some
operations on data frames so let's look
at example so if i do a data
here so that gets me
some sample data sets or functions what
we have here
so let's do
once we have our data here
so it says use data package and then you
can get
list all the data sets in available
packages and you can basically look at
all the r data sets which we are seeing
here it has opened up so i would be
interested in getting the air passengers
data so i'm going to pass that in the
data function
and then if i do a head to see the
initial data from air passengers it
shows me the values what we have
similarly we can do that on iris data
set and look at the head values
i can
do a view to look at specific values in
a tabular format if that makes more
meaning and that makes it easy for
analysis
now i can
do a view on state
x77 and that basically shows me
the population income and all this for
different u.s states so these are some
different data sets what we have
you can do a view on them
to basically understand the data or look
in a more readable format you can just
do a tail to get some end data so head
and tail functions just give you the top
six entries
or basically your entries from that
particular data set now the question is
how do we
work on this data so i can get a
statistical summary so i have the iris
data set which we had here
so if i do a head it shows me iris data
set this is a popular data set which
shows the petal lens sepal length of
particular flowers and the species what
is the length what is the width and what
species does that flower belongs to okay
now here we can get a summary that is
statistical summary of a data set which
gives me mean first quartile median mean
third quartile and maximum values
it basically shows you the count of
the entries for each species what we
have under the species column now what i
can do is i can check the structure of
this data set using str
i can create a data frame now of this
data
using the data.frame function so for
that we need to also have
say for example if we would want to
create a data frame let's see how do we
do that so first we create a vector of
days
we can create a vector of temperatures
and rain
and then we want to create a data frame
out of this so i use the data dot frame
option
i pass in my days temp and rain as the
vectors and now if you look at the data
frame you basically see
that i have my days my temp and rain so
those were the variables those were the
vector names and those have become the
column names row names are auto assigned
and basically we are seeing the values
which have been passed in my data frame
now i can do a summary on this to
basically look at what is the length or
how many values we have in data frame
what is the class of elements so that is
character
you are looking at
your
values or summary which gives you mean
first quartile median mean and so on and
then it also shows you the complete data
on rain
what is the mode here
what how many falls or how many true
values we have you can also look at the
structure of this data frame by doing a
sdr
which gives me
how many objects we have how many
variables we have
what are the different variables so that
is days temperature and rain and the
values for those
for days if you notice it is of the type
character temperature is numeric rain is
logical now how do we do data frame
indexing so
like your matrix which basically has
rows and columns and in multi-dimensions
similarly in data frames also you have
indexing so you can do a data frame so i
could just extract the first row by
doing this and that basically gives me
the value so you can always compare it
by just typing df so that's my data
frame
and now let's look at the values extract
the first row and that shows me monday
25.6
rain value is true
now i can also do it column wise so
for example i could do it in this way so
here what i'm doing is i'm doing
extracting the second row from this one
so it tells me
25.6
30.1
40.0 37.3 so you have extracted the
values for the column right so i would
not say extract the second row you would
say extract the second column
okay so this one should be second column
yeah
now
selecting using column names so that's
the easiest way to extract the values
for a particular column so i can just do
this instead of giving the position of
the column or the column number i'll
give the column name
and that gives me all the values of
temperature
and
if i do this where i'm saying 2 colon 4
and then i'm giving the columns so it
gets me the second
third and fourth rows for day and
temperature
and we are looking at the value so you
have given your row names and then you
have selected your columns you can also
do a dollar sign
if you would want all the values of a
particular column so i can just do a df
dollar days or df dollar rain and it
shows me
the values from my data frame now one
more way of doing that is using your
bracket notation to return a data frame
format of same information so if you
want the resultant data in a data frame
format
you can just do a df rain or df
temperature and that is basically giving
a data frame so if i had assigned this
to a value and if i had look at the type
of this that would be data frame
now
one of the things which we also require
is filtering data frames using a subset
function
so that is subsetting the information
from a data frame so we know we have our
data frame let's look at our data frame
again
so that just reminds of what data values
we have
and here let's get a subset out of it
using the subset function so i'm passing
in my data frame i am saying i would be
interested in the rain column so i am
giving subset rain column and
wherever the values are true so returns
all the columns where it has
rained similarly i can
do a subsetting by giving a value for
temperature wherever the value is
greater than 25 and that shows me the
value so this is where you are filtering
the data in data frames using a
subset function to which you have to
provide a column name
and then giving a condition now
one more important thing which might be
required is sorting your data frame
using order function so i can create a
variable by name sorted dot temp
i want to do a ordering of data frame
and here i am doing ordering based on
temp
and now if i look at the value
or i can create this
in an ascending order
so let's look at the values and now if i
look at my data frame it just gives me
the
order or the ranking for the particular
values
so we have discussed this in other
section also so what i can do is
i can return all the columns with
temperature sorted in a descending order
so right now what we were seeing was we
were seeing in ascending order but what
we can do is we can do that in a
descending order so here i'm creating a
variable descending.temp
i'm doing an ordering but when i'm doing
a ordering i'm using the minus symbol
and this one
if you would look at in the form of a
data frame it shows me the values which
are ordered in a descending order based
on the temperature column
now another way of sorting is by using a
particular column
so what i can do is i can sort i can do
a order and then i can choose the column
based on which i would want to order it
and then
if you would want to get the values of
this so it tells me
the values have been ordered based on
tip
so this can be very useful when you
would want to sort the data or order it
in a particular way to basically
understand your data or to make more
meaning out of it right
similarly one more requirement might be
merging your data frames
so here i'm creating a data frame so i'm
saying authors
and i'm using data.frame function and
what we are doing is
instead of creating three vectors i am
basically doing that within my data
frame function so let's do that
and now what we can do is at this point
of time i can check what my authors look
like so this is my authors
now here if you see we have
the vector turkey venables tierney
ripley and mcneil so that becomes my
first column
which is surname
then you have your nationality and then
you have deceased
where you have also
repeated the values four times right so
that's something new which you might be
seeing so you are creating a vector
where you are passing in a value and for
other set of values you are basically
using a repetitive function
now similarly we can create a data frame
called books
and this one is
where i am
having name column title
and then i have other dot author and you
are passing in the values so at this
point of time if you would want to look
at your books
it would look something like this so you
have given a name now just closely look
at the data frame function so here you
are using
the names
you have the titles whatever values you
are passed in
always remember when you have multiple
vectors they are ending with a comma
right so do not forget that and then you
have other dot author so that's the name
of the column and you are passing in the
values where you have also passed some n
a values
and at this point of time you can look
at authors
this is your books
and our intention will be to merge
these data frames so that's what we
would want to do
might be we are interested in getting
the data together so what i'm doing here
is i'm saying m1 now i want to use the
merge function i pass in my data frames
that is authors and books
so if we closely look at authors it has
three columns and five rows and here you
have three columns and we have seven
rows
so we would want to do a merge so we
will say author's books and we will say
by dot x so this is where i am choosing
which is the column based on which
i would want to merge so i have buy dot
x which is surname
and by dot y which is name
so we would want to merge the data where
we are giving a condition based on
values and surname and name so you see
there is turkey here there is turkey
here we have venables we have venables
we have tierney we have this one we have
ripley which we have here multiple
entries and then you have mcniel
now we don't have our core which is
there in
your author so let's see
what happens when we do emerging here
okay and now we see the result of this
merge where it has taken all the values
from
both the data frames so you have surname
nationality deceased you get the title
you get the other dot author which you
are getting in from your books
and the name column is avoided right
because we are
doing
the merging based on surname and but y
dot name is name so we don't see the
name column but what we are seeing here
is the values which have been merged and
then you can compare so for example
let's do a random check so if i look at
mac nail
that's the surname
or here it was named so you have mcniel
you have a nationality which comes from
the first data frame deceased from the
first data frame
then you have your interactive data
analysis
and then you look at title.author
what you don't look at
in the merge is this r core because this
does not have any value in your author's
data frame so you can do a merging of
your data frames using the merge
function so please try it out and you
can create different data frames and try
to use this
similarly you can manipulate a data
frame so for example here we are
creating one more data frame called
sales report
which is data dot frame you are giving
an id product has some values unit price
is where you are
getting the values as integer and
quantity as integer so now if i look at
my sales report this is the values which
i have let's spend a couple of seconds
to look at this value so id value is 1 0
1
2 1 0 10
product
is a b so that is automatically assigned
unit price is starting
where you say 101
140
184 right so we are using a as dot
integer we are converting it into
integer and basically we are
assigning these values here
for your
unit price and similarly for quantity we
are assigning the values by doing a as
dot integer and then just doing a run if
now once we have done that we have
created a data frame now how do you
transpose what do you mean by transpose
so transpose is when you are changing
your accesses so if i do a transpose on
sales report and if i want to do a view
so you will see
the positions which have changed so you
have all these values so my
row names or row whatever values become
the column headings
and basically your column headings
becomes your row names so that is what
you're achieving by doing a transpose
you can do a head to look at some
initial values
you can do a sorting of this data frame
by using the order function and you can
choose the column
and also the order if you would want to
have it in ascending or deciding or
basically increasing or decreasing
values
you can also choose a particular column
like we are choosing product as a column
and i would want to
take the values of sales report in a
descending order that is unit price
and we can just do ordering of data
frames or sorting the values and data
frame so this is pretty easy please
spend some time in practicing these
things taking these examples
and you will learn more about these
functions you can always try creating
an example at your end and you can try
to look into these
now
what about subsetting the data frame so
when you are saying subsetting the data
frame
let's do a subset function like what we
used earlier
i will say subset dot product a i'm
using the subset function and here i
will get the subset based on the product
value being a
let's look at this and this shows me
only the values where product value
matches a
now extract the rows
for which product is a and your price is
150 so you are still doing a subsetting
you are still passing your data frame
here you will give the product as a
which will tell basically the values for
product and unit price greater than 50
so you're giving some conditions and
look at the values
now if you're only interested in
particular columns so if i say
only the first and the fourth column
product is a
and unit price is 150 so you have to
still use your subset function
pass in your data frame
product will be given as a and unit
price should be greater than 150 but
what i am interested in is the values
from the first and the fourth column and
now if you see it shows me the values
for my fourth column
what we can also do is we can create two
subsets so set a from data frame where
we take the product is being a other one
is being b
and then we can look at the values so
this is just a this is just b and what
we can do is we can combine them or we
can merge them using column bind so when
i say column bind and i'm saying set a
set b so it is basically going to stack
the data frames column wise and if you
do r bind it is going to stack the data
frames row wise
so we can either use
column or we can do a row wise
so this is in one way where you can
merge the previous example where we saw
merging was based on a particular
condition which is met
based on some columns which might have
similar values right and this is where
you are straight away merging the data
frame using column bind and c bind so if
you compare this
with the other merge operation what we
saw here this was where you are
comparing the values
of first data frame
and second data frame and then merging
but here we have just used column mind
and row bind so we are not merging on a
particular condition we are just
tracking them either column wise or row
wise
now
what we can also look at is doing some
aggregate operations this is going
deeper into data frames so
when you use aggregate function you are
passing in your data frame you are
choosing the quantity column
and then
you are basically
using the list function so list function
is going to work on your data frame on
the product column so product column for
your sales report so at this point of
time let's look at sales
report
and let's look at the value here so this
is my sales report
and what we want to do is we want to
aggregate the values on quantity column
but for that i will say i will just take
the product columns
and i will get a sum
wherein i am ignoring the any values
let's look at this
and that gives me an aggregation value
so remember aggregate function
is doing a summing up
now here we are doing a summing up on
your
product
that is sales report product column is
what we have so you are kind of grouping
by based on product so we have two
products here a and b
now what we also want to do is we want
to take the quantity column so that's
why we have given that first and what we
are doing is we are doing a summing up
so we are summing up all the values for
a and all the values for b
and we are seeing that
here if there are any n a values we are
ignoring it so these are some basic
operations on data frames or matrices
subsetting them extracting useful
information
using some inbuilt functions to do
transformation or computation and
extracting some values
now similarly we can also work on lists
now that we have looked at data frames
matrices vectors let's also look at one
more structure and how we work
in r when we have to work on lists
so list
is basically a structure here and what
we are doing is we are creating a list
by using the list function
and here
i am
passing in three vectors you see here
now c
function is being used now in vector we
know that all the elements are of the
same type now let's create a list
wherein we see three vectors which are
of three different
types or objects of three different
types so let's create this list
and now let's look at our list so it
basically has elements
wherein you have values of different
types
we can create a different
list which can also have
sequence elements that is 1 to 10 a
matrix which is of three dimensions and
then also passing a list so this is also
one way of creating a list
let's look at list two and if we look at
the values here list two basically has a
vector which has values one to ten it
has a matrix of three into three
it has a list which has values a
having 10 and b having 20.
so this is how you can create a list
which can have objects of different
types so we can also
use
recursive variable a variable that can
store value of its own type so for that
you have to use a recursive function
something like this so i'm saying is
recursive and then do it on your list
and
we can check if the list basically has a
variable that can store values of its
own type now
one of the main requirements when you're
working with list
is
indexing so i have created a list and
here i can access this elements by using
an index so if i do this this shows me
the matrix what i could have also done
is using the dollar symbol and then
choosing
particular element of the list by doing
a mat which is the name given to our
matrix
or by choosing a name that is vector
so
you can access the elements using
indexing or dollar renovation or giving
the name of a particular element now i
can also work on list and i can get the
third elements second value so we can do
that and that shows me 20 or you could
have done by giving the value 3
that is the third element and within
that you are looking for second element
so i can get the length of the list i
can get the class of the list which
shows me this type list
and
what i can also do is i can convert
vectors into list
so here we are creating a variable price
which is being assigned a vector which
has 10 20 and 30
and now what i want to do is i would
want to convert this vector into list
and for that i'm using the list function
so i am creating a variable called price
list
and then i am saying as dot list so
that's going to convert my vector into
list and now let's look at price list
which shows me
a list
or you can look at price which is a
vector
so that's when you are converting your
vector into list now how do you convert
your list into vector
and that also can be done by doing a
unless function
so i can basically work on price list
wherein we converted vector to list and
i can just do a unlist on that which
will convert my list into a vector
looking at the values of the vector
now sometimes we may want to get the
dimensions so we can use the dimension
function to convert the vectors to a
matrix so that it can have multiple
dimensions
so here we create a vector which has
four values and then i am going to give
a dimension to this so that it is
converted from vector into matrix by
giving dimensions 2 comma 2 and now if
you look at price 1 it has basically
changed into rows and columns of two
into two dimensions so these are some
simple examples of working with list
now when you talk about basic data type
functions
we have seen how you use the assignment
operator
how you get the data type of a
particular variable or the class to
which it belongs
i can assign different values
such as 10.5 so the previous one was
showing me the value numeric
and
now what we would want to do is we want
to assign a value 10.5 look at the class
of it it says numeric type of it shows
double so by default
it belongs to the double class now i can
check if
the values in n1 are numeric and that
shows me true and similarly for n2 and
that shows me 2. so you are using the
numeric function which returns true if
the given value is numeric
similarly we can
have
integer
assigned to a particular variable and
for that either i can do as dot integer
or i can assign a value with capital l
so i can do this and look at the value
of i1
similarly
i2 and look at the values and if i would
want to check if that is an integer
let's look at the values of
i2
which was an integer i1 which was an
integer ni 3
which is an integer
so here we have assigned integer values
to a particular variable now all
integers are numeric but all numerics
are not integers so let's check that so
if i do a is numeric on i 1 which was
assigned as dot integer 10 that shows me
true
if i say is dot integer on i1 so was
that an integer
and if i look at the value it shows me
true
now let's look at the character values
so if we say c1
c2 and look at the class of this it
shows me this of character type
similarly on c2 and you can always
validate that
by using the character function
you can also use some inbuilt functions
such as converting to an upper case or
getting a substring from the starting
till the position what you would want
the elements
i can do a paste function
which basically will give me
the data
combined or you can say concatenated you
can also use a paste 0 which we know
will get rid of the space
and it just
concatenates them without a space i can
also use a specific separator which we
have seen examples and we can do that
and what we can also do is we can
replace set of characters
so here
i am saying
substitute
and then if i look at the values it has
basically replaced rob with cena
and let's look at the length of it or
number of characters in this so these
are some basic operations what you're
doing on matrices on your data frames on
your list
and also on your variables where either
you are assigning them values of a
particular type or you are changing the
data types you can also go for coercion
in case of vectors we have seen that
where if you are passing in values of
different types that's coerced into same
types
so later we can learn more on functions
and flow control and how that is handled
in r
let's learn how r can be used to take
care of flow control that is if i would
want to have a if else condition
and if what i would want to compute or
if i would want to check some values how
r can be used
so here
if
statement consists of a boolean
expression which is followed by one or
more statements so we can just say if we
can pass in a boolean expression where
we would want to compare particular
value or we would want to check a
particular value and then whatever is
passed in the statement will get
executed so what we can do is here we
can use assignment operator i can pass a
value to x now we can always do a type
of
and that can tell me that x
is basically an integer and now i can
use my if where i can say
is
dot and then i can choose integer
and i would want to check the value of x
if that is an integer
then i will just use
brackets and i'll pass a statement here
so let me say print
and let's say x is an integer
and we can execute this and this tells
me that the boolean value is true now if
for example we would have done something
else or
say for example
instead of integer if i had used let's
say character for that matter
and we can check the value and we can do
this
so
here we will check the values and it
says
there is an error with the bracket and
let's check this one so if x because we
missed a bracket here
so let's do that one
and then try this and it doesn't show me
any result so how could we handle
something like this if
the boolean expression does not match to
true and in that case we can always go
for else statement so we can check for a
value so if the boolean expression is
true statement will be executed and if
it is false then next statement will be
executed so we could have done the same
thing here where i said print x is an
integer which we know is not true and
what i could do is i can here after this
one
say else
and then i can open up one more bracket
and then i can say print
and i will say x
is not
a character
and now we know that x is not a
character so this is a simple way where
you can use if else and you can control
the flow
by passing in the conditions now that's
when you are using if else statements
now what about while loop so that also
can be useful when you are programming
in r
so
an else statement is executed when the
condition in the if statement results to
false so that basically means what we
can do here is
let's pass in a word or a set of words
like this for example let's say v
and then we use c function to create a
vector for example and then i can just
say hello
world
and if you look at v
you can look at the class of v
it's of characters and if you look at
type
of v
it is
having the objects or elements as
character now what we can do is we can
basically then say
count
and let's assign this a value to
now what we would want to check is is
the count of elements in
our v equals to two so what i can do is
while
my count is less than
say five
now i'm saying
i would want to
do something while the count is less
than 5.
so we have already given a value to
count as 2
and now what i can do is here i can open
up a bracket i can say print and then
pass the value of v and then what we do
is not only this we will also increment
the value of count and we will say count
plus one
and
here it gives me error probably because
we have missed a bracket so let's see
what we are missing out here so
let's just check this one again
so here it is
we have created v
which has two elements
of the type character
and then what we do is we
assign count a value of two and we would
want to check while the count value is
less than 5 we would want to print the
value of v so what we are doing here is
we are saying while then you pass in an
expression which will check the value of
count we do a print and then we
increment the value of count now this is
a simple example where you are using
while
to basically test an expression and
while that expression is true
you would be doing something whatever is
passed within your
brackets
now we could also be going for for loop
now for loop is basically used to
iterate over a list of elements or a
range of numbers
so for example if i have a vector like
fruit which has some values i could just
say for i in fruit i would want to print
something so let's try this also as an
example to test our for loop now we can
just say names
and we can basically
then assign values to this so let's say
vj
aj
dj
and let's say sj
and let's create this let's look at the
value of names now what i can do is i
can use a for loop and i can say for i
in my names so i will say for i in
names
now what do you want to do so open up
your brackets here
and then we would want to say print i
and then basically close the bracket so
you see for every element in this vector
it is basically going to print the name
one by one so you are iterating through
a set of objects
by using a for loop now this is how we
can work on
for loop
so if else while and for loop can be
very useful when you would want to
iterate or when you would want to check
the value of an expression or
when you would want to loop and do a
particular task
it's always good to
understand how you manage flow control
in r that is either when you're working
with your for loops your while loops
also understanding how you can use your
logical operators for working with your
data in r
so let's look at some examples and
understand logical operations
so either you could be having and or you
could be doing a or where you are
evaluating one condition or you are
using not so these are your logical
operations now here i can assign a value
to x
and then i can check if my x value is
less than 10 and it shows me false
so
i have been
checking the value of x so let's see is
it greater than 10 and that's true
now i can use logical operations here so
i can say
and so i'm saying is my x value less
than 20
and is my x value greater than 10 now
both these conditions are not true so in
this case we get the result as false
but if i say x is greater than 20 which
is true and
i am saying x is greater than 5 that's
also true and
x is equal to 25 now whenever we are
talking about and we have to look at all
the conditions have to
be right so let's look at this and we
get the value as true but if i say x is
greater than 10 or x is later than 5
then one of the condition has to be true
which is true in our case so we get the
result as true
we can take a different example we can
say is x less than 20 which is not true
but is x equals to 30 and that's also
not true so in this case we get result
as false
now we can straight away compare some
numbers and we can say is 12 equals 3
and that's false
and if i say not then that basically
will give me the result as true
so these are some simple logical
operations which help you when you're
working with your data in r
now we can create a data frame by using
an inbuilt data set empty cars
and let's look at our data frame so that
shows us the values with all the
different car models
and the different column names so car
models are the row names and then you
have other things like mileage and
cylinder and so on which are the
specification for the data now what i
can do is i can filter out values here
using indexing so i can say data frame
now in that data frame
i would want to compare the value of
mileage which is greater than or equal
to 30 and
then i can end it with comma so that
gives me the value wherever the mileage
is greater than 30.
i can also do a subset on data frame
where i can select a particular value
so
we can be doing this
or we can be using
square brackets we can also do a dollar
and compare the values now we will use
our logical operations knowledge here so
we will work on data frame where i am
interested in the mileage which is
greater than 20 and
i am looking at the column hp horsepower
and that should be greater than 100
remember when we are doing a and both
the conditions have to be
met as true and that shows me the result
where you are looking at the mileage and
you are looking at the horsepower column
both of these are
met and that's why we get the result
so these are some simple examples of
using your logical operations either
when you're working on a data frame so
same thing can be done on a matrix same
thing can be done on a list or a vector
or individual values
now let's also learn about flow control
that is how if else or else if is
handled in r so you can do a single
condition check so for example i assign
a value to hot which is false
and i'm saying temperature is 50. now
what i would want to check is if
the temperature value is greater than 60
which in our case will not be true
which will not be true because
temperature has been assigned 50 so is
it greater than 60 no
so
if i do this
if condition
and i am saying if the condition is true
then i would want to assign the value of
hot to true
and now if you look at the value of hot
it is still false why because the
condition which we passed for our if
is not true
it has not been met
so whatever was passed within the
statement has not been done
now let's change the value of
temperature as 100 and now if we do the
same thing we say
is my temperature greater than 60 which
is right so then whatever has passed in
the bracket will be applied so hot will
be assigned new value and now if you see
the hot value is
set to true so this is a simple single
condition check what you are doing now
certain times there can be multiple
conditions to check and that's where we
use else
so in this case we go for assigning a
value to score which is 63 so let's do
that
and now
let's say is my score value greater than
80 which is not true so whatever is
passed in here
which is print it's a good score will
not be done
but it will jump to else and then
whatever we have passed in else will be
done so it will say it's not a good
score so let's do this if
and it says it's not a good score so
this is a simple way of using if else
where you are checking two conditions or
you are checking the condition but what
if the condition is not met
then your control is passed to your next
statement
now i can also do an else if so i can
say score is 63 and i can say is my
score greater than 80 that's my first
condition so it would pretend good score
but might be i would want to check
something else so i'll say else if
and i'll say is my score greater than 60
yeah
and is it less than 80 remember the and
which has to
evaluate
and true for both the conditions so i'll
say print decent score
i can still keep on giving conditions
here in else if scored less than 60 and
score is greater than 33 that would not
meet so that will be ignored
and then you have else which says print
poor so
first it checks or evaluates for the
condition which you have passed for if
if that doesn't work then it goes to
else if and if anything in else if is
met then it's going to take that
into consideration and it will not go
for else if if and else if conditions
are not met then it goes to else
and we see decent score already printed
here now that's a simple example of if
else
and if else if
wherein we are evaluating a condition
but probably we have multiple other
things to check
now how do you work with while loops in
r that's very simple so what we can do
is we can assign a value to x
and now i will say while
my x is less than 10. so i'm going to
create a loop so i have said my x has
been assigned a value of 0 and that's
fine so this is going to be less than 10
but
if we are going to just do this then it
will keep running and it will get into
an infinite loop so we'll see how we do
that so we'll say while x is less than
10
i would want to
basically have the value of x i would
want to print x is still less than 10
adding 1 to x and what we are doing is
we are incrementing the value of x now
if you do not do this step
then it will get into an infinite loop
because x will be always less than 10
so we are incrementing the value of x by
one
and then we are giving a condition so if
at any point of time
x is equals to 10 i would want to say x
is equal to 10 terminating the loop
and then basically my while loop ends so
we can do this
so let's say x is 0 and then do this
while loop and now you see
it is at every step it is basically
printing out the value of x it is still
less than 10 adding 1 to x
and it also gives you
the value of x
so when we do a x is currently
and i print out the value of x so it
shows me 0
next time you increment it it becomes 1
and 2 and so on so this is where you are
using a while loop where you are looping
where based on a particular condition
and then you basically have
once the condition is met you are able
to
complete the loop
now let's look at
let me take this one here we'll look
into functions in a later stage
so let me take this function
and let's get rid of this one
i would also want to talk on break
statements and while loop and once we
are done with the flow control on while
loops then we can look at the functions
aspect
either we can look at how we control our
functions or how we create built-in
functions so let's look at this one
and let's continue with our while loop
so
we just saw a simple while loop here
and what we also want to see is when you
are working with your while loop
how do you break if a particular
condition is met
so we saw a simple example of
while loop
and that's fine
wherein we were printing out something
we were auto incrementing the value of x
we were also checking at one point of
time within our while loop
if the value of x was met
we would say we are terminating the loop
and it comes out of that
now if that does not happen then we
continue doing it
how about a break statement so break
statement is when you would want to end
the while loop
if a particular condition is met so for
example here i assign a value to x which
is 0 now i want to evaluate this lesser
than 5 so that means i will be auto
incrementing the value of x so i'll
create my while loop will give in a
condition that x is less than 5 now what
i want to do is i want to use the cat
function which will print the value so i
am saying x is currently
and i am printing out the value of x
then i say print x is less than 5
because we have not yet incremented the
value of x we are adding 1 to x like
what we saw in previous example
i am saying x is
then
incremented by 1
and here i'm saying if x reaches 5 so
while we keep incrementing the values
within the file loop we'll see if x's
value is 5 we will print it is equal to
5 and we can just do a break
now
if you do not use a break
you can still end the while loop but
break is basically to end this loop here
based on condition which is met and we
can do this and then run this while loop
so you see here x was met as 5 and we
just broke out of the loop
so that's your simple while loops what
we are seeing
similarly we can work on for loops
so for loops can also be useful
so your conditionals what we saw as if
else or else if your while loop is while
a condition is
not yet met you keep looping
and keep doing some actions now what you
can do is you can also work on for loops
so here i'm creating a vector
and then
i am going to loop
that is i'm going to iterate through
every element so i'll say
4. and when you're using for loops
you'll say 4
and then you can given anything you can
given any value i can say i i can say x
so i'm just giving temporary variable in
vector and then i'm printing it out so
this basically prints all the values one
by one so there is one more way to do it
you can say for
and you can say
i in
and i would want to take
length of the vector so 1 to the length
of vector that is till the last element
is reached i would want to print
the vector elements using the value of i
so what is i here it's the index
position and i can do it in this way
so if you are looping over a list
so i'm creating a list and it's very
simple so you can just do a for loop
where you can say for i in list i want
to print the i and that gives me the
list elements or you say for i in and
you give from starting position that is
1
till the length of list and you would
want to print every element so here we
can also use double brackets
so
if you would want to loop through a
matrix so sometimes that might be
required so let's create a matrix which
has 1 to 25 values around by row and you
look at your matrix and now what you
want to do is you want to iterate
through a matrix
so you want to do a looping so i'll say
for i in matrix i would want
to print out the values and that prints
out
all the values in matrix
now
what if i want to print the square and
square roots of numbers between 1 to 25
so
i can say for i
wherein the value starts with 1 ends
with 25
and then within my for loop
i can basically
give this condition where i am saying
get me the square root that is i into i
or get me a square root of i and
just
print it out so i am saying message i is
this one square root is this and
my square is this and square root is
this so if i look at this values
here
now i am looking at all the values from
1 is to 25 i am looking at the square of
the values and i am looking at the
square root so what we did was we did a
4
we passed in the elements by saying i in
1 to 25 and within the bracket i have
said what do i want to do for every
element so
either i have calculated a square i have
calculated a square root and then i am
printing out when i am using the message
function which takes the value which you
are passing in
comma the value of i
similarly square and similarly square
root so these are some simple examples
of understanding flow control in r that
is using your for loops your while loops
and also your if else
later we will spend time in learning
about functions
which could be either created by the
user or built-in functions and also
factors in r
welcome to this section of our
programming where we will learn about
functions whether that is about inbuilt
function or creating your own functions
and working on
your different data structures
so what are functions
function
is basically a set of statements to
perform a specific task
now
r
has a large number of inbuilt functions
or you can say packages which you can
import and start using
or users can create their own functions
so when it comes to functions the syntax
is very simple
you give a function name
you can assign
your function to a variable and a
function can take no arguments one
argument
or any number of arguments so let's see
some example on functions so for example
here we are creating a variable called
squares and we are assigning a function
to it now this function would take one
argument which is a
and then we use a for loop so we say for
i in
from 1 to the number a
we would basically be doing a
exponential computation
so what we would do is we would
square the value in this particular
range and assign that to b and print it
now when we do this
we can
call in this function and pass in a
value to look at the
square
of that particular value now this is a
simple example of function so this is
how it would look depending on what
value you have passed to the function so
for example we say squares and we pass
in a value of 4 so that becomes for i in
1 to 4
so you would start with 1
the value of
1 square would be 1
and then you have
your value for
2 so 2 square would be 4
then we have
3 3 square would be 9 and then we have 4
square which is 16. this is a simple
example of function and this is how you
can create your own function to
calculate or carry out some computations
now let's look at some other examples
before we get into built-in functions
which basically allows you to work with
different data structures
so there are different mathematical
functions which can be used
for your data science or
computations
you have your regular expressions which
can be used for pattern matching
or you can also use functions for data
manipulation now before we get into data
manipulation
let's look at how you work with
functions taking some examples
so let me bring up my r studio
wherein we will try out some examples
and see how functions work
now here are some examples and we can
see how this work let me just clean up
the console and we can start here
now here we are creating a symbol
function which does not take any
argument
we call it as hello world and this will
start with the word function
and parenthesis now that could have
arguments passed in however this
function we are not passing in any
argument
and what we are doing here is we are
printing out whatever value is passed
within the bracket
so let me just do a ctrl enter my
function is created and you can straight
away call this
by
just doing this
now however if you would have tried this
function without the bracket
for example something like this then it
would have printed out the complete
function it would have printed out the
complete function
and whatever you passed in to hello
world but if you would want to call the
function then basically you would just
do hello world and then use the brackets
so that's how you call the function and
that's how it shows the result
now your function can be with a single
argument so for example here we are
passing in an argument called name
and we can then use this to pass a value
to this so here i'm saying hello name
i have my function but this one takes a
single argument and we are going to use
paste which basically can concatenate or
just adds up whatever you are passing in
to paste so we will say paste hello and
then the name
notice that i have given a space here
after hello so that i can have it in the
right format and i can just do this so
the function is created and now let's
pass a name here and just try to call
the function so name is one argument or
a single argument which is passed to
this function
so let's look at the result and that
shows me the name whichever was passed
to this one
now what we can also have is function
created which takes two arguments and
this is a simple example so here we are
creating a function add num
i'm saying function it takes to argument
i'm not providing any value or default
values for this we'll see some other
examples for those
now here this particular function takes
two arguments
and whatever you pass in here
a
addition of that will be seen
so let's create this function and let's
call it and test it and that shows me
the result as 70.
now what we can also do is we can add a
vector to a number so vector is list of
elements or list of objects you can say
and here we would want to
perform add num
or we would have to call add num
function by passing in vector which
becomes the first argument and the
second value is the next argument so
let's run this one and that shows me the
result wherein 5
as a value has been added to every
element of this particular vector
now when it comes to function you can
also have default argument values which
can be passed so here let's look at as
an example so we have hello name again
but this time instead of passing in just
an argument we will also provide it a
value or you could say that could be
considered as a default value now
when we create this function we are
doing the same thing as previous
examples but we are passing in an
argument and that argument has a value
now once i do this
i can surely call this function without
passing in a value and that shows me the
name which we had assigned to the
argument
or we can even pass in a new name which
will be assigned to name so if we do
this it works in both the ways fine so
this is in one way you are passing in a
default value
and then basically
you can
either call the argument
or
you can assign it a new value
so if we would do something like this
hello name
and then for example i would say name
equals
say
jerry
and if i would do this so that also
works fine however since we are passing
in an argument we are assigning a value
so either we can let it go for the
default or we can just pass in the value
or we can be very specific in mentioning
the argument and then the value for it
now how do we return value from a
function let's look at this so here
we are creating a function we are
calling it full name
and this one takes name
wherein we are giving sachin and title
is say tendril
and what we would do is we would use a
return statement here so return would
basically
use the paste function it will take the
values of name and title and then glue
them together however we are using also
a space
so that there is a space between these
two
values to the arguments which are passed
now if i run this argument sorry this
function
my function is created now we have
already passed named arguments or we
have already passed value to those so we
can straight away say just call the
function and that does
whatever you have mentioned in the
function body
i could have also said that i could
create a new one
wherein i will pass new set of values
which we saw in a previous example
and then if we call this it takes up the
new values so
either you can let it go for default
like what we did here
we can also pass in new values
or if you would want to keep it specific
you could basically say
full
underscore name
that's my function
i could say name equals and i can say
john
and then i will say title
smith
and that's also fine
so we can do this and that works in the
same way as it would have worked with
just passing in the names
so this is fine and if you would want to
test it out say for example if i would
just take off name here and just do this
that also works perfectly fine wherein
we are
still using these arguments in the
particular order now if i would have
changed this one to name
wherein i am already passing in a value
for name and if i tried this
so in this case
what happens is name
is smith
and
basically your title becomes john
right so we have to remember how we are
what arguments we are passing and if we
are basically
assigning values to the arguments or
letting it pick up the default ones
so let's do this and that looks okay now
when you talk about scope of a variable
okay now before we understand scope of a
variable let me show you some more
examples on function now say for example
if you were using built-in functions we
have lots and lots of
built-in functions which are available
for programmers
which
they can use in their data science
activities or data processing or
computation now here we are using a
function called r norm
to generate thousand random values from
a normal distribution of mean zero and
one
so i would use the r norm
that's an inbuilt function
and i will call this say normal
distribution
so that is already done now we can find
out the mean
on these random values which would have
been generated using the inbuilt mean
function
and that works perfectly fine you can
also create a histogram out of this and
if i do this it shows me the histogram
so let's see
the histogram here
let's bring it out and that shows me the
histogram of normal distribution if you
would be interested in knowing about a
particular inbuilt function you can just
do a question mark and use the function
and that basically shows you the
documentation of the function so this is
a generic function
which computes a histogram of a given
data value
and here it takes arguments so this is
basically your data this could be the
number of arguments which you are
passing in
for
your histogram to be created
now we can look at some more examples
here so i can say two histogram with
large number of interval breaks and this
is where i am also specifying breaks and
passing in a value so this
allows me to provide arguments to
functions by position
now the same example which we have given
here we can do it without breaks
argument but as a good practice we
should
actually give name to the arguments
which we are defining so if i would do
this
when i'm passing in my data that is
normal distribution and then for breaks
i'm just giving the value 50 and that is
also fine it works perfectly fine here
now
we can create our own function
which as we saw in some basic examples
functions which can be without arguments
say this is a simple example or with
arguments
so this one we have already seen how you
can create a function without giving any
argument or by giving an argument and
then basically calling in the function
now when it comes to optional arguments
so we can look at this function
wherein i would want to say find out the
exponential value of a particular number
so i call it expo value i use my
function i say this will take the value
x now that's an argument which we are
passing in we could have given it a
value or we will just let the user
provide the value when this function is
being called i will also give a default
argument which is power equals to
and here we would want to
get a histogram of the values
with a particular power so if i create
this particular function
that's done and now i will just pass in
my value i don't need to mention power
that has been given a default value yeah
if we would want to change it then we
can pass in that so let's run this one
and that gives me
exponential value a histogram based on
the normal distribution data and by
default
it is using power as 2.
now what we could have also done is we
could have specifically mentioned a
different value for power
and that works perfectly fine i could
have just passed in the value as power
and that also works fine
so here
you are using named arguments and
basically
passing in any other arguments
now what we can also do is we can use
these named arguments and then we can
also do
or we can
pass these arguments
that is what we call as passing any
other arguments now if you look at the
explanation of this
hist function histogram function
if you look at this
it shows me these three dots and this is
what we can use to pass in
any other arguments so let's look at an
example for this one
so say for example
i would want
to
create a function where i am passing
named arguments
i am passing in the data but then i
would also want to pass any other
arguments which can be passed
dynamically now for that we can create a
function here
wherein i am calling it expo value again
i am passing in my x which will be the
data which we will pass in
you are mentioning power which is two
which is a named argument which can also
be considered as a default or you can
change the value
or you can provide a new value and then
i am also giving these three dots which
are also passed in within this
particular function
so let me create this function here
now once that is done
then i can call this function by passing
in my data which is normal distribution
power is 2 and then i'm also using these
breaks for getting my histogram with
intervals
of 50 so let's call this function
and that gives me the histogram now what
we can also do is sometimes it might be
useful to pass logical arguments so for
logical arguments what we can do is we
can create a function which will take
the data
here i am using a named argument exp
that is for exponential i am saying if
the value of
histogram is false
and then i am also giving any other
arguments so what we will do here is in
this function we will say if
the value of hist is true then this
block of code will get executed where
you will get a histogram
based on the exponential
which has been assigned in the function
passed as an argument
and if that doesn't hold true which is
by default false as we have given in our
function then this piece of code will
get executed so let's create this
function
and that's done and now we can straight
away just pass in our data
exponential value is given as 2
histogram has been given in false that
means the else part of the code will get
executed
and we can look at the values here i can
also say
histogram
is true
and that's where we will be calling in
the hist function and i can do this that
shows me the histogram so in this way we
can pass in named arguments we can pass
logicals
and then we can also pass any other
arguments for our use case
now looking further in functions let's
also understand the scope of a variable
in a function
so here i am saying v
and then i'm saying
i am global variable let's create this
and then i am saying stuff so i am
global stuff so this is basically we
have assigned some values to variables v
and stuff now let's create a function
where i'll say fun
i'll use the function and i will say
this will take my variable stuff
i'm saying print v
and then for stuff i'm assigning in a
new value and then i'll print stuff
so let me create this function and let's
see how it works so if for example i
would just say print v
that shows me
the global variable which we had created
earlier and since i'm using that within
my function
it basically has the value now i also
have a global stuff so i'm saying print
stuff
and that shows me whatever was assigned
to the variable and now we will
basically call the function
by giving in
the argument as stuff
the variable which we had created
now
if we do this then it says
reassigning stuff inside the function
and that's because within the function
we are basically assigning a new value
to this stuff
now i can also just do a print stuff now
if and if you see it still goes back and
prints the
global variable so only within the
function
reassignment happened and that's what we
understand when we talk about global
variable or local variable now
to create a function to find the final
output amount to be paid by a customer
after adding 20
tax to the purchased amount how do we do
that
so
i'm here creating a function
which will take x as hundred
and what does that function do
we would want to basically
find out
the amount which is paid by customer
after adding 20 percent of tax now how
do we calculate that so we take x
plus
20 percent of x and that would be the
final amount which will be paid so we do
a return t
and this is my function so let's create
this function and then let's pass in a
value to see what is the
amount which customer would pay
with an addition of 20 percent tax so
this is a simple function where we are
passing in one argument we are giving it
a value and then we are doing
computation within the function body
what we can also do is we can create a
function where i am passing in an
argument
and i can then
check the value of that so if the
argument
passed was greater than zero then we
would find out the final amount which is
amount plus 20 percent of the amount
if the amount is less than or equal to
amount
then
equal to zero then our final amount is
equal to amount
and we return f amount so here we will
be evaluating these conditions and based
on that my function will return the
value so let's create this function and
pass in a value
and that shows me hundred so you can
just test this by saying amount one
and say for example i would have passed
in zero now in this case my final amount
is zero because there is no amount which
needs to be paid by the customer
now
checking the argument and the body of a
function so
we can always use this inbuilt function
args which will tell me
for this particular function what are
the arguments and what is the body of
the argument which basically tells me
whatever we have coded within the
function body
now to understand the scope we can
create a function here which is taking
an argument x and what does this do so
we assign a value to y
then
we basically say g one and here
i am using function of x now what does
that function of x do
so this one will take the value of y
plus
multiply x by itself
so this is
a function which we are creating
and then i am saying g1 of x
so what you are doing is
whatever value was passed in as x
for that function x will be applied so
let's create this function and then pass
in a value 10
and that gives me the result as 110.
similarly we can create another function
where we want to do some computation
and then i am creating one more variable
which has basically the function
pass in a value for y and then basically
what you do is
you are calling in your g2 function
and then
let's call in this function
so let's do this
and
let's also create f2
and then finally we will call in f2
which is internally calling g2
so these are some simple examples where
you are doing some computations
and creating some simple functions let's
also create a function
which is taking two arguments so here i
have g2
function takes two arguments x and y
what does that function do here we are
saying y plus x into x that's my g2
and similarly i'll create f2 which is
going to have a value assigned to y
and this one
is going to call in my g2 function
which will take
x and y
x
which we are passing in here
and y which we have assigned
so let's create this
and then let's call our f2
and what does that f2 do it basically
has the value of y assigned and then it
does
whatever is mentioned in g2
with our x and y values so i am passing
in 10 here
so it is
basically
y which is 10 plus
you have
the x value which has been passed here
so let's look at the calculation which
is 10
so that gives me 110 so 10 into
10 into 10 plus the value of y
so this is how we can create functions
which have been assigned some values and
then pass in some other values to those
look at some more examples here when it
when we work with functions
and see how we can use functions to
carry out our basic operations or
calculations so for example here
i am creating a function
and this will take an argument wherein
we are saying it would be marks
now let's do this and the function body
would say result is not defined
now if the marks are greater than 50
then result will be
pass
and you will have the message which is
your result is
and then you are passing the value of
result
so let's look at this one so
let's create this function pretty simple
function
and then
let's pass in a value here so i'll say
status as 60
which will be checked for the value
greater than marks or lesser than marks
and that tells me your result is pass
and if we give this one then it says
your result is not defined however we
can
have additional statements here which
can say if the result was
lesser than 50 then what should have
been printed this is a simple example
let's look at one more example and here
my argument
is h now just notice that we are not
passing any default values or we are not
passing any values to the arguments we
are just passing in an argument
which will be assigned a value when you
call the function now here we say age
group is not defined we say vote is not
defined and then we start using some
condition checks
so i say if the age passed is greater
than 18 then
the age group would be adult and the
person can vote
and
message your age group is and voting
status is will be printed out
so
we can use this or from our previous
learning we can do a if else and modify
the function
so let's create this function and then
pass in a simple value to this and that
tells me what is your age group and what
is your status to vote
so
now if we would want to create a
function to convert a name into
uppercase
let's see how we can do that so we are
creating a function here which takes the
value name
now then we also find out the length of
this particular argument and for that we
are using a inbuilt function called n
character
which will be
for your name
and you would want to find out the
length of this particular name and we
would say if the length is greater than
5
then we are again using a inbuilt
function called two upper which will
convert the argument or the name passed
to uppercase
we will say message
user given name is and then you print
out your name
so let's
call in this function
so let me first create this
and then i can call in this function and
we clearly see that the number of
characters in this word is more than 5
and that's why it is converted to upper
case however if you would call the
function with a name which has less than
5 characters
it says
as it is
now this is again a simple function
which we created let's see how you can
create a function to calculate bonus
now here we are passing in two arguments
so this function takes two arguments one
is salary
and one is experience
and then we say if the experience is
greater than 5
then bonus percentage will be 10
and else bonus percentage will be 5
and here we will calculate the bonus so
first it will find out
how many years of experience a
particular employee has and based on
that a value of bonus will be assigned
or bonus person page will be assigned
and then you say what will be the bonus
that is salary into the bonus percentage
and return the bonus amount so this is a
simple function let's basically
select this
and let's create this function and then
let's calculate the function if the
salary is 25 000
and experience is 6 years
and that basically will tell me the
value so let's look at the value it
tells me
2500 which is 10 percent of the salary
similarly if we go for this one which
will
basically go for the execution of else
part of the code we can do this and that
gives me bonuses half of it
now
how do we handle multiple conditions and
multiple actions so let's look at that
so let's create a function which takes
one argument which is h
we would check if the age is greater
than zero then
we would want a nested if within this
condition
so if age is greater than 0 then
whatever we have given here will get
executed
and
this will be
this part of your code
and here
i am again checking if age is less than
18 then
age group would be kids
else
if
now else if is to check the second
condition so if the age was passed if it
is greater than 0 then we get into this
block of the code now it was greater
than 0 but then
is it less than 18 then i would
categorize the person as kids if age is
less than 60
then we will say
age group adult
else we will say age group senior
now we can
basically say that we could have given
more conditions to this because here we
are saying if age group is less than 18
then
the individual would be within the age
group of kids
if that is not true that is is not less
than 18 so probably it is 18 or
greater than 18 then we are checking the
second condition if the age is less than
60 age group is adult and if
these two conditions are not met then it
jumps to else where age group is senior
and if this whole block
was ignored because age was less than 0
then
we would have just printed out age group
is not defined matches messages wrong h
and your age is such and such so this is
our whole function so let's go ahead and
run this
now let's check the age group when the
age is 10
when the age is 40 when the age is 65 or
when the age is minus 10 which is not
defined
now there are some inbuilt functions
which can be used in r
such as your switch function so looking
at this
function that is switch function
we can see
or we can use this for our different
kind of operations so here your switch
function returns values
match with the first argument and first
argument should be a character let's
have a look at the example
so say for example you want to return
the
house rent allowance or hra amount based
on cities
so we create a function called hra now
that takes an argument which is city
name and here we will say what does this
function do
so here i am saying hra amount and i am
going to use the switch function
now switch function i am saying i would
want to convert the city name to
uppercase so that we can maintain some
consistency and here i am saying if
the city is bangalore it would be 7500
if it is mumbai
thousand if it is delhi eight thousand
chennai seven thousand five hundred
and you have five thousand value and you
are returning the hra amount now what do
we do with that
so let's
create this function
it's done and now we will pass in the
value
so we will see
whatever value has been passed to this
and that gives me the value here right
so switch is basically taking me
directly to this value now however if i
try to provide a city name which is not
given in the list
so when i'm saying say for example
pune
now what is happening is it is just
taking a value which has not been
assigned to any of these conditions
if i go for
again something else which is
in a lower case
now this is where your two upper
function will come into use and if we do
this it basically converts this into
uppercase mangalore and then basically
it gives you the value
so this is the usage of a switch
function
let's look at one more example so for
example here we are creating a salary
range which will take an argument which
will be banned and i will say these are
my bands or you can say these are my
options so i can say l1 is basically ten
thousand to fifteen thousand
l two
is so and so
and l three is so and so and you return
the range
now
let's create this function
sometimes you have to
do it this way
so our function is created and now we
can just do a salary range
given a value and that gives me the
range of the values however if you pass
something
which is not mentioned then it basically
prints out null
so in r you can also use repeat which
can be useful and
what does repeat do so here i am
assigning a value to a variable called
time
let's do that
and then i'm giving a piece of code with
repeat now what does repeat do so you
are passing in a message which is
hello welcome to our tutorial
and then you are saying if time is
greater than or equal to 20 you would
want to break out from this loop
and
then you also increment the times value
and this will keep repeating till
this if condition is met
wherein we have said
time value starts from 15 so let's do
this
and this basically will print out
the message wherein first my time was 15
which was
less than 20
so you increment it it becomes 16 you
print it again 17 print it again 18
print it again 19 and 20 and as soon as
you reach the times value which is 20 it
breaks out of this and it stops printing
this particular message
okay now let's look at
some more examples so if you say r
we will use say a function to find the
square of any given user number okay if
the square value is less than 100 then
increment user value by 1 and find
square again and repeat this till square
exceeds
100
pretty simple so you create a function
which takes n as an argument and you
would want to repeat it
so you would want to repeat this by
squaring the numbers until the square
exceeds 100
and once it reaches 100
you will break out
so this is what we are doing and we are
auto incrementing or incrementing the
value of n by 1 every time
we calculate a square
and then you return the value of n
so let's create this function
and now let's calculate it for square 6
and that tells me what is the square now
as soon as your square value touches 100
it basically breaks out of the loop
now if you would want to find balance in
a bank account after n years
if
a person has deposited x amount in the
beginning and bank gives a interest of
eight percent per annum right this is a
simple calculation so it needs the
amount which was deposited you need the
year and you need the
rate
now year which is n ears can be given by
the user
rate we have already given 8 percent
however functions main functionality is
that you can even assign new values to
it say
later one month down the line the bank
rate changes might be it increases might
be decreases then
function should not be
modified it can just take up the new
values and start calculating from there
on
now here
we will say
get the final balance
function takes amount
the amount which would be deposited year
and that could be say four years or five
years or ten years for which you would
want to calculate the rate of interest
and add it to the amount
so i will say for i in one to year so
that depends on how many times you would
want to
run this loop
i would say interest would be
using the round function i am saying
amount into rate
whatever is the rate of interest
and then you are giving two years
now
final amount
will be calculated
so you are basically saying amount plus
interest and you will pass in
a message where we'll say year
is the value of i that's first year or
second year
amount what is the amount what is the
interest you are calculating based on
the round function
and final amount will be amount plus the
interest
and then
you basically say
amount will be given
our final amount will be assigned to
amount now if this is a function you
would want to return the final amount so
let's select this
and then basically create a function
and let's say
i would want
the final balance if the amount
deposited was five thousand
it was kept in the bank for five years
and rate of interest was eight
now
that should basically give me
my final amount
and if we
double that
so we say amount is 10 000 number of
years is 10 but the rate of interest is
less so let's calculate this and that
gives me the interest however if you
notice based on my message it is
basically telling me what was the first
amount what was the interest what was
the final amount and it does that
for
all these number of years
so these are some simple examples for
your functions right now we can also
look at on the similar lines we can
create some interesting functions so you
can find the total number of years
required to raise
thousand dollars if the user deposits
750 per month
so here you're not actually calculating
the final amount but
you would want
to find out how many years
are required to basically have the
amount as 8000 so
your function we are saying the amount
is say
550 or say 750
per month
now i would say let the final amount be
zero as of now month is zero and i will
say
while my final amount is less than or
equal to eight thousand i would want to
do something and that is you are
incrementing the value of month by 1
because that's your first time
your amount is less than 8 000 whatever
deposit was made say 750 per month and
then you have final amount which is
your
initial amount which has been assigned
to f amount that is zero plus the amount
you print out the message
and then you basically say year is
whatever value was passed for month
so you may want to have it for number of
years or years with particular amount of
month so we will calculate the year
value now here what we are doing is
we are calling in this required years
function
without an argument which takes the
default argument
or you can pass it with 750.
we can run this so let's create this
function pretty simple
done
and if we do not pass an argument then
the amount is 750
and it tells me what would be or how
much time it would take
for us to reach from say 750 or 550 to
final amount
similarly if i would have done this
it tells me
again a new value so we are finding out
the total number of years required to
raise
1000 or raise the amount to 1800 dollars
so these are some simple examples of
functions which you can use for your
operations your calculations and also
creating functions which can be
repeatedly used with
either one either no or either multiple
arguments
now
so far we were learning on creating our
own functions
and we also looked at using some inbuilt
functions
either
creating a plot or basically doing some
basic operations
or
passing in multiple arguments
so let's look at some more examples and
when we talk about
built-in functions there are lots and
lots of built-in functions which are
available in r
which can be used so let's look at these
so for example here are some built-in
functions
which can allow you to work with
different data structures for example
you have a sequence function
which allows you to create sequences so
for example i could just say test nums
and i can just say sequence
and here i can say where does it start
from so might be i can say 0
goes all the way to 50 and then i can
also say
if i would want a jump or how many
numbers should be used so for example
let's do this
and now if i look at test nums
so that shows me the value however not
to confuse we could have also done this
using assignment operator like this
and then look at your test nums so it
tells me
it has created a list of numbers from 0
to 50 which are even numbers now you can
always do a class off
and let's look at this
and that tells me
the objects here are numeric
and say for example
i would use type off
to see what is this says nums which we
just created there was a typing mistake
let's check this and it has the values
with w
right so we have created a sequence here
where we are creating a list of numbers
which have
a space of 2 or you are saying about
even numbers now you can also use a sort
function so i can do a sorting here and
i can give it an increasing or a
decreasing order so if for example i
have created this sequence and i could
just create a simple variable like this
pass in a vector into this
which could be say for example i'll try
your test nums
and then
look at your v
so those are my numbers and you can
straight away do a sort on your test
nums so i could just do a sort on v
and that basically shows me the number
however i could also do a sort v
and then i could say
here let's check this v comma
and then you can say decreasing
equals true
and let's do this it just reverse or
puts the data in a reverse order or it
sorts based on decreasing value and
having the greater value in the
beginning and the lowest value at the
end
so you can use the inbuilt sort function
similarly you can use a reverse now
reverse need not actually sort the
values it will just reverse the elements
in your sequence for example let's say
v2 and i will again use
this one
as c and then just passed your test nums
that's an easier way or i could have
created a new vector so i'll say test
nums
that's my v2 and you can do a reverse
on
v2
and that basically
shows me the values but here we see
let's see so we are looking at
okay so this was wrong i should have
given a capitals
and do it yeah this is fine
and we get the values however if i had
created something like this v3
and let's say
c
and then let's say 99
and two and three and four and five and
seventy eight hundred
so that's
a vector i'm creating
and now what i can do is i can use the
reverse
on
v3
and you see
it has just reversed the elements in the
list
now we could have done this without
giving these brackets here
and it shows me the result
so this is good to understand what your
sorting does so sorting is basically
going to look at the objects and it's
going to sort them in ascending or
descending order
reverse is just going to
reverse the elements in your list now
similarly you can also use append which
is basically to combine objects so let's
say v4
and that will basically have
append
and let's say let's take v2
and let's take v3 and this is what we
would want to append
and now look at your value of v4
which basically has everything added
into one so this is your append
similarly you have other functions like
finding out the absolute value
of a number you would want to find out
the square root you would want to find
the sum of all the elements in a vector
you would want to find out the floor
value exponential value
of something
and you basically finding out the mean
value so these are some built-in
mathematical functions so you have
built-in symbol functions you have
mathematical functions you have regular
expressions in r which can also be used
for pattern matching
now what we can simply do is we can
create a variable let's say text
sorry for caps let's say text and here i
will pass in something r is a
programming
language
for
data science
let's do this and now i would want to
use grep function so i can say grep
and this one needs what i am searching
for so let's say language
and where am i searching for so i'm
searching it in text
and let's do that and that tells me
where is this found so when i do a grep
i am trying to find out if this was
found in my element so here i am saying
text and grep language similarly i can
also use one more function which is
finding out
index positions so i can also find out
index positions
by basically giving the vector and here
i can do a grep pass in my vector
abcd you are searching for b and in your
vector
and that tells you your b is at the
index position 2 d is at the index
position 4. so here we are using some
regular expressions now there are also
other
ways in which r can be used for data
manipulation so let's learn about
factors in r and
how do you work with factors and what
are they for
so when you talk about
factors
so here let's clean this up
and let's see what is this so when you
say
factors here we are talking about
categorical variables
so categorical variables can take only
limited number of different values now
don't be confused with this
histogram example here might be we can
just look at packages so that that
doesn't get confused
so when we talk about categorical
variables
we are talking about
variables which can belong to only
categories for example in r there is a
data structure to work with these kind
of variables and that is called your
factor
so with factors we can be sure that all
statistical modeling techniques will
handle such data correctly
so for example you can talk about a
person's blood group and you can say
the blood group could be a or b or a b
or o
so say we collected information about
eight people
and we
recorded this information as a vector
and we can call it blood group so let's
do that so let me
try that here
so if i say
blood
group
and then
i would like to create a vector here
so that
we can look at information about eight
people and their blood group and this
can be
in the form of a vector which can then
be created or converted into factor
by using the factor function
so how do we do that let's say i have
blood group and here i will basically
given some values
so i will use c function and here let's
give some values so for example
let's say
b
let's say a b
and let's say
o
and let's say a again
let's again say o
might be one more o
let's say a
and let's say b
so here we have eight entries and let's
consider we have recorded the blood
group of eight people
and this is in the form of a vector so
for example let me create this now this
is a vector which we have created and
you can always look at the value of this
one
so let's say
blurred group
and that basically
okay there was a spelling mistake let's
do blood group
and that basically shows me the values
and here you see all the values that are
in double quotes now we have basically
created a vector
now to convert this vector into factor
we can use the factor function
and how we can do that is basically we
can say for example
[Music]
let's go here and let's say blurred
group
underscore factor
and for to convert this vector into
factor i will use the factor function
and then basically pass my
blood group here
and
now i have created a factor and we can
look at this factor by just doing in
blurred
group factor
and now if you see
it basically shows us a factor
it does not have any double quotes and
you can also see the factor levels for
categorical variables which get printed
out here
now what what actually r is doing here
is first r scans through vector to see
the different categories in there
then our sorts levels alphabetically
and then it converts
the character
vector to a vector of integer values
so these integers correspond to set of
character values
to use when factor is displayed now we
can always do a structure
to find out more details of this
and here i will
pass in blood group factor and let's
look at this one and this one shows me
this factor is with four levels
so inspecting the structure will reveal
that it has four levels
it shows me what are the categorical
variables and it shows me some integers
so here we are dealing with a factor of
four levels
now a's are recorded and a
would have
say
recorded as one so that would be your
first level
you have abs
which which are recorded here
and that is basically your second level
b is the third level and o is the fourth
level so when this
uh when we are looking at this factor we
may think why this conversion so
categories could be
very long character strings and each
time repeating a string or an
observation can take up lot of memory so
using factors and having these levels
can reduce the memory space
now factors are actually
integer vectors
and each integer corresponds to a
category or a level
now to specify a different order of
levels
we can specify levels inside the factor
function how do we do that so let's say
i will say blurred underscore factor 2
and here
i will basically
give the same so i will say factor
and this factor will have blood group
which we had created earlier but this
time also i'm going to specify
levels
and then i can basically pass in a
vector here
so within this levels i will specify the
values what are the levels
so here i will say o
then i will say a
then let's say
b
and then let's say
a b
so this is what we are doing here
to specify different order of levels and
we are specifying the levels here now if
i do this that would have got created
and let's look at
blood
factor 2 and that shows me the value
here
below where you have specifically
assigned the levels now if you look at
the previous one where we had blood
group factor where levels were
automatically
understood by r
so we were looking at the categorical
variables we were seeing what are the
levels here
and here what we have done is
we have just created again a factor but
then we have specified levels in a
different order
and you can obviously do a structure on
this to compare so for example i'll say
structure
and then i will say blood fracture 2
and that basically shows me the
structure with four levels so this was
the initial one where we said a a b b
and 0
and then there were some integer values
which were responding to these
categorical variables here we have given
a different level and we have
a different set of numbers which we see
here
so if we compare structure of blood
factor in blood factor 2 we will see
encoding is different right now that is
done we can also specify the level names
so what we can do is
as we use names function for name of
vectors we can pass vectors to
levels here
and there is
basically
a function what you can use
so let's say i will
say levels
and then within this i'll pass blood
underscore factor
or blood group underscore factor here
so
once this is done
let's say
blood group
so in this one we created blood group
underscore factor
and this one was blood factor two so
that's okay i mean it's just a naming
convention
and here let's pass in levels
to my blood factor
and then what i can do is i can pass in
the values here
so this is
when you would want to give specific
names
and let's create a vector
and let's call it say bt underscore a
and might be you would want to give bt
underscore a b
and then you will give
bt underscore b
and
the final one is bt underscore o
so what i'm doing is i'm doing the
naming for these particular categorical
variables
by using levels
now let's do this
and
it says blood factor not found
so
we have to look which one did we have so
we have
blood group factor so this is what we
should have given so let's say blood
group
factor
blood group factor
and now we have given some names here so
let's look at
the blood group factor now
blood group
underscore factor and now if you see we
have some levels or we have given the
name to the categorical variable so if
you compare this one so here we were
creating a blood group where we had
these variables
and these variables were the categorical
variables
which was just creating a vector
then we created a factor out of it
and then we looked at our factor we
looked at the structure of it
and similarly what we did was we created
a different factor so let me also change
the name here
and let me call it blood group factor 2
but here we were specifying levels in a
different order
let's look at this one so which is blood
group factor 2
and then you can look at the structure
of this one
blood group factor two and here
in this example what we did was the
initial blood group factor what we had
created we have just given
some names to that like what we would do
in case of vector by using the names
function here you are using the levels
function
so we basically
created some levels and let's group at
the blood group factor now
which basically has some
different names
so what we are doing here is we are just
using naming now we can also specify the
categorical variable names
or levels by specifying label arguments
so inside the factor function so that is
basically to give some names or levels
so let's look at this one and how do we
do that so we can specify by using
factor
which basically creates your factor
and then here i'll say blurred group
i'm going to specify labels for my
naming
so
in
previous examples we saw how we were
using the levels right and this was
by specifying levels for a different
ordering
and then
we could have also done this by saying
levels and given some different names
or we can just do labels
and then
within this i will say labels equals
and then i can say
c and then let's give these values which
we have bt underscore a bt underscore a
b bt underscore b and o
so let me just copy this one again
and let's put it here and then we can
basically do a ctrl enter
so i would have created a factor here
and then
we should remember one thing here that
it is important to follow in
the same order as the order of factor
levels that is a
a b
b or o now these are the
levels what we are seeing so if you look
at any one of these in the beginning
which we had created
it was showing me what levels it has a a
b b and o
and a a b b and o so we are following
the same order but we are using the
labels
within
my factor creation
now sometimes there might be issues
because of wrong ordering so we can
actually use a combination of manually
specifying the levels and label
arguments when creating a factor
now what we can do there is we can say
factor
and in this case let's say
blurred group
which i'm creating
then i will basically say
levels
to give
the right ordering
and here in levels
let's say
oh let's say
a
let's say b
and then let's say a b
so this is for my
levels which i'm creating
and then what i can also do is i can go
for
labels
so levels will take care of my ordering
and labels will take care of by naming
the categories so let's say labels
and then we can create a vector
and we can give some names so we can say
bt underscore o
what else we have we have bt underscore
a
we can then say bt underscore
b
and finally we can say bt
underscore
a b
and
then let's create this one
so now what we have done is we have
created a blood group which has levels
which is following your ordering which
is following the naming
as we have passed so if you look at the
levels it tells you the names what you
have created
it also tells you all the categorical
variables
which were used
for my blood group and basically these
will have some labels
so
we can anytime
look at our
blood
group
which we had created in the beginning
and let's look at the values of those
so when we talk about categorical
variables
there are
two kinds in categorical variables so
you have nominal or you have ordinal now
in nominal you don't have any implied
order for example
blood group o
is not necessarily greater than a that
is o is no or not more worth than a
that we can think of
now trying such comparisons with factors
will generate a warning so
say for example we would want to
look into our
blurred
factor and let's look at
what blood group factor contains now
that's the new blood group factor let's
say blood group
factor
and here i will try to pull out a value
here and let's compare this with
blood
factor
and let's look at some other value now
in this case
we see not meaningful for factors so it
cannot really compare the categorical
variables and see if one
variable is greater than other or has
more worth
now
there can be many examples where such
ordering does exist and in r we can
impose such ordering in factors thus
making it ordered factor so inside
factor
we can set the argument ordered is true
and we can do that now for example you
would look at
the size of address so let's say
address size
and here i will say for example let's
create a vector
and let's say medium
let's say large
let's say
small
let's say again small
and then let's say large
let's say medium
again an entry of large
and then let's say
medium
so here i'm creating a vector
and let's see if we missed out any
quotes or comma
so it says unexpected symbol
and where is that so let's look at this
one so we have dress size
and we are looking at c so i am saying m
l
s
s
l and here is a quote missing
and that was the reason so
and this one also has a quote missing
and now it should resolve yeah so let's
look at this one and now we have created
a vector called dress size now obviously
you can create a
factor of this so i'll say dress size
underscore factor
where i would want to look at the
ordering of this so let's create a
factor
and in factor we will pass our vector
on which we we want to convert or we
want to create a factor
we will say ordered
equals true
so i am specifying a particular ordering
and then i can also specify levels as we
saw earlier
so in levels we will give the category
so what categories we have
so we have small
we have medium
and we have large so these are the three
levels which we have
and let us create this as a factor now
that's done
and what you can do is you can look at
the
factor and we can also do a comparison
so let's for example look at our factor
what does it contain it has some levels
and if you closely notice
there are these levels which also have a
comparison of which one is worth or
more worth than other variable so you
can look at
dress size factor which has some
ordering which we have implemented and
now let's do a comparison between dress
size
and
compare it with some other variable and
see what is the result so now it says if
it is true or if it is false earlier we
were not able to do that because we did
not have any ordering and if we were
looking at the variables we were not
really clear
if one variable has more worth than
others so these are some simple examples
what we have seen now we can also look
at some more examples so say for example
i do a type here
now that basically is creating a vector
if i would want to compare the element
that is type 3
is it greater than type 4 it shows me
false
right now here what we are seeing is
that
if you are looking at a particular value
okay
we can basically see
that there is some comparison happening
here if i compare this with 1
and 2 which tells me true or false
and if i look at this it also does some
comparison so i can always
convert this into factor by using the
factor function so i can do this
if i'm checking
if for example i would want to
create a nominal factor i can do a type
dot factor and it tells me it is true
you can also do a type dot factor 2
and then use the factor function pass in
your type which is a vector and here you
are saying ordered as true which we just
now saw
and now you look at type dot factor 2
which is creating an ordinal
type of variables
now here we can again create type dot
factor 3 so what we are doing here
in this case
we had a vector
we basically
said
type dot factor
we said factor is of true
and then we looked at the nominal factor
we also did a factor 2
and then we created factor but we
specify ordered as true
so we get ordinal
and now if you look at type dot factor 3
here you are saying ordered and you are
also specifying levels like what we did
in the previous example and now
you would look at ordered factor with
user given order which also has the
levels which clearly show us a
comparison between those now we can take
a different example we can say type dot
factor 4
we are using the factor function i am
specifying type which is a vector i am
saying ordered is t
i am using
level which is giving me
some levels
and then we also have labels which are
basically going to have the naming
convention so let's look at this one and
look at type dot factor 4 so it tells me
what are the categorical variables
which are small medium large small large
medium
these are for my
type values which we created a vector
here these are my type values
for which we created a vector
we said ordered is true the levels is
small medium large and we gave some
names so we are looking at the values of
this so this basically helps you to work
on your categorical variables when you
can then compare the values
and you can see what does it show
now here what we are doing is we are
creating a different vector we say small
tall tallest medium small and so on
let's look at this one which is
basically
type and it has the value
so what we would want to do is we would
want to compare height type of first
value with the fourth value
so for that
let's create a vector
on this type ordered is true level is we
are saying
small medium tall and tallest
these are the levels
and now when you look at your type dot
factor phi
it basically shows me
what are the levels which you have
specified so small is the smallest then
you have medium which is bigger than
small tall is bigger than medium tallest
is bigger than tall we have assigned
some levels and based on these levels
now you can compare
your values in this factor type dot
factor 5
take the first value which is small and
compare it with the fourth value which
is medium and you will know if small is
greater than medium so the result would
be false
now i can also convert this into integer
and
i can continue working on this
now here you have
basically
a sequence
so let's use the sequence function where
i'm starting from 0 ending to 20
and there is a jump of 2 so that
basically creates a vector
let's look at the vector value here
and if you would want to sort the vector
so we are using a inbuilt function
wherein let's create this
vector with these numbers i can do a
sorting i can also do a sorting with
decreasing is true
you can do a reversing of vector so
these are some examples of inbuilt
functions which we have already
discussed so here you are doing a
reverse you are finding out the
structure you want to append two vectors
you want to check the class of an object
you want to convert a vector into a list
using as dot list
converting the vector into a matrix
you are having a sample
with with two random values between 10
and 20 so these are some inbuilt
functions which we have already
discussed such as your absolute
such as your
vector and getting an absolute value or
getting a sum of it or a mean of it
around
or basically rounding it to two decimal
places getting the ceiling value getting
the floor value truncating it
get returning the log getting the
exponential value and so on
now we have also looked at regular
expressions earlier so regular
expressions let's just revisit that
so here you are basically creating a
variable called text and then you can
just do a grip
you can say what you would want to
search and where you would want to
search it and that would give you the
logical value indicating if the pattern
was found
you can try to search something else
which might not be found you can also
search for independent values like this
and that basically can give you the
position of that particular object
within the vector
and here is one more example of working
with timestamps so for example
if i would just to assist or date it
returns the current system date
if i would want to
set that as a variable and then call
that variable it shows me
our current time
i can also use as date
and then let's look at this one so as
date and this
would be converted into date and then
you can obviously use formatting
techniques like
getting the month getting the day
getting the year
so here we are passing in the date and
then we are saying what format we would
be interested in
and that basically gives us
the data in a particular format so
that's also useful when you have your
time series data or when you would want
to convert the data types and so on now
there are different ways in which you
can do formatting so for example in this
one we were saying
month day and year
i can also say
for getting the full month name or
getting the full year name i can do this
caps
so i can look at this one and that
basically shows me
my
date in a particular format
so these are some inbuilt functions
which we are seeing and before this we
were seeing factors which is mainly to
work with categorical variables
either they have levels auto assigned
and they might not have labels
so you can give labels you can give
levels you can control the ordering you
can give levels in a different way so
that you can have a different ordering
so this is how you use factors and work
on categorical variables maybe that is
nominal or ordinal and easily you can
do your statistical computations on such
data
let's learn about data manipulation in r
and here we will learn about
d player package
and when we talk about this d player
package it is much faster and much
easier to read than base r so d player
package is used to transform and
summarize tabular data with rows and
columns you might be working on a data
frame or you might be getting in a
inbuilt r data set which can then be
converted into a data frame so we can
get this package deployer by just
calling in library function
and this can be used for grouping by
data summarizing the data adding new
variables selecting different set of
columns filtering our data sets sorting
it selecting it arranging it or even
mutating that is basically creating new
columns using functions
on existing variables so let's see how
we work with dplyer now here
i can basically get the package here so
i can just say
install dot packages d plier now we
already see the the package here which
is showing up so i will just select this
one i can do a control enter and that
will basically set up the package
package deep player successfully
unpacked
so that is done now you can start using
this package by just doing a library d
plier
and this was built it shows me my
version of r so let's also use a inbuilt
data set that is new york flights 13 so
we can do install dot packages and that
will search
and get that relevant data set i can
again call it by using library function
now once that is done we can look at
some sample data here by just doing view
flights and that shows me the data in a
neat and a tabular format which shows me
year month day
departure time schedule departure time
and so on
now we can also do a head to look at
some initial data
which can help us in understanding the
data better so what is this data about
how many columns we have what are the
data types or object types here
it shows me how many variables we have
so this is fine now we can start using
the player and
in that we can use say filter function
if we would want
to look in for specific value now here
we have the column as month so i will do
a filter now i'm creating a variable f1
i'm using the filter function
on flights which we already have
and then what we can do is we can
basically
look at the month where the month value
is 0 7
so let's look at that
and this one
you can do a view on f1 which shows me
the data wherein you have filtered out
all the data based on month being 7.
so this is a simple usage of filter we
can take some other example we may want
to include multiple columns so we can
say f2 filter
flights and here we will say month
is equal to 7 day is 3
and then look at the value of f2 if you
are interested in seeing this
and that tells you the month is 7 and
days 3 you could also look into a more
readable format by using view on f2 and
that gives me my selected result so we
are just extracting in some specific
value we can keep extending this so here
we can say flights
is what we would want to work on i'm
using the filter function so i can
straight away
instead of creating a variable then then
doing a view i can also do a view in
this way i can just pass in my filter
within the view and within this i am
saying filter i would want to look at
the flights month being 0 9 day being 2
and origin being lga
and then that shows me the value here
and obviously you can scroll and look at
all the columns and if you see the
origin column it shows the selected
value so now we have filtered out our
data based on values
in three different columns
now
what we can also do is we can use and or
we can use or operators so
i could have done this
in a a little different way so i could
have said head which shows me
initial result
i will do a flight so within my head
function i am passing in this
and what does that contain so you are
saying flights and in this flights data
set
you would want to pick up the month
being the column so we use the dollar
symbol here we given a value and i'll
say and and i'll again say flights
wherein i will select the day being two
and and and remember when you talk about
and it is going to check if all the
values are met true so then you say
flights origin
lgea and you look at the value so in
this way i can
filter out specifically multiple values
by specifying columns now we could have
done it in this way we could have
created a view or we could have assigned
this to a variable and then done a view
on that where we could have selected
month being day and origin
or you can be more
specific
in specifying all the columns it makes
the code more readable so let's look at
the values and here you are looking at
head which shows me based on month
day
and then you can look for further
columns for other variables that is
origin being lga
now what we can also do is we can do
some slicing here to select rows by
particular position so i can say slice
and i would want to look at
rows one two five and i can do this
so you can always assign or look at the
view of this
i can just do
here so when i did a slide one is to
five it shows me
my entries
for one to five
now similarly we can do is slice five to
10
and now you are looking at
5 to 10 values
so you can always look at the complete
data and then you can slice out
particular data now mutate is usually a
function which is used when you would
want to apply some variable on a
particular data set
and then you would want to
add it to
your
existing data frame or you would want to
add a new column so this is where you
use
mutate which is mainly used to add new
variables so let's see how you work on
mutate
so
it's pretty simple so you create a
variable over delay now i would want to
do a mutate so that it adds a new column
so i'm selecting my data which is flight
i will call the new column as overall
delay
and then basically
i can look at
overall delay being arrival delay minus
departure delay so let's create this and
let's look at view of this which shows
me
or which should show me my new column
which is overall delay which was not in
my original data set so you can anytime
do a head on this one to compare the
value so this one shows me arrival delay
and then there are many other variables
what you can also do is you can do a
view
and you could have just look at flights
if you would want to compare
so you can look at the flights and this
one would not have any
overall delay column so it basically
shows me 19 columns only
what we see here
and if you
do a view
on overall delay then that basically
shows me 20 columns so we know that the
new column has been added
to
this overall delay so if you would want
to work with 20 columns you will use
overall delay if you would want to work
with your original data set you will use
flights now you can also use a transmute
function which is used to show only the
new column so we can do an overall delay
and at this time we will say transmute
we will say flights overall delay
the computation remains same but at this
time if i look at view on overall delay
it only shows me the new column so
sometimes we may want to compute result
based on two variables or two columns
and just look at the new value
and then we can decide if we would want
to add it to our existing structure
now you can also use summarize
and summarize basically helps us in
getting a summary
based on certain criteria so we can
always do a
summarize
and
what we can do is we can look at our
data
and we can say on what basis we would
want to summarize this particular data
so we can do a summarize function now
summarize on flights i will say average
a time
and i would want to calculate an average
so for that i am using inbuilt function
called mean
i will do that on airtime column
so
let's look at flights once again and
here we can see there is
arrival time not a time sorry arrival
time and we would want to do some
average on this particular data we would
want to summarize this so what i'll do
is i will use the summarize function
i will say average airtime and this one
i will look at mean of a time so let's
see if there is a a time column i might
be
let's look at this one and i will delay
and yes we have an airtime so we were
actually looking at summarizing based on
airtime not the arrival time
so time is how much time it takes in air
for this particular fight and we will
want to use the trans summarize function
not the transmute so summarize flights
average a time and this one we will
calculate the mean of average a time
and
i will also do a any removal which is
i'm saying true so let's do this and
that basically shows me the average a
time is 151
i can also do a total a time where i am
doing a summation of values or i can get
the standard deviation or i can
basically get multiple values such as
mean
i can say
total airtime where i am doing a
summation
and then i can look at other values
which is if you would want to put in
standard deviation here you could do
that so let's look at the result of this
summarize and this basically allows me
to get some useful information which is
summarized based on
a particular function such as mean sum
standard deviation
or
all three of them
now
let's look at grouping by so sometimes
we may be interested in summarizing the
data by groups and that's where we use
the group by function so we can always
use
the group by clause
now
here we are taking a different data set
so we will say for example let's look at
head of mt cars
and that is basically my data set on
empty cars now that shows me the model
of the car
it shows me my lathe cylinder power this
and your horsepower and various other
characteristics or variables in this
particular data set
so here
we can say let's do a grouping by gear
so there is a column called gear so i
will call it by gear i will look at my
data set and then what i am using here
which you see with these percentage and
greater symbol is called
piping so that basically
feeds your previous data frame into next
one so this is sometimes useful and you
can get this by just saying control
shift and m and you can then use this so
we are going to have
piping so i am saying empty cars now
this is my original data set where i did
a head
or i could have done a view on this one
if you would want to see it in a more
readable format and that basically shows
me the data so we are using a different
data set so i want to group it by the
gear column so i'm going to call it by
gear
and
this one takes my data that is empty
cars i'm using the piping and then i'm
saying group the data based on gear
column that's done now let's look at the
value of by gear
or
you can always do a view so remember
whenever you're doing a group by it is
giving you a
internal object where your data is
grouped based on a particular column
so we can look at the values here you
can do a view that shows you
your data grouped based on a particular
column
now i can again use the summarize
function
where i would want to now work on the
new one where it was grouped based on
gear so i am doing a summarize and here
i am going to say gear 1 which will be
having the value of summation on the
gear column
and then i am saying gear 2 which is
mean well you could give some meaningful
names to this
and let's look at the value of this one
where we are basically now looking at
the values which is sum
and mean values based on the gear
similarly we can use look at different
example so we can say by gear
and i am again using piping
but earlier we had taken gear
we had grouped the data
and we called it by gear so we took our
original data set empty cars but now
within this particular data which was
grouped by gear
i will take this data set i will use the
piping and i will summarize it where i
am saying within this particular data
set i would want to get the sum or i
would want to get the mean and then you
can look at the values so
what you are doing is
you are
either looking at your original data set
or you're looking at the data which was
already grouped and then you can look at
the values
now here what we can do is we can group
by cylinder say might be you are
interested in looking at data which is
summarized based on the cylinder column
you can do that and then for this by
cylinder i am doing a piping where i am
using the summarize function and
summarizing will then be done based on
the mean values of the gear column or
the horsepower
so let's do this
and then you can basically look at the
value at any point you may want to look
at the data set again so just go ahead
and you can look at what does the value
contain
and
by cylinder or by gear and do a head and
it gives you the value
so
you can always do some summarizing or
grouping in these ways
now here we are going to use sample
underscore n function and sample
underscore
fraction for creating samples
so for this
let's take the flights data set again
and we would want to
get 15 random values now that is done
and it shows me 15 rows with some random
values from the data what you can also
do is you can do a portion of data by
using sample underscore
fraction and here i'll say flights i'll
say 0.4 which will return 40 percent of
the total data so this can be useful
when you are building your machine
learning where you would want to split
your data into training and test might
be you are interested in some portion of
the data so you can do this
which is very useful function and then
you can look at the value of that now
what we can also do is we can use a
range function so like we were doing a
grouping by or we were trying to pull
out a particular column so in the same
way we can use a range which is a
convenient way of sorting than your base
are sorting so for a range function
let's do a view
based on a range so we will work on the
flights data set which we have
and here what we would want to do is we
would want to arrange the flights data
set
which is based on year and departure
time and we are doing a view out of it
so that basically
gives me the data which is arranged
based on
your year and departure time now i can
do a head to give me some highlighting
of that data
now
the piping operator what we are using
can be used in these ways also so here i
will say df i will just assign the data
set empty cars to it let's look at the
df which has basically your different
models you can obviously
look at the head or view of it to look
at useful information we can also go for
nesting options which can be useful
so we are
creating a variable called result here
now that has the arrange function
so what does this arrange function do so
when we would want to use arrange to
sort the data so i would want to sort
the data but what data would i sort so i
will use sample n
which will give me some portion of the
data or some sample data now what is
that sample data so here we are using
nesting that is
earlier when we did a sample we just
said data and how many random samples we
want but instead of giving that what we
are going to do is we are going to use
filter here
now this filter will work on df
so filtering will happen based on the
mileage which is greater than 20
i will say size is 5 and i would want to
basically arrange this in a descending
order so i'm using the des
on this particular mileage column by
default it is always ascending
so let's get the result out of this
which will basically show me the mileage
details in a descending order so this is
my data frame and now
we can look at the result what we have
created
so just do a view or do a head
and look at the view so here you see
mileage
where the highest value is on the top
and we were only interested in five
values in a random sample so that's why
when you did a view it shows your five
values
and it shows in a descending order based
on mileage so we have
not only used an inbuilt function
we have not only arranged the data that
is we have sorted the data but we have
sorted the data based on a descending
order on a particular column we have
said the value should be greater than 20
and we have also said we just need five
random samples
now let's look at some other examples so
you can always do a multi assignment
so i can say filter wherein i am going
to use
df which was assigned empty cars i am
going to say mileage should be greater
than 20
then i say b which is going to get a
sample out of a
and i just want 5 random values so let's
look at that so we have b which is
going to get a
set of 5 values from a
now i will create a result variable
which will arrange b which is sample
data in a descending order now let's
look at the result of this and that
basically shows me what we were seeing
earlier so you can do a multi
where you can create a variable get a
sample out of it and then basically
whatever is that result you can arrange
that or sort that in a descending or by
default ascending order
so same thing we can do it using pipe
operator
so piping so here i will say result
i'm passing in my df that's the data set
i'm using piping and which basically
tells what you need to do on this
particular data set so i'm going to
filter out the data based on mileage 50
sorry mileage 20 then i'm going to push
that
or forward it to
get the random sample and whatever is
this random sample is going to be pushed
so you are arranging this in a
descending order so this is one more way
of doing it and then basically you can
look at the result so these are some
simple examples where you can use your d
plier with multiple assignments or using
your nesting to filter out the data
you can also do a
arrange which is to sort the data you
can get some random samples out of it
you can summarize the data
you can also
summarize the data based on one or two
or multiple columns and you can use some
inbuilt functions to summarize the data
based on some
functions which are applied on the
variables or on the columns
you can transmute it
where you would be interested in only
looking at one column
you can mutate it where you want to add
a new column
you can slice it
and you can give the conditions where
you can say and on or to filter out the
data
so what we can also do is on this
particular data set which we have say
for example df
where i have my data let's look at this
one and if i just do a df at this point
it shows me my data set and if you would
be interested only in particular column
then your d player also allows you to
either we can do a filter or we can
simply do a select
now for selecting we can choose
our data so for example i'll say df
underscore i'm interested in mileage i'm
interested in horsepower
might be i am interested
in
your cylinders in this
and for this one what i can do is when i
would want to do a select
i can basically say
selected
df let's call it some name
i can say
control shift m
which is for piping
and then basically what you can do is
you can do a select
and you can choose your columns so i was
interested in mileage i was interested
in
horsepower
i was interested in cylinder and here
what i'm doing is i'm using a select
where i can look at the new data frame
so let's do this
and
i'm sorry here we will have to give it
df
this is where
you are passing in your data
yeah now this one is done and we can
look at the value of this one by just
doing a df
or
head
on df
underscore
mileage horsepower cylinder and look at
the selected result so you can be
looking at selective columns i could
have done this filter but filter will
always look for
a condition
say your mileage is greater than 20 or
might be your cylinders are more than 4
or something else but when you do a
select you are selecting specific
columns
so view always gives you all the columns
head gives you highlight but then select
can be useful when we are interested in
looking at only specific data so this is
how you can use the plier for
manipulation
for your data transformation for
basically filtering out the data by
selecting particular data and then
working on it so similarly there is one
more package called tidr and we'll see
how we can use data manipulation
done using your tie dr package
let's
learn about that idr package it makes it
easy to tidy your data
and this basically helps you creating a
more cleaner data
so
which is easy to visualize and model now
this comes with mainly four functions so
you have gather which makes
your data wide or it makes white data
longer so that is basically used to
stack up multiple columns you have
spread function which makes long data
wider that is stacking the data together
or stack
if you would want to unstack the data to
data
and you are talking about data which has
same attributes and then your spread can
spread the data across multiple columns
you have separate which is function
which splits single column into multiple
columns
and to complement that you have one more
function which is unite and that
combines multiple columns into single
columns so these are four main functions
which are used in your ti dr package so
let's look how we work with this
so let me bring up my r studio here now
for this
first is let me just clean up my screen
here doing a control l so i will install
the package it is already installed but
we can just do a control enter
and then i can say do you want to
restart r prior to reinstall store
install i'll say okay
and it is basically going to get the
package
now it says package ti
tidy r the rest idr's has been
successfully unpacked let's use that
package
using our library function
and that was built under our version 3.6
now i can basically
start using these functions so for
example here we are creating a data
frame so let's say n is 10
and then we basically would say
we will call it white
now that's the variable name i'm using
the data.frame function
i'm saying id which will be
1 to n so that will take the values from
1 to 10 and then these are the values
which have
10 entries so this is a vector phase one
phase two phase three let's create a
data frame out of it now that's done we
can have a look at our data frame by
just doing a view wide and that shows me
the id column and it has face dot one
face dot two and face dot three now we
can use our function so for example we
can work with gather that is reshaping
the data from wide format to long format
and basically you can say stacking up
multiple columns
so let's see how we do that here i'll
call it long i'm working on white i'm
using the piping
functionality and then i'm using gather
so this one i will say what will be the
data which i will use
so we are using wide as a data frame
then i am saying response time so that
will be basically one more column and
then you have your columns which you
would want to
basically stack so i'm saying from phase
one to phase three so let's do this
and once this is done let's have a look
at our variable long so this one shows
me that i have an id column
i have the response time column and i
have the face column which we mentioned
and that basically has all the values
stacked in so you have face dot one face
dot two and face dot three so if all the
columns are being stacked here so all my
data so now i have totally 30 entries in
this one so this is basically using your
gather function now sometimes we may
want to
use
a separate function now separate
function is basically splitting a single
column
into multiple columns so which we
would want to use when multiple
variables are captured in a single
variable column okay so let's look at an
example of this one so let's say long
separate that's what we will call we
will work on this long which has all the
data stacked in
as the columns we selected then i'm
saying separate i want the face column
and then i would say
when i separate the columns what are my
column names now i could also give a
separator by giving a comma and then
mentioning the separator if that is
required so let's do this
now once this is done let's have a look
at our long separate so what we see here
is the
column which we used so we were doing a
face column and that was to be split and
we wanted to split it into target and
number so that's what we see here so you
have face being split into target and
number and then you have the response
time so this is how you use the separate
function now there is also something
called as unite function which is
basically a complementing of separate
function so it takes multiple columns
and combines the elements to a single
column so for example here
we will
call it long unite
and we will take long separate which was
separating the data we want to unite so
we will take phase target
number
and we want to have a separator between
them so let's basically do this
and now let's look at the result of this
unite
so you see you have the face and target
merged together so you have face dot one
the separator is dot as we have
mentioned and we have united multiple
columns
so this is one more function of your tie
dr which helps you
basically
tidy up your data or put it in a
particular way
now then you have your spread function
and this is basically for unstacking so
that is if you have if you would want to
convert a stack to data or if you would
want to unstack the data which is of
same attributes spread can be used so
that you can spread the data across
multiple columns
so it will take two columns say key and
value and spread it into multiple
columns so it makes long data wider so
we can look at this one we will say long
unite
i'm using the piping i will use the
spread function i'll work on the face
column and response time and let's do
this and then let's do a view on this
so it tells me our data is back in the
shape as it was in the beginning so
these are four functions
which are very helpful when we work with
idr package
so let's learn about
visualization and here we will learn
about
r which can be used for your
visualization now
one thing which we need to understand is
because of our ability to see patterns
which is highly developed we
can understand the data better if we can
visualize it
so the efficient way or effective way to
understand what is in our data or what
we have understood in our data we should
or we can use graphical displays that is
your data visualization so there are
actually two types of data
visualizations so you have exploratory
data visualization which helps us to
understand the data and then you have
explanatory visualization which helps us
to share our understanding with others
so when you talk about r
r provides
various tools and packages to create
data visualizations
and which can be used for both kind of
data analysis or both kind of
visualizations
so when you talk about exploratory data
and visualization the key is to keep all
the potentially relevant details
together
now the objective when we talk about
exploratory data analysis is to
help you see what is in your data
and the main question is how much
details can
we interpret
now when you talk about different
functions which we see here such as plot
which is more for a generic
plotting you have bar plot which is used
to plot data using rectangular bars or
you can say creating bar charts you have
histogram or hist function to create
histograms where you look at the
frequency
of
the data are basically used to look at
the central tendency of the data you
have box plot which is used to represent
data in the form of quartiles you have
gg plot which is a package which enables
the user to create sophisticated
visualizations with the little code
using the grammar of graphics
and then you have plotly or plot ly it
creates interactive
web-based graphs via the open source
javascript graphing library now before
we see some examples here let's also
talk about
when you talk about plotting let's also
try to understand what kind of
plots you can have and what kind of
techniques you have so let me open up my
r studio here
now for example i can pull out a
particular data set
and let's look at this one
so here i can look at
all the panes and that shows me the
information now what i can do is
i can install
and get the inbuilt data sets and then i
can simply do a plot
wherein i am doing a plot on jquery data
set so let's see what does that show it
summarizes the relationship between four
variables in check weight data frame
which is
in our's built-in data set package now
from these plots we can see for example
weight varies systematically over time
you can also see that chicks were
assigned to four different diets
now when we talk about explanatory data
analysis
or visualization that shows others what
we found in the data this means we need
to make some editorial decisions what
features we would want to highlight for
emphasis
what features are distracting or
confusing and you want them to be
eliminated
right so there are different ways of
doing it now when you talk about your
graphics or visualizations you have
i would say
three different types or you can say
four so you have the base graphics which
is easiest to learn now here we are
having an example of base graphics where
i can use the base graphics
i can get a
data set using library
then i can simply create using plot
function to
a generate a simple scatter plot of
calories with sugar
from u.s serial data frame in the mass
package
and then i can give it a title so this
is basically a simple example of base
graphics now you also have what we call
as grid graphics which is powerful set
of modules for building other tools
now you also have latest graphics which
is general purpose system based on grid
graphics and then you have your gg plot
2 which implements grammar of graphics
and is based on grid graphics so you
have different ways now here since i
already have used library and i have the
data set i can just do a x so i can
assign the
sugar related values to x and calories
related value to y
then i can use one more which is library
function and calling in grid now i can
basically use functions such as push
view port if i would want to create a
plot using your grid graphics to create
the similar kind of plot which we
created using base graphics but this
will give you much more power than base
graphics
it will have a steep learning curve but
it is usually useful so i can do this
where i'm saying push view port
then i can basically say i would want to
have a data viewport
i would say different functions of your
grid package so i'm saying rectangle you
have x axis y axis given some points
here
and then basically you can add details
to the graph by giving the names to the
columns
and you can basically create a simple
grid graphics based plot here
now there are different other options
which we can use to create plots now
before we go into understanding how you
create plots let me just give you a
brief on
what are the different kind of plots and
how they can be used so here we will
look at these different plots now for
example
we have a bar chart which is a graph
which shows comparisons across
discrete categories
so you have x axis which will show the
categories being compared and y axis
which represents a measured value
and height of the bars are proportional
to measured values
now
to create different kind of charts you
can use ggplot which is a package for
creating graphs in r
it is basically a method of thinking
about and decomposing complex graphs
into logical subunits and that is a part
of tidy works ecosystem so it takes each
component of graph accesses you can give
scales you can give colors you can give
the objects and you can build graphs on
particular data you can modify each of
those components in a way that's more
flexible and user friendly you can if
you are not providing details for the
components then ggplot will use sensible
defaults
and this basically makes it a powerful
and flexible tool now here
are
different options when you use your
ggplot such as you can use geom or what
we call as geometry objects
to form the basis of different type of
graphs for bar charts you have for line
graphs you have scatter plots that is
underscore point you have underscore box
plot for box plots you have quartile for
continuous x violin for richer display
of distribution and jitter for small
data so here is some simple example
where i would not go into too many
details here but you can just have a
look at this one where we are
using library function to get the
ggplot2 package
then basically we would want to look
into the mileage data we would want to
look at the structure of it
and then we can basically get the tidy
words package finally we can create a
bar chart
using geo underscore bar
and we can basically also mention what
would be in x-axis now you can also give
different colors to basically add more
meaning to your data
you could also go for stacked bar charts
so here we are actually telling ggplot
to map the data in the drive column to
fill the aesthetic so here i am giving
aesthetic access class
and i am saying what is the data we need
to have and then we are using geom
underscore bar
so you can also have dodged bar
in your gg plot that is not bar charts
which are stacked but next to each other
and you can create that by using
your position as position underscore
dodge okay now you can obviously use
your different packages which are
inbuilt and you can create your bar
charts
and you have other kind of graphs such
as line graph which is basically a type
of graph that displays information
as a series of data points connected by
straight line segments such as this one
and for this one we are using if you see
geom underscore line
now you can also create a scatter plot
which is a two dimensional
data visualization that uses points
to graph the values of two different
variables one in an x axis one on y axis
like what we saw in base graphics
example
and they are mainly used if you would
want to assess the relationship or lack
of relationship between two variables
and you also have histogram which i
mentioned is mainly to look at the
distribution of a data to look at the
central tendency of the data
basically looking at
your
large amount of data for a single
variable you would be interested in
saying where is
more data found in terms of frequency
whereas lesser data found in the graph
how close the data is towards its
mid point or what we call as mean median
mode
so you can use histogram where you can
categorize the data in what we call as
bins so these are some basics on
different kind of graphs now we can look
at some examples and see how that works
so what we were seeing is some quick
examples of base graphics or grid
graphics now here
let's do
an example of pie chart for different
products and units sold so you want to
create a graph for this first let's
create a vector and pass in the value
here
now i can also create labels which i
would want to assign to these values
and then basically i can plot the chart
by saying pi so that's the kind of chart
which i would want to create
and i would say the data would be x
and labels
so let's do this and that shows me a
simple pie chart now i can also give
main details here so instead of just
doing a pi x comma labels i can say what
is the main
and then what kind of coloring it should
follow so this is the way you can create
a simple
uh plot now i can also
find out what is the percentage
and
then basically
i would be interested in plotting the
pie chart which takes x
which takes the labels which will be the
percentage which we are calculating here
by doing a round function
and then you can basically give details
to your
graph you can say what color it follows
you can basically look at the legend
where it needs to be
in your chart
what are the values
and then basically fill up the colors so
let's run this one
and that shows me the percentage which
was calculated and it gives me the
details
and we can always have a look at our
plot now if you would want to go for a
3d pie chart then you can get the
package which is plotrix
let's use that by calling in the library
function let's pass in some data to x
and let's give some values or labels
which will make more meaning to the data
and then let's plot the 3d graph so i'm
saying pi 3d here where i'm using x and
labels
then i'm basically doing an explode
which will basically control how your
graph looks like and basically give the
values so it also takes the title when
you say main and by chart of countries
now let's create
data for graph so again we are having a
variable here we are create using the c
function creating a vector
and then let's create a histogram for
this one
where i would say x lab what would be
your data around x axis what is the
color what is the border and here i am
creating a simple histogram
which as i discussed earlier will always
show
your values on the x axis and y axis is
more of frequency and then you can look
at the set of values and what is their
frequency
and we can basically use this histogram
for exploratory data analysis look at
the data try to understand what is the
central tendency of your data values
now we can also give some limits by
using the x lim and ylim and then i can
also specify what is the limit so we
have given some values here wherein we
have said your x limit is 0 to 40
and y limit is 0 to 5. now if you
compare this with the previous one which
we had created
this one
based on the frequency had taken the
limits but we can assign limits
explicitly by giving this and then
create a histogram which makes more
meaning
now let's take
another data set that is air quality
let's view this to see what does that
data contain so you have
ozone solar wind temperature month and
the day so this is the kind of
information we have in the air quality
now let's use the plot function to draw
a scatter plot where as i mentioned you
would be interested in analyzing
variables and see
what is the relationship between them so
to plot a graph between ozone and wind
values
so we will say plot we will say the data
which is air quality from that i would
be interested in the ozone column or
ozone field and the wind field i can
create a plot based on this
now i can also be saying what should be
the color what is the type of the data
which you would want to create and you
can look at the info information so you
can create a histogram you can create a
scatter plot to basically understand the
data better and then infer some
information from that data so let's take
the air quality data set itself without
specifying any particular column and you
can create a plot which shows me all the
different values which you have in the
data and it basically shows you the
difference this is more of an example
like what we did for chickweight where
we did a base graphics now you can
assign labels to the plot so that is
when you are creating a plot you can say
air quality you will say ozone
and then that's your ozone concentration
you have your y lab which is the number
of instances
you have what is the title ozone levels
in new york city what is the color so
these are the details what we have given
with our plot function and let's look at
the data so it just tells me that this
is the ozone concentration
uh the number of instances what you have
and you are looking at the data now we
could also create a histogram by picking
up a particular column that is such as
solar
from your air quality and that basically
shows me the frequency
of solar values and we can then try to
find out what is the mid
what is the mean what is the standard
deviation and so on you can also look at
your histogram and try to understand if
it is left skewed and right skewed so we
can do that
now here let's get the temperature out
from this particular data set
let's create a histogram on temperature
and that basically shows me the
frequency of the temperature values
and
what values have the most frequency or
most occurrence
now you can create a histogram
with
labels
so let's do that with the limit and then
let's also use text to basically given
the values which also takes the values
and for each set of frequency or each
set of values it gives me the labels
now you can have a histogram with
non-uniform width so you could do that
by doing a hist function
and then
passing in your temperature you can say
what will be the main what is the title
what will be your x lab it will tell you
a limit around x axis what is the color
what is the border
what are the breaks you would want to
have
for your bars and you can
simply create a histogram using this so
this basically takes the breaks which we
have given
such as 55 to 60
60 to 70 70 to 75 and so on so this is
basically creating a histogram with
non-uniform width
and it purely depends on the kind of
values what you have
now you can also create a box plot which
sometimes helps us in understanding the
the data quartiles also understanding
our outliers so you can create multiple
box plots based on the data from air
quality so we'll select all the data and
then we'll do some slicing on the data
so let's create a box plot which tells
me the values and if you look at these
points here
like single dots these are basically
your outliers
we can learn about that more in later
sections
so you can use
your gg plot 2 library to analyze
a particular data set so for that we
will first
use the install dot packages and get
ggplot2
so it says do you want to restart r and
i can say yes so let it get the package
i think the package was already there
and now
let's look at
using ggplot2 so for that i have the
library function
and let's do a attach where i'm getting
the data set which is empty cars
now then i will create a variable p1 i
will use ggplot i will pass in my data
i'll give the aesthetics
what is the columns which you would be
interested in
and then you are using geom underscore
box plot to basically create a plot
which gives me the box plot for the
values here and this is based on
the cylinders which is there in your
data
so we can always look at what does our
data contain
and what kind of values or features are
available in the data now let's create a
box plot we will also use the coordinate
function and that basically gives me
based on the data so i have changed the
coordinates now if you
look at the previous one where we
created a plot we had mileage on the y
axis and cylinders
on the x-axis
now i did a coordinate flip and that's
like your transpose function so you have
created the box plot but you have just
flipped the coordinates you can create a
box plot and then say fill
which is the factor of cylinder so that
can be used to fill up the values in
your box plot
now what we can also do is
we can create factors so we have learnt
about factors earlier which is usually
used to work on categorical variables
so here let's create a factor
which is empty cars gear you have am you
have cylinder
and if you look at the factors which we
have created we have passed our data
what is the field or the column we are
interested in
what is the level of values there and
what are the labels for those values
right so we have learnt about factors
you can always look into the previous
section and learn more about factors
now let's create a scatter plot
by using the ggplot function again we
will use the data as empty cars i will
go for mapping option and then i will
give my aesthetics that is what would be
x what would be your y
and you also would want to use what kind
of
function you are using so let's go for
geom pawn point and that basically helps
me in creating a scatter plot now you
can create a scatter plot by factors
so here we will say gg plot
so notice in all of these cases
depending on the kind of data you have
depending on the kind of plot you are
interested in you will use the ggplot
and then basically a function with that
or the inbuilt package so here i'm
saying data is empty cars i am going for
mapping which basically will take the
values for your x and y
what is the color
and the coloring will be done based on
the factor values now if you remember
factors will obviously have some levels
and
[Music]
those levels will basically help you in
differentiating between your categorical
variables so i'm saying as dot factor on
cylinder and then i'm using geom point
to basically create this scatter plot so
let's do this
and
i can
look at the values of this one so it
says
must be there is an error which says
must at least one color from the hue
palette so let's look at that one so the
error which we were facing when we gave
color as the factor values was because
when you look at these factors which
were created with some labels if we look
at the values of these it tells me there
are any values in that particular column
similarly your gear
or similarly you can completely look at
the complete data set it tells me
cylinder you have am you have care now
these have some
we have created some labels but these
have n a values
so what we can do is we can create a
scatter plot as we did earlier by giving
the aesthetics and that's a simple
scatter plot
wherein i'm also using geom point so
that i can have these points by defaults
or with defaults
you can also
give a color specific basically if you
would want to have different kind of
data in the same plot or i can
create scatter plots by different sizes
by giving a size or
i can give a color and size and that's
again one way in which you can create
your scatter plots now let's also see
how you can visualize
one more data set which is mpg
so i can also do it in this way where i
set ggplot2
and then pass and look at the data set
what we have here
you can just do a view on this to see
what my data contains if the fields have
any any values if that's going to affect
your plotting so now what we can do is
we can create a bar plot or a bar chart
so i am saying gg plot the data would be
as we have given in previous lines that
is ggplot2 mpg then i will say what
should be in my aesthetics and what kind
of
chart are you going to create so i'm
saying geom underscore bar so that's my
bar chart and that has basically your
class and count now you can create a
stacked bar chart where your information
is stacked in the same bars
and we are still using the same data
we are going for aesthetics which is
class and then when you say geom bar
which creates your stack bar we will use
fill
which is drive and we can always go back
and look at our data for example
you can always look into this so you
have the drive column here
and you are also working on this
complete data set so let's go ahead and
create a stacked bar chart and that
basically gives me the information where
you have the drive information which is
stacked
here now you can do a dodge
by giving the position as dodge
so we are still going to go for a stack
chart but this time the bars will be
next to each other and that can also be
done which is very useful
you can use this by using geom point
where you are mapping and you are
specifying what are your aesthetics so
we were creating a scatter plot
now you can also use
or give more details where you can say
color can be based on the class
and we have different classes and based
on that my points have been colored
now you can also use a plot
ly or plotly library so let's install
this one
i will say yes for example let it
basically restart so that all my
packages are updated
then i can access that package using
library function
and then
create a variable
to which you are assigning your plot
underscore ly plot so data is empty cars
what will be your x-axis what will be
your y-axis and details on your marker
which we have given
wherein i will give a list which is size
color which is a combination
and then you have your line
what kind of color it will have and what
will be the width so this is where i'm
going to use plot ly
and let's look at this plot
so it basically gives me some
information now we see some warnings
which are getting generated but there is
you don't need to worry about that so
you can look at the packages what you
have
and what options you are using so
similarly we can create one more plot
using plot ly and look at the values of
those so that's a plot with a trend
which explains me about my data
so this is a simple small tutorial on
understanding or
how you can have your graphics or
visualization
used to understand your data obviously
there are much more examples much more
ways in which you can pass into your
plot functions
or your gg plot
and the inbuilt
packages which are available in r for
your visualization now that could be for
exploratory data analysis or explanatory
data analysis so try these graphs and
see if
you can change these options and try or
create new visualizations
now
let's do a hands-on project to perform a
time series analysis using r programming
in this project
we'll be using time series energy data
to explore the variations in electricity
demand and renewable energy supply over
time
over to ajay now welcome to this session
where we will learn on time series
analysis using our programming language
so this is basically a mini project
where we will look at time series data
and how we can analyze it visualize it
to basically find some
important information or gather insights
from the data now when you talk about
time series analysis time series is
basically any data set where your values
are measured
at different points in time
so when you talk about time series data
data is usually
uniformly spaced at a specific frequency
for example hourly weather measurements
you have daily counts of website visits
monthly sales total and so on so when
you talk about time series that can also
be irregularly spaced and sporadic for
example time stamped data in computer
systems event log or history of 9 11
emergency calls
now when we work with time series data
for example here i am taking a energy
data set we can see how techniques such
as time based indexing resampling
rolling windows can help us explore
variations in electricity demand and
renewable energy supply over time now
here we will look at some aspects of
this data set which i am considering so
there is this is open power systems data
set and here is the data set i have we
can look at the data set now this is in
a simple format it has time
it basically has values for consumption
and then you have data for wind and
solar and wind plus solar so in certain
cases you have only the date and the
consumption but then if we scroll down
we will also find
data for wind solar wind plus solar and
so on so this is a time series data set
which we would want to work on
sometimes you may also have the data
collected which just does not have the
time but it may also have
time stamp that is it would have say
hour minutes and seconds and that can
also be worked upon so let's consider
this data set and let's work on this
project where we will analyze this time
series data set
now here we can work on this time series
data we can basically create some data
structures out of it such as data frames
we can do some time based indexing we
can visualize the data we can look at
the seasonality in the data look at some
frequencies and also do some trend
detection
now when you talk about this data set it
has electricity production and
consumption which is reported as daily
totals in gigawatt hours
and here are the columns of the data
which i was just showing you so you have
data you have consumption you have wind
you have solar and wind plus solar so
this is the data we have and we will
basically explore say electricity
consumption and production in germany
which has varied over time so some of
the questions which we can answer here
is when is electricity consumption
typically highest and lowest how do wind
and solar power production vary with
seasons of the year
what are the long-term trends in
electricity consumption solar power and
wind power how do wind and solar power
production compare with electricity
consumption and how has this ratio
changed over time
we can also do wrangling or cleaning of
this data or pre-processing of data and
create a data frame and then we can
visualize this
now let's see how do we do that so i
will open up my rstudio and let's look
at the data set so here is the data set
now i'm picking it up from my machine
you can also pick it up from github so
all the data sets or similar data sets
can be find in my github repository and
here
i can look in the data sets you will
find
a lot of different data sets here there
are some time series data sets such as
power
i can search for power or you have
basically
coal
or you have this
opsd
germany daily data set and there are
many other data sets which you can work
on
now to
get the documentation on this project
you can also look in my github
repository and you can search for
repositories
and then basically you can look in data
science and r
and here there is a project folder where
i have given the documentation sample
data set and also
your time series analysis related
document this is also the code which you
can directly import in your r studio and
you can practice or work on this project
so let's see how does that work
so first thing is we will create a data
frame
from this data set now here if you see i
am using header as true so that it
understands the heading of each column
i'm also giving row.names and i'm
specifying date so there is this date
column in the data set as i showed you
earlier let's look at it again so you
have date consumption wind solar wind
plus solar so you can suggest that date
should become the index column which can
be useful so you can do this now let's
just
create this
let's look at what does this data frame
contain
and here if you see it shows me some
data which
has been
now as a part of this data frame
structure
it starts with consumption wind solar
wind plus solar and if you see this one
is becoming my index column so i can
always do a head and look at part of the
data frame using head or tail so look at
the first records so let's see this now
that shows me the head data i can also
do a tail and look at the
ending values so if you closely see here
we have wind
solar
wind
dot solar and that basically has n a
values so there are missing values but
let's look at the tail and that tells me
that there is some data available for
wind and solar and wind solar
now we can always look in a tabular
format using view
and we can look at the data so this
shows me that there are values in these
columns we see any values but if i
really scroll down
i can see
some values which would be available for
wind and solar and wind solar so i can
just use view now i can look at the
dimensions of this particular object
and that tells me there are 400
4384
rows and four columns you can always
look at the structure that is check the
data type of each column which can be
very useful so if i see here i don't see
the date column because date column was
considered as an index which can be
useful but i also look at my other
columns they are of the num types so
that's the data type for each
attribute or each column here
now we would be interested in looking at
this date column so let's look at the
data type of this date column
now if i try to do this this will show
me that this is null because date as a
column does not exist because we created
it as an index so if i look at row names
and then i search
for my data show me the index column or
row.names it tells me these are the
values that's the date column
which we are seeing here now we can
access a specific row by just doing a my
data
and give the index value or row name
value so let's look at that and that
shows me based on this index you are
looking at the value
you can obviously search for a different
date
something like this you can also pass in
a vector and you can give
range of values so that is 0 1 2006 to 4
of january and we can look at this one
so it shows me
these are the values so here actually
i'm not giving a range but i'm just
selecting multiple values from row.names
now we already know that in r you have a
summary function so you can always do a
summary and that gives you
for each column it gives you minimum
first quartile median mean third
quartile and maximum values so we are
looking at consumption we are looking at
wind solar and wind dot solar
now this is good but then if i would
want to really visualize the data access
the data do some analysis then it would
be good to
take all the columns and then we can
later decide to change the data type of
say date column if we want to use it so
earlier i was using date as row.names or
the name of the rows or index what you
call
in any other programming language so
here i will just use my data set and
i'll say header is true i'm calling it
mydata2 let's look at the data and this
one shows me
five columns where in my first column is
the date
consumption wind solar and so on now
looking at the structure
so let's look at the data type
so it tells me that if now
i'm interested in looking at the date
column from my data to data frame it
tells me it is a factor with four 384
levels and these are the values
so
it is not in a date time format it's a
factor
now what we can do is we can convert
this into a date format how do we do
that so let's have a variable x and i'm
going to use as dot date function and
i'm going to pass in my date column so
that's
assigned to x now let's look at the head
of x and it shows me the values we will
also see what kind of class it is
and we will look at the structure of x
so class already says it is date type
and look at the structure so it shows me
the format
now we have converted this column or
column related value into x now how do i
basically
extract values out of it or make it a
part of data frame so first i will use
so all once it has been converted in
date format i will go for as dot numeric
and here i will create a variable called
year and i will just to a format on x
which is basically of date type and then
i am saying
percentage y so that will get me the ear
component out of this let's look at the
values
that shows me ear component
now similarly we can get the month out
of this and then basically look at the
month values we can get the day out of
it and we can get the day component now
if i look at my data 2 which we had
created earlier this basically had date
consumption wind solar wind solar so
what i can do is i can add these
extracted columns such as year month day
to my data frame using a c byte that is
column bind and i will assign it to my
data to again so let's do this and now
if you look at head it shows me
date so that should be date format
consumption now this one might not be
date format but we'll see you have
consumption wind solar and we have
extracted the year month and day which
can help us for group by we can do some
aggregations we can do a plotting and we
can do various things by these
additional columns now let's look at
first three rows here so i'll say one is
to three for my data two and that shows
me some data here you can always do
ahead and look at the sample of data so
that basically shows me month
day
your columns and then you have your date
now what we can do is we would want to
visualize this data we would want to
basically understand the consumption now
as i said
if we want to visualize the data say for
example i want this which is consumption
of data over years and this one is in
terms of gigawatts per hour as we were
mentioning here gigawatt hours so if i
would want to create this visual to
basically understand the pattern of the
data
how do we do it so we can you create a
line plot of full time series of
germany's
electricity consumption using the plot
method now how do we do that so here
one of the option is i can straight away
use the plot method
i can then say what would be in my
x-axis what would be on my y-axis
what would be the type of
graph i would want to plot what is my
name on x-axis y-axis and this is the
simplest way so i'm saying my data 2 i'm
extracting the year column
and here i'm taking the consumption so
let's create a plot
and here if you see we are looking at a
plot we do see some tick times and we
see that the data has been divided with
every two years so from 2006 onwards to
2016 but then really this data does not
give me
uh you know a very useful way of looking
at the rate or understanding it might be
what i can do is i can use the same way
but i can give apart from x-axis and
y-axis i can say
the
limits that is x limit is 2006 to 2018
and y limit is from 800 to 1700 so we
can do this and let's look at this again
this is a plot but it really does not
help me in visualizing and understanding
the data so what are the better options
i can go for multiple plots in a window
as of now we are just sticking to one
plot in window so if you would want to
have multiple plots you can always
change the value here and make it two or
three that will say how many rows and
how many columns so as of now we will
just keep it as it is bar
mf row now
if i would want to plot i can straight
away give the column name so i am
interested in getting the consumption
now i can just do a plot i'll say
mydata2 and i will choose the second
column which is consumption which we saw
here
from our data so consumption was the
second column so i can just do a plot in
a straightaway way without mentioning
your x-axis y-axis limits and so on and
if you look at this this one is giving
me
a pattern now here i am looking at
uh
x-axis y-axis which is not really named
we do not have a name to this graph
and we are looking at the data it does
show me some kind of pattern but might
be we can make it more meaningful so i
can do it this way where i say my data
second column let's give access as year
x axis y axis is consumption
now that has changed
the x-axis and y-axis now i can also
give some more details i can say type
should be line
i have the line width i'm saying color
is blue
and let's do this so this looks more
meaningful might be shows a wavering
pattern of consumption over years
i can also give a
limit of x that is 0 to 2018 and that
basically shows me the range now we can
change that and we can be more specific
and saying x limit should be 2006 to
2018
and let's look at this now this one once
you have given a proper limit it shows
the line graph and it shows what was the
consumption in 2006 and over a period
till 2018.
i can then
use any of these options are fine but it
depends on what and whom you are
presenting the data or what kind of
analysis you are doing so i can do a
plot i can choose column second x lab
which is x axis
y axis type is line width giving x limit
y limit and then i'm giving a title to
this which is consumption graph
and then basically you are looking at
the line graph
now those are the options which you can
do either you could be very specific or
you could just
give
your column which you want to plot or
obviously make it more meaningful by
giving all the details
now what we can do is if we would want
to look at
this data and understand it better
rather than just looking at a simple
line i can take the log values so here
i'm saying log of
my data to second column so i'm taking
log values of consumption and i'm taking
the difference of logs so i can say
difference and then you can
basically increase or decrease this by
multiplying it by some number so rest
remains the same i'm changing the color
and let's look at this plot and you see
this basically is giving me a better
pattern which makes meaning here we see
the log values so this is you are using
a simple plot function
in r you can also use ggplot now for
that we can install the ggplot package
it's already there in my machine so i'll
say no i will access this by using the
library ggg plot 2
and now i can use ggplot to plot so
the way you specify here you can say
mydata2 that's the data frame
i'm saying type as o and when i'm saying
line
i am basically going to
use x axis which is here y is
consumption and let's look at this plot
so again we are back to the one which we
were doing earlier really does not make
any sense
gives us some data but then really does
not give me enough information
i can
in my aesthetics i can say x is here y
is consumption i can do a grouping and
then i can give line and plot
so again we have some information but
really does not help me right
now let's look at other example so i'm
just doing the same thing here and i'm
looking at line type being tasked i'm
using the gg plots other methods such as
geom line and gm point to give me more
information and if i look at the
plot it does give me data it tells me
what are the different values it gives
me some kind of pattern but i would
still prefer the way we were doing with
plot
now
we can change the color and obviously
add details to it so what we see is when
you use the plot method which i did
earlier it was choosing pretty good tick
locations that is every two years and
labels the years for the x-axis which
was helpful
right but with these data points which
we were seeing here
or say for example this one
or say this one
or say this one we are looking at some
data but then that
really is quite crowded
and it is hard to read you can look at
the values but then it really does not
give you enough information so we can go
for plot method but then we will see how
we can consider different data now if i
would want to plot the solar and wind
time series so let's see how do we do
that
so wind column is what i'm interested in
so first thing is it was always good to
find out the minimum and the maximum
values in every column so i'm saying
minimum i'm saying let's put in here my
data 2
and then let's look at the values so we
are looking at the columns
we know consumption is the second column
wind is the third column
and
you have solar as the fourth and this
one is the fifth so let's say let's find
out the minimum of each of these columns
which we would want to plot so let's say
minimum of data third column and here
i'm also saying remove the n a values
because we do not want to consider the n
a values so let's let look at the
minimum that shows me 5.7757
what is the maximum value it is 826 so
that also helps mean giving a limit if i
want to plot wind on y axis i can give a
y limit from 5 to 850
consumption wise let's find out the
minimum from second column and maximum
and similarly for solar find the minimum
and maximum and wind plus solar minimum
and maximum so this will be helpful when
you would want to plot multiple graphs
or
give some limits so that's fine now for
multiple plots as i said
instead of having one plot let's plot
consumption and wind and solar and try
to see a pattern so i can say par
function and i will say three rows and
one column
so now when i start plotting you will
see you will have multiple plots in one
single window so let's see how we do it
so here
let's look at plot one so this one is
consumption as we did earlier
and let's look at the data so that gives
me some data you can always do a zoom
and you can look at the data you can
basically expand this graph or you can
reduce this graph to see
what kind of pattern we have in
consumption similarly we can basically
choose
date being
x axis
my consumption being y axis right so
this is being more specific because here
we have a range but it really does not
give me enough information so i will
basically give
x-axis y-axis i will give the name that
is daily totals and then i will
basically give consumption color and y
limit based on my minimum and maximum
limits so let's do this
and now we can
look at the data here so let's see this
data
makes a little more meaning because we
are looking at the dates
and let me do a zoom so it shows me all
the dates it shows me the data points it
shows me
how the data
pattern is changing for consumption
now
this is for consumption so what we can
do is we can also extract specific data
so if you see here i have done some
testing where i am saying okay i would
want to
get
a date specifically
i would want to extract some value so we
are looking at the date column but if
you remember we did not change the data
type we just change the data type of
date column we extracted year month out
of it
it would be good if we can
convert a column into date time format
and put that in our data frame now
let's look at the plot2
this is mainly for
your
column
which should be consumption and wind and
solar so here i see it is solar data and
i can plot this one
to see how it looks like
and that tells me from 2006
onwards we have some pattern
i can
be more specific where i say
i would be giving date and then
the
column for solar x-axis y-axis what is
the type what is the y limit and what is
the color
it is always good to specify your x and
y-axis given name rather than let it
automatically pick up now this makes
more meaning because it shows me some
dates
similarly we can do for wind
so either you do it just by giving the
column
or you give your x and y axis so let's
look at this one
and this shows me the data so we can
choose plot three
this one we can choose plot two
we can choose plot one and we can put
all that data in one graph
so that's when you are putting in multi
plots in one particular graph you can
always do a zoom
you can always look at the data right
and this is usually useful to look at
the pattern what kind of pattern we see
what data we have and so on now moving
forward so we have seen how you are
creating these plots all in one window
let me reset this back to one plot per
window
and let's basically plot time series in
a single year so what we have seen is
that when you look at the plot method it
was quite crowded then we looked at
solar and wind and if you compare that
you will see your consumption pattern
your solar pattern your wind pattern and
basically we can see from this
particular data some kind of pattern
so electricity consumption is highest in
the winter
where we will
see what is the consumption
is it highest in winter or is it in
summer we can see that by breaking a
year
further into months we can see that but
we see a pattern which goes for every
year or every two years being highest at
a particular point of time and then it
drops down
so electricity consumption is highest in
winter and that might be due to
electrical heating
and increased lighting usage and lowest
in summer now when you look at
electricity consumption appears to split
into two clusters
we can always look at the consumption
one with oscillation centered roundly
around 1400 gigawatts so you can always
look at 1400 gigawatts and you see all
the values here which are in that
particular consumption another with
fewer and more scattered data points
simply roughed around 1150 so if you
really expand this you can see you will
have lot of data points at this point
now
we might guess that these clusters
correspond with weekdays and weekends
which we can see if you break that data
into yearly monthly weekly and so on now
if you look at solar production
that is highest in summer when sunlight
is most evident and lowest in winter so
obviously when you are making or
gathering some insights when you're
looking at the data you are also using
your domain knowledge your business
knowledge your
you know knowledge of business to
understand how this goes
if you look at wind power production
that's again highest in winters and
drops down in summer
so due to stronger winds and more
frequent storms and lowest in summer
so there is some kind of increasing
trend in wind power production over
years which we can see here
over the years
and
all the time series data what we are
looking at
is
referring or showing us some kind of
seasonality that is we are looking at
seasonality in which a pattern is
repeating again and again at regular
times
at regular intervals so if you look at
consumption solar and wind time series
that oscillates between high and low
values on a yearly time scale which we
can break down and see i'll show you
that
it corresponds with the seasonal changes
in weather over the year
so seasonality
does not have to correspond with
meteorological reasons for example if
you look at retail stale sales data
that will show you yearly seasonality
with increased sales in particular
months
so seasonality when we say can occur on
other time scales so the plots what we
are seeing here
they are fine but if you look at those
plots they might
show some kind of weekly seasonality
also
so in your consumption corresponding to
weekdays and weekend so let's plot for
one single year now how do i do that
so first is i will look at mydata2
that shows me the structure it shows me
date which is factor other columns which
are all numerics
now like we did earlier i'll repeat this
step where i'm going to convert the date
column into date type
look at head of it look at class of it
look at the structure of it right and
then what i want to do is i want to add
this
and to my data frame so i will create a
variable called mod data
and this one will have as data and i'm
formatting
the value of x which is date time into
month day and year so let's do that
and now you look at the mod data which i
created like modified data so this is
the format i have it is in date type if
you carefully see here
and then i can look at the head of it
so it saves me more data
now
we are what we did here is when i said
mydata3
so
mydata3
we did a
c bind and i did a mod data which is
going to add this column to my
other columns of my data 2. so my new
data frame is my data 3 let's look at
the structure of it and you see there is
this date column i can delete it i can
remove it i can let it be right so that
depends on our choice might be we want
to once our analysis done we want to
remove the mod data right so we can keep
both of them
now let's
basically extract data for a particular
year now how do you do that so this is
some wrangling so i will say mydata4
let's call it mydata4 and i will use
subset function so subset will work on
my data 3 that's the data and what i'll
do is i will do a subset how do how is
the subset found so i'll say take the
mod data
column the value should be
greater than or equal to 2017 and should
be less than
2017 december 31st so i'm getting data
for one year and i'm storing it as my
data four
let's get the head of it and you see we
are specifically looking at 2017 related
data
now let's do a plotting of this where i
will only
create a plot for one year so i am
saying my data 4 that's my new
data what we got
so
here i am going to take the first column
which is mod data
i am going to take the third column
which is consumption so i am looking at
the date format for one year consumption
values for it and then rest of the
things as we have done earlier let's
look at the plot and this makes more
meaning right so when you look at this
plot it tells me jan to jan it shows me
some kind of pattern where i have
divided the year into months
right and it is broken down into say two
months so jan and march and may and july
and so on but we still see a pattern and
that gives me good understanding of
pattern where i've broken it down into
months
so this is where you have taken time
series in a single year to investigate
further and this is what we see
right now we can clearly see there are
some weekly
oscillations
what one more interesting feature is
that at this level of granularity that
is when you are looking at yearly data
there is a drastic decrease in
electricity consumption in early january
and late december during the holidays so
probably we can assume that this is
holidays now i can zoom in further and
look at just jan and feb data
let's see how we do that and let's see
how we work by zooming in the data
further
so to zoom in the data further let's see
how we do it now here we have this
mydata4 which is basically having a
subset right so let's work on this one
so i will say mydata4 which earlier i
was taking data 3 i was doing a subset
and i was giving the date but this time
i will make it more
narrower so i'll say my data 4 i will
say subset from my data 3
and i will choose mod data column which
we have modified with the date format
i will choose the starting date as
1701
that is jan and then let's go till feb
and let's create this
now let's look at the head of this so it
shows me we have the data which is jan
and then you you can basically look at
more on this now again as i said earlier
let's find out the minimum of this from
the
first
column so that is basically your mod
data so let's look into this one
and that basically will give me minimum
and maximum let's look at the value so
this one tells me jan 17 january 1
and maximum is
your
feb 28th second month
so we are actually looking at two months
data here
let's look at the y minimum so this is i
will look at
column three now what is column three
consumption so let's look at the minimum
value for consumption maximum value of
consumption let's look at the values
which can be given as our limits
now this is the minimum and maximum now
let's do a plotting for this data which
has been narrowed down
for consumption based on my data so i'm
saying
my first column which is mod data and
then third column which is consumption
i'm giving some
naming convention for sorry namings for
your x-axis y-axis
what is my
consumption
or what is my title here what is the
color and then you see i'm using x limit
to give the minimum and maximum limit
and y limit so let's look at this data
and if you
look at this data
it is specifically for two months and
again i can look at the pattern here
what i can also do is i can add some
grid here
so i can basically look at this data and
make more meaning out of it so it is
bi-weekly data you can see now i can add
a line here using ab line and then i can
basically choose what lines i would want
to add horizontally
so that basically allows me to dissect
the data and look at data in a more
meaningful way i can also
add vertical lines so vertical lines is
i'm saying sequence will be minimum
maximum and i'm saying an interval of
seven
so let's do this
and
this basically has added some lines
every week and you can see at the end of
week it is dropping and then it is
starting again it peaks somewhere in the
mid of the week and again it
drops down so this is you're looking at
your consumption data right now what we
can also do is we can create some box
plots so when we looked at zooming in
data for jan and feb you can add some
data points like this so consumption is
highest on the weekdays as i showed you
here and lowest on the weekends so this
is what we are seeing when we are
breaking the data or zooming it further
for a couple of months so we have
vertical grid lines and we have nicely
formatted tick labels that is jan first
and 15th feb first and so on so we can
easily tell which days are weekdays and
weekends with use of these grid lines
and basically breaking it down so there
are many other ways to actually
visualize your time series data
depending on what patterns you're trying
to explore you can use scatter plots you
can use heat maps you can just use
histograms and so on
now moving further we would want to
explore the seasonality right so when
you further explore the seasonality of
our data
we can use box plots basically to group
the data by different time periods and
display the distribution for each group
now how do we do that
let's come here and let's see how box
plot works so i can just do a simple box
plot and i can choose my consumption
column and that gives me just the
consumption data but this really does
not give me any meaning i can look at
solar data i can look at the wind data
and we can also see some outliers here
so we can create box plots but
if we would want to do a box plot what
is box plot it is basically a visual
display of your phi number summary that
is you want to look at your mean median
you want to look at your 25th percentile
50 percentile
or 75th percentile so we can use a
quantile function use the consumption
column and then you basically give
a vector which shows you find number
summary so that's your quantile and then
let's do a box plot
so if you are looking at quantile it
tells me what is the minimum what is
25th percentile 50 75 100 that's from my
consumption column so let's create a box
plot for consumption
let's give it a name as consumption
let's give y axis as consumption and a
limit
for
y-axis
now that's my consumption graph
so i can look at yearly data now that
will make more meaning rather than just
looking at the complete consumption data
so how do we do it early so we will say
consumption
and then i will say the year column so
it is consumption but grouped based on
year
so here i can give x axis y axis and i
can give y limit so let's create this
and this makes more meaning we can give
some coloring scheme here but now i'm
looking at 2006 2007 8 9 and so on and
we can look at the data what is the
range right it gives me five percentile
or sorry five number summary of the data
per year and it basically allows me to
look at the seasonality of this
similarly we can create box plot
by just giving consumption early group
and here i am giving the title as
consumption y axis
x axis and y limit
wherein i can also use lass so this is
one more feature which you can do and
that basically will give me the tick
points if you compare this one to the
previous graph
so when i created this previous graph i
had 2006 2008 and i had from 600 to 1800
and if i go for the next one
i am basically seeing more useful
information now let's look at monthly
data
so
i would want to group it based on months
and let's create that so this gives me
the monthly data where i'm looking at
months
and i could select
a particular year or i can just do a
grouping based on months
so
i can have multiple plots to see a
difference here so let's do this
now let's create a box plot for
consumption which is monthly data and
let's give it a color
let's look at the wind data which is
again grouped monthly and let's look at
the solar data which is grouped monthly
now if i zoom in it basically gives me
the seasonality of the data
for your wind for your consumption for
your solar so what we are doing is we
are creating these box plots which are
giving us
values now what i can also do is i could
look at the day wise also but before we
look into this how do i
infer some information from these box
plots which are being created so this is
what we have done where we are looking
at the data for month and these box
plots give me ear seasonality
which we were seeing in earlier plots
but give some additional insights so if
i look at the data here it tells me the
electricity consumption is generally
higher in winter
now this is based on months so we can
see consumption is higher in winters
and lower in summer so we can obviously
look at our plot we can see where it is
lower where it is higher
and then we can
look at the median and lower two
quartiles are lower in december and
january compared to november and
february so that is you look at the
quartiles and you will see
that
the median and lower two quartiles are
lower in december and january
here jan and december so you can look at
from my plot
now
this is giving you some idea on
seasonality
now
that might be due to business being
closed over holidays now this one we
were also seeing when we looked at time
series for 2017 only and box plot
basically confirms that there is this
consistent pattern throughout the years
now when you look at
your
solar and wind power production both
will give you a year seasonality what we
are seeing here
and
if basically i look at the data so
it depends on what parameters you are
choosing but if you look at solar it
will reflect the effect of occasional
extreme wind speeds associated with
storms and other transient and since we
are grouping it based on months we can
see this pattern is quite evident every
year
now what we can do is we can group the
data day wise so here let me again reset
this to
one plot per graph
now i'll say box plot i'll say
consumption which is group based on day
now we know that there is a day column
and let's give a while limit and let's
look at the data so this is where i'm
grouping the data day wise
so you look at 31 days and you look at
the box plot so this is where you are
plotting it on a daily basis so you can
look at the data you can break it down
to
a particular week so here i have given
a day and i have chosen all the 31 days
but i can break it down to a week and i
can look at the data so
if we look at the data per week or per
day we can basically infer that
electricity consumption
where i'm doing a consumption group by
day
is higher on weekdays than on weekends
so time series with strong seasonality
can
often be represented with models that
can decompose signal into seasonality
and long trend now this is
an easy way now how do we look at the
frequency of the data that could be
interesting to see
so let me
look at
say the yearly data
which we were seeing here
now let's go further and here
we have looked at data so what we will
do is we look at the frequency now when
you look at the frequency when you talk
about frequency in your data so we have
the modified date column which gives me
a frequency and if we really look into
the data that will tell me
that the data is on a daily basis so for
that let's look at my data three again
which gives me data and you can just see
all the data's data or dates are in
sequence so you're 22 23 24 25 26 and so
on i can look at i can access a d player
package
that is basically
allowing me to work in a better way now
i can look at the summary of this and
for all my columns
i am seeing what is the minimum phi
number summary date and consumption so
date does not show me anything because
this is not in a date format it is just
a factor but other things have the fine
number summary so we are looking at wind
plus solar we are looking at year and
month and day and all these columns
now what we will do is we will want to
find out the sum of each
column how many entries does it have
and we will say the value should n a
value should not be considered so let's
look at this one so it tells me for my
particular columns
so let me run this again
and that shows me
for each column how many values you have
and
these
counts
do not include the n a values
now similarly i can find out
specifically for consumption i can find
out is there any n a value so i'm saying
is dot n a and let's find out if there
is any n a value or missing value in
consumption it says zero
okay that's good if you look in wind
it tells me there are 1463
entries which are any
similarly solar
similarly
wind dot solar or wind plus solar so it
gives me a count of n a values that is
missing values
and also values which are not missing so
to understand frequency what we can do
is we can find out the minimum
on the date that is the first column and
i'm saying
rm
n a dot rm is true that is get rid of n
a values and find out the minimum
and let's look at the minimum value
this is the minimum from my modified
date
now if i would want to get the frequency
i can basically use sequence function so
i can say
from x minimum that is the minimum value
i want to look at the frequency that is
day wise and let's just look at five
entries and see if there is a
day by day frequency
so let's look at the value of this and
obviously it tells me there is device
frequency so that allows me to look at
the frequency look at the type of it it
is an integer class is a date
so similarly we can say from x minimum
we can basically look at the frequency
month-wise
and i can again look at five records so
that shows me monthly data
right so i can
extract the data for frequency similarly
yearly data and that's also very useful
now
we can select data which has n a values
for wind
so how do i do it i would want to find
out
the wind column and i want to find out
where the values are and a so
i will create a variable
and here i will say my data 3
and then i give a conditional where i
say is n a
in the column so let's do this
now once i have done this
once i have done this i have said that
my
selected wind data from my data 3 where
we said any values
and i will give the names to this so
name should be in my theta 3 i'm
interested in mod data consumption wind
and solar so these are the four columns
i'm interested in let's look at first 10
records here or first 10 rows so that
tells me these are the values where wind
has n a
or missing values
i can always do a view and that gives me
the complete data so it basically shows
me 1463
entries and here it shows me all n a
values so you can look at all the way to
the end and it shows me wind has n a
solar does have some value here
in the last row but then also if you see
the numbers have
a difference so you have 1 4 6 1 and
then you have 2 1 7 4 so there is a
difference so there is some data in
between where wind has some values so we
have found out any values now
what we will do is we will select data
which does not have any values
so i will call it cell selected win2
i'll again use mydata3 i will say which
but now i am saying not any
from this column and i will select the
data for the columns so i'm interested
in looking at
10 records and this shows me not any
value so no more missing values so if i
really look at this data as i saw
earlier which has n a and if i look at
these values which are not any for the
wind column so looking at these two
result we will know that in year 2011
wind column
has some missing values
so let's focus on year 2011. so how do i
do that let's call it a different
variable i'll say mydata3 i will say
here when i say which where we were
saying n a here i will say the year
should have a value of 2011 and i want
all these columns
let's look at the data here and this is
showing me 2011 but
we
are not seeing all the values so there
are some values but then there are some
missing values also for 2011 based on
whatever analysis we have done so let's
look at the class of this it is
basically a data frame do a view
and this one will help me in finding out
where are the any values so if you just
scroll down
looking at all the data let's search if
wind column has a n a or a missing value
and i will see
if there is any missing value in which
column or which row it is for the wind
column so we have all the values which
are existing
i could select and search for one
specific value and i'll show you how we
can do that so here let's scroll all the
way down so it's like you're exploring
your data and seeing is
wind column having n a or missing value
for a particular row
and let's scroll here and here you see
there is a missing value for one
particular row so
13th
december 2011 has wind value
15 december has wind value but
your 14th december does not have right
similarly we can search so there was
only one entry which was missing now
that could be for some reason might be
it was not calculated might be it was
not tabulated so we have a missing value
and that
can affect my plotting that can affect
my analysis so let's look at the number
of rows
in this which will tell me how many rows
we have
for 2011. so it tells me 365. so that is
basically the number of days in a year
now we will find out if
there were any values so we earlier
checked total number of na values per
column
that is
in your row number 265 to 269
we can see here 265 to 269
so this is where we were seeing
are there any n a values right so let's
go back here
and
we want to find out the number of n a
values for a particular year how do i do
it so i can just do a sum i will say is
n a
now i am interested in my data 3
wind column and i am saying my year has
to be 2011 but i am finding out the n a
values
so let's do this and it tells me one and
that's right that's what we saw when we
did a view let's see
how many non-na values you have and that
is 364 so that basically
satisfies my logic so it's 364 plus 1
missing so there are 365 let's look at
the structure of this it tells me you
have modified date and date format you
have consumption wind and solar now
let's create a variable
selected wind4
i will save in three that is which was
having all my n a and
non n a values for 2011. i will say
let's find out the n a value
because i'm interested in finding out
that particular row so i'm saying find
out where the value is n a and i want
all the columns
let's look at this one and this is my
specific
row which has a n a value
now
we know that data follows a device
frequency which we have clearly seen now
let's select data which has n a and non
n na values
so
let's say let's call it test one i will
use win3 which has
any non-n a values but now i will say
i want the modified date which should be
greater than 12 12 2001 now remember we
had when we were doing a view we saw
that one particular day or what we see
here 14th of december there is no date
so i will select a subset of data which
includes this n a and non n a that is
might be i can take 13th of december and
15th of december so let's start from 12
12
so the date should be greater than 12 12
that means 13 and it should be less than
16 so that is 15th
and the columns
right so
now we have some data let's look at this
so i have
a i've selected a subset of data i could
have done this using subset also so i
have any and non-any values now
why are we doing this so sometimes you
might have some data for a particular
column and you may want to find out if
there are any missing values might be
you want to fill them up or replace them
with something so that is usually useful
when you are doing a trend detection
so say for example you have data for
every month and might be in one one of
the months you have missed or might be
you have data for every year collected
monthly and then in one of the years for
couple of months you don't have the data
like i can say 2016 i have data for all
12 months 2017 all 12 months 2018 might
be i don't have data from march and june
2019 i don't have data for same months
so i can forward fill or backward fill
them using the previous year's same
month data so we can do that so here i
have test data where i've extracted a
subset of data
i can look at
the
class of this it is a data frame
structure of this it has the columns now
let's use that library
and
function and use the tidy r package
and what we will do is we will fill it
up so i will use test one i will fill
the wind column which has a missing
value now once you do this if you notice
it has done a forward fill so it has
taken the previous value and it has just
filled up that so you can
fill up the data using different
directions such as up and down
left and right and so on so we can take
care of missing values
in our frequency data which allows us to
basically
analyze the data in a better way now
here we will want to also look at some
more data so this is to deal with
frequencies of fill column
wherein you can take care of missing
values forward filled so filling values
can be done in different directions as i
said and you may want to first convert
your time series to specified frequency
if
your data does not have a frequency but
we had now if you do not have a
frequency might be you can convert it
into a frequency such as weekly daily
monthly as i showed you and then
basically you can
do a forward fill
for the value so for example if i have
my data i can break it down into weekly
and then look at the values and if there
are any values missing for weekly data i
can use a forward fill so that can take
care of my frequency data
then
let's look at the trends of the data
which is the last part of this project
so basically let's look at the trend so
when you say trend what does that mean
so in time series data
you always have some kind of trend
so that will exhibit some slow gradual
variability in addition to
higher frequency variability such as
seasonality and noise
now
to visualize these trends what we do is
we use what we call as rolling means so
we know how our data is
spread over year or month or day
but how about looking at a rolling
average and see what is the difference
so a rolling mean
will tend to smooth a time series by
averaging out the variations and
frequencies
so this can be higher than the window
size so there is something called as
windowing where you can choose a set of
time frame you can also average out any
seasonality on a time scale equal to
window size
so this will allow you to look at lower
frequency variation in the data
so when we are looking at electricity
consumption time series we already saw
there is a weekly pattern there is a
yearly seasonality which we saw using
box plots so we can also look at the
rolling means of the time scales how do
we do that so for this you can use some
package like zoo and then you can
basically use a rolling mean
using this zoo package
and you can say what
is the
frequency with which you want to
calculate the rolling mean
now how do we do this
let's look at this data so here i'm
going to my look at my data 3 which we
have been using so far
now let's call it a 3 day test you can
give it any name i am going to use my
data 3 i am using the pipe in function
now i will use d plier and i will
arrange the data descending in here now
you can always break it down step by
step and you can see the result of this
so i'm going to arrange this data in
descending order of year
so obviously my last one 2017 or 2018
will be on the top you want to group the
data by year so it depends on how many
years we have we'll see so you can group
the data by year now this data is then
used to basically mutate so mutate
function is going to allow me to use
this rolling mean so i'll call it as
says 0 3
day so i'm going to calculate a rolling
mean every three days
for my consumption column
and
basically let's ungroup this so let's
see how this
works sorry yeah let's look at this and
here when i'm doing a three-day test
let's look at the result of this and
then i'll explain this so if you see
here we have the test three-day column
now this has the rolling average now
what does that mean so first value here
what we see is 1367
is the average consumption in 2017
from the first date with the data point
on either side of it that is you can
look at
this
date so one one three zero
then you look at
you are looking at the value one three
six seven here so you look at one one
three zero 4 4 1 1 5 3 0 if i take a
mean of these so for example if i would
just do this part
and that
is giving me
mean okay because i have a comment so
let's basically add anything as comment
and then let's do this so it saves me
one three six seven that's what we are
seeing here right so you've got getting
a rolling average every three days
similarly if you want every five days it
takes the five values and it gets the
mid value right so you can always find
out the mean
rolling mean
for a particular frequency now let's do
that for seven days that is weekly data
and yearly data that is 365 days so how
do i do it same logic my data test
now i am using my data 3 i am arranging
it in a descending order i am grouping
by year
so when you do a group by year so
earlier when we did a grouping by and
when we looked at the data it was
telling me how many rows we had
right so let's do a grouping by year and
let's say test zero seven so that's a
rolling average every seven days and i'm
also saying take care of the n a values
similarly i'm getting rolling average
every 365 days might be you can do
quarterly might be you can do half
yearly and let's do this so let's
create this my data test and let's look
at the result of this so i will use my
data test i will say arrange
based on modified date now we know there
is a column called modified date i want
to just look at 2017 data so i'm doing a
filter
right and then i will choose what are
the columns i'm interested in so i will
look at the 7 and 365 day and let's look
at say first seven records so let's do
this
and that basically gives me the
consumption value modified date year and
my rolling seven day average order of
seven day mean
which is for first seven days and then
365 you will not see the data here but
if i do a view on this i can basically
see the values
so you can always select a particular
column to see the values
these are the values for every 7 day
rolling average
this is for 365 days every 365 days so
you see all the values are missing but
every 365 entry you will have basically
some data
now let's do a plotting of this and
basically visualize this data which we
are seeing rolling average so let me
first do a plotting one plot per graph
and let's do a plotting i will take
consumption data
x-axis y-axis
color and give a title to this so let's
create this and that's my consumption
data which is
spread over a period of time and that's
fair enough but now let's add some more
plot to this so i will add the seven day
rolling average to this
so for second plot to be added in the
same one in r you can use points
so i will say points i will choose seven
data column
type is line width
x limit y limit and color so let's do
this
and that's my
pattern seven day rolling average which
basically gives me some kind of trend
similarly i can add one more here and
this time i will choose the 365 day
and look at the pattern
lines so now you see some dots here well
you could do it in a different way so i
can just add legend to this and i can
say legend will be
where in x axis and y axis so i am
saying it will be 2500
and y is 1800 so my legend will come in
somewhere in here i am saying my legend
will have consumption
test
and this one i can give some names i can
give what is the color
i can say what kind of
legend it explains what is
for each color and then basically a
vector so let's add a legend to this and
i've added a legend now you can do a
zoom and look at the data
and
here i see that
my x axis is fine but y axis is going a
little
about of my plotting area so i can
actually change that so here i have 1800
how about making it 1600 and let's look
at this one
so
we can basically
uh go for this one and start again here
plot and points and line and then add a
legend right and you can basically place
your legend anywhere in the plot so this
basically is giving
me the trend what i'm looking at my
rolling average
so similarly you can look at the trend
for wind and solar data so what we are
seeing here is when you look at trend
this is one more way of looking at it
you can always create plots in different
ways
so
seven day rolling mean has smoothed out
all weekly seasonality which we were
seeing here in my graph where you look
at every seventh day preserving the
yearly seasonality so
seven day will tell
that electricity consumption is
typically higher in winter and lower in
summer so better is you break it down
yearly so here if you look at every year
you can see when is winter when is
summer what is the seasonality what your
trend what you are seeing here and if
there is a decrease or increase
for few weeks
every winter
so similarly if you look at 365 now as
you said as i said rolling average
basically
reduces the variation so if i look at
365 rolling mean we can see long term
trend
in electricity consumption is pretty
flat now that's what we are seeing it's
kind of pretty flat there is not much
variation over ears if you really join
these dots
so
we can basically see some highs and lows
and that gives me a trend now this is
how you can do a trend detection and
similarly we can do plotting for wind
and solar so this is a
small project which i demonstrated using
r
now all this code which you have here in
the form of a project dot r file you can
find here in my github page this is a
document which explains some things feel
free to download this and you can add
details to it this is the sample data
set which you can also find in my
repository in the data sets folder so
continue learning and continue
practicing r with that we have come to
the end of this full course on our
programming think we missed anything
important do let us know in the comment
section below thank you so much for
being here and do watch out for more
videos from us until then keep learning
and stay tuned to simply learn
hi there if you like this video
subscribe to the simply learn youtube
channel and click here to watch similar
videos to nerd up and get certified
click here
you
2CUTURL
Created in 2013, 2CUTURL has been on the forefront of entertainment and breaking news. Our editorial staff delivers high quality articles, video, documentary and live along with multi-platform content.
© 2CUTURL. All Rights Reserved.