Data Processing and data management

Data structure:

  1. Vector
  2. Matrix 
  3. Data frame 
  4. List

R is a case sensitive language.

Addition in R:

A+b

No need to declare the variable in R and Python.

In python type to check the Datatype of variable.

How to create vector in R?

Vector is a 1D array. 

Height <- c(165,167,168,160,158)

Function c helps create array from a list. 

  1. Sum(): sum of all elements 
  2. Mean(): mean 
  3. Class(): datatype 
  4. Min(): minimum value
  5. Max(): max value 
  6. Sd(): standard deviation
  7. Var(): variance 

Names<-c(‘a’) data type = character 

Graduate <- c(TRUE, FALSE, TRUE)

Class(Graduate)

The data type= logical 

my data<-c(2,3,’a’,’b’) System converts the Datatype to numeric to character.

System can convert numeric into string but not vice versa.

Mydata2<-c(“1”,”2”,”3”)

Class(mydata2)

For explicit conversion from character to numeric.

Mydata3<-as.numeric(mydata2)

Class(mydata3)

Integer and float are both numeric data types.

V1<-(23,45,67.78,65.56,34.56)

Missing value in python: NaN

Missing value in R: Na

Name can’t be a categorical variable.

Address can be character or string and it has no pre-defines value.

Name can be character or string. 

List can store elements of any type:

  1. Matrix 
  2. Vector
  3. Data frame
  4. Numeric

Vectors are mutable. The value of the vectors can be changed. It is used majority of the times. 

Matrix: n dimensional matrix. The elements of same type. Length of each column is same. It stores value in tabular form. 

Data frame: how is it different from matrix. It stores data in tabular. It can have different data types. 

List: When we want to store any kind of data. It is rarely used. But it is very powerful. It’s multi-dimensional. There can be list within a list. 

Hierarchical data type format. 

log2(x), log10(x)

Data transformation:

  1. Min and maximum 
  2. Z score standard scalar 
  3. Decimal scaling= x/10^3 number of digits of maximum values 
  4. Log transformation
  5. Sqrt transformation
  6. Discretisation

height<-c(170,167,168,160,172)

zvalue=(height-mean(height))/sd(height)

zvalue

height<-c(170,167,168,160,172)

dscaling=height/10^3

height=(height-min(height))/(max(height)-min(height))

It is left skewed normal distribution. It’s negatively skewed.

curve(dnorm(x,mean=mean(skmarks),sd=sd(skmarks)),add=T) adds the curve to the graph. 

add=T adds the curve

To make the distribution normal

Negatively skewed is not applicable for log transformation, square root transformation according to a research.

Left skewed transformation/Negatively skewed is applicable for square transformation and cube transformation according to a research.

Log and sqrt transformation for negatively skewed distribution and we want to transform it to normal distribution.

gender<-c(“male”,”female”,”male”,”female”,”male”, “male”)

table(gender)

We want to convert categorical data to numeric form. 

gender[gender==’male’]=1

gender[gender==’female’]=2

gender=as.factor(gender)

Read.csv(name of file, sep=“;”)

Read.tavle(name of file, sep=“\t”, header= T)

SomeData<-read.csv(name of file,Sep=“,”, header=F, col.names=c(“Name”,”City”),na.strings=c(“”,”,”,” “)

SomeData<- read.csv(name of file,Sep=“,”, header=F, col.names=c(“Name”,”City”),na.strings=c(“”,”,”,” “), slip=3, comment.char=“$”)

  1. Unstructured data : voice, audio 
  2. Structured data : Data in a tabular form. Regional database, SQL, etc.
  3. Semi structured : XML, Jason, No structure

XML package:

Mean is always sensitive to the outliers, it changes due to the outliers. 

mean <- ifelse(is.na(airquality$Ozone)

mean(airquality$Ozone, na.rm=TRUE)

 airquality$Ozone)

airdata<-airquality 

for(i in 1: ncol(airdata))

 {

airdata(is.na(airdata[,i}),i]<-mean(airdata[,i],na.rm = TRUE) 

}

 airdata

https://www.tandfonline.com/doi/abs/10.1080/08839514.2019.1637138