Recklessly Impulsive: 2013

Saturday, 30 March 2013

IT Lab Session 10

Assignment 1: Create 3 vectors, x, y, z and choose any random values for them, ensuring they are of equal length,

T<- cbind(x,y,z)

Create 3 dimensional plot of the same (of all the 3 types as taught)

Commands :

3D plots:

Normal Plot: plot3d(T[, 1:3])

Colour Plot: plot3d(T[, 1:3], col = rainbow(1000))

Color Plot of spheres: plot3d(T[, 1:3], col = rainbow(1000), type = 's')

Assignment 2:

Choose 2 random variables

Create 3 plots:

1. X-Y

2. X-Y|Z (introducing a variable z and cbind it to z and y with 5 diff categories)

3. Color code and draw the graph

4. Smooth and best fit line for the curve

>qplot(x,y)

>qplot(x,z)

Semi-transparent plot

> qplot(x,z, alpha=I(2/10))

Colour plot

> qplot(x,y, color=z)

Logarithmic colour plot

> qplot(log(x),log(y), color=z)

Best Fit and Smooth curve using the function "geom"

> qplot(x,y,geom=c("path","smooth"))

> qplot(x,y,geom=c("point","smooth"))

> qplot(x,y,geom=c("boxplot","jitter"))

Sunday, 24 March 2013

What is Big data

Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Tableau as an Infographic Tool for Big Data

In this world each e-commerce company has to handle large amount of data once they cross the initial start-up phase. They have to use a right tool to get necessary data to take a well informed decision

On the same lines employees can access 50+ petabytes of data on everything from user behavior to online transactions to customer shipments and much more -- with access controls in place to ensure users see only what they're authorized to see.

Coming to Tableau

It provides visualization software to turn large, complex data sets into intuitive, interactive pictures.Advanced analytics tools require specialized skills. Interactive data visualization tools like Tableau's, on the other hand, enable almost any business user to become an analyst and identify trends on the fly.

We can use the Tableau 8 direct connector to Google Analytics and Salesforce to enhance website-traffic analytics, blend data, create custom dashboards and in sales process.

Visualized Data

When you input a companies details it will give you an intuitive infographic to deduce all necessary details.

Conclusion

Tableau can be used as Infographic tool for Big data, Apart from this it can be used as a Business Intelligence tool , forecasting and Story telling tool on web.

Friday, 15 March 2013

IT Lab session 8

Do Panel Data Analysis of "Produc" data analyzing on three types of model :
Pooled affect model
Fixed affect model
Random affect model

Determine which model is the best by using functions:
pFtest : for determining between fixed and pooled
plmtest : for determining between pooled and random
phtest: for determining between random and fixed

> data(Produc , package ="plm")
> head(Produc)
state year pcap hwy water util pc gsp emp unemp
1 ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5 4.7
2 ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9 5.2
3 ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3 4.7
4 ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5 3.9
5 ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8 5.5
6 ALABAMA 1975 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4 7.7

Pooled Model
> pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))
> summary(pool)

Oneway (individual) effect Pooling Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
model = ("pooling"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.04950 -0.01940 -0.00412 0.01150 0.08690

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.7496721 0.0271054 27.6577 < 2.2e-16 ***
log(hwy) 0.5248704 0.0048326 108.6099 < 2.2e-16 ***
log(water) 0.1077579 0.0040454 26.6370 < 2.2e-16 ***
log(util) 0.4127255 0.0038337 107.6574 < 2.2e-16 ***
log(pc) -0.0330829 0.0048219 -6.8610 1.361e-11 ***
log(gsp) 0.0758341 0.0108650 6.9797 6.170e-12 ***
log(emp) -0.0891772 0.0076891 -11.5978 < 2.2e-16 ***
log(unemp) 0.0043878 0.0029465 1.4891 0.1368
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 724.14
Residual Sum of Squares: 0.56734
R-Squared : 0.99922
Adj. R-Squared : 0.98942
F-statistic: 147217 on 7 and 808 DF, p-value: < 2.22e-16

Fixed Model

> fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))
> summary(fixed)

Oneway (individual) effect Within Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
model = ("within"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.069800 -0.005280 -0.000327 0.005360 0.061200

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
log(hwy) 0.5418395 0.0109565 49.4536 < 2.2e-16 ***
log(water) 0.1215676 0.0053719 22.6304 < 2.2e-16 ***
log(util) 0.3909247 0.0065771 59.4368 < 2.2e-16 ***
log(pc) 0.0177190 0.0096372 1.8386 0.0663624 .
log(gsp) 0.0568433 0.0126569 4.4911 8.184e-06 ***
log(emp) -0.0851515 0.0146508 -5.8121 9.073e-09 ***
log(unemp) -0.0092135 0.0024988 -3.6872 0.0002429 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 9.4468
Residual Sum of Squares: 0.12613
R-Squared : 0.98665
Adj. R-Squared : 0.92015
F-statistic: 8033.41 on 7 and 761 DF, p-value: < 2.22e-16

Random Model
> random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))
> summary(random)

Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
model = ("random"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Effects:
var std.dev share
idiosyncratic 0.0001657 0.0128743 0.221
individual 0.0005848 0.0241825 0.779
theta: 0.8719

Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.06500 -0.00624 -0.00195 0.00454 0.06450

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.6625006 0.0530786 12.4815 < 2.2e-16 ***
log(hwy) 0.5021294 0.0074551 67.3537 < 2.2e-16 ***
log(water) 0.1191683 0.0049801 23.9289 < 2.2e-16 ***
log(util) 0.3944635 0.0060802 64.8768 < 2.2e-16 ***
log(pc) 0.0101901 0.0075870 1.3431 0.1796
log(gsp) 0.0599363 0.0122997 4.8730 1.323e-06 ***
log(emp) -0.0767378 0.0125556 -6.1119 1.531e-09 ***
log(unemp) -0.0034020 0.0022591 -1.5059 0.1325
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 21.167
Residual Sum of Squares: 0.13965
Pooled vs Fixed
Null Hypothesis: Pooled Model
Alternate Hypothesis : Fixed Model

> pFtest(fixed,pool)

F test for individual effects

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model is better than Pooled Model
Pooled vs Random
Null Hypothesis: Pooled Model
Alternate Hypothesis: Random Model

> plmtest(pool)

Lagrange Multiplier Test - (Honda)

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
normal = 57.1686, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Model is better than Pooled Model
Random vs Fixed
Null Hypothesis: No Correlation . Random Model
Alternate Hypothesis: Fixed Model

> phtest(fixed,random)

Hausman Test

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
chisq = 93.546, df = 7, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model.

Conclusion:
So after making all the comparisons we come to the conclusion that Fixed Model is best suited to do the panel data analysis for "Produc" data set.

Hence , we conclude that within the same id i.e. within same "state" there is no variation.

Thursday, 14 February 2013

IT Lab - 6

Stationary Time Series
#Assignment-1 Create log of the return data( way 1- log (st-st-1)/(st-1)
> #Historical volatility calculate.
> #Create acf plot for log(returns ) data and adf and interpret. NSE nifty index(from jan2012 to 31 jan 2013)
Program:
> z<-read.csv(file.choose(),header=T)
> closingprice<-z$Close
> closingprice.ts<-ts(closingprice,frequency=252)
> laggingtable<-cbind(closingprice.ts,lag(closingprice.ts,k=-1),closingprice.ts-lag(closingprice.ts,k=-1))
> Return<-(closingprice.ts-lag(closingprice.ts,k=-1))/lag(closingprice.ts,k=-1)
> Manipulate<-scale(Return)+10
> logreturn<-log(Manipulate)

> acf(logreturn)

From the figure it implies that the all the standard errors are within the 95% confidence interval and hence we can
say that the time series is stationary.
>T<-252^.5
>Historicalvolatility<-sd(Return)*T
> Historicalvolatility
[1] 0.1475815
> adf.test(logreturn)

Augmented Dickey-Fuller Test

data: logreturn
Dickey-Fuller = -5.656, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(logreturn) : p-value smaller than printed p-value

Since p-value is less than (1-.95) ,therefore we can say null hypothesis is rejected and hence the time series is stationary so data analysis can be done.

Thursday, 7 February 2013

IT lab Session 5

Assignment 1

Find returns of NSE data of greater than 6 months having selected the 10th data point as start and 95th data point as end.

Commands:

z<-read.csv(file.choose(),header=T)
Close<-z$Close
Close
Close.ts<-ts(Close)
Close.ts<-ts(Close,deltat= 1/252)
Interval<-ts(data=Close.ts[10:95],frequency=1,deltat=1/252)
z1.ts<-ts(Interval)
z1.ts
z1.diff<-diff(z1)
z2<-lag(Close.ts,K=-1)
Returns<-z1.diff/z2
plot(Returns,main=" Returns from 10 th to 95th day of NSE Mid-cap Index ")
z3<-cbind(z1.ts,z1.diff,Returns)
plot(z3,main=" Data from 10th-95th day ; Difference ; Returns")

Assignment 2

1-700 data is available, Predict the data from 701-850, use the GLM estimation using LOGIT Analysis for the same.

z<-read.csv(file.choose(),header=T)
data1<-z[1:700,1:9]
head(data1)
data1$ed<-factor(data1$ed)
data.est<-glm(default ~ age + ed + employ + address + income, data=data1, family ="binomial")
summary(data.est)
return<-z[701:850,1:8]
return$ed<-factor(return$ed)
return$probability<-predict(data.est,newdata=return,type="response")
head(return)