Saturday, 30 March 2013

IT Lab Session 10


Assignment 1: Create 3 vectors, x, y, z and choose any random values for them, ensuring they are of equal length,
T<- cbind(x,y,z)
Create 3 dimensional plot of the same (of all the 3 types as taught) 

Commands :

3D plots:

Normal Plot: plot3d(T[, 1:3])



Colour Plot: plot3d(T[, 1:3], col = rainbow(1000))

Color Plot of spheres: plot3d(T[, 1:3], col = rainbow(1000), type = 's')


Assignment 2:

Choose 2 random variables
Create 3 plots: 
1. X-Y 
2. X-Y|Z (introducing a variable z and cbind it to z and y with 5 diff categories)
3. Color code and draw the graph 
4. Smooth and best fit line for the curve



 

>qplot(x,y)


>qplot(x,z)


Semi-transparent plot

> qplot(x,z, alpha=I(2/10))

Colour plot

> qplot(x,y, color=z)

Logarithmic colour plot

> qplot(log(x),log(y), color=z)





Best Fit and Smooth curve using the function "geom"

> qplot(x,y,geom=c("path","smooth"))


> qplot(x,y,geom=c("point","smooth"))


> qplot(x,y,geom=c("boxplot","jitter"))

 
























Sunday, 24 March 2013

IT Lab Session 9

What is Big data

Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Tableau as an Infographic Tool for Big Data


In this world each e-commerce company has to handle large amount of data once they cross the initial start-up phase. They have to use a right tool to get necessary data to take a well informed decision

On the same lines employees can access 50+ petabytes of data on everything from user behavior to online transactions to customer shipments and much more -- with access controls in place to ensure users see only what they're authorized to see. 

Coming to Tableau

It provides visualization software to turn large, complex data sets into intuitive, interactive pictures.Advanced analytics tools require specialized skills. Interactive data visualization tools like Tableau's, on the other hand, enable almost any business user to become an analyst and identify trends on the fly.

We can use the Tableau 8 direct connector to Google Analytics and Salesforce to enhance website-traffic analytics, blend data, create custom dashboards and in sales process.


Visualized Data

When you input a companies details it will give you an intuitive infographic to deduce all necessary details.



Conclusion

Tableau can be used as Infographic tool for Big data, Apart from this it can be used as a Business Intelligence tool , forecasting and Story telling tool on web. 

Friday, 15 March 2013

IT Lab session 8


Do Panel Data Analysis of "Produc" data analyzing  on three types of model :
      Pooled affect model
      Fixed affect model
      Random affect model

Determine which model is the best by using functions:
       pFtest : for determining between fixed and pooled
       plmtest : for determining between pooled and random
       phtest: for determining between random and fixed

> data(Produc , package ="plm")
>  head(Produc)
    state year     pcap     hwy   water    util       pc   gsp    emp unemp
1 ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5   4.7
2 ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9   5.2
3 ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3   4.7
4 ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5   3.9
5 ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8   5.5
6 ALABAMA 1975 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4   7.7

Pooled Model
> pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))
> summary(pool)

Oneway (individual) effect Pooling Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
    log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
    model = ("pooling"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max.
-0.04950 -0.01940 -0.00412  0.01150  0.08690

Coefficients :
              Estimate Std. Error  t-value  Pr(>|t|)  
(Intercept)  0.7496721  0.0271054  27.6577 < 2.2e-16 ***
log(hwy)     0.5248704  0.0048326 108.6099 < 2.2e-16 ***
log(water)   0.1077579  0.0040454  26.6370 < 2.2e-16 ***
log(util)    0.4127255  0.0038337 107.6574 < 2.2e-16 ***
log(pc)     -0.0330829  0.0048219  -6.8610 1.361e-11 ***
log(gsp)     0.0758341  0.0108650   6.9797 6.170e-12 ***
log(emp)    -0.0891772  0.0076891 -11.5978 < 2.2e-16 ***
log(unemp)   0.0043878  0.0029465   1.4891    0.1368  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    724.14
Residual Sum of Squares: 0.56734
R-Squared      :  0.99922
      Adj. R-Squared :  0.98942
F-statistic: 147217 on 7 and 808 DF, p-value: < 2.22e-16

Fixed Model

> fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))
> summary(fixed)

Oneway (individual) effect Within Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
    log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
    model = ("within"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
     Min.   1st Qu.    Median   3rd Qu.      Max.
-0.069800 -0.005280 -0.000327  0.005360  0.061200

Coefficients :
             Estimate Std. Error t-value  Pr(>|t|)  
log(hwy)    0.5418395  0.0109565 49.4536 < 2.2e-16 ***
log(water)  0.1215676  0.0053719 22.6304 < 2.2e-16 ***
log(util)   0.3909247  0.0065771 59.4368 < 2.2e-16 ***
log(pc)     0.0177190  0.0096372  1.8386 0.0663624 .
log(gsp)    0.0568433  0.0126569  4.4911 8.184e-06 ***
log(emp)   -0.0851515  0.0146508 -5.8121 9.073e-09 ***
log(unemp) -0.0092135  0.0024988 -3.6872 0.0002429 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    9.4468
Residual Sum of Squares: 0.12613
R-Squared      :  0.98665
      Adj. R-Squared :  0.92015
F-statistic: 8033.41 on 7 and 761 DF, p-value: < 2.22e-16

Random Model
> random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))
> summary(random)

Oneway (individual) effect Random Effect Model
   (Swamy-Arora's transformation)

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
    log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
    model = ("random"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Effects:
                    var   std.dev share
idiosyncratic 0.0001657 0.0128743 0.221
individual    0.0005848 0.0241825 0.779
theta:  0.8719

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max.
-0.06500 -0.00624 -0.00195  0.00454  0.06450

Coefficients :
              Estimate Std. Error t-value  Pr(>|t|)  
(Intercept)  0.6625006  0.0530786 12.4815 < 2.2e-16 ***
log(hwy)     0.5021294  0.0074551 67.3537 < 2.2e-16 ***
log(water)   0.1191683  0.0049801 23.9289 < 2.2e-16 ***
log(util)    0.3944635  0.0060802 64.8768 < 2.2e-16 ***
log(pc)      0.0101901  0.0075870  1.3431    0.1796  
log(gsp)     0.0599363  0.0122997  4.8730 1.323e-06 ***
log(emp)    -0.0767378  0.0125556 -6.1119 1.531e-09 ***
log(unemp)  -0.0034020  0.0022591 -1.5059    0.1325  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    21.167
Residual Sum of Squares: 0.13965
Pooled vs Fixed
Null Hypothesis: Pooled Model
Alternate Hypothesis : Fixed Model

>  pFtest(fixed,pool)

        F test for individual effects

data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model is better than Pooled Model
Pooled vs Random
Null Hypothesis: Pooled Model
Alternate Hypothesis: Random Model

>  plmtest(pool)

        Lagrange Multiplier Test - (Honda)

data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
normal = 57.1686, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Model is better than Pooled Model
Random vs Fixed
Null Hypothesis: No Correlation . Random Model
Alternate Hypothesis: Fixed Model

> phtest(fixed,random)

        Hausman Test

data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp)
chisq = 93.546, df = 7, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model.

Conclusion:
 So after making all the comparisons we come to the conclusion that Fixed Model is best suited to do the panel data analysis for "Produc" data set.

Hence , we conclude that within the same id i.e. within same "state" there is no variation.