Recklessly Impulsive: March 2013

What is Big data

Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Tableau as an Infographic Tool for Big Data

In this world each e-commerce company has to handle large amount of data once they cross the initial start-up phase. They have to use a right tool to get necessary data to take a well informed decision

On the same lines employees can access 50+ petabytes of data on everything from user behavior to online transactions to customer shipments and much more -- with access controls in place to ensure users see only what they're authorized to see.

Coming to Tableau

It provides visualization software to turn large, complex data sets into intuitive, interactive pictures.Advanced analytics tools require specialized skills. Interactive data visualization tools like Tableau's, on the other hand, enable almost any business user to become an analyst and identify trends on the fly.

We can use the Tableau 8 direct connector to Google Analytics and Salesforce to enhance website-traffic analytics, blend data, create custom dashboards and in sales process.

Visualized Data

When you input a companies details it will give you an intuitive infographic to deduce all necessary details.

Conclusion

Tableau can be used as Infographic tool for Big data, Apart from this it can be used as a Business Intelligence tool , forecasting and Story telling tool on web.

Do Panel Data Analysis of "Produc" data analyzing on three types of model :
Pooled affect model
Fixed affect model
Random affect model

Determine which model is the best by using functions:
pFtest : for determining between fixed and pooled
plmtest : for determining between pooled and random
phtest: for determining between random and fixed

> data(Produc , package ="plm")
> head(Produc)
state year pcap hwy water util pc gsp emp unemp
1 ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5 4.7
2 ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9 5.2
3 ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3 4.7
4 ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5 3.9
5 ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8 5.5
6 ALABAMA 1975 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4 7.7

Pooled Model
> pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))
> summary(pool)

Oneway (individual) effect Pooling Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
model = ("pooling"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.04950 -0.01940 -0.00412 0.01150 0.08690

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.7496721 0.0271054 27.6577 < 2.2e-16 ***
log(hwy) 0.5248704 0.0048326 108.6099 < 2.2e-16 ***
log(water) 0.1077579 0.0040454 26.6370 < 2.2e-16 ***
log(util) 0.4127255 0.0038337 107.6574 < 2.2e-16 ***
log(pc) -0.0330829 0.0048219 -6.8610 1.361e-11 ***
log(gsp) 0.0758341 0.0108650 6.9797 6.170e-12 ***
log(emp) -0.0891772 0.0076891 -11.5978 < 2.2e-16 ***
log(unemp) 0.0043878 0.0029465 1.4891 0.1368
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 724.14
Residual Sum of Squares: 0.56734
R-Squared : 0.99922
Adj. R-Squared : 0.98942
F-statistic: 147217 on 7 and 808 DF, p-value: < 2.22e-16

Fixed Model

> fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))
> summary(fixed)

Oneway (individual) effect Within Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
model = ("within"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.069800 -0.005280 -0.000327 0.005360 0.061200

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
log(hwy) 0.5418395 0.0109565 49.4536 < 2.2e-16 ***
log(water) 0.1215676 0.0053719 22.6304 < 2.2e-16 ***
log(util) 0.3909247 0.0065771 59.4368 < 2.2e-16 ***
log(pc) 0.0177190 0.0096372 1.8386 0.0663624 .
log(gsp) 0.0568433 0.0126569 4.4911 8.184e-06 ***
log(emp) -0.0851515 0.0146508 -5.8121 9.073e-09 ***
log(unemp) -0.0092135 0.0024988 -3.6872 0.0002429 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 9.4468
Residual Sum of Squares: 0.12613
R-Squared : 0.98665
Adj. R-Squared : 0.92015
F-statistic: 8033.41 on 7 and 761 DF, p-value: < 2.22e-16

Random Model
> random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))
> summary(random)

Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +
log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,
model = ("random"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Effects:
var std.dev share
idiosyncratic 0.0001657 0.0128743 0.221
individual 0.0005848 0.0241825 0.779
theta: 0.8719

Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.06500 -0.00624 -0.00195 0.00454 0.06450

Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.6625006 0.0530786 12.4815 < 2.2e-16 ***
log(hwy) 0.5021294 0.0074551 67.3537 < 2.2e-16 ***
log(water) 0.1191683 0.0049801 23.9289 < 2.2e-16 ***
log(util) 0.3944635 0.0060802 64.8768 < 2.2e-16 ***
log(pc) 0.0101901 0.0075870 1.3431 0.1796
log(gsp) 0.0599363 0.0122997 4.8730 1.323e-06 ***
log(emp) -0.0767378 0.0125556 -6.1119 1.531e-09 ***
log(unemp) -0.0034020 0.0022591 -1.5059 0.1325
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 21.167
Residual Sum of Squares: 0.13965
Pooled vs Fixed
Null Hypothesis: Pooled Model
Alternate Hypothesis : Fixed Model

> pFtest(fixed,pool)

F test for individual effects

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model is better than Pooled Model
Pooled vs Random
Null Hypothesis: Pooled Model
Alternate Hypothesis: Random Model

> plmtest(pool)

Lagrange Multiplier Test - (Honda)

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
normal = 57.1686, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Model is better than Pooled Model
Random vs Fixed
Null Hypothesis: No Correlation . Random Model
Alternate Hypothesis: Fixed Model

> phtest(fixed,random)

Hausman Test

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
chisq = 93.546, df = 7, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model.

Conclusion:
So after making all the comparisons we come to the conclusion that Fixed Model is best suited to do the panel data analysis for "Produc" data set.

Hence , we conclude that within the same id i.e. within same "state" there is no variation.

Recklessly Impulsive

Saturday, 30 March 2013

IT Lab Session 10

Sunday, 24 March 2013

IT Lab Session 9

Friday, 15 March 2013

IT Lab session 8