BIIIIGG DATA!!!!!

The Vs behind Big Data

Big Data has become the buzz word of the IT sector and beyond for the last few years. Supposedly all the Data ever made in the world is going to double within 2 years from now. That’s all the data the human race has made in its existence will x2. That is a crazy forecast right there folks. It is an absolute phenomenon and is going to change the business world and possibly the whole world as we know it.
Companies can collect what seems like an infinite amount of it. The challenge is that much of that information remains underutilized. Data can be structures or unstructured. Big Data analytics has the ability of extracting important information from these unstructured datasets/streams of data which were previously almost impossible to extract from due to its complexity and volume.
There is a lot of articles and visualisations on the 4 Vs of Big Data.

We are going to talk about these and even add some more Vs for good measure.

Volume
90% of the world’s data has been made in the last two years. The daily amount accumulated by a company presents immediate challenges. Big Data technology allows us to deal with these data sets with the help of distributed systems which can help spread the load throughout different data warehouses and can be accessed via data analytical software.

Velocity
This refers to the speed at which this new data is generated and how quickly it moves around. The speed at which credit card transactions are checked for fraudulent activities, Traffic flow sensors, satellite imagery, broadcast audio streams and financial market data are all examples of data that is generated at a huge pace that needs to be extract and analysed immediately by Big Data analytical software.

Variety
This refers to different types of data that is being created which we can collect and use. Data can come in different formats including structured and unstructured. Structured data fits neatly in to tables or ERDs e.g. financial data. Unstructured is a different matter altogether especially which rise of social media. Photos, videos, text, click streams have all added to the fact that 80%-85% of the world’s data is unstructured. Big Data technology is starting to find ways to harness this data and make them become more structured.

Veracity
There is uncertainty around data due to its often inconsistent nature. This is a huge challenge for Big Data technology. Have you ever seen those Facebook posts where people type how they speak? … hmm. Hashtags and abbreviations are other examples but volume often makes up for lack of quality and Big Data analytics is starting to solve this problem.
Let’s add another V in here to spice things up a bit …
Value
We can have all this volume, velocity and more data than you can throw a stick at but unless we can make good use of it then it is quite useless. Business can jump on the ‘Big Data’ bandwagon and have senior management running around the office shouting ‘Data Warehouse’ and ‘Hadoop’ all day but Business need to have a clear understanding of big data. They need to put it to good use to try and use data mining analytics to extract valuable business information embedded in structured, unstructured, streaming data and data warehouses.

So there it is. The potential of Big Data is… Big … VERY BIG!
Effectively controlling it is a huge recipe for success for any innovative business.

Appendix
www.sciencedaily.com/releases/2013/05/130522085217.htm
tezo.dbsdataprojects.com/2016/04/25/is-big-data-big-problems/

Explaining the BI in Business Intelliegence

Business intelligence concept man pressing selecting BI

Kenneth Laudon defines Business Intelligence(BI) as collective information – about your customers, your competitors, your business partners, your competitive environment and your own internal operations that gives you the ability to make effective, important and often strategic business decisions.
In a recent report, it was estimated that the Business Intelligence and Analytics Software Market will be worth $26.78 Billion by 2020. Business intelligence is a powerful tool that can help organizations survive and thrive in a challenging and competitive business environment. BI lets you transform the abundant data that is held by systems within a company. It is a collection of decision support technologies and It helps decisions makers to make more informative choices and helps to view results which a few years ago would have been very hard to extract.
In this day and age most successful companies have used BI technology/software in some form for their business.

business-intelligence-infrastructure

Some of the basic elements of a BI stack:
– Data Warehouse: The costs of data storage and acquisition has reduced a huge amount thus businesses can afford to acquire large amounts of data. The data is collected here from a variety of data stores and applications throughout the business.
– BI analytical toolset: Analytics tools can be used to create analytical databases that make it faster and easier to run custom queries or perform data mining.
– Predictive analytics services: Predictive analytics uses techniques such as statistical, regression, correlation and cluster analysis

636000499917870667-capture

A good example of clever integration of BI technology was Cincinnati zoo which we studied and discussed in class. Zoo Management wanted useful information on:

-Who was coming? How often do they come? What do they do? What do they buy?

Firstly, they decided to replace all four legacy points of sale – admissions, membership, retail and food. They then brought in a single platform of sale and enlisted IBM to to build a data warehouse and implement IBM Cognos BI to provide real time analytics and reporting. They also integrated it with a weather feed form the US National Oceanic and Atmospheric Administration

The results
Through comparing weather forecasts and historic attendance, the Zoo were able to schedule labour and inventory planning much better. They could also identify some of the people who only spent their admission price and thus devised a marketing campaign to target them offering discounts in restaurants and gift shops. The Zoo also discovered that there was a big spike in soft serve ice cream sales during the last hour before the Zoo closes so they knew to keep these servers open and were able to shut down other shops early where there was no business being done thus saving labour costs.
Comparing the six-month period directly after the deployment of the IBM Cognos system with the same period of the previous year, the Zoo achieved a 30.7 percent increase in food sales, and a 5.9 percent increase in retail sales. (Kenneth Laudon)

Conclusion:
The landscape of BI in industry and research has become very active as can be seen with Cincinnati Zoo. It is being fuelled by massive changed in Technology in Data Storage and Cloud services. BI is now hugely important for any successful, innovative business and will give huge competitive advantage if used to its full potential.

Appendix:

(http://www.thestreet.com/story/13116309/1/business-intelligence-and-analytics-software-market-worth-2678-billion-by-2020.html )
http://www.biztechmagazine.com/article/2014/03/critical-elements-effective-business-intelligence-system
Laudon, K. and Laudon, J., (2013) Management Information System. 13th ed. Harlow: Pearson Education Limited

A little bit of Association Analysis

A few questions as part of our class assignments

Q1: Lift Analysis
Please calculate the following lift values for the table correlating burger and chips below:

◦ Lift(Burger, Chips)
◦ Lift(Burgers, ^Chips)
◦ Lift(^Burgers, Chips)
◦ Lift(^Burgers, ^Chips)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Lift(Burger, Chips)
= s(B u C)/(s(B) x s(c))
= s(B u C) = (600/1400) = 0.43
= s(B) = 1000/1400 = 0.71
= s(C) = 800/1400 = 0.57
= Lift (B,C) = .43/(.71*.57)
= 1.07
= positive correlation

Lift(burgers, ^Chips)
= s (B u ^C)/(s(B) x s(^C)
= s(B U ^C) = (400/1400) = 0.29
= s(B) = 1000/1400 = 0.71
= s(^C) = 600/1400 = 0.43
= Lift (B,^C) = .29/(.71*.43)
= 0.97
= negative correlation

Lift(^Burgers, Chips)
= s(^B u C)/(S(^B) x s(C))
= s(^B u C) = 200/1400 = .14
= s(^B) = 400/1400 = .29
= s(C) = 800/1400 = .57
= Lift (^B, C) = .14/(.29*.57)
= 0.89
= Negative correlation

Lift(^Burgers, ^Chips)
= s(^b u ^C)/s(^B) x s(^C)
s(^B u ^C) = 200/1400 = .14
s(^B) = 400/1400 = .29
s(^C) = 600/1400 = 0.43
Lift(^b, ^C) = .14/(.29*.43)
= 1.08
= positive correlation

Q2:
Please calculate the following lift values for the table correlating shampoo and ketchup below:

◦ Lift(Ketchup, Shampoo)
◦ Lift(Ketchup, ^Shampoo)
◦ Lift(^Ketchup, Shampoo)
◦ Lift(^Ketchup, ^Shampoo)

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Lift(Ketchup, Shampoo)
= s(K u S)/s(K) x s(S)
s(K u S) = 100/900 = .11
s(K) = 300/900 = .33
s(S) = 300/900 = .33
Lift(K, S) = .11(.33*.33)
= 1
Independent correlation

◦ Lift(Ketchup, ^Shampoo)
= s(K u ^S)/s(K) x (s(S)
s (K u ^S) = 200/900 = .22
s(K) = 300/900 = .33
s(^S) = 600/900 = .66
Lift(K ,^S) = .22/(.33*.66)
= 1
= Independent correlation

◦ Lift(^Ketchup, Shampoo)
= s(^K u S)/s(^K) x s(S)
s(^k u S) = 200/900 = .22
s(^K) = 600/900 = .66
s(S) = 300/900 = .33
Lift(^K, S) 22/(.33*.66)
= 1
= Independent correlation

◦ Lift(^Ketchup, ^Shampoo)
= s(^K u ^S)/s(^K) x s(^S)
s(^k u ^S) = 400/900 = .44
s(^K) = 600/900 = .66
s(^S) = 600/900 = .66
Lift(^K, ^S) = .44/(.66*.66)
= 1
= Independent correlation

Q3: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets).

◦ Burgers & Chips
◦ Burgers & Not Chips
◦ Chips & Not Burgers
◦ Not Burgers and Not Chips

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

χ2 = Sum of (Actual-Expected)2 /Expected

χ2 Burgers & Chips
χ2 = (900-800)2 /800 + (100-200)2 /200 + (300-400)2 /200 + (200-100)2 /100
=12.5 + 50 + 50 + 100 =212.5
Positive correlation (Actual is greater than expected)

χ2 Burgers & Not Chips
χ2 = (100-200)2 /200 + (300-400)2 /200 + (200-100)2 /100
= 50 + 50 + 100 = 200
= negative correlation (Expected is greater than actual)
χ2 Chips & Not Burgers
= (300-400)2 /200 + (200-100)2 /100
= 50 + 100 = 150
= negative correlation (Expected is greater than Actual

χ2 Not Chips & Not Burgers
= (200-100)2 /100
= 100
= Positive correlation (Actual was greater than expected)

Q4: Chi Squared Analysis
Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).

◦ Burgers & Sausages
◦ Burgers & Not Sausages)
◦ Sausages & Not Burgers
◦ Not Burgers and Not Sausages

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

χ2 Burgers & Sausages
(800-800)2/800 + (200-200)2/200 + (400-400)2/400 + (100-100)2/100
0 + 0 + 0 + 0 = 0
Independent

χ2 Burgers & Not Sausages
(200-200)2/200 + (400-400)2/400 + (100-100)2/100
0 + 0 + 0 = 0
Independent

χ2 Sausages & Not Burgers
(400-400)2/400 + (100-100)2/100
0 + 0 = 0
Independent

χ2 Not Burgers and Not Sausages
(100-100)2/100
= 0
Independent

Q5:

Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events?
Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared?

Lift and Chi Squared(χ2) are not useful as algorithms when there are too many null transactions i.e transactions that contain neither X or Y.
You can have several null invariant measures that can help with datasets that have this problem though.
•AllConf(A,B)
•Jaccard(A,B)
•Cosine(A,B)
•Kulczynski(A,B)
•MaxConf(A,B)

The Kulczynski method is the most popular in industry and it is generally accepted as the most accurate also.

R you kidding me? The Power of R…

The Power of R – An Introduction

R is an open source statistical programming package that incorporates graphical tools to present your data and was first written as a research project by Ross Ihaka and Robert Gentleman. It is now under active development by a group of statisticians called ‘the R core team’ and be seen on www.r-project.org. R is available free of charge.
It has become a very important tool for data scientists and a whole community has built up around it. The fact you can infiltrate any data format from .CSV to .SAV and not have to be a programming genius to use it has all added to it huge popularity.

Untitled-1

To see how it works, check out the Try R course from Code School –http://tryr.codeschool.com/
The seven sections in the course were:
1. Using R
2. Vectors
3. Matrices
4. Summary Statistics
5. Factors
6. Data Frames
7. Real-World Data

Below was the final screen when you I completed it

whats-next

Using R and does Rob survive?
Having finished code school and learning my first bit about R, I decided to try test my new found skills on something a bit more complex. There is a lot of datasets out there and it can be hard to choose, I found a dataset on Kaggle.com containing information on Titanic Passengers which we had also looked at in our class. IT contained information such as age, sex, departure location, ticket price and whether they survived the journey.
Referencing a very popular article by Megan Risdal here are the steps I took to be able to see if I would have survived.

Firstly I downloaded the datasets from Kaggle. Train.csv and Test.csv.

The dataset had the following variables:

Variable Name Description
Survived – Survived (1) or died (0)
Pclass – Passenger’s class
Name – Passenger’s name
Sex – Passenger’s sex
Age – Passenger’s age
SibSp – Number of siblings/spouses aboard
Parch – Number of parents/children aboard
Ticket – Ticket number
Fare – Fare
Cabin – Cabin
Embarked – Port of embarkation

Here we have a look at the family variable using:
# Create a family size variable including the passenger themselves
full$Fsize <- full$SibSp + full$Parch + 1 # Create a family variable Full$Family<-paste(full$Fsize,sep=’_’) # Use ggplot2 to visualize the relationship between family size & survival ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) + geom_bar(stat='count', position='dodge') + scale_x_continuous(breaks=c(1:11)) + labs(x = 'Family Size') + theme_few() family-size

surviaval

We already see that people who are single or have a family over 4 were treading on thin water. 🙂

factors

This graph breaks it into age and survival rate for male and females and gives more insights into that data. Females, young children and the very old had a much better chance of surviving.

But does Rob make it??
So being a 36 year old male who may have been in 2nd or 3rd class my chances of survival would be highly unlikely. A legend would be lost and the world would never be the same again…
While the Titanic only had enough lifeboats to hold 1/3 of passengers, she was actually carrying more lifeboats than were legally required. That’s because lifeboats were intended to ferry survivors from a sinking ship to a rescuing ship—not keep afloat the whole population. There you go…

References:
www.kaggle.com
https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic
http://crazyfacts.com/tag/titanic/

Fusion Tables a la Rob..

Big Data and Data Management have become instrumental in today’s society. The amount of Data being used in all walks of life has been growing exponentially for years. How businesses store, organise, and manage their data has a huge impact on organisational effectiveness. Google’s answer to data management is a web application called Google Fusion tables and in this blog I will show you how to use it.

This is a heat map of the Republic of Ireland based on the population of each county. I have created a heat map showing a random distribution of counties based on population density in the Republic of Ireland. A Heat map is used to present relative data, i.e. data consisting of the relationship of two occurrence and will give a visualization of the data.

To start off you must set up two different tables.

The first was on the population of Ireland in each county for 2011 and was extracted from the 2011 census downloaded from the Central Statistics Office Website http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/ . This information was put into an excel sheet. It also had to be cleaned up before making a table. Dublin, Limerick, Waterford, Galway, Tipperary and Cork had been broken up in different areas. Laois was spelt wrong. These would have made the tables incompatible. When amended, the excel sheet was imported and a table was formed.

The second table to be uploaded contained the geometry information of Ireland and the name of the counties. This comes in a KML file and was taken from independent.ie http://www.independent.ie/editorial/test/map_lead.kml

These tables were then merged to create a new table called ‘Ireland Population Heat Map’. Once the table was merged, I went into the ‘map of geometry tab’ to edit it. In order to create a random distribution of counties based on population density, I opened the change feature style, went into the buckets tap and changed the colours on the fill colour option in the polygons tab. We can now contrast the highly populated counties and the lowly populated ones.

There is different information that can be gleaned from these types of maps. When you look at the main motorways you can see that the counties furthest away are the ones with the lowest population. This is the case with Roscommon and Longford. Population density can be seen to be most dense in Leinster and in particular in Dublin. The map also shows high regional population densities for Galway and Cork. There seems to a relationship between the location of motorways and population density.
There are many other ideas and concepts that can be shown in a heat map. House pricing compared to location, Unemployment rates and Demographics are all examples.

In the United States, many people are familiar with heat maps from viewing television news programs. During a presidential election, for instance, a geographic heat map with the colors red and blue will quickly inform the viewer which states each candidate has won. Here are the results of the 2012 Presidential Election in America shown on a heat map.

269-us-map2

Google fusion maps are free, simple to use and can create powerful visual results but they do of course rely on the quality of data used.