Just two simple words, Big Data, has created a furore that is spreading across the world, unleashing a dizzying array of opinions in all subjects - from healthcare to marketing needs, from politics to economics. Every day articles and blog posts are written about how to gather, store and use these big datasets.
It took about 13 years for the sequencing of the first genome. In less than a decade, we could have it in just a few days, albeit for a few thousand dollars. Since the beginning of the Human Genome Project - started formally in 1990 – there have been many advances in the management of large amounts of data with radical redesigns of IT infrastructures. New research paths have been explored in terms of data analysis, leading to an augmented general knowledge, which influences other disciplines.
Integrating data from social networks has been strategic for election campaigns: knowing the voters, tailoring communications, identifying hot topics and keywords. Representative examples of this innovative use are the political campaigns of 2008 and 2012 for the U.S. Presidential Elections, the case study preferred by experts. The Big Data topic was used proactively by politicians - perhaps pleased with the results - as a strategic asset for public administrations and taking into consideration the Open Data issue - a cross-party consensus, without ideological distinctions.
Last but not least, the world of Business. With the potential and benefits stated above, companies have taken up the challenge immediately to work on Big Data. The desire to deepen the knowledge of their customers - existing or potential - has always been an important issue because it helps to define decisions that companies need to make more than anything else. With Big Data, the potential has increased with a more detailed picture of the surrounding reality. With the sentence “data without a model is just noise” [1], it makes sense to explain “who” deals with these data.
A positive factor in this so-called “Big Data Revolution” is general awareness. The people are realizing that their digital lives leave traces, little clues that tell us something about them – whether it be what they bought this week at the supermarket or their usage of Internet connectivity within their own homes.
As reported on Statistics Viewstowards the end of last year, Prof. Harvey Goldstein -- at the Big Data Debate hosted by the British Academy -- talked about innovation to be "embraced, not ignored".There needs to be a large-scale education, and there should be a shared view about data interpretation with research communities. Social scientists should change their thinking, updating the existing methodologies for the new challenges of the next years. ' [2].
Statisticians have been analyzing data for centuries, but the public often ignores this. People often do not even know the role of a statistician, confused with mathematics - as a discipline - or those professionals engaged in polls.
Statisticians are data scientists by definition but data scientists do not necessarily have academic training. Statistical knowledge is the backbone of any analyst. Otherwise, why waste time and resources to collect data that no one will make attractive? Even the implementation of a data warehouse would not make sense if nobody can benefit from this one. Here the first
I myself am a (business) statistician by training and a data scientist by practice. I grew up with computer scientists and data engineers as friends and colleagues. I was involved for years in academic research projects, from market research to data mining. Outside academia, I spent years working for business needs. And it has held up, even in small to medium-sized enterprises - where everything falls on you and professional labels disappear.
The “data scientist” definition is wide enough for a heterogeneous set of skills, better than the colourful definitions that the market has given us in recent years. Statisticians, with additional updated skills.\
Besides the statistical skills - essentials and dominants - data scientists should have computer skills because they will develop scripting code and handle large databases frequently. Apart from basic knowledge of traditional programming languages and system administration, query languages are fundamental (e.g. SQL, PL/SQL, HiveQL), as well as deep training in a statistical environment (eg. R).
Good communication skills are also required because results need to be explained. Academics can handle numbers and tables with ease, but workgroups are often heterogeneous for analytical skills, roles and business goals. Speaking slowly, with increased font size, doesn't help.
Walker and Fung recently wrote an article in Significance[3], asking the main question: should statisticians join in Big Data and Business? Yes, they should, with no doubts. The worlds of business and statistics have enjoyed a symbiosis for a long time. Beginning with the contributions of Ronald Fisher developed at the Rothamsted Research Centre, there are many representative examples of this relationship. With the increasing awareness and responsibilities that belong to us in dealing with data, it is important to always distinguish the forest through the trees. Big Data can also help you engage with multiple disciplines from which you can learn and then contribute yourself.
So, data scientists are faced with three main issues while working on Big Data for their models: selecting a valid research method, evaluating data quality, and identifying goals (questions). All sciences depend on data, but selecting a research method is not simple. There were also divergent schools of thinking, who disagreed - often vehemently - on some issues of fundamental importance (e.g. the definition of “probability” and the never-ending story between frequentists and bayesians). Despite these heated debates, the statistical community has gathered tools, methods and forms of experience that are useful in all quantitative analysis. Is it enough? It's a good starting point.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise” [J.W. Tukey]
A selected research method will be helpful to ask valid questions and avoid false interpretations. An exploration of the whole dataset, walking on a path of speculation and criticism over time [4]. From now on, our worst enemy will be human exuberance.
Over the past two decades, data mining applications have tested a new path: “let the data speak for themselves” [J.W. Tukey, 5, 6]. Many analysts – usually computer scientists and data engineers - adopted these words with a “naive optimism” [7], not leaving the interpretations - a conscious process of data exploration and analysis - to domain experts. Why? Because some are not aware of the importance of data quality, a journey that should start from far away. How to evaluate opinions in a survey? What are the best features? How should we handle null values or sparse matrix? What to do with outliers? These are just some of the questions to ask yourself.
The statistical model – one of the most important goals - describes a given phenomenon. It must be able to explain without exceeding (overfitting) or neglecting (underfitting) information that may degrade its value. It should reject the complexity, selecting the most important features. It should require data quality, otherwise, every result is meaningless.
In business, your customer profiles translate into marketing strategies, and profitability is the main criteria for priorities. With the customer profiles, a company can prepare targeted actions in pre- and post-sales. May differentiate the communication to break down tricky and trivial aspects, by offering products and services appreciated in their own profile. In data-driven marketing, customers could be closer to themselves more than before. The in-depth knowledge of customers - probably one of the most common and difficult-to-reach business goals - falls within the scope of those sciences that try to measure human complexity. Fields of research where statistical methods and tools provide valuable opportunities, with criticism and heated debates (eg. “abuse of reason” [8]). Following methodological individualism, understanding comes from the conceptions by which men are led to action and not by results theorized on them. Whoever observes a social phenomenon should handle the psychological motivations of individual beliefs and attitudes as unrelated to their research context. Valuable tips which the business world ignores with many risks. Besides the inconsistency of online identities, human behavior can be individually disturbed or distorted by external factors not (yet) measured. Valuable opportunities or unnecessary risks?
Away from outdated concepts that influenced past centuries (eg. society, class, nation), the definition of “profile” has caught on. A group definition is unrelated to the res publica that fits well with data mining applications, good enough to link individuals with similar characteristics, across time and space. A profile could have one only customer (eg. a multi-billionaire that could buy your whole company if promptly informed), be quite close to our "outdated" concepts cited before or stay in-between. Also, the membership of a group may change over time.
In business, your customer profiles translate into marketing strategies, and profitability is the main criteria for priorities. With the customer profiles, a company can prepare targeted actions in pre-and post-sales. May differentiate the communication to break down tricky and trivial aspects, by offering products and services appreciated in their own profile. In data-driven marketing, customers could be closer to themselves more than before.
The business managers consider their options while the clock is ticking. With the opportunities of a new market or the ongoing marketing campaign, the need for insights grow exponentially to establish a competitive advantage. Big Data is already a winning topic in marketing circles, but - more than the amount of data - it's the ability to analyze vast and complex datasets that makes a step ahead.
Given the amount of data to be processed, analysts can only rely on the most modern technologies available for the four “V” of Big Data: volume, velocity, variety, value. Technology vendors use “Big Data” for traditional data warehousing scenarios involving data volumes in either the single or multi-terabyte range. It's wrong for some exceptions: scalability, distributed computing on a cluster environment, new open standards, powerful analytical tools. It's not just data, it's what you need for the daily analytical work.
If the company expands its needs, data management could add new hardware, a new server/data node, in a linear way. More power for your Hadoop-based cluster and a lowered risk of data loss - by adding a further replication. The advantage of scalability, without the monolithic solutions' constraints.
If the company want to protect their investments, they could benefit from open technological standards. MapReduce is a programmingmodel for processing big datasets with a parallel, distributed algorithm on Hadoop-based clusters. Whatever choices and changes are made over time, the code developed will always be reusable and staff training can be flexible.Someone may argue that only big companies are able to benefit from these features - and they are right - but it's a shadow line between less-new and new solutions which may fade over time. If the amount of the investment is considerable, the IT solutions that can ensure the widest setting time are appreciable. The company size is not the only driver for new strategic decisions. Brian Hopkins (Forrester Research), while commenting their “Foresights Strategy Spotlight: Business Intelligence and Big Data, Q4 2012” survey [9] with a cautious point of view, has revealed some useful insight: high-growth companies are more interested in what's possible tomorrow than what's painful today, investing more than their peers, but with the right openness to next-generation technologies.
Also businesses previously reluctant to undertake new projects with big data - due to fuzzy return-of-investment, lack of specific business use cases and concerns over product and services maturity - began exploring this new emerging context in their organizations with small pilot projects, encouraged by potential growth of the market [10].
While big data opportunities are here, the market is still within the confines of the early adopter phase, poised for significant growth, with two main target groups: big companies, fast-growing companies. In other words, those who can afford investments in IT infrastructure and deploy a data analysis team, and those who are looking for flexibility to grow in the near future. These business opportunities need a set of precious skills - mainly owned by statisticians and data scientists - for a conscious step into predictive analysis, for new insights that drive competitive advantage and new business models.
To help this transition to the future, the academic institutions need to organize data-science programs and help the definition of a new role – data scientists – or the raise of an old one – statisticians - that may respond to market needs. Always supporting awareness amongst people, preferring technological solutions based on open standards for their educational programs and research projects.
Don't be scared of the so-called “trough of disillusionment”, it is like a set of brainstorming sessions where people take stock of the situation and look forward. While statisticians are already able to work on large datasets with their precious analyses, in this ongoing revolution the whole set of industry experts is re-designing the future of computing and data science for next years, lowering the entrance barriers to the widest audience of users.