I recently attended the Strata + Hadoop World Conference in San Jose, and came away impressed with the accelerating pace of innovation in the world of Big Data. Companies and startups are innovating in every area of the Big Data value chain -from automating how data is collected, cleaned, and organized; to data governance and management; to data storage using a plethora of NoSQL database technologies; and to the numerous emerging tools for data science.
Of particular interest are the innovations in the area of streaming data analytics at the edge of the network. This will be critical in the emerging world of the Internet of Everything (IoE), where "things" are connected to the Internet in the context of "people", "process" and "data". Data analytics will provide the intelligence in IoE, transforming data generated by millions of edge devices and applications into useful business insights. Examples of IoE in action abound -from applications in connected healthcare, supply chain management, and the smart grid to Google's self-driving car and Uber's industry-transforming business model that connects riders to drivers.
Big Data and analytics are clearly seen as a game-changing technology. Data science is the foundational capability behind the enormous value potential that everyone is expecting from Big Data and IoE. Both a mature and a new discipline, data science is based on well-established inferential statistical and computer science techniques.
So, what exactly is data science? D.J. Patil has called it a "team sport." It is a multi-disciplinary approach that combines business domain knowledge, IT, good communication skills, change management skills, along with the core expertise in statistical analysis and computer science to identify and capture business value from data. A colleague recently told me, "Data science is mostly about cleaning and preparing large datasets so that programmers can work on them." This is partially true. Data scientists spend a lot of time understanding raw data and preparing "clean" datasets for subsequent analysis.
The objective of data science is to identify and build an analytical model that can be scaled and operationalized (i.e., implemented in a "production" environment) to provide useful business insights, predictions and recommendations.
The typical data science process is shown in the figure below:
Source: Cisco Consulting Services, 2015Four data science developments from the last decade stand out:
A word of caution: Above all, data scientists need to be data skeptics. Not all data is useful, and not every business problem can be or should be solved using data science and analytics. George E.P. Box, a noted British statistician, once said, "Essentially, all models are wrong, but some are useful." This is why any data science project should include a cross-functional team and use a healthy dose of business acumen and pragmatism to develop approaches that ultimately drive useful business outcomes in a cost-effective manner.