18June 2018
New Era for Big Data as Emerging Technologies Consign Hadoop to History
Big data has transformed the economy, powering systems from Uber rides to Waze’s traffic avoidance software and the user experience on Facebook. Banks, insurers and security companies have used big data processing to achieve faster and more effective predictive modeling, credit scoring, regulatory compliance and security checks.
In the drive to store ever larger data sets, financial services groups and other giants have long turned to Hadoop, the infrastructure software for storing and processing big data. The open-source Apache Hadoop project, which launched in 2011, facilitates data processing on a vast scale that is much cheaper than using an enterprise data warehouse.
But Hadoop is facing competition on a number of fronts and falling out of favor with IT managers and data experts.
Transformational Technology Hadoop’s transformational technology was its distributed file system, which can store a great number of giant data files – bigger than those that can be stored on a single server – by distributing them across multiple nodes, or computers.
It also offers the MapReduce framework for processing data. Rather than moving large data sets over a network to the software, which is slow and cumbersome, Hadoop moves the software to the data – mapping the processing software to the data across all the different nodes. It brings back just the relevant data needed to answer a query – hence the Reduce tag.
However, Hadoop is hard work. It is based on the Java programming language, and Java expertise is helpful, though not essential, for using it. But to get the most out of the system requires skilled staff who can use sophisticated tools to interrogate the data.
Market Disappointment Hadoop scores highly on analyzing large batches of data very quickly, but is not strong on ad hoc queries, leaving it to specialists and in-house developer teams to build the appropriate software. One user describes Hadoop as becoming “a dumping ground for data”.
In April, two of the biggest Hadoop providers suffered steep falls in their share price. Cloudera saw a near 40% drop after forecasting lower revenues for 2019, news that shocked analysts. The company had conducted an IPO just a year earlier, backed by optimistic forecasts but it faces strong competition for its Hadoop services, which are being commoditized as well as superseded by superior technologies.
The drawback with Cloudera’s offering is that customers need highly-trained staff to run the software. Rival Hortonworks, another company providing Hadoop services, also has seen a sharp fall in its share price this year as investors took to their heels.
Increased Data Volumes
Some analysts interpret these developments as a signal of the demise of Hadoop as it is overtaken by more advanced technologies that are ready for the next wave of big data. The arrival of the Internet of Things will greatly increase data volumes and require more sophisticated technology to store and manage it.
Cloudera has already moved away from its pure Hadoop offering and makes much of machine learning and other software for processing big data. However, Hadoop is still widely seen as its main line of business, and both Cloudera and Hortonworks face challenges in transitioning their clients to the new big data methods.
Hadoop itself is becoming commoditized, and facing competition from the newcomers like Apache Spark, an analytics engine for big data processing. Innovators are further opening big data to non-specialists with service offers, taking technology out of the hands of software providers.
Trawling the Data Lakes Many companies which have invested in using Hadoop to create data lakes are struggling to get value out of them and transform themselves into data-driven businesses. Data lakes are where many types of data – from customer transactions to voice, video and social media content – are stored in their native format until they are processed.
Data lakes – usually, though not always, associated with Hadoop – are a cheaper and more flexible means of storage than the enterprise data warehouse, which stores data hierarchically in files and folders. But the challenge for businesses is to extract valuable data from data lakes.
The next big challenge for data lakes and for Hadoop vendors will be to use machine learning and artificial intelligence to make sense of the data and extract insights that will help boost business.
According to a report by data visualization software provider Tableau, Hadoop is comparatively weak on both machine learning and SQL queries – structured data – which are the emerging methods for enterprises to extract insights from big data. In a survey of IT managers, business intelligence staff and data architects, around 70% of respondents favoured Apache Spark over Hadoop.
Ease of Use Tableau says businesses are looking to faster databases such as Exasol and MemSQL or using SQL applications for Hadoop such as Hive, Presto and Drill. It predicts that easier-to-use, self-service big data offerings will overtake Hadoop, citing tools such as Alteryx, Trifacta and Paxata.
The next wave of big data technology will help banks and insurers become even more efficient at predicting customer behavior and analyzing the direction of financial markets. They will owe much to Hadoop for helping them enter the big data field and creating data lakes, but the search is on for new, flexible and agile tools to make sense of the huge data sets collected by businesses.
This decade, big data has been about capturing and storing vast amounts of data. The future is expected to focus on how non-technical staff can extract insights from the data – fast – to help companies improve efficiency and become more profitable.