Video: Spark: The big data tool du jour is getting automation
This is a big week in the analytics world as both Gartner’s Data & Analytics Summit in Grapevine, TX and the Strata Data Conference in San Jose are taking place. Many vendors are attending and exhibiting at both; some vendors are only at one, but just about everyone in the analytics world is exhibiting at one of them, at least.
Strata, which kicks off today, is more of an announcement vehicle for vendors though, and today three big names in the Big Data world 8212; Cloudera, MapR and AtScale — have new releases to announce. I’ll cover these three vendors’ announcement in some depth. I’ll also close with a short summary of announcements — mostly made yesterday — from an array of other vendors.
As Cloudera is one of the two companies behind Strata Data (O’Reilly being the other), perhaps it’s fitting that we start with its announcements. To cut to the chase, Cloudera is releasing a new version of its Altus cloud service with a key new feature being released, and another one slated for release in the near future.
Also read: Cloudera introduces Altus, offering Hadoop jobs as a Service
Altus is Cloudera’s Big Data pipeline as a service cloud offering. When announced, Altus included the ability to run scheduled jobs on a clusterless/serverless Cloudera instance. Essentially, Altus launched as a Hadoop job service that didn’t require the user to worry about the details of clusters, storage and so forth. Then, in a subsequent release, the Altus Analytic DB, based on Impala, was added.
Today, Cloudera is adding support on Altus for Cloudera Shared Data Experience (SDX). This facility allows for the unified management of multiple clusters, including mixes and matches of on-prem and cloud-based clusters. This addition to Altus is being released today, Cloudera tells me.
In addition, Cloudera is announcing, though not yet releasing, a new component: Altus Data Science, which will be based on Cloudera’s Data Science Workbench. Like other hosted Hadoop and Spark machine learning services, including those from Qubole and Databricks, Altus Data Science will allow data scientists and data science-savvy data engineers to build and schedule data/machine learning pipelines without needing a dedicated Hadoop or Spark cluster up and running. And just like other vendors, Cloudera now sees Data Science and other value-added workloads to be the real gems amongst its deliverables. The days of selling vanilla Hadoop clusters are ending, it would seem; something more turn-key is now required.
MapR further contains itself
MapR, is also technically a Hadoop distribution vendor. But the company has always really seen itself as an Enterprise data platform provider. Along with that, the company is today announcing support for containerized applications that use the MapR Converged Data Platform to run under Kubernetes, the leading container orchestration platform.
Using the Kubernetes Volume Driver, containerized applications can read from and write to the MapR-XD data store simply by addressing the container’s local storage. The MapR Kubernetes Volume Driver then takes care of conveying those reads and writes to whatever persistent storage media is being managed under MapR-XD, bypassing the ephemeral storage cache normally allocated to a Kubernetes container.
MapR tells me that the Kubernetes support works on-premises, as well as in managed Kubernetes cloud services, like those from Amazon, Microsoft and Google. The company also explained that unlike its previously announced Persistent Application Client Container (PACC) technology, which required the use of a specific Docker image, this new functionality will be compatible with any such image, as the connectivity to MapR-XD is provided by the Kubernetes Volume Driver.
This is pretty neat stuff, and it proves that the data and analytics container revolution, though quiet, is being fought, and slowly won.
AtScale gets automated, optimized and refined
AtScale, the Microsoft SQL Server Analysis Services (SSAS)-like BI platform that runs on top of Hadoop and Spark, is announcing its 6.5 release, with three major new features.
To start with, AtScale is adding automated modeling features to the product. What this means is that instead of requiring users to create their entire Universal Semantic Layer model from scratch, the product can now intuit some of that structure by looking at existing analytics assets already built on top of the underlying data.
Specifically, the product is able to inspect the assets contained in a Tableau workbook and from them determine which columns in which tables likely contain measures and dimensions. It can also determine table relationships by looking at the Tableau workbooks, and the data model within them. This means AtScale is now doing some modeling on the fly, like certain of its BI-on-Hadoop competitors have been doing for some time.
Other new features include execution of n-tile calculations on the server side and a Perspectives feature, which provides simplified or audience-specific filtered views of your models. The n-tile calculations join the estimated distinct count calculations as important row-access-level work that is done on server-side.
The AtScale Perspectives feature rather closely resembles the SSAS feature of the same name. However, SSAS Perspectives are merely a convenience feature, and don’t prevent access to data in the model not present in the Perspective. AtScale Perspectives, meanwhile, do in fact act as a security mechanism, the AtScale team told me. So any notion users might have that they can ignore the Perspective and connect directly to the model is apparently incorrect.
In other news
There was some other new release news yesterday too:
- Mapbox released it Mapbox Visual for Microsoft Power BI, giving Microsoft’s self-service BI platform access to mapping functionality beyond its built-in map visualizations and those provided by ESRI’s ArcGIS Maps for Power BI.
- Salesforce announced the addition of “conversational queries” to its Einstein Analytics platform.
- Host Analytics announced the Beta release of its “Project Orion” technology that will make Enterprise Performance Management (EPM) accessible to business users, and others without significant financial/accounting backgrounds.
- And Datawatch announced a new release of its Monarch Swarm product for team-oriented data prep and analytics — this time with “Personalized Machine Learning” that drives detailed ranking and data recommendations.
One other announcement, being made today, is the launch of the StreamSets Data Protector product, which identifies personally identifiable information (PII) in data as it is ingested, and then layers on corresponding data security and broader data governance.
Trends and news converge
So many of the new product and release announcements correspond with overall trends in the industry. The growing importance of container technology; serverless, cloud-based implementations of data technology; increased support for machine learning and AI; the importance of data protection and GDPR (the EU’s General Data Protection Regulation); and the growth in BI of sophisticated features on the one hand, and simplified use and operation on the other.
These trends move past the gee-whiz factor and address genuine customer pain points. That’s good to see, and it’s likely we’ll see even more of it, in future announcements, at future events.
Previous and related coverage
Cloudera’s service will make it easier to run and pay for Hadoop and Spark jobs running on its distribution in the cloud.
Now offering specialized editions tailored for data scientists, data engineers, and BI users, what are the next steps that Cloudera will take to broaden its appeal to the enterprise? And how will it approach the cloud?