As we recounted a few weeks back, we once shared a flight with an aspiring data scientist who was shopping for colleges who preferred SAS for the completeness of its analytic programming functions. Then there are those who prefer the vitality of the open source community, like the R programmers who embrace the CRAN libraries. And then you have the granddaddy of them all: Python developers who have flocked to this nearly 30-year old programming language because of its simplicity and flexibility. Python was recently ranked by IEE Spectrum as the top programming language, and in the latest Tiobe index, ranks as number 5.
Python, like Java (which was first conceived for set-top boxes), was originally conceived for something else. As a general purpose language, it is used by systems operators and web developers alike, but who would have guessed that it would become arguably the most popular language for data scientists?
It shouldn’t be surprising that some data science tools are likewise exploiting their home court advantages. Anaconda, the commercial venture that has supported delivery of the eponymous open source packages for Python, bills itself as “The Most Popular Python Data Science Platform.” It provides the management infrastructure that is typically lacking from the free open source distribution.
Likewise, there are bloodlines that often drive the preferences for data science notebooks. Not surprisingly, there is a plethora of notebooks to choose from.
Capitalizing on its IPython heritage, the Jupyter notebook’s popularity with Python developers has served as a springboard for appealing to a broader community. As the outgrowth of iPython, first created in 2001 and taking its current form (and name) in 2014), the Jupyter notebook is hailed for its ease of coding; there are plenty of programming shortcuts, and as the outgrowth of IPython, native support for Python package outputs.
While Jupyter had its origins with developers working with data on laptops, Zeppelin was conceived for a multi-polar world of distributed big data platforms (Jupyter has since adapted). So it has adapters (the exact term is “interpreters”) to Hadoop components, Cassandra, HBase, JDBC, Spark, Flink and others (including Python, by the way). And compared to Jupyter, you can more readily mix and match code from different programming languages in a single notebook or project. If you’re interested, here’s a good discussion on the relative merits of Jupyter vs. Zeppelin.
Given the demand for data scientists to become more productive and connected to the enterprise, it shouldn’t be surprising that there is a rapidly rising ecosystem of data science collaboration, lifecycle management, and development tools that embed or support notebooks. But for those who want to take more bottom-up approaches to scaling their work with notebooks, it’s also not surprising that there are projects (and commercial efforts) to develop tools shaped around the notebooks that are for many data scientists the primary tools of choice.
ZEPL is the commercial entity formed by the creators of the Apache Zeppelin project. This week, they are releasing a collaborative platform that expands on Zeppelin, but won’t leave Jupyter out in the cold either. In essence, it is a collaborative management offering that bolts onto the notebook, providing the version and access control that basic notebook lacks.
It builds on Zeppelin’s existing support of real-time visualizations, plus integration with Active Directory and LDAP. On that, it adds a new capability for creating and sharing private workspaces that may consist of one or more notebooks. And it manages permissions to compute resources for read/write, execution, and sharing of notebooks within specific workspaces.
On the roadmap, it plans to add support for managing workspaces on external clusters and VPC support for virtual private clouds.
As noted, Jupyter notebooks are not left out. But, not surprisingly, the integration is not as direct; the contents of Jupyter notebooks must be translated into formats that render in the Zeppelin format. In an ecosystem where numerous players, from IBM to Cloudera, Dataiku, Data Robot, Domino Data Lab and others, are also targeting the challenge of data science lifecycle management, ZEPL’s sweet spot will likely be those Zeppelin users who want the management function to operate as an extension of their notebooks.