The inside Spark channel is a resource for professionals looking to learn about the benefits of Apache Spark

MLOps | Is the Enterprise Repeating the Same DIY Mistakes?

In this contributed article, Aaron Friedman, VP of Operations at Wallaroo.ai, discusses why hiring data scientists isn’t the answer to unlocking ML value (especially at a time when finding qualified candidates is harder than ever).

Databricks Announces Major Contributions to Flagship Open Source Projects

Databricks announced that the company will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release. In addition, the company announced MLflow 2.0, which includes MLflow Pipelines, a new feature to accelerate and simplify ML model deployments. Finally, the company introduced Spark Connect, to enable the use of Spark on virtually any device, and Project Lightspeed, a next generation Spark Structured Streaming engine for data streaming on the lakehouse. 

Don’t Call It A “Data Product” Unless It Meets These 5 Requirements

In this special guest feature, Barr Moses, Co-founder and CEO of Monte Carlo, believes data products can transform an organization’s ability to be data-driven as long as they meet 5 key requirements. Data products can transform an organization’s ability to be data-driven, as long as they are implemented correctly and in good faith.

Databricks Launches SQL Analytics to Enable Cloud Data Warehousing on Data Lakes

Databricks, the data and AI company, announced the launch of SQL Analytics, which for the first time enables data analysts to perform workloads previously meant only for a data warehouse on a data lake. This expands the traditional scope of the data lake from data science and machine learning to include all data workloads including Business Intelligence (BI) and SQL.

Understanding Intention: Using Content, Context, and the Crowd to Build Better Search Applications

This white paper by enterprise search specialists Lucidworks, points out that unlike consumer search, which has become a seamless part of our everyday lives, the enterprise side might as well still be running Windows 95. Imagine if Amazon, Google, or Facebook treated every user the same, regardless of who they are, where they are, what they’re searching for, and what they’ve clicked. Your users expect that same sophistication in their enterprise apps.

StreamSets Launches StreamSets Transformer

StreamSets, Inc., provider of the DataOps platform for modern data integration, released StreamSets® Transformer, a simple-to-use, drag-and-drop UI tool to create native Apache Spark applications. Designed for a wide range of users — even those without specialized skills — StreamSets Transformer enables the creation of pipelines for performing ETL, stream processing and machine-learning operations. Now, data engineers, scientists, architects and operators gain deep visibility into the execution of Apache Spark while broadening usage across the business.

Addressing Governmental Challenges when Engaging AI, ML and Data Analytics

Gartner recently stated that all industries and levels of government agree the top three game-changing technologies today are AI/machine learning, data analytics/predictive analytics and cloud technologies. However, there are some primary sticking points when it comes to innovation in these areas. Government organizations continue to encounter challenges when trying to pursue these initiatives due to complex security and compliance requirements, poor scalability of legacy IT infrastructure, and perceived risks associated with cloud and IT modernization efforts. How can these challenges be addressed?

The Power of Crunching Big Data Effectively

In this contributed article, Lex Boost, CEO of Leaseweb USA, points out that according to an Accenture study, 79% of enterprise executives agree that companies not embracing big data will lose their competitive edge. Considering that data creation is on track to grow 10-fold by 2025, it’s crucial for companies to be able to process it more quickly, and meaningfully.

Databricks and RStudio Introduce New Version of MLflow with R Integration

Databricks, a leader in unified analytics and founded by the original creators of Apache Spark™, and RStudio, today announced a new release of MLflow, an open source multi-cloud framework for the machine learning lifecycle, now with R integration. RStudio has partnered with Databricks to develop an R API for MLflow v0.7.0.

State of the Art Natural Language Processing at Scale

The two part presentation below from the Spark+AI Summit 2018 is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API’s which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark’s built-in optimizations.