Research Highlights: Dremio Demonstrates Data Lakehouse Value with Math-Style Proof and Technical Clarity

Dremio, the easy and open data lakehouse, has published “The Data Lakehouse: Data Warehousing and More,” a novel research paper now available on arXiv. The paper explores the data lakehouse model, offering modern insights for businesses looking to optimize their data utilization. The idea through this preprint publication is to gather feedback from the open source research and scientific community and make it available to the wider community of practitioners.

Research Highlights: Unveiling the First Fully Integrated and Complete Quantum Monte Carlo Integration Engine

Quantinuum, a leading integrated quantum computing company has published full details of their complete Quantum Monte Carlo Integration (QMCI) engine. QMCI applies to problems that have no analytic solution, such as pricing financial derivatives or simulating the results of high-energy particle physics experiments and promises computational advances across business, energy, supply chain logistics and other sectors.

Video Highlights: Ultimate Guide To Scaling ML Models – Megatron-LM | ZeRO | DeepSpeed | Mixed Precision

In this video presentation, Aleksa Gordić explains what it takes to scale ML models up to trillions of parameters! He covers the fundamental ideas behind all of the recent big ML models like Meta’s OPT-175B, BigScience BLOOM 176B, EleutherAI’s GPT-NeoX-20B, GPT-J, OpenAI’s GPT-3, Google’s PaLM, DeepMind’s Chinchilla/Gopher models, etc.

Research Highlights: Scaling MLPs: A Tale of Inductive Bias

Multi-layer Perceptrons (MLPs) are the most fundamental type of neural network, so they play an important role in many machine learning systems and are the most theoretically studied type of neural network. A new paper from researchers at ETH Zurich pushes the limits of pure MLPs, and shows that scaling them up allows much better performance than expected from MLPs in the past. These findings may have important implications for the study of inductive biases, the theory of deep learning, and neural scaling laws.

Research Highlights: LLMs Can Process a lot more Text Than We Thought

A team of researchers at AI21 Labs, the company behind generative text AI platforms Human or Not, Wordtune, and Jurassic 2, has identified a new method to overcome a challenge that most Large Language Models (LLMs) grapple with – a limit as to how much text they can process before it becomes too expensive and impractical. 

Research Highlights: Real or Fake Text? We Can Learn to Spot the Difference

A team of researchers at the University of Pennsylvania School of Engineering and Applied Science is seeking to empower tech users to mitigate risks of AI generated misinformation. In a peer-reviewed paper presented at the February 2023 meeting of the Association for the Advancement of Artificial Intelligence, the authors demonstrate that people can learn to spot the difference between machine-generated and human-written text.

Research Highlights: SparseGPT: Prune LLMs Accurately in One-Shot

A new research paper shows that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models.

Research Highlights: A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT

The Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A pretrained foundation model, such as BERT, GPT-3, MAE, DALLE-E, and ChatGPT, is trained on large-scale data which provides a reasonable parameter initialization for a wide range of downstream applications.

Research Highlights: MIT Develops First Generative Model for Anomaly Detection that Combines both Reconstruction-based and Prediction-based Models

Kalyan Veeramachaneni and his team at the MIT Data-to-AI (DAI) Lab have developed the first generative model, the AutoEncoder with Regression (AER) for time series anomaly detection, that combines both reconstruction-based and prediction-based models. They’ve been building it for three years—AER has been learning and extracting intelligence for signals and has reached maturity to outperform the market’s leading models significantly.

Research Highlights: R&R: Metric-guided Adversarial Sentence Generation

Large language models are a hot topic in AI research right now. But there’s a hotter, more significant problem looming: we might run out of data to train them on … as early as 2026. Kalyan Veeramachaneni and the team at MIT Data-to-AI Lab may have found the solution: in their new paper on Rewrite and Rollback (“R&R: Metric-Guided Adversarial Sentence Generation”), an R&R framework can tweak and turn low-quality (from sources like Twitter and 4Chan) into high-quality data (texts from sources like Wikipedia and industry websites) by rewriting meaningful sentences and thereby adding to the amount of the right type of data to test and train language models on.