In most companies, Data Engineers support the Data Scientists in various ways. Often this means translating or productionizing the notebooks and scripts that a Data Scientist has written. A large portion of the Data Engineer’s role could be replaced with better tooling for Data Scientists, freeing Data Engineers to do more impactful (and scalable) work.
There’s a sentiment making its way around the internet (again): We don’t need Data Scientists, we need Data Engineers.
While data science tools are being optimized to perform well on microbenchmarks, they are becoming more and more difficult to use. Is the benchmark performance worth the human time cost it takes to get there? (Spoiler: it would take up to 200 years to recoup the upfront cost to learning a new tool, even if the new tool performs 10x faster)
Recently, a blog post was written that compared a variety of tools in a set of head to heads. I wanted to take the opportunity to talk about our vision with Modin and where we’d like to take the field of data science.
Modin (https://github.com/modin-project/modin) takes a different view of how systems should be built. Modin is designed around enabling the data scientist to be more productive. When I created Modin as a PhD student in the RISELab at UC Berkeley, I noticed a dangerous trend: data science tools are being built with hardware performance in mind, but becoming more…
Dataframes emerged from a specific need, but because so many diverse systems now call themselves dataframes, the term is on the verge of meaning nothing. In an effort to preserve the dataframe, we formalized the definition based on the original data model in our recent preprint[2].
Before we get into the specifics, I’d like to outline a few of the questions I will answer below:
The earliest “dataframe”…
Serverless computing is rapidly gaining in popularity due to its ease of programmability and management. Many see it as the next general purpose computing platform for the cloud [4]. However, while existing serverless platforms have been successful in supporting several popular applications such as event processing and simple ETL, they fall short of supporting latency and throughput sensitive applications such as streaming and machine learning (ML). The main challenge stems from the gap between the performance required by these applications — typically deployed on virtual machines (VMs) — and the performance of existing serverless platforms.
In this post, we argue…
PhD Student in the RISELab at UC Berkeley.