The role of Data Engineer exists as we know it because of a lack of adequate tooling for Data Scientists

In most companies, Data Engineers support the Data Scientists in various ways. Often this means translating or productionizing the notebooks and scripts that a Data Scientist has written. A large portion of the Data Engineer’s role could be replaced with better tooling for Data Scientists, freeing Data Engineers to do more impactful (and scalable) work.

Why does this matter?

There’s a sentiment making its way around the internet (again): We don’t need Data Scientists, we need Data Engineers.

Photo by Andrea Piacquadio (pexels.com)

We need to start placing a higher value on data scientists’ time than we do on machine time

While data science tools are being optimized to perform well on microbenchmarks, they are becoming more and more difficult to use. Is the benchmark performance worth the human time cost it takes to get there? (Spoiler: it would take up to 200 years to recoup the upfront cost to learning a new tool, even if the new tool performs 10x faster)

Comparing Modin with Dask, Ray, Vaex, and RAPIDS

Recently, a blog post was written that compared a variety of tools in a set of head to heads. I wanted to take the opportunity to talk about our vision with Modin and where we’d like to take the field of data science.

Modin (https://github.com/modin-project/modin) takes a different view of how systems should be built. Modin is designed around enabling the data scientist to be more productive. When I created Modin as a PhD student in the RISELab at UC Berkeley, I noticed a dangerous trend: data science tools are being built with hardware performance in mind, but becoming more…

The Dataframe Series

Dataframes are losing their statistical computing and machine learning roots

Dataframes emerged from a specific need, but because so many diverse systems now call themselves dataframes, the term is on the verge of meaning nothing. In an effort to preserve the dataframe, we formalized the definition based on the original data model in our recent preprint[2].

Before we get into the specifics, I’d like to outline a few of the questions I will answer below:

  1. What is a dataframe and where does it come from?
  2. How are dataframes different from tables? matrices?
  3. How is the explosion of dataframe systems killing the dataframe?
  4. Why should the user care?

Neither a table nor a matrix

The earliest “dataframe”…


Serverless computing is rapidly gaining in popularity due to its ease of programmability and management. Many see it as the next general purpose computing platform for the cloud [4]. However, while existing serverless platforms have been successful in supporting several popular applications such as event processing and simple ETL, they fall short of supporting latency and throughput sensitive applications such as streaming and machine learning (ML). The main challenge stems from the gap between the performance required by these applications — typically deployed on virtual machines (VMs) — and the performance of existing serverless platforms.

In this post, we argue…

Devin Petersohn

PhD Student in the RISELab at UC Berkeley.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store