So you want to build an open source tool/library as a grad student

This is a collection of experiences and recommendations for building an open source community as a grad student

Devin Petersohn
riselab

--

Many grad students and professors have asked me for suggestions on how to build a functioning and thriving open source community while in grad school. This blog post appears as a chapter in my thesis, but ultimately I decided to extract those contents and put them here for easier retrieval and consumption.

This blog post is part history, part lessons, part advice. I don’t know everything, and the moderate success of my work does not mean that my advice is automatically good. I think there is value in hearing other people’s opinions, but not taking them as truth. I recommend that method of consumption here. Everyone has opinions, every situation is different, judge for yourself.

My History Building a Successful Open Source Project

During my grad school career, I built Modin (https://github.com/modin-project/modin), a full dataframe implementation that, as of writing, has over 6,000 GitHub stars. This effort has been supported by many people over the last few years, and definitely would not be as far as it is without that support. Berkeley is well known for creating some of the most used and most impactful software on the planet, so I did have an unfair advantage in terms of brand.

My approach toward promoting the work has been fairly successful, however I attribute that largely to luck. The first blog post I published (2018) got a lot of input and feedback from others working on the project at the time. It ended up getting shared on Twitter and HackerNews by many people (I had accounts with neither at the time) and generated a lot of interest. At the time, pandas on Ray (which would become Modin) was a 1 month hack I put together with help from several undergraduate students at Berkeley. It honestly wasn’t ready for the overwhelming interest it received, and yet it has continued to be developed and grown into something that I couldn’t have imagined at the time.

My role has transitioned away from being the main generator of code to more of a project management role, coordinate many disparate institutions and their contributions and making it easy to contribute. I spend a lot of time reviewing code and telling other people what to do rather than writing all of the code myself.

Lessons and Advice

This section is likely to be long and difficult to parse, so I’m going to make my advice section headers so it’s easier to skim to find the points you’d like to better understand. The points are not in any specific order.

[1] Make your system understandable to your target user, and don’t worry about anyone else

This point is something I think we’ve gotten right from the beginning with Modin. From the start, we have abstracted away complex details from the user, including in how we present the system. This has, of course, led many of highly technical people to discount the complexity of abstracting away these details, but that has never bothered me. I don’t care if someone thinks Modin is or isn’t technically interesting, I care that it solves a problem. Because of how I talk about Modin, most people have a simple understanding of the system. That is by design. While working on Modin, we have formalized a new data model, created a new dataframe algebra, and created a truly unique data layout and metadata management system. The people who could understand enough about the underlying system to appreciate it likely wouldn’t use Modin in the first place, because Modin is targeting a less-technical group of users. I think this is really important because when people talk about Modin, they generally focus on the problems it solves rather than the technically interesting parts of the implementation. I am okay with that, but make sure that you are. Do you want people to use your work or do you want them to think you’re really smart? Sometimes you can have both, but often not.

[2] Be prepared to defer your graduation and publications

This point is less applicable if you have a large team managing the open source, but in my case I was working mostly alone from the open source side. On the research side, we were able to bring together some of the best in databases and machine learning, but many deadlines were missed because of things that came up in the open source. I prioritized the open source community and development over my own graduation and publications. This is a decision you’ll have to make for your own situation. I’m not completely convinced now that you need to do this, but at the time I felt like it was necessary to keep the open source community alive and growing. There’s little overlap between open source community development and grad school requirements. You are going to have to respond to questions, issues, and promote your work.

[3] The fun parts of open source are front-loaded

At the beginning of any open source project you’re going to be able to move fast. There’s no technical debt, no new issues, and a lot of energy and excitement. As time goes on, your time will go from developing new features and building things to answering issues and emails. If you’re fortunate enough to have a lot of external contributors like Modin does, you’ll end up spending a lot of time reviewing code. These days, I spend maybe 20% of my time writing code and debugging. If you want to get into open source, be prepared to spend a large chunk of time on user issues and support after your project hits a critical mass. If you are mostly doing it alone, the project momentum can easily screech to a halt and feel like it’s not moving anywhere for weeks. My recommendations here are to (1) avoid romanticizing the idea of managing a highly visible open source project and (2) learn to bin your time. At first, it’s easy to answer issues as they come in, eventually you will not get anything done if you always answer issues immediately.

[4] Promotion is important

If you want people to use your project, you have to tell them about it. Part of the difficulty in deciding how to promote is around deciding when. Promotion takes away from development in a small team, and so there needs to be some good reason to promote. Early on, I had planned on curating a series of blog posts that could archive the journey. That was quickly thrown out after the reception from the first blog post. My main advice is to be careful not to overpromote: each update should be substantial. This is mostly personal taste, but I don’t like reading a lot of fluffy blogs that have hardly any new content. Honestly, most people aren’t going to read the blog anyway, they will skim the headers or scroll to the bottom to read the conclusion. In terms of promotion, getting multiple friendly people to tweet about your blog is probably the best way to promote. HackerNews is not what I would consider a good distribution channel, rather a good place for discussion about topics surrounding a blog’s title and content. Podcasts are another way to spread the word, and they are a common way that people hear about new projects.

[5] Make your work easy to install and use

The biggest hurdle to using something is getting started, and the lower the barrier the more people will try it. This might seem obvious but easy means different things to different people so I’ll try to be concrete here. Do you require users to pull your Docker container? Do you require users to build from source? Does installation change the user’s environment? Does installation take more than a couple of steps? If you answered yes to any of these questions you’re going to have a hard time getting people to do more than look at your README. Making something easy to use is important to getting a community off the ground, if people don’t use it they won’t tell their friends about it (probably).

The second component to easy use is examples: you need really good useful examples, not toys. Users want to see what they can do with your tool, and showing them how to do a trivial map over a list of integers is not going to give users a good idea of what they can do. Your examples should show off a variety of use cases and capabilities on actual workloads. Examples overlap a bit with the next point on documentation.

[6] Write documentation

Everyone says this, nobody does it.

[7] Primarily use communication channels that are Googleable

Slack is not internet searchable, and Gitter search results are terrible. When people have a problem they will often go to Google first to see if others have solved the problem. In Modin, we use GitHub issues and Discourse boards for discussions to make sure that people looking for solutions can find them. This also has the nice side effect of being able to point to those pages when someone asks the same question.

[8] Give talks at small venues and meetups, not just the big ones

Promoting work via talks at big venues does give you more visibility, but ultimately the smaller more intimate venues are where you’re more likely to make good connections with people who will actually use your project. It’s tempting at the beginning to try to go straight for giving talks at the big international conferences, but I’ve found that small meetups are both more focused (people are more likely to have the problem you’re solving) and more willing to talk. I gave talks to meetup groups as small as 6 people, and in those meetups I had more engaging conversations that in the larger venues. You need these relationships with your users early on to build momentum. Otherwise, once you do talk at these large venues and people will ask “Who is using your project?”. In the large conferences people claim to be looking for the next big technology to adopt, but really they just want to use what everyone else is using. Having users will get you more users, but you need the early adopters, and often you will meet them in small venues or meetups.

[9] Make it easy to reach you

This point will contradict with the point about communication channels being Googleable, but in general you need to be able to be reached by people who run into problems. Don’t make the communication overhead of reporting a bug a barrier to discovering the bug exists. In Modin, for example, we set up a couple of emails and if something went wrong internally we asked them to email us a bug report as a part of the error message. It has been pretty successful and there are several bugs we found that weren’t discoverable otherwise. We rarely get these today, which is a good indication that things are getting more stable.

Generally, to solve the issue of search indexability, I will ask people who email to open an issue if I triage it to be serious and new. I’ve found people are easier to go back/forth with over email, so you can get the simple stuff out of the way quickly as well (e.g. user environment issues or user errors).

[10] Scale your efforts later, build a community first

A lot of what I propose here doesn’t scale, and it will get worse as the number of users grows. This is by design. Even after you have a critical mass, it’s not likely you’ll have a large group of contributors outside of your organization (it took roughly 2 years for Modin to get serious outside contributor groups). Building a community is largely a social effort, and you need broadcasting on Twitter is not going to be enough to get the ball rolling. Everyone is doing that. If you want to actually build a community, you need to do things like talk to individuals and answer individual emails. The personal connections are more important than trying to make noise in a very noisy world, and they will get you farther than clicks or views on your tweets and blog posts.

[11] Make it easy to contribute and ask for contributions

Making your project easy to contribute to is a good way to help build a community. There are always people who are interested in working on projects on the side, and getting these people involved is important. Often, projects are too difficult to jump into years later. It is difficult to build a community when you’re the only person who knows how to do anything. You’re going to need people who can help with issues so you can take some time off every now and then to recharge. This is obvious, but it takes good design and a lot of DevOps work, which doesn’t necessarily equate to more code being output. In fact, often helping others will often reduce your own productivity and the overall code velocity of the project. This code velocity cost (on an individual basis) may actually never be recouped, but I argue that there are intangible benefits to working with other people:

  1. You need to justify your designs
  2. Producing code is not a good metric of productivity — there are more important things than new features
  3. Excited contributors often also become evangelists

Adding contributors will not always yield more code or more bugfixes, but it helps build a community.

Concluding thoughts

I hope this has been helpful. It’s a lot of work to keep a community going, and the work is mostly social. There’s a lot of engineering in building something, but actually getting the word out and keeping in contact with users takes significant effort.

This list is by no means complete, but I hope it’s enough to help you get off the ground. Grad school is a great time to explore a bunch of different things, including open source community building and project management. I hope you can be as successful as I was (or more!) and that this list can help you plan how to execute on your great ideas. Please don’t hesitate to reach out to me directly if you have any questions (https://www.linkedin.com/in/devinpetersohn/). I don’t consider myself an expert on open source community building (having only done it once!), but I will do my best to answer any questions you might have. Good luck!

--

--