What is Pairing?
Pairing is a situation where two programmers work at a single workstation (two keyboard, two mice, and one screen) at the same time. If you’re interested in pairing in general, here’s a great overview. Usually one programmer drives the crunching and another navigates the process. The benefit of pairing is three-fold:
- Constant code review instead of on a weekly or monthly basis, ensuring high quality code at all times.
- Better diffusion of knowledge among the team, meaning there’s no single point of failure.
- Reduction of coordination efforts, since pairs operate as coordinate unit instead as individuals.
When using an agile approach in data science, pairing is a key ingredient as it’s designed for a short-cycle process.
The concept of pair programming has been discussed extensively over the internet, so I won’t bore you by introducing it again. However, I do want to mention that pairing has been proved to be more efficient than working individually, even though it requires two people to collaborate at the same time.
This paper details how pairing is preferable to working individually for multiple reasons, reducing the time spent on projects and improving code quality and workers’ moods. I highly recommend this paper for managers who’re looking to increase overall team efficiency and improve your work environment.
Assuming you’re already familiar with the concept of pair programming, you’d probably have some doubt about how it works in a data science workflow, since software development and data science differs quite a bit. If you haven’t read of previous blogs that describe why we’re doing this, check out Agile Data Science Process for more context.
When it comes to data science, it can be hard to plan out everything beforehand as there’s many unknowns along the way. Unlike software development tasks, which are relatively easy to timebox on, the ‘research’ component of data science tasks makes it difficult to just ‘navigate’ and ‘drive’.
How Does Pairing Work for Data Scientists?
Due to the reason mentioned above, we can expect some tweaks in methodology and mindset for pairing to work for data scientists. The secret sauce here is to find the balance between pairing and going parallel.
In an end-to-end data science product development process, there’s certainly a bunch of ‘engineering’ to do, in which they’re ‘development-oriented tasks’. Such tasks include writing code for data processing, sampling, train/test split, feature construction, and post-model ETL etc. These tasks are great for pairing as they’re inherently developer tasks.
On the other hand, when things like data exploration, feature analysis, and modeling come up, it’s best to go parallel and share results/insights later, as the tasks are aimed towards insights rather than implementation.
Therefore, break down the tasks beforehand that can help data scientists find the optimal way of conducting pairing. Here’s a short summary that ties to the CRISP-DM process mentioned in Agile Data Science Process.
- Pairing tasks: Data preparation and Deployment. This should include all reusable functions/scripts tasks like data processing, feature engineering, sampling, train-test-split, ETL.
- Parallel tasks: Data Understanding, Modeling, and Evaluation. This includes exploratory/trial and error/experiment tasks like data exploration, feature analysis, modeling/experiments.
Essentially, data science is half-engineering and half-researching work, especially in a product-oriented environment. With a little tweaking of the methodology, a data science team could enjoy the benefits of pairing to improve the efficiency and quality of the development process.
Some Nuances from Our Experience
Even though pairing can be incredibly useful in data science, things don’t always go well as we’d like. Here’re some tips that we found useful in our experience working with our clients. This applies especially for an ‘expert-novice’ pairing situation, where part of the goal is to mentor.
Map out what to do before pairing session
Pairing while mentoring could be very inefficient, especially when the skill level gap is large. It’s very easy to lose track of what you’re doing and miss your goals. We found that mapping out the game plan beforehand can be very helpful. It’s crucial for the ‘expert’ to be aware of, and have control over, the process all the time. It helps greatly to walk into a pairing session with a clear game plan—even better if there’s some skeleton/pseudo code done—to navigate through the session smoothly while staying on track.
Switching partners is encouraged but not necessary
Switching partners is beneficial when pairs need new perspectives. However, chemistry does matter. Allowing the pairs to form and switch naturally, without too much interference, is crucial to a smooth workflow. Moreover, switching pairs in the middle of a project means another round of knowledge transfer, which should be a timeline consideration. Therefore, only rotate pairs when the situation or timeline allows.
To sum up, pairing is a key methodology in agile processes. To make it work in a data science project, we need to be fully aware of the difference in nature of software development and data science project. If applied properly, pairing can be significantly beneficial to communication, code quality, and development efficiency. In my next post, I will talk about how we conduct the in-team/cross-team iterations in our agile data science process.