Categories: Big Data

Machine Learning – How to Predict Software Developers Productivity

This article represents my thoughts on how machine learning techniques could be used to solve one of the most popular problem of software industry such as whether a software developer is productive or not. Of all the effort that I have made to solve this problem using traditional programming techniques (rules-based), I could say that there is no definitive way of finding a concrete solution. As a matter of fact, I created a tool, AgileSQM to capture the software quality metrics (SQM) such as code coverage, duplication, complexity and infer from the trending data whether a software developer is productive. However, I soon hit the road-block in terms of acceptance of this tool across widespread audience as there are various features which needed to captured and analyzed in order to infer about the software developers’ productivity. Now that I am deep into machine learning, I have started to believe that machine learning techniques (algorithms) could be used to solve this problem of predicting software developers’ productivity. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.

As I have been going deeper into aspects of data science, in general, I am starting to believe that measuring software developer productivity seems to be a machine learning problem and could be solved using logistic regression algorithms. Following can be steps in creating a model that could be used to predict whether a software developer is productive or not:

Identify the features that could be used to predict the software developer productivity. Some of these features could be following:
- Number of story points/function points (this is to put the problem complexity in context)
- Problem solving skills (great, decent, bad); Another approach could be to rate the developers on a scale of 100.
- What is the level of participation of developer in code reviews? Answer could again fall within discrete value range (active, inactive)
- How communicative a developer is (both oral & written); The value could be high, medium (or at times) or low as this looks to have discrete values.
- Developer contribution towards new initiative (yes or no)
- Individual developers’ code related data such as code coverage, code complexity could be gathered from tools such as Sonar. In this case, the focus may be to get the delta (change) in order to predict the productivity. For example, change in coverage (positive or negative), change in code complexity (positive or negative)
Gather the above data (against every feature) for every developer, from key technology stakeholders such as tech lead from different teams etc. For uniformity purpose, one may need to baseline how to respond to above in a consistent manner.
Along with gathering data, also have the stakeholders suggest whether the developer is productive or not. To avoid bias, there needs to be a set of baseline criteria that developers could use to decide the productivity.
Try and gather above data every quarter and continue this process for 6-8 quarters.
Create a machine learning model using above data. This model could be optimized further by feeding regular data after every quarter.
Use the model to predict the productivity of a developer by gathering data against features mentioned such as above.

Following is how the response would look like, given a new data set is fed:

It is 90% likelihood that the developer is productive.
It is 55% likelihood that the developer is productive.
It is 20% likelihood that the developer is productive.

Based on organization baseline, one could than choose a threshold based on which developer could be called as productive or not. For example, let’s say in case of your organization namely ABC, in those cases where there is 60% or more likelihood that developer is productive, only those developers would be termed as productive.

This is just a thought. I am preparing a test data and see if above solution approach could work in real world scenarios and help solve the problem related with predicting software developers productivity. In the meantime, please share your opinion on whether I am on right track.

Author
Recent Posts

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin.
Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.