Client: one of top-30 microfinance institutions in Russia

We’ve built predictive scoring technology that generated over $1,000,000 of additional revenue within 3 months for 5 different MFI’s 


The Intro

It’s hard to picture a world without loans. While they are highly valuable to small businesses and regular citizens, microfinance institutions often face a heightened risk of not receiving their funds back in the form of late payers, bankruptcy and fraud. 

Now imagine what a world it would be if we could eradicate future debt before it even happens? Well, you can start pouring that champagne and serving a round of glasses as we’ve just developed a technology that does exactly that. 

Read on to find out what it takes to develop a prognostication module for a top microfinance institution (MFI) and victoriously expand the same system on 5 other MFIs that collectively generated over US$1,000,000 of additional revenue within the first 3 months.


The Challenge

In order to develop our model we started collecting data from the following categories:

  • application data from the user profile
  • credit bureau data on the previous loans for the last 5 years
  • mobile phone data
  • social network data (social media accounts, publicly disclosed data from the profile)
  • behavioural data from the website’s loan application (browser type, device type, IP address, time and speed of questionnaire completion, viewing agreements, etc.)

These data sources offered an extensive flow of information, resulting in over 400 different variables. However, even this number can be exceeded, as we are able to employ any additional data sources available in a given country or region. Moving further into the challenge, any guesses on how we did it?

While you’re laying out some ideas on a nearby napkin, we’ll let you in on what really happened:  we began by predicting the target variable of how soon the applicant would return the 30-day loan. Would the applicant return the loan within 30 days from the maturity date, would he exceed his repayment for more than 30 days and become our “debtor”, or would he delay the loan for more than 90 days and be called a “fraud”. To paint a more technical picture, let’s look at some numbers.


The Numbers

To perform our calculations, we applied a correlation analysis to our target variable of “debtors”. We’ve determined variables with significant predictive power, with the lowest predictive power being 0.05, and the average predictive power being 0.23. The total number of our deliberately handpicked variables was now narrowed down to 64. 

Furthermore, to increase the stability of our model and make it resistant to any future retraining, we used a multi-correlation analysis to exclude all incoming variables that correlated with each other. After this, the remaining number of variables went from 64 to 24 reliable units and included the total number of microloans, the date of the last microloan and its average duration, applicant’s age, job title, number of social media accounts, whether the IP address matched his region of residence and some others.

The sample size for our model consisted of 13,805 accounts for training and 2,300 accounts for the testing. It was divided by a stratified sampling method, which allowed for preserving representative properties of 2 datasets. For those wondering which tools we adopted, here is a list of modeling methods we’ve used: logistic regression, random forest classifier, and gradient boosting classifier.


The Solution

During the course of this project, we developed models that proved to be highly robust. By employing cross-validation we examine the quality of classification, which must not differ from the training set by more than 3%.

Our variables are also stable, as we plot the predictive power of the variable over time and, by using regression analysis, predict whether the variable is capable of providing stability in the next 2-3 months or not.  

Due to our chosen route of modeling, we were able to produce models that predict “debtors” with an accuracy of 74% and foresee “frauds” with an accuracy of 79%. The model’s

predictive power allows for all applications to be thoroughly vetted and reduces the approval rate to only 60%.

And that’s not even the best part. Our tech stack enables constant retraining of our models on new data while remaining in the convenient online mode, and facilitating perpetual control of the correlation strength of the said variables at the same time.


The Results

What a great case, huh? If these models were implemented by financial institutions earlier, imagine how much money (an inconceivably great ton of money) would have been saved.

All in all, here is the sum of all practical results we’ve achieved while creating this one-of-a-kind prediction model: 

  1. A management report that allows you to monitor the predictive power of the model with a GINI coefficient, KS statistics, decision-making matrix, etc. 
  2. Web-interface reports with the ability to simulate the portfolio’s state with different indicators of the scoring model in order to choose the most optimal stop factor.
  3. Ability to predict other default statuses for those whose payments are overdue for 1, 10, 20 or any other amount of days.
  4. A custom-built portrait of a “good” client to optimize traffic for more targeted and high-quality leads. 
  5. A list of social media, ad and third-party channels which generate a stream of “bad” customers who end up being “debtors” or “frauds” along the way.

With our strong team of analysts, who build reports and monitor the work of the scorecards and the quality of the loan portfolio, programmers who develop software algorithms and automate reports, and a dedicated scoring specialist — a person who is directly involved in modelling credit scores, your business could not be in better hands.