Algoritmica: Data in Action!

Algoritmica: Data in Action!

Did you miss the chance to hear Co-Founder Luca Borella’s Deep Dive on Deeploans? Jump to the video recording or read the transcript below.

The webinar, organised in collaboration with European DataWarehouse (EDW), gave attendees an in-depth look at the impact of COVID-19 on loan delinquencies, followed by a deep dive into the value of big data and deep learning in credit risk management. Download the video of “Monitoring the Impact of COVID-19” on our resources page, and scroll down to start watching the video.

What you will learn:

  • how EDW can create large sample sizes of various credit markets for analysis
  • who the members of the loan-level data value chain are
  • what stands between financial institutions and easier credit risk management
  • how Deeploans collects and enriches data from EDW
  • the details of the Algoritmica model lifecycle

Luca Borella is a financial economist by vocation with experience in credit scoring, loan-level data modelling, quality and warehousing. He promotes driving financial inclusion through data-sharing and adopting new technologies. 

The projects he is involved with aim to break up the fundamental laws that influence contemporary societies with the final aim of rebalancing the distribution of global wealth, locally and globally.

Data in Action

Key takeaways

  • Leverage loan-level data. There is abundant, useful data in European DataWarehouse just waiting to be leveraged by lending institutions.

  • Repayment history and credit scores are not an effective way to monitor risk. Macroeconomic factors, among others, influence the borrower’s ability to repay.

  • Look beyond the closed walls. Open Data sources can also be used to complement existing data and improve decision-making.

  • Computer architecture needs to follow the structure of the data that it is processing. If your aim is to process loan-level data, make sure you have an architecture that’s consistent, cloud-based, scalable and flexible. Otherwise, if you’re aiming to do this on an excel sheet, you’re probably on the wrong path. 

  • Leveraging Big Data using AI takes commitment. Artificial Intelligence is definitely not an install and forget system. So AI needs to be looked after by an expert. This expert can be internal, so you can hire a team of data scientists, or you can outsource the expertise.

Transcript

00:00 

I’m very very excited!

00:03 

This is going to be the agenda for today. We will briefly introduce the company, we’ll show you what we do, and we’ll focus on 3 main steps: Discovery, Use-Case Identification and Product Factory. And, at the end, we will go through some challenges and opportunities when it comes to loan-level data.

00:32 

So, the vision and mission of Algoritmica: Algoritmica is a boutique AI firm focused on banking, and our vision is to create a more liquid and transparent credit market able to support sustainable economic growth. 

So, how we get there, so our mission is to enable financial institutions to leverage data and technology in order to achieve an optimal credit life cycle management. So, as you may have understood, our clients are financial institutions, alternative investors and loan marketplaces. That’s our target market. 

01:16 

And these are the founders: Calogero and Giuseppe, who are AI data scientists with a research background; myself, I’m on the bottom left, I’m a financial economist, and then we have the pleasure to have Claudio Erba who brings his entrepreneurial mind and experience when it comes to software-as-a-service (SaaS).

01:45 

Yeah, the founding team is Italian. By the way, we have two locations: one in Milan, the other one in Berlin, and Claudio brings a great network from North America, having listed his previous company in Canada just a year ago.

02:05 

So, what do we do? Well, we are Data Junkies! We’re constantly looking to consume more data. So really, a big chunk of the work we do at Algoritmica is actually data discovery. So trying to find the best data sources both in terms of scope and quality. And once we find them, we try to understand what use cases this data enables. So we try to understand whether there are some use cases from the business or compliance perspective that can be implemented. 

02:43 

Once the use cases are identified, then we Productionise the solution because we know, and thanks toClaudio who has a great SaaS business, we understood that Productionisation is key to scaling a solution. And that’s exactly what we want to do,we want as many banks as possible using our software for improving the way they manage credit and loan books.

03:14 

So, now I’m going to go through this one-by-one. Data Discovery, use case, product.

Data discovery, I wanted to first of all highlight the loan-level data value chain, and for youI think it’s really important to visualise the different archetypes that are relevant in this value chain. So starting from the left, there are more than 250 European banks supplying loan-level data of different asset classes, jurisdictions and credit markets. They provide raw loan-level data to the European Datawarehouse which has, as you might already know, the EDW is a central database that collects data from all these banks. 

EDW does a great job of aggregating and enabling other data users, such as Algoritmica, that sit on the end of the value chain and they basically transform the raw data into knowledge in a way that algoritmica can then enrich this data with different data sources. For instance, we enrich loan-level data with macro and micro economic variables, and we also translate the enriched data into product and service. So we developed an added value solution and, specifically when it comes to loan-level data, we have a product called deeploans.

05:00 

Now, the second step once we’ve found the data, once we like the data, we do use case detection. So, what are the business and compliance use cases that this data enables?

I would like to start with a statement. So the statement is the following: “The beauty of this initiative, namely EDW and its loan-level initiative, is that the loan-level data deriving from different financial institutions, who have merely snippets of their own, have been combined to create a significant sample of various credit markets such as Italian SME loans, French residential mortgages, German auto loans, etc. So there are different combinations you can imagine. There are six asset classes, there are more than 10 jurisdictions, you can really make combinations and you end up with great sample sizes of these credit markets.

05:56 

So starting from this assumption, which is a strong assumption but very well-founded, we understood that the markets that we can target are not only the European ABS market, which is nearly a trillion euro in outstanding balance, but the potential market for solutions built on top of this data is much larger. We estimate maybe seven times, maybe ten times, we’re not certain how much but the European private and alternative credit markets… and I wanted to highlight that the blue circle is the business applications and the red pinkish is the compliance applications that are there and are waiting for solutions that run on top of the loan-level data of EDW. So we know that there are many many applications in the securitisation market, but we wanted to focus on the European private and alternative because the use cases are more. For instance, it can be simple data analytics, marketing organisation, but once you add some intelligence, once you add machine learning on top of this, as Usman was mentioning before, then there are even more use cases that you can find. 

07:25 

For instance, credit origination, credit monitoring, credit collection, valuation and pricing of loans, active balance sheet management (ABSM) and so on. So the number of use cases is enormous. We wanted to focus on one of those and we chose credit monitoring specifically. Credit monitoring, aka behavioural scoring, is quite critical because it enables the financial institutions to early detect NPLs and also identify new sales opportunities. 

You can always offer new products to a client if you’re a bank or P2P lender. So there is an unprecedented way of making credit decisions post COVID-19, that’s kind of well-understood. Moreover, since 10 years, we have a regulatory push for improving the tools that banks are using for credit loan book management.

08:38 

But what’s the üproblem? The problem is that right now these systems are not efficient nor effective because it’s labour intensive and credit models do not take into account enough variables. So if you feed an old model with 150 variables, maybe the model won’t be powerful enough to get some insights. 

09:02 

What’s the solution? Well the solution is, in our opinion, phasing out obsolete statistical models, leveraging available loan-level data of various euro credit markets, namely EDW. Relying on the power and flexibility of a cloud infrastructure and leveraging deep learning techniques.

So that’s the business story of credit monitoring. 

09:25 

Now, as we said, once we find the data, we identify the use case, and now we want to build the product out of this because the product is what will enable the market to really step up and not just find a custom made solution. 

So, just to remind you the specific use case is using EDW loan data to predict an unseen borrower’s future behaviour. In particular, if, when and how he or she will default or pay, with the aim of acting in advance to avoid permanent losses. So this is a specific use case we want to tackle. In order to do that, obviously, at the very base of this product there is a model… actually, a set of models. And model lifecycle in algoritmica works as follows:

10:29 

There are five steps, the first step is 

Model design, the second model training, third testing, model explainability and end with serving. 

So model design is very important because it’s the beginning of the model lifecycle, so you need to find the right model that enables you, as a financial institution to predict your borrower behaviour as well as being able to explain the result that you get from the model. So it’s not enough to have a super sophisticated model with great precision but then have a black box issue. You’re not able to explain to the supervisor or regulators. So after almost two years of research, we found a temporal fusion transformer architecture coming directly from the universities of Oxford and Google AI researchers. 

11:46 

This type of model, which is a neural network attention-based architecture, is optimised for multi horizon forecasting, which is actually what you do when you try to forecast the behaviour of a borrower. The beauty of this model is that it creates different categories of input in different ways. This is important for model explainability, which we will see in step four.

12:13 

Then, specifically in our domain, we will define the input and output based on the ECB loan level data templates so that we are sure that these types of models can work in many jurisdictions in Europe.

12:33 

Step number two is model training. So when you want to train a model, you choose a specific segment of the market. For instance, you can choose Italian SME loans, it’s really easy to do that with EDW APIs because you can really group and filter the different ABS you want to use as a framing set. You can very easily exclude some if you want to keep them for the test sets in step 3. 

Then you train your model. Training is a very complex routine. It requires lots of cloud power and yeah, this is something we’ve been working on for many many months. 

13:28 

After you have your model trained, you want to test it. So you use a portion of the database for testing and you use a series of KPIs to understand how good your model is. For instance, on the right-hand side I reported confusion metrics for the three-month forecast. I didn’t say that we do 12-month forecasts too. So, for example, these metrics: in the horizontal we have Prediction, in the vertical we have the actual status of a loan. 

So we divided the stats into three possible categories – performing, arrears, and default – as well as obviously categorising the actual result of the specific loan that we wanted to forecast in performing arrears and default. You can see, for instance, that 85,227 loans that were actually performing in 3 months’ time were predicted as performing by the model. Almost 600 loans that were actually in arrears were predicted correctly. And defaulted, for instance, 1,210 loans that would have defaulted in three months, actually defaulted in three months.

15:02 

Just to also throw your attention to those loans that the model was not able to forecast correctly. So there are 1123 that were forecasted to be performing but actually they turned out in arrears. So this is important because from this you can understand where you need to improve your model and what are the limits, right? The limits of the model you have built. 

And when it comes to EDW and the ECB loan-level template, I want to remind you that it’s very important to keep pushing for standardised reporting practices across the market because then, when data users like algoritmica are using the data, we might encounter problems is different loans have different definitions of arrears or default. Default is probably a bit easier because there is a base definition but also that isn’t superclear sometimes. 

16:16 

Then we have number 4, model explainability. Very important to overcome the black box issue, and deeploans enables three explainability use cases. So:

  • Globally important variables for the prediction problems, 
  • Persistent temporal patterns and 
  • Significant events.

When we have to incorporate those models into a product, we need to make sure that the models are served correctly. So we need to make sure that there is a technical infrastructure that enables continuous delivery of prediction. So we need to keep pulling data from the EDW’s APIs making sure that the model is constantly trained and updated. and that’s very important, especially when situations like covid-19 happen, because obviously data that was used before covid might not be relevant after covid and vice-versa.

17:23 

So what was the result? The result, after these five steps, is listed in this slide. So this is the default accuracy improvement for each asset class. So you can see, for instance, on the right you have SME: 21% improvement in accuracy compared to traditional statistical models like logistic regression. So 21% improvement is quite a lot from our perspective, we cover different asset classes: Credit card, consumer, Auto and residential mortgages.

18:09 

Now this slide, just to give you a high-level overview of where Deeploans sits. So imagine that you are leading the credit risk office of a bank and you want to use Deeploans for monitoring your current clients. So what would happen is that your legacy core banking, where your customer’s data is stored, would push the portfolio into Deeploans. Deeploans, as we said before, is already trained, so it will process the information and send you back via API some outputs which can be customised, but typically are the status of the loan in a certain timeframe. 

So it can be performing, arrears or default in 3,6, 9, 12 months. And this output is going to be used by the bank to make a decision. So it’s going to be incorporated in the client workflow. That’s important for an easy integration and keeping costs down.

19:27

So eventually, just to conclude on Deeploans. Deeploans is a SaaS designed for loan-level data management, enrichment and analysis. There are different modules in Deeploans: detection, portfolio management, due diligence, etc. I just showed you the early detection module, which is exactly the one you need to forecast the future behaviour of your borrowers. And there are other models for other specific needs. 

20:03

Now I’m reaching the end of my presentation, I want to just mention the challenges and opportunities that we’ve been facing for the last 2/3 years while building such a product. So, the first one is related to deep data. So big data is quite a hot topic but how to handle such a volume of data is not an easy question. There are structural issues based on engineering, synchronisation, deployment issues – it’s a really complex business. What we say as a takeaway is that computer architecture needs to follow the structure of the data that they are processing. 

20:51 

So if you’re aiming at processing loan-level data within your organisation, make sure you have a consistent architecture, cloud-based, scalable and flexible. Otherwise, if you’re aiming to do this with an excel sheet, you’re probably on the wrong path. 

21:14 

In the middle, Intelligence. so how to translate data into intelligence is not trivial. There is analysis, enriching, insight finding, prediction, forecasting, explainability… well, so our takeaway is that Artificial Intelligence is definitely not an install and forget system. So AI needs to be looked after by an expert. This expert can be internal, so you can hire a bunch of data scientists, or you can buy from outside, so you can have a SaaS.

21:56 

And then, last but not least, user experience. Well UX is very important because the question is how do you move from a working model to a useful system? So, there are user interface problems, APIs and integrability issues, and functionality. So which application needs to use loan-level data and soon. So the takeaway is that there are always benefits to keeping humans in the loop, because humans obviously have an expert eye on the topic. And there is no artificial intelligence versus human intelligence, we definitely think it’s a combination of both.

22:49 

Time taken for each ML task (%)

Now, my last slide, I saw this little infographic in the economist 3/4 weeks ago. There was a nice special report on AI and they were presenting the average time allocated to ML tasks. So, for instance, you can see that 25% for cleansing, 25% for labelling, and only 3% on algorithm development. So, what we want to say with this slide is that machine learning is fancy but also hard work. There’s a lot of data quality involved and we are really glad to work with EDW, which is doing an amazing job in data quality and making sure that the data is as good as possible.

And with this, I’m done and really thanking you for attending this presentation, if you have any questions, if you want to receive more information or material, please feel free to reach out.

Our latest posts

  • Behind Algoritmica: Part I – Data Discovery
    Deeploans is an AI-driven Software-as-a-Service that enables banks to accurately predict borrower behaviour and detect deteriorating credit quality. The process behind Deeploans and its reliable predictions can be divided into…
  • Algoritmica: Data in Action!
    Did you miss the chance to hear Co-Founder Luca Borella’s Deep Dive on Deeploans? Jump to the video recording or read the transcript below. The webinar, organised in collaboration with…
  • Monetising AnaCredit reporting data
    Throughout history, economic crises demonstrated that different economic sectors react in different ways to economic shocks. It became clear that basic data aggregation would not be sufficient to fully understand…
  • Predicting Energy Class Using Open Data
    A special thanks to Francesco Quinzan for his contribution. The benefits of Open Data are wide-ranging. From increased transparency to economic growth, Open Data will play an important role in…
  • Explainability hurdles surrounding the application of neural networks in credit risk
    Widespread adoption of deep learning models in highly regulated fields such as finance has been stalled by the lack of explainability. But now…