J.P. Morgan AI Research generates synthetic datasets to accelerate research and model development in financial services.

For AI models to be effective at demonstrating human behavior in business scenarios, they need to be trained on large quantities of data that are representative of reality. The financial services industry generates large amounts of data which could be very beneficial, but such data is often not available for use. This poses a fundamental challenge for researchers and developers.

Real data may be challenging to access along many dimensions including privacy, legal permissions, and technical aspects related to volume, representation, and meaning.

The question is then, how to enable innovation and building of new products and services that depend on data. One answer is the use of synthetic data, which can share format, distributions and standardized content with the real data while not incurring the risks of using real data. 

Synthetic data potentially has the added benefit of representing exploratory scenarios beyond historical data to prepare AI algorithms and support decision making in novel situations. As such, synthetic data enables us to be more robust in our response to challenging situations.

Further, synthetic data can multiply examples that may be rare in the real data, in order to train machine learning algorithms more effectively. 

Ultimately, if a new idea shows promise on the synthetic data, we can consider advancing it for real deployment and use on the real data.

Through its research, the AI Research team at J.P. Morgan has identified several methods to create synthetic data and has learned that different methods may apply to different types of data. For example, we can create realistic synthetic data by understanding the process that generates the real data, and then model the process itself to produce the synthetic data. The model can be declarative or captured in simulations. In addition, we can directly use the real data to train generative neural networks (GNNs), which have been successfully used to generate a variety of other synthetic data. 

The synthesized new samples have properties of real data but cannot be mapped back to it. The new samples offer insight on data that otherwise may be left undiscovered.

One critical area is fraud detection model training where AI models are given examples of normal and fraudulent transactions in order to learn suspicious transaction patterns. Since the number of fraudulent cases is extremely small compared to non-fraudulent cases, modeling approaches struggle to effectively train models on fraudulent behaviors from the available data. However, synthetic data can be used to train a model on anomalous behavior. The process renders a greater percentage of transactions that do not fall in line with expected behavior, thus generating more synthetic samples of the fraud cases for improved model training. 

Leveraging these techniques and others, the synthetic datasets that AI Research has developed include:

  • Anti-money laundering (AML) behaviors
  • Customer journey events
  • Markets execution data
  • Payments data for fraud detection

Manuela Veloso, Head of AI Research at the firm, reflected on synthetic data capabilities the team has enabled in retail banking. “Synthetic data generation allows us to think, for example, about the full lifecycle of a customer’s journey that opens an account and asks for a loan. We’re not simply examining the data to see what people do, but we’re also able to analyze their interaction with the firm and essentially simulate the entire process.”

The team’s synthetic data work has evolved. Since making available its synthetic datasets in February, the team has fielded many requests for these capabilities. Also, the firm’s Faculty Research Awardees at Stanford, Cornell, CMU, University of Buffalo, NY and other universities are leveraging datasets to develop algorithms that address fraud and money laundering, customer journeys, markets execution  and other areas in finance.

Rob Tillman, Executive Director of AI Research, summarizes the challenge addressed by its synthetic data and its benefits. “In highly regulated industries which deal with sensitive data, such as finance, there are often significant barriers that impede or delay the ability of researchers and developers to use data to develop AI solutions that improve experiences or address important problems like fraud detection and anti-money laundering. The team’s synthetic data work aims to address this issue and accelerate the development of AI solutions at J.P. Morgan as well as enable collaboration with the academic community.” 

AI Research  has published its synthetic data research and presented it at leading AI conferences and workshops including:  The ACM International Conference on Finance (ICAIF 2020), The Association for the Advancement of Artificial Intelligence (AAAI 2020) and the ICAPS 2020 Planning for Financial Services workshop.

The team’s work is aligned with its aspirational Research Goal to liberate data safely, providing researchers and developers with a synthetic data framework as an alternative to real data where appropriate.

To learn more about the AI Research synthetic data initiatives at the firm, see
https://www.jpmorgan.com/technology/artificial-intelligence/initiatives/synthetic-data.