TECHNOLOGY

Technology Blog

Our outlook on technology, written by distinguished engineers, machine learning experts and AI researchers.

Lawrence H., AI Research Intern | SEP 2020

Searching for Patterns in Daily Stock Data: First Steps Towards Data-Driven Technical Analysis

Chart patterns are a commonly-used tool in the analysis of financial data. Analysts use chart patterns as indicators to predict future price movements. The patterns and their interpretations, however, are subjective and may lead to inconsistent inference and biased interpretation.



In this study, I used a data-driven approach based on objective machine-learning methods to identify distinct patterns that best characterize the data and enable examination of the patterns’ predictive power. Specifically, I used various unsupervised machine-learning methods to cluster the time-series data into separable classes. To my surprise, all methods unanimously agree that simple harmonic functions best characterize the data. I also find that further filtering the data by time, sector, or profitability doesn’t add predictive power to the clusters.

Further exploration of the data is still needed. In the future, I would like to examine the multi-scale nature of the problem as well as expand the study to include more clusters in the unsupervised analysis.


Figure 1: Time series embedded into 2 dimensions--results visualized as clusters

My name is Lawrence Huang, and I’m a rising senior at Carnegie Mellon studying Physics. I am also a Prep for Prep alum, which is a leadership development program that offers promising students of color access to private school education.

This summer I worked as an intern for J.P. Morgan’s AI Research team. Due to coronavirus, this internship was shortened to five weeks and was fully remote. Considering the short duration of the internship, there are many related avenues of inquiry to explore.

Background & Related Work

Technical analysis in trading aims to evaluate investments and identify opportunities using only price and volume data. It isn’t immediately obvious why this type of analysis in trading may have predictive value, especially when compared with fundamental analysis. Fundamental analysis looks at a company’s financial reports, state of the economy, and industry trends to determine whether the “true” or assessed value of a stock reflects its current traded value. If the current price is below the assessed price we predict, the stock price will increase, and vice versa if the assessed price is lower than the current price. On the other hand, in a technical analysis view, all known fundamentals of a company’s business are instantaneously factored into the stock price. Thus, there is no need to explore the economic conditions of a company – it’s already reflected in the price. The price is all that matters!

Predictions, for the technical analyst, are made by identifying patterns that are “known” to lead to a predetermined outcome. A few well-known chart patterns are the head and shoulders, triangle, double top, etc. These are general shapes that stock prices can take, and technical analysts have found these shapes to be useful in making predictions in trading. For example, the head and shoulders is a reversal sign, indicating a bull to bear trend. However, complexity exists for the technical analyst in making related predictions because typically there are multiple signs in a given time series. For example, the pattern may be 70% head and shoulders, and 20% channel up.

Relying on known patterns then, can enter subjectivity into technical analysis. However, a body of research supports the idea that leveraging chart patterns in technical analysis has useful predictive value. For example, for a bull flag chart pattern, two separate papers studied methods of pattern matching and quantifying how well a given stock chart matches the bull flag chart pattern. The first of these papers, published in 2002, identified trading rules based on how well the stock chart matched a bull flag and found these trading rules to be effective on out-of-sample examples (Leigh, William, et al. "Stock market trading rule discovery using technical charting heuristics." Expert Systems with Applications 23.2 (2002): 155-159.).


Figure 2: Example of a bull flag pattern

The second, published in 2007, focuses on two stock market indices - Nasdaq Composite Index (NASDAQ) and Taiwan Weighted Index (TWI). This second study found that technical trading rules correctly predict the direction of market changes. It also found that matches with the bull flag pattern were correlated with higher returns (Wang, Jar-Long, and Shu-Hui Chan. "Stock market trading rule discovery using pattern recognition and technical analysis." Expert Systems with Applications 33.2 (2007): 304-315.).

In considering the bull flag chart pattern alone then, these two separate research papers find that one can make a profit, on average, by leveraging chart patterns.

For a more thorough overview of technical analysis chart patterns, I highly recommend a paper by Andrew W. Lo, Harry Mamaysky, and Jiang Wang, Foundations of Technical Analysis: Computational Algorithms, Statistical Inference, and Empirical Implementation. In this paper, the researchers examine ten different chart patterns in-depth, and how to recognize the patterns quantitatively.

Still, it remains that most of the research on technical analysis chart patterns address either the evaluation of their usefulness in trading or the application of new technologies to recognize chart patterns.

Research Question and Approach

This project reflects my research during my 5-week internship at JP Morgan with the AI Research team. Rather than focusing on existing chart patterns, I focused on identifying if I could extract some meaning out of any resulting patterns. I was interested in researching the following questions:

  • Can we identify patterns in stock time series using unsupervised learning?
  • Will we find patterns similar to chart patterns?
  • Will we find patterns that are good indicators of potential profit/loss?

I used a data-driven approach to observe whether such patterns actually exist in the data. Specifically, I used from Yahoo finance daily equity data of companies that contribute to the S&P 500 index (e.g., JPM, Google, Amazon, etc.) but limit the data to the last 30 years (1990-2020).

The data spanned approximately 30 years for each company. I wanted to identify patterns in short, sequential time segments. After some exploration, I chose 50 days as the length of the time-series segments. My main concern regarding the time-series length was the potential difficulty in finding meaningful patterns if segments were too short. While 50 trading days is not very long, my expectation was that this duration would be conducive to obtaining both predictive value and distinguishable patterns. Still, looking forward it would be interesting to do a deeper exploration and identify patterns that may emerge on a longer or variable time scale.

To mine for consistent patterns in the time-series segments, I needed to collect many such segments. I used the bootstrap method and randomly sampled (with replacement) 50,000 chunks of 50 days from the data. After sampling, I used one of several clustering methods: K-Means, DBSCAN, Hierarchical Clustering, or K-Means on an autoencoder-encoded time series to separate the segments to distinguishable patterns.

The goal of the project was to examine the characteristics that distinguish the clusters. In particular, I was looking for cluster centers that appeared similar to existing chart patterns, as well as clusters that had a higher future potential profit/loss on average than other clusters.

Results

In this section, I show the result of my clustering exercise using various methods and segment length. In addition, I explore the characteristics of the resulting segments.

In Figure 3, I depict the result of separating the segments using the K-Means algorithm. In that figure, each point represents a 50-day segment that has been embedded onto a two-dimensional space using t-SNE embedding. The goal here is to quantitatively ensure that the clusters are well-separated, and explore, preliminarily, if any of the clusters are distinguishable as representing profitability.

I examined the learning curve of the within-cluster sum of squares as a function of the number of clusters for a range of numbers (up to 20). Then I chose to use four clusters using the elbow method. It is essential to explore a greater number of clusters as this increases the opportunity to learn more complex patterns – particularly patterns that are used in practice.

In Figure 3, I also represent each cluster with a different colormap. The scale corresponds to the ratio of potential profit to potential loss. That is, the darker the color, the more profitable the time series. We define potential profit as the amount of profit made by investing $100 at the end of the time series and selling at the maximum price in 20 days. The counterpart, potential loss, is defined as the amount lost when investing $100 and selling at the lowest point in 20 days.

Unfortunately, we don’t see any clusters that emerge as particularly profitable or unprofitable. So, we continue investigating what exactly each cluster means.


Figure 3: Cluster results for K-Means, where hue is determined by the ratio of potential profit to potential loss


Figure 4: Bar charts of cluster makeup by week, month, and year

After checking whether or not clusters correlate with profit, we check whether clusters correlate with the date. Specifically, the year, month, and the week of the month. However, as Figure 4 reveals, we see similar profiles throughout. Our clusters don’t seem to be grouping by time or by profit.


Figure 5: Bar charts of cluster makeup by sector

Figure 5 explores the correlation of the various groups with the various sectors. As in Figures 3-4, qualitatively, we can’t find any distinguishable difference between the various sectors in Figure 5. Since clustering reveals no obvious correlation with any of our metadata, we hypothesize the clustering correlates with patterns in the data.

What exactly is a pattern, and what does it look like? Examples of patterns used in technical analysis are seen in Figure 6. Note that these patterns are not unique and small variations still count as the same pattern. More importantly, these patterns exhibit oscillatory behavior as can be seen in Figure 6b.


Figure 6a: Example of Head and Shoulders pattern


Figure 6b: Example of Triangle pattern

Now we calculate the cluster centers using the average close price of all time series within a cluster.


Figure 7: Average time series of each cluster. In the left panel, the line width is weighted by the cluster population. In the right panel, the line width is weighted by the cluster’s average potential profit.

As depicted in Figure 7, monochromatic waves resemble cosine waves of different frequencies. Note that in the left plot of Figure 7, the line widths are weighted by the cluster population, while in the right plot, line widths are weighted by the cluster’s average potential profit. From these plots, we can see that all four clusters have a similar population and that there is very little difference in each cluster’s average potential profit.


Figure 8: Average time series of clusters using four different clustering methods.

To avoid overfitting to one clustering method, we try several alternate methods. By exploring clustering methods that are different, we avoid algorithms that make the same mistakes and overfit to one particular technique. Now, if the clustering methods return similar results, it lends support to the argument that these patterns indeed characterize the data.

As can be seen in Figure 8, K-Means, DBSCAN, and Hierarchical Clustering return very similar patterns. The autoencoder method is slightly different, however, its patterns still appear harmonic and retain symmetry across the x-axis.


Figure 9: Patterns that arise when clustering time series of different lengths. The patterns for the 100-day time series have a noticeable upward slope.

In figure 9, I examine the results when changing the segment length from 50 days to 20 and 100 days. Note that 20 trading days is one month, 50 trading days is approximately one quarter, and 100 trading days is approximately half a year. Figure 9 shows similar patterns in the 20 and 50-day windows, but we see a significant upward slope for the 100-day segment. This upward slope indicates that the stocks we examined, on average, increased in value over the period we considered (1990-2020) on scales of half a year.

It’s possible that we obtain harmonic patterns because these clustering methods rely on Euclidean distance to separate clusters. Because sin and -sin are opposite to each other and therefore far apart, it would be a very reasonable way to separate the clusters. We do not see the same patterns when we use the autoencoder--because the time series has been embedded onto a lower dimension, the distance is calculated in a different space. However, the fact that the autoencoder patterns also show oscillatory and symmetric behavior indicates that this is not just an artifact of our clustering algorithms but rather a robust phenomenon.

Next, we tried clustering on multi-channel data. We included open, high, low, and close prices here. Using K-Means and autoencoder methods to cluster the data, I plotted the medoid time-series below in Figure 10. I found that even using multi-channel data, we still get similar results -- classes separated, largely, by opposite-sign harmonics.


Figure 10a: Medoid time series for each cluster using K-Means clustering on multichannel data (open, high, low, and close)


Figure 10b: Medoid time series for each cluster using the autoencoder method on multichannel data (open, high, low, and close)

The final experiment was to cluster and extract the dominant patterns from the data after filtering it according to potential profit and loss. The results are shown in Figure 11.

  • The solid-triangle line in Figure 11 shows the standard, randomly-sampled dataset I used before (labeled as “unfiltered”).
  • The solid line shows clustering results after filtering the data to include only the profitable segments - random time series with a maximum potential profit between $10 and $100, with a potential loss of no more than $10.
  • The dashed line complements the solid by showing the averaged segments after filtering for unprofitable segments -- segments with a potential loss between -$100 and -$10, with a potential profit of no more than $10.


Figure 11: Average time series of each cluster for profitable, unprofitable, and unfiltered time series. Each dataset is clustered separately.

Interestingly, we see almost no difference in cluster centers between the three time-series – unfiltered, profitable, and unprofitable – despite the fact that there is such a large difference in the potential rate of return between the datasets.

Conclusion

The goal of this project was to identify patterns in stock data using semi-supervised learning. We did this by clustering time series of various lengths using different methods and analyzing these clusters.

We observed oscillatory and symmetric patterns across four different clustering methods. These patterns were not good indicators of potential profit or loss.

There are many things I wanted to explore but was unable to, due to time constraints, e.g., using different clustering methods on the autoencoder embedding, or including volume as a channel for multi-channel experiments. An interesting alternative method of finding patterns might be using supervised learning with a convolutional neural network to predict future price/profitability and visualize the convolutional filters. One would essentially be programming an artificial technical analyst and (potentially) identifying emerging chart patterns.

As I mentioned earlier, this project was part of my internship at J.P. Morgan in their AI Research team this summer. In particular, I would like to thank Naftali Cohen, Srijan Sood, and Zhen Zeng for their help and guidance throughout this work. Without their help, this project would not have been so successful. I had an excellent time during this internship and learned a great deal about unsupervised learning and data science from my coworkers.

I hope that you enjoyed reading about this project and that it gave you some interesting insight into machine learning with technical analysis.

Thank you.
Lawrence Huang

Disclaimer

This post was prepared for informational purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co and its affiliates (“J.P. Morgan”), and is not a product of the Research Department of J.P. Morgan. J.P. Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy, or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.
© 2020 J.P. Morgan Chase & Co. All rights reserved

Amy V., Fenglin Y., Ming C., John B., Prashant D. | SEP 2020
Artificial Intelligence & Machine Learning Technology, Chief Technology Office

Snorkeling: Label Data With Less Labor

Check out how Snorkeling can complement active learning and help partially automate the process of data label creation.



Active learning enables collaboration between the annotator and data scientist to intelligently select data points to label. It helps identify important data points that the annotator should label to rapidly improve model performance. Snorkeling complements active learning by helping partial automation of data label creation. It focuses on identifying easy data points that can be labeled programmatically, instead of by an annotator.

A key barrier for businesses’ adoption of machine learning is not lack of data but lack of labeled data. In Learn more with less data, we shared how active learning enables collaboration between the annotator and data scientist to intelligently select data points to label. Using this approach we can identify important data points that the annotator should label to rapidly improve model performance

Snorkeling complements active learning by helping partial automation of data label creation. It focuses on identifying easy data points that can be labeled programmatically, instead of by an annotator.

Background: What is Snorkeling?

Snorkel is a library developed at Stanford for programmatically building and managing training datasets.

In Snorkel, a Subject Matter Expert (SME) encodes a business rule for labeling data into a Labeling Function (LF). The LF can then be applied to the unlabeled data to produce automated candidate labels. Typically, multiple LFs are used to produce differing labels, and policies are defined for selecting the best final label choice. These policies include majority vote and a model-based weighted combination.

The labeling functions can be evaluated for coverage of the unlabeled training data. The SME can determine if gaps exist, and add additional LFs for those cases. The labeled training data can then be used to train or generate a classifier model. The purpose of this model is to evaluate the quality of the labeled dataset produced by Snorkeling versus a reference or gold labeled data set. This model is evaluated using manual labeled test data for performance analysis. This can be used as feedback to the SME to further tune the LFs.

The overall process is shown in the following figure.

LFs can use different types of heuristics. For example, patterns in the content can be identified, such as keywords or phrases. Or attributes of the content such as the length or source of the content could be used. The SME determines the best LFs based on knowledge of the domain, data, and by iteratively improving the LFs to increase coverage and reduce noise.

Because of the high cost and time consuming nature of producing manual labels, a variety of programmatic and machine learning techniques are used. Data scientists uses a combination of techniques such as Snorkeling, Active learning, and manual labeling, depending on the stage of ML development and types of data and the requirements of the training environment.

Why is Snorkeling valuable?

Snorkeling has two primary sources of value 1) Labor savings 2) Faster time to market

Labor savings

Applying Snorkeling can substantially reduce the amount of labor required. In one project, two annotators label 3K to 5K labels per week, and in another project, five annotators label approximately 2,000 customer interactions per week. A data engineer can create a set of Snorkel label functions in about one month for each project. The label functions can be run each week and the results can either be used to directly retrain the model, or reviewed by the annotators in less than half the time to annotate the unlabeled data.

Combining this approach with Active Learning allows the data scientist to create a high performing model with significantly reduced cost compared to using traditional data labelling approaches.

Faster time to market

Using Snorkeling, we built a model on unseen data using heuristics and small labels, and augmented data using fine-tuned transformation functions. The team then built and deployed a model within 10 days – far faster than the traditional development cycle of 30 days or more. Separately, we analyzed that by training a model with a data set label using Snorkeling, we could improve model accuracy significantly.

Applying Snorkel

Industry solutions

In a study at Google [Bach et al 2019], data scientists used an extension of snorkeling to process 684,000 unlabeled data points. Each data sample was selected from a larger data set by a coarse grained initial keyword-filtering step. A developer wrote ten labeling functions. These LFs included:

  • The presence of URLs in the content, and specific features of the URLs
  • The presence of specific entity types in the content, like ‘person’, ‘organization’, or ‘date’, using a Natural Language Processing tool for Named-Entity Recognition
  • The matching of the topic of the content with specific topic categories, using a topic model classifier

The model trained on the labeled data set from Snorkeling matched the performance of 80K hand labeled training labels, and were within 5% of the performance metric (F1 score) of a model trained on 175K hand-labeled training data points.

JPMC solution

Due to Covid-19 virus and lockdown, customers’ feedback and complaint pattern profoundly changed at unprecedented speed. To understand customer issues, we used Snorkeling to build machine learning using a small set of labels, heuristics functions and augmentation techniques team created a dataset and COVID specific model.

A data scientist with the project team, wrote 20 LFs for the Voice of the Customer (VoC) project, to label data for training the VoC model for COVID-19 and lockdown themed customer feedback. Below is an example using mock-up/synthesized data.

Using the model, the team identified complaint themes and the business took immediate action to solve customer problems.

Learn More and Get Started

Guangyu W., Machine Learning Center of Excellence | Aug 2020

How to Build a FAQ Bot With Pre-Trained BERT and Elasticsearch

In this tutorial, we will demonstrate a simple way to create a FAQ bot by matching user questions to pre-defined FAQs using Sentence-BERT and Dense Vector Search in ElasticSearch with concrete code example. The solution is fast, accurate and scalable in production level environment.

Introduction

Chatbot has emerged to be one of the most popular interfaces given the improvement of NLP techniques. Within this big family, FAQ bot is usually designed to handle domain-specific question-answering given a list of pre-defined question-answer pairs. From the machine learning standing point, the problem could be further translated as “find the most similar question in database matching user’s question”, or something we called Semantic Question Matching. In this post, we will have a step-by-step tutorial for building a FAQ bot interface using Sentence-BERT and ElasticSearch.

Matching Logic

To solve the semantic question matching, there are generally two directions to go. One is to use the Information Retrieval approach which treats the pre-defined FAQs as documents and the question from users as the query. The advantage of this search-based approach is that it’s in general more efficient and scalable. Running this system on 100 FAQs vs 1 million FAQs usually won’t have a significant difference in inference time (this may still depend on the algorithm you use. For example, BM25 will be much faster than a ranking model). The other approach is to use a classification model which takes user question and a candidate FAQ as a question pair, then classify whether they are the same question or not. This method will require the classification model to run through all the potential FAQs with the user question to find the most similar FAQ. Comparing with the first approach, it could be more accurate but takes more computation resources during inference. In the following parts, we will focus on adopting the search-based method that computes the semantic embedding on both the user question and FAQs to select the best match based on their cosine similarity.

Sentence-BERT for Question Embedding

BERT types of models have been able to achieve SOTA performance on various NLP tasks [1]. However, BERT token level embeddings could not be transformed directly into a sentence embedding. A simple average of token embedding or just use [CLS] vector turns out to have poor performance on Textual Similarity tasks.

The idea to improve BERT sentence embedding is called Sentence-BERT (SBERT) [2] which fine-tunes the BERT model with the Siamese Network structure in figure 1. The model takes a pair of sentences as one training data point. Each sentence will go through the same BERT encoder to generate token level embedding. Then a pooling layer is added on top to create sentence level embedding. Final loss function is based on the cosine similarity between embeddings from those two sentences.

Figure 1: Sentence-BERT (SBERT) with Siamese architecture

 

According to SBERT paper, fine-tuned SBERT could significantly outperform various types of baseline such as averaging GloVe [3] or BERT token embeddings in terms of Spearman rank correlation of sentence embeddings on textual similarity data set.

The author team has also released a python package called “sentence-transformer” which allows use to easily embed sentences with SBERT and fine-tune the model based on a Pytorch interface. Following the GitHub link (https://github.com/UKPLab/sentence-transformers), we could download the package by:

```

pip install -U sentence-transformers

```

Then it’s straightforward to use a pre-trained SBERT model to generate question embedding:

```

from sentence_transformers import SentenceTransformer

sentence_transformer = SentenceTransformer("bert-base-nli-mean-tokens")

questions = [

    "How do I improve my English speaking? ",

    "How does the ban on 500 and 1000 rupee notes helps to identify black money? ",

    "What should I do to earn money online? ",

    "How can changing 500 and 1000 rupee notes end the black money in India? ",

    "How do I improve my English language? "

]

question_embeddings = sentence_transformer.encode(questions)

```

The output of the “encode” method is a list with the same length as input “questions”. Each element in the list is a sentence embedding for that question which is also a NumPy array with a dimension of 768. Noted that 768 is the same size as general BERT token level embedding.

Dense Vector Search in ElasticSearch (ES)

Now we have a way to generate the question embedding that captures the underlying semantic meaning. The next step is to set up a pipeline for matching the user questions using the question embedding. To achieve this, we’d like to introduce a famous concept called ElasticSearch. This is one of the most well-known search engine based on Lucene which is scalable and easy to use. After its version 7.3, ES has released the dense vector data type which allows us to store our question embeddings as well as similarity match.

If you don’t have ES in your environment, you could download and install it following this link from its official website (https://www.elastic.co/downloads/elasticsearch). This will start a local ES instance and accessible through “localhost:9200”. As we use Python for this tutorial, a Python ES client is needed before we start:

```

pip install elasticsearch

```

There are generally two steps using ES: Indexing and Querying. In the indexing stage, we first create an “index” which is a similar concept as “table” in a rational database using the following code. All the pre-defined FAQs will be stored in this index.

```

from elasticsearch import Elasticsearch

es_client = Elasticsearch("localhost:9200")

INDEX_NAME = "faq_bot_index"

EMBEDDING_DIMS = 768

def create_index() -> None:

    es_client.indices.delete(index=INDEX_NAME, ignore=404)

    es_client.indices.create(

        index=INDEX_NAME,

        ignore=400,

        body={

            "mappings": {

                "properties": {

                    "embedding": {

                        "type": "dense_vector",

                        "dims": EMBEDDING_DIMS,

                    },

                    "question": {

                        "type": "text",

                    },

                    "answer": {

                        "type": "text",

                    }

                }

            }

        }

    )

create_index()

```

In ES, there is a concept called “mapping” during index creation. It is similar to define a table schema for table creation. In our tutorial, we create an index called “faq_bot_index” with three fields: 

  • embedding: a dense_vector field to store question embedding with a dimension of 768.
  • question: a text field to store the FAQ.
  • answer: a text field to store the answer. This field is just a place holder and not relevant to this tutorial.

Once we have the index created in ES, it’s time to insert our pre-defined FAQs. This is what we call “Indexing Stage” and each FAQ will be stored with its question embedding in ES.

```

def index_qa_pairs(qa_pairs: List[Dict[str, str]]) -> None:

for qa_pair in qa_pairs:

    question = qa_pair["question"]

    answer = qa_pair["answer"]

    embedding = sentence_transformer.encode(question)[0].tolist()

        data = {

            "question": question,

            "embedding": embedding,

            "answer": answer,

        }

        es_client.index(

            index=INDEX_NAME,

            body=data

        )

 

qa_pairs = [{

    "question": "How do I improve my English speaking? ",

    "answer": "Speak more",

},{

    "question": "What should I do to earn money online? ",

    "answer": "Learn machine learning",

},{

    "question": "How can I improve my skills? ",

    "answer": "More practice",

}]

index_qa_pairs(qa_pairs)

```

By running the code above, we could simply index three different FAQs with its answer into ES, and then we are ready for our querying stage.

In the querying stage, we need to encode the user question into embedding and construct a customized ES query that computes cosine similarities based on embeddings to rank the pre-defined FAQs.

```

ENCODER_BOOST = 10

def query_question(question: str, top_n: int=10) -> List[dict]:
embedding = sentence_transformer.encode(question)[0].tolist()
    es_result = es_client.search(
        index=INDEX_NAME,
        body={
            "from": 0,
            "size": top_n,
            "_source": ["question", "answer"],
            "query": {
                "script_score": {
                    "query": {
                        "match": {
                            "question": question
                        }
                    },
                    "script": {
                        "source": """
                            (cosineSimilarity(params.query_vector, "embedding") + 1)
                            * params.encoder_boost + _score
                        """,
                        "params": {
                            "query_vector": embedding,
                            "encoder_boost": ENCODER_BOOST,
                        },
                    },
                }
            }
        }
    )
    hits = es_result["hits"]["hits"]
    clean_result = []
    for hit in hits:
        clean_result.append({
            "question": item["_source"]["question"],
            "answer": item["_source"]["answer"],
            "score": item["_score"],
        })

return clean_result

query_question("How to make my English better?")

```

The customized ES query will contain a “script” field which allows us to define a scoring function that computes the cosine similarity score on embeddings and further combine it to the general ES BM25 matching score. “ENCODER_BOOST” is hyper-parameter we could change to weight the embedding cosine similarity. Further detail on using ES script function could be found here: https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch.

By running the “query_question” function with an argument of a specific user question, we can find the most similar FAQs in ES not only based on the keyword match (BM25) but also taking into account it’s textual meaning (SBERT embedding). And this will be the key entry point for a FAQ bot.

Looking Forward

In this tutorial, we have demonstrated a simple way to create an FAQ bot by matching user questions to pre-defined FAQs using SBERT and ES. The solution is very scalable and could achieve decent performance. However, we haven’t gotten the chance to talk about two important pieces of work that could further improve your bot accuracy. First, we could fine-tune SBERT on the domain-specific dataset instead of using the pre-trained one. This in practice could significantly boost the model performance if the text-domain is quite different. Second, this search-based matching approach is usually the first step for a mature system given its low computation cost. Once ES returns a list of candidate FAQs, an additional ranking step could be added to select the best match. This could be done by more complex learn-to-rank models or classification models that take a pair of sentences at a time. We will further illustrate this in future posts.

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding
[2] Reimers, N., and Gurevych, I. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.
[3] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global Vectors for Word Representation.

Ming C., Ericamarie K, Fenglin Yin, Huang Z., John B., Prashant D. | July 2020
Artificial Intelligence & Machine Learning Technology, Chief Technology Office

Use ML to Improve Customer Experience Without Data and Privacy Compromise

Read on to find out how you can use machine learning to redact personally identifiable information to generate customer insights while ensuring data and privacy protection.

Can we remove the personally identifiable information of a customer from emails, chat messages, and other customer interactions while retaining the value of the content for building models to improve customer experience of the firm's services? In this blog, we describe how we redact PI and tokenize data, help discover, classify and protect sensitive entities at JPMorgan Chase.

What is Personally Identifiable Information (PII)?

PII is any information that can be used to distinguish or trace an individual's identity such as name, social security number, date and place of birth, mother's maiden name, biometric records, and any other information that is linked or linkable to an individual, including medical, educational, financial, and employment information.

Protecting customer’s PII is a fundamental legal, regulatory, and business requirement for the firm. While much PII data is in structured columns and hence can easily be removed, sources like customer call transcripts, emails, and messages are examples of unstructured data sources in which a customer may disclose PII such as addresses, names, and social security numbers at any point within a conversation. This creates a challenge for data use, as this information must be safeguarded before the data can be used.

How do you safeguard PII?

PI redaction methods help discover, classify and protect sensitive entities. There are multiple techniques to safeguard PII within text data. In each case, PII content is identified and either removed or replaced. PI content can be replaced with a generic string or masked. Here are types of entities PI tool can detect and redact.

Numeric Entities

Numeric identifiers e.g. Social Security, Phone number, codes etc. can appear as a combination of numbers and words. Both types and combinations are removed. Sequences of digits are identified using pattern matching with regular expressions. Numeric words ('one', 'two', …) are also matched. Numeric content is either removed or replaced with a generic string.

Physical Addresses

The US Postal Service dataset of 20,000 city and state names across US states and territories is used to identify city and state names. Street names are matched in context of words preceding street type symbols such as road or avenue. Using real physical address from https://openaddresses.io, physical address redaction was 99+% effective.

Email Addresses

The format of Internet email addresses is defined by a standards [IETF RFC 5322] published by the Internet Engineer Task Force (IEFT). Unfortunately the specification doesn't lend itself to simple pattern matching. The PII redaction package uses a domain database of about 8000 top level domains plus a regular expression pattern matcher.

Proper Names

Proper name recognition uses a Natural Language Processing (NLP) technique called Named-Entity Recognition (NER). NER identifies named-entities in sentences and classifies the entities by type. For example, if the text is:


Mary disputed a transaction with Online Shoppers that was reported in May 2020.

then the NER would label it as follows:


[Mary]Person disputed a transaction with [Online Shoppers]Organization that was reported in [May 2020]Time.

Existing NER tools use a combination of grammar based and statistical models. Grammatical features use sentence structure to inform proper nouns categorization. The PII redaction package uses the Stanford Named Entity Recognizer which uses a statistical modeling technique called Conditional Random Field (CRF). CRF uses a graph model to take context into account when making predictions about a word in the text.

In testing, the NER based redaction of proper nouns is 92% accurate. Improving NER models continues to be an active area of research.

Tokenization

Many machine learning models that deal with text data perform an additional step called tokenization. Here we proposed a custom Tokenization process to convert texts to integer sequences in a securer way:

1. Text tokenization: turning original text into sequence of word tokens, for example, ['hello', 'world'].

2. Hashing: applying one-way hashing algorithm such as SHA-256 on each word (token) and then replacing the hashed value with its token. The algorithm is called 'one-way' because it is mathematically difficult to invert, and is why such algorithms are also used for data encryption. Each hash is unique across the dictionary of words, and each time the same word is seen in the stream the same hash value is produced.

3. Sequencing: Mapping hashed code into sequential integers. This is achieved by replacing hashed code with its position in the original text stream. This step helps to save storage and transmission space by replacing long strings to integers. Sequential integers are also required when constructing a large word embedding matrix for NLP models.


A common technique to make it more difficult to reverse engineer the original text from the tokens is to randomize the hashing function. A random seed value or salting can be added at the hashing step. The result is a stream of numbers that is close to impossible to reverse engineer without the original dictionary.

The combination of PII redaction and tokenization provides high level of resiliency from data attacks such the rainbow attack or word frequency analysis. In the rainbow attack, a table is a precomputed with all possible input words for reversing cryptographic hash function. This technique is used, for example, for cracking password hashes. Tables are usually used in recovering a password (or credit card numbers, etc.) up to a certain length consisting of a limited set of characters. However without access to the same dictionary, the mapping is not the same after indexing. The salting token provides an additional safeguard.

In word frequency analysis, suppose that the tokenized stream is intercepted. Could the original text or meaning be reversed using knowledge of word frequency in common text? This scenario is likewise unlikely since PII is either already redacted of very low-frequency in the token stream.

Applying PII removal

Industry Solutions

Recently a public cloud provider announced an extension of its speech transcription service in which the transcript is PII redacted after the transcription is performed and another has a data loss prevention service running on the cloud. A disadvantage of these approaches is that the on premise data needs to be moved to the cloud before it can be cleaned.

JPMC Solution

A typical use case is an on premise data set which includes PII values in an unstructured text stream. Each block of text is redacted using the redaction package. Optionally the redacted text can be tokenized.

Conclusion

Using PI removal, PI data can be redacted from text corpus. Using tokenization for data, each string of text is replaces with non-sensitive token. Using PI removal and tokenization enables teams make small and secure datasets available to ML teams.

Applied AI & ML team | June 2020

Enter EVA (Email Virtualization Automation)

How the firm’s in-house machine learning solution for emails helped teams process the email surge in Q1.

 

As the coronavirus sent shockwaves across global markets, clients turned to J.P. Morgan via a traditional yet enduring channel: email. Heightened volatility meant the corresponding email volume surged by 50% for some client servicing teams.

While email enables critical interactions between the firm and its clients, enormous volumes make it challenging for client service teams to prioritize and process incoming inquiries. Enter 'EVA', Email Virtualization Automaton. EVA is already having a significant impact in the Corporate and Investment Bank (CIB), as it helps to automatically route and now, crucially, resolve email inquiries at scale.

Many teams across the bank have already successfully adopted EVA. One example of this success has been EVA's partnership with the CIB's Collateral & CPG (CCPG) operations team. The CCPG Ops team supports the firm’s Collateral and Credit portfolio group, who plays a large role in mitigating the firm’s credit risk exposure. In this example, the integration of EVA automated the resolution of J.P. Morgan-initiated margin calls across CCPG Ops’ team mailboxes. EVA supports the team at each step of a margin call booking, involving extraction from unstructured emails to construct affirmation of margin call agreements (up to 75% straight-through-processing) and handling booking and movement utilizing the Acadia third party platform. In March and April, this capability helped CCPG Ops seamlessly handle a 50% increase in volumes, providing scalability during turbulent market conditions by processing 12k calls and saving over 700 hours of processing time in March alone. The CIB is looking to extend this rollout which could help to save 1200 hours of processing time per month by the end of 2020.

Pull Quote Created with Sketch. This technology is a real game-changer for us. Phil Glackin Managing Director, Markets Operations, CIB

This example is just one application for a firmwide tool like EVA that uses reusable components - email parser, classifier and information extractor - like a set of Lego bricks to provide a standardized approach to different solutions. It works alongside existing teams and processes to optimize how they handle process-driven email tasks that otherwise consume time and resources, empowering them to better serve clients through uncertain times.

Pull Quote Copy Created with Sketch. VA provided a much- needed cushion to our team, seamlessly supporting a 50% volume hike. Shweta Shetty Vice President, Markets Ops, CIB

In Wholesale Payments, EVA helped triple the resolution of transaction status queries in one quarter since its introduction late last year. “The team is now helping us advance our strategic routing of transactional inquiries to free up our client services account managers to deliver even more value to our clients,” said Adam Hyde, Global Head of Client Service, Treasury Services and Commercial Card—Wholesale Payments.

Today, EVA is classifying over 490,000 emails each month, and with each new use case, we add to its capabilities: EVA is currently prototyping other full task resolution capabilities with other teams facing large email volumes.

“This is a great example of the firm using machine learning to solve the institutional challenge of many clients still preferring email communication,” said Samik Chandarana, Head of CIB Data & Analytics and Applied AI & ML. “The team has produced something that is easily scalable and already having a material impact.”

For more information, visit the J.P. Morgan Applied AI and ML page here.

Austin G., Digital Advanced Computing | May 2019

Quantum Computing From a Computer Science Perspective

When approaching quantum computing from a computer science perspective, it may seem intuitive to begin by comparing quantum computers directly with their classical counterpart. However, many who attempt to learn this way (myself included) end up more confused than informed, especially after encountering the complex mathematical and physics notation present in popular literature.

When approaching quantum computing from a computer science perspective, it may seem intuitive to begin by comparing quantum computers directly with their classical counterpart. However, many who attempt to learn this way (myself included) end up more confused than informed, especially after encountering the complex mathematical and physics notation present in popular literature. Instead, we will assume a basic understanding of how classical computers function, and discuss the unique qualities of a quantum computation separately. Although this distinction is a subtle, a quantum-focused approach is arguably more enlightening than a direct comparison.

It is worth noting that this is a difficult subject, and thus we encourage beginners to surround themselves with a healthy mix of both literature and examples (see the Activity Corner for additional resources). Understanding quantum state takes time and effort, and it's normal to be confused when starting out. As Richard Feynman said:

"Nature isn't classical . . . and if you want to make a simulation of Nature, you'd better make it quantum mechanical, and by golly it's a wonderful problem, because it doesn't look so easy."

Quantum state

2020_tech_blog_img1

Let's start by looking at the process of tossing a fair coin. We can regard this process as random, with two possible outcomes: heads or tails. When one outcome of the random process has a greater probability of being measured, we call the coin biased. The bias is an unknown factor which can skew the probability from what we would expect from a fair coin. The bias cannot be determined with only a single coin flip, so the coin would need to be flipped multiple times to get a more precise estimation of what the bias is. The frequency of the outcomes from the coin flips will provide information as a set of probabilities, which must add up to 1.

We can think of a biased coin as a probabilistic bit — i.e. there is some randomness to the result, as opposed to deterministically measuring 0 and 1. The probabilities associated with each outcome can be thought of as the parameters of a random process before sampling it, similar to a spinning coin before landing on one of its sides. When the number of possible outcomes is greater than one, we refer to the state as being in superposition of those outcomes. While this term is often used in quantum computing, this idea is not unique and exists for any probabilistic system.

A quantum bit (abbreviated qubit or qbit) is a generalized version of the probabilistic bit. Instead of associating probabilities with each outcome (as we do in the probabilistic bit), we associate 2-dimensional vectors or arrows, which are called amplitudes. The probabilities of the outcomes are correlated to the magnitudes of their corresponding amplitudes. More precisely, the probability of an outcome is the square of the length of its corresponding arrow. Therefore, we can use arrows to represent each outcome of a qubit.

2020_tech_blog_img2

A qubit starts in the default state, where the probability of measuring 0 is 1 (e.g. the coin always turns up heads). The default state is one of the basis states, which correlate directly with a possible outcome of the quantum state. There are two basis states for a single qubit: 0 and 1 (in quantum mechanics these are denoted |0> and |1>, but there is no benefit to making this distinction from a computing perspective). We define a quantum state by its basis states and corresponding amplitudes, much like a dictionary of key-value pairs. The arrow notation discussed above can therefore be spoken of interchangeably with the quantum state itself.

With a classical system we can observe its current state at any time without affecting it. But the state of a quantum system randomly collapses to one of its possible outcomes when measured. We can never observe the quantum state directly — measurement returns a binary string, and the state after observation is the basis state that corresponds to that binary string (the state before measuring is destroyed). In order to retrieve any meaning from the state of a quantum system, its state has to be recreated and measured multiple times, and the random results have to be interpreted as a pattern in the context of the given problem (e.g. flipping a coin over and over again to determine its bias).

Quantum systems

2020_tech_blog_img3

A quantum computing system is comprised of multiple qubits and has an associated state. The state consists of one amplitude for each possible measurement outcome, and for a quantum system with n qubits, there are 2n outcomes. The simplest system contains only a single qubit, and is the building block with which larger systems are composed. It is crucial that one does not think of a quantum system in terms of the state of its individual qubits, but instead to consider the state as a whole. If we continue the example using a coin, we can think of the composition of the quantum system as quantum coins combining into a die.

We don't discuss the outcomes of each individual coin (or qubit) separately. Each face corresponds to a possible combination of heads or tails and has its own probability of being measured. When all outcomes are equally likely to be measured (like in this example), we define the state as being in equal superposition. While equal superposition has its uses (e.g. sampling), typically we want to eliminate some of the possible outcomes, the details of which are problem-specific.

Using the example of a quantum die, let's assume we only want the die to return one of two results: 00 or 11. In a quantum system this can be done with entanglement. Specifically, we force one qubit's measurement to match the other, eliminating the outcomes 01 and 10. The result is effectively a fair coin — there is a 50/50 chance of measuring 00 or 11, which can be classically post-processed into heads (0) or tails (1).

Working with a quantum system is similar to playing a slot machine. We have a desired outcome (or event), a series of operations we are allowed to perform on the qubits and a limited number of available qubits to work with. The challenge is to construct a quantum state that gives the desired answer in the least amount of measurements, which manifests as a balance between accuracy and computation time.

Superposition

A state in superposition is like a spinning coin or a cast die. Because the coin has yet to stop spinning (or the die has yet to land on a face), more than one outcome is possible. This is sometimes called a general superposition. States can also be in equal superposition (all outcomes are equally likely - the coin is fair, the die is unbiased, etc.).

If we have a spinning, biased coin, we have a general superposition (there are two possible outcomes, but one is more likely than another). If we have a spinning, fair coin, we have an equal superposition (all outcomes are possible, and they are equally likely to be measured). Note that an equal superposition is just a general superposition with an additional equality constraint.

Complex numbers

2020_tech_blog_img4

A complex number is comprised of a real part a and an imaginary part b, and can be written in the form a+b∗i. It can also be written in polar form r∗(cos(ϴ)+ i∗sin(ϴ)), where r is the length of the arrow and ϴ is the angle between the arrow and the x-axis, also called the phase of the complex number.

The amplitudes of a quantum state are complex numbers, thus the association with 2-dimensional arrows.

 

Further Reading

The first part of Foundational Patterns for Efficient Quantum Computing details a visual approach to quantum computing, from which this article was derived.

Prashant D., Chief Technology Office | April 2020

Learning More From Less Data With Active Learning

How JPMC is combining the power of machine learning and human intelligence to create high-performance models in less time and at less cost.



A key barrier for companies to adopt machine learning is not lack of data but lack of labeled data. Labeling data gets expensive, and the difficulties of sharing and managing large datasets for model development make it a struggle to get machine learning projects off the ground.

That’s where our “learn more from less data” approach comes into action. At JPMorgan Chase, we are focused on reducing the need for data to build models. Instead, we focus on building gold training datasets, helping reduce the labeling cost and increasing the agility of model development.

Labeled data is a group of samples that have been tagged with one or more labels. After obtaining a labeled dataset, machine learning models can be applied to the data so that new, unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data. A gold training dataset is a small, labeled dataset with high predictive power.

So Where Does Active Learning Come In?

Active learning is a form of semi-supervised learning, which works well when you have a lot of data but face the expense of getting that data labeled. By labeling data points that help the quality of the model, teams can identify the samples that are most informative.

Using machine learning (ML) models, active learning can help identify difficult data points and ask a human annotator to focus on labeling them.

To explain passive learning and active learning, let’s use the analogy of teacher and student. In the passive learning approach, a student learns by listening to the teacher's lecture. In active learning, the teacher describes concepts, students ask questions, and the teacher spends more time explaining the concepts that are difficult for a student to understand. Student and teacher interact and collaborate in the learning process.

In ML model development using active learning, annotator and modeler interact and collaborate. An annotator provides a small labeled dataset. The modeling team builds a model and generates input on what to label next. Within a few iterations, teams can build refined requirements, a labeled gold training set, active learner and working machine learning model.

How We Identify Difficult Data Points

To identify difficult data points, we use a combination of methods, including:

  • Classification uncertainty sampling: When querying for labels, the strategy selects the sample with the highest uncertainty — data points the model knows least about. Labeling these data points makes the ML model more knowledgeable.

  • Margin uncertainty: When querying for labels, the strategy selects the sample with the smallest margin. These are data points the model knows about but isn’t confident enough to make good classifications. Labeling these examples increase model accuracy.

  • Entropy sampling: Entropy is a measure of uncertainty. It is proportional to the average number of guesses one has to make to find the true class. In this approach, we pick the samples with the highest entropy.

  • Disagreement-based sampling: While using this method, we pick those samples where different algorithms disagree. Example: if model is classifying into 5 classes (A,B, C, D & E), and if we are using 5 different classifiers, e.g.

    • 1. Bag of words

    • 2. LSTM

    • 3. CNN

    • 4. BERT

    • 5. HAN (Hierarchical Attention Networks)

    Annotator can label examples on which classifiers disagree.

  • Information density: In this approach, we focus on a denser region of data and select few points in each dense region. Labeling these data points help the model classify large number of data points around these points.

  • Business value: In this method, we focus on labeling the data points that have higher business value than the others.

Alignment Between Humans and Machines

Traditionally, data scientists work with annotators to label a portion of their data and hope for the best when training their model. If the model wasn’t sufficiently predictive, more data would be labeled, and they would try again until its performance reached an acceptable level. While this approach still makes sense for some problems, for those that have vast amounts of data or unstructured data, we find that active learning is a better solution.

Active learning combines the power of machine learning with human annotators to select the next best data points to label. This intelligent selection leads to the creation of high-performance models in less time and at lower cost.



The Artificial Intelligence & Machine Learning group is focused on increasing the volume and velocity of AI applications across the firm by helping develop common platforms, reusable services and solutions.