Zen Yui

Troop Co-Founder & CTO

10.30.23

Unlocking LLMs for Asset Stewardship

How Troop’s engineering team makes mountains of SEC filings more accessible.

The amount of work that goes into proxy voting on behalf of investment fund clients can’t be overstated. Weighing in on governance decisions, from director elections to shareholder proposals, across portfolios entails due diligence and analysis at a massive scale. Between definitive and amended proxy statements, solicitations and reports alone (leaving aside the flood of accompanying proxy advisory material and external research) asset stewardship teams crunch tens of thousands of pages worth of documents covering thousands of companies during the few months when most annual shareholder meetings take place.

Over and over again, we hear a desire for a smarter way. At Troop, we’re building tools that comb the SEC’s dense archives and make this vast unstructured dataset more accessible and wieldy. Large language models and generative AI, assembled with care, can be used to parse and extract from proxy filings the essential information required for organizing and evaluating governance questions more efficiently. Troop’s platform streamlines research so that stewardship and engagement teams can focus on advocating for clients’ values and goals. The system leverages advanced data science so that investment managers don’t have to.

Harvesting the data set

The SEC manages a publicly available dataset called EDGAR that houses the filings and disclosures for all publicly traded companies in the US. While there are general guidelines for styling these docs, there is no standardization requirement for proxy statement filings; each filing is essentially a long, rich-text memo to investors, making bulk analysis and inference tricky. If you’re plugged into the LLM world today, you’ve probably encountered a few neat projects like AITickerChat and FinGPT, which allow analysts to “chat” with these filings to extract information.

At Troop, we build on these trends, leveraging the latest LLM tooling to bring structure to the EDGAR database. Google’s language model offering, combined with their larger suite of machine learning and data processing products, allows us to build rich data products for our users and train special purpose models that help them make quick decisions across the thousands of companies they are invested in.

A quick primer on in-context learning

In this era of commodified access to high-dimensionality language models, in-context learning is a popular alternative to training a custom model. Put simply, it means sending relevant information to the LLM in addition to a prompt. This is a convenient way to integrate domain-specific data unavailable to the foundation model during its training, and it significantly reduces hallucination in outputs.

Most models have context limits that restrict a single prompt to a few paragraphs of text, and high-context models are expensive to run, especially across millions of prompts. When a single proxy statement can span hundreds of pages, we turn to a popular in-context learning pattern called Retrieval Augmented Generation (RAG), where only the most relevant text from the document is included in the prompt. In order to facilitate this, we separate the document into chunks and embed them, so the document can be searched with natural language for relevant text.

Modular architecture and Google Cloud

The LLM tooling space is evolving weekly; using the “best” tool for a job means maintaining a modular stack that you can upgrade and experiment with piecemeal. At a high level, the current in-context learning LLM stack comprises:

  • cheap storage for raw documents

  • OCR / document processing models for breaking apart large documents

  • a high-dimension embedding service

  • a vector database for storing embeddings

  • and, of course, one or more LLMs

One reason Troop prefers Google’s language model tooling is its unique focus on scale and production readiness. The EDGAR data updates in larger bursts, and Google makes it easy to scale out the resources as data arrives. The usage quotas and pricing are transparent and QPS is very high — it just works. Furthermore, integration with your existing services is simple, and Google encourages using its managed services in a modular way that enables blending with external tooling.

Document storage

We leverage Google Cloud Storage for saving raw documents as blobs in a bucket because it’s cheap, reliable and easy to secure with fine-grained access control via Google Groups.

Document chunking

Text embedding services require simple, clean text payloads as their input, and many PDFs (especially those found in EDGAR) are formatted in a way that simple pdf-to-text extraction has difficulty parsing. Together, Google’s OCR and DocAI Workbench are able to convert complex, rich-text formatting into machine-readable text. Specifically, the DocAI Form Parser is able to make readable plain text from complex source material like data tables and images, and DocAI provides a GenAI extractor to reformat the output and pull out concepts and entities automatically.

Embedding services

We have tried most of the prominent large embedding services out there (including self-hosted) and have landed on Google’s textembedding-gecko as our preferred option. The 768-dimensional vector output enables performant semantic querying while still being small enough for other use cases like semantic document clustering.

Vector storage

After much deliberation, Troop landed on Milvus as our self-hosted vector store. We maintain a Google Kubernetes cluster for the application, which has ample headroom and latent compute, especially during off-hours. Milvus separates compute from storage, and supports persistence in GCS buckets, and we can accept some read latency in exchange for this storage scalability. For a few specific use cases, we load embeddings into an in-memory FAISS store for rapid back-to-back querying of smaller datasets. We’re keeping a close eye on Google’s Vertex AI Vector Search 👀…it looks promising.

LLM

Ultimately, we have settled on Google’s text-bison model for our LLM queries because, content-wise, it performs on par with other mainstream models, but at impressively high QPS. Text-bison is also updated regularly, which reduces the context we need to provide in each prompt.

Integrations and production readiness

Integration is where the GCP ecosystem really shines. We publish the outputs from our LLM extractions into PubSub topics, where we federate the results into a few places:

  • Directly into the user-facing application layer, where we expose activism to users

  • Into BigQuery for our internal research team to have on-the-fly access to data via Looker

  • Embedded and written back into the vector store! This enables the team to conduct ad hoc queries across our results, and enables rapid exploration/ideation for research and product featuring.

Looking forward…

These models are capable of far more than just building structured databases. As demand grows for more personalized and thoughtful alternatives to voting recommendations from dominant and entrenched third-party proxy advisors, we see opportunities for alternative advisory approaches. Troop is building tools that turn client and fund values into custom directives which in turn may be applied to the governance questions posed annually in this sea of proxy filings. Investors themselves can dictate their governance preferences at scale. We look forward to sharing this work as it develops.



Subscribe to Troop Insights: