Andreessen Horowitz investment in Protege is sharpening focus on one of AI’s most urgent problems, access to reliable real-world data. As large language models and multimodal systems race ahead, developers are running into a hard limit. Public datasets are largely tapped out. Meanwhile, the private data that reflects how the world actually works remains locked behind legal, technical, and ethical barriers. This growing gap is slowing progress just as expectations for AI accuracy and trust are rising.
Protege is positioning itself as the infrastructure layer that turns fragmented private data into usable fuel for modern AI. The US-based startup has built a trusted data exchange that licenses datasets directly from hundreds of providers across healthcare, media, audio, and imaging. Instead of scraping or relying on questionable sources, Protege works with data owners to make access legal, structured, and repeatable. This approach gives AI companies a cleaner path to training and evaluation data while creating new revenue streams for data holders.
That model has now attracted more capital from Andreessen Horowitz, commonly known as a16z. The firm led a $30 million Series A extension for Protege, doubling down on a bet it first made last year. The extension brings Protege’s total funding to $65 million since 2024, following a $25 million Series A raised in August 2025. Returning investors include Footwork, CRV, Bloomberg Beta, and several strategic backers who see data access as the next competitive moat in AI.
The funding comes at a moment when many AI teams are discovering that better models alone are not enough. Performance gains increasingly depend on the quality, diversity, and freshness of training data. Yet acquiring that data is slow, expensive, and risky. Many organizations hold valuable datasets but lack the tools or incentives to share them responsibly. Others worry about compliance, anonymization, or misuse. Protege is designed to sit in the middle of this tension and resolve it.
The company was founded in 2024 by Bobby Samuels and Travis May after both experienced how data bottlenecks can stall promising AI projects. May previously led Datavant and LiveRamp, where he worked deeply in health data exchange and privacy-first collaboration. Samuels brings operational experience in building marketplaces that balance supply, demand, and trust at scale. Together, they set out to build infrastructure that treats data providers as partners rather than raw material.
Protege’s platform aggregates licensed datasets and then applies a layer of curation that makes them usable for AI. This includes cleaning, standardization, anonymization, and formatting for both training and evaluation. The system supports cross-vertical data, ranging from de-identified health records to audio, imaging, and media archives. AI teams can access this data through streamlined workflows instead of negotiating dozens of one-off agreements.
At the same time, data providers retain control and visibility. They earn revenue shares when their data is used, and they can define how datasets are accessed and applied. This balance is central to Protege’s pitch. As Samuels has noted, demand for real-world data is growing faster than the market’s ability to supply it responsibly. Fragmentation makes it hard for both sides to operate at scale. Protege aims to act as a trusted source that lowers friction without compromising ethics or compliance.
The company’s focus also sets it apart from other players in the AI data ecosystem. While platforms like Scale AI, Snorkel AI, and Labelbox concentrate on labeling, evaluation, or workflow tooling, Protege centers on the data itself and the provider network behind it. Its value lies less in annotation and more in unlocking access to datasets that were previously unreachable or unusable for AI training.
For Andreessen Horowitz, the investment reflects a broader thesis about the next phase of AI development. As models become more capable, differentiation will shift toward who can responsibly access the world’s most valuable data. Daisy Wolf, a partner at the firm, has emphasized that real-world data across industries is complex and difficult to operationalize. Protege’s momentum signals a market shift toward platforms that can handle that complexity while meeting modern AI needs.
With the new capital, Protege plans to expand into additional domains, grow its partner network, and accelerate product development. The company is also investing in team expansion to support increasing demand from both AI builders and data providers. As regulation tightens and scrutiny around training data grows, platforms that offer clarity and compliance are likely to become even more critical.
The bet from a16z suggests that the data crunch facing AI is no longer a theoretical concern. It is a practical constraint shaping where money, talent, and infrastructure are flowing. If Protege succeeds, it could help redefine how real-world data is licensed, shared, and monetized in the AI economy, turning a long-standing bottleneck into a scalable marketplace.