California Generative Artificial Intelligence System Disclosure
For systems and services in effect as of date of disclosure: Jan 1st 2026
1. Dataset Overview
(1) Sources or Owners of the Datasets
Internally collected data of Cleo AI
(2) Purpose Alignment
Improving chat functionality
(3) Dataset Size
Approximately 500,000 bank transactions
Approximately 10,000 chatbot messages
(4) Types of Data Points
Unlabelled datasets: natural language text, structured data in JSON format
Labelled datasets: transaction category, merchant name
(5) Intellectual Property Status
Datasets utilized were internally collected information and may also include copyrighted, trademarked, or patented material.
(6) Data Acquisition
Proprietary/internal; the datasets we use are freely obtained, not purchased or licensed.
(7) Personal Information
Categories of personal information:
Name or nickname
Purchase history
Employment data
Email address
Location data
Safeguards:
Before using data to train generative models that produce text for chatbot elements, we apply automated processes designed to remove personal information
For models that infer bank transaction metadata, the system is designed to output the name of a single category or entity rather than free-form text. This greatly reduces the chance that unrelated personal information appears in user-facing results
The chatbot implements user-level data isolation at inference time, so that the AI system can only retrieve data about the authenticated user for input to generative models.
(8) Aggregate Consumer Information
Datasets include aggregate statistics on behavior across groups of Cleo users. We apply minimum group sizes (500 users per group) to reduce the likelihood that individual users can be re-identified.
(9) Data Processing and Modifications
Depending on the model, we may filter data to prioritize high-quality examples, including based on user feedback, in-house annotation, or automated classification.
For classification tasks, we label data using large language models and/or human annotation.
For open-ended generation tasks where the generated text is shown directly to users, we apply processes designed to remove personal information.
(10) Data Collection Period
May 2023 - present
(11) Dataset Use Dates
First used: Oct 2025
(12) Synthetic Data
We do not currently use synthetic data for training generative AI models.