Fashion / E-commerce App — Case Study

— Written 2025 —

Explored a method to create network effects and deepen user engagement within an AI-enabled luxury shopping experience. My approach begins with a set assumption, then uses three guiding questions frame product thinking.

View Deck →

View Case →

Building an AI Assistant into Dispute Workflows

— Written 2025 —

Built and deployed an AI assistant for a Dispute Operations department. We wanted to test whether AI could speed up a legacy workflow to lower operating costs. Results showed small gains and revealed key challenges with clear steps forward.

My Role

As Product Manager for a dispute representment service, I observed rising operational costs dragging product profitability, so I hypothesized that LLMs could create operational efficiencies if deployed in dispute operations.

I secured stakeholder alignment around testing the idea, working with Account Managers to recruit a pilot merchant partnering with Operations to design the experiment and ultimately driving Engineering to build the RAG pipeline with ChatGPT-o3. In addition, I oversaw data collection and analysis to quantify the impact to operator effort and writing quality, presenting findings to leadership that highlighted both the limitations and long-term potential for AI.

Background

Verifi helps merchants fight payment disputes through the representment process — charging a flat fee for each response submitted. But the cost of this service depended on how long operators spent writing responses (i.e., dispute rebuttals). Preparing one rebuttal could take an operator20 to 45 minutes, depending on complexity of evidence requirements. Meaning, revenue was fixed per case while labor costs could rise sharply, hurting long term profitability.

I researched our rebuttal creation process to understand the inefficiencies. The obvious bottleneck was the drafting process, which included collecting evidence, checking SOPs, and authoring the rebuttal content itself. So, if operators could spend less time on drafting, the unit costs would fall, profitability would improve, and the managed service model would be more sustainable.

At the same time, LLMs were improving rapidly and beginning to be perceived as viable in enterprise systems. So, I created a simple hypothesis: let an LLM draft a complete rebuttal, and let operators focus on review and submission — doing so will reduce the average time it takes to submit a rebuttal by at least 25%. To test this, my team built a RAG pipeline that pulled in merchant evidence and internal SOP documents, then used ChatGPT-o3 to generate a working draft from that context.

Ultimately, an alpha version of this AI-enabled workflow was enabled in Production and tested to see if LLM use could reduce operator time per rebuttal, without trading off finishing quality.

Product

My team shipped an AI-powered feature, integrating into the existing rebuttal authoring workflow. Operators could flip on the Assisted Rebuttal switch, which controls display of an LLM draft inside the text editor. If left off (default), they could continue to write responses from scratch.

Visualizations are illustrative mock-ups which approximate the product UX/UI, with anonymized branding and dummy data included for confidentiality.

The Disputes table is a centralized repo for disputes imported from payment processor API feeds (Stripe, Adyen). With this, operators can begin to write rebuttals for disputes they've opened.

Behind the scenes, the system used a RAG pipeline to gather merchant evidence, pull policy guidance, and prompt ChatGPT-o3 for a working draft. This process ran automatically when a dispute was opened

The text editor is used to write rebuttal content (e.g., "The cardholder's claim is invalid because..."). The alpha included an Assisted Rebuttal switch which controlled the display of the LLM draft in the text editor.

Operators toggled Assisted Rebuttal ON to work off of the LLM draft. A time-tracking system captured operator effort, letting us measure the impact of AI assistance against recorded benchmarks.

Metrics

100% of cases tracked with timestamps — we implemented a lightweight app to log two key events in the operator workflow: when a case was opened in the UI and when it was completed (i.e., submitted). By subtracting the timestamps, we captured the total time spent per case.
22.2 and 34.7 minutes as control baselines — before running the experiment, we measured an average time-to-submit across reason codes used in the experimental group. These figures became the benchmarks for evaluation.
17/20 average score for quality benchmark — we developed a 20-point rubric which graded rebuttal quality on:
- Completeness — does the rebuttal address all required elements?
- Alignment — does the argument reflect merchant business context and reason code?
- Compliance — does it meet network standards?
- Presentation — is the writing clear and professional?
Past rebuttal submissions were first scored to set a baseline, then experimental responses were graded in the same way.

Results & Limitations

The results showed promise;

~5% faster completion on average — operators leveraging AI assistance during the experiment were only slightly quicker than baseline, far below the 25% reduction target.
0 measurable quality difference — blind reviews confirmed rebuttals from both groups averaged the same 17/20 score.

however, they also surfaced UX challenges and design limitations:

10–15 minutes slower in some cases — outliers occurred when operators distrusted drafts or rewrote them completely.
70% of AI drafts required major edits — most responses needed reformatting to match SOP templates before.
3 recurring friction points identified — operators had low trust in LLM output, irrelevant evidence included in the model output (e.g., non-transactional emails), and formatting misalignment with quality standards.

Conclusion

The alpha did not create significant efficiency gains, but it proved that LLM capabilities can be embedded safely into legacy workflows. Even the modest 5% improvement suggests long-term potential of AI applications in dispute management.

Learnings also revealed specific barriers to LLM adoption among operators: low trust in LLM output, inclusion of irrelevant evidence, and formatting misalignment with SOPs. These findings pointed to areas for future improvements and validated the need for iterative testing.

While the gains were small, the experiment established a framework for further AI development and outlined a practical path toward making the service more cost-efficient and scalable.

Key Takeaways

"Trust" is as much a requirement as output accuracy — adoption will probably lag if users are not convinced of the tool's efficacy.
Discovered the importance of designing constraints, not just features — the RAG pipeline needed tighter retrieval and prompting to reduce evidence overload and formatting errors.
Understood that a short test window can confound results — efficiency data was muddied by operator learning curves, showing the need for longer-run trials.
Recognized the power of small wins — even a 5% time savings showed a pathway to future cost efficiencies and kept leadership engaged.
Reinforced that the path forward is iterative — improve retrieval, constrain output to SOP templates, add guardrails, and continue building toward an AI-native workflow that could eventually be sold as a standalone merchant product.

Jeremy Brown | Product Manager