Phoenix by Arize is an open-source ML observability and evaluation toolkit built for developers and data scientists who work directly in notebooks. Designed for modern AI stacks, Phoenix helps you monitor, debug, and continuously improve LLMs, computer vision models, and traditional tabular ML. With seamless integration into Jupyter, VS Code, and other Python environments, you can instrument your pipelines, explore model behavior, and inspect data quality without leaving your development workflow. Phoenix unifies traces, predictions, embeddings, and metadata into an interactive workspace, making it easier to detect drift, surface failure patterns, and understand model performance across slices. For LLMs, Phoenix provides evaluation workflows for prompts, responses, and tools, including quality, safety, and hallucination analysis. For CV and tabular models, it offers powerful visualizations and metrics to diagnose issues like distribution shift, label leakage, or underperforming cohorts. Because it is open source, Phoenix fits naturally into existing MLOps stacks and can be embedded in CI/CD pipelines, experiment tracking, and production monitoring. Teams can collaborate on shared dashboards, compare model versions, and run what-if analyses before and after deployment. Whether you are launching a new LLM application, maintaining a mature recommendation system, or consolidating observability across heterogeneous models, Phoenix gives you a consistent, developer-friendly way to understand how your models behave in the real world and how to make them better.
Monitor LLM-powered applications to track response quality, latency, and failure patterns directly from your development notebooks.
Analyze production drift for computer vision or tabular models and identify which data slices are degrading over time.
Run side-by-side model comparisons to evaluate new model or prompt versions before rolling them out to production.
Debug hallucinations, unsafe outputs, or broken tool calls in complex LLM chains using trace and span-level insights.
Build reproducible evaluation workflows that plug into CI/CD so every new model release is automatically checked.