Microsoft’s new tool lets developers run AI behavior tests using textual descriptions

🔥 Check out this must-read post from TechCrunch 📖

📂 **Category**: AI,ai evaluations,AI regression testing,Microsoft

📌 **What You’ll Learn**:

AI researchers and labs have progressed rapidly in evaluating AI models for everything from safety and compliance to flatness and alignment. But companies and developers appear to be facing a new, specific need: ensuring that their AI system behaves as intended for their specific product or service.

In an effort to make this testing process simpler, Microsoft on Tuesday took the wraps off ASSERT, which stands for Adaptive Spec-driven Scoreing for Evaluation and Regression Testing.

Microsoft says the open source framework makes evaluating an application’s AI behavior easy by using AI to transform high-level natural language descriptions of goals, policies, or intended behaviors into comprehensive, scored tests that can be investigated.

ASSERT takes plain language descriptions of the expected behavior and policies of an AI model, converts them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them on the target system, and records the results. It can also record the paths taken by the AI system, including intermediate actions and tool calls, so developers can examine where failures occur.

Developers can provide system context, tools, and limitations as well, if they want to further customize what the evaluations cover.

For example, a developer could specify that a document research AI agent should not send emails to people outside the company, should limit confidential information to C-level executives and provide brief summaries with prior context in mind. ASSERT will use these rules to create test cases that check whether the system follows these rules consistently.

According to Microsoft, the framework fills a gap that broader, more general evaluations cannot when AI models are intended to behave in a way shaped by the context, policies, and tools of an application or product.

“One thing we’ve learned is that assessments are incredibly important for making good decisions,” said Sarah Bird, chief product officer for Responsible AI at Microsoft. “Because if you don’t understand the behavior of an AI system, it’s really hard to know if it meets your organization’s requirements… What we found is that if you really want to have a trustworthy system, you have to evaluate many other application-specific dimensions.”

ASSERT can be used to evaluate systems as they are built, after they are deployed, and even for ongoing monitoring, Baird said.

The release comes amid a gradual but broader shift in the artificial intelligence industry. As models become more powerful, researchers are focusing on reproducible tests and regression checks, with Stanford University’s HELM, MLCommons’ AILuminate, and evaluation groups like METR putting forward benchmarks to measure how models behave under different conditions.

When you buy through links in our articles, we may earn a small commission. This does not affect our editorial independence.

⚡ **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Microsofts #tool #lets #developers #run #behavior #tests #textual #descriptions**

🕒 **Posted on**: 1780435813

🌟 **Want more?** Click here for more info! 🌟

Microsoft’s new tool lets developers run AI behavior tests using textual descriptions

By

Leave a Reply Cancel reply