Testing What Doesn’t Exist with a Wizard of Oz Twist

‍

When Dorothy, Toto, Scarecrow, Tin Woodman, and the Cowardly Lion set off for The Emerald City, [SPOILER ALERT: the city that doesn’t really exist], L. Frank Baum hardly imagined that there was an analogy for the world of product innovation to use somewhere in there.

And yet, our industry has found it and is using it (and allow us to celebrate our nerdy side here) to test all kinds of things that don't exist. We used it extensively when we built our own proprietary product: Textractive, a Natural language processing (NLP)-led application which summarizes meetings and intuitively extracts action items and lists them for follow-up.

‍Zappos' CEO would buy the shoes from the store after he got an order. Stripe was processing credit card payments manually on the back end. Decades ago: IBM Speech-to-Text Experiment was a couple of people in a backroom, hidden from view, with incredible typing speeds. Aardvark did something similar as well.

‍

Why does it make sense to do Wizard of Oz testing?

Why does one need to test by creating an illusion? Why do you need Wizard of Oz testing at all? Simply put, artificial intelligence, machine learning (ML), augmented reality, or NLP-led product initiatives are expensive, require multiple iterations, and are time-consuming. They involve complex development cycles and incredible engineering expertise. Therefore, before any entrepreneur goes down that route, validating its value to customers before investing time, effort, and money is the only smart thing to do.

Wizard of Oz testing fantastically allows you to test the problem-solution proposition in one fell swoop in a world where 35% of start-ups fail due to lack of demand and 8% fail due to flawed products.

‍

So, what is Wizard of Oz Testing?

Wizard of Oz testing stimulates an experience for a user as-is, but the backend system is actually being operated by a person, and answers are human-generated, as opposed to algorithm-generated.

For example, and we are oversimplifying this immensely, on the front end — a user wants to see which is the most frequently brought moisturizer with the serum that they just purchased. Now, instead of an algorithm doing the Application Programming Interface (API) call, and having a recommendation engine populate the answer, an actual person is feeding the answer back to the user based on certain rules.

The most important questions to ask and answer are:

Is this a solution that businesses are actively looking for?
Are people willing to pay for this?

‍

When do we use Wizard of Oz testing?

Wizard of Oz testing is best employed when you need to verify and check the functionality of a product feature or capability and see if that has value to the user. It's ideally done when the build itself, even a pared-down version of the offering, is expensive, requires specific expertise, and requires a lot of time to build.

It is the ultimate way to test ideas and the value they will have for a buyer/user where the traditional wire-framing, high-fidelity mock-up just won’t cut it.

For many situations, a wireframe is sufficient. We would not recommend using this kind of testing investment to test simple user journey elements, but will insist on using it to test an Artificial Intelligence (AI) algorithm, a sensor-led solution, an NLP-driven platform, and so on. That would basically indicate - anywhere where the technical investment is significant and the outcomes are unknown.

‍A couple of examples of companies that have done this spectacularly well: Zappos, Stripe, Expensify, and so on.

‍

What are the challenges in setting up a Wizard-of-Oz test?

Legality

The ideal environment to get the most informative results is when the user doesn’t know about the test. However, in most jurisdictions, we must ensure data transparency: share exact details on what elements, how, and where the data is used without giving the test away. In Healthcare, security, and FinTech industries this can get especially tricky.

Ethics

While setting up a Wizard of Oz test is an exercise in learning, the last thing you want is your users and buyers walking away feeling like they have been leveraged. And Theranos hasn’t done a world of skeptics any favors with their fast one.

Feasibility

Machines are doing a really good job of imitating humans these days, but it can be difficult to train a human to imitate a machine. In order for a Wizard of Oz to give you consistent data, the human needs to follow machine logic in most circumstances, so you will need to write up incredibly detailed instructions for this to be successful and get you the data that you are after.

You need to have a clear understanding of how the product will operate, and what is technically feasible, and then train humans to provide this information in the same way that you architected it.

‍

How do you overcome the challenges of setting up a Wizard of Oz Hack?

In general, overcoming the first two hurdles is consulting with counsel, and being transparent. If you are in a highly regulated market with low trust, make sure to go over the details of what you propose to do, and if you can, fill in the users while you are providing a service. They are the lucky recipients of an early version where they will get the same service, and a bit of hand-holding as some things are still being worked out. For the last one: feasibility - hire experts who’ve done it before and can create the right setup. It might have become harder/easier with remote environments, depending on who you ask though.

‍

Case studies:

Pretending to be a human language processing algorithm

It can take months (if not years) to train a machine learning algorithm with data, and most successful entrepreneurs we’ve encountered are unwilling to pour in that kind of investment without some validation first or wait that long.

The product we were building, in this case, was an NLP solution that summarizes phone calls, takes notes, and detects intent to complete a financial transaction.

Humans are better at summarizing human speech and detecting intent than computers are (tone being an incredible giveaway). So the trick with this one was to work with data scientists, data engineers, and natural language processing gurus (all of whom are at Zemoso) to develop a human protocol that mimicked the proposed machine protocol.

It was decided that the key components of this algorithm would be 1) transcription 2) summarization, and 3) action items. We would deliver the service via email, with a two-minute delay.

For transcription, we used readily available transcription software and copy-pasted that into the email in a predefined format. In the summary, we copy-pasted key sentences directly from the transcription, taking out pauses. In action items, we looked at what each party agreed to do, and copied and pasted those portions into the action items portion of the email, providing scores for higher/lower intent conversations.

The key here was to make it something a machine could feasibly do, using keyword cues around verb usage, time commitments, etc. We ensured that the tone, intonations, etc. weren’t accounted for because the first version of the Minimum Viable Product (MVP) wouldn’t be able to do that.

This Wizard-of-Oz setup tested well on value parameters for usage and monetization benchmarks for buyers. This product is the core of a billion-dollar offering. Though significantly refined later, that early validation is what these startup visionaries needed to double down on it.

Pretending to be an insurance marketplace

The healthcare industry, and by extension the related insurance industry, is highly regulated. Trust and compliance are keys to success in this market, and several apparent Wizard of Oz testing setups have made the news (and even some movies) that betrayed customer and industry trust. Therefore, we're exceptionally conscious of that when setting up testing scenarios for the healthcare industry.

This is a solution for businesses looking to offer benefits digitally, which employees can use in their everyday lives to stay healthier and reduce costs. This entrepreneurial team has relationships with insurance providers, and data handling was set up in a way to ensure privacy and compliance with Health Insurance Portability and Accountability Act (HIPAA) regulations.

It was decided here that key components of this platform would be identifying lower-cost, vetted and highly rated providers nearer to the employee’s address. This allows employees to track and earn rewards for meeting healthy goals, and working one-on-one with a health coach.

This experiment was run with one large office of one large company. Employees were told that this was a new service, that the company was evaluating, and that this service would only be provided on an opt-in basis.

Employees were provided with a human-researched list of providers in their area, indexed by services, reviews, office hours, and distance from the office. They were provided with a spreadsheet, made with a popular spreadsheet tool, where they could input their goals and track their progress towards the goal, and had weekly meetings with a certified health coach.

Based on the test results, businesses that took this approach could save $100 million annually on healthcare. Now that this is a product, the actual savings are considerably more.

Pretending to be a FinTech platform

The financial services industry is another highly regulated industry. We mocked a technical solution to get an idea about market and user viability.

The idea here was for a buy-now, pay-later solution for certain high-spends, where acceptance would be decided by a machine learning algorithm. It was decided that key components of this platform would be an easy application process with an algorithm making the final decision, and an easy interface for users to input information that would determine the result.

Success would be measured by a business' time to close the deal, increase in deal closures, and successful payback completions. Initially, some guardrails were put in place to limit risks around the max deal amount, etc.

We created a questionnaire, which asked the applicant standard financial questions. Users knew that an actual person would be looking at their data, and that it was stored and handled in compliance with privacy and industry regulations. Once the form was filled, a human looked through the data and decided whether to approve the application using the same criteria that the algorithm would eventually use. They would send a text in five minutes with a go, no-go decision.

The business received the full amount for the deal, upfront. The test ultimately proved that there were plenty of businesses willing to pay for this. In fact, many of the Wizard of Oz test people became early adopters.

‍

Conclusion

Wizard of Oz testing is tricky to get right, but when it is done well, it answers important questions about the business and financial viability of an idea while limiting risk. They are best applicable when the cost of being wrong is too high, and when the investment in time or money is extensive.

We generally recommend them for technology products that involve a machine learning algorithm that needs to be trained with extensive data sets to mimic a human, and platforms that need to integrate with a bevy of other products. You need to have the significant technical knowledge to design an experiment where you are mimicking a technology product, and how best to set up an experiment to get data on whether to make a significant investment, and when to hide the curtain or show the curtain and how that'll affect relevant data.

‍

Testing what doesn't exist with a

Wizard of Oz twist