10 weeks to fully evaluate client's beta

The client hired us for 10 weeks to conduct a user experience evaluation of the beta version of their app before release. I worked directly with the client as the project lead and user experience researcher and designer.

Project kick-off

At the project onset, I had several meetings with the client team to set expectations and start on a strategic path forward. I used Miro to lead the group through discussions on Zoom so the client could have a visual point of reference while I took notes. After these initial meetings, I had a bunch of simple artifacts that I then used throughout the project to build out user tests and to ensure we were staying on target.
An example of an "artifact" created during early kick-off meetings with the client team.

Testing

After kick-off meetings, I got to work planning and running my first tests for the project. For any test, there are two variables that are always in tension: (1) wanting to get as much learning out of every test as possible, and (2) needing to be aware of participants' limited energy and attention spans.

To start, I decide on an area of the app or a specific task flow that will be the focus of the test. I pull together all the learning goals that are relevant, and as I do, the flow of the test starts shaping up in my mind.

Once I'm actually writing the test, I start to work with that tension: I have so many questions I'd like to ask, but optimizing for the participants' energy and attention spans is just as important. If I'm worried the test is getting too long, I look at where I can go deeper, instead of broader.

Over the 10 week project, I ran a total of 19 tests (plus an informal heuristics analysis), which included:

mixture of moderated and unmoderated task-based and/or desirability tests
quantitative impression tests
one round of interviews
By the end of the project, we had received feedback from about 320 participants.
As part of kick-off conversations, the group brainstormed together what we hoped to learn about the product's user experience. After that session, I broke up all the learning goals into focus areas of the app (part of that work is shown here), which then served as a scaffolding for all testing.

Presenting findings

Once the testing was complete, I listened to recordings and took quick notes on post-its (using the online tool Miro). I then went through a process of grouping post-its around the learning goals they were addressing, and starting to synthesize findings. Each week, after running 1-3  tests and synthesizing the learnings, I met with the client for about an hour to discuss what was learned, the implications of those learnings, and how designs could be improved to better meet user needs.
This is an example of how I presented findings to the client team. Learning goals are in purple/blue and notes from tests are in yellow. If a verbatim or insight was especially helpful in understanding the user's mindset, I marked it with a little red heart icon.

Some learnings

Following is an overview of a handful of findings from the 19 tests run during the project, with corresponding changes made to the product.

Learning 1: Onboarding centers business needs at the expense of user needs.

The client wanted to use onboarding as a way to gather data on the app's users. But test participants were off-put by the number of questions posed in the sign up process. Plus, some of the questions, such as those asking for first and last name, date of birth, and city and state caused some discomfort.

After some discussion, the client and I decided to limit onboarding questions to only those required to run an assessment. But, there was one important exception this rule. To use the app's basic assessment, legal sex is a required data point. However, some users expected a question about gender identity to accompany the question about legal sex, as a way of acknowledging and affirming the identities of those whose legal sex is incongruous with their gender identity. The client and I concluded that this was important enough to add the one additional question to the onboarding process.

When test participants were presented with a much shorter and simpler onboarding process, they no longer complained about or seemed to notice the length of onboarding nor the types of questions posed.
These are some participant verbatims about the onboarding process before changes were made.

Learning 2: Users are confused when trying to interpret results after completing an assessment.

At the outset, the assessment results screen was a source of many issues during testing. The screen is intended to communicate possible medical conditions that match the user's reported symptoms. The client was hesitate to use a numerical score (i.e. a percentage match), so the original UI featured some visual elements that were meant to convey the concept without the numbers.

However, through testing, we quickly learned that users kept getting tripped up by the visuals. When we tested the same screens but with numerical measures instead of the visual elements, users were able to decipher the intended meaning right away.

Even after we had tested and validated that the percentage match scores made the designs easier to understand, the client team was still concerned that a numerical measure, like a percentage, would make users place too much undo trust in the app. To be specific: would a high percentage match (i.e. 99%) would cause users to take the app too seriously? Could this become a liability?

To see if the client's fears were warranted, I tested a number of variations of the same screen, but with different percentage matches ranging from 75% all the way up to 100%. From these tests it became clear that users are not in danger of using the app to replace a doctor; they know the app is not a human professional and therefore could not provide a true diagnosis.

The only issue that arose when using numbered percentages was when the app showed a 100% match. This caused the user to actually distrust the app saying things like, "There's no way it could actually be 100% certain."

Ultimately, the client team and I were able to agree on a best path forward: implement the percentage match in the final designs, do away with the visual symbols, and never display a 100%.
This is an earlier design of the assessment results screen, showing a visual instead of a quantitative measure.
This is the final design for the assessment results screen.

Learning 3: Users note the lack of nuance in assessment by app, as opposed to assessment by doctor.

Once test participants were able to quickly decipher the meaning of the results screen, they were curious for more. During tests, participants would say things like, "Does it think I have bronchitis because I said my cough lasted 1-2 days instead of 2-3 days? I want to go back and change my answer to that question."

However, the original app did not allow for this behavior because the client hadn't been aware of this user need.

Since we simply could not allow for the same level of nuance in the app as users experience when visiting a real human doctor, we started to think about how we could give the user more autonomy in a different way. Our approach was two-fold: 

1. Allow the users to see how the app came up with the results. For example, which responses and/or symptoms the algorithm prioritized over others to come up with a possible illness or condition.

2. Allow the users to change just one response at a time, instead of re-doing the entire assessment, and compare the app's results after each change.

Once we had a mock-up of a screen that allowed for these additional affordances, I returned to The response overwhelmingly positive.

From the participants' perspective, the app's willingness to show this information was an act of goodwill and transparency. It presented a significant differentiator from competitors such as WebMD. Of course it's not the same as seeing a doctor. But in some ways, it provided users with a workable replacement.  
This is the final design for the screen demonstrating how the app calculates a match strength percentage.

Learning 4: Users are unsure how to interpret and use the symptom tracking area of the app.

The tracking feature of the app caused a good deal of confusion during user testing. The feature allows a user to set a reminder in the app, which will then prompt the user to rate their symptom on a 1-10 severity scale, 10 being most severe.

During testing, participants were confused firstly about the records of past ratings. For example, they often incorrectly assumed that a symptom severity rating (for example, a 7/10) was actually an indicator that the symptom was experienced that number of times per day (for example, 7 times). We quickly adjusted the design to more clearly indicate to users that these were severity ratings, not incidences, of symptoms.

But more interestingly, we were able to understand through user testing that participants could see many more uses for the feature than the client had expected. Participants wanted to use the tracking feature to also track things such as medications and/or multi-vitamins taken, which remedies were effective and which made no impact, and fevers as they change over time (to the tenth of a degree).

To allow for the many intended uses that had not been foreseen before development, the client team and I decided on a highly feasible intermediary solution: a text field where users could add any notes alongside every incidence of tracking. When future development allows for more changes to the app, the client plans to return to these user needs and build in more specific affordances to address the user needs more directly.
This is the original tracking screen from the MVP, before changes were made, which caused confusion and frustration for test participants.
This was one variation of a prototype made to address some of the issues with the earlier design. The final design is a similar iteration of this one.

Handing off

After 10 weeks of planning, testing, iterating and re-testing, the client team had a lot of work to do to develop and implement many improvements to their beta. For the last presentation to the team, I laid out all of the major findings and corresponding changes we had recommended.
This is one of the many slides I put together for the client's future reference, detailing insights from user testing and corresponding recommendations for additional changes.

Final thoughts

‍In addition, I included some user needs that, if addressed, could help the client team move from its transaction-focused beta (that users think of when they're feeling ill) to a more relational and longer term experience (that users could use to manage their health over time). These included the following:
Affordances for users to track supplements and vitamins, exercise, sleep
Give users an opportunity to easily share information with doctors and other health professionals
Serve up personalized things to try, whether that be supplements, exercise routines, or remedies, and allow users to report back on what they've tried and what had positive effects
Think through providing highly personalized guidance on staying healthy and taking an offensive stance on lifelong well-being
After our 10 weeks of work together, the client was ready to move from our recommendations to development, and paused our work for some time. Once some of the development is complete, the client plans to return to continue to conduct testing and make user-centered improvements.

Search Pivot