Clarification & Understanding
What is the greater context and our motivation for doing this? Is there some org wide strategic initiative we need to align with or a specific goal? Let’s just focus on providing a really good experience to our users.
Is this a stand alone offering or an extension of something Google currently offers? It can be either. What do we mean when we say app? Is there a specific type of app we should focus on: desktop, mobile, web? If up to us then let’s approach this as platform agnostic, but I could see us leaning towards a mobile app as a lot of photo capturing, viewing, and sharing is done on mobile devices.
User Segments
This problem is still pretty ambiguous so let’s try to break it down further by looking at some of the different user segments this could apply to. It is worth noting that there are varying levels of blindness:
- Completely Blind – Can’t see anything at all pitch black
- Partially Blind – Can still see shapes and colors
- Near Sighted – Functionally blind to objects far away, but might be able to read text via a screen reader
- Not Blind – While the core users we are designing for are some level of blind, non-blind users particularly friends or family may still be using the app to interact with those blind users.
Out of the above user segments, I’m going to suggest we focus on the completely blind. This is the most common form of blindness and if our solution works for someone who is completely blind it will also work for someone who is partially blind as well where as the opposite of that isn’t necessarily true.
User Needs / Pain Points
- Blind people have no idea if they successfully took a clear photo of what they were trying to take a photo of
- Blind people need to know what is in the photo they are looking at
- Blind people need to know if their photo has been shared successfully
- Blind people can’t thumb or scroll through a photo album to find the photo they are looking for, finding a specific photo is difficult
Out of the above pain points, I’m going to suggest we focus on pain point #2. This pain point in particular is really central to the experience of a photo app. If users of our photo app can’t understand what the photo is of then we fundamentally aren’t providing a good experience.
Solutions
Now that we have a better understanding of some of the issues the user faces, let’s go ahead and brainstorm some solutions to help blind users understand what a photo is of:
- Description Prompt – When users share or send photos to a blind user of Google Photos, we could prompt them to type a short description describing the image. When the visually impaired user receives the photo or views it at a later date the description of the photo would be read aloud.
- Audio Companion Files – When a user takes a photo we could allow them to automatically record an audio description of what they are taking a picture of that would then be attached to the photo.
- AI Descriptions – We could use machine learning algorithms to automatically analyze the contents of a photo and create a verbal description of the photo that would then be read aloud to the user
Prioritization
I’m assuming we won’t have the bandwidth necessary to build out all three solutions in parallel so let’s go ahead and choose one to prioritize. Remember our goal here is to provide a really good experience to our users. To help us do this I’m going to use the comparison matrix below:
Ease of Implementation, User Satisfaction
1. A, B
2. B, B-
3. C+, A
I’m going to suggest we prioritize building out solution #3 first. While this is the hardest solution to implement I think it is going to be the one users find most useful. The first two solutions are a little too niche in the sense that they apply to individual scenarios, either someone sending the blind user a photo or the blind user taking a photo themself. They both require someone to manually describe the contents of photo and that is going to be difficult to scale. The AI Descriptions of photos would not be constrained by requiring using to do something manually.
The AI Description of photos could of course be leveraged by any sort of photo specific applications, but I think it can be useful to our users beyond that. If incorporated as a general utility into audio based screen readers then the visually impaired users would be able to understand the contents of photos they encounter elsewhere online. For example, an attachement in an email or an image embedded into a news article they’re reading.
While all of the above sounds great, let’s not lose sight of the fact that this is going to be quite difficult to implement. A very rudimentary version could read aloud information like the capture time and location from the image’s exif data as well as detect standard objects present like “person smiling”. Future versions of the model could both be more specific and be tailored to the individual users. For example, instead of detecting “person smiling” it could detect “Your grand daughter Sophie smirking while eating halloween candy”. Facebook’s auto-tagging capability is proof that this is doable, but at the same time they have a very large and rich data set of photos.
Summary
In order to provide a really good product experience to users who are completely blind, we are going to build an AI based application that will automatically analyze the contents of a photo and read an audio description of it’s contents.
Metrics
Right off the top of my head, I think there are two points that are important to measure:
1. Does our product accurately describe the contents of a photo?
2. Do our users find our descriptions of the photos useful?
Let’s focus on picking out a metric that accurately reflects our ability to do #2. I think #2 encapsulates #1 in the sense that if our product can’t accurately decribe the contents of photos then our users aren’t going to find the descriptions useful so based upon that let’s focus on question #2.
To assess whether or not our users find our descriptions of photos useful, we should focus on monitoring our number of daily active users. This is the type of application that is meant to be used day-to-day so focusing on daily active users makes more sense than monthly active users. If users aren’t finding our product useful then they won’t use it.
One potential complication here is if our AI Descriptions are leveraged by something like a general audio based screen reader as a standard offering then users could be de facto using our AI descriptions even if they didn’t find them useful. In that situation maybe offering a small subset of users the ability to skip the reading of our descriptions could be useful just to monitor if they do it or not as a proxxy for how useful they find our AI Descriptions.