Tuesday, April 18, 2006

Back in October I opined that cameras in mobile phones are about to morph into primary data input devices. Based on a recent company blog entry it sounds like Microsoft is sniffing down the same trail:
[Xing] Xie, a researcher for the Web Search and Mining group within Microsoft Research Asia, is working on technology called Photo2Search, which is designed to provide information on the go for users of camera phones.... Photo2Search gives users a way to search a Web-based database by using nothing more than an image captured by a cellphone equipped with a digital camera.

“This technology,” Xie says, “aims to solve the problem of mapping a physical-world object to a digital-world object. You see an object in the physical world, and you want to know the corresponding information in the digital world—for example, its price on the Web, user comments, or Web sites. There are many different solutions. You can use a bar code or radio-frequency identification. But using a picture of the object is very convenient and very easy to deploy.”

The prospect of snapping a picture of something and using the image data to form a search query is exciting, but as Xie and his colleagues have found it's an idea that's heavily dependent on improvements in machine vision. I am skeptical that machine vision alone is enough to make something like Photo2Search work as a practical, scalable way of mapping arbitrary objects to their digital counterparts. A picture arriving on a server lacks context, and without it machine vision is of little use for search. What was the user doing at the time they took the picture? What kind of information were they hoping to get when they took the picture? What, did the user think he was taking a picture of? The billboard on the side of the road? One of the stores in the background? The airplane taking off from a nearby airport? Most photographic data is too rich to form a determinate query and the last thing you want to do on a mobile device is sift through irrelevant search results.

There are at least two ways around the blind alley down which Microsoft's researchers seem to be walking. The simplest one stems from the insight that an awful lot of things that people want to get information about when they are mobile is information that other people really want to reach them. The local movie theater would like nothing better than for you to snap a picture of their billboard, browse their current shows and buy tickets while you sit in traffic. A digital code on the sign wouldn't require the power of a server to be parsed into a web link--the phone's own software could manage it with a reasonably high resolution image to work with. The embedded code wouldn't even have to be visible to the human eye if the camera had an infrared-sensitive mode--that would be the 2.0 version of this idea. I think we can be assured that businesses and other people with messages they want to get out will be quite resourceful in helping cameraphone users receive and do useful things with those messages. And without Microsoft's fancy machine vision research, thank you very much.

The other way around the problem is the one that Numenta seems to be taking: develop software that takes sensory data over time and builds an empirical context for perceiving what it should see or do next. I admit that this sounds like a tall order for a pocketable device to accomplish when PDAs seem barely able make it through a day of intermittent use on a battery charge. And it also sounds like it's more about making devices that respond intelligently to their owners and their habits than to solving Microsoft's search problem.

But search isn't the only--or even the most--interesting thing here. Turning dumb "computer-like" gadgets into smart ones that seem to "understand" us will be the key to moving mobile devices out of geekdom and into mass market. We might not be far away from portable devices that understand things we do regularly. For example, it might not be hard for my phone to recognize that I have a habit of taking pictures of peoples' business cards, and learn to send that kind of data to an OCR server for entry to my contacts database. If it knew the sound or appearance of my car and correlated it with my usual commute times it might be practical to train it to use a text-to-speech engine to read my incoming mail and messages out loud in that circumstance. Perhaps it could learn to do more than remind me with a beep that, as usual, I am running late as I drive somewhere, instead making an intelligent audible offer to dial or message the poor soul who needs to adjust their schedule to this fact. These basic levels of intelligence seem like they might be within reach, and could turn cameras and microphones into the eyes and ears of smart devices.

Comments

No comments yet

Add Comment

Comments must be approved before being published. Thank you!