Is Today, 1pm to 3pm @ Fusionopolis.
Yesterday night, TDM/A*STAR did a nice event for people to interact with the finalists. If one is new to image/video/audio/speech search, I thought it was a great event, but personally I didn’t learn anything new except more anecdotes of the need to search in a more human manner. Some tried to ask social questions and stuff, but wasn’t really useful. Here’s a quick summary if you left half way or didn’t attend (sorry I didn’t jot down who said what…).
In the past, audio, video, image, speech were all different field. This competition brought them together, which was exciting. One thing is clear, text/language is an artificial abstraction. But for multimedia, the low level details are in bits, which can be quite good (e.g. filtering out unwanted stuff, classification of languages) but not 100%. However, we’re used to keyword search – and that continues to present itself as the largest challenge to search multimedia that’s annotated.
There were some initial discussion on whether search would go down the path of being specialized based on context or generalized. It was agreed that generalized search was what the teams were going after in this specific competition, but there are certainly distinct markets where specialized searches are multi-million dollars right now, example being speech translation between doctors and patients in US.
Another discussion was on the need to apply semantics to multimedia, like a grammar. Or the use of low level features (looking at 1s and 0s) would be sufficient. Some compared it to decoding DNA, which some experts disagreed, feeling that their problem is much harder than decoding DNA.
Reference to the human experience was also discussed. A baby can learn from visual cues. Human don’t necessarily use semantics, and might use statistical methods as well, example being the ability to accurately determine that another person is speaking Japanese without knowing a single word of Jap. A vivid example was given of a six year old asking (his dad) how to search for all clips with Beckam kicking a goal and the interview about the feeling after that.
Experts seem to have an emotional response when the G word is mentioned. But it is agreed that the web in general is a boon as a source of external knowledge. Someone used Yahoo news, to create a large database of celebrity and names, with no annotation, but can subsequently use the database to identify celebrities. Thus, a possible solution to the Beckham problem would be to use that database to find clips with Beckham. Other external knowledge such as soccer commentary and also help. And Wikipedia can teach us things now.
Some noted that even as many databases were created, the interpretation of the database is still not complete without combining it with other analysis methods (simi top down bottom up bayesian combi all said in a fluff manner without going into details…) An example being a search for homeless person wandering in a bucket of images might lead to nothing but a man walking might yield something, notwithstanding the possibility of creating a stereotype way of detecting homeless people.
From the business side, some asked if there’s a vertical that would allude to the horizontal solution, perhaps in space race. The wine industry was given as an example of a vertical which has real applications of classifying wine. Another would be creating of robots with human capabilities. Some felt that there’s a huge economic opportunity to be able to just map “taste” on consumer products. On the flip side, there’s the possible proliferation of video based email spam. One interesting problem brought up was for a kid focused search engine to filter out unwanted websites, instead of manually white listing sites with ok image / video content, or utilizing some form of crowd sourcing (Captcha as an example).
From the social side some asked about the implication of gaining such great powers. For one, it allows for easier retrieval of currently harder to find data, which raise privacy concerns, if not red alert of more p*rnography.
An interesting situation was pose where you can’t use text, like foreigners finding out about char kuay teow in Singapore when there’s no way she would know to use the keyword “char kuay teow” (or animals was her actual example). The ability for human to give inputs in multimedia require tools (e.g. camera) and the matching against a database of knowledge.
And a weird question which sound forced (was it planted!?) was abou the advantage of doing such research in Singapore? Besides reputation, funds (government + indirect through customers), there wasn’t any mention of the availability of experts 🙂 ah well.
So, if you’re around Fusionopolis, go take a look now as the competition would be projected in private 2nd life like environment in the “egg” (think is the atrium in the lobby). I’ll just read the papers for the results.