Vision-Based Gesture Recognition: An Ideal Human Interface for Industrial Control Applications

作者：Brian Dipert

投稿人：Convergence Promotions LLC

2012-03-14

Embedded vision, the evolution and extrapolation of computer-based vision systems that process and interpret meaning from still and video images, is poised to be the next big technology success story. Consider, for example, the image sensors and processors now commonly found in cellular phones, tablets, laptop computers and dedicated computer displays. Originally intended for video conferencing and photography, they are now being harnessed for additional applications, such as augmented reality.^[1]

Similarly, consider the burgeoning popularity of consumer surveillance systems, driven by steady improvements in cameras and their subsystems, as well as the increasingly user-friendly associated surveillance software and services.^[2] Also, as anyone who has recently shopped for an automobile already knows, image sensors are increasingly found in numerous locations around a vehicle, leveraged for parking assistance, rear-view safety, impending-collision alert, lane-departure warning, and other functions.^[3]

The same robust-featured and cost-effective image sensors, processors, memory devices, I/O transceivers, and other ICs used in the earlier-mentioned systems are equally available to developers of vision-inclusive industrial automation applications. Gesture-based human interfaces are ideal in many respects, and therefore increasingly common, in such environments. For one thing, they are immediately intuitive; why click on a mouse, or a button, or even slide your finger across a touch screen to flip pages or move within a menu page, when you can instead just sweep your hand through the air?

A gesture-based UI also dispenses with the environmental restrictions that often hamper a touch-based interface; water and other fluids, non-conductive gloves, dirt and germs, etc. However, a first-generation motion implementation such as that utilized by the Nintendo^® Wii™ game console system has limitations of its own. An easy-to-lose, breakable, in-hand controller is required to implement the scheme. Additionally, the interface between the controller and the system, usually implemented via Bluetooth^®, ZigBee^® or some other RF wireless technology, is (like a touchscreen interface) vulnerable to functional degradation due to environmental EMI.

Instead, consider an image sensor-inclusive design. Vision-based gesture interfaces use the human body as the controller versus a dedicated piece of extra hardware, interpreting hand, arm, and other body movements. They are comparatively EMI-immune; all that you need to ensure is sufficient operator-to-equipment distance along with adequate ambient lighting. In addition to gesture-based control, and as with the earlier mentioned computers and cell phones, you can use facial recognition technology to not only "unlock" the system in response to the presence of a valid operator's visage but also custom-configure the system on the fly for any particular operator, logging into a specific user account, for example.^[4] They can also offer a more extensive suite of user control options than does a coarser-grained accelerometer- or gyroscope-based motion interface.

A Kinect case study

If your system employs a dual-image-sensor (i.e. stereo or 3-D) arrangement, your range of available gestures becomes even richer, encompassing not only horizontal and vertical movements but also depth discernment. Stereo sensor setups also enable facial recognition software to more accurately discern between a real-life human being and a photograph of a person. Microsoft® took a different approach, called structured light, to discern depth with the Kinect peripheral for the Xbox^® 360 (see Figure 1).^[5]

Xbox 360 Kinect

Xbox 360 Kinect cameras

Figure 1: Microsoft's Kinect peripheral for the Xbox 360 game console, a well-known embedded vision success story (a), combines both monochrome and Bayer-patterned full color image sensors, along with an infrared transmitter for structured light depth discernment (b). Further dissection by iFixit revealed additional component details (c). (Courtesy Microsoft and iFixit, respectively).

Kinect is one of the best-known embedded vision examples, selling eight million units in its first 60 days on the market beginning early November 2011.^[6] It is not currently an industrial automation device, at least officially, although hackers' efforts have notably broadened its usefulness beyond the game console origins. Microsoft plans to unveil an official SDK for the Windows® 7 operating system this year, along with a PC-optimized product variant.^[7] Regardless, the design trade-offs and decisions made by Microsoft are instructive to others developing vision-based user interface hardware and software.

Teardowns of Kinect conducted by Chipworks, Inc. and iFixit shortly after product introduction revealed that both the monochrome and full-color image sensors came from Aptina. Their relatively generic VGA-resolution CMOS characteristics mean that alternate supply sources such as Omnivision are also feasible.^[8],[9] Microsoft included an infrared transmitter in the Kinect design in order to provide a known-illumination-pattern light source that, by virtue of its operating frequency, is also invisible to the naked eye. This design decision, however, hampers Kinect's use in sunlight and other infrared-rich ambient environments.

The monochrome image sensor works in tandem with the infrared transmitter and a PrimeSense-sourced processing SoC to output QVGA-resolution, 11-bit depth map images to the Xbox 360 over a USB 2.0 interface, with white pixels representing nearby objects and a color gradient extending to blue-pixel (far) objects (see Figure 2). Kinect also provides 24-bit interpolated color VGA-resolution images from the Bayer filter-patterned color image sensor, useful, for example, in both capturing the facial image of each game player and subsequently identifying a particular user.^[10] Finally, Kinect incorporates a four-element array microphone configuration, useful in pinpointing a particular participant's voice in 3-D space, in the process filtering out both ambient noise and the vocal utterances of other game players.

PrimeSense-developed vision SoC
Infrared light painting
VGA resolution monochrome image sensor

Figure 2: A PrimeSense-developed vision SoC (a) both drives a transmitter that "paints" the area in front of Kinect with infrared light (b) and processes the output of Kinect's VGA resolution monochrome image sensor, creating per-frame depth map images of objects from near (white) to far (blue) distances (c). (Courtesy of PrimeSense).

Simpler implementations can sometimes suffice

Several analyst firms have independently estimated the bill of materials cost for Kinect to be just over $50, and the device is also fairly large (11" x 3" x 3") and heavy (~4 lbs). Keep in mind that this particular peripheral is intended to not only discern users' hand gestures but also to successfully tackle full body motion capture and facial recognition tasks, including discerning users' smiles, frowns, eyebrow-raises and other facial-element motions and mirroring them at the on-screen user avatar.^[11] It is also intended for use in a variety of operating environments, thereby explaining the infrared transmitter (and associated heat-removing fan), as well as the unit-orienting accelerometer, motor and triple-gear assembly.

Kinect needs to minimize the amount of USB 2.0 system bus bandwidth it consumes, reserving sufficient spare bandwidth for other console peripherals such as networking adapters and the HD DVD drive peripheral. On the other hand, it is able to harness both its own processing resources (the earlier-mentioned PrimeSense IC, along with a Marvell-developed and ARM®-based SoC) and the USB2-tethered game console system's combination of a triple-core, six-thread 3.2 GHz PowerPC™ CPU and 500 MHz GPU. However, Kinect's optics subsystem and infrared transmission scheme combine to limit its guaranteed near usable range to six feet (eight feet for multiple-player situations); in combination with processing limitations, these factors allow Kinect-enabled games to simultaneously discern only a couple of players.

Keep these trade-offs made by the Microsoft team in mind as you develop your own embedded vision-based design. For example, you can dispense with the microphone array if voice recognition isn't required, or alternatively simplify it to a single- or two-microphone setup if less robust source location and noise suppression schemes are sufficient. It is possible that you will need your gesture configuration to accurately respond to users located closer than six feet from the image sensor. On the other hand, you might be able to guarantee sufficient ambient lighting in all possible usage cases to preclude the requirement for ancillary infrared or other illumination.

Accurate depth discernment, both for complex hand motions and object dimensions, sometimes requires a dual-image sensor setup, but you may already be planning such a configuration for use in implementing 3-D video conferencing or photography functions. On the other hand, if the gesture-based interface is fairly simplistic, you can probably get away with a single-image-sensor setup. Single-sensor configurations also suffice (as Kinect exemplifies) for structured lighting-based depth discernment, as well as for the time-of-flight depth resolution approach.^[12]

CPUs and software

If your vision-based interface complexity is reduced in comparison to the earlier Kinect case study, so too will the amount of processing resources required to implement the various algorithms. A variety of processing candidates is available, which you can either harness standalone or in combination, for example with a CPU-plus-GPU pairing.^[13] They include:

CPUs from companies like AMD and Intel Corporation
DSPs from suppliers such as Analog Devices and Texas Instruments
FPGAs from Xilinx or another programmable logic provider
GPUs from firms like AMD and NVIDIA
Vision-tailored ICs from companies such as CogniVue and Maxim
Vision-optimized processor cores from suppliers like CEVA
SoCs from Freescale Semiconductor and several of the previously-mentioned semiconductor firms, along with others

Gesture recognition is a sufficiently specialized and demanding function that you might choose to license foundation algorithms and/or middleware code from a company whose core focus is the development and implementation of gesture technology for various processing platforms. During the research phase, you may discover that gesture recognition means different things to different people. Middleware developer Omek Interactive, for example, focuses its efforts on implementations that leverage 3-D image sensor arrays, while other companies concentrate only on recognizing hand-based gestures, disregarding broader body motion.^[14]

If, on the other hand, you decide to tackle developing your own gesture interface code, the most commonly leveraged APIs and reference algorithms are open-source in nature, specifically:

OpenCL™ for GPGPU (general-purpose computing on graphics processing units) acceleration of massively parallelizable code segments^[15]
OpenMP^® (multi-processing) and Grand Central Dispatch, the latter originally developed by Apple^®, for partitioning code both between CPU cores and between a CPU and GPU
The OpenCV (Computer Vision) code library originally developed by Intel^® and now maintained by Willow Garage^[16]
OpenNI (natural interaction), an organization with PrimeSense as a key founder, offering both a set of APIs and a framework for supporting natural voice and voice command recognition, hand gestures, and body motion tracking

Alternative APIs of a more proprietary nature come from a number of suppliers. Higher-level frameworks and software development toolsets are available from vendors such as General Instruments and MathWorks. If you are interested in further improving your gesture algorithm's effectiveness via image enhancement techniques, contact a company such as Apical Limited.^[17]

The Embedded Vision Alliance

A number of the companies mentioned in this article (along with many others) are members of the Embedded Vision Alliance, which publicly launched at the end of May 2011. Embedded vision technology has the potential to enable a wide range of electronic products that are more intelligent and responsive than before, so that they are more valuable to users. It can enable electronic equipment companies to both create valuable new products and add helpful features to existing products. Also, it can provide significant new markets for hardware, software, and semiconductor manufacturers. The Embedded Vision Alliance, a unified worldwide organization of technology developers and providers, is helping to transform this potential into reality in a rich, rapid, and efficient manner.

The Embedded Vision Alliance has developed a full-featured website, freely accessible to all and includes (among other things) articles, videos, a daily news portal, and a discussion forum staffed by a variety of technology experts. Registered website users can receive the Embedded Vision Alliance's monthly email newsletter; they also gain access to the Embedded Vision Academy, containing numerous tutorial presentations, technical papers, and file downloads, intended to enable new players in the embedded vision application space to rapidly ramp up their expertise.

References:

http://embedded-vision.com/news/2011/08/16/augmented-reality-applications-strive-meaningful-applicability
http://embedded-vision.com/news/2011/11/02/surveillance-analytics-consumer-success-stories-silence-critics
http://embedded-vision.com/news/2011/12/27/driver-assistance-gm-thinks-camera-based-systems-make-fiscal-and-functional-sense
http://embedded-vision.com/news/2011/12/15/facial-recognition-mobile-application-yearning-stereo-vision
http://en.wikipedia.org/wiki/Structured_light
http://en.wikipedia.org/wiki/Kinect
http://embedded-vision.com/news/2011/11/23/microsofts-kinect-startup-investments-and-pc-enhancements
http://www.chipworks.com/en/technical-competitive-analysis/resources/recent-teardowns/2010/12/teardown-of-the-microsoft-kinect-focused-on-motion-capture
http://www.ifixit.com/Teardown/Microsoft-Kinect-Teardown/4066/1
http://embedded-vision.com/platinum-members/bdti/embedded-vision-training/documents/pages/selecting-and-designing-image-sensor-
http://embedded-vision.com/news/2011/07/26/avatar-kinect-hits-xbox-live-marketplace-facial-recognition-becomes-commonplace
http://en.wikipedia.org/wiki/Time-of-flight_camera
http://embedded-vision.com/platinum-members/bdti/embedded-vision-training/documents/pages/implementing-vision-capabilities-embe
http://embedded-vision.com/industry-analysis/video-interviews-demos/2011/10/05/embedded-vision-alliance-conversation-gershom-ku
http://embedded-vision.com/platinum-members/bdti/embedded-vision-training/documents/pages/introduction-computer-vision-using-op
http://embedded-vision.com/platinum-members/bdti/embedded-vision-training/videos/pages/conversation-gary-bradski-part-1-2
http://embedded-vision.com/industry-analysis/video-interviews-demos/2011/10/04/embedded-vision-alliance-conversation-michael-tu

免责声明：各个作者和/或论坛参与者在本网站发表的观点、看法和意见不代表 DigiKey 的观点、看法和意见，也不代表 DigiKey 官方政策。

Vision-Based Gesture Recognition: An Ideal Human Interface for Industrial Control Applications

关于此作者

关于此出版商

信息

帮助

联系我们

关注我们