Issue 1, June 1999
Engineering & Applied Sciences
Evaluation of an Optical Human Motion Tracking System
Mark Palatucci
University of Pennsylvania
Abstract
We present
an evaluation of the ExpertVision HiRES optical motion capture system.
Using a spreadsheet, we performed a statistical analysis on motion
data from a female dancer. This analysis provided insight to the major
sources of error in the HiRES device. Among our results, we found
the major causes of error to be optical marker movement and occlusion.
We also analyzed how this error inhibits the animation of anatomically
correct human models.
Introduction
Movement is perhaps
the most significant interaction that humans have with their environment.
Various motions can convey messages, manipulate objects, and traverse
surroundings. Virtually all human processes, both internal and external,
are a result of some sort of movement. Given this significance it
is no surprise that movement must be an integral part of any computer
generated or virtual environment (VE). It requires a good VR system
to provide these realistic interactions. Such systems will capture
the motions of their users while simulating the movements of various
virtual objects (Matsuba and Roehl, 1996).
VR systems use
tracking to capture and produce motion data. There are several methods
of tracking that include mechanical, ultrasonic, magnetic, and optical.
While each type of system has its own advantages and disadvantages,
recent advances in technology have made optical capture the method
of choice for many animators and developers.
One of the most
popular optical motion capture systems is the Motion Analysis Corporation
(MAC) ExpertVision HiRES. This system uses multiple cameras to record
the motion of actors wearing reflective markers. The included software
then analyzes the motion data from the various camera views to produce
3-D location coordinates for each marker. These coordinates can
then be exported to various animation packages such as Alias/Wavefront,
SoftImage, Prism, and Nichiman Graphics. This export ability allows
users of the HiRES system to produce incredibly realistic motions
in a fraction of the time that key-frame animation would take to
produce similar movements.
Another time
saving aspect of the HiRES system is what MAC calls "Accumulation
of Digital Resources." The feature allows users of the HiRES to
capture one set of motion data and then apply that data to any number
of figures regardless of age, size, race, clothing, etc.. While
many people praise this feature of the HiRES, some criticize it
saying that it may produce movements that are unrealistic for the
given character. Others qualify this feature as useful in some situations
while unhelpful in others.
The debate raises
several issues about the overall usefulness of the HiRES motion
capture system. These issues include: 1) When should key-frame animation1
be used over motion capture? When should motion capture be used
over key-frame animation? 2) What are the ideal uses of the HiRES
system? 3) To what level of realism can the motion data be trusted?
What kind of characters can be realistically animated with one set
of data? 4) What repercussions are there for animating only a portion
of a human figure with the HiRES system?
While this paper
analyzes the HiRES device, the discussion is relevant to all reflector
based optical systems. The evaluation will first present an analysis
of the error involved in the device. From the data we will then
discuss the various issues described above.
Methods
The optical capture
device that was used for this study was the Motion Analysis Corporation
ExpertVision HiRES. The system uses a combination of high speed
cameras and reflective markers to capture the motion of a performer.
As the performer executes a movement, light from each camera is
reflected by the markers. The cameras then filter this light to
record a 3-D location of each marker.
The trajectories
of the reflective (real) markers are used to reference the internal
skeletal system of the performer. In reality, the internal skeletal
system is defined by "virtual markers" which are nothing more than
estimated joint centers. The real markers are used to define these
virtual marker locations. The basic method for defining a virtual
marker is to physically estimate the distance between a real marker
and a joint center. After virtual markers are defined, the rest
of the setup includes forming a body segment hierarchy and defining
the motion area.
The setup process
is the most difficult aspect of this HiRES system. Since this paper
is an evaluation of the system, we wanted the setup to be as accurate
as possible. Rather than perform our own clumsy configuration, we
analyzed data given to us by Motion Analysis Corporation. The data
is that of a human female dancing and was recorded at 60 frames
per second. The motion lasts 15 seconds for a total of 900 frames
of data.
This dance demo
links the virtual markers together to create a 20 segment human
"stick" figure. Each segment has its own origin and local coordinate
system that enables calculation of joint translations and rotations.
A segment hierarchy is then defined to interpret these coordinate
systems. The hierarchy describes the parent-child relationships
of the different segments. A child's coordinate system is relative
to its parent's. The main parent of the hierarchy is the root, and
in our demo the root is the lower torso. Its translations and rotations
are relative to the global coordinate system or capture space.
Defining virtual
markers and a segment hierarchy is only necessary when using the
HiRES .htr (hierarchical translation and rotation) format. The are
currently two formats that the HiRES uses for animation purposes.
The first, the track row column (.trc) output file, contains XYZ
position values for the reflective markers relative to the global
coordinate system. With this approach the animation software must
use inverse kinematics to create joint translations and rotations
(Badler and Hollick, 1993). The second format is the .htr format
mentioned above. With this format the output file contains translation
and rotation information for each segment relative to its parent
segment. For our study, we consider the .htr format as it allows
the segment based motion of a single subject to animate a variety
of different models.
The hierarchical
translation and rotation (htr) format separates the body into twenty
distinct segments. In each frame of data the length of each segment
is determined by measuring the distance between joint centers (virtual
markers). Rather than report actual segments lengths for each frame,
the software supplied with the HiRes system records each segment
length as a "scale factor". For each segment, the software calculates
the actual length for the first frame of data (scale factor 1.0).
The segment length in subsequent frames is then reported as a scale
factor (or percentage) of the length measured in the first frame.
For example, if the segment length in frame ten is calculated to
be 97% of the length measured for that segment in frame one, the
scale factor for frame ten would be reported as 0.97.
We performed
our statistical analysis on the 18,000 (900 frames * 20 segments)
scale factors recorded during the dance demo. We calculated the
mean, mode, and standard deviation values for each body segment.
The standard deviation value is most important as it gives a percentage
approximation of the error involved for each body part.
Analysis & Results
We performed this
statistical error analysis of the dance demo data using a simple
spreadsheet. This analysis raised several important questions that
need to be considered.
Question
1: What are the major sources/causes of error in the HiRes motion
capture system?
We found that
the average error over the entire body was 5.3% (an average segment
deviation of 0.053). For realism in motion capture, this error may
present noticeable differences. Hypothetically consider the mode
human male to be 6' tall. A 5.3% deviation would mean that the majority
of men would fall between 5' 8'' and 6' 4''.If realism is the prime
objective, motion data from a 6' male may be inaccurate for properly
animating both 5' 8'' and 6' 4'' males. We further consider this
realism problem in the discussion section.
What are the
sources of these segment deviations? We believe these deviations
to be a direct result of the HiRes motion capture process. Remember
that the segment lengths are nothing but distances between joint
centers (virtual markers). These virtual marker locations are estimated
when the performer is suited up. The estimation requires the setup
crew to approximate the distance from the real markers to the corresponding
joint center. These human approximations can cause inaccurate numbers
to be reported for the segment lengths.
Also consider
that the real markers placed on the performer are not entirely static.
Some may be attached with rubber bands while others may be attached
with glue. Some markers may be attached directly to the skin while
others may be fastened to clothing. The point is that the real markers
do not define a rigid human skeleton. Their ability to move causes
the virtual markers to move and hence the segment lengths will be
improperly reported.
Another major
source of error comes from the optical nature of this device. When
a marker cannot be seen from any the cameras it is said to be occluded
or obscured. The HiRes system does take this into account. Unlike
many less developed systems, the HiRes incorporates a best fit algorithm
(spline polynomial) to estimate the location of a real marker based
on its location in nearby motion frames. Unfortunately, this algorithm
requires post processing and cannot help the occlusion problem for
real time applications.
Question
2: How does the error vary with respect to individual body segments?
Although the
average segment deviation is 5.3%, individual body parts vary from
as little as 0.5% (the head segment) to 13.8% (the neck segment).
We believe this discrepancy to be a direct result of occlusion and
skin surface differences.
Just like the
method of attachment (glue, rubber bands, etc.), the surface of
attachment seems to cause error by allowing slippage in the real
marker positions. This means that the skin variations over the body
affect the segment length estimates. Consider the surface of the
head. The skin located near the temples (where the real head markers
are attached) does not stretch much during normal body movement.
The segment deviation of the head is very small (0.5%). Other segment
surfaces that do not stretch much during movement include the torso,
lower legs, and pelvis sections. These segments have deviations
of less than 4%.
The skin surface
of other segments can vary a relatively large amount. The skin around
the neck stretches and compresses enough to cause deviations of
up to 13.8%. Other skin sections with large variations are the wrists,
collars, and ankles. The corresponding body segments have deviations
of over 9%. In general, the closer the real marker is to a joint,
the greater the deviation in the corresponding body segment.
The other major
factor that affects the discrepancy in segment deviations is occlusion.
Simply put, certain markers get obscured more than others. The markers
placed on the head are unlikely to be blocked throughout a simple
motion. Even simple motions however, have many complex translations
and rotations around the "extremity" segments such as the hands
or feet. Markers on these body parts can loose their tracking more
readily than those on other parts. When tracking is lost, the HiRes
system applies the best fit algorithm to estimate the real marker
location. The accuracy of this estimate depends on the number of
frames the marker is occluded.
Question
3: How does the error vary with respect to body symmetry?
It is possible
to see that segment error is greater for most segments on the left
side of the body. For those with moderate to heavy deviations (5.0%
+) the left segment deviates 1%-4% more than the corresponding segment
on the right side of the body. Consider the left collar bone. It
deviates by 12.7% while the right collar bone deviates by 8.9%.
Other large discrepancies occur in the hands (9.2% vs. 6.7%), and
feet (8.0% vs. 5.2%).
This problem
was particularly puzzling until we viewed the dance sequence from
a "bird's eye view". When looking at the motion with respect to
the camera locations it was possible to see that the majority of
the motion was concentrated towards the right portion of the motion
area. This may explain why the left body segments were occluded
more often.
Question 4:
How can the accuracy of the device be increased?
After considering
the sources and causes of error we can suggest several methods for
increasing the accuracy of the HiRes device. Unfortunately most
of these suggestions require additional setup time and/or cost to
the already expensive optical process. It is important to consider
what level of accuracy is desired before implementing these process
recommendations.
The easiest and
zero cost way to increase the accuracy of the of the device is to
raise the frame capture rate. The HiRes cameras can capture motion
data from 60 fps to 240 fps. Unfortunately, increasing the frame
rate does not really have a large effect on the error unless the
motion is extremely fast and complex. For normal human motion, a
frame rate of 60 fps is usually enough. Capturing at high rates
produces more information than is often needed for animation purposes.
Although this is the cheapest way to reduce error, there are other
low cost options that are much more effective.
We noted above
that a major source of error came from marker slippage or movement.
It is important to consider this when attaching the real markers
to the performer. If the marker movement can be reduced than the
segment length calculations will be much more accurate. We see two
ways to reduce marker movement. The first is to attach the markers
directly to the skin wherever possible. Although the skin is not
rigid, it is much more resistant to stretching and compressing than
clothing. When clothing must be used, try spandex as it hugs the
body well. The second method of reducing marker movement is to attach
the markers as firmly as possible. Use glue or adjustable straps
instead of rubber bands. If the marker is attached firmly then it
will only move as much as the surface itself moves. For those on
low/no budgets, marker attachment is the most cost effective method
to reduce segment deviation.
For those with
money to spend, there are other options for reducing error. The
most expensive and by far the best way of reducing error is to have
more capture cameras. If more angles are covered than markers are
less likely to be occluded. Eliminating occlusion is the best way
of reducing error in any optical capture system. This option is
not always practical because the high speed cameras are very expensive.
A lower cost alternative is to add more optical markers to the performer.
The particular advantage of this method is that the accuracy of
individual segments can be increased. The new markers further define
the location of a joint center. The more markers that define a particular
joint center, the less effect any one occluded marker will have
on the segment length calculation.
Discussion
We now relate our
analysis back to the general issues presented in the Introduction.
Issue 1: When
should key-frame animation be used over motion capture? When should
motion capture be used over key-frame animation?
The contention
between key-frame animators and motion capture professionals has
never been greater. While key-frame animators argue that motion
capture devices will never replace human creativity and imagination,
motion capture professionals (MCPs) stress that animators could
never duplicate the realism of natural movement. Both parties have
valid points.
Using systems
like the HiRES, motion capture professionals can produce the most
realistic looking animations in the least amount of time. "A small
production team and a single actor can produce many complex animations
in an afternoon." (MAC, 1997) For jobs where complex movements need
to be produced in a short period of time, the HiRES system is the
best option. Another advantage of using systems like the HiRES is
the ability to separate the motion data from the character. This
allows MCPs to exhibit many different characters using a single
set of data. This feature saves production houses a great deal of
time and money.
While motion
capture often saves overall expense, there are certain instances
when key-frame animation is a more appropriate choice. This occurs
most often in computer storytelling. Whereas traditional animators
are only limited by their imaginations, MCPs are limited by what
their performer can do. Computer animators often create interesting
characters that defy the laws of physics and anatomy. If animators
only used motion capture devices to add movement to their characters,
stories would be much less interesting and creative.
Issue 2: What
are the ideal uses of the HiRES system?
The HiRES device
is not a replacement for traditional computer animation. It is a
tool to help animators be more productive and efficient. Like any
tool, the HiRES has ideal applications.
The HiRES is
very well suited for capturing extremely realistic motion. This
is especially important in VR applications where realism is the
prime objective. A VR user is more easily convinced when virtual
objects behave just like their natural counterparts. By using the
HiRES, animators can eliminate the human error involved in reproducing
an object's motion.
The HiRES system
can also eliminate human error from multiple character animations.
When animating many characters with identical motions such as in
a synchronized dance, the HiRES system can save animators a massive
amount of time.
The HiRES is
best suited for any application in which identical and extremely
realistic movements are required.
Issue 3: To
what level of realism can the motion data be trusted? What kind
of characters can be realistically animated with one set of data?
This question
really deserves a philosophical discussion rather than a simple
yes or no answer. The easiest way to answer this question is to
say that captured motion data is only realistic for the particular
performer. If the computer character is not an accurate representation
of the performer, then the motion data is not realistic relative
to that character (Hodgins and Pollard, 1997). Consider data captured
from a average female jogging. If this data is used to animate an
average female jogger, the motion is extremely realistic. With the
HiRES system this data may also be used to animate an average male
jogger. While the male jogger animates very easily by importing
the female data, the motion is considerably unrealistic to the computer
figure.
When realism
is a significant goal it is important to choose a performer that
accurately represents what the computer character is portraying.
As we mentioned in our analysis, it isinaccurate to animate a 6'4''
character and a 5'8'' character with the same data. The animator
must consider what level of realism is required. Some animators
may be satisfied with identical motion for different physically
shaped characters while others may want each character to have his/her
own realistic movement.
Unfortunately,
when total realism is the ultimate goal, the HiRES system falls
a little short. The segment length deviations pose a majority problem
for realistically animating anatomically correct human figures.
When real humans move their body segments do not change in length
(except for a very small amount due to joint geometry). Animating
a correct human figure with motion data that deviates 13.8% in some
cases is an incredible task. Even if the motion data can be used
to animate this figure, there is no guarantee that the motion is
realistic to a correct human model.
Issue 4: What
repercussions are there for animating only a portion of a human
figure with the HiRES system?
This is another
question that relates back to realism demands. When a human moves
any part of his body, the other body parts react and respond to
this motion. If motion over the entire body is captured this is
not a problem because the translations and rotations of one marker
affects the translations and rotations of every other marker. When
only a section of the body is captured a number of realism problems
can occur.
Consider the
captured motion of a single segment, the head. If the performer
remains still and merely rotates his head, this does not pose a
major problem. But if the performer leans forward slightly a large
behavioral problem. Without the rest of the segments for reference,
the computer has no way of telling how the head got from one place
to the other. Did the performer just lean forward? Or did he walk
forward a few inches? There is no way to know for sure. The only
thing to do is develop behavioral constraints that instruct the
human model on how to move to the new location. A great deal of
planning is required in order for the human movements to be realistic.
It is a complex problem with no easy solution.
Future Work
In future work we
plan to develop a method for using the .htr (hierarchical translation
and rotation) format to properly animate anatomically correct human
figures. Our immediate goal is to find a way to deal with the large
segment length deviations. Once we can animate our models we will
then evaluate the relative accuracy of the movement. We would then
like to define a set behavioral constraints that will allow the
realistic movement of an entire human figure by capturing motion
of only certain body segments. As a long term goal, we would like
develop methods for using these motion techniques in real time applications
such as interactive VR environments.
Acknowledgements
We would like to
thank the Army Research Lab at Aberdeen, Maryland for use of the
ExpertVision HiRES system. We would also like to thank Motion Analysis
Corporation for supplying the dance demo data.
References
Badler,
N. I., Hollick, M., Granieri, J. (1993). Real-Time control of a virtual
human using minimal sensors. Presence: Volume 2, Number 1, 1993.
FX Fighter - Motion Capture vs. Keyframing Page. http://www.im.gte.com/FXF/fxfwhat2.html
Hodgins, J.,Pollard, N. (1997). Adapting simulated behaviors for new
characters. SIGGRAPH Proceedings, 1997.
Ko, H., Badler, N.I. (1993). Straight line walking animation based
on kinematic generalization that preserves the original characteristics.
Graphics Interface '93.
Matsuba, S., Roehl, B. (1996). Using VRML. Indianapolis: Que Publishing,
1996.
Motion Analysis Corporation. EVa HiRES Version 4.0 User's Manual,
1997.
Mulder, S., Human Movement Tracking Technology. Hand Centered Studies
of Human Movement Project, Simon Fraser University. Technical Report
94-1, July 1994.
Zeltzer, D. (1992). Autonomy, interaction, and presence. Presence,
Volume 1, Number 1, 1992.
Pixar's Renderman, http://www.pixar.com/renderman/
Footnotes:
1 Keyframing is a method by which an animator specifies where an object
should be at certain frames in an animation. The computer then linearly
interpolates between the "key-frames" to generate the entire
motion sequence.
Journal of Young
Investigators. 1999. Volume Two.
Copyright © 1999 by Mark Palatucci and JYI. All rights reserved.
|