Jump to content

Egocentric vision: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m clean up, typo(s) fixed: an huge → a huge using AWB
Citation bot (talk | contribs)
Alter: title, template type. Add: chapter-url, chapter. Removed or converted URL. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
 
(41 intermediate revisions by 23 users not shown)
Line 1: Line 1:
{{short description|Type of computer vision for wearable cameras}}
{{multiple issues|
'''Egocentric vision''' or '''first-person vision''' is a sub-field of [[computer vision]] that entails analyzing images and videos captured by a [[wearable camera]], which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Consequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user's activities and their context in a naturalistic setting.<ref>An Introduction to the 3rd Workshop on Egocentric (First-person) Vision, Steve Mann, Kris M. Kitani, Yong Jae Lee, M. S. Ryoo, and Alireza Fathi, IEEE Conference on Computer Vision and Pattern Recognition Workshops 2160-7508/14, 2014, IEEE {{doi|10.1109/CVPRW.2014.1338272014}}</ref>
{{more footnotes|date=January 2018}}
{{orphan|date=January 2018}}
{{underlinked|date=January 2018}}
}}
'''Egocentric vision''' or '''first-person vision''' is a sub-field of [[computer vision]] that entails analyzing images and videos captured by a [[wearable camera]], which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Consequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user's activities and their context in a naturalistic setting.


The wearable camera looking forwards is often supplemented with a camera looking inward at the user’s eye and able to measure a user’s eye gaze, which is useful to reveal attention and to better understand the
The wearable camera looking forwards is often supplemented with a camera looking inward at the user's eye and able to measure a user's eye gaze, which is useful to reveal attention and to better understand the
user’s activity and intentions.
user's activity and intentions.


== History ==
== History ==


The idea of using a wearable camera to gather visual data from a first-person perspective dates back to the 70s, when [[Steve Mann]] invented "Eye Glass", a device that, when worn, causes the human eye itself to effectively become both an electronic camera and a television display. But it was only after the introduction to the market of the [[Microsoft SenseCam]] in 2006 that wearable cameras were used for the first time in large scale experimental health research works.<ref>Doherty, A. R., Hodges, S. E., King, A. C., Smeaton, A. F., Berry, E., Moulin, C. J., ... & Foster, C. (2013). Wearable cameras in health. American journal of preventive medicine, 44(3), 320-323.</ref> The interest of the computer vision community into the egocentric paradigm has been arising slowly entering the 2010s and it is rapidly growing in recent years,<ref>Bolanos, M., Dimiccoli, M., & Radeva, P. (2017). Toward storytelling from visual lifelogging: An overview. IEEE Transactions on Human-Machine Systems, 47(1), 77-90.</ref> boosted by both the impressive advanced in the field of [[wearable technology]] and by the increasingly number of potential applications.
The idea of using a wearable camera to gather visual data from a first-person perspective dates back to the 70s, when [[Steve Mann (inventor)|Steve Mann]] invented "Digital Eye Glass", a device that, when worn, causes the human eye itself to effectively become both an electronic camera and a television display.<ref>Mann, S. (1998). [https://core.ac.uk/download/pdf/24060215.pdf Humanistic computing:" WearComp" as a new framework and application for intelligent signal processing.] Proceedings of the IEEE, 86(11), 2123-2151.</ref>


Subsequently, wearable cameras were used for health-related applications in the context of Humanistic Intelligence<ref>Haykin, Simon S., and Bart Kosko. Intelligent signal processing. Wiley-IEEE Press, 2001.</ref> and Wearable AI.<ref>“Wearable AI”, Steve Mann, Li-Te Cheng, John Robinson, Kaoru Sumi, Toyoaki Nishida, Soichiro Matsushita, Ömer Faruk Özer, Oguz Özun, C. Öncel Tüzel, Volkan Atalay, A. Enis Cetin, Joshua Anhalt, Asim Smailagic, Daniel P. Siewiorek, Francine Gemperle, Daniel Salber, Weber, Jim Beck, Jim Jennings, and David A. Ross, IEEE Intelligent Systems 16(3), 2001, Pages 0(cover) to 53.</ref> Egocentric vision is best done from the point-of-eye, but may also be done by way of a neck-worn camera when eyeglasses would be in-the-way.<ref name="Mann">{{Cite book|last=Mann|first=S.|title=Digest of Papers. Fourth International Symposium on Wearable Computers |chapter=Telepointer: Hands-free completely self-contained wearable visual augmented reality without headwear and without any infrastructural reliance |date=October 2000|chapter-url=https://ieeexplore.ieee.org/document/888489|pages=177–178|doi=10.1109/ISWC.2000.888489|isbn=0-7695-0795-6 |s2cid=6036868 }}</ref> This neck-worn variant was popularized by way of the [[Microsoft SenseCam]] in 2006 for experimental health research works.<ref name="auto1">Doherty, A. R., Hodges, S. E., King, A. C., Smeaton, A. F., Berry, E., Moulin, C. J., ... & Foster, C. (2013). [https://ora.ox.ac.uk/objects/uuid:53baca7d-7679-42fb-87f9-c1ddab2ef779/download_file?safe_filename=Wearable%2Bcameras%2Bin%2Bhealth&file_format=application%2Fpdf&type_of_work=Journal+article Wearable cameras in health.] American Journal of Preventive Medicine, 44(3), 320-323.</ref> The interest of the computer vision community into the egocentric paradigm has been arising slowly entering the 2010s and it is rapidly growing in recent years,<ref>Bolanos, M., Dimiccoli, M., & Radeva, P. (2017). [https://arxiv.org/pdf/1507.06120 Toward storytelling from visual lifelogging: An overview.] IEEE Transactions on Human-Machine Systems, 47(1), 77-90.</ref> boosted by both the impressive advances in the field of [[wearable technology]] and by the increasing number of potential applications.
The prototypical first-person vision system described by Kanade and Hebert,<ref>Kanade, T., & Hebert, M. (2012). First-person vision. Proceedings of the IEEE, 100(8), 2442-2453.</ref> in 2012 is composed by three basic components: a localization component able to estimate the surrounding, a recognition component able to identify object and people, and an [[activity recognition]] component, able to provide information about the current activity of the user. Together, these three components provide a complete situational awareness of the user, which in turn can be used to provide assistance to the itself or to the caregiver. Following this idea, the first computational techniques for egocentric analysis focused on hand-related activity recognition <ref>Fathi, A., Farhadi, A., & Rehg, J. M. (2011, November). Understanding egocentric activities. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 407-414). IEEE.</ref> and social interaction analysis.<ref>Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012, June). Social interactions: A first-person perspective. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1226-1233). IEEE.</ref> Also, given the unconstrained nature of the video and the huge amount of data generated, [[Shot transition detection|temporal segmentation]]<ref>Poleg, Y., Arora, C., & Peleg, S. (2014). Temporal segmentation of egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2537-2544).</ref> and [[Video synopsis|summarization]]<ref>Lee, Y. J., Ghosh, J., & Grauman, K. (2012, June). Discovering important people and objects for egocentric video summarization. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1346-1353). IEEE.</ref> where among the first problem addressed. After almost ten years of egocentric vision (2007 - 2017), the field is still undergoing diversification. Emerging research topics include:


The prototypical first-person vision system described by Kanade and Hebert,<ref>{{Cite journal|last1=Kanade|first1=Takeo|last2=Hebert|first2=Martial|date=August 2012|title=First-Person Vision|url=https://ieeexplore.ieee.org/document/6232429|journal=Proceedings of the IEEE|volume=100|issue=8|pages=2442–2453|doi=10.1109/JPROC.2012.2200554|s2cid=33060600 |issn=1558-2256}}</ref> in 2012 is composed by three basic components: a localization component able to estimate the surrounding, a recognition component able to identify object and people, and an [[activity recognition]] component, able to provide information about the current activity of the user. Together, these three components provide a complete situational awareness of the user, which in turn can be used to provide assistance to the user or to the caregiver. Following this idea, the first computational techniques for egocentric analysis focused on hand-related activity recognition <ref>Fathi, A., Farhadi, A., & Rehg, J. M. (2011, November). [https://smartech.gatech.edu/bitstream/handle/1853/42262/Understanding%20Egocentric%20Activities.pdf?sequence=1 Understanding egocentric activities.] In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 407-414). IEEE.</ref> and social interaction analysis.<ref name="auto">Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012, June). [https://smartech.gatech.edu/bitstream/handle/1853/44557/CVPR12.pdf?sequence=1 Social interactions: A first-person perspective.] In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1226-1233). IEEE.</ref> Also, given the unconstrained nature of the video and the huge amount of data generated, [[Shot transition detection|temporal segmentation]]<ref>Poleg, Y., Arora, C., & Peleg, S. (2014). [https://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Poleg_Temporal_Segmentation_of_2014_CVPR_paper.pdf Temporal segmentation of egocentric videos.] In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2537-2544).</ref> and [[Video synopsis|summarization]]<ref>Lee, Y. J., Ghosh, J., & Grauman, K. (2012, June). [http://ideal.ece.utexas.edu/pubs/pdf/2012/YLee2012.pdf Discovering important people and objects for egocentric video summarization.] In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1346-1353). IEEE.</ref> were among the first problems addressed. After almost ten years of egocentric vision (2007 - 2017), the field is still undergoing diversification. Emerging research topics include:
* Social saliency estimation<ref>Park, H. S., Jain, E., & Sheikh, Y. (2012). 3d social saliency from head-mounted cameras. In Advances in Neural Information Processing Systems (pp. 422-430).</ref>

* Social saliency estimation<ref>Park, H. S., Jain, E., & Sheikh, Y. (2012). [https://www.cs.cmu.edu/~hyunsoop/nips/NIPS12.pdf 3d social saliency from head-mounted cameras.] In Advances in Neural Information Processing Systems (pp. 422-430).</ref>
* Multi-agent egocentric vision systems
* Multi-agent egocentric vision systems
* Privacy preserving techniques and applications
* Privacy preserving techniques and applications
* Attention-based activity analysis<ref>{{Cite book|last1=Su|first1=Yu-Chuan|last2=Grauman|first2=Kristen|title=Computer Vision – ECCV 2016 |chapter=Detecting Engagement in Egocentric Video |date=2016|editor-last=Leibe|editor-first=Bastian|editor2-last=Matas|editor2-first=Jiri|editor3-last=Sebe|editor3-first=Nicu|editor4-last=Welling|editor4-first=Max|chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-46454-1_28|series=Lecture Notes in Computer Science |volume=9909 |language=en|location=Cham|publisher=Springer International Publishing|pages=454–471|doi=10.1007/978-3-319-46454-1_28|isbn=978-3-319-46454-1|arxiv=1604.00906|s2cid=1599840 }}</ref>
* Attention-based activity analysis<ref>Su, Y. C., & Grauman, K. (2016, October). Detecting engagement in egocentric video. In European Conference on Computer Vision (pp. 454-471). Springer International Publishing.</ref>
* Social interaction analysis<ref name="auto"/>
* Social interaction analysis<ref>Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012, June). Social interactions: A first-person perspective. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1226-1233). IEEE.</ref>
* Hand pose analysis<ref>Rogez, G., Supancic, J. S., & Ramanan, D. (2015). First-person pose recognition using egocentric workspaces. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4325-4333).</ref>
* Hand pose analysis<ref>Rogez, G., Supancic, J. S., & Ramanan, D. (2015). [http://openaccess.thecvf.com/content_cvpr_2015/papers/Rogez_First-Person_Pose_Recognition_2015_CVPR_paper.pdf First-person pose recognition using egocentric workspaces.] In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4325-4333).</ref>
* Ego graphical User Interfaces (EUI)<ref>Mann, S., Janzen, R., Ai, T., Yasrebi, S. N., Kawwa, J., & Ali, M. A. (2014, May). [http://wearcam.org/abaq.pdf Toposculpting: Computational lightpainting and wearable computational photography for abakographic user interfaces.] In Electrical and Computer Engineering (CCECE), 2014 IEEE 27th Canadian Conference on (pp. 1-10). IEEE.</ref>
* Ego graphical User Interfaces (EUI)
* Understanding social dynamics and attention<ref>Bettadapura, V., Essa, I., & Pantofaru, C. (2015, January). Egocentric field-of-view localization using first-person point-of-view devices. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on (pp. 626-633). IEEE</ref>
* Understanding social dynamics and attention<ref>Bettadapura, V., Essa, I., & Pantofaru, C. (2015, January). [https://arxiv.org/pdf/1510.02073 Egocentric field-of-view localization using first-person point-of-view devices.] In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on (pp. 626-633). IEEE</ref>
* Revisiting robotic vision and [[machine vision]] as egocentric sensing <ref>{{Cite journal|last1=Ji|first1=Peng|last2=Song|first2=Aiguo|last3=Xiong|first3=Pengwen|last4=Yi|first4=Ping|last5=Xu|first5=Xiaonong|last6=Li|first6=Huijun|date=2017-09-01|title=Egocentric-Vision based Hand Posture Control System for Reconnaissance Robots|url=https://doi.org/10.1007/s10846-016-0440-2|journal=Journal of Intelligent & Robotic Systems|language=en|volume=87|issue=3|pages=583–599|doi=10.1007/s10846-016-0440-2|s2cid=254648250 |issn=1573-0409}}</ref>
* Revisiting robotic vision and [[machine vision]] as egocentric sensing
* Activity forecasting<ref>{{Cite book|last1=Bokhari|first1=Syed Zahir|last2=Kitani|first2=Kris M.|title=Computer Vision – ACCV 2016 |chapter=Long-Term Activity Forecasting Using First-Person Vision |date=2017|editor-last=Lai|editor-first=Shang-Hong|editor2-last=Lepetit|editor2-first=Vincent|editor3-last=Nishino|editor3-first=Ko|editor4-last=Sato|editor4-first=Yoichi|chapter-url=https://link.springer.com/chapter/10.1007/978-3-319-54193-8_22|series=Lecture Notes in Computer Science |volume=10115 |language=en|location=Cham|publisher=Springer International Publishing|pages=346–360|doi=10.1007/978-3-319-54193-8_22|isbn=978-3-319-54193-8}}</ref>
* Activity forecasting<ref>Bokhari, S. Z., & Kitani, K. M. (2016, November). Long-Term Activity Forecasting Using First-Person Vision. In Asian Conference on Computer Vision (pp. 346-360). Springer, Cham</ref>


== Technical challenges ==
== Technical challenges ==
[[File:Egomotion-odometry.gif|thumb|[[Egomotion]] estimation]]

Today's wearable cameras are small and lightweight digital recording devices that can acquire images and videos automatically, without the user intervention, with different resolutions and frame rates, and from a first-person point of view. Therefore, wearable cameras are naturally primed to gather visual information from our everyday interactions since they offer an intimate perspective of the visual field of the camera wearer.
Today's wearable cameras are small and lightweight digital recording devices that can acquire images and videos automatically, without the user intervention, with different resolutions and frame rates, and from a first-person point of view. Therefore, wearable cameras are naturally primed to gather visual information from our everyday interactions since they offer an intimate perspective of the visual field of the camera wearer.


Depending on the frame rate, it is common to distinguish between photo-cameras (also called lifelogging cameras) and video-cameras.
Depending on the frame rate, it is common to distinguish between photo-cameras (also called lifelogging cameras) and video-cameras.
* The former (e.g., [[Narrative Clip]] and [[Microsoft SenseCam]]), are commonly worn on the chest, and are characterized by a very low frame rate (up to 2fpm) that allows to capture images over a long period of time without the need of recharging the battery. Consequently, they offer considerable potential for inferring knowledge about e.g. behaviour patterns, habits or lifestyle of the user. However, due the low frame-rate and the free motion of the camera, temporally adjacent images typically present abrupt appearance changes so that motion features cannot be reliably estimated.
* The former (e.g., [[Narrative Clip]] and [[Microsoft SenseCam]]), are commonly worn on the chest, and are characterized by a very low frame rate (up to 2fpm) that allows to capture images over a long period of time without the need of recharging the battery. Consequently, they offer considerable potential for inferring knowledge about e.g. behaviour patterns, habits or lifestyle of the user. However, due to the low frame-rate and the free motion of the camera, temporally adjacent images typically present abrupt appearance changes so that motion features cannot be reliably estimated.
* The latter (e.g., [[Google Glass]], [[GoPro]]), are commonly mounted on the head, and capture conventional video (around 35fps) that allows to capture fine temporal details of interactions. Consequently, they offer potential for in-depth analysis of daily or special activities. However, since the camera is moving with the wearer head, it becomes more difficult to estimate the global motion of the wearer and in the case of abrupt movements, the images can result blurred.
* The latter (e.g., [[Google Glass]], [[GoPro]]), are commonly mounted on the head, and capture conventional video (around 35fps) that allows to capture fine temporal details of interactions. Consequently, they offer potential for in-depth analysis of daily or special activities. However, since the camera is moving with the wearer head, it becomes more difficult to estimate the global motion of the wearer and in the case of abrupt movements, the images can result blurred.


In both cases, since the camera is worn in a naturalistic setting, visual data present a huge variability in terms of illumination conditions and object appearance.
In both cases, since the camera is worn in a naturalistic setting, visual data present a huge variability in terms of illumination conditions and object appearance.
Moreover, the camera wearer is not visible in the image and what he/she is doing has to be inferred from the information in the visual field of the camera, implying that important information about the wearer, such for instance as pose or facial expression estimation, is not available.
Moreover, the camera wearer is not visible in the image and what he/she is doing has to be inferred from the information in the visual field of the camera, implying that important information about the wearer, such for instance as [[pose (computer vision)|pose]] or facial expression estimation, is not available.


== Applications ==
== Applications ==


A collection of studies published in a special theme issue of the American Journal of Preventive Medicine<ref>Doherty, A. R., Hodges, S. E., King, A. C., Smeaton, A. F., Berry, E., Moulin, C. J., ... & Foster, C. (2013). Wearable cameras in health. American journal of preventive medicine, 44(3), 320-323.</ref> has demonstrated the potential of lifelogs captured through wearable cameras from a number of viewpoints. In particular, it has been shown that used as a tool for understanding and tracking lifestyle behaviour, lifelogs would enable the prevention of noncommunicable diseases associated to unhealthy trends and risky profiles (such as obesity, depression, etc.). In addition, used as a tool of re-memory cognitive training, lifelogs would enable the prevention of cognitive and functional decline in elderly people.
A collection of studies published in a special theme issue of the American Journal of Preventive Medicine<ref name="auto1"/> has demonstrated the potential of lifelogs captured through wearable cameras from a number of viewpoints. In particular, it has been shown that used as a tool for understanding and tracking lifestyle behaviour, lifelogs would enable the prevention of noncommunicable diseases associated to unhealthy trends and risky profiles (such as obesity, depression, etc.). In addition, used as a tool of re-memory cognitive training, lifelogs would enable the prevention of cognitive and functional decline in elderly people.


More recently, egocentric cameras have been used to study human and animal cognition, human-human social interaction, human-robot interaction, human expertise in complex tasks.
More recently, egocentric cameras have been used to study human and animal cognition, human-human social interaction, human-robot interaction, human expertise in complex tasks.
Other applications include navigation/assistive technologies for the blind,<ref>Yagi, T., Mangalam, K., Yonetani, R., & Sato, Y. (2017). Future Person Localization in First-Person Videos. arXiv preprint arXiv:1711.11217.</ref> monitoring and assistance of industrial workflows.<ref>Leelasawassuk, T., Damen, D., & Mayol-Cuevas, W. (2017, March). Automated capture and delivery of assistive task guidance with an eyewear computer: the GlaciAR system</ref><ref>Edmunds, S. R., Rozga, A., Li, Y., Karp, E. A., Ibanez, L. V., Rehg, J. M., & Stone, W. L. (2017). Brief Report: Using a Point-of-View Camera to Measure Eye Gaze in Young Children with Autism Spectrum Disorder During Naturalistic Social Interactions: A Pilot Study. Journal of autism and developmental disorders, 47(3), 898-904.</ref>
Other applications include navigation/assistive technologies for the blind,<ref>Yagi, T., Mangalam, K., Yonetani, R., & Sato, Y. (2017). Future Person Localization in First-Person Videos. arXiv preprint {{arXiv|1711.11217}}.</ref> monitoring and assistance of industrial workflows,<ref>{{Cite book|last1=Leelasawassuk|first1=Teesid|last2=Damen|first2=Dima|last3=Mayol-Cuevas|first3=Walterio|title=Proceedings of the 8th Augmented Human International Conference |chapter=Automated capture and delivery of assistive task guidance with an eyewear computer |date=2017-03-16|chapter-url=https://doi.org/10.1145/3041164.3041185|series=AH '17|location=New York, NY, USA|publisher=Association for Computing Machinery|pages=1–9|doi=10.1145/3041164.3041185|hdl=1983/ed89a4ab-f375-40b7-bdf4-b3f97925a0fe |isbn=978-1-4503-4835-5|s2cid=10231349 |url=https://research-information.bris.ac.uk/en/publications/ed89a4ab-f375-40b7-bdf4-b3f97925a0fe }}</ref><ref>Edmunds, S. R., Rozga, A., Li, Y., Karp, E. A., Ibanez, L. V., Rehg, J. M., & Stone, W. L. (2017). [https://www.academia.edu/download/55008137/Edmunds_et_al_2017.pdf Brief Report: Using a Point-of-View Camera to Measure Eye Gaze in Young Children with Autism Spectrum Disorder During Naturalistic Social Interactions: A Pilot Study.]{{dead link|date=July 2022|bot=medic}}{{cbignore|bot=medic}} Journal of Autism and Developmental Disorders, 47(3), 898-904.</ref> and [[augmented reality]] interfaces.<ref name="Mann"/>


== See also ==
== See also ==
{{Div col|2|colwidth=}}
{{Div col}}
* [[Smartglasses]]
* [[Eye tracking]]
* [[Eye tracking]]
* [[Lifelog]]
* [[Lifelog]]
* [[Quantified self]]
* [[Quantified self]]
* [[Smartglasses]]
* [[Sousveillance]]
* [[Sousveillance]]
* [[Steve Mann]]
{{Div col end}}
{{Div col end}}



Latest revision as of 11:37, 4 September 2023

Egocentric vision or first-person vision is a sub-field of computer vision that entails analyzing images and videos captured by a wearable camera, which is typically worn on the head or on the chest and naturally approximates the visual field of the camera wearer. Consequently, visual data capture the part of the scene on which the user focuses to carry out the task at hand and offer a valuable perspective to understand the user's activities and their context in a naturalistic setting.[1]

The wearable camera looking forwards is often supplemented with a camera looking inward at the user's eye and able to measure a user's eye gaze, which is useful to reveal attention and to better understand the user's activity and intentions.

History

[edit]

The idea of using a wearable camera to gather visual data from a first-person perspective dates back to the 70s, when Steve Mann invented "Digital Eye Glass", a device that, when worn, causes the human eye itself to effectively become both an electronic camera and a television display.[2]

Subsequently, wearable cameras were used for health-related applications in the context of Humanistic Intelligence[3] and Wearable AI.[4] Egocentric vision is best done from the point-of-eye, but may also be done by way of a neck-worn camera when eyeglasses would be in-the-way.[5] This neck-worn variant was popularized by way of the Microsoft SenseCam in 2006 for experimental health research works.[6] The interest of the computer vision community into the egocentric paradigm has been arising slowly entering the 2010s and it is rapidly growing in recent years,[7] boosted by both the impressive advances in the field of wearable technology and by the increasing number of potential applications.

The prototypical first-person vision system described by Kanade and Hebert,[8] in 2012 is composed by three basic components: a localization component able to estimate the surrounding, a recognition component able to identify object and people, and an activity recognition component, able to provide information about the current activity of the user. Together, these three components provide a complete situational awareness of the user, which in turn can be used to provide assistance to the user or to the caregiver. Following this idea, the first computational techniques for egocentric analysis focused on hand-related activity recognition [9] and social interaction analysis.[10] Also, given the unconstrained nature of the video and the huge amount of data generated, temporal segmentation[11] and summarization[12] were among the first problems addressed. After almost ten years of egocentric vision (2007 - 2017), the field is still undergoing diversification. Emerging research topics include:

  • Social saliency estimation[13]
  • Multi-agent egocentric vision systems
  • Privacy preserving techniques and applications
  • Attention-based activity analysis[14]
  • Social interaction analysis[10]
  • Hand pose analysis[15]
  • Ego graphical User Interfaces (EUI)[16]
  • Understanding social dynamics and attention[17]
  • Revisiting robotic vision and machine vision as egocentric sensing [18]
  • Activity forecasting[19]

Technical challenges

[edit]
Egomotion estimation

Today's wearable cameras are small and lightweight digital recording devices that can acquire images and videos automatically, without the user intervention, with different resolutions and frame rates, and from a first-person point of view. Therefore, wearable cameras are naturally primed to gather visual information from our everyday interactions since they offer an intimate perspective of the visual field of the camera wearer.

Depending on the frame rate, it is common to distinguish between photo-cameras (also called lifelogging cameras) and video-cameras.

  • The former (e.g., Narrative Clip and Microsoft SenseCam), are commonly worn on the chest, and are characterized by a very low frame rate (up to 2fpm) that allows to capture images over a long period of time without the need of recharging the battery. Consequently, they offer considerable potential for inferring knowledge about e.g. behaviour patterns, habits or lifestyle of the user. However, due to the low frame-rate and the free motion of the camera, temporally adjacent images typically present abrupt appearance changes so that motion features cannot be reliably estimated.
  • The latter (e.g., Google Glass, GoPro), are commonly mounted on the head, and capture conventional video (around 35fps) that allows to capture fine temporal details of interactions. Consequently, they offer potential for in-depth analysis of daily or special activities. However, since the camera is moving with the wearer head, it becomes more difficult to estimate the global motion of the wearer and in the case of abrupt movements, the images can result blurred.

In both cases, since the camera is worn in a naturalistic setting, visual data present a huge variability in terms of illumination conditions and object appearance. Moreover, the camera wearer is not visible in the image and what he/she is doing has to be inferred from the information in the visual field of the camera, implying that important information about the wearer, such for instance as pose or facial expression estimation, is not available.

Applications

[edit]

A collection of studies published in a special theme issue of the American Journal of Preventive Medicine[6] has demonstrated the potential of lifelogs captured through wearable cameras from a number of viewpoints. In particular, it has been shown that used as a tool for understanding and tracking lifestyle behaviour, lifelogs would enable the prevention of noncommunicable diseases associated to unhealthy trends and risky profiles (such as obesity, depression, etc.). In addition, used as a tool of re-memory cognitive training, lifelogs would enable the prevention of cognitive and functional decline in elderly people.

More recently, egocentric cameras have been used to study human and animal cognition, human-human social interaction, human-robot interaction, human expertise in complex tasks. Other applications include navigation/assistive technologies for the blind,[20] monitoring and assistance of industrial workflows,[21][22] and augmented reality interfaces.[5]

See also

[edit]

References

[edit]
  1. ^ An Introduction to the 3rd Workshop on Egocentric (First-person) Vision, Steve Mann, Kris M. Kitani, Yong Jae Lee, M. S. Ryoo, and Alireza Fathi, IEEE Conference on Computer Vision and Pattern Recognition Workshops 2160-7508/14, 2014, IEEE doi:10.1109/CVPRW.2014.1338272014
  2. ^ Mann, S. (1998). Humanistic computing:" WearComp" as a new framework and application for intelligent signal processing. Proceedings of the IEEE, 86(11), 2123-2151.
  3. ^ Haykin, Simon S., and Bart Kosko. Intelligent signal processing. Wiley-IEEE Press, 2001.
  4. ^ “Wearable AI”, Steve Mann, Li-Te Cheng, John Robinson, Kaoru Sumi, Toyoaki Nishida, Soichiro Matsushita, Ömer Faruk Özer, Oguz Özun, C. Öncel Tüzel, Volkan Atalay, A. Enis Cetin, Joshua Anhalt, Asim Smailagic, Daniel P. Siewiorek, Francine Gemperle, Daniel Salber, Weber, Jim Beck, Jim Jennings, and David A. Ross, IEEE Intelligent Systems 16(3), 2001, Pages 0(cover) to 53.
  5. ^ a b Mann, S. (October 2000). "Telepointer: Hands-free completely self-contained wearable visual augmented reality without headwear and without any infrastructural reliance". Digest of Papers. Fourth International Symposium on Wearable Computers. pp. 177–178. doi:10.1109/ISWC.2000.888489. ISBN 0-7695-0795-6. S2CID 6036868.
  6. ^ a b Doherty, A. R., Hodges, S. E., King, A. C., Smeaton, A. F., Berry, E., Moulin, C. J., ... & Foster, C. (2013). Wearable cameras in health. American Journal of Preventive Medicine, 44(3), 320-323.
  7. ^ Bolanos, M., Dimiccoli, M., & Radeva, P. (2017). Toward storytelling from visual lifelogging: An overview. IEEE Transactions on Human-Machine Systems, 47(1), 77-90.
  8. ^ Kanade, Takeo; Hebert, Martial (August 2012). "First-Person Vision". Proceedings of the IEEE. 100 (8): 2442–2453. doi:10.1109/JPROC.2012.2200554. ISSN 1558-2256. S2CID 33060600.
  9. ^ Fathi, A., Farhadi, A., & Rehg, J. M. (2011, November). Understanding egocentric activities. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 407-414). IEEE.
  10. ^ a b Fathi, A., Hodgins, J. K., & Rehg, J. M. (2012, June). Social interactions: A first-person perspective. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1226-1233). IEEE.
  11. ^ Poleg, Y., Arora, C., & Peleg, S. (2014). Temporal segmentation of egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2537-2544).
  12. ^ Lee, Y. J., Ghosh, J., & Grauman, K. (2012, June). Discovering important people and objects for egocentric video summarization. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1346-1353). IEEE.
  13. ^ Park, H. S., Jain, E., & Sheikh, Y. (2012). 3d social saliency from head-mounted cameras. In Advances in Neural Information Processing Systems (pp. 422-430).
  14. ^ Su, Yu-Chuan; Grauman, Kristen (2016). "Detecting Engagement in Egocentric Video". In Leibe, Bastian; Matas, Jiri; Sebe, Nicu; Welling, Max (eds.). Computer Vision – ECCV 2016. Lecture Notes in Computer Science. Vol. 9909. Cham: Springer International Publishing. pp. 454–471. arXiv:1604.00906. doi:10.1007/978-3-319-46454-1_28. ISBN 978-3-319-46454-1. S2CID 1599840.
  15. ^ Rogez, G., Supancic, J. S., & Ramanan, D. (2015). First-person pose recognition using egocentric workspaces. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4325-4333).
  16. ^ Mann, S., Janzen, R., Ai, T., Yasrebi, S. N., Kawwa, J., & Ali, M. A. (2014, May). Toposculpting: Computational lightpainting and wearable computational photography for abakographic user interfaces. In Electrical and Computer Engineering (CCECE), 2014 IEEE 27th Canadian Conference on (pp. 1-10). IEEE.
  17. ^ Bettadapura, V., Essa, I., & Pantofaru, C. (2015, January). Egocentric field-of-view localization using first-person point-of-view devices. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on (pp. 626-633). IEEE
  18. ^ Ji, Peng; Song, Aiguo; Xiong, Pengwen; Yi, Ping; Xu, Xiaonong; Li, Huijun (2017-09-01). "Egocentric-Vision based Hand Posture Control System for Reconnaissance Robots". Journal of Intelligent & Robotic Systems. 87 (3): 583–599. doi:10.1007/s10846-016-0440-2. ISSN 1573-0409. S2CID 254648250.
  19. ^ Bokhari, Syed Zahir; Kitani, Kris M. (2017). "Long-Term Activity Forecasting Using First-Person Vision". In Lai, Shang-Hong; Lepetit, Vincent; Nishino, Ko; Sato, Yoichi (eds.). Computer Vision – ACCV 2016. Lecture Notes in Computer Science. Vol. 10115. Cham: Springer International Publishing. pp. 346–360. doi:10.1007/978-3-319-54193-8_22. ISBN 978-3-319-54193-8.
  20. ^ Yagi, T., Mangalam, K., Yonetani, R., & Sato, Y. (2017). Future Person Localization in First-Person Videos. arXiv preprint arXiv:1711.11217.
  21. ^ Leelasawassuk, Teesid; Damen, Dima; Mayol-Cuevas, Walterio (2017-03-16). "Automated capture and delivery of assistive task guidance with an eyewear computer". Proceedings of the 8th Augmented Human International Conference. AH '17. New York, NY, USA: Association for Computing Machinery. pp. 1–9. doi:10.1145/3041164.3041185. hdl:1983/ed89a4ab-f375-40b7-bdf4-b3f97925a0fe. ISBN 978-1-4503-4835-5. S2CID 10231349.
  22. ^ Edmunds, S. R., Rozga, A., Li, Y., Karp, E. A., Ibanez, L. V., Rehg, J. M., & Stone, W. L. (2017). Brief Report: Using a Point-of-View Camera to Measure Eye Gaze in Young Children with Autism Spectrum Disorder During Naturalistic Social Interactions: A Pilot Study.[dead link] Journal of Autism and Developmental Disorders, 47(3), 898-904.