Information Assimilation 
Multmedia Surveillance and Monitoring Systems

Description: Description: mms_image


·       Data set used for this research can be found here.

·       System Demos (WMV files):

o   Running event

o   Door knocking event


What is “Information Assimilation”?

Information assimilation refers to a process of combining the sensory and non-sensory information obtained from asynchronous multifarious sources using the context and past experience.



In recent times, it is being increasingly accepted that most surveillance and monitoring tasks can be better performed by using multiple types of sensors as compared to using only a single type. Therefore, most surveillance systems nowadays utilize multiple types of sensors like microphones, motion detectors and RFIDs etc in addition to the video cameras. However, different sensors usually provide the sensed data in different formats and at different rates. For example, a video may be captured at a frame rate which could be different from the rate at which audio samples are obtained, or even two video sources can have different frames rates. Moreover, the processing time of different types of data is also different. Due to the asynchrony and diversity among streams, the assimilation of information in order to accomplish an analysis task is a challenging research problem.

Event detection is one of the fundamental analysis tasks in multimedia surveillance and monitoring systems. In this research, we propose an information assimilation framework for event detection in multimedia surveillance and monitoring systems.

Events are usually not impulse phenomena in real world, but they occur over an interval of time. Based on different granularity levels in time, location, number of objects and their activities, an event can be a “compound-event” or simply an “atomic-event”. We define compound-events and the atomic-events as follows -

Event is a physical reality that consists of one or more living or non-living real world objects (who) having one or more attributes (of type) being involved in one or more activities (what) at a location (where) over a period of time (when).

Atomic-event is an event in which exactly one object having one or more attributes is involved in exactly one activity.

Compound-event is the composition of two or more different atomic-events.

A compound-event, for example, “a person is running and shouting in the corridor” can be decomposed into its constituent atomic-events – “a person is running in the corridor” and “a person is shouting in the corridor”. The atomic-events in a compound event can occur simultaneously, as in the example give above; or they may also occur one after another, for example, the compound-event “A person walked through the corridor, stood near the meeting room, and then ran to the other side of the corridor” consists of three atomic-events “a person walked through the corridor” followed by “person stood near the meeting room”, and then followed by “person ran to the other side of the corridor”.

The different atomic-events, to be detected, may require different types of sensors. For example, a “walking” and “running” event can be detected based on video and audio streams, a “standing” event can be detected using video but not by using audio streams, and “shouting” event can be better detected using the audio streams. The different atomic-events require different minimum time-periods over which they can be confirmed. This minimum time-period for different atomic-events depends upon the time in which the amount of data sufficient to reliably detect an event can be obtained and processed. Even the same atomic-event can be confirmed in different time periods using different data streams. For example, minimum video data required to detect a walking event could be of two seconds; however, the same event can be detected based on audio data of one second.


Framework: Research Issues

The media streams in a multimedia system are often correlated. We assume that the system designer has a confidence level in the decision obtained based on each of the media streams; and there is a cost of obtaining these decisions which usually includes the cost of sensor, its installation and maintenance cost, the cost of energy to operate it, and the processing cost of the stream. We also assume that each stream in a multimedia system partially helps in accomplishing the analysis task (e.g. event detection). The various research issues in the assimilation of information in such systems are –

*  When to assimilate? Events occur over a timeline. Timeline refers to a measurable span of time with information denoted at key points. Timeline-based event detection in multimedia surveillance systems requires the identification of key points along a timeline at which assimilation of information should take place. The identification of these key points is challenging because of asynchrony and diversity among streams and also because of the fact that different events have different granularity levels in time.

*  What to assimilate? The fact that, at any instant all of the employed media streams do not necessarily contribute towards accomplishing the analysis task brings up the issue of finding the most informative subset of streams. From the available set of streams,

·        What is the optimal number of streams required to detect an event under the specified constraints?

·        Which subset of the streams is the optimal one?

·        In case the most suitable subset is unavailable, can one use alternate streams without much loss of cost-effectiveness and confidence?

·        How frequently should this optimal subset be computed so that the overall cost of the system is minimized?

*  How to assimilate? In combining of different data sources,

·        How to utilize the correlation among streams?

·         How to integrate the contextual information (such as environment information) and the past experience?


The framework for information assimilation, which we propose, essentially addresses the above-mentioned issues.


Framework: Distinct characteristics

The proposed framework for information assimilation has the following distinct characteristics -

*  Late thresholding over early thresholding: The detection of events based on individual streams are usually not accomplished with certainty. To obtain a binary decision, early thresholding of uncertain information about an event may lead to error. For example, let an event detector find the probabilities of the occurrence of an event based on three media streams M1, M2 and M3, to be 0.60, 0.62 and 0.70, respectively. If the threshold is 0.65, then these probabilistic decisions are converted into binary decisions 0, 0 and 1, respectively; which implies that the event is found occurring based on stream M3 but is found non-occurring based on stream M1and M2. Since two decisions are in favor of non-occurrence of event compared to the one decision in favor of occurrence of the event, by adopting a simple voting strategy, the overall decision would be that the event did not occur. It is important to note that early thresholding can introduce errors in the overall decision. In contrast to early thresholding, the proposed framework advocates late thresholding by first assimilating the probabilistic decisions that are obtained based on individual streams, and then by thresholding the overall probability (which is usually more than the individual probabilities e.g. 0.85 in this case) of occurrence of event based on all the streams, which is less erroneous.

*  Use of agreement/disagreement among streams: The sensors capturing the same environment usually provide concurring or contradictory evidences about what is happening in the environment. The proposed framework utilizes this agreement/disagreement information among the media streams to strengthen the overall decision about the events happening in the environment.     For example, if two sensors have been providing concurring evidences in the past, it makes sense to give more weight to their current combined evidence compared to the case if they provided contradictory evidences in the past. The agreement/disagreement information (we call it as ``agreement coefficient'') among media streams is computed based on how they have been agreeing or disagreeing in their decisions in the past. We also propose a method for fusing the agreement coefficients among the media streams.

*  Use of confidence in streams: The designer of a multimedia analysis system can have different confidence levels in different media streams for accomplishing different tasks. The proposed framework utilizes the confidence information by assigning a higher weight to the media stream which has a higher confidence level. The confidence in each stream is computed based on how accurate it has been in the past. Integrating confidence information in the assimilation process also requires the computation of the overall confidence in a group of streams, a method for which is also proposed.

*  Dynamic programming based approach for optimal subset selection of streams: The proposed framework adopts a dynamic programming approach that attempts to find the optimal subset of media streams so as to achieve the surveillance goal under specified constraints. It attempts to find the optimal subset of media streams based on three criteria; first, by maximizing the probability of achieving the surveillance goal (e.g. event detection) under the specified cost and the specified confidence; second, by maximizing the confidence in the achieved goal under the specified cost and the specified probability with which the surveillance goal is achieved; and third, by minimizing the cost to achieve the surveillance goal with a specified probability and a specified confidence. Each of these problems is proven to be NP-hard. The framework also allows for a trade-off among the above-mentioned three criteria, and offers a flexibility to compare whether any one set of media streams of low cost would be better than any other set of media streams of higher cost, or any one set of media streams of high confidence would be better than any other set of media streams of low confidence.

*  Assimilation over fusion: Information assimilation is different from information fusion in that the former brings the notion of integrating context and the past experience in the fusion process. The context is accessory information that helps in the correct interpretation of the observed data. We use the geometry of the monitored space along with the location, orientation and coverage area of the employed sensors as the spatial contextual information. We integrate the past experience by modelling the agreement/disagreement information among the media streams based on the accumulated past history of their agreement or disagreement.


Main contributions

The main contributions of this research are:


*  A framework for assimilation of information in order to detect events in surveillance and monitoring systems.

*  Notion of compound and atomic events that helps in describing events over a timeline.

*  In the assimilation process, use and modeling of the two distinct properties of sensors – the agreement/ disagreement information among and the confidences in them.

*  NP-hardness proof of the media subset selection problems and a near-optimal solution to them using a dynamic programming based approach.


For details, please contact:

Pradeep K. Atrey ( (Currently at State University of New York, Albany)

Mohan S. Kankanhalli (

School of Computing, National University of Singapore, Singapore



*  P. K. Atrey and M. S. Kankanhalli and John B. Oommen. Goal-oriented optimal subset selection of correlated multimedia streams. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 3, Issue 1, Article 2 (2007).

*  P. K. Atrey, M. S. Kankanhalli and R. Jain. Information assimilation framework for event detection in multimedia surveillance systems. Springer/ACM Multimedia Systems Journal, Vol. 12, No. 3, pp 239-253 (2006).

*  P. K. Atrey, V. Kumar, A. Kumar and M. S. Kankanhalli. Experiential sampling based foreground/background segmentation for video surveillance. IEEE International Conference on Multimedia and Expo (ICME'2006), pp 1809-1812, July, 2006, Toronto, Canada.

*  P. K. Atrey, N. C. Maddage and M. S. Kankanhalli. Audio based event detection for multimedia surveillance. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'2006), pp V813-816, May, 2006, Toulouse, France.

*  P. K. Atrey, M. S. Kankanhalli and R. Jain. Timeline-based information assimilation in multimedia surveillance and monitoring systems, 3rd ACM International Workshop on Video Surveillance and Sensor Networks (ACM VSSN'05), pp 103-112, November, 2005, Singapore.

*  P. K. Atrey and M. S. Kankanhalli. Goal based optimal selection of media streams. IEEE International Conference on Multimedia & Expo (ICME'05), pp 305-308, July, 2005, Amsterdam, The Netherlands.

*  P. K. Atrey and M. S. Kankanhalli. Probability fusion for correlated multimedia streams. ACM International Conference on Multimedia (MM'04), pp 408-411, October, 2004, New York City, NY, USA.


Description: Description: 154 Description: Description: Blob154 Description: Description: 134 Description: Description: Blob134


© 2006, Pradeep Kumar Atrey and Mohan S. Kankanhalli.  All rights reserved.