the problem of spatio-temporal invariant points in videos
Embed Size (px)
Multimedia Systems - Class Project
Spatio-Temporal Invariant Points in Videos
Priyatham Bollimpalli – 10010148
Pydi Peddigari Venkat Sai – 10010149
PVS Dileep – 10010180
The objective here is to find the spatio-temporal invariant points in a given input video. We implement the following models on a set of contiguous frames of a video, called a scene. We divide the problem into three cases, one with the background being fixed in a video and the entire scene is not dynamic, second, background fixed and the entire scene is reasonably dynamic, and final one with the background moving and the objects are also moving. We examine those cases below:
Case1: When the Background is fixed and the entire scene is not dynamic
In this case, the background in the scene is fixed across several frames, while the foreground objects can keep moving across the whole video, but they do not occupy the entire frame of video with their movements, i.e., only some parts of the frame would be having movement, while a decent part of the frame would remain static. The following procedure is followed to detect the spatio-temporal interest points in the scene:
Every scene is a collection of several frames. In this instance, we would consider a scene of a video in which the background is constant, and in the foreground, there is a ball which quickly moves underneath a wooden block which is constant in the video. These are just a four of the several frames present in the video:
The difference of the first two frames of the image would be computed. In the resulted difference image, we would have several ranges of pixel values possible.
Hence, we would keep a grey threshold of say 0.2, and put all the locations of the frame which are above that grey threshold level, to pixel value of that of white. We also have an image which we will keep writing through across several iterations of this method(we refer to that image as Rframe). Now, in the previously resulted frame of difference, we find all the locations where the pixel value is white, and fill out all those corresponding locations on the Rframe to black.
We keep on repeating this process of finding the difference between two successive frames of the scene, thresholding the difference to get some pixel values which are white, and filling out all the locations on Rframe where there is a white pixel to black. For example, the difference of the frames at several instants are as follows:
Then, the final Rframe produced will be as follows:
The portions which are black in this image would depict those points where a temporal invariance is not possible, as the objects keep moving in those areas. The portions which are white would indicate those points which are constant throughout the period of the video. These would be our points of interest for application of SIFT on those points.
On the final go, we would apply SIFT on the original frames of the scene, and find all those points resulted from SIFT, and consider such a point among them to be our interest point in this case only if that point is among the white portions of the Rframe, i.e., we would only want the interest points which coincide with the white portions of the Rframe. Hence, the resulted interest points finally would form our Spatio temporal invariant points in this case. The above algorithm is also run on other videos as follows:
Some of the several frames in the original video :
Some among the resulted differences in the frames at several instants are:
Then the Rframe resulted is as follows:
Now, we find the interest points at those locations, which are among the whie portions of the Rframe, since, only those would be the temporal invariant parts of the image. Hence, the interest points of the scene resulted at some of the several instants are as follows:
So, the above points marked with green would represent the spatio-temporal interest points of the scene, at some of the instants among the whole.
It is run on another video as follows:
The Rframe resulted is :
Interest points resulted at several frames are :
Case2: When the Background is fixed and the entire scene is reasonably dynamic In this case, the background in the scene is fixed across several frames, while the foreground objects can keep moving across the whole video, and they do occupy the most of the frame of video with their reasonable movements. In this case, the following procedure is followed: Consider a scene as follows:
Now, if Case-1 was used here, the Rframe resulted would be as follows:
Hence, if case-1 is used here, we can observe that most of the region is blacked out since the objects motion is present almost over the entire image, and hence, we lose some of the possible interest points.
Hence, we adopt the following method now. This method would use automatic detection and motion-based tracking of moving objects in a video. This problem can be seen as:
o detecting moving objects in each frame o associating the detections corresponding to the same object over time
The association of detections to the same object is based solely on motion. The motion of each track is estimated by a Kalman filter. The filter is used to predict the track's location in each frame, and determine the likelihood of each detection being assigned to each track.
In any given frame, some detections may be assigned to tracks, while other detections and tracks may remain unassigned. The assigned tracks are updated using the corresponding detections. The unassigned tracks are marked invisible. Each track keeps count of the number of consecutive frames, where it remained unassigned. If the count exceeds a specified threshold, the example assumes that the object left the field of view and it deletes the track.
So in the process, a frame is read, objects are detected with their centroids and bounding boxes, and motion segmentation using the foreground detector. Next, the detections are assigned to tracks. Then, the assigned tracks would be update and the unassigned tracks would be updated by marking them invisible, and the lost tracks would be deleted.
The following are the results of the tracked objects at some of the instants of the scene:
Now, we apply SIFT on the tracked objects at all instants of the video. Then, the following would be the interest points produced at several instants of the video:
Now, to gather further more interest points from the video, we can also combine the interest points generated from case-1, and hence, the resulted interest points would be:
So this would capture all the interest points possible, combining case-1 and case-2
Case3: When the background is moving and the objects are also moving When the camera as well as well objects are moving, tracking the objects is a challenging issue. There will be rarely anything invariant even in one particular scene. This is still an open research problem and some heuristic methods are successful. The video has a moving car with a moving camera. Some of the frames are given below.
One heuristic method which we found on the web to solve it is given below. Note that segmenting the car frame differentiation (background subtraction) won’t work because the camera is also moving henceforth the background is also moving. Hence normal prediction algorithms would fail in this case. To tackle this issue, Optical flow is used where the scene and the car have different directions of flow. In the below figures, the red points denote the optical flow of the background and the green points denote the points having an optical flow opposite to the red points. Note that the green points are able to track the car present in the entire video and hence locate the green points which are spatio-temporally invariant.
Now, combination of the points obtained here with those obtained in case-1 and case-2 (it is less likely that any points would be there), gives the total points possible.
Conclusion & Future Work: In this report we have tried to perform three kinds of techniques on any given scene in a video – when the background is almost stationary and the scene is not dynamic, when the background is stationary and the scene is dynamic and the background and the scene bother are dynamic. Combination of points obtained from all the three methods gives maximum possible spatio-temporal invariant points. There is lot of scope for future work in this area and we wish to pursue it further. The following are the issues involved.
In case-1, the level at which thresholding is done defines the extent/degree to which motion of the object is considered. Lower the threshold, greater the impact of motion. This entirely depends on the video i.e. if a video has lot of illumination and contrast changes between the frames, then the difference of the frames would give many false contours. In this case considering higher value of threshoding is desirable. In some other cases the frame rate would be too high due to which almost negligible amount of motion would be captured between the frames. In this case lower value for thresholding is desirable. Hence developing an optimal threshold value automatically by taking the video quality, frame rate into account is one area of future work.
In case-3, the above example gives a possible approach to solution. It works because the camera is also continuously moving along with the car and getting the points with opposite optical flow works. But if the camera remains stationary for some time and moves again suddenly, we need a separate system to first track the background movement and that in combination of foreground motion can be used to find out the points on the object. Many object tracking methods exist and many are still pursed since this is a very active area of research. It is likely that exploring into this area would give a generic algorithm for obtaining an object and the points which are consistent on it throughout the scene.
Case-2 considers the background to be stationary and the motion of the objects to be uniform. Modifying the parameters for the detection, assignment, and deletion steps of the trackers according to the video may be done. The tracking in this example was solely based on motion with the assumption that all objects move in a straight line with constant speed. When the motion of an object significantly deviates from this model, the example may produce tracking errors. The likelihood of tracking errors can be reduced by using a more complex motion model, such as constant acceleration, or by using multiple Kalman filters for every object. Also, you can incorporate other cues for associating detections over time, such as size, shape, and colour. References 1. http://www.mathworks.in/help/vision/examples/motion-based-multiple-object-
tracking.html 2. http://www.youtube.com/watch?v=MOaKnCSejXQ