human action recognition using spatio-temporal classification

1

Human Action Recognition using Spatio-Temporal Classification

方競賢 Ching-Hsien Fang 98.10.14

[email protected]

2

Outline

1. Introduction

2. Flowchart

3. Learning for Spatio-Temporal Classification 3.1 Spatial Subspace Creation using Locality Preserving

Projection(LPP)

3.2 Learning for Classification in Temporal Subspace

4. Recognition Process

5. Experimental Results

6. Conclusions

3

Outline

1. Introduction

2. Flowchart


Projection(LPP)




6. Conclusions

4

1. Introduction

The major concept is that we would like to add the temporal

information into the action recognition process (p1,1,2).

Our “Temporal-Vector Trajectory Learning(TVTL)” is :

-supervised ( 有使用 label)

-linear ( 有找到一個線性轉換矩陣 )

We use the feature of human silhouettes instead of feature

points because the feature point method would be limited due to

the discard of global structural information (p1,3,8).

The silhouette-based method are becoming more and more

popular because the feature of human silhouette is easier to

obtain and it still contain the detailed body shape information

(p1,4,1).

(Page No. , Paragraph No. , Line No.)

5

Outline

1. Introduction

2. Flowchart


Projection(LPP)




6. Conclusions

6

2. Flowchart– 變數介紹 (1/2) 共有 M 個 training sequence, 這 M 個

sequence 全部加總共有 N 個 frame.

, 以本篇 paper 影像大小為 64*48, 故是資料在做完 LPP 之後求到的轉換矩陣 .

是資料經由轉換矩陣轉至的新空間 .

, 在本篇 paper 影像在 LPP 後降至 31 維 , 故是我們設計的一個矩陣 , 跟相乘後可以得到 temporal

data

, , 維度是 d*(2t+1) , 而 t 代表取前後 t 個 frame 來當輔助加強的時間資訊 , 例如 t=2, 維度就會變成 5*d = 5*31=155 維度的資料 .

中的是做完 metric learning 後的空間 , 其維度跟相同

是測試的 data

},...,,{},...,,{ 2121 NM xxxXXXX

1)**( hwi Rx 307248*64)*( hw

A

},...,,{ 21 NT yyyXAY

1*di Ry 31d

S Y

}',','{' 21 NyyyYSY 1)*)*12((' dti Ry

LLM T L 'Y

},...,,{ ,2,1, nttttt xxxX

7

2. Flowchart(2/2)

8

Outline

1. Introduction

2. Flowchart


Projection(LPP)




6. Conclusions

9

3.1 Spatial Subspace Creation using Locality Preserving Projection(1/3)

We would like to obtain a low-dimensional space to discover the

intrinsically nonlinear structure of spatial-motion information

where the local spatial information can be preserved (p4,1,2).

We choose LPP because :

-linear ( 有轉換矩陣 )

-preserving local structure ( 可以保持住資料的區域結構 )

-unsupervised ( 因為在此步驟我們不希望考慮 label)

LPP 基本概念為 , 如果原本在高維度靠近 ( 相似 ) 的資料 , 在降維後希望其分布情況依然這兩點仍可以在附近 , 進而由區域性的最佳化來建構出全域的分布 .

10

3.1 Spatial Subspace Creation using LPP-LPP 概念圖及 weight matrix(2/3)

A

B

C

假設有 N 個 sample 點 , 在做完 k-NN 之後 , 其 LPP 的 weight matrix 如下面定義 :

010......

1............

0...001

......001

......110

W

1ijw if data i is the neighbor of

data j, vice versa

0ijw else

An example:

1ABw

0ACw

11

3.1 Spatial Subspace Creation using LPP-LPP 公式 (3/3)

求出 weight matrix 之後 , LPP 的 objective function 如下 :

…………(1)

利用 graph embedding, 可以把公式推導成下式 :

………...(2)

…………(3)

ijij

jT

iT

Aij

ijji

AWxAxAnimWyynim

22)(arg)(arg

AXLXnAimWxAxAnim TT

Aij

ijjT

iT

Aarg)(arg 2

1AXDXA TTSubject to

Where is the “Laplacian matrix”, and . L WDL

DAnd is the diagonal matrix, j

ijii WD

12

Q and A:

Q1 : 降維技術那麼多 , 為甚麼選擇 LPP?

-Answer : 因為一開始在做這個行為偵測的時候 , 看了一些論文 , 直到看了LSTDE 因此有了時間概念 , 那我們想到如果資料本身就有時間資訊的話 , 那麼幫助性有多大 , 因此那時候第一直覺是想到把原本的 data 加上其跟時間鄰近資料間的軌跡合起來變成一個 data, 如下圖所示

那麼如果我在高維度就進行這動作 , 那麼有兩個大問題 , 第一是維度太高 , 原本維度就很高的 data 把他展成 temporal data 那麼矩陣會過大造成運算量龐大 , 而另一個主因是高維度的資料並沒有特徵擷取的概念 , 也就是他每個維度就只是一個 pixel 的黑白值 , 因此我們需要一個特徵擷取的動作 , 因此要找個特徵擷取方法又可以把維度同時降低的方法 , 這時 PCA.LDA.LSDA.LLE.ISOMAP.LDE.LPP…

很多方法可以使用 , 但是我們在這層主要是希望保持資料間的架構 , 又可以找到一個線性矩陣 , 而且並不希望在這邊就使用 label, 因為充其量只是希望可以降低計算量又保持資料的架構 , 因此 LPP 是一個很符合這邊期望的一個降維方式 , 故選擇 LPP.

1tx1tx

2tx

tx 2tx

tx [ + ]

tx

13

3.2 Learning for Classification in Temporal Subspace _ Temporal Data(1/3)

After obtain the spatial-motion subspace(LPP subspace), we

would like to extend data to temporal data. Here we propose

three kinds of temporal data.

1.Locations’ Temporal Motion of Mahalanobis Distance (LTM)

}'......',','{' 321 nyyyyY

},...,,,,...,{' 11 tiiiitii yyyyyy

t = 2

},,,,{' 2112 iiiiii yyyyyy

iy1iy

2iy

1iy

2iy

14


2.Difference’ Temporal Motion of Mahalanobis Distance (DTM)

}'......',','{' 321 nyyyyY },...,,,,...,{' 11 tiiiiiiitiii yyyyyyyyyy

t = 2

},,,,{' 2112 iiiiiiiiii yyyyyyyyyy

iy

1iy

2iy

1iy

2iy

15


3.Trajectory Temporal Motion of Mahalanobis Distance (TTM)

}'......',','{' 321 nyyyyY

iy

1iy

2iy

1iy2iy

},..,,,,..,{' 1111 titiiiiiititii yyyyyyyyyy

},,,,{' 121121 iiiiiiiiii yyyyyyyyyy

t = 2

16

3.2 Metric Learning by LMNN(1/2)

Large Margin Nearest Neighbor (LMNN), is a metric learning

method, that it tries to produce a new space which have better

distance measurement, and the distance in this space is called

Mahalanobis distance. The objective function is shown below :

ijk

ijkikijij

jiT

jiij YYMYY )1()()( ''''

Minimize

Subject to

ijkjiT

jikiT

ki YYMYYYYMYY 1)()()()( ''''''''

(i)

(ii)

(iii) M has to be semi-definite

0ijk

ijk

ijkikijij

jiTT

jiij YYEEYY )1()()( ''''

17

3.2 Metric Learning by LMNN(2/2)

1.For the neighbors with

the same label, try to

pull it in.

2.For the neighbors with

different labels, try to

push it away with a

distance.

The result after pulling

and pushing.

18

Q and A:

Q2 : 為甚麼要安排一個 LMNN 的 metric learning 方法在這邊 ?

-Answer : 可以注意到 , 到此我們還沒有使用到 label 的概念 , 而在做完 LPP

並且把資料變成 temporal data 之後 , 我們希望可以有一個機制 , 把同動作(label) 並且軌跡又相似的資料聚集 , 而反之把一些侵入者 (imposter), 那些不同動作 (label) 卻又很靠近的資料往外推出一個距離之外 . 也就是說在這邊我們不只有利用 data 本身的資料來推拉 , 資料中更有 temporal 資訊存在 , 所以這邊 LMNN

是對一個同時具有資料本身的空間資訊又有資料在 sequence 中的時間資訊的spatio-temporal data 做一個距離學習的方法 . 而 LMNN 會學習出一個轉換矩陣L, 經由空間轉換至這個空間 , 其資料間在這個空間的距離就是我們學習出來的Mahalanobis distance.

19

Outline

1. Introduction

2. Flowchart


Projection(LPP)




6. Conclusions

20

4. Recognition Process(1/2)

TestX { , }, ...... ,Testn21

hwR *

TestT

Test XAY

},...,,,{ 321 TestnTest yyyyY dR

SYY TestTest '

}',...,',','{' 321 TestnTest yyyyY )12( tdR

Mahalanobis Distance

)y(y)y(y),d( testji

Ttestji

testji '''' Mxx )12( tdR

Assign “Label” to each frame in test sequence

LPP

Temporal data

LMNN

KNN

21

4. Recognition Process(2/2)

KNNTest data

Walk

Run

Jump

5-NN

3/5

1/5

1/5

The Winner Takes All

belongs to (Run)

22

Q and A:

Q3 : Recognition process 只使用 k-NN 分類機制 , 合適嗎 ?

-Answer : 這個地方我也認為的確有可以再改良的地方 , 不過使用 k-NN 分類方式是因為 LMNN 也是以一個 k-NN 的方式去對資料做推移 , 所以直覺的分類也就是使用 k-NN 的分類方式 , 不過 ACCV 的 reviewer 對於這個部分有提出 , 只使用一個 k-NN 的機制好像有點簡單 , 因此他有提問是否有更好的分類方法 , 這個部分我也有思考 , SVM 分類機制 , 還是其他分類方式 , 有看過某些 paper 有其他方法 , 目前這部分還沒有做比較深入的探討 , 但是我覺得這邊也是一個可以改進的地方 .

23

Outline

1. Introduction

2. Flowchart


Projection(LPP)




6. Conclusions

24

5. Experimental Results(1/5)

Weizmann Dataset :

- 共有 9 個人-10種動作-93 個 sequence

- 本篇論文有把圖片 normalize 至 64*48, 且有置中

… …

……

……

……

Human Behavior Database

…

… …

在本篇論文中我們用二質化的影像 :

64

48

25

5. Experimental Results(2/5) Weizmann Dataset :

- 用 cross validation 來測試 , 即選一個人當 test data, 其餘八人的資料當 training

data, 共測九次 , 即每人都會當過 test data, 之後九組數據平均就是實驗數據 .

- 而變數 t, 代表時間資料取多長 , 例如 t=2 就是取前後各兩張來當輔助時間資訊 .

- 降維方法有用 LPP, Supervised LPP, LSDA(Locality Sensitive Discriminant

Analysis)三種來做比較 .

- 有五種架構做比較

1.SE( 只做降維 )

2.SM( 做降維加上 metric learning)

3.LTM( 降維 +LTM 時間概念 +metric learning)

4.DTM( 降維 +DTM 時間概念 +metric learning)

5.TTM( 降維 +TTM 時間概念 +metric learning)

26


分析 (1) : 看第一列可以看出加上時間概念 , 對於實驗結果的確有幫助 , 尤其是 DTM 跟TTM 兩種方法 , 效果更好 , 我們討論是因為在這邊有用時間上資料間的差異性當資訊 ,

因此進步幅度比 LTM 來的好 .

分析 (2) : 降維使用 LPP,比起 SLPP 以及 SLDA, 其效果更好 , 我們討論的原因是因為第一層降維我們希望保持資料原本的架構 , 因此如果在第一層降維就使用 supervised 的方法 , 那麼資料的分布就其實有被更改過 , 我認為如果在這邊就使用一次 label, 然後又加上 temporal information, 然後又使用一次 label 有點重複的感覺 , 也有點擾亂重點 , 因為我認為重點是在後面的 spatio-temporal data, 所以我覺得把 label 概念用在這個部分比較合適 .

27


分析 (3) : 在這邊我要分析的是時間 t 的大小 ( 時間概念的長短 ) 有甚麼影響 , 可以看出上一頁的粗體數據跟本頁的粗體數據 , 可以看出當 t增加的時候 , 對於 DTM 以及 TTM

影響可以看出來沒有很大 , 但是 LTM 卻下滑了頗多 , 其實我們有討論 , 其實問題是出在於 LMNN 這個機制 , LMNN 在做 metric learning 的時候並沒有 weight 的概念 , 也就是說LMNN 並沒有一個加權的概念來使得時間點上跟我比較相近的點比較重要 , 影響度就比較高 , 因此當時間越加越長或許在 t 還不大的時候數據變動會不大 , 但是當 t 太大我個人認為不僅資料量變太大 , 準確度也會下滑 , 因為 t越大代表使用了時間點上很遠的資料 ,

其實相關性已經很小 , 卻還拿來使用 , 覺得就會有點模糊焦點 , 多此一舉的感覺 , 因此我認為 t 這個參數不是越大越好 , 不過 t 的選定這邊我們並沒有比較深入的探討 , 我個人是覺得這個參數選定應該是可以由實驗來得出最佳的值 .

28

5. Experimental Results _ Noisy data(5/5)

v=0.1 v=0.15 v=0.2

這邊要測試我們的系統對於有雜訊的 data 會不會受到很大的影響 , 我們用 matlab產生 variance 不等的雜訊 , 其圖片如左圖所示 . 實驗結果如下表 :

分析 (4) : 在這邊我們可以看出雜訊對於我們系統的影響性其實不大 , 但前提是這個雜訊並不是一大塊被遮住的那種雜訊 , 而是一些 salt noise, 為甚麼影響不大 , 我想是因為在降維之後這些雜訊鮮少會被當成特徵留下來 , 因此影響並不大 , 但是如果是一大塊的 ,

雖然這邊我們沒有測試 , 但是我認為當然是會影響的 , 因為如果遮住的部分太大還是遮住了某動作的特徵部位 , 那想當然爾 , 對於實驗結果一定是有影響的 .

29

Outline

1. Introduction

2. Flowchart


Projection(LPP)




6. Conclusions

30

6. Conclusion

We propose a novel framework “TVTL” for human action

recognition, and in this framework we try to find a proper way to

measure the similarity by take not only the spatial information

into consideration but also the temporal information.

We prove that the addition of the temporal information do have

positive influence, and moreover our method is robust to noisy

data.

未來我想可以想辦法改良我們的方法 , 不管是速度還是準確度 ,

都可以繼續研究深入探討 , 我想行為偵測這個主題是日漸重要 ,

也我在參加 ACCV 時也有看了不同做法 , 而時間概念的使用我認為是很有幫助的 , 雖然我們的方法很直覺得把時間概念加入資料中 , 但是其實加的方式還可以再探討一番 .

31

Reference

1. C. Fang, J. Chen, C. Tseng, and J. Lien, “Human Action Recognition

using Spatio-Temporal Classification,” ACCV 2009

2. L. Jia, and D. Yeung, “Human Action Recognition using Local Spatio-

Temporal Discriminant Embedding,” CVPR, pp. 1-8 ,2008

3. S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph

Embedding and Extensions: A General framework for dimensionality

reduction,” IEEE Trans. on Pattern Analysis and Machine Intelligence,

Vol. 29, No. 1, pp. 40–51, 2007.

4. X. He, and P. Niyogi, “Locality Preserving Projections,” Advances in

Neural Information Process Systems, pp. 153-160, 2003

5. K. Weinberger, and L. Saul, “Distance Metric Learning for Large Margin

Nearest Neighbor Classification,” Journal of Machine Learning Research,

pp. 209-244, 2009

human action recognition using spatio-temporal classification

Documents

temporal information

temporal data232

temporal data13after

temporal data333

kinds of temporal data

spatial subspace creation

local spatial information

motion information