- 无标题文档
查看论文信息

中文题名:

 

多模态特征融合的视频记忆度预测研究

    

姓名:

 常诗颖    

学号:

 1049721801570    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 081200    

学科名称:

 工学 - 计算机科学与技术(可授工学、理学学位) - 计算机科学与技术    

学生类型:

 硕士    

学位:

 工学硕士    

学校:

 武汉理工大学    

院系:

 计算机科学与技术学院    

专业:

 计算机科学与技术    

研究方向:

 机器学习    

第一导师姓名:

 胡燕    

第一导师院系:

 武汉理工大学    

完成日期:

 2021-06-10    

答辩日期:

 2021-06-25    

中文关键词:

 

视频记忆度 ; 文本特征 ; 深度视觉特征 ; 光流特征 ; 多模态特征融合

    

中文摘要:

随着网络视频的爆炸式增长,各种各样的视频出现在互联网共享平台。有研究表明人类对所观看视频的记忆程度并不一样,一些视频可以被记住很长时间,而另一些视频转瞬即忘。视频记忆度是衡量一个视频令人难忘程度的指标,设计自动预测视频记忆度的计算模型有广泛的应用前景。因此如何有效地预测视频的记忆度是本文研究的主要内容。

记忆度是图像的固有属性,人类在记忆上有共同的喜好。与图像不同的是,视频是图像、声音、文字、运动信息等维度的综合表现形式,传达了更丰富的媒体内容,因此视频记忆度预测任务受到了更多因素的影响。由于单模态的预测模型无法全面地描述视频的记忆度,导致实际应用中视频的记忆度预测性能不佳。因此本文以视频为研究对象,探索了视频的文本标题、图像深度、运动信息等维度的特征对视频记忆度的影响。构建有效的视频记忆度预测模型来提高视频记忆度的预测性能是本文的主要工作,具体的研究内容如下:

(1)为了研究视频标题和图像深度信息对视频记忆度的影响,提出了一个融合文本和深度视觉特征的视频记忆度预测模型。首先利用TF-IDF算法从视频的描述性标题中提取文本特征,并给予对视频记忆度有影响的单词一定的权重。其次对视频进行分帧预处理,利用深度估计模型提取深度图作为视频的深度信息,利用预训练的ResNet-152网络来提取视觉特征,并利用深度图数据集微调ResNet-152网络模型来提取深度特征,将深度特征和视觉特征进行拼接得到深度视觉特征。然后将文本特征和深度视觉特征分别使用回归算法来预测视频记忆度分数,利用晚融合中的加权平均法进行模态融合。最后在公开的数据集上进行相关方法的对比实验,在短期和长期记忆度预测任务中分别达到了0.547和0.260的Spearman秩相关性,证明了模型的有效性。

(2)为了解决现有的视频记忆度预测模型无法很好的描述运动信息对视频记忆度的影响,进一步提高视频记忆度的预测效果,在现有的融合文本和深度视觉特征的视频记忆度预测模型中添加了运动特征维度,通过光流的形式进行运动信息的描述,进而提出了一个融合文本、图像深度和光流信息的多模态视频记忆度预测模型。首先利用光流估计模型来提取光流图,并利用光流图数据集微调ResNet-152网络模型来提取光流特征,然后将三个维度的特征在单模态下利用回归算法来预测视频记忆度分数,最后对三种模态的记忆度分数进行晚融合处理,在公开的数据集上开展了一系列对比实验,在短期和长期记忆度预测任务中分别达到了0.567和0.272的Spearman秩相关性,证明了多模态特征融合方法在视频记忆度预测任务中的提升效果。

(3)将本文提出的多模态特征融合的视频记忆度预测模型应用于某企业的网络编码推流器,进行广告记忆度的预测。对广告记忆度预测模块进行了分析与设计,以手机广告为例进行了视频记忆度预测实验,并对实验结果进行了分析,证明了本文提出的模型可以有效预测不同广告的记忆度。

参考文献:

[1] Han J, Chen C, Shao L, et al. Learning computational models of video memorability from fMRI brain imaging[J]. IEEE Transactions on Cybernetics, 2014, 45(8): 1692-1703.

[2] Newman A, Fosco C, Casser V, et al. Multimodal memorability: modeling effects of semantics and decay on video memorability[C]. European Conference on Computer Vision, 2020: 223-240.

[3] Cohendet R, Demarty C H, Duong N Q K, et al. VideoMem: constructing, analyzing, predicting short-term and long-term video memorability[C]. IEEE/CVF International Conference on Computer Vision, 2019: 2531-2540.

[4] Bainbridge W A, Isola P, Oliva A. The intrinsic memorability of face photographs[J]. Journal of Experimental Psychology: General, 2013, 142(4): 1323-1334.

[5] Isola P, Xiao J, Parikh D, et al. What makes a photograph memorable?[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(7): 1469-1482.

[6] Isola P, Xiao J, Torralba A, et al. What makes an image memorable?[C]. IEEE Conference on Computer Vision and Pattern Recognition, 2011: 145-152.

[7] Khosla A, Raju A S, Torralba A, et al. Understanding and predicting image memorability at a large scale[C]. IEEE International Conference on Computer Vision, 2015: 2390-2398.

[8] Donohue J M , Aczel A D , Freund J E , et al. Complete business statistics[J]. The American Statistician, 2002, 47(4):309.

[9] Ebbinghaus H. Memory: a contribution to experimental psychology[J]. Annals of Neurosciences, 2013, 20(4): 155.

[10] Dubey R, Peterson J, Khosla A, et al. What makes an object memorable?[C]. IEEE International Conference on Computer Vision, 2015: 1089-1097.

[11] Li Y, Hou X, Koch C, et al. The secrets of salient object segmentation[C]. IEEE Conference on Computer Vision and Pattern Recognition, 2014: 280-287.

[12] Baveye Y, Cohendet R, Perreira Da Silva M, et al. Deep learning for image memorability prediction: the emotional bias[C]. ACM International Conference on Multimedia, 2016: 491-495.

[13] Fajtl J, Argyriou V, Monekosso D, et al. Amnet: memorability estimation with attention[C]. IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6363-6372.

[14] Basavaraju S, Mittal P, Sur A. Image memorability: the role of depth and motion[C]. IEEE International Conference on Image Processing, 2018: 699-703.

[15] Squalli-Houssaini H, Duong N Q K, Gwena?lle M, et al. Deep learning for predicting image memorability[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018: 2371-2375.

[16] Akagunduz E, Bors A G, Evans K K. Defining image memorability using the visual memory schema[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(9): 2165-2178.

[17] Isola P, Parikh D, Torralba A, et al. Understanding the intrinsic memorability of images[R]. Annual Conference on Neural Information Processing Systems, 2011: 2429-2437.

[18] Lu J, Xu M, Yang R, et al. Understanding and predicting the memorability of outdoor natural scenes[J]. IEEE Transactions on Image Processing, 2020, 29: 4927-4941.

[19] Khosla A, Bainbridge W A, Torralba A, et al. Modifying the memorability of face photographs[C]. IEEE International Conference on Computer Vision, 2013: 3200-3207.

[20] Sidorov O. Changing the image memorability: from basic photo editing to GANs[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019: 790-799.

[21] Borkin M A, Bylinskii Z, Kim N W, et al. Beyond memorability: visualization recognition and recall[J]. IEEE Transactions on Visualization and Computer Graphics, 2015, 22(1): 519-528.

[22] Borkin M A, Vo A A, Bylinskii Z, et al. What makes a visualization memorable?[J]. IEEE Transactions on Visualization and Computer Graphics, 2013, 19(12): 2306-2315.

[23] Shekhar S, Singal D, Singh H, et al. Show and recall: learning what makes videos memorable[C]. IEEE International Conference on Computer Vision Workshops, 2017: 2730-2739.

[24] Cohendet R, Yadati K, Duong N Q K, et al. Annotating, understanding, and predicting long-term video memorability[C]. ACM on International Conference on Multimedia Retrieval, 2018: 178-186.

[25] Almeida J, Leite N J, Torres R S. Comparison of video sequences with histograms of motion patterns[C]. IEEE International Conference on Image Processing, 2011: 3673-3676.

[26] Dollár P, Rabaud V, Cottrell G, et al. Behavior recognition via sparse spatio-temporal features[C]. IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005: 65-72.

[27] Horn B K P, Schunck B G. Determining optical flow[J]. Artificial Intelligence, 1981, 17(1-3): 185-203.

[28] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks[C]. IEEE International Conference on Computer Vision, 2015: 4489-4497.

[29] Kar A, Mavin P, Ghaturle Y, et al. What makes a video memorable?[C]. IEEE International Conference on Data Science and Advanced Analytics, 2017: 373-381.

[30] Engilberge M, Chevallier L, Pérez P, et al. Sodeep: a sorting deep net to learn ranking loss surrogates[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10792-10801.

[31] Gupta R, Motwani K. Linear models for video memorability prediction using visual and semantic features[C]. MediaEval. 2018.

[32] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]. IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.

[33] Constantin M G, Ionescu B, Demarty C H, et al. Predicting media memorability task at MediaEval 2019[C]. MediaEval 2019 Workshop.

[34] Constantin M G, Kang C, Dinu G, et al. Using aesthetics and action recognition-based networks for the prediction of media memorability[C]. MediaEval 2019.

[35] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset[C]. IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.

[36] Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]. European Conference on Computer Vision, 2016: 20-36.

[37] Azcona D, Moreu E, Hu F, et al. Predicting media memorability using ensemble models[C]. MediaEval 2019.

[38] García Seco De Herrera A, Savran Kiziltepe R, Chamberlain J, et al. Overview of MediaEval 2020 predicting media memorability task: what makes a video memorable?[J]. arXiv e-prints arXiv: 2012.15650, 2020.

[39] Awad G, Butt A A, Curtis K, et al. Trecvid 2019: an evaluation campaign to benchmark video activity detection, video captioning and matching, and video search & retrieval[J]. arXiv preprint arXiv: 2009.09984, 2020.

[40] 王帅,王维莹,陈师哲,等.基于全局和局部信息的视频记忆度预测[J]. 软件学报, 2020,31(07):1969-1979.

[41] Konkle T, Brady T F, Alvarez G A, et al. Conceptual distinctiveness supports detailed visual long-term memory for real-world objects[J]. Journal of Experimental Psychology: General, 2010, 139(3): 558.

[42] Konkle T, Brady T F, Alvarez G A, et al. Scene memory is more detailed than you think: the role of categories in visual long-term memory[J]. Psychological Science, 2010, 21(11): 1551-1556.

[43] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]. Annual Conference on Neural Information Processing Systems, 2013: 3111-3119.

[44] 施聪莺,徐朝军,杨晓江.TF-IDF算法研究综述[J]. 计算机应用, 2009, 6(29):167-170.

[45] 叶雪梅,毛雪岷,夏锦春,等.文本分类TF-IDF算法的改进研究[J]. 计算机工程与应用, 2019, 55(02):104-109.

[46] Basavaraju S, Sur A. Image memorability prediction using depth and motion cues[J]. IEEE Transactions on Computational Social Systems, 2020, 7(3): 600-609.

[47] Chen K, Li J, Lin W, et al. Towards accurate one-stage object detection with AP-loss[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5119-5127.

[48] Desingh K, Krishna K M, Rajan D, et al. Depth really matters: improving visual salient region detection with depth[C]. British Machine Vision Conference, 2013: 1-11.

[49] Larsson V, Kukelova Z, Zheng Y. Camera pose estimation with unknown principal point[C]. IEEE Conference on Computer Vision and Pattern Recognition, 2018: 2984-2992.

[50] Laina I, Rupprecht C, Belagiannis V, et al. Deeper depth prediction with fully convolutional residual networks[C]. International Conference on 3D Vision, 2016: 239-248.

[51] Chang C, Lin C J. LIBSVM: a library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 1-27.

[52] 王兴玲,李占斌.基于网格搜索的支持向量机核函数参数的确定[J]. 中国海洋大学学报(自然科学版), 2005(05):169-172.

[53] 奉国和.SVM分类核函数及参数选择比较[J]. 计算机工程与应用, 2011, 47(003):123-124.

[54] 王健峰,张磊,陈国兴,等. 基于改进的网格搜索法的SVM参数优化[J]. 应用科技, 2012, 000(003):28-31.

[55] Liu P, Lyu M, King I, et al. Selflow: self-supervised learning of optical flow[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4571-4580.

[56] Walker J, Gupta A, Hebert M. Dense optical flow prediction from a static image[C]. IEEE International Conference on Computer Vision, 2015: 2443-2451.

中图分类号:

 TP391.41    

条码号:

 002000062962    

馆藏号:

 TD10049615    

馆藏位置:

 403    

备注:

 403-西院分馆博硕论文库;203-余家头分馆博硕论文库    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式