Computer Science Computer Vision and Pattern Recognition

Video Analysis and Summarization

Description

This cluster of papers focuses on the automatic analysis and summarization of video content, covering topics such as shot boundary detection, user attention models, semantic analysis, key frame extraction, event detection, and the application of these techniques to soccer videos. It also explores the use of the MPEG-7 standard and content-based retrieval methods in video summarization.

Keywords

Video Summarization; Content-Based Retrieval; Shot Boundary Detection; User Attention Model; MPEG-7 Standard; Semantic Analysis; Key Frame Extraction; Event Detection; Soccer Video; Multimodal Indexing

We present a novel method for summarizing raw, casually captured videos. The objective is to create a short summary that still conveys the story. It should thus be both, interesting … We present a novel method for summarizing raw, casually captured videos. The objective is to create a short summary that still conveys the story. It should thus be both, interesting and representative for the input video. Previous methods often used simplified assumptions and only optimized for one of these goals. Alternatively, they used handdefined objectives that were optimized sequentially by making consecutive hard decisions. This limits their use to a particular setting. Instead, we introduce a new method that (i) uses a supervised approach in order to learn the importance of global characteristics of a summary and (ii) jointly optimizes for multiple objectives and thus creates summaries that posses multiple properties of a good summary. Experiments on two challenging and very diverse datasets demonstrate the effectiveness of our method, where we outperform or match current state-of-the-art.
Many algorithms have been proposed for detecting video shot boundaries and classifying shot and shot transition types. Few published studies compare available algorithms, and those that do have looked at … Many algorithms have been proposed for detecting video shot boundaries and classifying shot and shot transition types. Few published studies compare available algorithms, and those that do have looked at limited range of test material. This paper presents a comparison of several shot boundary detection and classification techniques and their variations including histograms, discrete cosine transform, motion vector, and block matching methods. The performance and ease of selecting good thresholds for these algorithms are evaluated based on a wide variety of video sequences with a good mix of transition types. Threshold selection requires a trade-off between recall and precision that must be guided by the target application.
Various methods of automatic shot boundary detection have been proposed and claimed to perform reliably. Detection of edits is fundamental to any kind of video analysis. It segments a video … Various methods of automatic shot boundary detection have been proposed and claimed to perform reliably. Detection of edits is fundamental to any kind of video analysis. It segments a video into its basic components, that is, the shots. However, only few comparative investigations on early shot boundary detection algorithms have been published. These investigations mainly concentrate on measuring the edit detection performance. However, they do not consider the algorithms' ability to classify the types, and to locate the boundaries of the edits correctly. This paper extends these comparative investigations. More recent algorithms designed explicitly to detect specific complex editing operations, such as fades and dissolves, are taken into account. In addition, their ability to classify the types and locate the boundaries of such edits are examined. The algorithms' performance is measured in terms of hit rate, number of false hits, and miss rate for hard cuts, fades, and dissolves, over a large and diverse set of video sequences. The experiments show that while hard cuts and fades can be detected reliably, dissolves are still an open research issue. The false hit rate for dissolves is usually unacceptably high, ranging from 50 percent up to more than 400 percent. Moreover, all algorithms seem to fail under roughly the same conditions.
The TREC Video Retrieval Evaluation (TRECVid)is an international benchmarking activity to encourage research in video information retrieval by providing a large test collection, uniform scoring procedures, and a forum for … The TREC Video Retrieval Evaluation (TRECVid)is an international benchmarking activity to encourage research in video information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations 1 interested in comparing their results. TRECVid completed its fifth annual cycle at the end of 2005 and in 2006 TRECVid will involve almost 70 research organizations, universities and other consortia. Throughout its existence, TRECVid has benchmarked both interactive and automatic/manual searching for shots from within a video corpus,automatic detection of a variety of semantic and low-level video features, shot boundary detection and the detection of story boundaries in broadcast TV news. This paper will give an introduction to information retrieval (IR) evaluation from both a user and a system perspective, high-lighting that system evaluation is by far the most prevalent type of evaluation carried out. We also include a summary of TRECVid as an example of a system evaluation bench-marking campaign and this allows us to discuss whether such campaigns are a good thing or a bad thing. There are arguments for and against these campaigns and we present some of them in the paper concluding that on balance they have had a very positive impact on research progress.
Given an image (or video clip, or audio song), how do we automatically assign keywords to it? The general problem is to find correlations across the media in a collection … Given an image (or video clip, or audio song), how do we automatically assign keywords to it? The general problem is to find correlations across the media in a collection of multimedia objects like video clips, with colors, and/or motion, and/or audio, and/or text scripts. We propose a novel, graph-based approach, "MMG", to discover such cross-modal correlations.Our "MMG" method requires no tuning, no clustering, no user-determined constants; it can be applied to any multimedia collection, as long as we have a similarity function for each medium; and it scales linearly with the database size. We report auto-captioning experiments on the "standard" Corel image database of 680 MB, where it outperforms domain specific, fine-tuned methods by up to 10 percentage points in captioning accuracy (50% relative improvement).
As increasingly powerful techniques emerge for machine tagging multimedia content, it becomes ever more important to standardize the underlying vocabularies. Doing so provides interoperability and lets the multimedia community focus … As increasingly powerful techniques emerge for machine tagging multimedia content, it becomes ever more important to standardize the underlying vocabularies. Doing so provides interoperability and lets the multimedia community focus ongoing research on a well-defined set of semantics. This paper describes a collaborative effort of multimedia researchers, library scientists, and end users to develop a large standardized taxonomy for describing broadcast news video. The large-scale concept ontology for multimedia (LSCOM) is the first of its kind designed to simultaneously optimize utility to facilitate end-user access, cover a large semantic space, make automated extraction feasible, and increase observability in diverse broadcast news video data sets.
The demand for various multimedia applications is rapidly increasing due to the recent advance in the computing and network infrastructure, together with the widespread use of digital video technology. Among … The demand for various multimedia applications is rapidly increasing due to the recent advance in the computing and network infrastructure, together with the widespread use of digital video technology. Among the key elements for the success of these applications is how to effectively and efficiently manage and store a huge amount of audio visual information, while at the same time providing user-friendly access to the stored data. This has fueled a quickly evolving research area known as video abstraction . As the name implies, video abstraction is a mechanism for generating a short summary of a video, which can either be a sequence of stationary images ( keyframes ) or moving images ( video skims ). In terms of browsing and navigation, a good video abstract will enable the user to gain maximum information about the target video sequence in a specified time constraint or sufficient information in the minimum time. Over past years, various ideas and techniques have been proposed towards the effective abstraction of video contents. The purpose of this article is to provide a systematic classification of these works. We identify and detail, for each approach, the underlying components and how they are addressed in specific works.
Learning-based video annotation is a promising approach to facilitating video retrieval and it can avoid the intensive labor costs of pure manual annotation. But it frequently encounters several difficulties, such … Learning-based video annotation is a promising approach to facilitating video retrieval and it can avoid the intensive labor costs of pure manual annotation. But it frequently encounters several difficulties, such as insufficiency of training data and the curse of dimensionality. In this paper, we propose a method named optimized multigraph-based semi-supervised learning (OMG-SSL), which aims to simultaneously tackle these difficulties in a unified scheme. We show that various crucial factors in video annotation, including multiple modalities, multiple distance functions, and temporal consistency, all correspond to different relationships among video units, and hence they can be represented by different graphs. Therefore, these factors can be simultaneously dealt with by learning with multiple graphs, namely, the proposed OMG-SSL approach. Different from the existing graph-based semi-supervised learning methods that only utilize one graph, OMG-SSL integrates multiple graphs into a regularization framework in order to sufficiently explore their complementation. We show that this scheme is equivalent to first fusing multiple graphs and then conducting semi-supervised learning on the fused graph. Through an optimization approach, it is able to assign suitable weights to the graphs. Furthermore, we show that the proposed method can be implemented through a computationally efficient iterative process. Extensive experiments on the TREC video retrieval evaluation (TRECVID) benchmark have demonstrated the effectiveness and efficiency of our proposed approach.
We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large … We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MK- using the prelearned classifiers only from each individual event class.
Video indexing and retrieval have a wide spectrum of promising applications, motivating the interest of researchers worldwide. This paper offers a tutorial and an overview of the landscape of general … Video indexing and retrieval have a wide spectrum of promising applications, motivating the interest of researchers worldwide. This paper offers a tutorial and an overview of the landscape of general strategies in visual content-based video indexing and retrieval, focusing on methods for video structure analysis, including shot boundary detection, key frame extraction and scene segmentation, extraction of features including static key frame features, object features and motion features, video data mining, video annotation, video retrieval including query interfaces, similarity measure and relevance feedback, and video browsing. Finally, we analyze future research directions.
We present a video summarization approach for egocentric or "wearable" camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer's day. In … We present a video summarization approach for egocentric or "wearable" camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer's day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video — such as the nearness to hands, gaze, and frequency of occurrence — and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results with 17 hours of egocentric data show the method's promise relative to existing techniques for saliency and summarization.
Given an image, we propose a hierarchical generative model that classifies the overall scene, recognizes and segments each object component, as well as annotates the image with a list of … Given an image, we propose a hierarchical generative model that classifies the overall scene, recognizes and segments each object component, as well as annotates the image with a list of tags. To our knowledge, this is the first model that performs all three tasks in one coherent framework. For instance, a scene of a `polo game' consists of several visual objects such as `human', `horse', `grass', etc. In addition, it can be further annotated with a list of more abstract (e.g. `dusk') or visually less salient (e.g. `saddle') tags. Our generative model jointly explains images through a visual model and a textual model. Visually relevant objects are represented by regions and patches, while visually irrelevant textual annotations are influenced directly by the overall scene class. We propose a fully automatic learning framework that is able to learn robust scene models from noisy Web data such as images and user tags from Flickr.com. We demonstrate the effectiveness of our framework by automatically classifying, annotating and segmenting images from eight classes depicting sport scenes. In all three tasks, our model significantly outperforms state-of-the-art algorithms.
A number of automated shot-change detection methods for indexing a video sequence to facilitate browsing and retrieval have been proposed. Many of these methods use color histograms or features computed … A number of automated shot-change detection methods for indexing a video sequence to facilitate browsing and retrieval have been proposed. Many of these methods use color histograms or features computed from block motion or compression parameters to compute frame differences. It is important to evaluate and characterize their performance so as to deliver a single set of algorithms that may be used by other researchers for indexing video databases. We present the results of a performance evaluation and characterization of a number of shot-change detection methods that use color histograms, block motion matching, or MPEG compressed data.
This paper looks into a new direction in video content analysis - the representation and modeling of affective video content . The affective content of a given video clip can … This paper looks into a new direction in video content analysis - the representation and modeling of affective video content . The affective content of a given video clip can be defined as the intensity and type of feeling or emotion (both are referred to as affect) that are expected to arise in the user while watching that clip. The availability of methodologies for automatically extracting this type of video content will extend the current scope of possibilities for video indexing and retrieval. For instance, we will be able to search for the funniest or the most thrilling parts of a movie, or the most exciting events of a sport program. Furthermore, as the user may want to select a movie not only based on its genre, cast, director and story content, but also on its prevailing mood, the affective content analysis is also likely to contribute to enhancing the quality of personalizing the video delivery to the user. We propose in this paper a computational framework for affective video content representation and modeling. This framework is based on the dimensional approach to affect that is known from the field of psychophysiology. According to this approach, the affective video content can be represented as a set of points in the two-dimensional (2-D) emotion space that is characterized by the dimensions of arousal (intensity of affect) and valence (type of affect). We map the affective video content onto the 2-D emotion space by using the models that link the arousal and valence dimensions to low-level features extracted from video data. This results in the arousal and valence time curves that, either considered separately or combined into the so-called affect curve, are introduced as reliable representations of expected transitions from one feeling to another along a video, as perceived by a viewer.
Video summarization is a challenging problem with great application potential. Whereas prior approaches, largely unsupervised in nature, focus on sampling useful frames and assembling them as summaries, we consider video … Video summarization is a challenging problem with great application potential. Whereas prior approaches, largely unsupervised in nature, focus on sampling useful frames and assembling them as summaries, we consider video summarization as a supervised subset selection problem. Our idea is to teach the system to learn from human-created summaries how to select informative and diverse subsets, so as to best meet evaluation metrics derived from human-perceived quality. To this end, we propose the sequential determinantal point process (seqDPP), a probabilistic model for diverse sequential subset selection. Our novel seqDPP heeds the inherent sequential structures in video data, thus overcoming the deficiency of the standard DPP, which treats video frames as randomly permutable items. Meanwhile, seqDPP retains the power of modeling diverse subsets, essential for summarization. Our extensive results of summarizing videos from 3 datasets demonstrate the superior performance of our method, compared to not only existing unsupervised methods but also naive applications of the standard DPP model.
We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video sub shots depicting … We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video sub shots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a random-walk based metric of influence between sub shots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subs hot summary. Whereas traditional methods optimize a summary's diversity or representative ness, ours explicitly accounts for how one sub-event "leads to" another-which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.
We present an unsupervised technique for detecting unusual activity in a large video set using many simple features. No complex activity models and no supervised feature selections are used. We … We present an unsupervised technique for detecting unusual activity in a large video set using many simple features. No complex activity models and no supervised feature selections are used. We divide the video into equal length segments and classify the extracted features into prototypes, from which a prototype-segment co-occurrence matrix is computed. Motivated by a similar problem in document-keyword analysis, we seek a correspondence relationship between prototypes and video segments which satisfies the transitive closure constraint. We show that an important sub-family of correspondence functions can be reduced to co-embedding prototypes and segments to N-D Euclidean space. We prove that an efficient, globally optimal algorithm exists for the co-embedding problem. Experiments on various real-life videos have validated our approach.
Due to the information redundancy of video, automatically extracting essential video content is one of key techniques for accessing and managing large video library. In this paper, we present a … Due to the information redundancy of video, automatically extracting essential video content is one of key techniques for accessing and managing large video library. In this paper, we present a generic framework of a user attention model, which estimates the attentions viewers may pay to video contents. As human attention is an effective and efficient mechanism for information prioritizing and filtering, user attention model provides an effective approach to video indexing based on importance ranking. In particular, we define viewer attention through multiple sensory perceptions, i.e. visual and aural stimulus as well as partly semantic understanding. Also, a set of modeling methods for visual and aural attentions are proposed. As one of important applications of user attention model, a feasible solution of video summarization, without fully semantic understanding of video content as well as complex heuristic rules, is implemented to demonstrate the effectiveness, robustness, and generality of the user attention model. The promising results from the user study on video summarization indicate that the user attention model is an alternative way to video understanding.
Key frame extraction has been recognized as one of the important research issues in video information retrieval. Although progress has been made in key frame extraction, the existing approaches are … Key frame extraction has been recognized as one of the important research issues in video information retrieval. Although progress has been made in key frame extraction, the existing approaches are either computationally expensive or ineffective in capturing salient visual content. We first discuss the importance of key frame selection; and then review and evaluate the existing approaches. To overcome the shortcomings of the existing approaches, we introduce a new algorithm for key frame extraction based on unsupervised clustering. The proposed algorithm is both computationally simple and able to adapt to the visual content. The efficiency and effectiveness are validated by large amount of real-world videos.
Several rapid scene analysis algorithms for detecting scene changes and flashlight scenes directly on compressed video are proposed. These algorithms operate on the DC sequence which can be readily extracted … Several rapid scene analysis algorithms for detecting scene changes and flashlight scenes directly on compressed video are proposed. These algorithms operate on the DC sequence which can be readily extracted from video compressed using Motion JPEG or MPEG without full-frame decompression. The DC images occupy only a small fraction of the original data size while retaining most of the essential "global" information. Operating on these images offers a significant computation saving. Experimental results show that the proposed algorithms are fast and effective in detecting abrupt scene changes, gradual transitions including fade-ins and fade-outs, flashlight scenes and in deriving intrashot variations.
Video management tools and techniques are based on pixels rather than perceived content. Thus, state-of-the-art video editing systems can easily manipulate such things as time codes and image frames, but … Video management tools and techniques are based on pixels rather than perceived content. Thus, state-of-the-art video editing systems can easily manipulate such things as time codes and image frames, but they cannot "know," for example, what a basketball is. Our research addresses four areas of content-based video management.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
We propose a fully automatic and computationally efficient framework for analysis and summarization of soccer videos using cinematic and object-based features. The proposed framework includes some novel low-level processing algorithms, … We propose a fully automatic and computationally efficient framework for analysis and summarization of soccer videos using cinematic and object-based features. The proposed framework includes some novel low-level processing algorithms, such as dominant color region detection, robust shot boundary detection, and shot classification, as well as some higher-level algorithms for goal detection, referee detection, and penalty-box detection. The system can output three types of summaries: i) all slow-motion segments in a game; ii) all goals in a game; iii) slow-motion segments classified according to object-based features. The first two types of summaries are based on cinematic features only for speedy processing, while the summaries of the last type contain higher-level semantics. The proposed framework is efficient, effective, and robust. It is efficient in the sense that there is no need to compute object-based features when cinematic features are sufficient for the detection of certain events, e.g., goals in soccer. It is effective in the sense that the framework can also employ object-based features when needed to increase accuracy (at the expense of more computation). The efficiency, effectiveness, and robustness of the proposed framework are demonstrated over a large data set, consisting of more than 13 hours of soccer video, captured in different countries and under different conditions.
In today's fast-paced world, while the number of channels of television programming available is increasing rapidly, the time available to watch them remains the same or is decreasing. Users desire … In today's fast-paced world, while the number of channels of television programming available is increasing rapidly, the time available to watch them remains the same or is decreasing. Users desire the capability to watch the programs time-shifted (on-demand) and/or to watch just the highlights to save time. In this paper we explore how to provide for the latter capability, that is the ability to extract highlights automatically, so that viewing time can be reduced.
Partitioning a video sequence into shots is the first step toward video-content analysis and content-based video browsing and retrieval. A video shot is defined as a series of interrelated consecutive … Partitioning a video sequence into shots is the first step toward video-content analysis and content-based video browsing and retrieval. A video shot is defined as a series of interrelated consecutive frames taken contiguously by a single camera and representing a continuous action in time and space. As such, shots are considered to be the primitives for higher level content analysis, indexing, and classification. The objective of this paper is twofold. First, we analyze the shot-boundary detection problem in detail and identify major issues that need to be considered in order to solve this problem successfully. Then, we present a conceptual solution to the shot-boundary detection problem in which all issues identified in the previous step are considered. This solution is provided in the form of a statistical detector that is based on minimization of the average detection-error probability. We model the required statistical functions using a robust metric for visual content discontinuities (based on motion compensation) and take into account all (a priori) knowledge that we found relevant to shot-boundary detection. This knowledge includes the shot-length distribution, visual discontinuity patterns at shot boundaries, and characteristic temporal changes of visual features around a boundary. Major advantages of the proposed detector are its robust and sequence-independent performance, while there is also the possibility to detect different types of shot boundaries simultaneously. We demonstrate the performance of our detector regarding two most widely used types of shot boundaries: hard cuts and dissolves.
Dynamic events can be regarded as long-term temporal objects, which are characterized by spatio-temporal features at multiple temporal scales. Based on this, we design a simple statistical distance measure between … Dynamic events can be regarded as long-term temporal objects, which are characterized by spatio-temporal features at multiple temporal scales. Based on this, we design a simple statistical distance measure between video sequences (possibly of different lengths) based on their behavioral content. This measure is non-parametric and can thus handle a wide range of dynamic events. We use this measure for isolating and clustering events within long continuous video sequences. This is done without prior knowledge of the types of events, their models, or their temporal extent. An outcome of such a clustering process is a temporal segmentation of long video sequences into event-consistent sub-sequences, and their grouping into event-consistent clusters. Our event representation and associated distance measure can also be used for event-based indexing into long video sequences, even when only one short example-clip is available. However, when multiple example-clips of the same event are available (either as a result of the clustering process, or given manually), these can be used to refine the event representation, the associated distance measure, and accordingly the quality of the detection and clustering process.
We introduce the challenge problem for generic video indexing to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of … We introduce the challenge problem for generic video indexing to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of experiments. To arrive at a challenge problem, we provide a general scheme for the systematic examination of automated concept detection methods, by decomposing the generic video indexing problem into 2 unimodal analysis experiments, 2 multimodal analysis experiments, and 1 combined analysis experiment. For each experiment, we evaluate generic video indexing performance on 85 hours of international broadcast news data, from the TRECVID 2005/2006 benchmark, using a lexicon of 101 semantic concepts. By establishing a minimum performance on each experiment, the challenge problem allows for component-based optimization of the generic indexing issue, while simultaneously offering other researchers a reference for comparison during indexing methodology development. To stimulate further investigations in intermediate analysis steps that inuence video indexing performance, the challenge offers to the research community a manually annotated concept lexicon, pre-computed low-level multimedia features, trained classifier models, and five experiments together with baseline performance, which are all available at http://www.mediamill.nl/challenge/.
We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the … We propose a deep neural network for the prediction of future frames in natural video sequences. To effectively handle complex evolution of pixels in videos, we propose to decompose the motion and content, two key components generating dynamics in videos. Our model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM for pixel-level prediction, which independently capture the spatial layout of an image and the corresponding temporal dynamics. By independently modeling motion and content, predicting the next frame reduces to converting the extracted content features into the next frame content by the identified motion features, which simplifies the task of prediction. Our model is end-to-end trainable over multiple time steps, and naturally learns to decompose motion and content without separate training. We evaluate the proposed network architecture on human activity videos using KTH, Weizmann action, and UCF-101 datasets. We show state-of-the-art performance in comparison to recent approaches. To the best of our knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.
Video summarization aims to facilitate large-scale video browsing by producing short, concise summaries that are diverse and representative of original videos. In this paper, we formulate video summarization as a … Video summarization aims to facilitate large-scale video browsing by producing short, concise summaries that are diverse and representative of original videos. In this paper, we formulate video summarization as a sequential decision-making process and develop a deep summarization network (DSN) to summarize videos. DSN predicts for each video frame a probability, which indicates how likely a frame is selected, and then takes actions based on the probability distributions to select frames, forming video summaries. To train our DSN, we propose an end-to-end, reinforcement learning-based framework, where we design a novel reward function that jointly accounts for diversity and representativeness of generated summaries and does not rely on labels or user interactions at all. During training, the reward function judges how diverse and representative the generated summaries are, while DSN strives for earning higher rewards by learning to produce more diverse and more representative summaries. Since labels are not required, our method can be fully unsupervised. Extensive experiments on two benchmark datasets show that our unsupervised method not only outperforms other state-of-the-art unsupervised methods, but also is comparable to or even superior than most of published supervised approaches.
A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment. Many existing methods for learning the dynamics … A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment. Many existing methods for learning the dynamics of physical interactions require labeled object information. However, to scale real-world interaction learning to a variety of scenes and objects, acquiring labeled data becomes increasingly impractical. To learn about physical object motion without labels, we develop an action-conditioned video prediction model that explicitly models pixel motion, by predicting a distribution over pixel motion from previous frames. Because our model explicitly predicts motion, it is partially invariant to object appearance, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, we also introduce a dataset of 59,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a "visual imagination" of different futures based on different courses of action. Our experiments show that our proposed method produces more accurate video predictions both quantitatively and qualitatively, when compared to prior methods.
In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. … In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new task, we propose a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks. In addition, we propose a novel algorithm called MaskTrack R-CNN for this task. Our new method introduces a new tracking branch to Mask R-CNN to jointly perform the detection, segmentation and tracking tasks simultaneously. Finally, we evaluate the proposed method and several strong baselines on our new dataset. Experimental results clearly demonstrate the advantages of the proposed algorithm and reveal insight for future improvement. We believe the video instance segmentation task will motivate the community along the line of research for video understanding.
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. … Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches.Without bells and whistles, VisTR achieves the highest speed among all existing VIS models, and achieves the best result among methods using single model on the YouTube-VIS dataset. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy. We hope that VisTR can motivate future research for more video understanding tasks.Code is available at: https://git.io/VisTR
Multi-label movie genre classification is challenging due to the inherent ambiguity and overlap between different genres. Most of the existing works in genre classification use audio-visual modalities. The potential of … Multi-label movie genre classification is challenging due to the inherent ambiguity and overlap between different genres. Most of the existing works in genre classification use audio-visual modalities. The potential of text-based modalities in movie genre classification is still underexplored. This paper proposes an ensemble deep-learning model that uses movie plots to predict movie genres. After pre-processing the text plots, three transformer-based models, Bidirectional Encoder Representations from Transformers (BERT), DistilBERT, and Robustly Optimized BERT Pre-training Approach (ROBERTa), are used to generate genre predictions, combined through a weighted soft-voting method. The proposed ensemble architecture achieves state-of-the-art performance on two benchmark datasets, Trailers12K and LMTD9, with a micro-average precision of 80.10% and 80.37%, respectively, significantly outperforming both traditional machine learning approaches and advanced deep learning models. The ensemble’s superior performance is attributed to its ability to combine the diverse strengths of individual models and capture nuanced genre-specific information from textual features. The lack of interpretability in deep learning models for genre classification is addressed using Local Interpretable Model-Agnostic Explanations (LIME), which provides both local and global explanations for the model’s predictions. The findings of the study highlight the potential of textual data in automated genre classification and emphasize the importance of interpretability methods in multi-label genre classification.
Bu kitap bölümünde, eğitim ortamlarında kaçınılmaz bir hal alan ve dijital dönüşüm süreçlerinin temellerinden biri olan Üretken Yapay Zeka (ÜYZ) teknolojilerinin modern sınıf ortamlarına entegresi ve etkileri tartışılacaktır. Üretken yapay … Bu kitap bölümünde, eğitim ortamlarında kaçınılmaz bir hal alan ve dijital dönüşüm süreçlerinin temellerinden biri olan Üretken Yapay Zeka (ÜYZ) teknolojilerinin modern sınıf ortamlarına entegresi ve etkileri tartışılacaktır. Üretken yapay zeka, metin, görsel, ses, video gibi çoklu ortam öğelerini ve içeriklerini otomatik olarak yapılandırabilen yapay zeka sistemlerinin tümünü kapsamaktadır. Bu bağlamda ÜYZ, klasik yapay zekadan farklı olarak salt veriyi analiz etmekle kalmaz; aynı zamanda yaratıcı süreçler kapsamında örüntüler öğrenip yeni veriler üretir. Bu teknolojiler günümüzde daha çok dil modelleri ve görüntü oluşturma, müzik-video üretimi gibi alanlarda varlığını göstermektedir. Eğitim paydaşları açısından değerlendirildiğinde; ÜYZ öğretim materyalleri geliştirme, öz değerlendirme ve geri bildirim sağlama gibi üst düzey beceriler doğrultusunda öğrencilerin öğrenme deneyimlerini zenginleştirmektedir. Bunun yanı sıra öğrenci seviyesine, hızına ve ilgi alanlarına göre ders materyalleri ve içerikler oluşturma, detaylı geri bildirimlerle öğrencilerin hatalarını hızlıca fark ettirme ve düzeltmelerini mümkün kılma, yaratıcı içerik üretme araçları ile öğrencilerin yaratıcılık becerilerini arttırma ve üretkenliklerini destekleme gibi birçok kazanımları da beraberinde getirmektedir. Özetle, üretken yapay zeka öğrencilerin öğrenme süreçlerini daha etkili, kişisel ve yaratıcı hale getirerek başarılarını arttırmaktadır. Öğretmenler açısından değerlendirildiğinde ise; öğretmenler, üretken yapay zekadan aldığı destekle sınıf ortamlarında yeni materyaller ve etkinlikler geliştirebilir, böylece yaratıcılıklarını geliştirebilirler. Ayrıca, öğrenci performansını analiz eden üretken yapay zeka araçları sayesinde zayıf noktaları hızlıca tespit edip, hedefli müdahalelerde bulunabilirler. Mesleki gelişim açısından da üretken yapay zeka, öğretmenlerin eğitim teknolojileri ve pedagojik yeniliklerden haberdar olmalarını kolaylaştırır. Sonuç olarak, üretken yapay zeka, sınıf ortamlarına entegre edilerek öğretim süreçlerini kişiselleştirir ve daha etkili hale getirir. Öğrencilerin ihtiyaçlarına göre uyarlanmış içerikler sunar, öğretmenlerin iş yükünü azaltır ve öğrenme deneyimini zenginleştirir. Böylece dijital dönüşüm, sınıflarda hem verimliliği artırır hem de pedagojik yeniliklere kapı aralar.
Mr.D. Bikshalu , Mr.D.Veera Reddy , B. Nikhitha E. Akhila | International jounal of information technology and computer engineering.
India, a nation of 1.5 billion people, holds immense potential to excel in international sports, particularly in the Olympics. Despite possessing a vast talent pool, the nation has yet to … India, a nation of 1.5 billion people, holds immense potential to excel in international sports, particularly in the Olympics. Despite possessing a vast talent pool, the nation has yet to achieve sustained success comparable to other leading sporting countries. This project, India at Olympics: A Digital Tribute to Indian Athletes, aims to celebrate India’s Olympic heritage by offering an interactive, web-based platform dedicated to Indian Olympians.The system utilizes modern web technologies—HTML, CSS, JavaScript to create a structured, accessible repository of athlete profiles, achievements, biographies, images, and sporting disciplines. A clickable map of India serves as the core interface, allowing users to explore state-wise representations of Olympians, fostering engagement among students, researchers, and sports enthusiasts. The platform is designed to be responsive and ensure continuous updates and real-time accessibility.The platform not only preserves India’s Olympic legacy but also inspires future generations by showcasing the country’s sporting achievements in an engaging and informative manner.
The research field of computational aesthetics gives crucial contributions to the development of mechanisms for filtering and/or generating value-laden informational content. This paper acknowledges a recognized escalating problem in the … The research field of computational aesthetics gives crucial contributions to the development of mechanisms for filtering and/or generating value-laden informational content. This paper acknowledges a recognized escalating problem in the development of contemporary informational technologies and presents a practical solution for communicational quality management by employing an innovative approach to the computational aesthetic evaluation (CAE). After discussing the problem and attempted approaches to its alleviation, the paper offers a novel expert solution by presenting an original research approach and its resulting open-sourced model which outperforms its current state-of-the-art competition in semantic and stylistic classification, at the same time providing an idiomatic measure for objective aesthetic evaluation and demonstrating semantically rich and professionally recognized explanatory power which can serve as the solid basis for development of reliable and user friendly content retrieval, generative or auxiliary design applications. Presented model is resource- and privacy-wise utmost conservative. Its use evades all ethical, legal or security concerns that beset all currently prominent models. Its developmental and operational costs are practically nil.
Identifying macromolecular complexes in situ using cryo-electron tomography remains challenging, with low signal-to-noise ratios and heterogeneous backgrounds among the key limiting factors. By integrating prior knowledge on macromolecular localization, such … Identifying macromolecular complexes in situ using cryo-electron tomography remains challenging, with low signal-to-noise ratios and heterogeneous backgrounds among the key limiting factors. By integrating prior knowledge on macromolecular localization, such as the preferred orientations of membrane-associated proteins, detection can be improved by constraining searches to biologically feasible orientations. However, previous approaches integrating such constraints fail to achieve both computational scalability to large and curved systems and accurate detection. To resolve this, here we present rejection sampling, a novel approach for integrating translational and rotational constraints into template matching. Using simulated influenza virus-like particles, we demonstrate that rejection sampling outperforms existing methods in terms of precision and recall. Our approach can uniquely integrate constraints at voxel resolution while being compatible with imaging filters such as the contrast transfer function, essential for accurate macromolecular localization. Rejection sampling thus provides a practical solution for macromolecular detection in large and curved systems.
The purpose of this study is to realize the automatic identification and classification of fouls in football matches and improve the overall identification accuracy. Therefore, a Deep Learning-Based Saliency Prediction … The purpose of this study is to realize the automatic identification and classification of fouls in football matches and improve the overall identification accuracy. Therefore, a Deep Learning-Based Saliency Prediction Model (DLSPM) is proposed. DLSPM combines the improved DeepPlaBV 3+architecture for salient region detection, Graph Convolutional Networks (GCN) for feature extraction and Deep Neural Network (DNN) for classification. By automatically identifying the key action areas in the image, the model reduces the dependence on traditional image processing technology and manual feature extraction, and improves the accuracy and robustness of foul behavior identification. The experimental results show that DLSPM performs significantly better than the existing methods on multiple video motion recognition data sets, especially when dealing with complex scenes and dynamic changes. The research results not only provide a new perspective and method for the field of video motion recognition, but also lay a foundation for the application in intelligent monitoring and human-computer interaction.
Abstract This paper analyzes how language and body interact in boxing sparring sessions by focusing on the Japanese particle hai (lit. ‘yes’) as it occurs turn-initially in the first part … Abstract This paper analyzes how language and body interact in boxing sparring sessions by focusing on the Japanese particle hai (lit. ‘yes’) as it occurs turn-initially in the first part of instruction-compliance sequences. Based on sequential and embodied analysis of 11 boxing sparring sessions, this paper examines: (1) in what sequential and embodied environments hai is used; (2) if hai responds to a focal moment, what constitutes that moment; (3) what actions do hai -prefaced instructions indicate? How do language and body interact when these actions emerge? This paper identifies three environments: (1) while a boxer is being attacked, the particle prefaces instruction to evade the attack; (2) after a first phase of combined boxing movements, it precedes instruction pursuing the second phase; (3) after a change of distance, the particle introduces instructions for punches which are suitable at that distance. In each environment, hai is used to identify the exact moment at which targeted shifts from a current body alignment to a different one should be implemented. Depending on the temporal order of language and body, hai -prefaced instructions express different actions, e.g., ‘late’ instruction can “acknowledge” ( Mondada 2021 ) boxer’s independent initiations of the targeted action and, simultaneously, make their completions relevant.
Objective – This research investigates library research guides that share information about anti-fat bias to support weight-inclusive education or practice. By analyzing these guides, we seek to understand how academic … Objective – This research investigates library research guides that share information about anti-fat bias to support weight-inclusive education or practice. By analyzing these guides, we seek to understand how academic librarians are engaging in this work and how they can continue to support weight inclusivity as educators, proponents of information literacy, and interdisciplinary partners. Methods – The authors searched for and screened publicly available LibGuides from academic libraries that included content about anti-fat bias, weight stigma, and/or body liberation. Relevant guides were then evaluated with an original framework to examine their content for insight about their target audience and context. Results – The authors identified and analyzed 36 relevant LibGuides, predominantly from college and university libraries. Thirty-three LibGuides came from institutions in the United States, and most of the institutions had at least one health sciences program, though eight offered no health-related programs. Thirty-two of the analyzed LibGuides presented anti-fat bias content in a tab within a larger guide, while the remaining few were standalone guides. The majority of guides with tab-level anti-fat bias content presented it as a social justice issue, though a few framed the content in a nutrition or other context. The most popular resource types offered in the guides were books, popular articles, videos, associations/organizations, and academic articles. Conclusion – Weight inclusivity discourse is growing across disciplines and is an area that librarians are well-situated to support. Presenting anti-fat bias as a social justice and diversity, equity, inclusion, and accessibility (DEIA) issue in libraries is promising and highlights library workers’ commitment to anti-oppression efforts and learning. Work remains to be done to integrate more anti-fat bias content into academic curricula and education, and librarians should look to engage with disciplinary educators, learners, and colleagues to grow and support this work, particularly in the context of the health sciences.
Nicola Messina , Jan Sedmidubský , Fabrizio Falchi +1 more | ACM Transactions on Multimedia Computing Communications and Applications
Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is … Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description ( text-to-motion ) and vice-versa ( motion-to-text ). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: https://github.com/mesnico/MOTpp .
Histopathological data are foundational in both biological research and clinical diagnostics but remain siloed from modern multimodal and single-cell frameworks. We introduce LazySlide, an open-source Python package built on the … Histopathological data are foundational in both biological research and clinical diagnostics but remain siloed from modern multimodal and single-cell frameworks. We introduce LazySlide, an open-source Python package built on the scverse ecosystem for efficient whole-slide image (WSI) analysis and multimodal integration. By leveraging vision-language foundation models and adhering to scverse data standards, LazySlide bridges histopathology with omics workflows. It supports tissue and cell segmentation, feature extraction, cross-modal querying, and zero-shot classification, with minimal setup. Its modular design empowers both novice and expert users, lowering the barrier to advanced histopathology analysis and accelerating AI-driven discovery in tissue biology and pathology.
A significant goal for sports team management is establishing a reliable method for analyzing players’ performance. This research suggested a better method for predicting sporting events, including the final score … A significant goal for sports team management is establishing a reliable method for analyzing players’ performance. This research suggested a better method for predicting sporting events, including the final score of a basketball game, by combining adaptive weighted features with machine learning algorithms. Hence, this paper proposes Image Processing with XGBoost and the Enhanced Support Vector Machine Algorithm (XGB-SVM) to construct a real-time basketball game result prediction model. The model effectively quantified the study’s key variables that influenced basketball game results and simulated the prediction of game outcomes at different times of basketball games. The study’s findings proved that the XGBoost algorithm could accurately forecast the results of basketball games. There has been an enduring correlation between the results of the basketball game and key performance metrics, including defensive rebounds, field goal percentage, and turnovers. Incorporating Image Processing, XGBoost, and the Enhanced Support Vector Machine Algorithm, the real-time prediction model for basketball game outcomes achieves outstanding and easily interpretable results. Because of this, it can accurately forecast and evaluate basketball score forecasts. The results can give credibility to the team’s player management decisions. The proposed method increases the player positioning data ratio by 96.8%, shot trajectory ratio by 98.3%, historical performance data ratio by 91.7%, efficiency ratio by 98.5%, and accuracy ratio by 92.7% compared to other existing methods.
Yan Zhang , Yali Peng , Shengnan Wu +1 more | Concurrency and Computation Practice and Experience
ABSTRACT In the field of multiview clustering, how to make full use of information from multiple data sources to improve the clustering performance has become a hot research topic. However, … ABSTRACT In the field of multiview clustering, how to make full use of information from multiple data sources to improve the clustering performance has become a hot research topic. However, the rapid growth of high‐dimensional multiview data brings great challenges to the research of multiview clustering algorithms, especially the time and space complexity of the algorithms. As an effective solution, anchor‐based technique has gained wide attention in large‐scale multiview clustering tasks. Nevertheless, the current anchor‐based methods fail to fully take into account the importance of different views and the difference and diversity of anchors at the same time, which limits the clustering performance to some extent. To address these problems, we propose a dual‐weighted multiview clustering based on anchor (DwMVCA). First, we effectively distinguish the different impacts of high‐quality and low‐quality views on clustering by adaptively learning the weights of different views. Second, by introducing the adaptive weighting matrix of anchors and self‐correlation matrix regularization term, the difference and diversity of anchors are fully considered to effectively reduce the effect of redundant information on clustering. Furthermore, we design a three‐step alternating optimization algorithm to solve the resultant optimization problem and prove its convergence. Extensive experimental results show that the proposed DwMVCA has obvious advantages in clustering performance on large‐scale datasets, especially on datasets with more than 100,000 samples that still maintain linear time complexity.
Abstractive summarization of humorous narratives presents unique computational challenges due to humor's multimodal, context-dependent nature. Conventional models often fail to preserve the rhetorical structure essential to comedic discourse, particularly the … Abstractive summarization of humorous narratives presents unique computational challenges due to humor's multimodal, context-dependent nature. Conventional models often fail to preserve the rhetorical structure essential to comedic discourse, particularly the relationship between setup and punchline. This study proposes a novel Attention-Augmented Long Short-Term Memory (LSTM) model with discourse-aware decoding to enhance the summarization of stand-up comedy performances. The model is trained to capture temporal alignment between narrative elements and audience reactions by leveraging a richly annotated dataset of over 10,000 timestamped transcripts, each marked with audience laughter cues. The architecture integrates bidirectional encoding, attention mechanisms, and a cohesion-first decoding strategy to retain humor's structural and affective dynamics. Experimental evaluations demonstrate the proposed model outperforms baseline LSTM and transformer configurations in ROUGE scores and qualitative punchline preservation. Attention heatmaps and confusion matrices reveal the model's capability to prioritize humor-relevant content and align it with audience responses. Furthermore, analyses of laughter distribution, narrative length, and humor density indicate that performance improves when the model adapts to individual performers' pacing and delivery styles. The study also introduces punchline-aware evaluation as a critical metric for assessing summarization quality in humor-centric domains. The findings contribute to advancing discourse-sensitive summarization methods and offer practical implications for designing humor-aware AI systems. This research underscores the importance of combining structural linguistics, behavioral annotation, and deep learning to capture the complexity of comedic communication in narrative texts.
This study proposes an automated classification framework for evaluating teacher behavior in classroom settings by integrating AlphaPose and Faster region-based convolutional neural networks (R-CNN) algorithms. The method begins by applying … This study proposes an automated classification framework for evaluating teacher behavior in classroom settings by integrating AlphaPose and Faster region-based convolutional neural networks (R-CNN) algorithms. The method begins by applying AlphaPose to classroom video footage to extract detailed skeletal pose information of both teachers and students across individual frames. These pose-based features are subsequently processed by a Faster R-CNN model, which classifies teacher behavior into appropriate or inappropriate categories. The approach is validated on the Classroom Behavior (PCB) dataset, comprising 74 video clips and 51,800 annotated frames. Experimental results indicate that the proposed system achieves an accuracy of 74.89% in identifying inappropriate behaviors while also reducing manual behavior logging time by 47% and contributing to a 63% decrease in such behaviors. The findings highlight the potential of computer vision techniques for scalable, objective, and real-time classroom behavior analysis, offering a viable tool for enhancing educational quality and teacher performance monitoring.
Ausaf Ahmad , Sonali Singh | International Journal For Multidisciplinary Research
In the digital age, movie recommendation systems have become essential for increasing user satisfaction and engagement. In this paper, we present the movie recommendation system based on collaborative filtering, content-based … In the digital age, movie recommendation systems have become essential for increasing user satisfaction and engagement. In this paper, we present the movie recommendation system based on collaborative filtering, content-based filtering, and a hybrid method to highly customize the suggested movies to the users. Using historical data, analyzing the use’s behaviors of watching preferences, and topics, desired to watch movies, the system recommends movies using advanced machine learning algorithms on what he/she is likely to be interested in.
В статье рассматриваются особенности восприятия игроком видеоигр жанра «Ролевая игра» (РПГ), обусловленные использованием различных типов виртуальных камер: камеры с постоянным ракурсом, управляемой и автоматической динамических камер. Авторы анализируют влияние выбранного … В статье рассматриваются особенности восприятия игроком видеоигр жанра «Ролевая игра» (РПГ), обусловленные использованием различных типов виртуальных камер: камеры с постоянным ракурсом, управляемой и автоматической динамических камер. Авторы анализируют влияние выбранного типа камеры на геймплей, уровень погружения, эмоциональную вовлечённость и восприятие игрового мира. Исследование основано на анализе пользовательских отзывов, полученных в ходе тестирования специально разработанной игры, где последовательно применялись три модели камеры. Особое внимание уделяется различиям в оценке удобства, сложности выполнения игровых задач и общего впечатления между опытными и неопытными игроками. Результаты исследования показывают, что выбор модели камеры напрямую влияет на комфорт и успешность прохождения игры, а также на формирование игровых навыков. Полученные данные могут быть полезны разработчикам видеоигр при проектировании систем виртуальных камер, учитывающих уровень подготовки целевой аудитории и специфику жанра РПГ.
This paper proposes a novel framework that integrates swarm intelligence algorithms with controllable Generative Adversarial Networks (GANs) to meet the multi-objective demands of advertising illustration tasks. By combining Particle Swarm … This paper proposes a novel framework that integrates swarm intelligence algorithms with controllable Generative Adversarial Networks (GANs) to meet the multi-objective demands of advertising illustration tasks. By combining Particle Swarm Optimization (PSO) for global parameter exploration with a StyleGAN-like architecture capable of fine-grained style manipulation, our method effectively balances visual fidelity, brand consistency, stylistic diversity, and computational efficiency. We formulate the generation process as a multi-objective optimization problem. Experiments conducted on a curated dataset show that the proposed method outperforms conventional GANs, conditional GANs, and evolutionary-based baselines. Ablation studies further demonstrate the importance of integrating both style and brand-related loss functions, while parameter sensitivity analyses highlight the role of swarm size, inertia weight, and acceleration coefficients in guiding the search toward optimal solutions.
Artem Konev , Bernhard Sadransky , Silvana Zechmeister +4 more | ACM Transactions on Modeling and Computer Simulation
With the increasing frequency and severity of flood events worldwide, the need for flood simulations has become more critical than ever, together with the communication of predicted hazards and possible … With the increasing frequency and severity of flood events worldwide, the need for flood simulations has become more critical than ever, together with the communication of predicted hazards and possible mitigation measures. Since this communication frequently relies on static cartographic representations, it falls short in communicating complex spatial and temporal information and the intricate dynamics of flooding in an engaging and accessible manner. This limitation becomes particularly apparent when addressing diverse audiences with varying levels of expertise. This paper presents an approach that uses interactive geospatial stories to better communicate various aspects of flood management strategies. Our method integrates geospatial visualization with interactive storytelling techniques, creating a compelling platform for engaging stakeholders, fostering understanding and supporting participatory decision-making in flood-prone regions. We discuss the conceptual model, the principles of guided information visualization, and interaction modalities of our web-based, scenario-driven storytelling application. We also evaluate its effectiveness through two exemplary interactive geospatial stories designed to help domain experts explain different aspects of water-sensitive design practices and the use of simulation models for predicting, visualizing, and managing water flows.