Tokenized visual representations have shown promise in image compression, yet their extension to video remainsunderexplored due to the challenges posed by complex temporal dynamics and stringent bit rate constraints. ...Tokenized visual representations have shown promise in image compression, yet their extension to video remainsunderexplored due to the challenges posed by complex temporal dynamics and stringent bit rate constraints. In thispaper, we present tokenized video compression (TVC), a token-based dual-stream framework designed to operateeffectively at ultra-low bit rates. TVC leverages the Cosmos video tokenizer to extract both discrete and continuoustoken streams. The discrete tokens are partially masked using a strategic masking scheme and then compressedlosslessly with a discrete checkerboard context model to reduce transmission overhead. The masked tokens arereconstructed by a decoder-only Transformer with spatiotemporal token prediction. In parallel, the continuoustokens are quantized and compressed using a continuous checkerboard context model, providing complementarycontinuous information at ultra-low bit rates. At the decoder side, the two streams are fused with a ControlNet-basedmulti-scale integration module, ensuring high perceptual quality alongside stable fidelity in reconstruction. Overall,this work illustrates the practicality of tokenized video compression and points to new directions forsemantics-aware, token-native approaches.展开更多
针对地图综合中建筑多边形化简方法依赖人工规则、自动化程度低且难以利用已有化简成果的问题,本文提出了一种基于Transformer机制的建筑多边形化简模型。该模型首先把建筑多边形映射至一定范围的网格空间,将建筑多边形的坐标串表达为...针对地图综合中建筑多边形化简方法依赖人工规则、自动化程度低且难以利用已有化简成果的问题,本文提出了一种基于Transformer机制的建筑多边形化简模型。该模型首先把建筑多边形映射至一定范围的网格空间,将建筑多边形的坐标串表达为网格序列,从而获取建筑多边形化简前后的Token序列,构建出建筑多边形化简样本对数据;随后采用Transformer架构建立模型,基于样本数据利用模型的掩码自注意力机制学习点序列之间的依赖关系,最终逐点生成新的简化多边形,从而实现建筑多边形的化简。在训练过程中,模型使用结构化的样本数据,设计了忽略特定索引的交叉熵损失函数以提升化简质量。试验设计包括主试验与泛化验证两部分。主试验基于洛杉矶1∶2000建筑数据集,分别采用0.2、0.3和0.5 mm 3种网格尺寸对多边形进行编码,实现了目标比例尺为1∶5000与1∶10000的化简。试验结果表明,在0.3 mm的网格尺寸下模型性能最优,验证集上的化简结果与人工标注的一致率超过92.0%,且针对北京部分区域的建筑多边形数据的泛化试验验证了模型的迁移能力;与LSTM模型的对比分析显示,在参数规模相近的条件下,LSTM模型无法形成有效收敛,并生成可用结果。本文证实了Transformer在处理空间几何序列任务中的潜力,且能够有效复用已有化简样本,为智能建筑多边形化简提供了具有工程实用价值的途径。展开更多
The language barrier is the biggest obstacle for users watching foreign-language videos.Because of this,videos cannot be famous across borders,and their viewership is limited to a single language and culture.The easie...The language barrier is the biggest obstacle for users watching foreign-language videos.Because of this,videos cannot be famous across borders,and their viewership is limited to a single language and culture.The easiest way to solve this problem is to add subtitles in the language of the viewer.However,the current subtitling system lacks incentives,the ability to build a secure transaction environment,and a trusting relationship between video creators and subtitling makers.In response to the above situation,a tokenized subtitling crowdsourcing system(TSCS)based on blockchain and smart contract technologies is proposed.The source files for the subtitles are stored on the inter-planetary file system(IPFS)in the proposed system.Based on the ERC-721 standard,the returned corresponding address and subtitling-related information are made into a non-fungible token(NFT).At the same time,depending on the expected revenue from video view counts,the video token(VT),based on the ERC-777 standard and endorsed by the video platform,will be used as the payment token.The TSCS has two payment strategies:one-time and dividend.Through such a settlement mechanism,the subtitling maker’s revenue is also guaranteed by the code invariance and rule certainty of smart contract deployment.On the other hand,introducing an incentive mechanism for viewers to audit subtitles enables community autonomy,thus increasing the applicability of subtitles and the activity of users.展开更多
Legal case classification involves the categorization of legal documents into predefined categories,which facilitates legal information retrieval and case management.However,real-world legal datasets often suffer from...Legal case classification involves the categorization of legal documents into predefined categories,which facilitates legal information retrieval and case management.However,real-world legal datasets often suffer from class imbalances due to the uneven distribution of case types across legal domains.This leads to biased model performance,in the form of high accuracy for overrepresented categories and underperformance for minority classes.To address this issue,in this study,we propose a data augmentation method that masks unimportant terms within a document selectively while preserving key terms fromthe perspective of the legal domain.This approach enhances data diversity and improves the generalization capability of conventional models.Our experiments demonstrate consistent improvements achieved by the proposed augmentation strategy in terms of accuracy and F1 score across all models,validating the effectiveness of the proposed method in legal case classification.展开更多
The Transformer has achieved great success in the field of medical image segmentation,but its quadratic computational complexity limits its application in dense medical image prediction.Recently,the receptance weighte...The Transformer has achieved great success in the field of medical image segmentation,but its quadratic computational complexity limits its application in dense medical image prediction.Recently,the receptance weighted key value(RWKV)architecture has garnered widespread attention due to its linear computational complexity and its capability of parallel computation during training.Despite the RWKV model's proficiency in addressing long-range modeling tasks with linear computational complexity,most current RWKV-based approaches employ static scanning patterns.These patterns may inadvertently incorporate biased prior knowledge into the model's predictions.To address this challenge,we propose a multi-head scan strategy combined with padding methods to effectively simulate spatial continuity in 2D images.Within the Feature Aggregation Attention(FAA)module,asymmetric convolutions are designed to aggregate 1D sequence features along a single dimension,thereby expanding effective receptive fields while preserving structural sparsity.Additionally,panoramic token shift(P-Shift)effectively models local dependency relationships by moving tokens from a wide receptive field.Extensive experiments conducted on the ISIC17/18 and ACDC datasets demonstrate that our method exhibits superior performance in dense medical image prediction tasks.展开更多
With the increasing growth of online news,fake electronic news detection has become one of the most important paradigms of modern research.Traditional electronic news detection techniques are generally based on contex...With the increasing growth of online news,fake electronic news detection has become one of the most important paradigms of modern research.Traditional electronic news detection techniques are generally based on contextual understanding,sequential dependencies,and/or data imbalance.This makes distinction between genuine and fabricated news a challenging task.To address this problem,we propose a novel hybrid architecture,T5-SA-LSTM,which synergistically integrates the T5 Transformer for semantically rich contextual embedding with the Self-Attentionenhanced(SA)Long Short-Term Memory(LSTM).The LSTM is trained using the Adam optimizer,which provides faster and more stable convergence compared to the Stochastic Gradient Descend(SGD)and Root Mean Square Propagation(RMSProp).The WELFake and FakeNewsPrediction datasets are used,which consist of labeled news articles having fake and real news samples.Tokenization and Synthetic Minority Over-sampling Technique(SMOTE)methods are used for data preprocessing to ensure linguistic normalization and class imbalance.The incorporation of the Self-Attention(SA)mechanism enables the model to highlight critical words and phrases,thereby enhancing predictive accuracy.The proposed model is evaluated using accuracy,precision,recall(sensitivity),and F1-score as performance metrics.The model achieved 99%accuracy on the WELFake dataset and 96.5%accuracy on the FakeNewsPrediction dataset.It outperformed the competitive schemes such as T5-SA-LSTM(RMSProp),T5-SA-LSTM(SGD)and some other models.展开更多
基金supported by Futurewei Technologies Inc.through funding for a research collaboration and student internships.
文摘Tokenized visual representations have shown promise in image compression, yet their extension to video remainsunderexplored due to the challenges posed by complex temporal dynamics and stringent bit rate constraints. In thispaper, we present tokenized video compression (TVC), a token-based dual-stream framework designed to operateeffectively at ultra-low bit rates. TVC leverages the Cosmos video tokenizer to extract both discrete and continuoustoken streams. The discrete tokens are partially masked using a strategic masking scheme and then compressedlosslessly with a discrete checkerboard context model to reduce transmission overhead. The masked tokens arereconstructed by a decoder-only Transformer with spatiotemporal token prediction. In parallel, the continuoustokens are quantized and compressed using a continuous checkerboard context model, providing complementarycontinuous information at ultra-low bit rates. At the decoder side, the two streams are fused with a ControlNet-basedmulti-scale integration module, ensuring high perceptual quality alongside stable fidelity in reconstruction. Overall,this work illustrates the practicality of tokenized video compression and points to new directions forsemantics-aware, token-native approaches.
文摘针对地图综合中建筑多边形化简方法依赖人工规则、自动化程度低且难以利用已有化简成果的问题,本文提出了一种基于Transformer机制的建筑多边形化简模型。该模型首先把建筑多边形映射至一定范围的网格空间,将建筑多边形的坐标串表达为网格序列,从而获取建筑多边形化简前后的Token序列,构建出建筑多边形化简样本对数据;随后采用Transformer架构建立模型,基于样本数据利用模型的掩码自注意力机制学习点序列之间的依赖关系,最终逐点生成新的简化多边形,从而实现建筑多边形的化简。在训练过程中,模型使用结构化的样本数据,设计了忽略特定索引的交叉熵损失函数以提升化简质量。试验设计包括主试验与泛化验证两部分。主试验基于洛杉矶1∶2000建筑数据集,分别采用0.2、0.3和0.5 mm 3种网格尺寸对多边形进行编码,实现了目标比例尺为1∶5000与1∶10000的化简。试验结果表明,在0.3 mm的网格尺寸下模型性能最优,验证集上的化简结果与人工标注的一致率超过92.0%,且针对北京部分区域的建筑多边形数据的泛化试验验证了模型的迁移能力;与LSTM模型的对比分析显示,在参数规模相近的条件下,LSTM模型无法形成有效收敛,并生成可用结果。本文证实了Transformer在处理空间几何序列任务中的潜力,且能够有效复用已有化简样本,为智能建筑多边形化简提供了具有工程实用价值的途径。
文摘The language barrier is the biggest obstacle for users watching foreign-language videos.Because of this,videos cannot be famous across borders,and their viewership is limited to a single language and culture.The easiest way to solve this problem is to add subtitles in the language of the viewer.However,the current subtitling system lacks incentives,the ability to build a secure transaction environment,and a trusting relationship between video creators and subtitling makers.In response to the above situation,a tokenized subtitling crowdsourcing system(TSCS)based on blockchain and smart contract technologies is proposed.The source files for the subtitles are stored on the inter-planetary file system(IPFS)in the proposed system.Based on the ERC-721 standard,the returned corresponding address and subtitling-related information are made into a non-fungible token(NFT).At the same time,depending on the expected revenue from video view counts,the video token(VT),based on the ERC-777 standard and endorsed by the video platform,will be used as the payment token.The TSCS has two payment strategies:one-time and dividend.Through such a settlement mechanism,the subtitling maker’s revenue is also guaranteed by the code invariance and rule certainty of smart contract deployment.On the other hand,introducing an incentive mechanism for viewers to audit subtitles enables community autonomy,thus increasing the applicability of subtitles and the activity of users.
基金supported by the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)[RS-2021-II211341,Artificial Intelligence Graduate School Program(Chung-Ang University)],and by the Chung-Ang University Graduate Research Scholarship in 2024.
文摘Legal case classification involves the categorization of legal documents into predefined categories,which facilitates legal information retrieval and case management.However,real-world legal datasets often suffer from class imbalances due to the uneven distribution of case types across legal domains.This leads to biased model performance,in the form of high accuracy for overrepresented categories and underperformance for minority classes.To address this issue,in this study,we propose a data augmentation method that masks unimportant terms within a document selectively while preserving key terms fromthe perspective of the legal domain.This approach enhances data diversity and improves the generalization capability of conventional models.Our experiments demonstrate consistent improvements achieved by the proposed augmentation strategy in terms of accuracy and F1 score across all models,validating the effectiveness of the proposed method in legal case classification.
基金Supported by Zhejiang Provincial Natural Science Foundation of China(LY22F020025)the National Natural Science Foundation of China(62072126)。
文摘The Transformer has achieved great success in the field of medical image segmentation,but its quadratic computational complexity limits its application in dense medical image prediction.Recently,the receptance weighted key value(RWKV)architecture has garnered widespread attention due to its linear computational complexity and its capability of parallel computation during training.Despite the RWKV model's proficiency in addressing long-range modeling tasks with linear computational complexity,most current RWKV-based approaches employ static scanning patterns.These patterns may inadvertently incorporate biased prior knowledge into the model's predictions.To address this challenge,we propose a multi-head scan strategy combined with padding methods to effectively simulate spatial continuity in 2D images.Within the Feature Aggregation Attention(FAA)module,asymmetric convolutions are designed to aggregate 1D sequence features along a single dimension,thereby expanding effective receptive fields while preserving structural sparsity.Additionally,panoramic token shift(P-Shift)effectively models local dependency relationships by moving tokens from a wide receptive field.Extensive experiments conducted on the ISIC17/18 and ACDC datasets demonstrate that our method exhibits superior performance in dense medical image prediction tasks.
基金supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2025R195)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘With the increasing growth of online news,fake electronic news detection has become one of the most important paradigms of modern research.Traditional electronic news detection techniques are generally based on contextual understanding,sequential dependencies,and/or data imbalance.This makes distinction between genuine and fabricated news a challenging task.To address this problem,we propose a novel hybrid architecture,T5-SA-LSTM,which synergistically integrates the T5 Transformer for semantically rich contextual embedding with the Self-Attentionenhanced(SA)Long Short-Term Memory(LSTM).The LSTM is trained using the Adam optimizer,which provides faster and more stable convergence compared to the Stochastic Gradient Descend(SGD)and Root Mean Square Propagation(RMSProp).The WELFake and FakeNewsPrediction datasets are used,which consist of labeled news articles having fake and real news samples.Tokenization and Synthetic Minority Over-sampling Technique(SMOTE)methods are used for data preprocessing to ensure linguistic normalization and class imbalance.The incorporation of the Self-Attention(SA)mechanism enables the model to highlight critical words and phrases,thereby enhancing predictive accuracy.The proposed model is evaluated using accuracy,precision,recall(sensitivity),and F1-score as performance metrics.The model achieved 99%accuracy on the WELFake dataset and 96.5%accuracy on the FakeNewsPrediction dataset.It outperformed the competitive schemes such as T5-SA-LSTM(RMSProp),T5-SA-LSTM(SGD)and some other models.