With the rise of remote collaboration,the demand for advanced storage and collaboration tools has rapidly increased.However,traditional collaboration tools primarily rely on access control,leaving data stored on cloud...With the rise of remote collaboration,the demand for advanced storage and collaboration tools has rapidly increased.However,traditional collaboration tools primarily rely on access control,leaving data stored on cloud servers vulnerable due to insufficient encryption.This paper introduces a novel mechanism that encrypts data in‘bundle’units,designed to meet the dual requirements of efficiency and security for frequently updated collaborative data.Each bundle includes updated information,allowing only the updated portions to be reencrypted when changes occur.The encryption method proposed in this paper addresses the inefficiencies of traditional encryption modes,such as Cipher Block Chaining(CBC)and Counter(CTR),which require decrypting and re-encrypting the entire dataset whenever updates occur.The proposed method leverages update-specific information embedded within data bundles and metadata that maps the relationship between these bundles and the plaintext data.By utilizing this information,the method accurately identifies the modified portions and applies algorithms to selectively re-encrypt only those sections.This approach significantly enhances the efficiency of data updates while maintaining high performance,particularly in large-scale data environments.To validate this approach,we conducted experiments measuring execution time as both the size of the modified data and the total dataset size varied.Results show that the proposed method significantly outperforms CBC and CTR modes in execution speed,with greater performance gains as data size increases.Additionally,our security evaluation confirms that this method provides robust protection against both passive and active attacks.展开更多
In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical D...In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical Database. Our investigation entails a comprehensive exploration of various methodologies aimed at enhancing the efficiency of ETL processes, with a primary emphasis on optimizing time and resource utilization. Through meticulous experimentation utilizing a representative dataset, we shed light on the advantages associated with the incorporation of PySpark and Docker containerized applications. Our research illuminates significant advancements in time efficiency, process streamlining, and resource optimization attained through the utilization of PySpark for distributed computing within Big Data Engineering workflows. Additionally, we underscore the strategic integration of Docker containers, delineating their pivotal role in augmenting scalability and reproducibility within the ETL pipeline. This paper encapsulates the pivotal insights gleaned from our experimental journey, accentuating the practical implications and benefits entailed in the adoption of PySpark and Docker. By streamlining Big Data Engineering and ETL processes in the context of clinical big data, our study contributes to the ongoing discourse on optimizing data processing efficiency in healthcare applications. The source code is available on request.展开更多
Polymer informatics faces challenges owing to data scarcity arising from complex chemistries,experimental limitations,and process-ing-dependent properties.This review presents the recent advances in data-efficient mac...Polymer informatics faces challenges owing to data scarcity arising from complex chemistries,experimental limitations,and process-ing-dependent properties.This review presents the recent advances in data-efficient machine learning for polymers.First,data preparation tech-niques such as data augmentation and rational representation help expand the dataset size and develop useful features for learning.Second,modeling approaches,including classical algorithms and physics-informed methods,enhance the model robustness and reliability under limited data conditions.Third,learning strategies,such as transferlearning and active learning,aim to improve generalization and guide efficient data ac-quisition.This review concludes by outlining future opportunities in machine learning for small-data scenarios in polymers.This review is expect-ed to serve as a useful tool for newcomers and offer deeper insights for experienced researchers in the field.展开更多
This paper analyzes the advantages of legal digital currencies and explores their impact on bank big data practices.By combining bank big data collection and processing,it clarifies that legal digital currencies can e...This paper analyzes the advantages of legal digital currencies and explores their impact on bank big data practices.By combining bank big data collection and processing,it clarifies that legal digital currencies can enhance the efficiency of bank data processing,enrich data types,and strengthen data analysis and application capabilities.In response to future development needs,it is necessary to strengthen data collection management,enhance data processing capabilities,innovate big data application models,and provide references for bank big data practices,promoting the transformation and upgrading of the banking industry in the context of legal digital currencies.展开更多
Objective: To measure the hospital operation efficiency, study the correlation between average length of stay and hospital operation efficiency, analyze the importance of shortening average length of stay to the impro...Objective: To measure the hospital operation efficiency, study the correlation between average length of stay and hospital operation efficiency, analyze the importance of shortening average length of stay to the improvement of the hospital operation efficiency and put forward relevant policy suggestion. Methods: Based on China provincial panel data from 2003 to 2012, the hospital operation efficiencies are calculated using Super Efficiency Data Envelopment Analysis model, and the correlation between average length of stay and hospital operation efficiency is tested using Spearman rank correlation coefficient test. Results: From 2003 to 2012, the average of national hospital operation efficiency was increasing slowly and the hospital operations were inefficient in most of the areas. The national hospital operation efficiency is negatively correlated to the average length of stay. Conclusion: Measures should be taken to set average length of stay in a scientific and reasonable way, improve social and economic benefits based on the improvement of efficiency.展开更多
On Nov.4^(th), AQSIQ (General Administration of Quality Supervision,Inspection and Quarantine of the People' s Republic of China), SAC (Standardization Administrationof China), National Audit Office of China (CNAO...On Nov.4^(th), AQSIQ (General Administration of Quality Supervision,Inspection and Quarantine of the People' s Republic of China), SAC (Standardization Administrationof China), National Audit Office of China (CNAO), and National Ministry of Finance of China jointlyheld the conference press on the national standard of Information Technology--Data Interface ofAccounting Software (GB/T 19581-2004) in Beijing. The standard was approved and issued on Sept. 20,2004 by AQSIQ and SAC, and it would come into effect all over the whole nation from January 1^(st),2005. Pu Changcheng, Vice Director of AQSIQ, Shi Aizhong, Vice Director of CNAO, Li Zhonghai. amember of the Party Group of AQSIQ and Director of SAC, the other leaders of concerned departmentssuch as National Ministry of Finance, National Telegraphy Office, and etc. attended the ConferencePress and made speeches. They fully affirmed the important significance and the achievements onstandardization work of electronic government business, and also they set new demands on the workfor the future.展开更多
Predicting the performance of a tunneling boring machine is vitally important to avoid any possible accidents during tunneling boring.The prediction is not straightforward due to the uncertain geological conditions an...Predicting the performance of a tunneling boring machine is vitally important to avoid any possible accidents during tunneling boring.The prediction is not straightforward due to the uncertain geological conditions and the complex rock-machine interactions.Based on the big data obtained from the 72.1 km long tunnel in the Yin-Song Diversion Project in China,this study developed a machine learning model to predict the TBM performance in a real-time manner.The total thrust and the cutterhead torque during a stable period in a boring cycle was predicted in advance by using the machine-returned parameters in the rising period.A long short-term memory model was developed and its accuracy was evaluated.The results show that the variation in the total thrust and cutterhead torque with various geological conditions can be well reflected by the proposed model.This real-time predication shows superior performance than the classical theoretical model in which only a single value can be obtained based on the single measurement of the rock properties.To improve the accuracy of the model a filtering process was proposed.Results indicate that filtering the unnecessary parameters can enhance both the accuracy and the computational efficiency.Finally,the data deficiency was discussed by assuming a parameter was missing.It is found that the missing of a key parameter can significantly reduce the accuracy of the model,while the supplement of a parameter that highly-correlated with the missing one can improve the prediction.展开更多
Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic...Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.展开更多
Heterogeneous cellular networks(HCNs)are envisioned as a promising architecture to provide seamless wireless coverage and increase network capacity.However,the densified multi-tier network architecture introduces exce...Heterogeneous cellular networks(HCNs)are envisioned as a promising architecture to provide seamless wireless coverage and increase network capacity.However,the densified multi-tier network architecture introduces excessive intra-and cross-tier interference and makes HCNs vulnerable to eavesdropping attacks.In this article,a dynamic spectrum control(DSC)-assisted transmission scheme is proposed for HCNs to strengthen network security and increase the network capacity.Specifically,the proposed DSC-assisted transmission scheme leverages the idea of block cryptography to generate sequence families,which represent the transmission decisions,by performing iterative and orthogonal sequence transformations.Based on the sequence families,multiple users can dynamically occupy different frequency slots for data transmission simultaneously.In addition,the collision probability of the data transmission is analyzed,which results in closed-form expressions of the reliable transmission probability and the secrecy probability.Then,the upper and lower bounds of network capacity are further derived with given requirements on the reliable and secure transmission probabilities.Simulation results demonstrate that the proposed DSC-assisted scheme can outperform the benchmark scheme in terms of security performance.Finally,the impacts of key factors in the proposed DSC-assisted scheme on the network capacity and security are evaluated and discussed.展开更多
The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology.One of the research directions is employing relations among multi-modal data to enhanc...The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology.One of the research directions is employing relations among multi-modal data to enhance perfor-mance.However,the reliance on manually annotated multi-modal datasets results in a high cost of data labeling.In this paper,the topic semantics of images is proposed to alleviate the above problem.First,topic-related images can be auto-matically collected from the Internet by search engines.Second,topic semantics is sufficient to encode the relations be-tween multi-modal data such as texts and images.Specifically,we propose a visual topic semantic enhanced translation(VTSE)model that utilizes topic-related images to construct a cross-lingual and cross-modal semantic space,allowing the VTSE model to simultaneously integrate the syntactic structure and semantic features.In the above process,topic similar texts and images are wrapped into groups so that the model can extract more robust topic semantics from a set of similar images and then further optimize the feature integration.The results show that our model outperforms competitive base-lines by a large margin on the Multi30k and the Ambiguous COCO datasets.Our model can use external images to bring gains to translation,improving data efficiency.展开更多
With the rapid development of information technology,data has become the cornerstone of digitalization,networking,and intelligence,profoundly impacting various sectors including production,distribution,circulation,con...With the rapid development of information technology,data has become the cornerstone of digitalization,networking,and intelligence,profoundly impacting various sectors including production,distribution,circulation,consumption,and social service management.As the core resource of the digital economy and information society,the economic and social value of big data is increasingly prominent,yet it has also become a prime target for cyberattacks.In the face of a complex and ever-changing data environment and advanced cyber threats,traditional big data security technologies such as Hadoop and other mainstream technologies are proving inadequate in ensuring data security and compliance.Consequently,cryptography-based technologies such as fully encrypted execution environments and efficient data encryption and decryption have emerged as new directions for security protection in the field of big data.This paper delves into the latest advancements and challenges in this area by exploring the current state of big data security,the principles of endogenous security technologies,practical applications,and future prospects.展开更多
Data envelopment analysis (DEA) is an effective non-parametric method for measuring the relative efficiencies of decision making units (DMUs) with multiple inputs and outputs. In many real situations, the internal...Data envelopment analysis (DEA) is an effective non-parametric method for measuring the relative efficiencies of decision making units (DMUs) with multiple inputs and outputs. In many real situations, the internal structure of DMUs is a two-stage network process with shared inputs used in both stages and common outputs produced by the both stages. For example, hospitals have a two-stage network structure. Stage 1 consumes resources such as information technology system, plant, equipment and admin personnel to generate outputs such as medical records, laundry and housekeeping. Stage 2 consumes the same set of resources used by stage 1 (named shared inputs) and the outputs generated by stage 1 (named intermediate measures) to provide patient services. Besides, some of outputs, for instance, patient satisfaction degrees, are generated by the two individual stages together (named shared outputs). Since some of shared inputs and outputs are hard split up and allocated to each individual stage, it needs to develop two-stage DEA methods for evaluating the performance of two-stage network processes in such problems. This paper extends the centralized model to measure the DEA efficiency of the two-stage process with non split-table shared inputs and outputs. A weighted additive approach is used to combine the two individual stages. Moreover, additive efficiency decomposition models are developed to simultaneously evaluate the maximal and the minimal achievable efficiencies for the individual stages. Finally, an example of 17 city branches of China Construction Bank in Anhui Province is employed to illustrate the proposed approach.展开更多
With the developing demands of massive-data services,the applications that rely on big geographic data play crucial roles in academic and industrial communities.Unmanned aerial vehicles(UAVs),combining with terrestria...With the developing demands of massive-data services,the applications that rely on big geographic data play crucial roles in academic and industrial communities.Unmanned aerial vehicles(UAVs),combining with terrestrial wireless sensor networks(WSN),can provide sustainable solutions for data harvesting.The rising demands for efficient data collection in a larger open area have been posed in the literature,which requires efficient UAV trajectory planning with lower energy consumption methods.Currently,there are amounts of inextricable solutions of UAV planning for a larger open area,and one of the most practical techniques in previous studies is deep reinforcement learning(DRL).However,the overestimated problem in limited-experience DRL quickly throws the UAV path planning process into a locally optimized condition.Moreover,using the central nodes of the sub-WSNs as the sink nodes or navigation points for UAVs to visit may lead to extra collection costs.This paper develops a data-driven DRL-based game framework with two partners to fulfill the above demands.A cluster head processor(CHP)is employed to determine the sink nodes,and a navigation order processor(NOP)is established to plan the path.CHP and NOP receive information from each other and provide optimized solutions after the Nash equilibrium.The numerical results show that the proposed game framework could offer UAVs low-cost data collection trajectories,which can save at least 17.58%of energy consumption compared with the baseline methods.展开更多
An optimisation problem can have many forms and variants.It may consider different objectives,constraints,and variables.For that reason,providing a general application programming interface(API)to handle the problem d...An optimisation problem can have many forms and variants.It may consider different objectives,constraints,and variables.For that reason,providing a general application programming interface(API)to handle the problem data efficiently in all scenarios is impracticable.Nonetheless,on an R&D environment involving personnel from distinct backgrounds,having such an API can help with the development process because the team can focus on the research instead of implementations of data parsing,objective function calculation,and data structures.Also,some researchers might have a stronger background in programming than others,hence having a standard efficient API to handle the problem data improves reliability and productivity.This paper presents a design methodology to enable the development of efficient APIs to handle optimisation problems data based on a data-centric development framework.The proposed methodology involves the design of a data parser to handle the problem definition and data files and on a set of efficient data structures to hold the data in memory.Additionally,we bring three design patterns aimed to improve the performance of the API and techniques to improve the memory access by the user application.Also,we present the concepts of a Solution Builder that can manage solutions objects in memory better than built-in garbage collectors and provide an integrated objective function so that researchers can easily compare solutions from different solving techniques.Finally,we describe the positive results of employing a tailored API to a project involving the development of optimisation solutions for workforce scheduling and routing problems.展开更多
With the increasing demand and the wide application of high performance commodity multi-core processors, both the quantity and scale of data centers grow dramatically and they bring heavy energy consumption. Researche...With the increasing demand and the wide application of high performance commodity multi-core processors, both the quantity and scale of data centers grow dramatically and they bring heavy energy consumption. Researchers and engineers have applied much effort to reducing hardware energy consumption, but software is the true consumer of power and another key in making better use of energy. System software is critical to better energy utilization, because it is not only the manager of hardware but also the bridge and platform between applications and hardware. In this paper, we summarize some trends that can affect the efficiency of data centers. Meanwhile, we investigate the causes of software inefficiency. Based on these studies, major technical challenges and corresponding possible solutions to attain green system software in programmability, scalability, efficiency and software architecture are discussed. Finally, some of our research progress on trusted energy efficient system software is briefly introduced.展开更多
Engineering application domains need database management systems to supply them with a good means of modeling, a high data access efficiency and a language interface with strong functionality. This paper presents a se...Engineering application domains need database management systems to supply them with a good means of modeling, a high data access efficiency and a language interface with strong functionality. This paper presents a semantic hypergraph model based on relations, in order to express many-to-many relations among objects belonging to defferent semanic classes in engineering applications. A management mechanism expressed by the model and the basic data of engineering databases are managed in main memory. Especially, different objects are linked by different kinds of semantics defined by users, therefore the table swap, the record swap and some unnecessary examinations are reduced and the access efficiency of the engineering data is increased.C language interface that includes some generic and special functionality is proposed for closer connection with application programs.展开更多
基金supported by the Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(RS-2024-00399401,Development of Quantum-Safe Infrastructure Migration and Quantum Security Verification Technologies).
文摘With the rise of remote collaboration,the demand for advanced storage and collaboration tools has rapidly increased.However,traditional collaboration tools primarily rely on access control,leaving data stored on cloud servers vulnerable due to insufficient encryption.This paper introduces a novel mechanism that encrypts data in‘bundle’units,designed to meet the dual requirements of efficiency and security for frequently updated collaborative data.Each bundle includes updated information,allowing only the updated portions to be reencrypted when changes occur.The encryption method proposed in this paper addresses the inefficiencies of traditional encryption modes,such as Cipher Block Chaining(CBC)and Counter(CTR),which require decrypting and re-encrypting the entire dataset whenever updates occur.The proposed method leverages update-specific information embedded within data bundles and metadata that maps the relationship between these bundles and the plaintext data.By utilizing this information,the method accurately identifies the modified portions and applies algorithms to selectively re-encrypt only those sections.This approach significantly enhances the efficiency of data updates while maintaining high performance,particularly in large-scale data environments.To validate this approach,we conducted experiments measuring execution time as both the size of the modified data and the total dataset size varied.Results show that the proposed method significantly outperforms CBC and CTR modes in execution speed,with greater performance gains as data size increases.Additionally,our security evaluation confirms that this method provides robust protection against both passive and active attacks.
文摘In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical Database. Our investigation entails a comprehensive exploration of various methodologies aimed at enhancing the efficiency of ETL processes, with a primary emphasis on optimizing time and resource utilization. Through meticulous experimentation utilizing a representative dataset, we shed light on the advantages associated with the incorporation of PySpark and Docker containerized applications. Our research illuminates significant advancements in time efficiency, process streamlining, and resource optimization attained through the utilization of PySpark for distributed computing within Big Data Engineering workflows. Additionally, we underscore the strategic integration of Docker containers, delineating their pivotal role in augmenting scalability and reproducibility within the ETL pipeline. This paper encapsulates the pivotal insights gleaned from our experimental journey, accentuating the practical implications and benefits entailed in the adoption of PySpark and Docker. By streamlining Big Data Engineering and ETL processes in the context of clinical big data, our study contributes to the ongoing discourse on optimizing data processing efficiency in healthcare applications. The source code is available on request.
基金supported by the National Natural Science Foundation of China(No.22473006)the Central Government Guiding Local Science and Technology Development Fund(No.2025ZY01029).
文摘Polymer informatics faces challenges owing to data scarcity arising from complex chemistries,experimental limitations,and process-ing-dependent properties.This review presents the recent advances in data-efficient machine learning for polymers.First,data preparation tech-niques such as data augmentation and rational representation help expand the dataset size and develop useful features for learning.Second,modeling approaches,including classical algorithms and physics-informed methods,enhance the model robustness and reliability under limited data conditions.Third,learning strategies,such as transferlearning and active learning,aim to improve generalization and guide efficient data ac-quisition.This review concludes by outlining future opportunities in machine learning for small-data scenarios in polymers.This review is expect-ed to serve as a useful tool for newcomers and offer deeper insights for experienced researchers in the field.
文摘This paper analyzes the advantages of legal digital currencies and explores their impact on bank big data practices.By combining bank big data collection and processing,it clarifies that legal digital currencies can enhance the efficiency of bank data processing,enrich data types,and strengthen data analysis and application capabilities.In response to future development needs,it is necessary to strengthen data collection management,enhance data processing capabilities,innovate big data application models,and provide references for bank big data practices,promoting the transformation and upgrading of the banking industry in the context of legal digital currencies.
文摘Objective: To measure the hospital operation efficiency, study the correlation between average length of stay and hospital operation efficiency, analyze the importance of shortening average length of stay to the improvement of the hospital operation efficiency and put forward relevant policy suggestion. Methods: Based on China provincial panel data from 2003 to 2012, the hospital operation efficiencies are calculated using Super Efficiency Data Envelopment Analysis model, and the correlation between average length of stay and hospital operation efficiency is tested using Spearman rank correlation coefficient test. Results: From 2003 to 2012, the average of national hospital operation efficiency was increasing slowly and the hospital operations were inefficient in most of the areas. The national hospital operation efficiency is negatively correlated to the average length of stay. Conclusion: Measures should be taken to set average length of stay in a scientific and reasonable way, improve social and economic benefits based on the improvement of efficiency.
文摘On Nov.4^(th), AQSIQ (General Administration of Quality Supervision,Inspection and Quarantine of the People' s Republic of China), SAC (Standardization Administrationof China), National Audit Office of China (CNAO), and National Ministry of Finance of China jointlyheld the conference press on the national standard of Information Technology--Data Interface ofAccounting Software (GB/T 19581-2004) in Beijing. The standard was approved and issued on Sept. 20,2004 by AQSIQ and SAC, and it would come into effect all over the whole nation from January 1^(st),2005. Pu Changcheng, Vice Director of AQSIQ, Shi Aizhong, Vice Director of CNAO, Li Zhonghai. amember of the Party Group of AQSIQ and Director of SAC, the other leaders of concerned departmentssuch as National Ministry of Finance, National Telegraphy Office, and etc. attended the ConferencePress and made speeches. They fully affirmed the important significance and the achievements onstandardization work of electronic government business, and also they set new demands on the workfor the future.
基金supported by the Natural Science Foundation of China(Grant No.51679060)。
文摘Predicting the performance of a tunneling boring machine is vitally important to avoid any possible accidents during tunneling boring.The prediction is not straightforward due to the uncertain geological conditions and the complex rock-machine interactions.Based on the big data obtained from the 72.1 km long tunnel in the Yin-Song Diversion Project in China,this study developed a machine learning model to predict the TBM performance in a real-time manner.The total thrust and the cutterhead torque during a stable period in a boring cycle was predicted in advance by using the machine-returned parameters in the rising period.A long short-term memory model was developed and its accuracy was evaluated.The results show that the variation in the total thrust and cutterhead torque with various geological conditions can be well reflected by the proposed model.This real-time predication shows superior performance than the classical theoretical model in which only a single value can be obtained based on the single measurement of the rock properties.To improve the accuracy of the model a filtering process was proposed.Results indicate that filtering the unnecessary parameters can enhance both the accuracy and the computational efficiency.Finally,the data deficiency was discussed by assuming a parameter was missing.It is found that the missing of a key parameter can significantly reduce the accuracy of the model,while the supplement of a parameter that highly-correlated with the missing one can improve the prediction.
基金funded by Deanship of Graduate studies and Scientific Research at Jouf University under grant No.(DGSSR-2024-02-01264).
文摘Automated essay scoring(AES)systems have gained significant importance in educational settings,offering a scalable,efficient,and objective method for evaluating student essays.However,developing AES systems for Arabic poses distinct challenges due to the language’s complex morphology,diglossia,and the scarcity of annotated datasets.This paper presents a hybrid approach to Arabic AES by combining text-based,vector-based,and embeddingbased similarity measures to improve essay scoring accuracy while minimizing the training data required.Using a large Arabic essay dataset categorized into thematic groups,the study conducted four experiments to evaluate the impact of feature selection,data size,and model performance.Experiment 1 established a baseline using a non-machine learning approach,selecting top-N correlated features to predict essay scores.The subsequent experiments employed 5-fold cross-validation.Experiment 2 showed that combining embedding-based,text-based,and vector-based features in a Random Forest(RF)model achieved an R2 of 88.92%and an accuracy of 83.3%within a 0.5-point tolerance.Experiment 3 further refined the feature selection process,demonstrating that 19 correlated features yielded optimal results,improving R2 to 88.95%.In Experiment 4,an optimal data efficiency training approach was introduced,where training data portions increased from 5%to 50%.The study found that using just 10%of the data achieved near-peak performance,with an R2 of 85.49%,emphasizing an effective trade-off between performance and computational costs.These findings highlight the potential of the hybrid approach for developing scalable Arabic AES systems,especially in low-resource environments,addressing linguistic challenges while ensuring efficient data usage.
基金supported by the National Natural Science Foundation of China(61825104 and 91638204)the China Scholarship Council(CSC)+1 种基金the Natural Sciences and Engineering Research Council(NSERC)of CanadaUniversity Innovation Platform Project(2019921815KYPT009JC011)。
文摘Heterogeneous cellular networks(HCNs)are envisioned as a promising architecture to provide seamless wireless coverage and increase network capacity.However,the densified multi-tier network architecture introduces excessive intra-and cross-tier interference and makes HCNs vulnerable to eavesdropping attacks.In this article,a dynamic spectrum control(DSC)-assisted transmission scheme is proposed for HCNs to strengthen network security and increase the network capacity.Specifically,the proposed DSC-assisted transmission scheme leverages the idea of block cryptography to generate sequence families,which represent the transmission decisions,by performing iterative and orthogonal sequence transformations.Based on the sequence families,multiple users can dynamically occupy different frequency slots for data transmission simultaneously.In addition,the collision probability of the data transmission is analyzed,which results in closed-form expressions of the reliable transmission probability and the secrecy probability.Then,the upper and lower bounds of network capacity are further derived with given requirements on the reliable and secure transmission probabilities.Simulation results demonstrate that the proposed DSC-assisted scheme can outperform the benchmark scheme in terms of security performance.Finally,the impacts of key factors in the proposed DSC-assisted scheme on the network capacity and security are evaluated and discussed.
基金supported by the National Natural Science Foundation of China under Grant No.52178034.
文摘The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology.One of the research directions is employing relations among multi-modal data to enhance perfor-mance.However,the reliance on manually annotated multi-modal datasets results in a high cost of data labeling.In this paper,the topic semantics of images is proposed to alleviate the above problem.First,topic-related images can be auto-matically collected from the Internet by search engines.Second,topic semantics is sufficient to encode the relations be-tween multi-modal data such as texts and images.Specifically,we propose a visual topic semantic enhanced translation(VTSE)model that utilizes topic-related images to construct a cross-lingual and cross-modal semantic space,allowing the VTSE model to simultaneously integrate the syntactic structure and semantic features.In the above process,topic similar texts and images are wrapped into groups so that the model can extract more robust topic semantics from a set of similar images and then further optimize the feature integration.The results show that our model outperforms competitive base-lines by a large margin on the Multi30k and the Ambiguous COCO datasets.Our model can use external images to bring gains to translation,improving data efficiency.
文摘With the rapid development of information technology,data has become the cornerstone of digitalization,networking,and intelligence,profoundly impacting various sectors including production,distribution,circulation,consumption,and social service management.As the core resource of the digital economy and information society,the economic and social value of big data is increasingly prominent,yet it has also become a prime target for cyberattacks.In the face of a complex and ever-changing data environment and advanced cyber threats,traditional big data security technologies such as Hadoop and other mainstream technologies are proving inadequate in ensuring data security and compliance.Consequently,cryptography-based technologies such as fully encrypted execution environments and efficient data encryption and decryption have emerged as new directions for security protection in the field of big data.This paper delves into the latest advancements and challenges in this area by exploring the current state of big data security,the principles of endogenous security technologies,practical applications,and future prospects.
基金Acknowledgments The authors thank the editors and two anonymous referees for their helpful comments and suggestions that substantially improved the quality of this work. This research has been supported by grants from National Natural Science Foundation of China (71224001) and China Postdoctoral Science Foundation funded project (2015M571135).
文摘Data envelopment analysis (DEA) is an effective non-parametric method for measuring the relative efficiencies of decision making units (DMUs) with multiple inputs and outputs. In many real situations, the internal structure of DMUs is a two-stage network process with shared inputs used in both stages and common outputs produced by the both stages. For example, hospitals have a two-stage network structure. Stage 1 consumes resources such as information technology system, plant, equipment and admin personnel to generate outputs such as medical records, laundry and housekeeping. Stage 2 consumes the same set of resources used by stage 1 (named shared inputs) and the outputs generated by stage 1 (named intermediate measures) to provide patient services. Besides, some of outputs, for instance, patient satisfaction degrees, are generated by the two individual stages together (named shared outputs). Since some of shared inputs and outputs are hard split up and allocated to each individual stage, it needs to develop two-stage DEA methods for evaluating the performance of two-stage network processes in such problems. This paper extends the centralized model to measure the DEA efficiency of the two-stage process with non split-table shared inputs and outputs. A weighted additive approach is used to combine the two individual stages. Moreover, additive efficiency decomposition models are developed to simultaneously evaluate the maximal and the minimal achievable efficiencies for the individual stages. Finally, an example of 17 city branches of China Construction Bank in Anhui Province is employed to illustrate the proposed approach.
基金the National Natural Science Foundation of China under Grant No.61972230the Natural Science Foundation of Shandong Province of China under Grant No.ZR2021LZH006.
文摘With the developing demands of massive-data services,the applications that rely on big geographic data play crucial roles in academic and industrial communities.Unmanned aerial vehicles(UAVs),combining with terrestrial wireless sensor networks(WSN),can provide sustainable solutions for data harvesting.The rising demands for efficient data collection in a larger open area have been posed in the literature,which requires efficient UAV trajectory planning with lower energy consumption methods.Currently,there are amounts of inextricable solutions of UAV planning for a larger open area,and one of the most practical techniques in previous studies is deep reinforcement learning(DRL).However,the overestimated problem in limited-experience DRL quickly throws the UAV path planning process into a locally optimized condition.Moreover,using the central nodes of the sub-WSNs as the sink nodes or navigation points for UAVs to visit may lead to extra collection costs.This paper develops a data-driven DRL-based game framework with two partners to fulfill the above demands.A cluster head processor(CHP)is employed to determine the sink nodes,and a navigation order processor(NOP)is established to plan the path.CHP and NOP receive information from each other and provide optimized solutions after the Nash equilibrium.The numerical results show that the proposed game framework could offer UAVs low-cost data collection trajectories,which can save at least 17.58%of energy consumption compared with the baseline methods.
文摘An optimisation problem can have many forms and variants.It may consider different objectives,constraints,and variables.For that reason,providing a general application programming interface(API)to handle the problem data efficiently in all scenarios is impracticable.Nonetheless,on an R&D environment involving personnel from distinct backgrounds,having such an API can help with the development process because the team can focus on the research instead of implementations of data parsing,objective function calculation,and data structures.Also,some researchers might have a stronger background in programming than others,hence having a standard efficient API to handle the problem data improves reliability and productivity.This paper presents a design methodology to enable the development of efficient APIs to handle optimisation problems data based on a data-centric development framework.The proposed methodology involves the design of a data parser to handle the problem definition and data files and on a set of efficient data structures to hold the data in memory.Additionally,we bring three design patterns aimed to improve the performance of the API and techniques to improve the memory access by the user application.Also,we present the concepts of a Solution Builder that can manage solutions objects in memory better than built-in garbage collectors and provide an integrated objective function so that researchers can easily compare solutions from different solving techniques.Finally,we describe the positive results of employing a tailored API to a project involving the development of optimisation solutions for workforce scheduling and routing problems.
文摘With the increasing demand and the wide application of high performance commodity multi-core processors, both the quantity and scale of data centers grow dramatically and they bring heavy energy consumption. Researchers and engineers have applied much effort to reducing hardware energy consumption, but software is the true consumer of power and another key in making better use of energy. System software is critical to better energy utilization, because it is not only the manager of hardware but also the bridge and platform between applications and hardware. In this paper, we summarize some trends that can affect the efficiency of data centers. Meanwhile, we investigate the causes of software inefficiency. Based on these studies, major technical challenges and corresponding possible solutions to attain green system software in programmability, scalability, efficiency and software architecture are discussed. Finally, some of our research progress on trusted energy efficient system software is briefly introduced.
文摘Engineering application domains need database management systems to supply them with a good means of modeling, a high data access efficiency and a language interface with strong functionality. This paper presents a semantic hypergraph model based on relations, in order to express many-to-many relations among objects belonging to defferent semanic classes in engineering applications. A management mechanism expressed by the model and the basic data of engineering databases are managed in main memory. Especially, different objects are linked by different kinds of semantics defined by users, therefore the table swap, the record swap and some unnecessary examinations are reduced and the access efficiency of the engineering data is increased.C language interface that includes some generic and special functionality is proposed for closer connection with application programs.