This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected featu...This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.展开更多
In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classificati...In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classification tool designed to educate iHEARu-PLAY users about state-of-the-art speech analysis paradigms. Via this associated speech analysis web interface, in addition, VoiLA encourages users to take an active role in improving the service by providing labelled speech data. The platform allows users to record and upload voice samples directly from their browser, which are then analysed in a state-of-the-art classification pipeline. A set of pre-trained models targeting a range of speaker states and traits such as gender, valence, arousal, dominance, and 24 different discrete emotions is employed. The analysis results are visualised in a way that they are easily interpretable by laymen, giving users unique insights into how their voice sounds. We assess the effectiveness of iHEARu-PLAY and its integrated VoiLA feature via a series of user evaluations which indicate that it is fun and easy to use, and that it provides accurate and informative results.展开更多
In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore,...In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore, this paper proposes an improved affinity propagation clustering algorithm. First, add the subtraction clustering, using the density value of the data points to obtain the point of initial clusters. Then, calculate the similarity distance between the initial cluster points, and reference the idea of semi-supervised clustering, adding pairs restriction information, structure sparse similarity matrix. Finally, the cluster representative points conduct AP clustering until a suitable cluster division.Experimental results show that the algorithm allows the calculation is greatly reduced, the similarity matrix storage capacity is also reduced, and better than the original algorithm on the clustering effect and processing speed.展开更多
Systemarchitecture The Intelligent Teaching Team of the Shanghai Institute(Laboratory)of AI Education and the Institute of Curriculum and Instruction of East China Normal University collaborated to develop the High-Qu...Systemarchitecture The Intelligent Teaching Team of the Shanghai Institute(Laboratory)of AI Education and the Institute of Curriculum and Instruction of East China Normal University collaborated to develop the High-Quality Classroom Intelligent Analysis Standard system.This system was measured from the dimensions of Class Eficiency,Equity and Democracy,referred to as CEED system.展开更多
Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This s...Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.展开更多
Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effectiv...Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effective and innovative digital platform to observe trend from social media users’ perspective who are direct or indirect witnesses of the calamitous event. This paper aims to collect and analyze twitter data related to the recent wildfire in California to perform a trend analysis by classifying firsthand and credible information from Twitter users. This work investigates tweets on the recent wildfire in California and classifies them based on witnesses into two types: 1) direct witnesses and 2) indirect witnesses. The collected and analyzed information can be useful for law enforcement agencies and humanitarian organizations for communication and verification of the situational awareness during wildfire hazards. Trend analysis is an aggregated approach that includes sentimental analysis and topic modeling performed through domain-expert manual annotation and machine learning. Trend analysis ultimately builds a fine-grained analysis to assess evacuation routes and provide valuable information to the firsthand emergency responders<span style="font-family:Verdana;">.</span>展开更多
Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the ...Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the cost of data storage and improve the reliability and efficiency of Big Data management.Its weaknesses lie in inadequate and non-standardized management.Archiving in archival science focuses on the management aspects and neglects the necessary technical considerations,resulting in high storage and retention costs and poor ability to manage Big Data.Therefore,the integration of large-scale data archiving and archival theory can balance the existing research limitations of the two fields and propose two research topics for related research-archival management of Big Data and large-scale management of archived Big Data.展开更多
Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with...Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.展开更多
1.Introduction Climate change mitigation pathways aimed at limiting global anthropogenic carbon dioxide(CO_(2))emissions while striving to constrain the global temperature increase to below 2℃—as outlined by the Int...1.Introduction Climate change mitigation pathways aimed at limiting global anthropogenic carbon dioxide(CO_(2))emissions while striving to constrain the global temperature increase to below 2℃—as outlined by the Intergovernmental Panel on Climate Change(IPCC)—consistently predict the widespread implementation of CO_(2)geological storage on a global scale.展开更多
Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a no...Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Scheduling is a traditional problem in parallel and distributed system. However, due to special issues and goals of Grid, traditional approach is not effective in this environment any more. Therefore, it is necessary to propose methods specialized for this kind of parallel and distributed system. Another solution is to use a data replication strategy to create multiple copies of files and store them in convenient locations to shorten file access times. To utilize the above two concepts, in this paper we develop a job scheduling policy, called hierarchical job scheduling strategy (HJSS), and a dynamic data replication strategy, called advanced dynamic hierarchical replication strategy (ADHRS), to improve the data access efficiencies in a hierarchical Data Grid. HJSS uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers network characteristics, number of jobs waiting in queue, file locations, and disk read speed of storage drive at data sources. Moreover, due to the limited storage capacity, a good replica replacement algorithm is needed. We present a novel replacement strategy which deletes files in two steps when free space is not enough for the new replica: first, it deletes those files with minimum time for transferring. Second, if space is still insufficient then it considers the last time the replica was requested, number of access, size of replica and file transfer time. The simulation results show that our proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, number of intercommunications, number of replications, hit ratio, computing resource usage and storage usage.展开更多
Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of data...Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of database systems. Big data applications demand and consequently lead to the developments of diverse large-scale data management systems in different organizations, ranging from traditional database vendors to new emerging Internet-based enterprises. In this survey, we investigate, characterize, and analyze the large-scale data management systems in depth and develop comprehensive taxonomies for various critical aspects covering the data model, the system architecture, and the consistency model. We map the prevailing highly scalable data management systems to the proposed taxonomies, not only to classify the common techniques but also to provide a basis for analyzing current system scalability limitations. To overcome these limitations, we predicate and highlight the possible principles that future efforts need to be undertaken for the next generation large-scale data management systems.展开更多
How to effectively reduce the energy consumption of large-scale data centers is a key issue in cloud computing. This paper presents a novel low-power task scheduling algorithm (L3SA) for large-scale cloud data cente...How to effectively reduce the energy consumption of large-scale data centers is a key issue in cloud computing. This paper presents a novel low-power task scheduling algorithm (L3SA) for large-scale cloud data centers. The winner tree is introduced to make the data nodes as the leaf nodes of the tree and the final winner on the purpose of reducing energy consumption is selected. The complexity of large-scale cloud data centers is fully consider, and the task comparson coefficient is defined to make task scheduling strategy more reasonable. Experiments and performance analysis show that the proposed algorithm can effectively improve the node utilization, and reduce the overall power consumption of the cloud data center.展开更多
The recent upsurge in metro construction emphasizes the necessity of understanding the mechanical performance of metro shield tunnel subjected to the influence of ground fissures.In this study,a largescale experiment,...The recent upsurge in metro construction emphasizes the necessity of understanding the mechanical performance of metro shield tunnel subjected to the influence of ground fissures.In this study,a largescale experiment,in combination with numerical simulation,was conducted to investigate the influence of ground fissures on a metro shield tunnel.The results indicate that the lining contact pressure at the vault increases in the hanging wall while decreases in the footwall,resulting in a two-dimensional stress state of vertical shear and axial tension-compression,and simultaneous vertical dislocation and axial tilt for the segments around the ground fissure.In addition,the damage to curved bolts includes tensile yield,flexural yield,and shear twist,leading to obvious concrete lining damage,particularly at the vault,arch bottom,and hance,indicating that the joints in these positions are weak areas.The shield tunnel orthogonal to the ground fissure ultimately experiences shear failure,suggesting that the maximum actual dislocation of ground fissure that the structure can withstand is approximately 20 cm,and five segment rings in the hanging wall and six segment rings in the footwall also need to be reinforced.This study could provide a reference for metro design in ground fissure sites.展开更多
Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins.With the rapid development of high-throughput genomic technologies,massive protein-protein interacti...Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins.With the rapid development of high-throughput genomic technologies,massive protein-protein interaction(PPI)data have been generated,making it very difficult to analyze them efficiently.To address this problem,this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms,i.e.,CoFex,using MapReduce.To do so,an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction.Respective solutions are then devised to overcome these limitations.In particular,we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins.After that,its procedure is modified by following the MapReduce framework to take the prediction task distributively.A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy.Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.展开更多
Background: The importance of structurally diverse forests for the conservation of biodiversity and provision of a wide range of ecosystem services has been widely recognised. However, tools to quantify structural div...Background: The importance of structurally diverse forests for the conservation of biodiversity and provision of a wide range of ecosystem services has been widely recognised. However, tools to quantify structural diversity of forests in an objective and quantitative way across many forest types and sites are still needed, for example to support biodiversity monitoring. The existing approaches to quantify forest structural diversity are based on small geographical regions or single forest types, typically using only small data sets.Results: Here we developed an index of structural diversity based on National Forest Inventory(NFI) data of BadenWurttemberg, Germany, a state with 1.3 million ha of diverse forest types in different ownerships. Based on a literature review, 11 aspects of structural diversity were identified a priori as crucially important to describe structural diversity. An initial comprehensive list of 52 variables derived from National Forest Inventory(NFI) data related to structural diversity was reduced by applying five selection criteria to arrive at one variable for each aspect of structural diversity. These variables comprise 1) quadratic mean diameter at breast height(DBH), 2) standard deviation of DBH, 3) standard deviation of stand height, 4) number of decay classes, 5) bark-diversity index, 6) trees with DBH ≥ 40 cm, 7) diversity of flowering and fructification, 8) average mean diameter of downed deadwood, 9) mean DBH of standing deadwood, 10) tree species richness and 11) tree species richness in the regeneration layer. These variables were combined into a simple,additive index to quantify the level of structural diversity, which assumes values between 0 and 1. We applied this index in an exemplary way to broad forest categories and ownerships to assess its feasibility to analyse structural diversity in large-scale forest inventories.Conclusions: The forest structure index presented here can be derived in a similar way from standard inventory variables for most other large-scale forest inventories to provide important information about biodiversity relevant forest conditions and thus provide an evidence-base for forest management and planning as well as reporting.展开更多
The titanium alloy strut serves as a key load-bearing component of aircraft landing gear,typically manufactured via forging.The friction condition has important influence on material flow and cavity filling during the...The titanium alloy strut serves as a key load-bearing component of aircraft landing gear,typically manufactured via forging.The friction condition has important influence on material flow and cavity filling during the forging process.Using the previously optimized shape and initial position of preform,the influence of the friction condition(friction factor m=0.1–0.3)on material flow and cavity filling was studied by numerical method with a shear friction model.A novel filling index was defined to reflect material flow into left and right flashes and zoom in on friction-induced results.The results indicate that the workpiece moves rigidly to the right direction,with the displacement decreasing as m increases.When m<0.18,the underfilling defect will occur in the left side of strut forging,while overflow occurs in the right forging die cavity.By combining the filling index and analyses of material flow and filling status,a reasonable friction factor interval of m=0.21–0.24 can be determined.Within this interval,the cavity filling behavior demonstrates robustness,with friction fluctuations exerting minimal influence.展开更多
Background A task assigned to space exploration satellites involves detecting the physical environment within a certain space.However,space detection data are complex and abstract.These data are not conducive for rese...Background A task assigned to space exploration satellites involves detecting the physical environment within a certain space.However,space detection data are complex and abstract.These data are not conducive for researchers'visual perceptions of the evolution and interaction of events in the space environment.Methods A time-series dynamic data sampling method for large-scale space was proposed for sample detection data in space and time,and the corresponding relationships between data location features and other attribute features were established.A tone-mapping method based on statistical histogram equalization was proposed and applied to the final attribute feature data.The visualization process is optimized for rendering by merging materials,reducing the number of patches,and performing other operations.Results The results of sampling,feature extraction,and uniform visualization of the detection data of complex types,long duration spans,and uneven spatial distributions were obtained.The real-time visualization of large-scale spatial structures using augmented reality devices,particularly low-performance devices,was also investigated.Conclusions The proposed visualization system can reconstruct the three-dimensional structure of a large-scale space,express the structure and changes in the spatial environment using augmented reality,and assist in intuitively discovering spatial environmental events and evolutionary rules.展开更多
Based on questionnaire surveys and field interviews conducted with various types of agricultural production organizations across five districts and four counties in Daqing City,this study combines relevant theoretical...Based on questionnaire surveys and field interviews conducted with various types of agricultural production organizations across five districts and four counties in Daqing City,this study combines relevant theoretical frameworks to systematically examine the evolution,performance,and influencing factors of governance mechanisms within these organizations.Using both quantitative and inductive analytical methods,the paper proposes innovative designs and supporting measures for improving governance mechanisms.The findings reveal that,amid large-scale farmland circulation,the governance mechanisms of agricultural production organizations in Daqing City are evolving from traditional to modern structures.However,challenges remain in areas such as decision-making efficiency,benefit distribution,and supervision mechanisms.In response,this study proposes innovative governance designs focusing on decision-making processes,profit-sharing mechanisms,and risk prevention.Corresponding policy recommendations are also provided to support the sustainable development of agricultural modernization in China.展开更多
Formalizing complex processes and phenomena of a real-world problem may require a large number of variables and constraints,resulting in what is termed a large-scale optimization problem.Nowadays,such large-scale opti...Formalizing complex processes and phenomena of a real-world problem may require a large number of variables and constraints,resulting in what is termed a large-scale optimization problem.Nowadays,such large-scale optimization problems are solved using computing machines,leading to an enormous computational time being required,which may delay deriving timely solutions.Decomposition methods,which partition a large-scale optimization problem into lower-dimensional subproblems,represent a key approach to addressing time-efficiency issues.There has been significant progress in both applied mathematics and emerging artificial intelligence approaches on this front.This work aims at providing an overview of the decomposition methods from both the mathematics and computer science points of view.We also remark on the state-of-the-art developments and recent applications of the decomposition methods,and discuss the future research and development perspectives.展开更多
This article focuses on the management of large-scale machinery and equipment in highway construction,with the research objective of identifying issues at the management level and exploring more effective management m...This article focuses on the management of large-scale machinery and equipment in highway construction,with the research objective of identifying issues at the management level and exploring more effective management measures.Through practical observation and logical analysis,this article elaborates on the management connotations of large-scale machinery and equipment in highway construction,affirming its management value from different perspectives.On this basis,it carefully analyzes the problems existing in the management of large-scale machinery and equipment,providing a detailed interpretation of issues such as the weak foundation of the equipment management system and the disconnection between equipment selection and configuration from reality.Combining the manifestations of related problems,this article proposes strategies such as strengthening the institutional foundation of equipment management,selecting and configuring equipment based on actual conditions,aiming to provide references for large-scale machinery and equipment management to relevant enterprises.展开更多
文摘This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.
基金supported by the European Community’s Seventh Framework Programme(No.338164)(ERC Starting Grant iHEARu)
文摘In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classification tool designed to educate iHEARu-PLAY users about state-of-the-art speech analysis paradigms. Via this associated speech analysis web interface, in addition, VoiLA encourages users to take an active role in improving the service by providing labelled speech data. The platform allows users to record and upload voice samples directly from their browser, which are then analysed in a state-of-the-art classification pipeline. A set of pre-trained models targeting a range of speaker states and traits such as gender, valence, arousal, dominance, and 24 different discrete emotions is employed. The analysis results are visualised in a way that they are easily interpretable by laymen, giving users unique insights into how their voice sounds. We assess the effectiveness of iHEARu-PLAY and its integrated VoiLA feature via a series of user evaluations which indicate that it is fun and easy to use, and that it provides accurate and informative results.
基金This research has been partially supported by the national natural science foundation of China (51175169) and the national science and technology support program (2012BAF02B01).
文摘In the face of a growing number of large-scale data sets, affinity propagation clustering algorithm to calculate the process required to build the similarity matrix, will bring huge storage and computation. Therefore, this paper proposes an improved affinity propagation clustering algorithm. First, add the subtraction clustering, using the density value of the data points to obtain the point of initial clusters. Then, calculate the similarity distance between the initial cluster points, and reference the idea of semi-supervised clustering, adding pairs restriction information, structure sparse similarity matrix. Finally, the cluster representative points conduct AP clustering until a suitable cluster division.Experimental results show that the algorithm allows the calculation is greatly reduced, the similarity matrix storage capacity is also reduced, and better than the original algorithm on the clustering effect and processing speed.
基金supported by the China National Social Science Foundation(BHA220144).
文摘Systemarchitecture The Intelligent Teaching Team of the Shanghai Institute(Laboratory)of AI Education and the Institute of Curriculum and Instruction of East China Normal University collaborated to develop the High-Quality Classroom Intelligent Analysis Standard system.This system was measured from the dimensions of Class Eficiency,Equity and Democracy,referred to as CEED system.
基金supported by the National Natural Science Foundation of China(Grant No.52409151)the Programme of Shenzhen Key Laboratory of Green,Efficient and Intelligent Construction of Underground Metro Station(Programme No.ZDSYS20200923105200001)the Science and Technology Major Project of Xizang Autonomous Region of China(XZ202201ZD0003G).
文摘Substantial advancements have been achieved in Tunnel Boring Machine(TBM)technology and monitoring systems,yet the presence of missing data impedes accurate analysis and interpretation of TBM monitoring results.This study aims to investigate the issue of missing data in extensive TBM datasets.Through a comprehensive literature review,we analyze the mechanism of missing TBM data and compare different imputation methods,including statistical analysis and machine learning algorithms.We also examine the impact of various missing patterns and rates on the efficacy of these methods.Finally,we propose a dynamic interpolation strategy tailored for TBM engineering sites.The research results show that K-Nearest Neighbors(KNN)and Random Forest(RF)algorithms can achieve good interpolation results;As the missing rate increases,the interpolation effect of different methods will decrease;The interpolation effect of block missing is poor,followed by mixed missing,and the interpolation effect of sporadic missing is the best.On-site application results validate the proposed interpolation strategy's capability to achieve robust missing value interpolation effects,applicable in ML scenarios such as parameter optimization,attitude warning,and pressure prediction.These findings contribute to enhancing the efficiency of TBM missing data processing,offering more effective support for large-scale TBM monitoring datasets.
文摘Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effective and innovative digital platform to observe trend from social media users’ perspective who are direct or indirect witnesses of the calamitous event. This paper aims to collect and analyze twitter data related to the recent wildfire in California to perform a trend analysis by classifying firsthand and credible information from Twitter users. This work investigates tweets on the recent wildfire in California and classifies them based on witnesses into two types: 1) direct witnesses and 2) indirect witnesses. The collected and analyzed information can be useful for law enforcement agencies and humanitarian organizations for communication and verification of the situational awareness during wildfire hazards. Trend analysis is an aggregated approach that includes sentimental analysis and topic modeling performed through domain-expert manual annotation and machine learning. Trend analysis ultimately builds a fine-grained analysis to assess evacuation routes and provide valuable information to the firsthand emergency responders<span style="font-family:Verdana;">.</span>
基金supported by the National Natural Science Foundation of China(grant number 72074214).
文摘Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the cost of data storage and improve the reliability and efficiency of Big Data management.Its weaknesses lie in inadequate and non-standardized management.Archiving in archival science focuses on the management aspects and neglects the necessary technical considerations,resulting in high storage and retention costs and poor ability to manage Big Data.Therefore,the integration of large-scale data archiving and archival theory can balance the existing research limitations of the two fields and propose two research topics for related research-archival management of Big Data and large-scale management of archived Big Data.
基金the National Natural Science Foundation of China under Grant Nos.61732003,61832003,61972110 and U19A2059the National Key Research and Development Program of China under Grant No.2019YFB2101902the CCF-Baidu Open Fund CCF-BAIDU under Grant No.OF2021011.
文摘Kernel is a kind of data summary which is elaborately extracted from a large dataset.Given a problem,the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio.It is widely used in geometric optimization,clustering,and approximate query processing,etc.,for scaling them up to massive data.In this paper,we focus on the minimumε-kernel(MK)computation that asks for a kernel of the smallest size for large-scale data processing.For the open problem presented by Wang et al.that whether the minimumε-coreset(MC)problem and the MK problem can be reduced to each other,we first formalize the MK problem and analyze its complexity.Due to the NP-hardness of the MK problem in three or higher dimensions,an approximate algorithm,namely Set Cover-Based Minimumε-Kernel algorithm(SCMK),is developed to solve it.We prove that the MC problem and the MK problem can be Turing-reduced to each other.Then,we discuss the update of MK under insertion and deletion operations,respectively.Finally,a randomized algorithm,called the Randomized Algorithm of Set Cover-Based Minimumε-Kernel algorithm(RA-SCMK),is utilized to further reduce the complexity of SCMK.The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets.Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets,respectively.The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets.RA-SCMK runs up to three times faster than SCMK on synthetic datasets.
基金supported by the National Key Research and Development Program of China(2022YFE0206700)。
文摘1.Introduction Climate change mitigation pathways aimed at limiting global anthropogenic carbon dioxide(CO_(2))emissions while striving to constrain the global temperature increase to below 2℃—as outlined by the Intergovernmental Panel on Climate Change(IPCC)—consistently predict the widespread implementation of CO_(2)geological storage on a global scale.
文摘Data Grid integrates graphically distributed resources for solving data intensive scientific applications. Effective scheduling in Grid can reduce the amount of data transferred among nodes by submitting a job to a node, where most of the requested data files are available. Scheduling is a traditional problem in parallel and distributed system. However, due to special issues and goals of Grid, traditional approach is not effective in this environment any more. Therefore, it is necessary to propose methods specialized for this kind of parallel and distributed system. Another solution is to use a data replication strategy to create multiple copies of files and store them in convenient locations to shorten file access times. To utilize the above two concepts, in this paper we develop a job scheduling policy, called hierarchical job scheduling strategy (HJSS), and a dynamic data replication strategy, called advanced dynamic hierarchical replication strategy (ADHRS), to improve the data access efficiencies in a hierarchical Data Grid. HJSS uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers network characteristics, number of jobs waiting in queue, file locations, and disk read speed of storage drive at data sources. Moreover, due to the limited storage capacity, a good replica replacement algorithm is needed. We present a novel replacement strategy which deletes files in two steps when free space is not enough for the new replica: first, it deletes those files with minimum time for transferring. Second, if space is still insufficient then it considers the last time the replica was requested, number of access, size of replica and file transfer time. The simulation results show that our proposed algorithm has better performance in comparison with other algorithms in terms of job execution time, number of intercommunications, number of replications, hit ratio, computing resource usage and storage usage.
文摘Today, data is flowing into various organizations at an unprecedented scale. The ability to scale out for processing an enhanced workload has become an important factor for the proliferation and popularization of database systems. Big data applications demand and consequently lead to the developments of diverse large-scale data management systems in different organizations, ranging from traditional database vendors to new emerging Internet-based enterprises. In this survey, we investigate, characterize, and analyze the large-scale data management systems in depth and develop comprehensive taxonomies for various critical aspects covering the data model, the system architecture, and the consistency model. We map the prevailing highly scalable data management systems to the proposed taxonomies, not only to classify the common techniques but also to provide a basis for analyzing current system scalability limitations. To overcome these limitations, we predicate and highlight the possible principles that future efforts need to be undertaken for the next generation large-scale data management systems.
基金supported by the National Natural Science Foundation of China(6120200461272084)+9 种基金the National Key Basic Research Program of China(973 Program)(2011CB302903)the Specialized Research Fund for the Doctoral Program of Higher Education(2009322312000120113223110003)the China Postdoctoral Science Foundation Funded Project(2011M5000952012T50514)the Natural Science Foundation of Jiangsu Province(BK2011754BK2009426)the Jiangsu Postdoctoral Science Foundation Funded Project(1102103C)the Natural Science Fund of Higher Education of Jiangsu Province(12KJB520007)the Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(yx002001)
文摘How to effectively reduce the energy consumption of large-scale data centers is a key issue in cloud computing. This paper presents a novel low-power task scheduling algorithm (L3SA) for large-scale cloud data centers. The winner tree is introduced to make the data nodes as the leaf nodes of the tree and the final winner on the purpose of reducing energy consumption is selected. The complexity of large-scale cloud data centers is fully consider, and the task comparson coefficient is defined to make task scheduling strategy more reasonable. Experiments and performance analysis show that the proposed algorithm can effectively improve the node utilization, and reduce the overall power consumption of the cloud data center.
基金supported by the National Key Research&Development Program of China(Grant No.2023YFC3008404)the Key Laboratory of Earth Fissures Geological Disaster,Ministry of Natural Resources,China(Grant Nos.EFGD20240609 and EFGD20240610).
文摘The recent upsurge in metro construction emphasizes the necessity of understanding the mechanical performance of metro shield tunnel subjected to the influence of ground fissures.In this study,a largescale experiment,in combination with numerical simulation,was conducted to investigate the influence of ground fissures on a metro shield tunnel.The results indicate that the lining contact pressure at the vault increases in the hanging wall while decreases in the footwall,resulting in a two-dimensional stress state of vertical shear and axial tension-compression,and simultaneous vertical dislocation and axial tilt for the segments around the ground fissure.In addition,the damage to curved bolts includes tensile yield,flexural yield,and shear twist,leading to obvious concrete lining damage,particularly at the vault,arch bottom,and hance,indicating that the joints in these positions are weak areas.The shield tunnel orthogonal to the ground fissure ultimately experiences shear failure,suggesting that the maximum actual dislocation of ground fissure that the structure can withstand is approximately 20 cm,and five segment rings in the hanging wall and six segment rings in the footwall also need to be reinforced.This study could provide a reference for metro design in ground fissure sites.
基金This work was supported in part by the National Natural Science Foundation of China(61772493)the CAAI-Huawei MindSpore Open Fund(CAAIXSJLJJ-2020-004B)+4 种基金the Natural Science Foundation of Chongqing(China)(cstc2019jcyjjqX0013)Chongqing Research Program of Technology Innovation and Application(cstc2019jscx-fxydX0024,cstc2019jscx-fxydX0027,cstc2018jszx-cyzdX0041)Guangdong Province Universities and College Pearl River Scholar Funded Scheme(2019)the Pioneer Hundred Talents Program of Chinese Academy of Sciencesthe Deanship of Scientific Research(DSR)at King Abdulaziz University(G-21-135-38).
文摘Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins.With the rapid development of high-throughput genomic technologies,massive protein-protein interaction(PPI)data have been generated,making it very difficult to analyze them efficiently.To address this problem,this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms,i.e.,CoFex,using MapReduce.To do so,an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction.Respective solutions are then devised to overcome these limitations.In particular,we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins.After that,its procedure is modified by following the MapReduce framework to take the prediction task distributively.A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy.Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.
基金supported by a grant from the Ministry of Science,Research and the Arts of Baden-Württemberg(7533-10-5-78)to Jürgen BauhusFelix Storch received additional support through the BBW ForWerts Graduate Program
文摘Background: The importance of structurally diverse forests for the conservation of biodiversity and provision of a wide range of ecosystem services has been widely recognised. However, tools to quantify structural diversity of forests in an objective and quantitative way across many forest types and sites are still needed, for example to support biodiversity monitoring. The existing approaches to quantify forest structural diversity are based on small geographical regions or single forest types, typically using only small data sets.Results: Here we developed an index of structural diversity based on National Forest Inventory(NFI) data of BadenWurttemberg, Germany, a state with 1.3 million ha of diverse forest types in different ownerships. Based on a literature review, 11 aspects of structural diversity were identified a priori as crucially important to describe structural diversity. An initial comprehensive list of 52 variables derived from National Forest Inventory(NFI) data related to structural diversity was reduced by applying five selection criteria to arrive at one variable for each aspect of structural diversity. These variables comprise 1) quadratic mean diameter at breast height(DBH), 2) standard deviation of DBH, 3) standard deviation of stand height, 4) number of decay classes, 5) bark-diversity index, 6) trees with DBH ≥ 40 cm, 7) diversity of flowering and fructification, 8) average mean diameter of downed deadwood, 9) mean DBH of standing deadwood, 10) tree species richness and 11) tree species richness in the regeneration layer. These variables were combined into a simple,additive index to quantify the level of structural diversity, which assumes values between 0 and 1. We applied this index in an exemplary way to broad forest categories and ownerships to assess its feasibility to analyse structural diversity in large-scale forest inventories.Conclusions: The forest structure index presented here can be derived in a similar way from standard inventory variables for most other large-scale forest inventories to provide important information about biodiversity relevant forest conditions and thus provide an evidence-base for forest management and planning as well as reporting.
基金National Natural Science Foundation of China(52375378)National Key Laboratory of Metal Forming Technology and Heavy Equipment(S2308100.W12)Huxiang High-Level Talent Gathering Project of Hunan Province(2021RC5001)。
文摘The titanium alloy strut serves as a key load-bearing component of aircraft landing gear,typically manufactured via forging.The friction condition has important influence on material flow and cavity filling during the forging process.Using the previously optimized shape and initial position of preform,the influence of the friction condition(friction factor m=0.1–0.3)on material flow and cavity filling was studied by numerical method with a shear friction model.A novel filling index was defined to reflect material flow into left and right flashes and zoom in on friction-induced results.The results indicate that the workpiece moves rigidly to the right direction,with the displacement decreasing as m increases.When m<0.18,the underfilling defect will occur in the left side of strut forging,while overflow occurs in the right forging die cavity.By combining the filling index and analyses of material flow and filling status,a reasonable friction factor interval of m=0.21–0.24 can be determined.Within this interval,the cavity filling behavior demonstrates robustness,with friction fluctuations exerting minimal influence.
文摘Background A task assigned to space exploration satellites involves detecting the physical environment within a certain space.However,space detection data are complex and abstract.These data are not conducive for researchers'visual perceptions of the evolution and interaction of events in the space environment.Methods A time-series dynamic data sampling method for large-scale space was proposed for sample detection data in space and time,and the corresponding relationships between data location features and other attribute features were established.A tone-mapping method based on statistical histogram equalization was proposed and applied to the final attribute feature data.The visualization process is optimized for rendering by merging materials,reducing the number of patches,and performing other operations.Results The results of sampling,feature extraction,and uniform visualization of the detection data of complex types,long duration spans,and uneven spatial distributions were obtained.The real-time visualization of large-scale spatial structures using augmented reality devices,particularly low-performance devices,was also investigated.Conclusions The proposed visualization system can reconstruct the three-dimensional structure of a large-scale space,express the structure and changes in the spatial environment using augmented reality,and assist in intuitively discovering spatial environmental events and evolutionary rules.
基金Supported by Daqing City Philosophy and Social Sciences Planning Research Project(DSGB 2025011)the Heilongjiang Province Education Science Planning Key Project(GJB1320229).
文摘Based on questionnaire surveys and field interviews conducted with various types of agricultural production organizations across five districts and four counties in Daqing City,this study combines relevant theoretical frameworks to systematically examine the evolution,performance,and influencing factors of governance mechanisms within these organizations.Using both quantitative and inductive analytical methods,the paper proposes innovative designs and supporting measures for improving governance mechanisms.The findings reveal that,amid large-scale farmland circulation,the governance mechanisms of agricultural production organizations in Daqing City are evolving from traditional to modern structures.However,challenges remain in areas such as decision-making efficiency,benefit distribution,and supervision mechanisms.In response,this study proposes innovative governance designs focusing on decision-making processes,profit-sharing mechanisms,and risk prevention.Corresponding policy recommendations are also provided to support the sustainable development of agricultural modernization in China.
基金The Australian Research Council(DP200101197,DP230101107).
文摘Formalizing complex processes and phenomena of a real-world problem may require a large number of variables and constraints,resulting in what is termed a large-scale optimization problem.Nowadays,such large-scale optimization problems are solved using computing machines,leading to an enormous computational time being required,which may delay deriving timely solutions.Decomposition methods,which partition a large-scale optimization problem into lower-dimensional subproblems,represent a key approach to addressing time-efficiency issues.There has been significant progress in both applied mathematics and emerging artificial intelligence approaches on this front.This work aims at providing an overview of the decomposition methods from both the mathematics and computer science points of view.We also remark on the state-of-the-art developments and recent applications of the decomposition methods,and discuss the future research and development perspectives.
文摘This article focuses on the management of large-scale machinery and equipment in highway construction,with the research objective of identifying issues at the management level and exploring more effective management measures.Through practical observation and logical analysis,this article elaborates on the management connotations of large-scale machinery and equipment in highway construction,affirming its management value from different perspectives.On this basis,it carefully analyzes the problems existing in the management of large-scale machinery and equipment,providing a detailed interpretation of issues such as the weak foundation of the equipment management system and the disconnection between equipment selection and configuration from reality.Combining the manifestations of related problems,this article proposes strategies such as strengthening the institutional foundation of equipment management,selecting and configuring equipment based on actual conditions,aiming to provide references for large-scale machinery and equipment management to relevant enterprises.