This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of co...This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of computer science in different fields of study. The technique used in this paper is handling the inadequate Information for citation;it removes the problem of cold start, which is encountered by very many other recommender systems. In this paper, abstracts, the titles, and the Microsoft academic graphs have been used in coming up with the recommendation list for every document, which is used to combine the content-based approaches and the co-citations. Prioritization and the blending of every technique have been allowed by the tuning system parameters, allowing for the authority in results of recommendation versus the paper novelty. In the end, we do observe that there is a direct correlation between the similarity rankings that have been produced by the system and the scores of the participant. The results coming from the associated scrips of analysis and the user survey have been made available through the recommendation system. Managers must gain the required expertise to fully utilize the benefits that come with business intelligence systems [1]. Data mining has become an important tool for managers that provides insights about their daily operations and leverage the information provided by decision support systems to improve customer relationships [2]. Additionally, managers require business intelligence systems that can rank the output in the order of priority. Ranking algorithm can replace the traditional data mining algorithms that will be discussed in-depth in the literature review [3].展开更多
This paper addresses the challenge of efficiently querying multimodal related data in data lakes,a large-scale storage and management system that supports heterogeneous data formats,including structured,semi-structure...This paper addresses the challenge of efficiently querying multimodal related data in data lakes,a large-scale storage and management system that supports heterogeneous data formats,including structured,semi-structured,and unstructured data.Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities,such as tables,images,and text,which has applications in fields like e-commerce,healthcare,and education.However,existing methods primarily focus on single-modality queries,such as joinable or unionable table discovery,and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency.To tackle these challenges,we propose a Multimodal data Query mechanism for Data Lakes(MQDL),which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities.Additionally,we introduce product quantization to optimize candidate verification during queries,reducing computational overhead while maintaining precision.We evaluate MQDL using a table-image dataset across multiple business scenarios,measuring metrics such as precision,recall,and F1-score.Results show that MQDL achieves an accuracy rate of approximately 90%,while demonstrating strong scalability and reduced query response time compared to traditional methods.These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.展开更多
Enterprise application integration encounters substantial hurdles,particularly in intricate contexts that require elevated scalability and speed.Transactional applications directly accessed by many systems frequently ...Enterprise application integration encounters substantial hurdles,particularly in intricate contexts that require elevated scalability and speed.Transactional applications directly accessed by many systems frequently overload databases,undermining process efficiency.This paper examines the utilization of data lakes-historically used for data analysis-as a centralized integration layer that accommodates various temporalities and consumption modalities.The sug-gested method diminishes system interdependence and the burden on transac-tional databases,enhancing scalability and data governance in both mono-lithic and distributed frameworks.展开更多
Since their introduction by James Dixon in 2010,data lakes get more and more attention,driven by the promise of high reusability of the stored data due to the schema-on-read semantics.Building on this idea,several add...Since their introduction by James Dixon in 2010,data lakes get more and more attention,driven by the promise of high reusability of the stored data due to the schema-on-read semantics.Building on this idea,several additional requirements were discussed in literature to improve the general usability of the concept,like a central metadata catalog including all provenance information,an overarching data governance,or the integration with(high-performance)processing capabilities.Although the necessity for a logical and a physical organisation of data lakes in order to meet those requirements is widely recognized,no concrete guidelines are yet provided.The most common architecture implementing this conceptual organisation is the zone architecture,where data is assigned to a certain zone depending on the degree of processing.This paper discusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on data types instead of zones,how they can be used to abstract the physical implementation,and how they empower generic and portable processing capabilities based on a provenance-based approach.展开更多
A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various form...A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various forms of semi-structured,structured,and unstructured information.These systems use a flat architecture and run different types of data analytics.NoSQL databases are nontabular and store data in a different manner than the relational table.NoSQL databases come in various forms,including key-value pairs,documents,wide columns,and graphs,each based on its data model.They offer simpler scalability and generally outperform traditional relational databases.While NoSQL databases can store diverse data types,they lack full support for atomicity,consistency,isolation,and durability features found in relational databases.Consequently,employing machine learning approaches becomes necessary to categorize complex structured query language(SQL)queries.Results indicate that the most frequently used automatic classification technique in processing SQL queries on NoSQL databases is machine learning-based classification.Overall,this study provides an overview of the automatic classification techniques used in processing SQL queries on NoSQL databases.Understanding these techniques can aid in the development of effective and efficient NoSQL database applications.展开更多
Poyang Lake is the largest freshwater lake in China. This paper conducted a digital and rapid investigation of the lake’s wetland vegetation biomass using Landsat ETM data acquired on April 16, 2000. First, utilizing...Poyang Lake is the largest freshwater lake in China. This paper conducted a digital and rapid investigation of the lake’s wetland vegetation biomass using Landsat ETM data acquired on April 16, 2000. First, utilizing the false color composite derived from the ETM data as one of the main references, the authors designed a reasonable sampling route for field measurement of the biomass, and carried it out on April 18–28, 2000. Then after both the sampling data and the ETM data were geometrically corrected to an equal-area projection of Albers, linear relationships among the sampling data and some transformed data derived from the ETM data and the ETM 4 were calculated. The results show that the sampling data is best relative to the band 4 data with a high correlation coefficient of 0.86, followed by the DVI and NDVI data with 0.83 and 0.80 respectively. Therefore, a linear regression model, which was based on the field data and band 4 data, was used to estimate the total biomass of entire Poyang Lake, and then the map of the biomass distribution was compiled.展开更多
The relatively rapid recession of glaciers in the Himalayas and formation of moraine dammed glacial lakes(MDGLs) in the recent past have increased the risk of glacier lake outburst floods(GLOF) in the countries of Nep...The relatively rapid recession of glaciers in the Himalayas and formation of moraine dammed glacial lakes(MDGLs) in the recent past have increased the risk of glacier lake outburst floods(GLOF) in the countries of Nepal and Bhutan and in the mountainous territory of Sikkim in India. As a product of climate change and global warming, such a risk has not only raised the level of threats to the habitation and infrastructure of the region, but has also contributed to the worsening of the balance of the unique ecosystem that exists in this domain that sustains several of the highest mountain peaks of the world. This study attempts to present an up to date mapping of the MDGLs in the central and eastern Himalayan regions using remote sensing data, with an objective to analyse their surface area variations with time from 1990 through 2015, disaggregated over six episodes. The study also includes the evaluation for susceptibility of MDGLs to GLOF with the least criteria decision analysis(LCDA). Forty two major MDGLs, each having a lake surface area greater than 0.2 km2, that were identified in the Himalayan ranges of Nepal, Bhutan, and Sikkim, have been categorized according to their surface area expansion rates in space and time. The lakes have been identified as located within the elevation range of 3800 m and6800 m above mean sea level(a msl). With a total surface area of 37.9 km2, these MDGLs as a whole were observed to have expanded by an astonishing 43.6% in area over the 25 year period of this study. A factor is introduced to numerically sort the lakes in terms of their relative yearly expansion rates, based on their interpretation of their surface area extents from satellite imageries. Verification of predicted GLOF events in the past using this factor with the limited field data as reported in literature indicates that the present analysis may be considered a sufficiently reliable and rapid technique for assessing the potential bursting susceptibility of the MDGLs. The analysis also indicates that, as of now, there are eight MDGLs in the region which appear to be in highly vulnerable states and have high chances in causing potential GLOF events anytime in the recent future.展开更多
Lakes are an important component of the earth climate system. They play an important role in the study of basin weather forecasting, air quality forecasting, and regional climate research. The accuracy of driving vari...Lakes are an important component of the earth climate system. They play an important role in the study of basin weather forecasting, air quality forecasting, and regional climate research. The accuracy of driving variables is the basic premise to ensure the rationality of lake mode simulation. Based on the in-situ observations at Bifenggang site of the Lake Taihu Eddy flux Network from 2012 to 2017, this paper investigated temporal variations in temperature, relative humidity, wind speed, radiation components at different time scales (hourly, seasonal and interannual). ERA5 reanalysis data were compared with in-situ observation to quantify the error and evaluate the performance of reanalysis data. The results show that: 1) On the hourly scale, the ERA5 reanalysis data described air temperature, and downward long-wave radiation more accurately. 2) On the seasonal variation scale, the ERA5 reanalysis data described air temperature, and downward long-wave radiation more accurately. However, the descriptions of wind speed, relative humidity and downward short-wave have large deviations. 3) On the interannual scale, the ERA5 reanalysis data show a good performance for temperature, followed by downward longwave radiation, downward shortwave radiation and relative humidity.展开更多
针对传统中小企业客户数据呈现杂乱无序状态且缺乏标准化的现状,提出一种创新的数据治理技术。该技术整合多源异构数据,该技术汇聚多源异构数据,融合光学字符识别(Optical Character Recognition,OCR)等多种方法,构建标准化的中小企业...针对传统中小企业客户数据呈现杂乱无序状态且缺乏标准化的现状,提出一种创新的数据治理技术。该技术整合多源异构数据,该技术汇聚多源异构数据,融合光学字符识别(Optical Character Recognition,OCR)等多种方法,构建标准化的中小企业基础信息数据湖,从源头提升数据质量。引入“熵减”理念,利用智能算法对数据质量进行量化评估,能够及时定位并解决数据质量问题。同时,搭建时序数据库并构建基于熵减的马尔科夫链模型,以此预测未来数据质量趋势,精准治理潜在问题区域。该技术不仅实现了数据价值的最大化,还显著降低了治理成本,提高了数据治理的效率与准确性,为企业降本增效提供了有力支撑。展开更多
文摘This research paper has provided the methodology and design for implementing the hybrid author recommender system using Azure Data Lake Analytics and Power BI. It offers a recommendation for the top 1000 Authors of computer science in different fields of study. The technique used in this paper is handling the inadequate Information for citation;it removes the problem of cold start, which is encountered by very many other recommender systems. In this paper, abstracts, the titles, and the Microsoft academic graphs have been used in coming up with the recommendation list for every document, which is used to combine the content-based approaches and the co-citations. Prioritization and the blending of every technique have been allowed by the tuning system parameters, allowing for the authority in results of recommendation versus the paper novelty. In the end, we do observe that there is a direct correlation between the similarity rankings that have been produced by the system and the scores of the participant. The results coming from the associated scrips of analysis and the user survey have been made available through the recommendation system. Managers must gain the required expertise to fully utilize the benefits that come with business intelligence systems [1]. Data mining has become an important tool for managers that provides insights about their daily operations and leverage the information provided by decision support systems to improve customer relationships [2]. Additionally, managers require business intelligence systems that can rank the output in the order of priority. Ranking algorithm can replace the traditional data mining algorithms that will be discussed in-depth in the literature review [3].
文摘This paper addresses the challenge of efficiently querying multimodal related data in data lakes,a large-scale storage and management system that supports heterogeneous data formats,including structured,semi-structured,and unstructured data.Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities,such as tables,images,and text,which has applications in fields like e-commerce,healthcare,and education.However,existing methods primarily focus on single-modality queries,such as joinable or unionable table discovery,and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency.To tackle these challenges,we propose a Multimodal data Query mechanism for Data Lakes(MQDL),which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities.Additionally,we introduce product quantization to optimize candidate verification during queries,reducing computational overhead while maintaining precision.We evaluate MQDL using a table-image dataset across multiple business scenarios,measuring metrics such as precision,recall,and F1-score.Results show that MQDL achieves an accuracy rate of approximately 90%,while demonstrating strong scalability and reduced query response time compared to traditional methods.These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.
文摘Enterprise application integration encounters substantial hurdles,particularly in intricate contexts that require elevated scalability and speed.Transactional applications directly accessed by many systems frequently overload databases,undermining process efficiency.This paper examines the utilization of data lakes-historically used for data analysis-as a centralized integration layer that accommodates various temporalities and consumption modalities.The sug-gested method diminishes system interdependence and the burden on transac-tional databases,enhancing scalability and data governance in both mono-lithic and distributed frameworks.
基金funding by the"Niedersachsisches Vorab"funding line of the Volkswagen Foundation.
文摘Since their introduction by James Dixon in 2010,data lakes get more and more attention,driven by the promise of high reusability of the stored data due to the schema-on-read semantics.Building on this idea,several additional requirements were discussed in literature to improve the general usability of the concept,like a central metadata catalog including all provenance information,an overarching data governance,or the integration with(high-performance)processing capabilities.Although the necessity for a logical and a physical organisation of data lakes in order to meet those requirements is widely recognized,no concrete guidelines are yet provided.The most common architecture implementing this conceptual organisation is the zone architecture,where data is assigned to a certain zone depending on the degree of processing.This paper discusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on data types instead of zones,how they can be used to abstract the physical implementation,and how they empower generic and portable processing capabilities based on a provenance-based approach.
基金supported by the Student Scheme provided by Universiti Kebangsaan Malaysia with the Code TAP-20558.
文摘A data lake(DL),abbreviated as DL,denotes a vast reservoir or repository of data.It accumulates substantial volumes of data and employs advanced analytics to correlate data from diverse origins containing various forms of semi-structured,structured,and unstructured information.These systems use a flat architecture and run different types of data analytics.NoSQL databases are nontabular and store data in a different manner than the relational table.NoSQL databases come in various forms,including key-value pairs,documents,wide columns,and graphs,each based on its data model.They offer simpler scalability and generally outperform traditional relational databases.While NoSQL databases can store diverse data types,they lack full support for atomicity,consistency,isolation,and durability features found in relational databases.Consequently,employing machine learning approaches becomes necessary to categorize complex structured query language(SQL)queries.Results indicate that the most frequently used automatic classification technique in processing SQL queries on NoSQL databases is machine learning-based classification.Overall,this study provides an overview of the automatic classification techniques used in processing SQL queries on NoSQL databases.Understanding these techniques can aid in the development of effective and efficient NoSQL database applications.
基金The Knowledge Innovation Project of CAS, No. KZCX1-Y-02,No. KZCX2-310 The key project of Ninth Five-Year+3 种基金 Plan of CAS, No.KZ951-A1-102-01 The National Ninth Five-Year Plan Project,No.96-b02-01
文摘Poyang Lake is the largest freshwater lake in China. This paper conducted a digital and rapid investigation of the lake’s wetland vegetation biomass using Landsat ETM data acquired on April 16, 2000. First, utilizing the false color composite derived from the ETM data as one of the main references, the authors designed a reasonable sampling route for field measurement of the biomass, and carried it out on April 18–28, 2000. Then after both the sampling data and the ETM data were geometrically corrected to an equal-area projection of Albers, linear relationships among the sampling data and some transformed data derived from the ETM data and the ETM 4 were calculated. The results show that the sampling data is best relative to the band 4 data with a high correlation coefficient of 0.86, followed by the DVI and NDVI data with 0.83 and 0.80 respectively. Therefore, a linear regression model, which was based on the field data and band 4 data, was used to estimate the total biomass of entire Poyang Lake, and then the map of the biomass distribution was compiled.
文摘The relatively rapid recession of glaciers in the Himalayas and formation of moraine dammed glacial lakes(MDGLs) in the recent past have increased the risk of glacier lake outburst floods(GLOF) in the countries of Nepal and Bhutan and in the mountainous territory of Sikkim in India. As a product of climate change and global warming, such a risk has not only raised the level of threats to the habitation and infrastructure of the region, but has also contributed to the worsening of the balance of the unique ecosystem that exists in this domain that sustains several of the highest mountain peaks of the world. This study attempts to present an up to date mapping of the MDGLs in the central and eastern Himalayan regions using remote sensing data, with an objective to analyse their surface area variations with time from 1990 through 2015, disaggregated over six episodes. The study also includes the evaluation for susceptibility of MDGLs to GLOF with the least criteria decision analysis(LCDA). Forty two major MDGLs, each having a lake surface area greater than 0.2 km2, that were identified in the Himalayan ranges of Nepal, Bhutan, and Sikkim, have been categorized according to their surface area expansion rates in space and time. The lakes have been identified as located within the elevation range of 3800 m and6800 m above mean sea level(a msl). With a total surface area of 37.9 km2, these MDGLs as a whole were observed to have expanded by an astonishing 43.6% in area over the 25 year period of this study. A factor is introduced to numerically sort the lakes in terms of their relative yearly expansion rates, based on their interpretation of their surface area extents from satellite imageries. Verification of predicted GLOF events in the past using this factor with the limited field data as reported in literature indicates that the present analysis may be considered a sufficiently reliable and rapid technique for assessing the potential bursting susceptibility of the MDGLs. The analysis also indicates that, as of now, there are eight MDGLs in the region which appear to be in highly vulnerable states and have high chances in causing potential GLOF events anytime in the recent future.
文摘Lakes are an important component of the earth climate system. They play an important role in the study of basin weather forecasting, air quality forecasting, and regional climate research. The accuracy of driving variables is the basic premise to ensure the rationality of lake mode simulation. Based on the in-situ observations at Bifenggang site of the Lake Taihu Eddy flux Network from 2012 to 2017, this paper investigated temporal variations in temperature, relative humidity, wind speed, radiation components at different time scales (hourly, seasonal and interannual). ERA5 reanalysis data were compared with in-situ observation to quantify the error and evaluate the performance of reanalysis data. The results show that: 1) On the hourly scale, the ERA5 reanalysis data described air temperature, and downward long-wave radiation more accurately. 2) On the seasonal variation scale, the ERA5 reanalysis data described air temperature, and downward long-wave radiation more accurately. However, the descriptions of wind speed, relative humidity and downward short-wave have large deviations. 3) On the interannual scale, the ERA5 reanalysis data show a good performance for temperature, followed by downward longwave radiation, downward shortwave radiation and relative humidity.
文摘针对传统中小企业客户数据呈现杂乱无序状态且缺乏标准化的现状,提出一种创新的数据治理技术。该技术整合多源异构数据,该技术汇聚多源异构数据,融合光学字符识别(Optical Character Recognition,OCR)等多种方法,构建标准化的中小企业基础信息数据湖,从源头提升数据质量。引入“熵减”理念,利用智能算法对数据质量进行量化评估,能够及时定位并解决数据质量问题。同时,搭建时序数据库并构建基于熵减的马尔科夫链模型,以此预测未来数据质量趋势,精准治理潜在问题区域。该技术不仅实现了数据价值的最大化,还显著降低了治理成本,提高了数据治理的效率与准确性,为企业降本增效提供了有力支撑。