Duplicate bug reporting is a critical problem in the software repositories’mining area.Duplicate bug reports can lead to redundant efforts,wasted resources,and delayed software releases.Thus,their accurate identifica...Duplicate bug reporting is a critical problem in the software repositories’mining area.Duplicate bug reports can lead to redundant efforts,wasted resources,and delayed software releases.Thus,their accurate identification is essential for streamlining the bug triage process mining area.Several researchers have explored classical information retrieval,natural language processing,text and data mining,and machine learning approaches.The emergence of large language models(LLMs)(ChatGPT and Huggingface)has presented a new line of models for semantic textual similarity(STS).Although LLMs have shown remarkable advancements,there remains a need for longitudinal studies to determine whether performance improvements are due to the scale of the models or the unique embeddings they produce compared to classical encoding models.This study systematically investigates this issue by comparing classical word embedding techniques against LLM-based embeddings for duplicate bug detection.In this study,we have proposed an amalgamation of models to detect duplicate bug reports using textual and non-textual information about bug reports.The empirical evaluation has been performed on the open-source datasets and evaluated based on established metrics using the mean reciprocal rank(MRR),mean average precision(MAP),and recall rate.The experimental results have shown that combined LLMs can outperform(recall-rate@k 68%–74%)other individual=models for duplicate bug detection.These findings highlight the effectiveness of amalgamating multiple techniques in improving the duplicate bug report detection accuracy.展开更多
A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the d...A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the data preprocessing module,and then,in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods,the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient.展开更多
Waveform audio(WAV) file is a widely used file format of uncompressed audio. With the rapid development of digital media technology, one can easily insert duplicated segments with powerful audio editing software, e.g....Waveform audio(WAV) file is a widely used file format of uncompressed audio. With the rapid development of digital media technology, one can easily insert duplicated segments with powerful audio editing software, e.g. inserting a segment of audio with negative meaning into the existing audio file. The duplicated segments can change the meaning of the audio file totally. So for a WAV file to be used as evidence in legal proceedings and historical documents, it is very importance to identify if there are any duplicated segments in it.This paper proposes a method to detect duplicated segments in a WAV file. Our method is based on the similarity calculation between two different segments. Duplicated segments are prone to having similar audio waveform,i.e., a high similarity. We use fast convolution algorithm to calculate the similarity, which makes our method quit efficient. We calculate the similarity between any two different segments in a digital audio file and use the similarity to judge which segments are duplicated. Experimental results show the feasibility and efficiency of our method on detecting duplicated audio segments.展开更多
A 37-year old male presented with an acute abdomen suggestive of an appendiceal perforation.Urgent laparotomy showed a duplicated appendix with one of the lumens involved with appendicitis and a focal periappendicular...A 37-year old male presented with an acute abdomen suggestive of an appendiceal perforation.Urgent laparotomy showed a duplicated appendix with one of the lumens involved with appendicitis and a focal periappendicular abscess while the other lumen had a localized appendiceal cancer.Recognition of congenital intestinal duplications in adults is important to avoid serious clinical consequences.展开更多
The reliability evaluation of a multistate network is primarily based on d-minimal paths/cuts(d-MPs/d-MCs).However,being a nondeterminism polynomial hard(NP-hard)problem,searching for all d-MPs is a rather challenging...The reliability evaluation of a multistate network is primarily based on d-minimal paths/cuts(d-MPs/d-MCs).However,being a nondeterminism polynomial hard(NP-hard)problem,searching for all d-MPs is a rather challenging task.In existing implicit enumeration algorithms based on minimal paths(MPs),duplicate d-MP candidates may be generated.An extra step is needed to locate and remove these duplicate d-MP candidates,which costs significant computational effort.This paper proposes an efficient method to prevent the generation of duplicate d-MP candidates for implicit enumeration algorithms for d-MPs.First,the mechanism of generating duplicate d-MP candidates in the implicit enumeration algorithms is discussed.Second,a direct and efficient avoiding-duplicates method is proposed.Third,an improved algorithm is developed,followed by complexity analysis and illustrative examples.Based on the computational experiments comparing with two existing algorithms,it is found that the proposed method can significantly improve the efficiency of generating d-MPs for a particular demand level d.展开更多
CRISPR (clustered regularly interspaced short palindromic repeats)-Cas9-based genome editing has revolutionized func- tional genomics in many biological research fields. The specificity and potency of CR1SPR-Cas9 ge...CRISPR (clustered regularly interspaced short palindromic repeats)-Cas9-based genome editing has revolutionized func- tional genomics in many biological research fields. The specificity and potency of CR1SPR-Cas9 genome editing make it ideal for investigating the function of genes in vivo (Hsu et al., 2014). Gene duplication is a key driver of evolu- tionary novelty (Taylor and Raes, 2004). However, duplicated genes with near-identical sequences and functional redun- dancy have posed challenges for genetic analysis (Woollard, 2005). The functions of duplicated genes can be assessed by simultaneous knockdown using homology-based methods such as RNA interference (RNAi) (Tischler et al., 2006), Generation of double or triple mutants is an alternative way to assess functional redundancy of duplicated genes, However, generation of such compound mutants by forward or reverse genetic methods is time consuming.展开更多
On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e...On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.展开更多
Duplicate publication can introduce significant bias into a meta-analysis if studies are inadvertently included more than once. Many studies are published in more than one journal to maximize readership and impact of ...Duplicate publication can introduce significant bias into a meta-analysis if studies are inadvertently included more than once. Many studies are published in more than one journal to maximize readership and impact of the study findings. Inclusion of multiple publications of the same study within a meta-analysis affords inappropriate weight to the duplicated data if reports of the same study are not linked together. As studies which have positive findings are more likely to be published in multiple journals this leads to a potential overestimate of the benefits of an intervention. Recent advances in immunosuppression strategies following liver transplantation have led to many studies investigating immunosuppressive regimes including immunosuppression monotherapy. In this letter we focus on a recently published meta-analysis by Lan et al investigating studies assessing immunosuppression monotherapy for liver transplantation. The authors claim to have identified fourteen separate randomised studies investigating immunosuppression monotherapy. Seven of the references appear to relate to only three studies which have been subject to duplicate publication. Several similarities can be identified in each of the duplicate publications including similar authorship, identical immunosuppression regimes, identical dates of enrolment and citation of the original publication in the subsequent manuscripts. We discuss the evidence of the duplicate publication inclusion in the meta-analysis.展开更多
Discovery of service nodes in flows is a challenging task, especially in large ISPs or campus networks where the amount of traffic across net-work is rmssive. We propose an effective data structure called Round-robin ...Discovery of service nodes in flows is a challenging task, especially in large ISPs or campus networks where the amount of traffic across net-work is rmssive. We propose an effective data structure called Round-robin Buddy Bloom Filters (RBBF) to detect duplicate elements in flows. A two-stage approximate algorithm based on RBBF which can be used for detecting service nodes from NetFlow data is also given and the perfonmnce of the algorithm is analyzed. In our case, the proposed algorithm uses about 1% memory of hash table with false positive error rate less than 5%. A proto-type system, which is compatible with both IPv4 and IPv6, using the proposed data structure and al-gorithm is introduced. Some real world case studies based on the prototype system are discussed.展开更多
ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.D...ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.Duplicate detections involve discovering records referring to the same practical components,indicating tasks,which are generally dependent on several input parameters that experts yield.Record linkage specifies the issue of finding identical records across various data sources.The similarity existing between two records is characterized based on domain-based similarity functions over different features.De-duplication of one dataset or the linkage of multiple data sets has become a highly significant operation in the data processing stages of different data mining programmes.The objective is to match all the records associated with the same entity.Various measures have been in use for representing the quality and complexity about data linkage algorithms,and many other novel metrics have been introduced.An outline of the problem existing in themeasurement of data linkage and de-duplication quality and complexity is presented.This article focuses on the reprocessing of health data that is horizontally divided among data custodians,with the purpose of custodians giving similar features to sets of patients.The first step in this technique is about an automatic selection of training examples with superior quality from the compared record pairs and the second step involves training the reciprocal neuro-fuzzy inference system(RANFIS)classifier.Using the Optimal Threshold classifier,it is presumed that there is information about the original match status for all compared record pairs(i.e.,Ant Lion Optimization),and therefore an optimal threshold can be computed based on the respective RANFIS.Febrl,Clinical Decision(CD),and Cork Open Research Archive(CORA)data repository help analyze the proposed method with evaluated benchmarks with current techniques.展开更多
In this issue of the Journal of Geriatric Cardiology;Jing et al. showed off their near perfect results of percutaneous coronary interventions (PCI) through transfemoral approach (TFA) and transradial approach (TRA... In this issue of the Journal of Geriatric Cardiology;Jing et al. showed off their near perfect results of percutaneous coronary interventions (PCI) through transfemoral approach (TFA) and transradial approach (TRA) in the elderly Chinese patients. All patients were older.than 60years of age, with an average of 67.……展开更多
Semantic duplicates in databases represent today an important data quality challenge which leads to bad decisions. In large databases, we sometimes find ourselves with tens of thousands of duplicates, which necessitat...Semantic duplicates in databases represent today an important data quality challenge which leads to bad decisions. In large databases, we sometimes find ourselves with tens of thousands of duplicates, which necessitates an automatic deduplication. For this, it is necessary to detect duplicates, with a fairly reliable method to find as many duplicates as possible and powerful enough to run in a reasonable time. This paper proposes and compares on real data effective duplicates detection methods for automatic deduplication of files based on names, working with French texts or English texts, and the names of people or places, in Africa or in the West. After conducting a more complete classification of semantic duplicates than the usual classifications, we introduce several methods for detecting duplicates whose average complexity observed is less than O(2n). Through a simple model, we highlight a global efficacy rate, combining precision and recall. We propose a new metric distance between records, as well as rules for automatic duplicate detection. Analyses made on a database containing real data for an administration in Central Africa, and on a known standard database containing names of restaurants in the USA, have shown better results than those of known methods, with a lesser complexity.展开更多
The number of systematic reviews is gradually increas-ing over time. Also, the methods to perform a systematic review are being improved. However, little attention has been paid for the issue regarding how to fnd dupl...The number of systematic reviews is gradually increas-ing over time. Also, the methods to perform a systematic review are being improved. However, little attention has been paid for the issue regarding how to fnd duplicates in systematic reviews. On the basis of the survey and systematic reviews by our team and others, we review the prevalence, significance and classification of duplicates and the method to fnd duplicates in a systematic review. Notably, although a preliminary method to fnd duplicates is established, its usefulness and convenience need to be further confrmed.展开更多
LaparoEndoscopic Single-site(LESS) renal surgery emerging as a potential alternative to conventional laparoscopy,is technically challenging and the major vascular anomaly may increase the risk of intraoperative haemor...LaparoEndoscopic Single-site(LESS) renal surgery emerging as a potential alternative to conventional laparoscopy,is technically challenging and the major vascular anomaly may increase the risk of intraoperative haemorrhage.Herein,we present a case of right transumbilical LESS radical nephrectomy which was successfully performed in the presence of double inferior vena cava and duplicated the standard laparoscopic techniques.Most importantly,to bring such an aberrant vascular anatomy to the attention of laparoscopic,especially LESS surgeons with high resolution pictorial illustrations.展开更多
Duplicated inferior vena cava with bilateral iliac vein compression is extremely rare.We report a case of an 87-year-old man presented with bilateral lower extremity swelling,who was noted to have duplicated inferior ...Duplicated inferior vena cava with bilateral iliac vein compression is extremely rare.We report a case of an 87-year-old man presented with bilateral lower extremity swelling,who was noted to have duplicated inferior vena cava,as revealed by computed tomography angiography(CTA).This revealed bilateral iliac vein compression caused by surrounding structures.Anticoagulant treatment combined with stent implantation completely alleviated this chronic debilitating condition during the follow-up of 2 months with no recurrence.展开更多
The duplicate form of the generalized Gould-Hsu inversions has been obtained by Shi and Zhang. In this paper, we present a simple proof of this duplicate form. With the same method, we construct the duplicate form of ...The duplicate form of the generalized Gould-Hsu inversions has been obtained by Shi and Zhang. In this paper, we present a simple proof of this duplicate form. With the same method, we construct the duplicate form of the generalized Carlitz inversions. Using this duplicate form, we obtain several terminating basic hypergeometric identities and some limiting cases.展开更多
Transmission Control Protocol (TCP) performance over MANET is an area of extensive research. Congestion control mechanisms are major components of TCP which affect its performance. The improvement of these mechanisms ...Transmission Control Protocol (TCP) performance over MANET is an area of extensive research. Congestion control mechanisms are major components of TCP which affect its performance. The improvement of these mechanisms represents a big challenge especially over wireless environments. Additive Increase Multiplicative Decrease (AIMD) mechanisms control the amount of increment and decrement of the transmission rate as a response to changes in the level of contention on routers buffer space and links bandwidth. The role of an AIMD mechanism in transmitting the proper amount of data is not easy, especially over MANET. This is because MANET has a very dynamic topology and high bit error rate wireless links that cause packet loss. Such a loss could be misinterpreted as severe congestion by the transmitting TCP node. This leads to unnecessary sharp reduction in the transmission rate which could degrades TCP throughput. This paper introduces a new AIMD algorithm that takes the number of already received duplicated ACK, when a timeout takes place, into account in deciding the amount of multiplicative decrease. Specifically, it decides the point from which Slow-start mechanism should begin its recovery of the congestion window size. The new AIMD algorithm has been developed as a new TCP variant which we call TCP Karak. The aim of TCP Karak is to be more adaptive to mobile wireless networks conditions by being able to distinguish between loss due to severe congestion and that due to link breakages or bit errors. Several simulated experiments have been conducted to evaluate TCP Karak and compare its performance with TCP NewReno. Results have shown that TCP Karak is able to achieve higher throughput and goodput than TCP NewReno under various mobility speeds, traffic loads, and bit error rates.展开更多
文摘Duplicate bug reporting is a critical problem in the software repositories’mining area.Duplicate bug reports can lead to redundant efforts,wasted resources,and delayed software releases.Thus,their accurate identification is essential for streamlining the bug triage process mining area.Several researchers have explored classical information retrieval,natural language processing,text and data mining,and machine learning approaches.The emergence of large language models(LLMs)(ChatGPT and Huggingface)has presented a new line of models for semantic textual similarity(STS).Although LLMs have shown remarkable advancements,there remains a need for longitudinal studies to determine whether performance improvements are due to the scale of the models or the unique embeddings they produce compared to classical encoding models.This study systematically investigates this issue by comparing classical word embedding techniques against LLM-based embeddings for duplicate bug detection.In this study,we have proposed an amalgamation of models to detect duplicate bug reports using textual and non-textual information about bug reports.The empirical evaluation has been performed on the open-source datasets and evaluated based on established metrics using the mean reciprocal rank(MRR),mean average precision(MAP),and recall rate.The experimental results have shown that combined LLMs can outperform(recall-rate@k 68%–74%)other individual=models for duplicate bug detection.These findings highlight the effectiveness of amalgamating multiple techniques in improving the duplicate bug report detection accuracy.
基金The National Natural Science Foundation of China(No.60673139)
文摘A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the data preprocessing module,and then,in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods,the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient.
基金the "12th Five-Year Plan" National Science and Technology Support Program(No.2012BAK16B05)
文摘Waveform audio(WAV) file is a widely used file format of uncompressed audio. With the rapid development of digital media technology, one can easily insert duplicated segments with powerful audio editing software, e.g. inserting a segment of audio with negative meaning into the existing audio file. The duplicated segments can change the meaning of the audio file totally. So for a WAV file to be used as evidence in legal proceedings and historical documents, it is very importance to identify if there are any duplicated segments in it.This paper proposes a method to detect duplicated segments in a WAV file. Our method is based on the similarity calculation between two different segments. Duplicated segments are prone to having similar audio waveform,i.e., a high similarity. We use fast convolution algorithm to calculate the similarity, which makes our method quit efficient. We calculate the similarity between any two different segments in a digital audio file and use the similarity to judge which segments are duplicated. Experimental results show the feasibility and efficiency of our method on detecting duplicated audio segments.
文摘A 37-year old male presented with an acute abdomen suggestive of an appendiceal perforation.Urgent laparotomy showed a duplicated appendix with one of the lumens involved with appendicitis and a focal periappendicular abscess while the other lumen had a localized appendiceal cancer.Recognition of congenital intestinal duplications in adults is important to avoid serious clinical consequences.
基金supported by the National Natural Science Foundation of China(71701207)the Science and Technology on Reliability&Environmental Engineering Laboratory(6142004004-2)the Science and Technology Commission of the CMC(2019-JCJQ-JJ-180)。
文摘The reliability evaluation of a multistate network is primarily based on d-minimal paths/cuts(d-MPs/d-MCs).However,being a nondeterminism polynomial hard(NP-hard)problem,searching for all d-MPs is a rather challenging task.In existing implicit enumeration algorithms based on minimal paths(MPs),duplicate d-MP candidates may be generated.An extra step is needed to locate and remove these duplicate d-MP candidates,which costs significant computational effort.This paper proposes an efficient method to prevent the generation of duplicate d-MP candidates for implicit enumeration algorithms for d-MPs.First,the mechanism of generating duplicate d-MP candidates in the implicit enumeration algorithms is discussed.Second,a direct and efficient avoiding-duplicates method is proposed.Third,an improved algorithm is developed,followed by complexity analysis and illustrative examples.Based on the computational experiments comparing with two existing algorithms,it is found that the proposed method can significantly improve the efficiency of generating d-MPs for a particular demand level d.
基金supported by National Institutes of Health(NIH grant R01GM054657)to A.D.C
文摘CRISPR (clustered regularly interspaced short palindromic repeats)-Cas9-based genome editing has revolutionized func- tional genomics in many biological research fields. The specificity and potency of CR1SPR-Cas9 genome editing make it ideal for investigating the function of genes in vivo (Hsu et al., 2014). Gene duplication is a key driver of evolu- tionary novelty (Taylor and Raes, 2004). However, duplicated genes with near-identical sequences and functional redun- dancy have posed challenges for genetic analysis (Woollard, 2005). The functions of duplicated genes can be assessed by simultaneous knockdown using homology-based methods such as RNA interference (RNAi) (Tischler et al., 2006), Generation of double or triple mutants is an alternative way to assess functional redundancy of duplicated genes, However, generation of such compound mutants by forward or reverse genetic methods is time consuming.
基金supported by the National Key R&D Program of China(Nos.2018YFB1003905)the National Natural Science Foundation of China under Grant No.61971032,Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.
文摘Duplicate publication can introduce significant bias into a meta-analysis if studies are inadvertently included more than once. Many studies are published in more than one journal to maximize readership and impact of the study findings. Inclusion of multiple publications of the same study within a meta-analysis affords inappropriate weight to the duplicated data if reports of the same study are not linked together. As studies which have positive findings are more likely to be published in multiple journals this leads to a potential overestimate of the benefits of an intervention. Recent advances in immunosuppression strategies following liver transplantation have led to many studies investigating immunosuppressive regimes including immunosuppression monotherapy. In this letter we focus on a recently published meta-analysis by Lan et al investigating studies assessing immunosuppression monotherapy for liver transplantation. The authors claim to have identified fourteen separate randomised studies investigating immunosuppression monotherapy. Seven of the references appear to relate to only three studies which have been subject to duplicate publication. Several similarities can be identified in each of the duplicate publications including similar authorship, identical immunosuppression regimes, identical dates of enrolment and citation of the original publication in the subsequent manuscripts. We discuss the evidence of the duplicate publication inclusion in the meta-analysis.
基金supported by the National Basic Research Program of China under Grant No. 2009CB320505
文摘Discovery of service nodes in flows is a challenging task, especially in large ISPs or campus networks where the amount of traffic across net-work is rmssive. We propose an effective data structure called Round-robin Buddy Bloom Filters (RBBF) to detect duplicate elements in flows. A two-stage approximate algorithm based on RBBF which can be used for detecting service nodes from NetFlow data is also given and the perfonmnce of the algorithm is analyzed. In our case, the proposed algorithm uses about 1% memory of hash table with false positive error rate less than 5%. A proto-type system, which is compatible with both IPv4 and IPv6, using the proposed data structure and al-gorithm is introduced. Some real world case studies based on the prototype system are discussed.
基金This research project was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project Number(PNURSP2022R234),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.Duplicate detections involve discovering records referring to the same practical components,indicating tasks,which are generally dependent on several input parameters that experts yield.Record linkage specifies the issue of finding identical records across various data sources.The similarity existing between two records is characterized based on domain-based similarity functions over different features.De-duplication of one dataset or the linkage of multiple data sets has become a highly significant operation in the data processing stages of different data mining programmes.The objective is to match all the records associated with the same entity.Various measures have been in use for representing the quality and complexity about data linkage algorithms,and many other novel metrics have been introduced.An outline of the problem existing in themeasurement of data linkage and de-duplication quality and complexity is presented.This article focuses on the reprocessing of health data that is horizontally divided among data custodians,with the purpose of custodians giving similar features to sets of patients.The first step in this technique is about an automatic selection of training examples with superior quality from the compared record pairs and the second step involves training the reciprocal neuro-fuzzy inference system(RANFIS)classifier.Using the Optimal Threshold classifier,it is presumed that there is information about the original match status for all compared record pairs(i.e.,Ant Lion Optimization),and therefore an optimal threshold can be computed based on the respective RANFIS.Febrl,Clinical Decision(CD),and Cork Open Research Archive(CORA)data repository help analyze the proposed method with evaluated benchmarks with current techniques.
文摘 In this issue of the Journal of Geriatric Cardiology;Jing et al. showed off their near perfect results of percutaneous coronary interventions (PCI) through transfemoral approach (TFA) and transradial approach (TRA) in the elderly Chinese patients. All patients were older.than 60years of age, with an average of 67.……
文摘Semantic duplicates in databases represent today an important data quality challenge which leads to bad decisions. In large databases, we sometimes find ourselves with tens of thousands of duplicates, which necessitates an automatic deduplication. For this, it is necessary to detect duplicates, with a fairly reliable method to find as many duplicates as possible and powerful enough to run in a reasonable time. This paper proposes and compares on real data effective duplicates detection methods for automatic deduplication of files based on names, working with French texts or English texts, and the names of people or places, in Africa or in the West. After conducting a more complete classification of semantic duplicates than the usual classifications, we introduce several methods for detecting duplicates whose average complexity observed is less than O(2n). Through a simple model, we highlight a global efficacy rate, combining precision and recall. We propose a new metric distance between records, as well as rules for automatic duplicate detection. Analyses made on a database containing real data for an administration in Central Africa, and on a known standard database containing names of restaurants in the USA, have shown better results than those of known methods, with a lesser complexity.
文摘The number of systematic reviews is gradually increas-ing over time. Also, the methods to perform a systematic review are being improved. However, little attention has been paid for the issue regarding how to fnd duplicates in systematic reviews. On the basis of the survey and systematic reviews by our team and others, we review the prevalence, significance and classification of duplicates and the method to fnd duplicates in a systematic review. Notably, although a preliminary method to fnd duplicates is established, its usefulness and convenience need to be further confrmed.
基金Supported by the Municipal Hospitals' Project for Emerging and Frontier Technology of Shanghai (SHDC12010115)the Chinese Military Major Project for Clinical High-tech and Innovative Technology (2010gxjs057)the Project of Key Discipline of Shanghai
文摘LaparoEndoscopic Single-site(LESS) renal surgery emerging as a potential alternative to conventional laparoscopy,is technically challenging and the major vascular anomaly may increase the risk of intraoperative haemorrhage.Herein,we present a case of right transumbilical LESS radical nephrectomy which was successfully performed in the presence of double inferior vena cava and duplicated the standard laparoscopic techniques.Most importantly,to bring such an aberrant vascular anatomy to the attention of laparoscopic,especially LESS surgeons with high resolution pictorial illustrations.
文摘Duplicated inferior vena cava with bilateral iliac vein compression is extremely rare.We report a case of an 87-year-old man presented with bilateral lower extremity swelling,who was noted to have duplicated inferior vena cava,as revealed by computed tomography angiography(CTA).This revealed bilateral iliac vein compression caused by surrounding structures.Anticoagulant treatment combined with stent implantation completely alleviated this chronic debilitating condition during the follow-up of 2 months with no recurrence.
文摘The duplicate form of the generalized Gould-Hsu inversions has been obtained by Shi and Zhang. In this paper, we present a simple proof of this duplicate form. With the same method, we construct the duplicate form of the generalized Carlitz inversions. Using this duplicate form, we obtain several terminating basic hypergeometric identities and some limiting cases.
文摘Transmission Control Protocol (TCP) performance over MANET is an area of extensive research. Congestion control mechanisms are major components of TCP which affect its performance. The improvement of these mechanisms represents a big challenge especially over wireless environments. Additive Increase Multiplicative Decrease (AIMD) mechanisms control the amount of increment and decrement of the transmission rate as a response to changes in the level of contention on routers buffer space and links bandwidth. The role of an AIMD mechanism in transmitting the proper amount of data is not easy, especially over MANET. This is because MANET has a very dynamic topology and high bit error rate wireless links that cause packet loss. Such a loss could be misinterpreted as severe congestion by the transmitting TCP node. This leads to unnecessary sharp reduction in the transmission rate which could degrades TCP throughput. This paper introduces a new AIMD algorithm that takes the number of already received duplicated ACK, when a timeout takes place, into account in deciding the amount of multiplicative decrease. Specifically, it decides the point from which Slow-start mechanism should begin its recovery of the congestion window size. The new AIMD algorithm has been developed as a new TCP variant which we call TCP Karak. The aim of TCP Karak is to be more adaptive to mobile wireless networks conditions by being able to distinguish between loss due to severe congestion and that due to link breakages or bit errors. Several simulated experiments have been conducted to evaluate TCP Karak and compare its performance with TCP NewReno. Results have shown that TCP Karak is able to achieve higher throughput and goodput than TCP NewReno under various mobility speeds, traffic loads, and bit error rates.