A new parallel expectation-maximization (EM) algorithm is proposed for large databases. The purpose of the algorithm is to accelerate the operation of the EM algorithm. As a well-known algorithm for estimation in ge...A new parallel expectation-maximization (EM) algorithm is proposed for large databases. The purpose of the algorithm is to accelerate the operation of the EM algorithm. As a well-known algorithm for estimation in generic statistical problems, the EM algorithm has been widely used in many domains. But it often requires significant computational resources. So it is needed to develop more elaborate methods to adapt the databases to a large number of records or large dimensionality. The parallel EM algorithm is based on partial Esteps which has the standard convergence guarantee of EM. The algorithm utilizes fully the advantage of parallel computation. It was confirmed that the algorithm obtains about 2.6 speedups in contrast with the standard EM algorithm through its application to large databases. The running time will decrease near linearly when the number of processors increasing.展开更多
Influence maximization is the problem to identify and find a set of the most influential nodes, whose aggregated influence in the network is maximized. This research is of great application value for advertising,viral...Influence maximization is the problem to identify and find a set of the most influential nodes, whose aggregated influence in the network is maximized. This research is of great application value for advertising,viral marketing and public opinion monitoring. However, we always ignore the tendency of nodes' behaviors and sentiment in the researches of influence maximization. On general, users' sentiment determines users behaviors, and users' behaviors reflect the influence between users in social network. In this paper, we design a training model of sentimental words to expand the existing sentimental dictionary with the marked-commentdata set, and propose an influence spread model considering both the tendency of users' behaviors and sentiment named as BSIS (Behavior and Sentiment Influence Spread) to depict and compute the influence between nodes. We also propose an algorithm for influence maximization named as BS-G (BSIS with Greedy Algorithm) to select the initial node. In the experiments, we use two real social network data sets on the Hadoop and Spark distributed cluster platform for experiments, and the experiment results show that BSIS model and BS-G algorithm on big data platform have better influence spread effects and higher quality of the selection of seed node comparing with the approaches with traditional IC, LT and CDNF models.展开更多
A proximal iterative algorithm for the mulitivalue operator equation 0∈T(x)is presented,where T is a maximal monotone operator.It is an improvement of the proximal point algorithm as well know.The convergence of the ...A proximal iterative algorithm for the mulitivalue operator equation 0∈T(x)is presented,where T is a maximal monotone operator.It is an improvement of the proximal point algorithm as well know.The convergence of the algorithm is discussed and all example is given.展开更多
Submodular maximization is a significant area of interest in combinatorial optimization.It has various real-world applications.In recent years,streaming algorithms for submodular maximization have gained attention,all...Submodular maximization is a significant area of interest in combinatorial optimization.It has various real-world applications.In recent years,streaming algorithms for submodular maximization have gained attention,allowing realtime processing of large data sets by examining each piece of data only once.However,most of the current state-of-the-art algorithms are only applicable to monotone submodular maximization.There are still significant gaps in the approximation ratios between monotone and non-monotone objective functions.In this paper,we propose a streaming algorithm framework for non-monotone submodular maximization and use this framework to design deterministic streaming algorithms for the d-knapsack constraint and the knapsack constraint.Our 1-pass streaming algorithm for the d-knapsack constraint has a 1/4(d+1)-∈approximation ratio,using O(BlogB/∈)memory,and O(logB/∈)query time per element,where B=MIN(n,b)is the maximum number of elements that the knapsack can store.As a special case of the d-knapsack constraint,we have the 1-pass streaming algorithm with a 1/8-∈approximation ratio to the knapsack constraint.To our knowledge,there is currently no streaming algorithm for this constraint when the objective function is non-monotone,even when d=1.In addition,we propose a multi-pass streaming algorithm with 1/6-∈approximation,which stores O(B)elements.展开更多
In order to find roots of maximal monotone operators, this paper introduces and studies the modified approximate proximal point algorithm with an error sequence {e k} such that || ek || \leqslant hk || xk - [(x)\tilde...In order to find roots of maximal monotone operators, this paper introduces and studies the modified approximate proximal point algorithm with an error sequence {e k} such that || ek || \leqslant hk || xk - [(x)\tilde]k ||\left\| { e^k } \right\| \leqslant \eta _k \left\| { x^k - \tilde x^k } \right\| with ?k = 0¥ ( hk - 1 ) < + ¥\sum\limits_{k = 0}^\infty {\left( {\eta _k - 1} \right)} and infk \geqslant 0 hk = m\geqslant 1\mathop {\inf }\limits_{k \geqslant 0} \eta _k = \mu \geqslant 1 . Here, the restrictions on {η k} are very different from the ones on {η k}, given by He et al (Science in China Ser. A, 2002, 32 (11): 1026–1032.) that supk \geqslant 0 hk = v < 1\mathop {\sup }\limits_{k \geqslant 0} \eta _k = v . Moreover, the characteristic conditions of the convergence of the modified approximate proximal point algorithm are presented by virtue of the new technique very different from the ones given by He et al.展开更多
Solving the absent assignment problem of the shortest time limit in a weighted bipartite graph with the minimal weighted k-matching algorithm is unsuitable for situations in which large numbers of problems need to be ...Solving the absent assignment problem of the shortest time limit in a weighted bipartite graph with the minimal weighted k-matching algorithm is unsuitable for situations in which large numbers of problems need to be addressed by large numbers of parties. This paper simplifies the algorithm of searching for the even alternating path that contains a maximal element using the minimal weighted k-matching theorem and intercept graph. A program for solving the maximal efficiency assignment problem was compiled. As a case study, the program was used to solve the assignment problem of water piping repair in the case of a large number of companies and broken pipes, and the validity of the program was verified.展开更多
The quality of synthetic aperture radar(SAR)image degrades in the case of multiple imaging projection planes(IPPs)and multiple overlapping ship targets,and then the performance of target classification and recognition...The quality of synthetic aperture radar(SAR)image degrades in the case of multiple imaging projection planes(IPPs)and multiple overlapping ship targets,and then the performance of target classification and recognition can be influenced.For addressing this issue,a method for extracting ship targets with overlaps via the expectation maximization(EM)algorithm is pro-posed.First,the scatterers of ship targets are obtained via the target detection technique.Then,the EM algorithm is applied to extract the scatterers of a single ship target with a single IPP.Afterwards,a novel image amplitude estimation approach is pro-posed,with which the radar image of a single target with a sin-gle IPP can be generated.The proposed method can accom-plish IPP selection and targets separation in the image domain,which can improve the image quality and reserve the target information most possibly.Results of simulated and real mea-sured data demonstrate the effectiveness of the proposed method.展开更多
Maximizing the spread of influence is to select a set of seeds with specified size to maximize the spread of influence under a certain diffusion model in a social network. In the actual spread process, the activated p...Maximizing the spread of influence is to select a set of seeds with specified size to maximize the spread of influence under a certain diffusion model in a social network. In the actual spread process, the activated probability of node increases with its newly increasing activated neighbors, which also decreases with time. In this paper, we focus on the problem that selects k seeds based on the cascade model with diffusion decay to maximize the spread of influence in social networks. First, we extend the independent cascade model to incorporate the diffusion decay factor, called as the cascade model with diffusion decay and abbreviated as CMDD. Then, we discuss the objective function of maximizing the spread of influence under the CMDD, which is NP-hard. We further prove the monotonicity and submodularity of this objective function. Finally, we use the greedy algorithm to approximate the optimal result with the ration of 1 ? 1/e.展开更多
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode...Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.展开更多
基金the National Natural Science Foundation of China(79990584)
文摘A new parallel expectation-maximization (EM) algorithm is proposed for large databases. The purpose of the algorithm is to accelerate the operation of the EM algorithm. As a well-known algorithm for estimation in generic statistical problems, the EM algorithm has been widely used in many domains. But it often requires significant computational resources. So it is needed to develop more elaborate methods to adapt the databases to a large number of records or large dimensionality. The parallel EM algorithm is based on partial Esteps which has the standard convergence guarantee of EM. The algorithm utilizes fully the advantage of parallel computation. It was confirmed that the algorithm obtains about 2.6 speedups in contrast with the standard EM algorithm through its application to large databases. The running time will decrease near linearly when the number of processors increasing.
文摘Influence maximization is the problem to identify and find a set of the most influential nodes, whose aggregated influence in the network is maximized. This research is of great application value for advertising,viral marketing and public opinion monitoring. However, we always ignore the tendency of nodes' behaviors and sentiment in the researches of influence maximization. On general, users' sentiment determines users behaviors, and users' behaviors reflect the influence between users in social network. In this paper, we design a training model of sentimental words to expand the existing sentimental dictionary with the marked-commentdata set, and propose an influence spread model considering both the tendency of users' behaviors and sentiment named as BSIS (Behavior and Sentiment Influence Spread) to depict and compute the influence between nodes. We also propose an algorithm for influence maximization named as BS-G (BSIS with Greedy Algorithm) to select the initial node. In the experiments, we use two real social network data sets on the Hadoop and Spark distributed cluster platform for experiments, and the experiment results show that BSIS model and BS-G algorithm on big data platform have better influence spread effects and higher quality of the selection of seed node comparing with the approaches with traditional IC, LT and CDNF models.
基金Supported by the National Natural Science Foundation of China
文摘A proximal iterative algorithm for the mulitivalue operator equation 0∈T(x)is presented,where T is a maximal monotone operator.It is an improvement of the proximal point algorithm as well know.The convergence of the algorithm is discussed and all example is given.
基金supported in part by the National Natural Science Foundation of China(Grant Nos.62325210 and 62272441).
文摘Submodular maximization is a significant area of interest in combinatorial optimization.It has various real-world applications.In recent years,streaming algorithms for submodular maximization have gained attention,allowing realtime processing of large data sets by examining each piece of data only once.However,most of the current state-of-the-art algorithms are only applicable to monotone submodular maximization.There are still significant gaps in the approximation ratios between monotone and non-monotone objective functions.In this paper,we propose a streaming algorithm framework for non-monotone submodular maximization and use this framework to design deterministic streaming algorithms for the d-knapsack constraint and the knapsack constraint.Our 1-pass streaming algorithm for the d-knapsack constraint has a 1/4(d+1)-∈approximation ratio,using O(BlogB/∈)memory,and O(logB/∈)query time per element,where B=MIN(n,b)is the maximum number of elements that the knapsack can store.As a special case of the d-knapsack constraint,we have the 1-pass streaming algorithm with a 1/8-∈approximation ratio to the knapsack constraint.To our knowledge,there is currently no streaming algorithm for this constraint when the objective function is non-monotone,even when d=1.In addition,we propose a multi-pass streaming algorithm with 1/6-∈approximation,which stores O(B)elements.
基金Supported both by the Teaching and Research Award Fund for Outstanding Young Teachers inHigher Educational Institutions of MOEChinaand by the Dawn Program Fund in Shanghai
文摘In order to find roots of maximal monotone operators, this paper introduces and studies the modified approximate proximal point algorithm with an error sequence {e k} such that || ek || \leqslant hk || xk - [(x)\tilde]k ||\left\| { e^k } \right\| \leqslant \eta _k \left\| { x^k - \tilde x^k } \right\| with ?k = 0¥ ( hk - 1 ) < + ¥\sum\limits_{k = 0}^\infty {\left( {\eta _k - 1} \right)} and infk \geqslant 0 hk = m\geqslant 1\mathop {\inf }\limits_{k \geqslant 0} \eta _k = \mu \geqslant 1 . Here, the restrictions on {η k} are very different from the ones on {η k}, given by He et al (Science in China Ser. A, 2002, 32 (11): 1026–1032.) that supk \geqslant 0 hk = v < 1\mathop {\sup }\limits_{k \geqslant 0} \eta _k = v . Moreover, the characteristic conditions of the convergence of the modified approximate proximal point algorithm are presented by virtue of the new technique very different from the ones given by He et al.
文摘Solving the absent assignment problem of the shortest time limit in a weighted bipartite graph with the minimal weighted k-matching algorithm is unsuitable for situations in which large numbers of problems need to be addressed by large numbers of parties. This paper simplifies the algorithm of searching for the even alternating path that contains a maximal element using the minimal weighted k-matching theorem and intercept graph. A program for solving the maximal efficiency assignment problem was compiled. As a case study, the program was used to solve the assignment problem of water piping repair in the case of a large number of companies and broken pipes, and the validity of the program was verified.
基金This work was supported by the National Science Fund for Distinguished Young Scholars(62325104).
文摘The quality of synthetic aperture radar(SAR)image degrades in the case of multiple imaging projection planes(IPPs)and multiple overlapping ship targets,and then the performance of target classification and recognition can be influenced.For addressing this issue,a method for extracting ship targets with overlaps via the expectation maximization(EM)algorithm is pro-posed.First,the scatterers of ship targets are obtained via the target detection technique.Then,the EM algorithm is applied to extract the scatterers of a single ship target with a single IPP.Afterwards,a novel image amplitude estimation approach is pro-posed,with which the radar image of a single target with a sin-gle IPP can be generated.The proposed method can accom-plish IPP selection and targets separation in the image domain,which can improve the image quality and reserve the target information most possibly.Results of simulated and real mea-sured data demonstrate the effectiveness of the proposed method.
基金This paper was supported by the National Natural Science Foundation of China (61562091), Natural Science Foundation of Yunnan Province (2014FA023,201501CF00022), Program for Innovative Research Team in Yunnan University (XT412011), and Program for Excellent Young Talents of Yunnan University (XT412003).
文摘Maximizing the spread of influence is to select a set of seeds with specified size to maximize the spread of influence under a certain diffusion model in a social network. In the actual spread process, the activated probability of node increases with its newly increasing activated neighbors, which also decreases with time. In this paper, we focus on the problem that selects k seeds based on the cascade model with diffusion decay to maximize the spread of influence in social networks. First, we extend the independent cascade model to incorporate the diffusion decay factor, called as the cascade model with diffusion decay and abbreviated as CMDD. Then, we discuss the objective function of maximizing the spread of influence under the CMDD, which is NP-hard. We further prove the monotonicity and submodularity of this objective function. Finally, we use the greedy algorithm to approximate the optimal result with the ration of 1 ? 1/e.
文摘Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.