Background:De novo genome assembly relies on two kinds of graphs:de Bruijn graphs and overlap graphs.Overlap graphs are the basis for the Celera assembler,while de Bruijn graphs have become the dominant technical devi...Background:De novo genome assembly relies on two kinds of graphs:de Bruijn graphs and overlap graphs.Overlap graphs are the basis for the Celera assembler,while de Bruijn graphs have become the dominant technical device in the last decade.Those two kinds of graphs are collectively called assembly graphs.Results:In this review,we discuss the most recent advances in the problem of constructing,representing and navigating assembly graphs,focusing on very large datasets.We will also explore some computational techniques,such as the Bloom filter,to compactly store graphs while keeping all functionalities intact.Conclusions:We complete our analysis with a discussion on the algorithmic issues of assembling from long reads(eg.,PacBio and Oxford Nanopore).Finally,we present some of the most relevant open problems in this field.展开更多
The undirected de Bruijn graph is often used as the model of communication network for its useful properties, such as short diameter, small maximum vertex degree. In this paper, we consider the alphabet overlap graph ...The undirected de Bruijn graph is often used as the model of communication network for its useful properties, such as short diameter, small maximum vertex degree. In this paper, we consider the alphabet overlap graph G(k, d, s): the vertex set V = {v|v = (v1...vk); vi ∈ {1,2,... ,d}, i = 1,2,... ,k}; they are distinct and two vertices u = (ul...uk) and v = (vl... vk) are adjacent if and only if us+i = vi or vs+i = ui (i = 1,2,...,k - s). In particular, when s = 1, G(k,d,s) is just an undirected de Bruijn graph. First, we give a formula to calculate the vertex degree of G(k, d, s). Then, we use the corollary of Menger's theorem to prove that the connectivity of G(k, d, s) is 2d^s - 2d^2s-k for s ≥ k/2.展开更多
In this paper, we show that for a locally LEW-embedded 3-connected graph G in orientable surface, the following results hold: 1) Each of such embeddings is minimum genus embedding; 2) The facial cycles are precisel...In this paper, we show that for a locally LEW-embedded 3-connected graph G in orientable surface, the following results hold: 1) Each of such embeddings is minimum genus embedding; 2) The facial cycles are precisely the induced nonseparating cycles which implies the uniqueness of such embeddings; 3) Every overlap graph O(G, C) is a bipartite graph and G has only one C-bridge H such that C U H is nonplanar provided C is a contractible cycle shorter than every noncontractible cycle containing an edge of C. This extends the results of C Thomassen's work on LEW-embedded graphs.展开更多
在复杂网络分析中,挖掘社区结构是一个重要且具有挑战性的研究方向。现有的基于深度学习方法在图相关任务中取得了不错的效果,但鲜有处理社区检测任务,尤其是重叠社区检测,并且也未能充分挖掘和利用网络的拓扑结构信息。为此,提出了一...在复杂网络分析中,挖掘社区结构是一个重要且具有挑战性的研究方向。现有的基于深度学习方法在图相关任务中取得了不错的效果,但鲜有处理社区检测任务,尤其是重叠社区检测,并且也未能充分挖掘和利用网络的拓扑结构信息。为此,提出了一种图正则化模糊自动编码器的重叠社区检测方法(Overlapping Community Detection with Graph Regularized Fuzzy AutoEncoder,FAE)。首先,运用自动编码器将网络拓扑编码为低维表示,进一步通过模糊C均值聚类形成模糊隶属度矩阵,随后解码模糊隶属度矩阵以重构网络拓扑。然后,将用于刻画网络中结构信息的图正则融入上述自动编码器。再者,融合后的自动编码器构成堆叠自动编码器,以获取深度模糊隶属度矩阵。最后,基于模糊集理论,使用深度模糊隶属度矩阵划分重叠社区。在3组人工网络和6个真实网络上的实验结果表明,该方法基于重叠标准互信息熵(ONMI)、杰卡德指数(Jaccard)、F1分数(F1-Score)的评估性能优于7种经典算法的大部分算法,展示了其在处理复杂网络重叠社区检测问题上的潜力。展开更多
Superposition of signals in DNA molecule is a sufficiently general principle of information coding. The necessary re-quirement for such superposition is the degeneracy of the code, which allows placing different messa...Superposition of signals in DNA molecule is a sufficiently general principle of information coding. The necessary re-quirement for such superposition is the degeneracy of the code, which allows placing different messages on the same DNA fragment. Code words that are equivalent in the informational sense (i.e., synonyms) form synonymous group and the entire set of code words is partitioned into synonymous groups. This paper is dedicated to constructing and analyzing the model of synonymous coding. We evaluate some characteristics of synonymous coding as applied to code words of length two although many definitions may be extended for words of arbitrary length.展开更多
提出一种新的面向复杂网络大数据的重叠社区检测算法DOC(detecting overlapping communities over complex network big data),时间复杂度为O(nlog2(n)),算法基于模块度聚类和图计算思想,应用新的节点和边的更新方法,利用平衡二叉树对...提出一种新的面向复杂网络大数据的重叠社区检测算法DOC(detecting overlapping communities over complex network big data),时间复杂度为O(nlog2(n)),算法基于模块度聚类和图计算思想,应用新的节点和边的更新方法,利用平衡二叉树对模块度增量建立索引,基于模块度最优的思想设计一种新的重叠社区检测算法.相对于传统的重叠节点检测算法,对每个节点分析的频率大为降低,可以在较低的算法运行时间下获得较高的识别准确率.复杂网络大数据集上的算法测试结果表明:DOC算法能够有效地检测出网络重叠社区,社区识别准确率较高,在大规模LFR基准数据集上其重叠社区检测标准化互信息指标NMI最高能达到0.97,重叠节点检测指标F-score的平均值在0.91以上,且复杂网络大数据下的运行时间明显优于传统算法.展开更多
现有的中文事件抽取方法存在触发词和论元依赖建模不足的问题,削弱事件内的信息交互,导致论元抽取性能低下,特别是论元角色存在重叠的情况下.对此,文中提出基于图注意力和表指针网络的中文事件抽取方法(Chinese Event Extraction Method...现有的中文事件抽取方法存在触发词和论元依赖建模不足的问题,削弱事件内的信息交互,导致论元抽取性能低下,特别是论元角色存在重叠的情况下.对此,文中提出基于图注意力和表指针网络的中文事件抽取方法(Chinese Event Extraction Method Based on Graph Attention and Table Pointer Network,ATCEE).首先,融合预训练字符向量和词性标注向量作为特征输入,并利用双向长短期记忆网络,得到事件文本的强化语义特征.再将字符级建模的依存句法图引入图注意力网络,捕获文本中各组成成分的长距离依赖关系.然后,使用表填充的方法进行特征融合,进一步增强触发词和其对应的所有论元之间的依赖性.最后,将学习得到的表特征输入全连接层和表指针网络层,进行触发词和论元的联合抽取,使用表指针网络对论元边界进行解码,更好地识别长论元实体.实验表明:ATCEE在ACE2005和DuEE1.0这两个中文基准数据集上都有明显的性能提升,并且字符级依存特征和表填充策略在一定程度上可以解决论元角色重叠问题.ATCEE源代码地址如下:https://github.com/event6/ATCEE.展开更多
目前,针对复杂网络的社区发现算法大多仅根据网络的拓扑结构来确定社区,然而现实复杂网络中的边可能带有表示连接紧密程度或者可信度意义的权重,这些先验信息对社区发现的准确性至关重要.针对该问题,提出了基于加权稠密子图的重叠聚类算...目前,针对复杂网络的社区发现算法大多仅根据网络的拓扑结构来确定社区,然而现实复杂网络中的边可能带有表示连接紧密程度或者可信度意义的权重,这些先验信息对社区发现的准确性至关重要.针对该问题,提出了基于加权稠密子图的重叠聚类算法(overlap community detection on weighted networks,简称OCDW).首先,综合考虑网络拓扑结构及真实网络中边权重的影响,给出了一种网络中边的权重定义方法;进而给出种子节点选取方式和权重更新策略;最终得到聚类结果.OCDW算法在无权网络和加权网络都适用.通过与一些经典的社区发现算法在9个真实网络数据集上进行分析比较,结果表明算法OCDW在F度量、准确度、分离度、标准互信息、调整兰德系数、模块性及运行时间等方面均表现出较好的性能.展开更多
当前社区发现算法主要是针对无向图研究社区结构,但在实际复杂网络中,链接关系时常表现出非对称性或方向性,比如Twitter的用户关注关系,文献网络的引用关系,网页之间的超链接关系等应用网络。因此,本文依据信息在复杂网络中的传播规律...当前社区发现算法主要是针对无向图研究社区结构,但在实际复杂网络中,链接关系时常表现出非对称性或方向性,比如Twitter的用户关注关系,文献网络的引用关系,网页之间的超链接关系等应用网络。因此,本文依据信息在复杂网络中的传播规律和流动方向性,提出了k-Path共社区邻近相似性概念及计算方法,用于衡量结点在同一社区的相似性程度,并给出了把有向图转换为带方向权值的无向图的方法。基于带权无向图提出了一种从局部扩展来探测社区的重叠社区发现算法(Local and wave-like extension algorithm of detecting overlapping community,LWS-OCD)。在真实数据集上的实验表明,共社区邻近相似性概念实现了有向到无向的合理转换,而且提高了社区结点的聚集效果,LWSOCD算法能够有效地发现带权无向图中的重叠社区。展开更多
文摘Background:De novo genome assembly relies on two kinds of graphs:de Bruijn graphs and overlap graphs.Overlap graphs are the basis for the Celera assembler,while de Bruijn graphs have become the dominant technical device in the last decade.Those two kinds of graphs are collectively called assembly graphs.Results:In this review,we discuss the most recent advances in the problem of constructing,representing and navigating assembly graphs,focusing on very large datasets.We will also explore some computational techniques,such as the Bloom filter,to compactly store graphs while keeping all functionalities intact.Conclusions:We complete our analysis with a discussion on the algorithmic issues of assembling from long reads(eg.,PacBio and Oxford Nanopore).Finally,we present some of the most relevant open problems in this field.
文摘The undirected de Bruijn graph is often used as the model of communication network for its useful properties, such as short diameter, small maximum vertex degree. In this paper, we consider the alphabet overlap graph G(k, d, s): the vertex set V = {v|v = (v1...vk); vi ∈ {1,2,... ,d}, i = 1,2,... ,k}; they are distinct and two vertices u = (ul...uk) and v = (vl... vk) are adjacent if and only if us+i = vi or vs+i = ui (i = 1,2,...,k - s). In particular, when s = 1, G(k,d,s) is just an undirected de Bruijn graph. First, we give a formula to calculate the vertex degree of G(k, d, s). Then, we use the corollary of Menger's theorem to prove that the connectivity of G(k, d, s) is 2d^s - 2d^2s-k for s ≥ k/2.
基金Supported by NNSF of China(10271048,10671073)Supported by Science and Technology Commission of Shanghai Municipality(07XD14011)Supported by Shanghai Leading Academic Discipline Project(B407)
文摘In this paper, we show that for a locally LEW-embedded 3-connected graph G in orientable surface, the following results hold: 1) Each of such embeddings is minimum genus embedding; 2) The facial cycles are precisely the induced nonseparating cycles which implies the uniqueness of such embeddings; 3) Every overlap graph O(G, C) is a bipartite graph and G has only one C-bridge H such that C U H is nonplanar provided C is a contractible cycle shorter than every noncontractible cycle containing an edge of C. This extends the results of C Thomassen's work on LEW-embedded graphs.
文摘在复杂网络分析中,挖掘社区结构是一个重要且具有挑战性的研究方向。现有的基于深度学习方法在图相关任务中取得了不错的效果,但鲜有处理社区检测任务,尤其是重叠社区检测,并且也未能充分挖掘和利用网络的拓扑结构信息。为此,提出了一种图正则化模糊自动编码器的重叠社区检测方法(Overlapping Community Detection with Graph Regularized Fuzzy AutoEncoder,FAE)。首先,运用自动编码器将网络拓扑编码为低维表示,进一步通过模糊C均值聚类形成模糊隶属度矩阵,随后解码模糊隶属度矩阵以重构网络拓扑。然后,将用于刻画网络中结构信息的图正则融入上述自动编码器。再者,融合后的自动编码器构成堆叠自动编码器,以获取深度模糊隶属度矩阵。最后,基于模糊集理论,使用深度模糊隶属度矩阵划分重叠社区。在3组人工网络和6个真实网络上的实验结果表明,该方法基于重叠标准互信息熵(ONMI)、杰卡德指数(Jaccard)、F1分数(F1-Score)的评估性能优于7种经典算法的大部分算法,展示了其在处理复杂网络重叠社区检测问题上的潜力。
文摘Superposition of signals in DNA molecule is a sufficiently general principle of information coding. The necessary re-quirement for such superposition is the degeneracy of the code, which allows placing different messages on the same DNA fragment. Code words that are equivalent in the informational sense (i.e., synonyms) form synonymous group and the entire set of code words is partitioned into synonymous groups. This paper is dedicated to constructing and analyzing the model of synonymous coding. We evaluate some characteristics of synonymous coding as applied to code words of length two although many definitions may be extended for words of arbitrary length.
文摘提出一种新的面向复杂网络大数据的重叠社区检测算法DOC(detecting overlapping communities over complex network big data),时间复杂度为O(nlog2(n)),算法基于模块度聚类和图计算思想,应用新的节点和边的更新方法,利用平衡二叉树对模块度增量建立索引,基于模块度最优的思想设计一种新的重叠社区检测算法.相对于传统的重叠节点检测算法,对每个节点分析的频率大为降低,可以在较低的算法运行时间下获得较高的识别准确率.复杂网络大数据集上的算法测试结果表明:DOC算法能够有效地检测出网络重叠社区,社区识别准确率较高,在大规模LFR基准数据集上其重叠社区检测标准化互信息指标NMI最高能达到0.97,重叠节点检测指标F-score的平均值在0.91以上,且复杂网络大数据下的运行时间明显优于传统算法.
文摘现有的中文事件抽取方法存在触发词和论元依赖建模不足的问题,削弱事件内的信息交互,导致论元抽取性能低下,特别是论元角色存在重叠的情况下.对此,文中提出基于图注意力和表指针网络的中文事件抽取方法(Chinese Event Extraction Method Based on Graph Attention and Table Pointer Network,ATCEE).首先,融合预训练字符向量和词性标注向量作为特征输入,并利用双向长短期记忆网络,得到事件文本的强化语义特征.再将字符级建模的依存句法图引入图注意力网络,捕获文本中各组成成分的长距离依赖关系.然后,使用表填充的方法进行特征融合,进一步增强触发词和其对应的所有论元之间的依赖性.最后,将学习得到的表特征输入全连接层和表指针网络层,进行触发词和论元的联合抽取,使用表指针网络对论元边界进行解码,更好地识别长论元实体.实验表明:ATCEE在ACE2005和DuEE1.0这两个中文基准数据集上都有明显的性能提升,并且字符级依存特征和表填充策略在一定程度上可以解决论元角色重叠问题.ATCEE源代码地址如下:https://github.com/event6/ATCEE.
文摘目前,针对复杂网络的社区发现算法大多仅根据网络的拓扑结构来确定社区,然而现实复杂网络中的边可能带有表示连接紧密程度或者可信度意义的权重,这些先验信息对社区发现的准确性至关重要.针对该问题,提出了基于加权稠密子图的重叠聚类算法(overlap community detection on weighted networks,简称OCDW).首先,综合考虑网络拓扑结构及真实网络中边权重的影响,给出了一种网络中边的权重定义方法;进而给出种子节点选取方式和权重更新策略;最终得到聚类结果.OCDW算法在无权网络和加权网络都适用.通过与一些经典的社区发现算法在9个真实网络数据集上进行分析比较,结果表明算法OCDW在F度量、准确度、分离度、标准互信息、调整兰德系数、模块性及运行时间等方面均表现出较好的性能.
文摘当前社区发现算法主要是针对无向图研究社区结构,但在实际复杂网络中,链接关系时常表现出非对称性或方向性,比如Twitter的用户关注关系,文献网络的引用关系,网页之间的超链接关系等应用网络。因此,本文依据信息在复杂网络中的传播规律和流动方向性,提出了k-Path共社区邻近相似性概念及计算方法,用于衡量结点在同一社区的相似性程度,并给出了把有向图转换为带方向权值的无向图的方法。基于带权无向图提出了一种从局部扩展来探测社区的重叠社区发现算法(Local and wave-like extension algorithm of detecting overlapping community,LWS-OCD)。在真实数据集上的实验表明,共社区邻近相似性概念实现了有向到无向的合理转换,而且提高了社区结点的聚集效果,LWSOCD算法能够有效地发现带权无向图中的重叠社区。