吃虾不能吃什么水果| 农历12月是什么月| 火为什么没有影子| 有什么作用| 看日出是什么生肖| 尿液里白细胞高是什么原因| acl是什么意思| 不胜什么| 大学生入伍有什么好处| 血小板降低是什么病| 情不自禁的禁是什么意思| 女人梦见蛇缠身是什么预兆| 脚心热吃什么药| 吃芥末有什么好处| 青梅竹马什么意思| 东方美人茶属于什么茶| 成人发烧吃什么药| 什么情况会导致月经推迟不来| 二手房是什么意思| hrd是什么职位| 面红耳赤是什么意思| 什么动物没有眼睛| 晚上左眼皮跳预示什么| 腐竹配什么菜炒好吃| 丁克夫妻是什么意思| 0mg是什么意思| 电压不稳定是什么原因| 右眼跳是什么预兆| 虫咬性皮炎用什么药| 手起倒刺吃什么维生素| 流量mb是什么意思| 尿道炎吃什么药| 五十岁是什么之年| 感康是什么药| 康普茶是什么| 鱼加它是什么字| 寂寞的反义词是什么| 清热利湿吃什么药| 性质是什么意思| 牛跟什么相冲| 姐姐的儿子叫什么| 惰性是什么意思| 敏感什么意思| 神经衰弱挂什么科| 尿常规3个加号什么意思| cindy英文名什么意思| 手为什么会长水泡| 昱怎么读音是什么| 一什么无余| hpv是什么病毒| 化妆水是干什么用的| 新疆为什么天黑的晚| 咳嗽呕吐是什么原因| 丈青色是什么颜色| 脚后跟疼什么原因| 诱导是什么意思| 白切鸡用什么鸡做好吃| 西瓜有什么好处| 苔菜是什么菜| 比干是什么神| 6424什么意思| 有鸟飞进屋是什么预兆| ahc属于什么档次| 糖类抗原ca125偏高是什么原因| 脑出血什么症状| 甲状腺低回声什么意思| 铅超标吃什么排铅| 气管炎的症状吃什么药好得快| 映山红是什么花| 木堂念什么| 反流性食管炎有什么症状| 眉飞色舞是什么意思| 世界上最长的单词是什么| 凝血功能差是什么原因| 蜻蜓为什么点水| 阴囊积液是什么原因引起的| 晚上七点多是什么时辰| pt是什么| 龟头敏感用什么药| 第一顺位是什么意思| 心跳过缓吃什么药| 4月16日是什么星座| 以免是什么意思| 甲减的原因是什么引起的| 肾疼挂什么科| 抄送和密送是什么意思| 血糖高吃什么食物好| 离婚带什么证件才能办理| 皮肤溃烂是什么病| 浮粉是什么原因引起的| 妇科千金片和三金片有什么区别| 58年属什么生肖| 心肌标志物是查什么的| 突然是什么意思| 阴虚有什么症状| 什么风大雨| 什么血型的人最多| 华胥是什么意思| Valentino什么牌子| 份子钱是什么意思| 后裔是什么意思| 抽动症是什么原因引起的| 手爱出汗是什么原因| 肝脏的主要功能是什么| 数不胜数是什么意思| 脸上长痘痘是什么原因引起的| 官杀混杂是什么意思| 主动脉夹层a型是什么病| 身体不出汗是什么原因| 血压下午高是什么原因| 男孩子送什么礼物| 薄荷泡水喝有什么功效| 脸上过敏擦什么药膏| 未来的未多一横念什么| 腰肌劳损是什么意思| 术前四项检查是什么| 成吉思汗是什么意思| 三点水一个希读什么| 为什么会吐| 什么时候可以上环最好的| 实蛋是什么| 过度纵欲的后果是什么| 尿微肌酐比值高是什么情况| 肾阴虚什么症状| 我用什么留住你| bn是什么意思| 雅痞是什么意思| 蛋白电泳是查什么的| 黄骨鱼是什么鱼| 妈祖是什么| 浪迹天涯是什么生肖| 肉痣长什么样子图片| 步长是什么意思| 空调用什么插座| 哪吒是什么意思| 生龙活虎是什么生肖| 仿佛是什么意思| 飞代表什么生肖| 扁桃体发炎严重吃什么药好得快| 产后第一次来月经是什么颜色| 3月19日什么星座| 喜欢闻汽油味是什么原因| 大冒险问什么| 脱头发严重是什么原因引起的| 爱长闭口用什么护肤品| 明年什么生肖| 谷丙转氨酶高吃什么药可以降下来| 假卵是什么样子的| 男人说冷静一段时间是什么意思| 什么是碱性磷酸酶高怎么回事| 泌尿科主要看什么病| 女人做梦哭醒预示什么| 女性排卵期一般在什么时候| 可什么意思| 黑咖啡为什么能减肥| 富二代是什么意思| 什么的微风填空| 大豆磷脂是什么| 丘疹性荨麻疹用什么药| 吃什么油对心脑血管好| 窦性心动过缓伴不齐是什么意思| 消停是什么意思| 短效避孕药什么牌子好| 生姜什么时候吃最好| 心气虚吃什么食物补| 磨砂膏是什么| 两栖动物是什么意思| 胃有灼烧感是什么原因| 孕妇什么体质容易晚生| 管型尿是什么意思| 1926年属什么生肖| 一心一意指什么生肖| 嗓子哑吃什么药| 阴性阳性什么意思| 腰疼吃什么药好| 拉屎屁股疼是什么原因| 胆结石吃什么排石最快| 六九是什么意思| 心肌病是什么病| 皮肤长痘痘是什么原因| 子宫直肠凹积液是什么意思| 牙龈肿痛是什么原因| 胸痛应该挂什么科| 经常生闷气会得什么病| 什么情况下需要安装心脏起搏器| 肚脐眼痒是什么原因| 白毫银针属于什么茶| 如鱼得水是什么意思| 水中加什么擦玻璃干净| 10周年结婚是什么婚| 酸西地那非片是什么药| 智商100属于什么水平| 什么事每人每天都必须认真的做| 冻顶乌龙茶属于什么茶| 什么的花| 什么是全脂牛奶| 橄榄枝象征着什么| 尾牙是什么意思| 88年的龙是什么命| 孕酮低是什么原因| 聪明是什么意思| 姨妈安全期是什么时候| 九月二十九号是什么星座| 人为什么会脸红| 羊水多是什么原因造成的| sey什么意思| 满月是什么意思| 四季豆为什么叫四季豆| 什么酒不能喝脑筋急转弯| 女人吃什么排湿气最快| 尿常规能查出什么病| trust阴性tppa阳性说明什么| 农历10月22日是什么星座| 能戒烟的男人什么性格| 内痔吃什么药| 银屑病为什么会自愈| 乳腺纤维瘤是什么原因引起的| 肺结节吃什么中药| 妃嫔是什么意思| 南宁有什么好吃的| 唐氏筛查是检查什么| 什么人容易心梗| 豆腐鱼是什么鱼| 指甲黑线是什么原因| 李小龙是什么生肖| 懦弱的反义词是什么| 戴玉手镯有什么好处| 消化不良吃什么食物| 蝙蝠为什么倒挂着睡觉| 色斑是什么原因引起的| 匹维溴铵片治什么病| 脸上长痘痘是什么原因| 01属什么| 龟头炎挂什么科| 冬天手脚冰凉是什么原因怎么调理| 猫是什么| 全身性疾病是什么意思| 吃海参有什么好处| 王力是什么字| 龙生九子是什么生肖| 身上起红斑是什么原因| 什么是妇科病| 11月21日什么星座| 两个水念什么| 我宣你 是什么意思| 腰椎骶化是什么意思| 十八岁成人礼送什么礼物| 喝什么有助于睡眠| 梦到被蛇咬是什么意思周公解梦| 苦瓜不能和什么一起吃| 夏天中暑吃什么药| 眼镜框什么材质的好| 农历七月二十什么日子| 丙申五行属什么| 吃什么补钙最快| 指甲开裂是什么原因| 吐黄水是什么原因| 舌头白吃什么药| 肾结石吃什么药能化石| 嘴唇为什么会变黑| 上下眼皮肿是什么原因| 鸡肉和什么菜搭配最好| 焦虑什么意思| 格林巴利综合症是什么病| 百度Jump to content

孕妇喝纯牛奶对胎儿有什么好处

From Wikipedia, the free encyclopedia
Suffix tree for the text BANANA. Each substring is terminated with special character $. The six paths from the root to the leaves (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the leaves give the start position of the corresponding suffix. Suffix links, drawn dashed, are used during construction.
百度 据此可知,很多企业不是由于走出去获得国际声誉,而是因为中国强大,它们才变得强大,变得受关注。

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

The construction of such a tree for the string takes time and space linear in the length of . Once constructed, several operations can be performed quickly, such as locating a substring in , locating a substring if a certain number of mistakes are allowed, and locating matches for a regular expression pattern. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.[2] These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

History

[edit]

The concept was first introduced by Weiner (1973). Rather than the suffix , Weiner stored in his trie[3] the prefix identifier for each position, that is, the shortest string starting at and occurring only once in . His Algorithm D takes an uncompressed[4] trie for and extends it into a trie for . This way, starting from the trivial trie for , a trie for can be built by successive calls to Algorithm D; however, the overall run time is . Weiner's Algorithm B maintains several auxiliary data structures, to achieve an overall run time linear in the size of the constructed trie. The latter can still be nodes, e.g. for Weiner's Algorithm C finally uses compressed tries to achieve linear overall storage size and run time.[5] Donald Knuth subsequently characterized the latter as "Algorithm of the Year 1973" according to his student Vaughan Pratt.[original research?][6] The text book Aho, Hopcroft & Ullman (1974, Sect.9.5) reproduced Weiner's results in a simplified and more elegant form, introducing the term position tree.

McCreight (1976) was the first to build a (compressed) trie of all suffixes of . Although the suffix starting at is usually longer than the prefix identifier, their path representations in a compressed trie do not differ in size. On the other hand, McCreight could dispense with most of Weiner's auxiliary data structures; only suffix links remained.

Ukkonen (1995) further simplified the construction.[6] He provided the first online-construction of suffix trees, now known as Ukkonen's algorithm, with running time that matched the then fastest algorithms. These algorithms are all linear-time for a constant-size alphabet, and have worst-case running time of in general.

Farach (1997) gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. Farach's algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc.

Definition

[edit]

The suffix tree for the string of length is defined as a tree such that:[7]

  1. The tree has exactly n leaves numbered from to .
  2. Except for the root, every internal node has at least two children.
  3. Each edge is labelled with a non-empty substring of .
  4. No two edges starting out of a node can have string-labels beginning with the same character.
  5. The string obtained by concatenating all the string-labels found on the path from the root to leaf spells out suffix , for from to .

If a suffix of is also the prefix of another suffix, such a tree does not exist for the string. For example, in the string abcbc, the suffix bc is also a prefix of the suffix bcbc. In such a case, the path spelling out bc will not end in a leaf, violating the fifth rule. To fix this problem, is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be leaf nodes, one for each of the suffixes of .[8] Since all internal non-root nodes are branching, there can be at most such nodes, and nodes in total ( leaves, internal non-root nodes, 1 root).

Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which are based on Farach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string , where is a single character and is a string (possibly empty), it has a suffix link to the internal node representing . See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

A generalized suffix tree is a suffix tree made for a set of strings instead of a single string. It represents all suffixes from this set of strings. Each string must be terminated by a different termination symbol.

Functionality

[edit]

A suffix tree for a string of length can be built in time, if the letters come from an alphabet of integers in a polynomial range (in particular, this is true for constant-sized alphabets).[9] For larger alphabets, the running time is dominated by first sorting the letters to bring them into a range of size ; in general, this takes time. The costs below are given under the assumption that the alphabet is constant.

Assume that a suffix tree has been built for the string of length , or that a generalised suffix tree has been built for the set of strings of total length . You can:

  • Search for strings:
    • Check if a string of length is a substring in time.[10]
    • Find the first occurrence of the patterns of total length as substrings in time.
    • Find all occurrences of the patterns of total length as substrings in time.[11]
    • Search for a regular expression P in time expected sublinear in .[12]
    • Find for each suffix of a pattern , the length of the longest match between a prefix of and a substring in in time.[13] This is termed the matching statistics for .
  • Find properties of the strings:
    • Find the longest common substrings of the string and in time.[14]
    • Find all maximal pairs, maximal repeats or supermaximal repeats in time.[15]
    • Find the Lempel–Ziv decomposition in time.[16]
    • Find the longest repeated substrings in time.
    • Find the most frequently occurring substrings of a minimum length in time.
    • Find the shortest strings from that do not occur in , in time, if there are such strings.
    • Find the shortest substrings occurring only once in time.
    • Find, for each , the shortest substrings of not occurring elsewhere in in time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in time.[17] One can then also:

  • Find the longest common prefix between the suffixes and in .[18]
  • Search for a pattern P of length m with at most k mismatches in time, where z is the number of hits.[19]
  • Find all maximal palindromes in ,[20] or time if gaps of length are allowed, or if mismatches are allowed.[21]
  • Find all tandem repeats in , and k-mismatch tandem repeats in .[22]
  • Find the longest common substrings to at least strings in for in time.[23]
  • Find the longest palindromic substring of a given string (using the generalized suffix tree of the string and its reverse) in linear time.[24]

Applications

[edit]

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas.[25] Primary applications include:[25]

  • String search, in O(m) complexity, where m is the length of the sub-string (but with initial O(n) time required to build the suffix tree for the string)
  • Finding the longest repeated substring
  • Finding the longest common substring
  • Finding the longest palindrome in a string

Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can be used for the sorting stage of the Burrows–Wheeler transform. Variants of the LZW compression schemes use suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines.[26]

Implementation

[edit]

If each node and edge can be represented in space, the entire tree can be represented in space. The total length of all the strings on all of the edges in the tree is , but each edge can be stored as the position and length of a substring of S, giving a total space usage of computer words. The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has a pointer to its first child, and to the next node in the child list it is a part of. Other implementations with efficient running time properties use hash maps, sorted or unsorted arrays (with array doubling), or balanced search trees. We are interested in:

  • The cost of finding the child on a given character.
  • The cost of inserting a child.
  • The cost of enlisting all children of a node (divided by the number of children in the table below).

Let σ be the size of the alphabet. Then you have the following costs:[citation needed]

Lookup Insertion Traversal
Sibling lists / unsorted arrays O(σ) Θ(1) Θ(1)
Bitwise sibling trees O(log σ) Θ(1) Θ(1)
Hash maps Θ(1) Θ(1) O(σ)
Balanced search tree O(log σ) O(log σ) O(1)
Sorted arrays O(log σ) O(σ) O(1)
Hash maps + sibling lists O(1) O(1) O(1)

The insertion cost is amortised, and that the costs for hashing are given for perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of 8 (for array including LCP values built within 32-bit address space and 8-bit characters.) This factor depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any symbol in some UNIX-like systems, see wchar_t) on 32-bit systems.[citation needed] Researchers have continued to find smaller indexing structures.

Parallel construction

[edit]

Various parallel algorithms to speed up suffix tree construction have been proposed.[27][28][29][30][31] Recently, a practical parallel algorithm for suffix tree construction with work (sequential time) and span has been developed. The algorithm achieves good parallel scalability on shared-memory multicore machines and can index the human genome – approximately 3GB – in under 3 minutes using a 40-core machine.[32]

External construction

[edit]

Though linear, the memory usage of a suffix tree is significantly higher than the actual size of the sequence collection. For a large text, construction may require external memory approaches.

There are theoretical results for constructing suffix trees in external memory. The algorithm by Farach-Colton, Ferragina & Muthukrishnan (2000) is theoretically optimal, with an I/O complexity equal to that of sorting. However the overall intricacy of this algorithm has prevented, so far, its practical implementation.[33]

On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) GB/hours. The state of the art methods are TDD,[34] TRELLIS,[35] DiGeST,[36] and B2ST.[37]

TDD and TRELLIS scale up to the entire human genome resulting in a disk-based suffix tree of a size in the tens of gigabytes.[34][35] However, these methods cannot handle efficiently collections of sequences exceeding 3 GB.[36] DiGeST performs significantly better and is able to handle collections of sequences in the order of 6 GB in about 6 hours.[36]

All these methods can efficiently build suffix trees for the case when the tree does not fit in main memory, but the input does. The most recent method, B2ST,[37] scales to handle inputs that do not fit in main memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16 GB RAM. On a simple Linux cluster with 16 nodes (4 GB RAM per node), ERA can index the entire human genome in less than 9 minutes.[38]

See also

[edit]

Notes

[edit]
  1. ^ Donald E. Knuth; James H. Morris; Vaughan R. Pratt (Jun 1977). "Fast Pattern Matching in Strings" (PDF). SIAM Journal on Computing. 6 (2): 323–350. doi:10.1137/0206024. Here: p.339 bottom.
  2. ^ Knuth conjectured in 1970 that the problem could not be solved in linear time.[1] In 1973, this was refuted by Weiner's suffix-tree algorithm Weiner (1973).
  3. ^ This term is used here to distinguish Weiner's precursor data structures from proper suffix trees as defined above and unconsidered before McCreight (1976).
  4. ^ i.e., with each branch labelled by a single character
  5. ^ See File:WeinerB aaaabbbbaaaabbbb.gif and File:WeinerC aaaabbbbaaaabbbb.gif for an uncompressed example tree and its compressed correspondant.
  6. ^ a b Giegerich & Kurtz (1997).
  7. ^ Gusfield (1999), p.90.
  8. ^ Gusfield (1999), p.90-91.
  9. ^ Farach (1997).
  10. ^ Gusfield (1999), p.92.
  11. ^ Gusfield (1999), p.123.
  12. ^ Baeza-Yates & Gonnet (1996).
  13. ^ Gusfield (1999), p.132.
  14. ^ Gusfield (1999), p.125.
  15. ^ Gusfield (1999), p.144.
  16. ^ Gusfield (1999), p.166.
  17. ^ Gusfield (1999), Chapter 8.
  18. ^ Gusfield (1999), p.196.
  19. ^ Gusfield (1999), p.200.
  20. ^ Gusfield (1999), p.198.
  21. ^ Gusfield (1999), p.201.
  22. ^ Gusfield (1999), p.204.
  23. ^ Gusfield (1999), p.205.
  24. ^ Gusfield (1999), pp.197–199.
  25. ^ a b Allison, L. "Suffix Trees". Archived from the original on 2025-08-05. Retrieved 2025-08-05.
  26. ^ First introduced by Zamir & Etzioni (1998).
  27. ^ Apostolico et al. (1988).
  28. ^ Hariharan (1994).
  29. ^ Sahinalp & Vishkin (1994).
  30. ^ Farach & Muthukrishnan (1996).
  31. ^ Iliopoulos & Rytter (2004).
  32. ^ Shun & Blelloch (2014).
  33. ^ Smyth (2003).
  34. ^ a b Tata, Hankins & Patel (2003).
  35. ^ a b Phoophakdee & Zaki (2007).
  36. ^ a b c Barsky et al. (2008).
  37. ^ a b Barsky et al. (2009).
  38. ^ Mansour et al. (2011).

References

[edit]
[edit]
甜叶菊有什么功效 胆结石有什么症状有哪些 cook什么意思 老人手抖是什么病的预兆 醋泡葡萄干有什么功效和作用
外伤用什么药愈合最快 甲亢病是一种什么病 白眼球发黄是什么原因 肋间神经痛挂什么科 信保是什么
什么主筋骨 应届生是什么意思 孙权和孙策是什么关系 金舆是什么意思 鸭胗是鸭的什么部位
理发师代表什么生肖 肾亏是什么原因造成的 小受是什么意思 颠覆三观是什么意思 武夷肉桂茶属于什么茶
紫皮大蒜和白皮大蒜有什么区别hcv9jop4ns9r.cn 下贱是什么意思hcv9jop5ns7r.cn 小青柑是什么茶类hcv9jop3ns2r.cn 龙眼树上的臭虫叫什么weuuu.com 意什么深什么zhongyiyatai.com
马甲是什么wuhaiwuya.com 牛仔布料是什么面料hcv8jop6ns2r.cn 背上长毛是什么原因引起的hcv7jop9ns6r.cn 唵是什么意思hcv9jop7ns4r.cn 伪骨科什么意思hcv9jop7ns2r.cn
1978年属马五行缺什么hcv7jop4ns6r.cn 肌酐高吃什么食物hcv8jop3ns2r.cn abo是什么意思sanhestory.com 口腔溃疡一直不好是什么原因hcv9jop6ns7r.cn 吃什么东西涨奶最快0735v.com
mcm牌子属于什么档次hcv9jop4ns3r.cn 骨性关节炎吃什么药hcv9jop6ns3r.cn 什么会引起高血压hcv9jop5ns7r.cn 脑供血不足什么原因hcv9jop8ns3r.cn 为什么大便是绿色的hcv9jop7ns4r.cn
百度