Compressed Data Structures for Strings: On Searching and by Rossano Venturini

By Rossano Venturini

Data compression is needed to control large datasets, indexing is key to question them. even if, their pursuits seem as counterposed: the previous goals at minimizing info redundancies, while the latter augments the dataset with auxiliary info to hurry up the question answer. during this monograph we introduce options that conquer this dichotomy. we begin through proposing using optimization strategies to enhance the compression of classical information compression algorithms, then we flow to the layout of compressed information buildings delivering quickly random entry or effective trend matching queries at the compressed dataset. those theoretical experiences are supported by way of experimental evidences in their influence in functional scenarios.

Show description

Read Online or Download Compressed Data Structures for Strings: On Searching and Extracting Strings from Compressed Textual Data PDF

Similar logic books

Logic Pro X: Audio and Music Production

From preliminary demos to blending and studying, pro authors Mark Cousins and Russ Hepworth-Sawyer assist you get the main from good judgment seasoned X. via exploring the basic workflow and the artistic chances provided by way of Logic’s digital tools and results, common sense seasoned X: Audio and tune construction leads you thru the song construction and construction strategy, providing you with all of the information and methods utilized by the professionals to create release-quality recordings.

Logic for Programming, Artificial Intelligence, and Reasoning: 16th International Conference, LPAR-16, Dakar, Senegal, April 25–May 1, 2010, Revised Selected Papers

This ebook constitutes the completely refereed post-conference lawsuits of the sixteenth foreign convention on good judgment for Programming, synthetic Intelligence, and Reasoning, LPAR 2010, which came about in Dakar, Senegal, in April/May 2010. The 27 revised complete papers and nine revised brief papers offered including 1 invited speak have been rigorously revised and chosen from forty seven submissions.

Additional info for Compressed Data Structures for Strings: On Searching and Extracting Strings from Compressed Textual Data

Sample text

In the rest of the chapter, and for simplicity of exposition, we will restrict to the LZ77-variant which avoids the additional symbol per phrase. This means that wi is represented by the integer pair ≤di , εi ∈, where di is the relative offset of the copied phrase wi within the prefix w1 . . wi−1 and εi is its length |wi |. Every first occurrence of a new symbol c is encoded as ≤0, c∈. , a phrase’s source could overlap the phrase itself. Once phrases are identified and represented via pairs of integers, their components are compressed via variable-length integer encoders which eventually produce the compressed output of T as a sequence of bits.

So, we are assuming that T has been (possibly) permuted in advance, and we are concentrating on the last two steps of the PPC-paradigm. Now, given a partition P of the input string into contiguous substrings, say T = T1 T2 · · · Tk , we denote by Cost(P) the cost of this partition and measure it as li=1 |C(Ti )|, where |C(α)| is the length in bit of the string α compressed by C. 2 As we mentioned above Popt might be computed via a Dynamic-Programming approach (Buchsbaum et al. 2003; Giancarlo and Sciortino 2003).

Several solutions are indeed known for this problem but they are either inefficient (Schuegraf and Heaps 1974), in that they take σ(n2 ) worst-case time and space, or they are approximate (Katajainen and Raita 1989), or they rely on heuristics (Klein 1997; Smith and Storer 1985; Bekesi et al. 1997; Cohn and Khazan 1996) which do not provide any guarantee on the time/space performance of the compression process. This is the reason why Rajpoot and Sahinalp stated in Rajpoot and Sahinalp (2002) at page 159 that “We are not aware of any on-line or off-line parsing scheme that achieves optimality when the LZ77-dictionary is in use under any constraint on the codewords other than being of equal length”.

Download PDF sample

Rated 4.97 of 5 – based on 26 votes