Using SED to Extract Info out of Data
#############################

We will use sed which is a common tool for search and replace to extract information. Sed is very powerful because its a search and replace that can also have variables. So you can search for a part, set it to a variable, and return that variable. Using that simple technique and working line per line (as sed does per default), you can achieve the result of “extract data” out. So we will be constructing a sed string, its like an equation for sed that tells it what to find, and to print instead of what it found.

First lets cover the basics:

If you used sed your probably used to simple sed strings like this, to do simple search+replace:

find first occurance of “what to find” and replace with “new”, sed works 1 line at a time, this just says run search+replace once per line (so only first occurance is found)

find every occurance of “what to find” and replace with “new”, sed works 1 line at a time, this just says run thru the whole string running search+replace (so it will replace every occurance of “what to find” in the string)

Extracting is just a wierd form of search and replace. We search the strings for “what we want to extract”, we set a variable to what we want (the backslash 1 variable), then we replace the whole string with that variable, thus the string becomes “what we want to extract”

For extracting info we will use the p
p means we want to print the answer

Useful tools (they select or do things in sed)

Legend: Tool – description

 

s/”using above tools identify the part of the string we dont want, and the part that we want we set to variable \1 using \(\)”/”we return variable \1, so just put \1″/p

Format

 UPDATE: the -n argument in sed, makes it so sed doesnt print anything on the screen (usually it prints the whole document with the changes inline on the document). So now that it prints nothing on the screen the changes we do are useless, right? nope.. the /p in the sed argument, asks sed to print out the line that it worked on. The end result, we only see on the screen the lines that apply. So if in the document there was a line that didnt match the sed statement, it wouldnt print (however if you didnt use -n it would of printed). Now if you used /p but didnt use -n, meaning you asked sed to print everything, but then also asked it to print on every match, well the end result would be everything would be printed, and everyline that matched would be printed two times. So logically in these data extraction scenarios we only want the data that makes sense, so we start off by printing nothing with -n arugment, and finish off by asking it to print only what applys with /p.

Example 1
##########

Here we should how to extract info at the end of the line (if it were in the middle of the line it wouldnt be much different)

Imagine the following commands output:

Below where you see the “<anything>” or ‘anything’, i mean that it can be replaced with any human readable character (aside from an ENTER/newline, so it cant be a new line char, but it can be a tab or space or a number or a letter)

Notice that the pattern is
<anything> path <anything>
If we wanted everything after the word path, we would just want the last <anything>

If we only wanted the paths with snapshot in it
<anything> path <anything>snap<anything>
Or it can be (note: there are many ways to construct a correct sed, sometimes more then 1 way)
<anything> level 5 path <anything>snap<anything>

And we want the part that is <anything>snap<anything>
Lets say we need all of the names of the paths.

Here we say, find any line that starts with any chars/’anything’ but has space path space (” path “) followed by any chars/’anything’ (record this ‘anything’ thats after the ” path ” into placeholder 1), now take the whole line and replace it with placeholder 1 and print it out.

Let say we need all of the the names of the paths that are snapshots only

Here we say, find any line that starts with any chars/’anything’ but has space path space (” path “) followed by any chars/’anything’ surrounding the word ‘snap’ (record this ‘anything’ thats surrounding ‘snap’ thats after the ” path ” into placeholder 1), now take the whole line and replace it with placeholder 1 and print it out.

Or likewise use the first output and grep

 Another way to extract data:

Imagine the text:

How can we extract the numbers 123456 21 74 and 443. One way is to use the above method. Look for the numbers save them into a variable and print the variable.

Another way is to use sed for one of the main features that its made for (search and replace), and search for the text that is NOT what we want and replace it with clear/null space. So that whats left is what we want.

So my trick here will be to look for a sentence that starts with “here comes the sun” and clear out “here comes the sun”. Then it will be to look for “dun dun dun /dev/sd. dadada” and replace that with clear text.

End result:

The end of this file.

 

Leave a Reply

Your email address will not be published. Required fields are marked *