Skip to content

Commit 30f06b8

Browse files
committed
Bit reworked regular expressions. More examples and tasks.
1 parent bcdba76 commit 30f06b8

File tree

1 file changed

+52
-23
lines changed

1 file changed

+52
-23
lines changed

presentation/linux_bash_metacentrum_course.tex

Lines changed: 52 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3242,7 +3242,7 @@ \subsection{Regular expressions}
32423242
\begin{frame}{Regular expressions are useful\ldots}
32433243
\begin{multicols}{2}
32443244
\begin{center}
3245-
\includegraphics[height=5.5cm]{regular_expressions.png}
3245+
\includegraphics[height=6.5cm]{regular_expressions.png}
32463246
\end{center}
32473247
\columnbreak
32483248
\begin{itemize}
@@ -3251,7 +3251,7 @@ \subsection{Regular expressions}
32513251
\item Syntax is variable among programming languages and applications
32523252
\item There are commonly more solutions for one task
32533253
\item Well supported in \texttt{grep}, \texttt{sed}, \texttt{vim}, \texttt{emacs},~\ldots
3254-
\item Probably the most advanced is Perl
3254+
\item Probably the most advanced is \href{https://www.perl.org/}{Perl}
32553255
\end{itemize}
32563256
\vfill
32573257
\url{https://xkcd.com/208/}
@@ -3261,6 +3261,23 @@ \subsection{Regular expressions}
32613261
\begin{frame}[allowframebreaks]{Regular expressions}
32623262
\label{regexp}
32633263
\begin{itemize}
3264+
\item Implementation in \texttt{vim}, \texttt{sed}, \texttt{grep}, \texttt{awk} and \texttt{perl} and among various UNIX systems is almost same, but not identical --- can be confusing\ldots
3265+
\item \textbf{grep}, \textbf{sed} and \textbf{vim} \alert{require escaping} of \alert{\texttt{+}}, \alert{\texttt{?}}, \alert{\texttt{\{}}, \alert{\texttt{\}}}, \alert{\texttt{(}} and \alert{\texttt{)}} by backslash \alert{\texttt{\textbackslash}} (e.g. \texttt{\textbackslash +}, see also next slides)
3266+
\item \textbf{egrep} (extended version, launched as \texttt{grep -E \ldots} or \texttt{egrep \ldots}), \textbf{sed} with extended reg exp (\texttt{sed -r}) and \textbf{perl} \alert{do not} require escaping (simply just e.g. \texttt{+}, not \texttt{\textbackslash +})
3267+
\item Mastering regular expressions require practicing -- solve practical problems and see their power
3268+
\item Read \url{https://en.wikibooks.org/wiki/Regular_Expressions}, \url{https://www.grymoire.com/Unix/Regular.html}, \url{https://www.regular-expressions.info/}
3269+
\begin{itemize}
3270+
\item Česky \url{http://www.nti.tul.cz/~satrapa/docs/regvyr/}, \url{https://www.root.cz/serialy/regularni-vyrazy/} a~\url{https://www.regularnivyrazy.info/}
3271+
\item Manuals for \href{https://www.gnu.org/software/grep/manual/}{Grep}, Vim, \href{https://www.gnu.org/software/sed/manual/}{Sed}, \href{https://www.gnu.org/software/gawk/manual/}{Awk}, \href{https://en.wikibooks.org/wiki/Perl_Programming}{Perl} (\href{https://en.wikibooks.org/wiki/Raku_Programming}{newer Perl~6 Raku}),~\ldots
3272+
\end{itemize}
3273+
\item See sed examples, slide~\ref{sedex}; and next slides
3274+
\item macOS has by default very outdated version of \texttt{sed} and another tools --- it does not have all advanced features --- users need to install e.g. \texttt{gnu-sed} formulae from \href{https://brew.sh/}{Homebrew} (slide~\ref{homebrew}), similarly for Grep, AWK,~\ldots
3275+
\item Do not confuse with shell globbing (slide~\ref{globbing}) --- regular expressions are used withing particular application (GNU Sed, GNU Grep, Perl,~\ldots), while shell globbing is in-build BASH feature
3276+
\begin{itemize}
3277+
\item Globbing as well as regular expressions match/expand particular text string (in case of globbing typically file names)
3278+
\item Regular expressions mostly must be quoted (\texttt{'\ldots'}) \textbf{not} to be interpreted by shell, they work mostly with \textbf{text} files (their versatility allows to use them to work with e.g. molecular data)
3279+
\end{itemize}
3280+
\item Word processors (LibreOffice,~\ldots), graphical text editors, etc. usually also support regular expression, more or less following syntax below, but sometimes bit simplified
32643281
\item \alert{\texttt{.}} --- any single character
32653282
\item \alert{\texttt{*}} --- any number of characters/occurrences of pattern (including 0)
32663283
\item \alert{\texttt{+}} --- one or more occurrences of the preceding reg exp
@@ -3273,8 +3290,8 @@ \subsection{Regular expressions}
32733290
\item \alert{\texttt{\textbackslash\{n\textbackslash\}}} --- exactly \textit{n} occurrences
32743291
\item \alert{\texttt{\textbackslash\{n,\textbackslash\}}} --- at least \textit{n} occurrences
32753292
\item \alert{\texttt{\textbackslash}} --- escape following special character (e.g. \texttt{\textbackslash .} to literally search for dot and not \enquote{any single character})
3276-
\item \alert{\texttt{|}} --- either the preceding or following reg exp can be matched (alternation)
3277-
\item \alert{\texttt{\textbackslash(\ldots\textbackslash)}} --- remembered group reg exp (numbered, starting with 1) --- can be called by \alert{\textbackslash\textit{n}}, where \textit{n} is number of the group (starting with 1)
3293+
\item \alert{\texttt{|}} --- either the preceding or following reg exp can be matched (alternation), in \texttt{grep} etc. escape it and use as \texttt{\textbackslash |}
3294+
\item \alert{\texttt{\textbackslash(\ldots\textbackslash)}} --- remembered group reg exp (numbered, starting with 1) --- can be called by \alert{\textbackslash\textit{n}}, where \textit{n} is number of the group (starting with 1, see examples further)
32783295
\item \alert{\texttt{\textbackslash$<$}}, \alert{\texttt{\textbackslash$>$}} --- word boundaries
32793296
\item \alert{\texttt{[[:alnum:]]}} --- alphanumerical characters (includes white space), same like \texttt{[a-zA-Z0-9]}
32803297
\item \alert{\texttt{[[:alpha:]]}} --- alphabetic characters, like \texttt{[a-zA-Z]}
@@ -3292,17 +3309,13 @@ \subsection{Regular expressions}
32923309
\item \alert{\texttt{\textasciicircum.*\$}} --- entire line whatever it is
32933310
\item \alert{\texttt{ +}} --- one or more spaces (there is space before plus)
32943311
\item \alert{\texttt{\&}} --- content of pattern that was matched
3295-
\item Implementation in \texttt{vim}, \texttt{sed}, \texttt{grep}, \texttt{awk} and \texttt{perl} and among various UNIX systems is almost same, but not identical\ldots
3296-
\item \textbf{grep}, \textbf{sed} and \textbf{vim} \alert{require escaping} of \alert{\texttt{+}}, \alert{\texttt{?}}, \alert{\texttt{\{}}, \alert{\texttt{\}}}, \alert{\texttt{(}} and \alert{\texttt{)}} by backslash \alert{\texttt{\textbackslash}} (e.g. \texttt{\textbackslash +})
3297-
\item \textbf{egrep} (extended version, launched as \texttt{grep -E \ldots} or \texttt{egrep \ldots}), \textbf{sed} with extended reg exp (\texttt{sed -r}) and \textbf{perl} \alert{not} (simply e.g. \texttt{+})
3298-
\item Read \url{https://en.wikibooks.org/wiki/Regular_Expressions}, \url{https://www.grymoire.com/Unix/Regular.html}, \url{https://www.regular-expressions.info/}; česky \url{http://www.nti.tul.cz/~satrapa/docs/regvyr/}, \url{https://www.root.cz/serialy/regularni-vyrazy/} a~\url{https://www.regularnivyrazy.info/}
3299-
\item Manuals for \href{https://www.gnu.org/software/grep/manual/}{Grep}, Vim, \href{https://www.gnu.org/software/sed/manual/}{Sed}, \href{https://www.gnu.org/software/gawk/manual/}{Awk}, \href{https://en.wikibooks.org/wiki/Perl_Programming}{Perl} (\href{https://en.wikibooks.org/wiki/Raku_Programming}{newer Perl~6 Raku}),~\ldots
3300-
\item See sed examples, slide~\ref{sedex}; and next slide
3301-
\item macOS has by default very outdated version of \texttt{sed} and another tools --- it does not have all advanced features --- users need to install e.g. \texttt{gnu-sed} formulae from \href{https://brew.sh/}{Homebrew} (slide~\ref{homebrew})
33023312
\end{itemize}
33033313
\end{frame}
33043314
33053315
\begin{frame}[fragile]{Grep and sed examples I}
3316+
\begin{itemize}
3317+
\item Be sure to understand all syntax on this and following slide\ldots
3318+
\end{itemize}
33063319
\begin{bashcode}
33073320
# Extract sequences with at least 5 A bases in line
33083321
grep "A\{5,\}" Oxalis_HybSeq_nrDNA_selection_alignment.fasta
@@ -3318,24 +3331,40 @@ \subsection{Regular expressions}
33183331
sed -e 's/^/<p>/' -e 's/$/<\/p>/' long_text.txt | less
33193332
# Make first word of every paragraph bold in HTML (<strong>...</strong>)
33203333
sed -e 's/^/<strong>/' -e 's/^[[:graph:]]\+/&<\/strong>/' long_text.txt
3321-
# How many times is each word in the text
3322-
grep -o "\<[[:alpha:]]\+\>" long_text.txt | sort | uniq -ic | less
33233334
\end{bashcode}
33243335
\end{frame}
33253336
33263337
\begin{frame}[fragile]{Grep and sed examples II}
33273338
\begin{bashcode}
3339+
# How many times is each word in the text
3340+
grep -o "\<[[:alpha:]]\+\>" long_text.txt | sort | uniq -ic | less
33283341
# List all Internet web links
3329-
grep -o "http[a-zA-Z0-9\.()/:\-]\+" long_text.txt
3330-
\end{bashcode}
3331-
\begin{block}{Tasks}
3332-
\begin{enumerate}
3333-
\item Remove "\texttt{S}" codes, replace underscore by dot and space (\texttt{. }), and capitalize initial "\texttt{o}" in FASTA names in \texttt{oxalis\_assembly\_6235.aln.fasta}, e.g. from \texttt{>o\_annae\_S499} to \texttt{>O. annae}.
3334-
\item Extract from \texttt{arabidopsis.vcf.gz} values of \texttt{DP} (only numbers), sort them and print on single line, separated by commas.
3335-
\item Determine, which sequence of \texttt{Oxalis\_HybSeq\_nrDNA\_selection\_alignment.fasta} has the longest block of missing data (\texttt{N}) or spaces (\texttt{-}).
3336-
\end{enumerate}
3337-
\end{block}
3338-
\end{frame}
3342+
grep -o 'https\?://[a-zA-Z0-9\.()/:\-]\+' long_text.txt
3343+
# Convert selected letters to upper case
3344+
sed 's/[acegikmoqsuwy]/\U&/g' diff_test_file_1.txt
3345+
# From file listing (compare with 'ls -l') remove permissions and number
3346+
# of links on the beginning, flip user and group ownership and add labels
3347+
# Note usage of numbered groups
3348+
# Note that unmatched part o line is intact
3349+
ls -l | sed 's/^[[:graph:]]\+[[:blank:]]\+[0-9]\+[[:blank:]]\+
3350+
\([[:alnum:]]\+\)[[:blank:]]\+\([[:alnum:]]\+\)/GRP: \2\tUSR: \1/g'
3351+
# Create list of samples (e.g. as input in script for some application)
3352+
SAMPLESLIST=$(find . -name "*.jpg" | sed 's/^\.\///' | sed 's/^/-I /' |
3353+
tr "\n" " ")
3354+
echo $SAMPLESLIST # What would be difference from quoted "$SAMPLESLIST"?
3355+
application $SAMPLESLIST -method X -out Y ... # Rationale of such listing
3356+
\end{bashcode}
3357+
\end{frame}
3358+
3359+
\begin{frame}[fragile]{Regular expressions tasks}
3360+
\begin{enumerate}
3361+
\item Remove "\texttt{S}" codes, replace underscore by dot and space (\texttt{. }), and capitalize initial "\texttt{o}" in FASTA names in \texttt{oxalis\_assembly\_6235.aln.fasta}, e.g. from \texttt{>o\_annae\_S499} to \texttt{>O. annae}.
3362+
\item Extract from \texttt{arabidopsis.vcf.gz} values of \texttt{DP} (only numbers), sort them and print on single line, separated by commas.
3363+
\item Determine, which sequence(s) of \texttt{Oxalis\_HybSeq\_nrDNA\_selection\_alignment.fasta} has block of missing data (\texttt{N}) or spaces (\texttt{-}) longer than 10~bp.
3364+
\item From file \texttt{cut\_awk\_test\_file.tsv} remove with \texttt{sed} column \texttt{Description} (\texttt{"Assembly of \# reads: \ldots "}).
3365+
\item Think about any task (manipulation with your data,~\ldots) you are (sometimes) dealing with, which could be simplified/solved by using regular expressions. Try to solve it. Discuss it with others.
3366+
\end{enumerate}
3367+
\end{frame}
33393368
33403369
\section{Scripting}
33413370

0 commit comments

Comments
 (0)