You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\item Implementation in \texttt{vim}, \texttt{sed}, \texttt{grep}, \texttt{awk} and \texttt{perl} and among various UNIX systems is almost same, but not identical --- can be confusing\ldots
3265
+
\item\textbf{grep}, \textbf{sed} and \textbf{vim} \alert{require escaping} of \alert{\texttt{+}}, \alert{\texttt{?}}, \alert{\texttt{\{}}, \alert{\texttt{\}}}, \alert{\texttt{(}} and \alert{\texttt{)}} by backslash \alert{\texttt{\textbackslash}} (e.g. \texttt{\textbackslash +}, see also next slides)
3266
+
\item\textbf{egrep} (extended version, launched as \texttt{grep -E \ldots} or \texttt{egrep \ldots}), \textbf{sed} with extended reg exp (\texttt{sed -r}) and \textbf{perl} \alert{do not} require escaping (simply just e.g. \texttt{+}, not \texttt{\textbackslash +})
3267
+
\item Mastering regular expressions require practicing -- solve practical problems and see their power
\item Česky \url{http://www.nti.tul.cz/~satrapa/docs/regvyr/}, \url{https://www.root.cz/serialy/regularni-vyrazy/} a~\url{https://www.regularnivyrazy.info/}
\item See sed examples, slide~\ref{sedex}; and next slides
3274
+
\item macOS has by default very outdated version of \texttt{sed} and another tools --- it does not have all advanced features --- users need to install e.g. \texttt{gnu-sed} formulae from \href{https://brew.sh/}{Homebrew} (slide~\ref{homebrew}), similarly for Grep, AWK,~\ldots
3275
+
\item Do not confuse with shell globbing (slide~\ref{globbing}) --- regular expressions are used withing particular application (GNU Sed, GNU Grep, Perl,~\ldots), while shell globbing is in-build BASH feature
3276
+
\begin{itemize}
3277
+
\item Globbing as well as regular expressions match/expand particular text string (in case of globbing typically file names)
3278
+
\item Regular expressions mostly must be quoted (\texttt{'\ldots'}) \textbf{not} to be interpreted by shell, they work mostly with \textbf{text} files (their versatility allows to use them to work with e.g. molecular data)
3279
+
\end{itemize}
3280
+
\item Word processors (LibreOffice,~\ldots), graphical text editors, etc. usually also support regular expression, more or less following syntax below, but sometimes bit simplified
3264
3281
\item\alert{\texttt{.}} --- any single character
3265
3282
\item\alert{\texttt{*}} --- any number of characters/occurrences of pattern (including 0)
3266
3283
\item\alert{\texttt{+}} --- one or more occurrences of the preceding reg exp
\item\alert{\texttt{\textbackslash\{n,\textbackslash\}}} --- at least \textit{n} occurrences
3275
3292
\item\alert{\texttt{\textbackslash}} --- escape following special character (e.g. \texttt{\textbackslash .} to literally search for dot and not \enquote{any single character})
3276
-
\item\alert{\texttt{|}} --- either the preceding or following reg exp can be matched (alternation)
3277
-
\item\alert{\texttt{\textbackslash(\ldots\textbackslash)}} --- remembered group reg exp (numbered, starting with 1) --- can be called by \alert{\textbackslash\textit{n}}, where \textit{n} is number of the group (starting with 1)
3293
+
\item\alert{\texttt{|}} --- either the preceding or following reg exp can be matched (alternation), in \texttt{grep} etc. escape it and use as \texttt{\textbackslash |}
3294
+
\item\alert{\texttt{\textbackslash(\ldots\textbackslash)}} --- remembered group reg exp (numbered, starting with 1) --- can be called by \alert{\textbackslash\textit{n}}, where \textit{n} is number of the group (starting with 1, see examples further)
3278
3295
\item\alert{\texttt{\textbackslash$<$}}, \alert{\texttt{\textbackslash$>$}} --- word boundaries
3279
3296
\item\alert{\texttt{[[:alnum:]]}} --- alphanumerical characters (includes white space), same like \texttt{[a-zA-Z0-9]}
3280
3297
\item\alert{\texttt{[[:alpha:]]}} --- alphabetic characters, like \texttt{[a-zA-Z]}
\item\alert{\texttt{\textasciicircum.*\$}} --- entire line whatever it is
3293
3310
\item\alert{\texttt{ +}} --- one or more spaces (there is space before plus)
3294
3311
\item\alert{\texttt{\&}} --- content of pattern that was matched
3295
-
\item Implementation in \texttt{vim}, \texttt{sed}, \texttt{grep}, \texttt{awk} and \texttt{perl} and among various UNIX systems is almost same, but not identical\ldots
3296
-
\item\textbf{grep}, \textbf{sed} and \textbf{vim} \alert{require escaping} of \alert{\texttt{+}}, \alert{\texttt{?}}, \alert{\texttt{\{}}, \alert{\texttt{\}}}, \alert{\texttt{(}} and \alert{\texttt{)}} by backslash \alert{\texttt{\textbackslash}} (e.g. \texttt{\textbackslash +})
3297
-
\item\textbf{egrep} (extended version, launched as \texttt{grep -E \ldots} or \texttt{egrep \ldots}), \textbf{sed} with extended reg exp (\texttt{sed -r}) and \textbf{perl} \alert{not} (simply e.g. \texttt{+})
3298
-
\item Read \url{https://en.wikibooks.org/wiki/Regular_Expressions}, \url{https://www.grymoire.com/Unix/Regular.html}, \url{https://www.regular-expressions.info/}; česky \url{http://www.nti.tul.cz/~satrapa/docs/regvyr/}, \url{https://www.root.cz/serialy/regularni-vyrazy/} a~\url{https://www.regularnivyrazy.info/}
\item See sed examples, slide~\ref{sedex}; and next slide
3301
-
\item macOS has by default very outdated version of \texttt{sed} and another tools --- it does not have all advanced features --- users need to install e.g. \texttt{gnu-sed} formulae from \href{https://brew.sh/}{Homebrew} (slide~\ref{homebrew})
3302
3312
\end{itemize}
3303
3313
\end{frame}
3304
3314
3305
3315
\begin{frame}[fragile]{Grep and sed examples I}
3316
+
\begin{itemize}
3317
+
\item Be sure to understand all syntax on this and following slide\ldots
3318
+
\end{itemize}
3306
3319
\begin{bashcode}
3307
3320
# Extract sequences with at least 5 A bases in line
\item Remove "\texttt{S}" codes, replace underscore by dot and space (\texttt{. }), and capitalize initial "\texttt{o}" in FASTA names in \texttt{oxalis\_assembly\_6235.aln.fasta}, e.g. from \texttt{>o\_annae\_S499} to \texttt{>O. annae}.
3334
-
\item Extract from \texttt{arabidopsis.vcf.gz} values of \texttt{DP} (only numbers), sort them and print on single line, separated by commas.
3335
-
\item Determine, which sequence of \texttt{Oxalis\_HybSeq\_nrDNA\_selection\_alignment.fasta} has the longest block of missing data (\texttt{N}) or spaces (\texttt{-}).
# Create list of samples (e.g. as input in script for some application)
3352
+
SAMPLESLIST=$(find . -name "*.jpg" | sed 's/^\.\///' | sed 's/^/-I /' |
3353
+
tr "\n" " ")
3354
+
echo $SAMPLESLIST # What would be difference from quoted "$SAMPLESLIST"?
3355
+
application $SAMPLESLIST -method X -out Y ... # Rationale of such listing
3356
+
\end{bashcode}
3357
+
\end{frame}
3358
+
3359
+
\begin{frame}[fragile]{Regular expressions tasks}
3360
+
\begin{enumerate}
3361
+
\item Remove "\texttt{S}" codes, replace underscore by dot and space (\texttt{. }), and capitalize initial "\texttt{o}" in FASTA names in \texttt{oxalis\_assembly\_6235.aln.fasta}, e.g. from \texttt{>o\_annae\_S499} to \texttt{>O. annae}.
3362
+
\item Extract from \texttt{arabidopsis.vcf.gz} values of \texttt{DP} (only numbers), sort them and print on single line, separated by commas.
3363
+
\item Determine, which sequence(s) of \texttt{Oxalis\_HybSeq\_nrDNA\_selection\_alignment.fasta} has block of missing data (\texttt{N}) or spaces (\texttt{-}) longer than 10~bp.
3364
+
\item From file \texttt{cut\_awk\_test\_file.tsv} remove with \texttt{sed} column \texttt{Description} (\texttt{"Assembly of \# reads: \ldots "}).
3365
+
\item Think about any task (manipulation with your data,~\ldots) you are (sometimes) dealing with, which could be simplified/solved by using regular expressions. Try to solve it. Discuss it with others.
0 commit comments