|
LIBRARY#include <biolibc/align.h> -lbiolibc -lxtend SYNOPSISsize_t bl_align_map_seq_sub(const bl_align_t *params, const char *big, size_t big_len, const char *little, size_t little_len)ARGUMENTSparams bl_align_t parameters. Only min_match and max_mismatch_percent are used. big Sequence to be searched for matches to little little Sequence to be located within big DESCRIPTIONLocate the leftmost (farthest 5') match for sequence little within sequence big, tolerating the given percentage of mismatched bases.The content of little is assumed to be all upper case. This improves speed by avoiding numerous redundant toupper() conversions on the same string, assuming multiple big strings will be searched for little, as in adapter removal and read mapping. Use strlupper(3) or strupper(3) before calling this function if necessary. A minimum of min_match bases must match between little and big. This mainly matters near the end of big, where remaining bases are fewer than the length of little. A maximum of max_mismatch_percent mismatched bases are tolerated to allow for read errors. This is taken as a percent of little, or the same percent of remaining bases in big, whichever is smaller. Note that the NUMBER of allowed mismatched bases tolerated is truncated from the percent calculation. E.g. using 10% tolerance, 0 mismatched bases are tolerated among 9 total bases, or 1 mismatch among 10 total. Higher values of max_mismatch_percent will results in slightly longer run times, more alignments detected, and a higher risk of false-positives (falsely identifying other big sequences as matching little. Indels (insertions and deletions) are not currently handled. Note that alignment is not an exact science. We cannot detect every true little sequence without falsely detecting other sequences, since it is impossible to know whether any given sequence is really from the source of interest (e.g. an adapter) or naturally occurring from another source. The best we can do is guestimate what will provide the most true positives (best statistical power) and fewest false positives. In the case of adapter removal, it is also not usually important to remove every adapter, but only to minimize adapter contamination. Failing to align a small percentage of sequences due to adapter contamination will not change the story told by the downstream analysis. Nor will erroneously trimming off the 3' end of a small percentage of reads containing natural sequences resembling adapters. Just trimming exact matches of the adapter sequence will generally remove 99% or more of the adapter contamination and minimize false-positives. Tolerating 1 or 2 differences has been shown to do slightly better overall. Modern read mapping software is also tolerant of adapter contamination and can clip adapters as needed. RETURN VALUESIndex of little sequence within big if found, index of null terminator of big otherwiseEXAMPLESbl_param_t params; bl_fastq_t read; char *adapter; size_t index; bl_align_set_min_match(¶ms, 3); bl_align_set_max_mismatch_percent(¶ms, 10); index = bl_align_map_seq_sub(¶ms, BL_FASTQ_SEQ(&read), BL_FASTQ_SEQ_LEN(&read), little, strlen(adapter)3, 10); if ( index != BL_FASTQ_SEQ_LEN(&read) ) bl_fastq_3p_trim(&read, index); SEE ALSObl_align_map_seq_exact(3), bl_align_set_min_match(3), bl_align_set_max_mismatch_percent(3), bl_fastq_3p_trim(3) Visit the GSP FreeBSD Man Page Interface. |