skbio.alignment.TabularMSA.join

TabularMSA.join(other, how='strict')[source]

Join this MSA with another by sequence (horizontally).

Sequences will be joined by index labels. MSA positional_metadata will be joined by columns. Use how to control join behavior.

Alignment is not recomputed during join operation (see Notes section for details).

Parameters:

other : TabularMSA

MSA to join with. Must have same dtype as this MSA.

how : {‘strict’, ‘inner’, ‘outer’, ‘left’, ‘right’}, optional

How to join the sequences and MSA positional_metadata:

  • 'strict': MSA indexes and positional_metadata columns must match
  • 'inner': an inner-join of the MSA indexes and positional_metadata columns (only the shared set of index labels and columns are used)
  • 'outer': an outer-join of the MSA indexes and positional_metadata columns (all index labels and columns are used). Unshared sequences will be padded with the MSA’s default gap character (TabularMSA.dtype.default_gap_char). Unshared columns will be padded with NaN.
  • 'left': a left-outer-join of the MSA indexes and positional_metadata columns (this MSA’s index labels and columns are used). Padding of unshared data is handled the same as 'outer'.
  • 'right': a right-outer-join of the MSA indexes and positional_metadata columns (other index labels and columns are used). Padding of unshared data is handled the same as 'outer'.
Returns:

TabularMSA

Joined MSA. There is no guaranteed ordering to its index (call sort to define one).

Raises:

ValueError

If how is invalid.

ValueError

If either the index of this MSA or the index of other contains duplicates.

ValueError

If how='strict' and this MSA’s index doesn’t match with other.

ValueError

If how='strict' and this MSA’s positional_metadata columns don’t match with other.

TypeError

If other is not a subclass of TabularMSA.

TypeError

If the dtype of other does not match this MSA’s dtype.

Notes

The join operation does not automatically perform re-alignment; sequences are simply joined together. Therefore, this operation is not necessarily meaningful on its own.

The index labels of this MSA must be unique. Likewise, the index labels of other must be unique.

The MSA-wide and per-sequence metadata (TabularMSA.metadata and Sequence.metadata) are not retained on the joined TabularMSA.

The positional metadata of the sequences will be outer-joined, regardless of how (using Sequence.concat(how='outer')).

If the join operation results in a TabularMSA without any sequences, the MSA’s positional_metadata will not be set.

Examples

Join MSAs by sequence:

>>> from skbio import DNA, TabularMSA
>>> msa1 = TabularMSA([DNA('AC'),
...                    DNA('A-')])
>>> msa2 = TabularMSA([DNA('G-T'),
...                    DNA('T--')])
>>> joined = msa1.join(msa2)
>>> joined
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 5
---------------------
ACG-T
A-T--

Sequences are joined based on MSA index labels:

>>> msa1 = TabularMSA([DNA('AC'),
...                    DNA('A-')], index=['a', 'b'])
>>> msa2 = TabularMSA([DNA('G-T'),
...                    DNA('T--')], index=['b', 'a'])
>>> joined = msa1.join(msa2)
>>> joined
TabularMSA[DNA]
---------------------
Stats:
    sequence count: 2
    position count: 5
---------------------
ACT--
A-G-T
>>> joined.index
Index(['a', 'b'], dtype='object')

By default both MSA indexes must match. Use how to specify an inner join:

>>> msa1 = TabularMSA([DNA('AC'),
...                    DNA('A-'),
...                    DNA('-C')], index=['a', 'b', 'c'],
...                   positional_metadata={'col1': [42, 43],
...                                        'col2': [1, 2]})
>>> msa2 = TabularMSA([DNA('G-T'),
...                    DNA('T--'),
...                    DNA('ACG')], index=['b', 'a', 'z'],
...                   positional_metadata={'col2': [3, 4, 5],
...                                        'col3': ['f', 'o', 'o']})
>>> joined = msa1.join(msa2, how='inner')
>>> joined
TabularMSA[DNA]
--------------------------
Positional metadata:
    'col2': <dtype: int64>
Stats:
    sequence count: 2
    position count: 5
--------------------------
A-G-T
ACT--
>>> joined.index
Index(['b', 'a'], dtype='object')
>>> joined.positional_metadata
   col2
0     1
1     2
2     3
3     4
4     5

When performing an outer join ('outer', 'left', or 'right'), unshared sequences are padded with gaps and unshared positional_metadata columns are padded with NaN:

>>> joined = msa1.join(msa2, how='outer')
>>> joined
TabularMSA[DNA]
----------------------------
Positional metadata:
    'col1': <dtype: float64>
    'col2': <dtype: int64>
    'col3': <dtype: object>
Stats:
    sequence count: 4
    position count: 5
----------------------------
ACT--
A-G-T
-C---
--ACG
>>> joined.index
Index(['a', 'b', 'c', 'z'], dtype='object')
>>> joined.positional_metadata
   col1  col2 col3
0  42.0     1  NaN
1  43.0     2  NaN
2   NaN     3    f
3   NaN     4    o
4   NaN     5    o