Discussion:
[ORDNEWS:1593] log, sqrt and other transformation with Bray-Curtis dissimilarity
Michele Scardi
2010-07-10 08:15:47 UTC
Permalink
A student of mine recently showed me a NMDS ordination of fish
assemblages, which was based on Bray-Curtis dissimilarity computed on
log-transformed data.

I told her that log-transforming data before computing BC did not make
sense to me, because the original interpretation of the BC dissimilarity
(the ratio between the sum of the differences between two samples and
the overall sum of the specimens found) would be lost.

She argued that I was probably right, but she read many papers based on
this approach. As for me, I never noticed so many papers based on
log-transformed data and BC, but I ran a quick bibliographic search and
I was surprised by the number of papers using this approach.

I cannot see why one should log- or sqrt- or sqrt(sqrt)-transform the
data before computing BC, which is meant to measure relative
differences, not quantitative differences. I am afraid the most people
just want to try to normalize data distributions even in case
normalization is not really necessary. And the result of unnecessary
normalization is that the interpretation of distances/dissimilarities
can be much less straightforward than with raw data.

However, I'd really like to read other opinions about this!

All the best,

Michele
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michele Scardi
Associate Professor of Ecology

Department of Biology
Tor Vergata University
Rome, Italy

http://www.michele.scardi.name
http://www.mare-net.com/mscardi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dave Roberts
2010-07-11 03:19:30 UTC
Permalink
Michele,

log transformation (and square root among others) gives the scale a
convex transformation which is often very helpful. It tends to
emphasize differences in smaller values and de-emphasize small
differences in large values. E.g. cover of 1% vs 2% is difference of 1%
but an increase of 100%. 50% percent cover vs 51% cover is also a
difference of 1% but a negligible increase on a relative scale.
Remember, BC sums before division, rather than averaging species
relative differences. So, if the original scale was broad, a log
transform can be very useful in down-weighting dominants and bringing
out the signal is lesser species.

If the original scale was counts of individuals, what was the
range? If some other scale, what was that scale?

Dave
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David W. Roberts office 406-994-4548
Professor and Head FAX 406-994-3190
Department of Ecology email droberts-***@public.gmane.org
Montana State University
Bozeman, MT 59717-3460
Post by Michele Scardi
A student of mine recently showed me a NMDS ordination of fish
assemblages, which was based on Bray-Curtis dissimilarity computed on
log-transformed data.
I told her that log-transforming data before computing BC did not make
sense to me, because the original interpretation of the BC dissimilarity
(the ratio between the sum of the differences between two samples and
the overall sum of the specimens found) would be lost.
She argued that I was probably right, but she read many papers based on
this approach. As for me, I never noticed so many papers based on
log-transformed data and BC, but I ran a quick bibliographic search and
I was surprised by the number of papers using this approach.
I cannot see why one should log- or sqrt- or sqrt(sqrt)-transform the
data before computing BC, which is meant to measure relative
differences, not quantitative differences. I am afraid the most people
just want to try to normalize data distributions even in case
normalization is not really necessary. And the result of unnecessary
normalization is that the interpretation of distances/dissimilarities
can be much less straightforward than with raw data.
However, I'd really like to read other opinions about this!
All the best,
Michele
Salvador Herrando-Perez
2010-07-11 16:56:59 UTC
Permalink
I concur with the other views already posted to the list that a transformation
can change a multiplicative scale into an additive one, and that transformations
(in ordination analyses) aid in accounting for the effect of ‘dominant versus
rare species’ in a data set of ‘species x site abundances’. I always remember
Legendre and Legendre 1998 in that the amount of information that a species
contributes to a numerical analysis (like an ordination) increases with its
variance; however higher variance does not necessarily mean more important
biological meaning. Therefore those species extremely abundant in some sites and
poorly represented in others will dominate a MDS ordination of sites, and in
those circumstances we will be unable to detect the effect of other species
which might also be of biological interest.

In that line and as a mere exploratory tool, I always found it useful to plot
average density against variance of each species across sites (i.e. the plot has
as many points as species, each point represents average density in abcisas and
variance in ordinates for each species). Logically, when relative densities
among taxa are high the emerging relationship between averages and variances
will be clearly linear, and the slope of that relationship will be relatively
pronounced (typically between 0.5 and 1, very much like a “pseudo Taylor’s power
law”).

After applying a transformation and plotting averages versus variances for the
transformed data, the slope will be reduced, the left side of the linear
relationship will be broken down (species variances have been leveled out),
while the range between maximum and minimum values for both axes (i.e. the scale
of variation) will be also reduced. Stronger transformations (e.g. log) will
reduce the slope and break the left side of the relationship more than milder
transformations (e.g. sqrt).

Bray-Curtis values will be dominated by dominant species if no transformation is
applied to the raw data: whether that is what we are after is a matter of our
research question.

Salva
Post by Michele Scardi
A student of mine recently showed me a NMDS ordination of fish
assemblages, which was based on Bray-Curtis dissimilarity computed on
log-transformed data.
I told her that log-transforming data before computing BC did not make
sense to me, because the original interpretation of the BC dissimilarity
(the ratio between the sum of the differences between two samples and
the overall sum of the specimens found) would be lost.
She argued that I was probably right, but she read many papers based on
this approach. As for me, I never noticed so many papers based on
log-transformed data and BC, but I ran a quick bibliographic search and
I was surprised by the number of papers using this approach.
I cannot see why one should log- or sqrt- or sqrt(sqrt)-transform the
data before computing BC, which is meant to measure relative
differences, not quantitative differences. I am afraid the most people
just want to try to normalize data distributions even in case
normalization is not really necessary. And the result of unnecessary
normalization is that the interpretation of distances/dissimilarities
can be much less straightforward than with raw data.
However, I'd really like to read other opinions about this!
All the best,
Michele
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michele Scardi
Associate Professor of Ecology
Department of Biology
Tor Vergata University
Rome, Italy
http://www.michele.scardi.name
http://www.mare-net.com/mscardi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Frye, Chris
2010-07-12 15:54:04 UTC
Permalink
Michele,

I found O'Hara and Kotze (2010, Do not log-transform count data, Methods in Ecology and Evolution)helpful. The latter paper seems fairly specific to discrete count data and I assume that your student has count data. My general query is that if you apply a transformation to a matrix of samples x species data (count, cover)then what is the rationale for choosing bray-curtis over euclidean or any other distance measure in NMDS? Additionally, you will have to fudge the zeroes in the data by adding a constant so you are really doing two transformations. Perhaps it's more important that you check the data for outliers, delete those points, run NMDS again to see if those points had an effect rather than squashing all the data at the beginning. Just thinking out loud. I would like to hear other op
inions on this subject.

Christopher T. Frye
State Botanist and Biometry student
Maryland Department of Natural Resources
Wildlife and Heritage Service
Natural Heritage Program
Wye Mills Field Station
909 Wye Mills Road
PO Box 68
Wye Mills, MD 21679


-----Original Message-----
From: owner-ordnews-***@public.gmane.org [mailto:owner-ordnews-***@public.gmane.org] On Behalf Of Michele Scardi
Sent: Saturday, July 10, 2010 4:16 AM
To: Ordination-multivariate methods in community ecology
Subject: [ORDNEWS:1593] log, sqrt and other transformation with Bray-Curtis dissimilarity


A student of mine recently showed me a NMDS ordination of fish assemblages, which was based on Bray-Curtis dissimilarity computed on log-transformed data.

I told her that log-transforming data before computing BC did not make sense to me, because the original interpretation of the BC dissimilarity (the ratio between the sum of the differences between two samples and the overall sum of the specimens found) would be lost.

She argued that I was probably right, but she read many papers based on this approach. As for me, I never noticed so many papers based on log-transformed data and BC, but I ran a quick bibliographic search and I was surprised by the number of papers using this approach.

I cannot see why one should log- or sqrt- or sqrt(sqrt)-transform the data before computing BC, which is meant to measure relative differences, not quantitative differences. I am afraid the most people just want to try to normalize data distributions even in case normalization is not really necessary. And the result of unnecessary normalization is that the interpretation of distances/dissimilarities can be much less straightforward than with raw data.

However, I'd really like to read other opinions about this!

All the best,

Michele
--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michele Scardi
Associate Professor of Ecology

Department of Biology
Tor Vergata University
Rome, Italy

http://www.michele.scardi.name
http://www.mare-net.com/mscardi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michele Scardi
2010-07-12 21:04:24 UTC
Permalink
...My general query is that if you apply a
transformation to a matrix of samples x species data (count,
cover)then what is the rationale for choosing bray-curtis over
euclidean or any other distance measure in NMDS? ...
Chris,

I fully agree with you, and this is exactly the reason why I never
log-transformed (or sqrt-, etc.) data before using Bray-Curtis, and the
reason why I started this thread.

Log-transformation makes sense to me in case I use quantitative
distances that are not based on ratios (e.g. euclidean or Manhattan),
but I cannot understand the meaning of a sum of log differences weighted
over a sum of logs, as with BC.

In case I want to give equal relevance to species that are very
different in their abundances I would rather use something like a
Canberra metrics, not logs. Basically, I think that the simplest path
from data to distance/dissimilarity is always the best choice (the poor
ecologist's Occam's razor?).

Obviously, log-transformations is useful in many cases (e.g. when data
involving dilution/concentration processes are involved, when you need
to normalize skewed data distributions, etc.).
However, I'd really like to read other opinions about this!
Me too!

All the best,

Michele
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michele Scardi
Associate Professor of Ecology

Department of Biology
Tor Vergata University
Rome, Italy

http://www.michele.scardi.name
http://www.mare-net.com/mscardi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dave Roberts
2010-07-12 22:48:01 UTC
Permalink
Dear Michele and Chris,

I weighed in on this yesterday so I will try to be brief. You have
two decisions: (1) whether or not to transform abundances, and (2)
whether to use a dissimilarity or a distance. There is certainly
interaction among the two decisions, but in my opinion it is not the
case that transformations are only appropriate for distances, whereas
dissimilarities (such as BC) should only use raw data. Transformations
are chosen in either case to emphasize or de-emphasize some ranges of
the scale.

In response to Chris's direct question, the reason to prefer BC
over ED in NMDS is that BC considers both what samples have in common
and what is different between then. ED only considers the differences,
and ignores what may be a great deal in common.

The Canberra metric is an interesting choice, but relativizes each
species differently for each pair of samples. If you have three samples
A, B, and C with a species abundance 0, 1, 100, then comparison AB gives
(1-0)/(1+0) = 1, whereas comparison AC gives (100-0)/(100+0) = 1. So it
thinks that a difference of 1 and 100 is equal in contribution to the
pairwise distances. Alternatively, BC of log(x+1) treats those cases
as quite different because you sum before dividing. If the data are
counts, log(c+1) gives 0 for absence, just as you would like. O'Hara and
Kotze's admonition was for regression I believe, where they preferred
Poisson regression over linear regression of log data (and I do as well).

To repeat my question from yesterday, what is the range of the
students data?

My 2 cents, Dave
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
David W. Roberts office 406-994-4548
Professor and Head FAX 406-994-3190
Department of Ecology email droberts-***@public.gmane.org
Montana State University
Bozeman, MT 59717-3460
Post by Michele Scardi
...My general query is that if you apply a
transformation to a matrix of samples x species data (count,
cover)then what is the rationale for choosing bray-curtis over
euclidean or any other distance measure in NMDS? ...
Chris,
I fully agree with you, and this is exactly the reason why I never
log-transformed (or sqrt-, etc.) data before using Bray-Curtis, and the
reason why I started this thread.
Log-transformation makes sense to me in case I use quantitative
distances that are not based on ratios (e.g. euclidean or Manhattan),
but I cannot understand the meaning of a sum of log differences weighted
over a sum of logs, as with BC.
In case I want to give equal relevance to species that are very
different in their abundances I would rather use something like a
Canberra metrics, not logs. Basically, I think that the simplest path
from data to distance/dissimilarity is always the best choice (the poor
ecologist's Occam's razor?).
Obviously, log-transformations is useful in many cases (e.g. when data
involving dilution/concentration processes are involved, when you need
to normalize skewed data distributions, etc.).
However, I'd really like to read other opinions about this!
Me too!
All the best,
Michele
Michele Scardi
2010-07-12 23:46:54 UTC
Permalink
...Transformations are chosen in either case to emphasize or
de-emphasize some ranges of the scale.
Dave,

this is a very good point, but I think that transformations should be
consistent with the rationale supporting a similarity or distance
coefficient. In case you're using a coefficient that sums up
differences, log-transformation can certainly emphasize or de-emphasize
some contributions to the overall similarity or distance between two
samples, and this can be very useful. When it comes to coefficients that
involve ratios, however, the effects of log-transformation are much more
complex and possibly unpredictable. That is the reason for my concern.
The Canberra metric is an interesting choice, but relativizes each
species differently for each pair of samples...
I agree with you, it was just an example. It often emphasizes too much
the contribution of rare specie, whose presence or absence can be
absolutely random (when 0 does not mean that a species is actually
absent from that area, but only that its density is very low, and
therefore the probability of finding 1 or 0 specimens is very similar).
BTW, you can experience very similar problems with log-transformed data,
in case that transformation is not strictly needed.
To repeat my question from yesterday, what is the range of the
students data?
It's 0 to (very) few hundreds, as the data are about counts of fish
juveniles catched with hand towed nets (same nets and same fishermen
everywhere). Juvenile fish size is rather homogeneous among different
species and therefore differences in abundance are independent of
differences in fish size. No catches for a species do not mean that the
species is actually absent from the sampling site, therefore the role of
very low counts should not be emphasized too much. In this framework, I
would not consider log-transformation and I think that BC computed on
raw data is a good choice.

All the best,

Michele
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michele Scardi
Associate Professor of Ecology

Department of Biology
Tor Vergata University
Rome, Italy

http://www.michele.scardi.name
http://www.mare-net.com/mscardi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sean Porter
2010-07-13 09:26:12 UTC
Permalink
I have been performing MDS and cluster analyses with both BC & ED as
distance measures depending on the type of data for a few years now and I
agree with what Dave and Salva are saying. For biological data (e.g. percent
cover of species or species fish counts per site), patterns often become
clearer when you down-weight dominants and bring out the signal in lesser
species by using transformations. I have often square root or log x + 1
transformed biological data when trying to detect biogeographic breaks or
discontinuities. I have found that untransformed data leaves a lot of
"noise" in a cluster diagram for example, while transforming the data and
down weighting dominant species creates a much clearer cluster diagram with
more obvious differences and consistent clusters. For me that is one of the
key reasons for transforming biological data.

To answer Christopher's question about whether to choose BC or ED it depends
on ones data. Most biological data of say fish counts per site will have a
high number of zeros...in fact the matrix may be dominated by zeros. This
becomes a problem when one uses many distance measures such as ED or
Manhattan as sites start becoming similar because of the ABSENCE of various
fish species. It doesn't make biological sense that two sites are now
similar to each other because they lack the same species (e.g. its like
saying Africa & America are similar because they lack Kangaroos). The
problem is that in typical species /sample matrixes, because of the large
number of zeros such differences dominate the analysis. So BC takes a value
of 100 when two sites are identical but most importantly unlike many other
distance measures it takes the value of 0 when two sites have NO species in
common. BC also has a number of other advantages and seems to be the most
appropriate for analysing a species/sample matrix.

all the best Sean


Dr. Sean Porter
12 Forest Lane
Hilton 3245
KwaZulu-Natal
South Africa
www.drseanporter.weebly.com
+27 82 5148014
+27 33 3434163
----- Original Message -----
From: "Michele Scardi" <mscardi-dP4/***@public.gmane.org>
To: "Ordination-multivariate methods in community ecology"
<ordnews-***@public.gmane.org>
Cc: "Ordination-multivariate methods in community ecology"
<ordnews-***@public.gmane.org>
Sent: Tuesday, July 13, 2010 1:46 AM
Subject: [ORDNEWS:1601] RE: log, sqrt and other transformation with
Bray-Curtis dissimilarity
Post by Michele Scardi
...Transformations are chosen in either case to emphasize or
de-emphasize some ranges of the scale.
Dave,
this is a very good point, but I think that transformations should be
consistent with the rationale supporting a similarity or distance
coefficient. In case you're using a coefficient that sums up differences,
log-transformation can certainly emphasize or de-emphasize some
contributions to the overall similarity or distance between two samples,
and this can be very useful. When it comes to coefficients that involve
ratios, however, the effects of log-transformation are much more complex
and possibly unpredictable. That is the reason for my concern.
The Canberra metric is an interesting choice, but relativizes each
species differently for each pair of samples...
I agree with you, it was just an example. It often emphasizes too much the
contribution of rare specie, whose presence or absence can be absolutely
random (when 0 does not mean that a species is actually absent from that
area, but only that its density is very low, and therefore the probability
of finding 1 or 0 specimens is very similar). BTW, you can experience very
similar problems with log-transformed data, in case that transformation is
not strictly needed.
To repeat my question from yesterday, what is the range of the students
data?
It's 0 to (very) few hundreds, as the data are about counts of fish
juveniles catched with hand towed nets (same nets and same fishermen
everywhere). Juvenile fish size is rather homogeneous among different
species and therefore differences in abundance are independent of
differences in fish size. No catches for a species do not mean that the
species is actually absent from the sampling site, therefore the role of
very low counts should not be emphasized too much. In this framework, I
would not consider log-transformation and I think that BC computed on raw
data is a good choice.
All the best,
Michele
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michele Scardi
Associate Professor of Ecology
Department of Biology
Tor Vergata University
Rome, Italy
http://www.michele.scardi.name
http://www.mare-net.com/mscardi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
M***@public.gmane.org
2010-07-14 06:15:01 UTC
Permalink
Log, sqrt and other transformation with Bray-Curtis dissimilarity
I would suggest the discussion is relevant to a much more general issue.

Statements (see below) about importance or lack of it for dominant species imply a model of what the important ecological properties of the species/site data matrix are.

e.g.Dave Roberts:
"It tends to emphasize differences in smaller values and de-emphasize small differences in large values. E.g. cover of 1% vs 2% is difference of 1% but an increase of 100%. 50% percent cover vs 51% cover is also a difference of 1% but a negligible increase on a relative scale."
Why and when is it ecologically meaningful to de-emphasise large values?

Michele Scardi:
"I think that transformations should be consistent with the rationale supporting a similarity or distance coefficient. In case you're using a coefficient that sums up differences, log-transformation can certainly emphasize or de-emphasize some contributions to the overall similarity or distance between two samples, and this can be very useful".
Why and when is this useful?

Sean Porter:
"For biological data (e.g. percent cover of species or species fish counts per site), patterns often become clearer when you down-weight dominants and bring out the signal in lesser species by using transformations."
What types of pattern become clearer? Why and When?

These implicit models need to be expressed explicitly and tested against what we know about:
1. species' responses to ecological gradients, Are they bell-shaped or skewed?
2. responses of collective properties e.g. dominance, species richness and stand abundance to the same ecological gradients, Are they random, bell-shaped or bimodal?

I suggest the usefulness of any transformations etc. will depend on what part of the ecological space the dataset samples.

The implications of having explicit models for ordination and other analysis methods is, I suggest, best examined by generating artificial datasets based on the models and exploring the performance of different methods and their consistency with the postulated models. This makes clear the weaknesses of methods, standardisations and models.

References which hopefully will clarify these brief comments are:
Austin, M. P. & Smith, T. M. 1989. A new model for the continuum concept. Vegetatio 83: 35-47
Faith, D. B., Minchin, P. R. & Belbin, L. 1987. Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69: 57-68.
Minchin, P.R. 1987a An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio, 69, 89-107.
Minchin, P. R. 1987b. Simulation of multidimensional community patterns: towards a comprehensive model. Vegetatio 71: 145-156.
Minchin, P.R. (1989) Montane vegetation of the Mt. Field Massif, Tasmania: a test of some hypotheses about properties of community patterns. Vegetatio, 83, 97-110.
Recent alternative approaches
Heikkinen, J., & Mäkipää,R. 2010. Testing hypotheses on shape and distribution of ecological response curves. Ecological Modelling 221:388-399.
Clarke, K.R., Somerfield, P.ZJ. & Chapman, M.G. 2006 On resememblance measures for ecological studies, including taxonomic dissimlarities and a zero-adjusted Bray-Curtis coefficient for denuded assemblages. Journal of Experimental Marine Biology and Ecology 330:55-80.
Hurst, C.P., Catterall, C.P. & Chaseling, J. 2008. A comparison of two methods for generating artificial multi-assemblage ecological datasets. Ecological Informatics 3:286-294.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Mike Austin
CSIRO Sustainable Ecosystems

GPO Box 284, Canberra, ACT 2601 Australia
Tel: 61-(0)2-6242-1758; Fax: 61-(0)2-6242-1555
E-Mail : Mike.Austin-***@public.gmane.org
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Salvador Herrando-Perez
2010-07-14 08:15:48 UTC
Permalink
Following on after Mike Austin's email, the performance of different association
indexes (similarity, disimilarity, distances) has not been explored in detail at
all, so work like that by Peter Minchin in the 1980s, or by Bob Clarke in the
1990s and more recently, are certainly exceptional. Simulations with sets of
artificial data with contrasting properties (e.g. dominance versus rareness,
gradients versus discontinuities, bell-shaped versus other type of species'
responses...) certainly offer a way of doing that. I guess many of us have
experienced that when community gradients and discontinuities are very clear
across sites, they can be recovered with an entire range of association indexes,
with and without transformations, irrespective of the type of species' response
assumed (if any), and thus even more (ackwardly): a PCA and a nMDS based on B-C
can produce very similar ordinations...
Post by M***@public.gmane.org
Log, sqrt and other transformation with Bray-Curtis dissimilarity
I would suggest the discussion is relevant to a much more general issue.
Statements (see below) about importance or lack of it for dominant species
imply a model of what the important ecological properties of the species/site
data matrix are.
"It tends to emphasize differences in smaller values and de-emphasize small
differences in large values. E.g. cover of 1% vs 2% is difference of 1% but
an increase of 100%. 50% percent cover vs 51% cover is also a difference of
1% but a negligible increase on a relative scale."
Why and when is it ecologically meaningful to de-emphasise large values?
"I think that transformations should be consistent with the rationale
supporting a similarity or distance coefficient. In case you're using a
coefficient that sums up differences, log-transformation can certainly
emphasize or de-emphasize some contributions to the overall similarity or
distance between two samples, and this can be very useful".
Why and when is this useful?
"For biological data (e.g. percent cover of species or species fish counts
per site), patterns often become clearer when you down-weight dominants and
bring out the signal in lesser species by using transformations."
What types of pattern become clearer? Why and When?
These implicit models need to be expressed explicitly and tested against what
1. species' responses to ecological gradients, Are they bell-shaped or skewed?
2. responses of collective properties e.g. dominance, species richness and
stand abundance to the same ecological gradients, Are they random,
bell-shaped or bimodal?
I suggest the usefulness of any transformations etc. will depend on what part
of the ecological space the dataset samples.
The implications of having explicit models for ordination and other analysis
methods is, I suggest, best examined by generating artificial datasets based
on the models and exploring the performance of different methods and their
consistency with the postulated models. This makes clear the weaknesses of
methods, standardisations and models.
Austin, M. P. & Smith, T. M. 1989. A new model for the continuum concept.
Vegetatio 83: 35-47
Faith, D. B., Minchin, P. R. & Belbin, L. 1987. Compositional dissimilarity
as a robust measure of ecological distance. Vegetatio 69: 57-68.
Minchin, P.R. 1987a An evaluation of the relative robustness of techniques
for ecological ordination. Vegetatio, 69, 89-107.
towards a comprehensive model. Vegetatio 71: 145-156.
Minchin, P.R. (1989) Montane vegetation of the Mt. Field Massif, Tasmania: a
test of some hypotheses about properties of community patterns. Vegetatio,
83, 97-110.
Recent alternative approaches
Heikkinen, J., & Mäkipää,R. 2010. Testing hypotheses on shape and
distribution of ecological response curves. Ecological Modelling 221:388-399.
Clarke, K.R., Somerfield, P.ZJ. & Chapman, M.G. 2006 On resememblance
measures for ecological studies, including taxonomic dissimlarities and a
zero-adjusted Bray-Curtis coefficient for denuded assemblages. Journal of
Experimental Marine Biology and Ecology 330:55-80.
Hurst, C.P., Catterall, C.P. & Chaseling, J. 2008. A comparison of two
methods for generating artificial multi-assemblage ecological datasets.
Ecological Informatics 3:286-294.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mike Austin
CSIRO Sustainable Ecosystems
GPO Box 284, Canberra, ACT 2601 Australia
Tel: 61-(0)2-6242-1758; Fax: 61-(0)2-6242-1555
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dave Roberts
2010-07-17 19:33:10 UTC
Permalink
Certainly I would not argue with Mike about the importance of a model of
species distributions (and thanks for the citations on Clarke et al and
Hurst et al; I had not seen those). Unfortunately, in my data sets the
vast majority of species are too infrequent to model (even P/A as
logistic regression) but still have significantly restricted
distributions in dissimilarity space, and apparently ecological
information content.

So, in the absence of that sort of modeling effort I think it's
important to do some empirical analysis of species effect sizes vs
information content in ordination or classification. Salvador suggests
that not much of that has been done, but certainly it goes back as far
as Jensen (1970) Vegetatio 37:19-31, who suggested calculating the ratio
of the largest value to the smallest non-zero value as an indication of
effect size in a given transformation.

Mike asked "Why and when is it ecologically meaningful to de-emphasise
large values?"

Its is a relatively simple matter to empirically analyze species
importance in ordinations from the equations of the
dissimilarity/distance indices. For Bray-Curtis the effect size is
proportional to the sum of its variability, i.e. the sum of the absolute
difference for that species in all possible plot pairs. You can
calculate this easily in a nested loop, or simply apply Manhattan
distance to each column of the taxon matrix independently.

I don't want to clog up the ORDNEWS email list with big attachments, so
I have posted a small discussion and example of this on

http://ecology.msu.montana.edu/ordnews/

I've already said more than my fair share on this subject in previous
posts (and it's getting to be a significant displacement activity), sio
I'll sign off.

Dave Roberts
Post by M***@public.gmane.org
Log, sqrt and other transformation with Bray-Curtis dissimilarity
I would suggest the discussion is relevant to a much more general issue.
Statements (see below) about importance or lack of it for dominant species imply a model of what the important ecological properties of the species/site data matrix are.
"It tends to emphasize differences in smaller values and de-emphasize small differences in large values. E.g. cover of 1% vs 2% is difference of 1% but an increase of 100%. 50% percent cover vs 51% cover is also a difference of 1% but a negligible increase on a relative scale."
Why and when is it ecologically meaningful to de-emphasise large values?
"I think that transformations should be consistent with the rationale supporting a similarity or distance coefficient. In case you're using a coefficient that sums up differences, log-transformation can certainly emphasize or de-emphasize some contributions to the overall similarity or distance between two samples, and this can be very useful".
Why and when is this useful?
"For biological data (e.g. percent cover of species or species fish counts per site), patterns often become clearer when you down-weight dominants and bring out the signal in lesser species by using transformations."
What types of pattern become clearer? Why and When?
1. species' responses to ecological gradients, Are they bell-shaped or skewed?
2. responses of collective properties e.g. dominance, species richness and stand abundance to the same ecological gradients, Are they random, bell-shaped or bimodal?
I suggest the usefulness of any transformations etc. will depend on what part of the ecological space the dataset samples.
The implications of having explicit models for ordination and other analysis methods is, I suggest, best examined by generating artificial datasets based on the models and exploring the performance of different methods and their consistency with the postulated models. This makes clear the weaknesses of methods, standardisations and models.
Austin, M. P. & Smith, T. M. 1989. A new model for the continuum concept. Vegetatio 83: 35-47
Faith, D. B., Minchin, P. R. & Belbin, L. 1987. Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69: 57-68.
Minchin, P.R. 1987a An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio, 69, 89-107.
Minchin, P. R. 1987b. Simulation of multidimensional community patterns: towards a comprehensive model. Vegetatio 71: 145-156.
Minchin, P.R. (1989) Montane vegetation of the Mt. Field Massif, Tasmania: a test of some hypotheses about properties of community patterns. Vegetatio, 83, 97-110.
Recent alternative approaches
Heikkinen, J., & Mäkipää,R. 2010. Testing hypotheses on shape and distribution of ecological response curves. Ecological Modelling 221:388-399.
Clarke, K.R., Somerfield, P.ZJ. & Chapman, M.G. 2006 On resememblance measures for ecological studies, including taxonomic dissimlarities and a zero-adjusted Bray-Curtis coefficient for denuded assemblages. Journal of Experimental Marine Biology and Ecology 330:55-80.
Hurst, C.P., Catterall, C.P. & Chaseling, J. 2008. A comparison of two methods for generating artificial multi-assemblage ecological datasets. Ecological Informatics 3:286-294.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mike Austin
CSIRO Sustainable Ecosystems
GPO Box 284, Canberra, ACT 2601 Australia
Tel: 61-(0)2-6242-1758; Fax: 61-(0)2-6242-1555
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Salvador Herrando-Perez
2010-07-18 08:41:05 UTC
Permalink
Hi Dave,
The comment, you quote I made, actually referred to lack of authorative
assessments on the performance of association indexes, not to transformations.
Additionally, the assessment of performance of association indexes in
combination with different transformations adds a level of complexity and, to my
knowledge, has been also sparsely treated in the literature.
Salva
Post by Dave Roberts
Certainly I would not argue with Mike about the importance of a model of
species distributions (and thanks for the citations on Clarke et al and
Hurst et al; I had not seen those). Unfortunately, in my data sets the
vast majority of species are too infrequent to model (even P/A as
logistic regression) but still have significantly restricted
distributions in dissimilarity space, and apparently ecological
information content.
So, in the absence of that sort of modeling effort I think it's
important to do some empirical analysis of species effect sizes vs
information content in ordination or classification. Salvador suggests
that not much of that has been done, but certainly it goes back as far
as Jensen (1970) Vegetatio 37:19-31, who suggested calculating the ratio
of the largest value to the smallest non-zero value as an indication of
effect size in a given transformation.
Mike asked "Why and when is it ecologically meaningful to de-emphasise
large values?"
Its is a relatively simple matter to empirically analyze species
importance in ordinations from the equations of the
dissimilarity/distance indices. For Bray-Curtis the effect size is
proportional to the sum of its variability, i.e. the sum of the absolute
difference for that species in all possible plot pairs. You can
calculate this easily in a nested loop, or simply apply Manhattan
distance to each column of the taxon matrix independently.
I don't want to clog up the ORDNEWS email list with big attachments, so
I have posted a small discussion and example of this on
http://ecology.msu.montana.edu/ordnews/
I've already said more than my fair share on this subject in previous
posts (and it's getting to be a significant displacement activity), sio
I'll sign off.
Dave Roberts
Post by M***@public.gmane.org
Log, sqrt and other transformation with Bray-Curtis dissimilarity
I would suggest the discussion is relevant to a much more general issue.
Statements (see below) about importance or lack of it for dominant species
imply a model of what the important ecological properties of the species/site
data matrix are.
Post by M***@public.gmane.org
"It tends to emphasize differences in smaller values and de-emphasize small
differences in large values. E.g. cover of 1% vs 2% is difference of 1% but
an increase of 100%. 50% percent cover vs 51% cover is also a difference of
1% but a negligible increase on a relative scale."
Post by M***@public.gmane.org
Why and when is it ecologically meaningful to de-emphasise large values?
"I think that transformations should be consistent with the rationale
supporting a similarity or distance coefficient. In case you're using a
coefficient that sums up differences, log-transformation can certainly
emphasize or de-emphasize some contributions to the overall similarity or
distance between two samples, and this can be very useful".
Post by M***@public.gmane.org
Why and when is this useful?
"For biological data (e.g. percent cover of species or species fish counts
per site), patterns often become clearer when you down-weight dominants and
bring out the signal in lesser species by using transformations."
Post by M***@public.gmane.org
What types of pattern become clearer? Why and When?
These implicit models need to be expressed explicitly and tested against
1. species' responses to ecological gradients, Are they bell-shaped or
skewed?
Post by M***@public.gmane.org
2. responses of collective properties e.g. dominance, species richness and
stand abundance to the same ecological gradients, Are they random,
bell-shaped or bimodal?
Post by M***@public.gmane.org
I suggest the usefulness of any transformations etc. will depend on what
part of the ecological space the dataset samples.
Post by M***@public.gmane.org
The implications of having explicit models for ordination and other
analysis methods is, I suggest, best examined by generating artificial
datasets based on the models and exploring the performance of different
methods and their consistency with the postulated models. This makes clear
the weaknesses of methods, standardisations and models.
Post by M***@public.gmane.org
Austin, M. P. & Smith, T. M. 1989. A new model for the continuum concept.
Vegetatio 83: 35-47
Post by M***@public.gmane.org
Faith, D. B., Minchin, P. R. & Belbin, L. 1987. Compositional
57-68.
Post by M***@public.gmane.org
Minchin, P.R. 1987a An evaluation of the relative robustness of techniques
for ecological ordination. Vegetatio, 69, 89-107.
towards a comprehensive model. Vegetatio 71: 145-156.
a test of some hypotheses about properties of community patterns. Vegetatio,
83, 97-110.
Post by M***@public.gmane.org
Recent alternative approaches
Heikkinen, J., & Mäkipää,R. 2010. Testing hypotheses on shape and
distribution of ecological response curves. Ecological Modelling 221:388-399.
Post by M***@public.gmane.org
Clarke, K.R., Somerfield, P.ZJ. & Chapman, M.G. 2006 On resememblance
measures for ecological studies, including taxonomic dissimlarities and a
zero-adjusted Bray-Curtis coefficient for denuded assemblages. Journal of
Experimental Marine Biology and Ecology 330:55-80.
Post by M***@public.gmane.org
Hurst, C.P., Catterall, C.P. & Chaseling, J. 2008. A comparison of two
methods for generating artificial multi-assemblage ecological datasets.
Ecological Informatics 3:286-294.
Post by M***@public.gmane.org
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mike Austin
CSIRO Sustainable Ecosystems
GPO Box 284, Canberra, ACT 2601 Australia
Tel: 61-(0)2-6242-1758; Fax: 61-(0)2-6242-1555
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dave Roberts
2010-07-19 15:21:53 UTC
Permalink
Colleagues,

Dr. Paul Somerfield forwarded the following to me and asked me to
post it as he has been having difficulties getting his material to
appear in ORDNEWS.

***********************************************************************

It always interests me that there is such a dichotomy in the literature
between marine and terrestrial (including aquatic) ecologists. I agree
with Dave, for 'biological' data, such as abundances, Bray-Curtis (or
one of the Bray-Curtis family of measures) is a much better choice than
distance-based measures, because of the scaling which treats zero as in
some sense special, rather than simply as a number. Thus differences in
abundance close to zero (i.e. among low abundance species) are more
important than equal differences between highly abundant species.
Considering transformations, Bray-Curtis is driven by the
presence-absence structure of the data (remember that Sorensen or Dice
is simply Bray-Curtis on binary data) as well as numerical abundance.
Transformations are chosen to down-weight the influence of the most
abundant species. There is no 'right' transformation, the choice is an
important one and should be made by the investigator in light of what
(s)he wants to know. It is perfectly valid to analyse data with no
transformation and a strong one, to compare the community patterns among
dominant species (which is all you get with raw data) with those in
which more of the species present contribute. The usual sequence is
square-root, fourth-root, log (constrained to log+1 generally with
abundances) and presence/absence. Because of the +1 log is generally
indistinguishable in effect from fourth-root. There are actually good
reasons for using a mild (e.g. square-root) transformation when relating
biotic patterns to environmental variables, as it reduces some of the
'noise' in raw abundances.
Actually, there is another level to consider. There are two reasons one
might wish to down-weight the numerical dominants. One is that if you
don't the resulting patters will only reflect variation among the 3 or 4
most dominant species. Another is that species do not tend to occur at
random in samples, but rather arrive in 'clumps', and the more abundant
species are often rather badly behaved, in a statistical sense (variance
increases with mean). There is an objective statistical test for
detecting such species, and down-weighting them in proportion to their
clustering structure.
You might like to read some of the following, for further thoughts and
examples:
Olsgard, F., Somerfield, P. J., Carr, M. R. 1997 Relationships between
taxonomic resolution and data transformations in analyses of a
macrobenthic community along an established pollution gradient. Mar.
Ecol. Prog. Ser. 149, 173-181.
Olsgard, F., Somerfield, P. J., Carr, M. R. 1998 Relationships between
taxonomic resolution, macrobenthic community patterns and disturbance.
Mar. Ecol. Prog. Ser. 172: 25-36.
Clarke, K. R., Somerfield, P. J., Chapman, M. G. 2006 On resemblance
measures for ecological studies, including taxonomic dissimilarities and
a zero-adjusted Bray-Curtis measure for denuded assemblages. J. Exp.
Mar. Biol. Ecol. 330: 55-80.
Clarke K. R., Chapman, M. G., Somerfield, P. J., Needham, H. R. 2006
Dispersion-based weighting of species counts in assemblage analyses.
Mar. Ecol. Prog. Ser. 320: 11-27.
Best wishes,

Dr Paul J. Somerfield
Biodiversity
Plymouth Marine Laboratory
Prospect Place, Plymouth, PL1 3DH UK
+44 1752 633100
see www.marbef.org
***************************************************************************

forwarded by Dave Roberts

Ann Huber
2010-07-13 16:41:15 UTC
Permalink
Loading...