cross-posted from: https://mander.xyz/post/41224832

ITIS Tree of Life

Just finished another visualization of entire taxonomy tree. Previous is buried here: GBIF ToL.

Main concept is very simple: each taxon is a point, and each taxon has a clockwise-bent arc from it’s parent taxon.

Trick is to place those points in a meaningful way. At first, I was using force-directed algorithm to do it. In general, it succeeded in grouping points by clades, but introduced a lot of branch overlapping (check how purple Echinodermata is “intruded” into Arthopoda in GBIF version).

Force-directed algorithms can layout not only trees, but basically any graph, and I thought: maybe tree-specific algorithm will produce a better result? I’ve found out there is a cool Voronoi Treemap algorithm which for any given tree can build a set of nested polygons, a polygon for each node in a tree. Not only it eliminates branch overlapping problem, but also it ensures those branches fit into convex polygons and you can even add gaps between adjacent branches. So I’ve built a CLI wrapper around a Java implementation I’ve found on GitHub.

At first, I’ve used it for NCBI database, but I didn’t use gaps and haven’t published interactive version yet (but there are PNGs in Wikimedia Commons). Then, I’ve made a treemap for ITIS. Points are points and polygons have been used for mouse hover feature. When I was making force-directed GBIF, I had to separately compute those polygons for each clade of given ranks. Now both points and polygons are computed by an algorithm, which is nice.

What do you think?

  • flora_explora@beehaw.org
    link
    fedilink
    English
    arrow-up
    1
    ·
    16 days ago

    Haha, this wasn’t even in detail and I only looked at it a bit on my phone screen. But it is quite enjoyable to have such a visualization, fantastic job!

    For plants, I usually go for POWO (Plants of the world online) and they have their own database on vascular plants, which according to them is incorporated into the GBIF database (and then also to the Catalogue of Life Checklist).

    You are probably right in that this is based on the underlying datasets and that GBIF does the best job regarding plants.

    Looking over to the animals in both visualizations, I feel like the GBIF one gives looks better and the ITIS one gives a slightly better overview. E.g. looking at Hymenoptera I need to zoom in much more in the ITIS one to get to the families and it doesn’t show any intermediate rank between order and family. The GBIF one does the same but shows the family names also when zoomed out more. Although it is harder to distinguish between the borders of different order than in the ITIS dataset.

    However, looking at Hymenoptera also made me realize that both visualizations are a mess in their own way! There are many entries missing in the ITIS dataset: For example, there are 800 genera and over 8000 described species in the Symphyta, but they are only a tiny section south of the rest of the Hymenoptera. But just above there is a wasp genus named Microgaster with similarly many points that apparently only have a 100 described species!

    But in the GBIF visualization, the arrangement of various groups seems to be done in a haphazard way. For example, if you were to look for all the bees (Anthophila, but this rank is not shown), in the ITIS one they are all at least displayed close together (although not within one rank). But in the GBIF visualization, e.g. Apidae and Halictidae are at totally different ends within the Hymenoptera group. And there are much more basal groups like Tenthredinidae (in Symphyta) between. So within the Hymenoptera groups, this makes no sense at all! It would be much better if more basal groups would be closer to the origin and more distant lineages are more distant. But within this visualization all families within the Hymenoptera are just ‘related’ to the Hymenoptera and not each other.

    Maybe the problem is also that plants and animals have their own taxonomies that are structured differently. You usually don’t need intermediate ranks between order and family in plants. In animals this seems to be quite different. Hm, not sure how to solve this in an elegant manner.

    Regarding your actual question if the NCBI visualization is representative: I cannot say. The png only shows plant orders and not even families. So it’s impossible to tell what point cloud is which genus and how they are represented. Although looking at it, I feel like there are some locations where a huge number of points link to a single origin. E.g. within the Lepidoptera a third of all taxa lead to a single point. Not sure what this might be, because in the other two visualizations the Lepidoptera are much more diverse. The most diverse families are the Erebidae with about 25,000 species and the Geometridae with 23,000 species. But both are just a small portion of the 180,000 Lepidoptera species. So my guess is that the NCBI shows superfamilies and not families as intermediate ranks and in case of the Lepipoptera, the large cloud are the Noctuoidea, which contain about 70,000 species. But then there are even less ranks than in the other visualizations, if all listed taxa within this superfamily point to a single origin.

    I could go on for ages! This is so much fun, haha :)

    • podbrushkin@mander.xyzOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      16 days ago

      I will try to check if there are any discrepancies between visualization and ITIS db, and between ITIS db and other taxonomic sources.

      Hymenoptera in ITIS has two direct children: Apocrita and Symphyta with 1904 and 39 genera each.

      Hymenoptera in insecta.pro has three direct children (third is a dead end, will ignore): Apocrita and Symphyta with 6768 and 153 genera.

      Apocrita is ~45 times larger than Symphyta in both databases, ITIS is representative in this case. In visualisation each clade gets as much space as it needs to fit all its leaf nodes (taxa without children). Apocrita probably got ~45 times more space than Symphyta, which is what I’d expect.

      Also, I’ve tried ti find Symphyta in lifemap, but NCBI page (LifeMap is based on NCBI) for Hymenoptera has a comment about Symphyta being a paraphyletic group and therefore NCBI doesn’t have this suborder at all.

      There are ~100 species in Microgaster clade and ~62 in Symphyta, not a big difference, they got comparable amount of space, I think it is also as expected.

      in the ITIS one they are all at least displayed close together (although not within one rank). But in the GBIF visualization, e.g. Apidae and Halictidae are at totally different ends

      As LLM would’ve said, “you’ve got to the heart of how Voronoi Treemaps work”. In GBIF they do not keep track of intermediate taxa at all, therefore Apidae and Halictidae in their system are equally related to Hymenoptera, there’s nothing else to group them together. While in ITIS, they do have a lot of taxa with intermediate rank, including Aculeata and Apoidea. These two additional links prevented spreading of Apidae and Halictidae to the opposite ends of Hymenoptera as it is in GBIF. I’ve decided to color points only by six main ranks, and I’ve made zoom to jump between these ranks, therefore intermediate polygons are somewhat obscure, but they already did their job, and you’ve noticed that, cool! When I will continue to work on these maps, probably I will not consider using GBIF as data source because of this exact detail you’ve mentioned - some branches can be placed further from each other than you expect.

      I feel like there are some locations where a huge number of points link to a single origin

      Yes, I also was looking at them, usually it’s artificial groups like “unclassified Lepidoptera” with a lot of taxa which doesn’t even have a name, they have a code instead, like “BOLD:ACO0165”. You can find such groups in GBIF as well, e.g. in Lepidoptera there is a huge ball in the center with a lot of unnamed taxa squeezed together. This is somewhat similar. I think next time I will nuke them because they are not interesting, take a lot of space and don’t add up to the structure and readability.

      Also, you can checkout this foamtree demo which is also a treemap, but it displays polygons instead of points, and you have to move through all the intermediate taxa by double clicking to get anywhere. To the right you can switch to Metazoa. They don’t use space as efficiently, Korarchaeota has a single known specie but got a huge polygon anyway. I am not related to this foamtree, they’re trying to sell visualisation library and to showcase it they’ve made a demo with taxonomy tree.

      • flora_explora@beehaw.org
        link
        fedilink
        English
        arrow-up
        1
        ·
        16 days ago

        I guess one cannot achieve everything with a visualization like this and has to prioritize what’s more important.

        has a comment about Symphyta being a paraphyletic group and therefore NCBI doesn’t have this suborder at all.

        Ah true, it’s used in iNaturalist so I’m used to it as a group. But yeah, these are probably just all basal hymenopterans.

        There are ~100 species in Microgaster clade and ~62 in Symphyta, not a big difference

        I do not doubt that the visualization is correct, but rather that the underlying data is. The vast majority of symphyta species seems to be missing in the dataset, making the disparity so apparent.

        usually it’s artificial groups like “unclassified Lepidoptera” with a lot of taxa which doesn’t even have a name

        Ah yes, forgot about that. Probably better to exclude them, I agree.

        I wonder if there is a way to somehow combine datasets to fill in the gaps. Like adding more intermediate ranks to the gbif dataset by using the other ones. Looking at your tables, one could probably quite easily achieve this (although probably with some gaps). I didn’t find your code though, was wondering how you have written this :)

        Or maybe use the style of the gbif visualization with the itis dataset?

        you can checkout this foamtree demo

        Oof, I didn’t like this at all! It’s very hard to find anything in there. Tried to go for Araceae and could only find it by searching below for subfamilies. Apparently Araceae isn’t in their dataset as a rank, although other plant families are?

        • podbrushkin@mander.xyzOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          16 days ago

          I wonder if there is a way to somehow combine datasets to fill in the gaps.

          It would’ve been zero fun and same amount of success. Basically, creating a new taxonomy database while a lot of them already exist. I didn’t expect there are so many taxonomy databases, almost all of them being backed by scientific organizations and being freely accessible and downloadable. Other areas (books, movies, history) are not even close to this diversity of data sources.

          I didn’t find your code though, was wondering how you have written this

          Apart from Gephi Commander (already on Github), which is used for generating PNG tiles when you already have x and y for every taxon, there is also a CLI tool to build Voronoi (assign x,y) and another CLI tool to split those points across zoom levels and PBF vector tiles. Neo4j as a database and Powershell to bring all of this to life.

          Oof, I didn’t like this at all!

          Not a fan either. There was another tool looking similar to Voronoi, made by a person working in scientific organization, but I can’t find it right now… There is a lot of interesting on this topic.

          • flora_explora@beehaw.org
            link
            fedilink
            English
            arrow-up
            1
            ·
            15 days ago

            Oh right, now I see that you made very different network graphs based on all kinds of example data. I come from the opposite direction. I worked with a lot of ecological datasets, analyzing and plotting them. But I haven’t messed around with network graphs a lot. Maybe I’ll try to do my own version in R or python (I don’t know any java, so I cannot really understand your code). Because I’m really fascinated by the idea of having a nice rendering of the tree of life!