Friday 3 April 2015

It was all basic bricks in my day: part three

Well, that took longer than expected. You've waited long enough, so I'll dive straight into the results with a minimum of introduction. On the assumption that any part that appears in sets from multiple themes can't be particularly specialised I've defined a specialised part here as one that has appeared in less than 25% of Bricklink's categories. At the time I started collecting data, this was 28 categories, so a specialised part is one that has appeared in less than 28 categories.

Here's the graph of the average proportion of a set's part that are specialised each year/decade:


Or, if you prefer, a graph of the proportion of sets each year that contain  more than 50% specialised parts:


Conclusion: Sets these days are not full of overly specialised parts - they're pretty much the least specialised they've ever been!

Update: I wouldn't pay too much attention to the startlingly high results for the 50's and 60's since, for reasons explained in the updates and comments below, they are known to be erroneously inflated to some degree. I'm in the process of trying to fix this.

The bad years of the late 90's and early 2000's, when specialised parts supposedly proliferated wildly, are clearly visible on the second graph, but they're perhaps not as bad as you might have expected. On the averages graph, they don't seem that bad at all. I guess we tend to focus on the worst examples from those years, without acknowledging that the majority of sets weren't that bad.

What you can clearly see on both graphs, beginning around 2006, is the effect of stricter controls on set designers' use of specialised parts. In David Robertson's book, Brick by Brick, he says:
"As a result [of the new rules], on average, at least 70 percent of every LEGO set, whether it's a LEGO City box or a new play theme such as Ninjago, is now made up of standard, universal bricks."
Lo and behold, by 2007 the averages graph is below the 30% mark and has stubbornly stayed there ever since.*

So that's what I found. Let me know what you think in the comments, or carry on reading to find out the details of the data collection, including some known flaws.

In my next post, which will follow shortly, I'll be breaking the data down by themes and looking at things like whether licensed sets are more specialised than non-licensed.

***

Update: Another flaw, which I just noticed as a result of the very first comment, is that older versions of basic bricks end up counting as specialised because they were phased out before having the chance to appear in lots of themes. I think this accounts for the unusually high proportions of specialised bricks and sets in the 50's and 60's. Luckily, since Bricklink appears to have labelled these older versions in a consistent way ('[item #]old'), this should be reasonably simple to fix.
To some extent this error actually cancels out some of the error from double counting counterparts mentioned below, since in sets where less common older bricks appear as counterparts, they will be cancelled out by the more common versions. 
In my last post, you may remember that I found the proportion of basic bricks and plates has fallen over time (in a limited sample of sets). That may seem completely at odds with the results in this post, and may partially be explained by older versions of basic bricks counting as specialised. But I think it's mostly because the definition of specialised in this post is closer to the AFOL's definition ('can't be used for anything else') rather than the traditionalist's definition ('isn't a cuboid brick or rectangular plate').

All data was collected from Bricklink using a web scraping program I wrote in Python. I'm happy to share the code, but be warned that it isn't pretty. A lot of the flaws below could probably be fixed with smarter web scraping, but I'm not willing to spend huge amounts of time figuring it out in the near future. If anyone reading is good at Python and/or web-scraping and would be willing to help out, let me know, and I'll make the code public so we can all work together on improving it.

The data is not weighted by the number of each part in the set. So if a set contains one specialised part and fifty 2x4 bricks, it still counts as 50% specialised. To be honest, I did this because it made the data scraping easier. I'm happy to discuss how this will have affected the results in the comments, but if you want it done differently you'll have to help me rewrite the code!

I removed all sets that Bricklink puts in the Duplo category, as well as any set with Duplo in its name (this catches educational Duplo sets and the like). While Duplo bricks are compatible with normal Lego bricks, they never show up in any Lego sets, so Duplo sets always come out as 100% specialised and skew the data. Contrast with Technic, which shows up in normal Lego sets all the time.

I also excluded any set with less than 10 types of part, to get rid of the majority of little promo sets and advent calendar builds etc.

The parts counted are anything in a set's Bricklink inventory that Bricklink defines as a part. Which is to say any part whose catalogue entry has a URL including P=[part number]. This excludes minifigs (deliberately) but unfortunately doesn't exclude stickers or counterparts. Again this is something that I found too complicated to fix, and since stickers and counterparts rarely make up a huge proportion of a set's parts, it hopefully doesn't affect the data too much.

One final flaw I only spotted after digging into the data for my next post is that if a part appears twice in the same set in multiple colours then it will count once for every colour. So if a set has 20 different types of part and 3 of them are the same specialised part in 3 different colours then, assuming no other specialised parts or colour variations, the proportion will be 3/22, rather than the 1/20 I'd prefer. I don't think it will have made any difference to the overall trends however.

I think that's all that needs mentioning - if you want more details, just ask.


* Amazing, right? I randomly chose 25% of categories as the threshold and it just turned out that this squared very nicely with recent averages being below 30% specialised. I definitely didn't try out a few different thresholds on a sample of recent sets until I found one that placed the average just below 30%. No siree! 

4 comments:

  1. I don't fully understand this chart, but it appears that specialized parts are only decreasing and decreasing... How come the early sets have such high stats? All these tin cars?

    ReplyDelete
    Replies
    1. I think it's mostly because Bricklink classes older styles of basic bricks as different parts which, because they were phased out before there were many themes, show up as specialised in my data. See the old style of 2x4 brick for example (http://www.bricklink.com/catalogItemIn.asp?P=3001old&v=3&in=S).

      That goes some way to explaining the difference between this graph and the graph of basic bricks in my last post. I should put a note about this in the main text, thanks for pointing it out!

      Delete
  2. I think your graphs are biased towards the present-day. The 25% should be adjusted for every year, imo, especially since back in the 1970's and 1980's, there wouldn't have been 28 categories to split into, hence the very high % in those years.

    ReplyDelete
    Replies
    1. There may be some effect due to this, but probably not as much as you think because a piece in say 1970 doesn't have to appear in 28 categories in 1970 to count as non specialised - it has to appear in 28 categories over the whole of Lego history.
      So the 1x1 brick (http://www.bricklink.com/catalogItemIn.asp?P=3005&in=S&viewSum=Y) that appears in this garage set from 1955 (http://www.bricklink.com/catalogItemInv.asp?S=1236-2) counts as non specialised because it has appeared in basically every category over the years - I didn't only count the 1 category that existed in 1955.

      Delete