Archive for the 'Performance' Category

Architects need to properly Communicate with Developers: Scale-Up vs. Scale-Out

Wednesday, July 2nd, 2008

This is a problem that unfortunately we so often see.

Architects are employed to ensure that software projects have a reasonable base; solutions are well thought out, robust, and follow industry best practices; and provide a more detailed but higher-level view of application spread (and inter-operation) than a senior developer, project or programme manager could affect.

However, there’s a gap, the architects produce plans and ideas and don’t communicate them effectively to the developers, so new problems arise. Often it then gets worse because the architects blame the developers (and especially the lead developers) for not following the plan, while the developers blame the architects for poorly thought out plans and non-committal/non-understandable language. Out of frustration, I thought we could cover one of those major communication problems today: Scale-Up vs. Scale-Out.

Any developer who’s been in the industry has heard these terms a few times (probably by some people who have used them inter-changeably), however they mean very different things and have different design decisions and costs associated with them.

a. Scale-Up -> To add additional hardware (processing power, memory, faster/bigger disk) to a solution to enable it to scale further

b. Scale-Out -> To use additional hardware nodes (web servers/database servers/application servers) to an existing solution to enable scaling by spreading the workload further. (Note: this also includes strategies where duplicated hardware is put in place and the solution is segmented down functional/data lines so that each set of hardware runs independently of the other). Grid computing is an example of Scale-Out pushed to the maximum.
Scale-Up is usually the first port of call for badly performing applications, but there’s always a high point of what can be done. Even if you started on a single CPU Windows 2003 server and then ported up to a SUN Fire E25K you’ve still hit a ceiling. However, for most projects this makes sense as the user/work loading will never be exponentially larger than it started. So long as there’s sufficient upgrade path (and the application can handle it) then Scale-Up will often be the best in cost-benefit analysis, primarily because one only upgrades (or moves on to new hardware) if it’s strictly necessary.
True Scale-Out requires designing from the start as it’s hard to push a Scale-Out strategy on to existing solutions. If the service is known to be massive from day one then segment your workload like Google/Ebay and co and have multiple servers that all do a minor part of the workload, and other servers that collate that and have responses back to the user. In that way you can always add more hardware to do individual tasks without a major problem. However, for smaller projects this isn’t worth the cost or the stress.

Some architects like to say they’ve got the ultimate scaling strategy: a mix of Scale-Out and Scale-Up, and to an extent some do. However, having your webserver, app server and database server on a single machine, then scaling out to three machines, before beefing up each of those machines further isn’t true Scale-Out (it’s just a modification of the Scale-Up strategy really). For those solutions you have to be extremely careful about issues like marshalling (inefficient data passing between tiers won’t really hurt when you’re on the same machine, however if you have to pass that data across a network, serializing and de-serializing as you go, then that can cause major issues.

Hope that helps explain Scale-Up (aka ScaleUp) and Scale-Out (aka ScaleOut). As ever i’m happy to hear from you with questions etc.

Why adding an index to a DB table isn’t always a good idea

Tuesday, April 29th, 2008

When we’re on site we talk to a lot of people who are struggling with databases and how they work. I always try and help by explaining the basic of how a RDBMS works and how indexes are generally used, however there’s one thing that always requires extra discussion, as it’s contrary to commonsense, why removing indexes can sometimes increase database performance.

In this blog there’s not really room for a full discussion of Database (aka “DB” from now on) problems, however new users of databases (or rather new users of production databases with a reasonable workload) often do the following;

a) Start logging queries that don’t have indexes (or run tuning analysers like those in Microsoft SQL Server’s Management Studio)

b) Add indexes with gusto; each table starts to have 3 or 4 indexes (I won’t use the term indices today…)

c) Notice new and horrible issues, some inserts and updates take longer than before, and some selects take longer than they did before the indexes were added.

Of course some people then repeat (b) and (c) again…

Two universal truths need to be stated;

1) A database optimiser won’t magically concatenate multiple index data together and use the composite index for a super-quick lookup. Instead you should estimate that a DB will only select a single index to use, and that’ll be based off the statistics of the index “selectivity” (aka “granularity” in some quarters).

2) Because we have more intelligence than machines (and we know the business domain) we can do things that the machine optimiser can’t. This does mean that in some instances it’s better to not use an index for a select statement, or even (in some difficult spots) tell the query engine to use a specific index for it’s lookup.

I should also add another “Universal Truth” that’s not quite true (it’s kind of the Newtonian Physics of the DB Tuning world - it’s not quite true but it’s near)

3) Adding an Index has a cost for each insert and update (and when those indexes need to be rebuilt). Yes, it’s not good to have superfluous Indexes but this is often over-stated to a point where people are scared to add indexes, even when they’re sure it’s a good idea.

Right, first we’ll attack UT (Universal Truth) 1;

Universal Truth 1

“A database optimiser won’t magically concatenate multiple index data together and use the composite index for a super-quick lookup”

Databases are created to handle data, and they do it well. They maintain this lead over the competition by being ruthlessly efficient and sticking to sensible rules.

A database will typically (there are exceptions but they’re rare) use a single index to look up data in a table. If there isn’t one then you’ll get a table scan (usually bad bad news if there are many rows - more on that later).

When a DB uses an index it tries to select the best route/index to cut down the number of records it then has to read from the main table and finish off it’s selection (of course if it doesn’t need to select any further from the table that’s great.. however unless the index has all the necessary return columns from the query then it’ll still need to read that table row).

So let’s create a sample table;

CREATE TABLE `propagandr`.`animal` (
`id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
`animalname` VARCHAR(45) NOT NULL,
`animaltypeid` INTEGER UNSIGNED NOT NULL,
`creationusername` VARCHAR(45) NOT NULL,
`creationdate` VARCHAR(45) NOT NULL,
PRIMARY KEY(`id`)
)
/* note: this is an example - if you want to add a few thousand content lines then go for it and you can try the queries out */

Now this has no user defined indexes (it has one - notice the primary key is ‘id’ - that’s an indexed column as that’s the way it’s laid out on the disk within MySQL), so if I say “SELECT id, animalname FROM animal WHERE creationusername = ‘boris’” then we know it’ll do a table scan.

So, let’s add an index for the animaltypeid (as that links to another table)

ALTER TABLE animal ADD INDEX `Index_animaltypeid`(`animaltypeid`);

We then re-run the query, and no surprise it doesn’t get used. So we’ll add an index for the creationusername…

ALTER TABLE animal ADD INDEX `Index_creationusername`(`creationusername`);

Re-running the query we notice it’s now using the index.. winner!

So, let’s change the query to be more “realistic” - “SELECT id, animalname FROM animal WHERE animaltypeid = 3 AND creationusername = ‘boris’”. This should use both indexes right? erm. no. It’ll select the ONE index that it believes is the most selective (so if there is only two people who create records then each row has a 50% chance of being allotted to a single user - not that selective, however there might be 500 animal types).

Let’s assume we really need this query to be lightening fast; we could add a further index to include both columns

ALTER TABLE animal ADD INDEX `Index_animaltypeid_vs_creationusername `(`animaltypeid`, `creationusername`);

Now, i’ve used the business domain specific discussion earlier to establish that animaltypeid is more selective, so i’ve put that first - after all a query just based on animaltypeid can still use that index. But, what for creationusername? Well that’s largely useless and in many cases it makes more sense to do a table scan (if that doesn’t sound right then think hard about what the DB will do - it’ll look at the index, note the selectivity is very poor AND that it needs to go to the table to get the data anyway… so it’ll do a single serial read through the table rather than use the index, compile the animal id’s that match and then look them up and return them).

Hope that helps with part 1… Now on to part 2 (which is far weirder for most folk)

Universal Truth 2

“Because we have more intelligence than machines (and we know the business domain) we can do things that the machine optimiser can’t”

In the above example we used our knowledge of the data to guide which way we should build a key. We sometimes hav to do the same with respect to table data.

Let’s imagine that the above Animal table is constantly being updated (like > 10 per second), so we can’t key the data too heavily, but a query we often run is asking whether we’ve had a tiger added in the last 5 minutes (this is an example bear with me! - other examples could be a wiki update log, a money audit log etc.)

Here we can use some business domain knowledge to help us out. We know that tigers (animaltypeid 3 btw) are added pretty regularly, and that they’re a regular but largely insignificant number of entries (10 entries per sec * 60 secs in a minute * 60 minutes per hour - that’s a fair few non-tigers coming through).

Now we’ve already got an index that addresses the animaltypeid, so we figure that’s a good candidate, so we write the following SQL (if you’re not familiar with the LIMIT keyword it just does what TOP does in SQL Server);

“SELECT * FROM Animal WHERE animaltypeid = 3 /* that’s a tiger remember */ AND creationdate > DATE_ADD ( NOW(), INTERVAL -1 HOUR) LIMIT 1;”

(note: the limit 1 just returns the first one we find as we only want to know IF there was a tiger added in the last hour)

Now, if you analyse the DB you’ll see it uses the animaltypeid index, and you may assume this is good, however look at the timings and work out what it’s doing.

Using that index means it’s gone to the index, retrieved all the ids for animals that are of animaltypeid 3, then it goes to the animal table and scans those rows and if they pass the criteria it copies it out for returning to the caller. Assuming we’ve got a thousand tigers in the DB (and a 100,000 entries) then that’s a lot of Disk IO for not much.

So, we re-write it slightly to only use the table (i’ll skip the index hint that tells the DB not to use that index here but you can add it if you need it, but in a circumstance like this we’d re-write the index so it went creationusername then animaltypeid so this query wouldn’t trigger it but callers using both those criteria still get the benefits)

“SELECT * FROM animal WHERE animaltypeid = 3 and creationdate > DATE_ADD( NOW(), INTERVAL -1 HOUR) ORDER BY id DESC LIMIT 1;”

Here we using a few things, the fact that we know the table is inserted into in order (so we add the order by id DESC to force the DB to look backwards from the current page - we could have said “ORDER BY creationdate DESC” but we know that the id rises as the creationdate does, and as MySQL orders tables by the primary key it makes sense to use that), and also we know that it won’t use the animaltypeid index (as we’ve given an index hint or we’ve removed the index).

Now, this query uses no indexes for the lookup, and table scans are always bad, right? Well, you’ll find in these instances this runs very quickly as the answer is likely to be on the first page of data that’s parsed, and because it’s updated so frequently it’s also a good likelihood that it’s in memory.

So, what about if we just added an index that allows us to do that lookup quicker? Let’s define an index that does that

ALTER TABLE animal ADD INDEX `Index_animaltypeid_vs_creationdate `(`animaltypeid`, `creationdate`);

Now, that seems an ideal situation: the query now uses that index and everybody’s happy? Well not really, that index is going to be appended to all the time and that means the index will be less than optimal, plus we have to write an additional index entry for every insert that we do (or updates if we change the creationdate for some reason??). This of course is part of Universal Truth 3 :)

Adding an index is min-maxing, you need to profile the application performance before and after and get representative results back, otherwise you may as well be applying stuff in the dark.

Hopefully the above has helped some of you. If you’ve any comments then get in touch and let us know.

Making MySQL OR performance better

Thursday, February 28th, 2008

MySQL experts and Old SQL hands look away now. This is like teaching grannies to suck eggs, but if you’ve always had experience of advanced (which used to mean commercial) SQL databases then you probably don’t know this trick.

In the old days performance for OR clauses was pretty bad and a lot of DBMSs would ignore the index and start table scanning and performance would stink.

The old (and MySQL) way to get around this if the problem exhibits itself is to simply use a UNION and have each part of the OR on each side.So for example for table employee we need to find any employee where their name is ’stuart’ or ‘roger’ (i’m assuming we’re using an OR here - we could of course use an IN..).

The first version is the clear version

SELECT * FROM employee WHERE forename = 'stuart' OR forename = 'roger'

However, if you do an EXPLAIN on it and notice that it’s table scanning it’s trivial to re-write as a UNION

SELECT * FROM employee WHERE forename = 'stuart' UNION SELECT * FROM employee WHERE forename = 'roger'

Now if you know about your data (and the OR condition) then you can optimise a dash further. UNION merges the two result sets and checks for duplicate entries (de-dupes - just like a DISTINCT) before returning the values, but if you’re sure that the two sets won’t clash (and above is a great example) then use UNION ALL which ignore the de-duping stage.

So, why not always use a UNION? Well firstly it’s not clear and secondly there’s a duplication of code. Imagine if your OR clause was more complex (forename could be ’stuart’, ‘bob’, ‘roger’, ‘terry’, ‘june’) then that’s 5 repeats of the same statement. Throw in the fact that you’ll often explicitly mention the columns you want (remember i’m using a * in the SELECT clause for example only - try your best not to do this in real code of course) and that’ll you’ll have multiple joins then you realise you only want to do this if and when you get a performance problem.

Have fun and I hope not too many of you are sucking eggs at the moment :)

Helping Tune SSAS2005 OLAP Cubes

Thursday, September 13th, 2007

Today i was pointed at a good summary for optimising SSAS2005 cubes

In tandem you can use the “shared source” Visual Studio plugin - BIDS for more fine control of SSAS2005 - although use it with care if you don’t know what you’re doing!

OLAP MDX Optimisation Tip

Wednesday, September 12th, 2007

Ok Ok I know this is trivial but it’s saved more than my own bacon before…

You’ve built an OLAP (SSAS2005) Cube (and this isn’t your normal job) and some brain box has asked for a calculated member to be available (or alternately you’re adding this as a calc. member to a report and you want it calculated on the fly - see below it’s basically identical)

For example you want to find the ratio of the current bank accounts balance relative to all the others so you add a member of the ilk (in script form this is);

CREATE MEMBER CURRENTCUBE.[MEASURES].[Ratio of Account Balance]
AS iif ( IsEmpty( [Measures].[Account Balance] ),
[Measures].[Account Balance],
( [Measures].[Account Balance] ) / ( ROOT([Account]), [Measures].[Account Balance] )
),
VISIBLE = 1 ;

So, you run the cube in a report or the browser and suddenly that lovely cube response has shot up from sub-10 seconds to never coming back… what could you have done?

Well i’m assuming you’ve got a lot of data ™ - probably hundreds of millions of fact rows and you’ve now stopped SSAS optimising the search path and ignoring the empty bits. So, give it a hint and add the NON EMPTY BEHAVIOR (sic - behaviour - that gets me every time - colour becoming color i get but behaviour is just wrong) tip.

In the GUI simply select the requisite measure field, in the script tab add
NON_EMPTY_BEHAVIOR = { [Account Balance] }

voila! instant speed up and less heart attack!

(note: if this was an inline calc-on-the-fly type member then the full wording would be…

WITH MEMBER CURRENTCUBE.[MEASURES].[Ratio of Account Balance]
AS iif ( IsEmpty( [Measures].[Account Balance] ),
[Measures].[Account Balance],
( [Measures].[Account Balance] ) / ( ROOT([Account]), [Measures].[Account Balance] )
),
NON_EMPTY_BEHAVIOR = { [Account Balance] },
VISIBLE = 1 ;

SQL Server Triggers containing Cursors (AKA: how *NOT* to audit)

Monday, June 25th, 2007

I’m working with a client at the moment who has a lot of legacy code.

One of the things that makes the systems they’ve got so bad is that there’s little thought of performance when things have been written. An excellent example is the use of multiple nested cursors within triggers in SQL Server.

Obviously at some point a developer was given a requirement for auditting. As not all of of the tables (nor every column in a table) was required to be auditted they thought to themselves that triggers using cursors would be an excellent plan.

So, they created a table called “audit” that contained a timestamp; user id and name that caused the change, the source row id, table and column; and before and after values.

Now i’ve no doubt that this is a nice scheme for quickly showing a list of changes on a screen (i.e. select * from audit where sourcetable = ‘address’ and sourcetableid = ‘’ order by changedate DESC) but this is a performance killer if you ever have much concurrency and you update things in batch.

To get the data within the trigger to build the audit info you need to encode the table name, look up the username (for the user that made the change - as that was denormalised into this table too - ironically no doubt to speed up read performance), then iterate around another table that tells you which columns you’re interested in auditting, for each of those check if they’ve changed (in a cursor of course), then find the differences and do an insert for each.

So, if we’ve got 8 columns to change then we’ve got one trigger, one lookup for the user’s username, one cursor iterating around 8 “columns to audit” entries, at least 8 lookups against before and after data in the trigger, then up to 8 inserts into the audit table.

Now, some people are thinking… what’s the problem?

Firstly, one of volume: the client mentioned has a few million rows and does updates that can sometimes effect 10s of thousands of rows, suddenly a trigger per row hurts.

Secondly, they’ve obviously been looking at performance the wrong way - they have about 1000 users (and the table for users is the ntusername, vs the userid - not exactly large - so it’s easy for sql server to cache in tempdb as it’s used all the time). Next up the audit data isn’t going to change on a day by day basis - so why not generate that list of columns to check straight into the trigger code? At least that way you’ve skipped another table lookup.

I’m guessing you’re thinking - so what’s the right way? what’s the magic bullet/golden hammer? - There isn’t one, for every situation there’s a different solution, but it all starts by understanding the way your application needs to work.

I use two basic practises a lot;

a) Never update a row - have a timestamp against a row and the user id that created it and a marker that it’s historical. when you need to change data simply insert a new row (with the Now timestamp and the current user’s id) with the new data, then mark the original row as defunct. You either get the history live by showing before and afters, or have a batch job that turns it into user friendly stuff in a downtime (if one exists). If you use Stored Procedures/SPs (i’m assuming you do!) then it’s easy to implement this scheme in there.
b) Generate a table that matches the schema of the original table you wish to audit and prefix the name with audit or history (potentially you can trim down the columns in there if you’re not interested in all of them) and then insert the original data into the audit table whenever you go to update it. Again this is easy to do in the Stored Procs/SPs.

In general (a) is the simplest to implement but it’s only really usable if you have smallish data sets (especially if you put a clustered key on the “defunct” marker column - as that update will require a table reshuffle). (b) is more work but at least you’re segmenting your data into historical vs. current - so any stupid programmer mistakes (such as not using the defunct marker in a select query on the data can return a history row instead and confuse the user).