Friday, November 24, 2006

Star Vs Snowflake schema

Star Schemas

The star schema is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables, as shown in figure.







Fact Tables

A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated.


Dimension Tables

A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values.
Dimension tables are generally small in size as compared to fact table.


-To take an example and understand, assume this schema to be of a retail-chain (like wal-mart or carrefour).

Fact will be revenue (money). Now how do you want to see data is called a dimension.



In above figure, you can see the fact is revenue and there are many dimensions to see the same data. You may want to look at revenue based on time (what was the revenue last quarter?), or you may want to look at revenue based on a certain product (what was the revenue for chocolates?) and so on.
In all these cases, the fact is same, however dimension changes as per the requirement.



Note: In an ideal Star schema, all the hierarchies of a dimension are handled within a single table.

Star Query

A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans for them.






Snoflake Schema

The snowflake schema is a variation of the star schema used in a data warehouse.

The snowflake schema (sometimes callled snowflake join schema) is a more complex schema than the star schema because the tables which describe the dimensions are normalized.







Flips of "snowflaking"



- In a data warehouse, the fact table in which data values (and its associated indexes) are stored, is typically responsible for 90% or more of the storage requirements, so the benefit here is normally insignificant.

- Normalization of the dimension tables ("snowflaking") can impair the performance of a data warehouse. Whereas conventional databases can be tuned to match the regular pattern of usage, such patterns rarely exist in a data warehouse. Snowflaking will increase the time taken to perform a query, and the design goals of many data warehouse projects is to minimize these response times.


Benefits of "snowflaking"



- If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snowflaking may be appropriate.



- A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalised. A snowflake schema will hence be easier to implement.

- A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organisation of the database.

- Some users may wish to submit queries to the database which, using conventional multidimensional reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snowflaking would typically be required to permit simple query tools such as Cognos Powerplay to form such a query, especially if provision for these forms of query weren't anticpated when the data warehouse was first designed.

In practice, many data warehouses will normalize some dimensions and not others, and hence use a combination of snowflake and classic star schema.


Source: Oracle documentation, wikipedia

16 comments:

Yazad said...

Than you... it's a lovely article :)

1 small request, please add some more lines of explanation to the snowflake section. Thank you again, a lot of my concepts are now clear.

Regards,
Yazad Khambata.

Sachin said...

Thanks Yazad for your compliment.
I couldnt change the doc. But here is some information about snowflake schema.

The snowflake schema presents a dimensional model which is composed of a central fact table and a set of dimension tables which are further broken up into sub-dimension tables. This makes this kind of schema one step normalized than star schema yet preserving the essence of star schema.

In the above diagram “product” is a dimension and “supplier” is a sub-dimension. Similarly “customers” is a dimension and “countries” is sub-dimension.

Yazad said...

thank you very much Sachin.

Unknown said...

Lovely article

Anonymous said...

Easily understood article..
Thanks..

Nigel Ainscoe said...

Thank you for a clear and concise article. I think I now understand why the mantra "resist snowflaking where possible" comes about.

Anonymous said...

Sachin,

Thanks for making this clear it very straight forward and simple explanation.

Sudeep

Anonymous said...

Thanks Sachin,

This article is very helpful.

Could you also post regrding Informatica Client Tools:

1. Designer
2. Workflow Manager
3. Workflow Monitor &
4. Repository Manager

Thanks

Deepak said...

Very nice article to cover the basic differences between Star and snowflake schemas. Can you also post links to the advanced concepts regarding Star queries, optimization techniques etc.

Thanks for all the efforts!

Ankita Shah said...

Thanks...Nice concise article to clear the concepts..

Sachin said...

Thanks Ankita for the kind words. Thx, Sachin

Mitul Suthar said...

Hello Sachin,

Thank you for explaining in a nice article.

Can you explain this question to me? If there is a self referencing dimension such as a department dimension in which a boss can have one or many employees. And there is only one fact dimension. Would you call this a star schema or snowflake schema?

Answer would be appreciated.

Sachin said...

Hi Mitul,

Not sure if I understood your question correctly.

The use of dept and emp tables is best in OLTP environments. Using it in warehouse tables may confuse.

If you mean that we have 2 tables dept (having dept no) and emp (having deptno, emp_id and mgr_id - which is nothing but emp_id of his manager) and the query is to find all bosses of say dept# 10 with all of their subordinates. The schema design to my knowledge should be called Star schema Because the join between mgr and emp is already done in emp table and need not be doen again via table join. Only join that's required is between emp and dept tables.

Let me know if I grasped your question correctly.

Anonymous said...

nice article Sachin only one question what is Fact Constellation Schema??

Malar said...

It is a nice article...helps me to understand well...Thank you Sachin and do continue posting these kind of articles.

Anonymous said...

Very well explained.

Thank you