Google results linking to www.domainname.com/wb/admin/media/basic_header.html

diodak · December 04, 2008, 11:40:17 PM

Easiest way is to add .htaccess password protect to /admin.

chio · November 09, 2008, 05:56:06 PM

OK,
Wie lautet die URL der Startseite?
1) www.domain.de
oder
2) www.domain.de/index.php
?

Richtig ist: 1)
Eine normale WB-Installation verlinkt nirgends auf www.domain.de/index.php, sondern immer auf www.domain.de

www.domain.de == /
Dort beginnt auch der Bot (zwangsläufig, was anderes kennt er ohne Deeplinks nicht)
Jetzt darf er aber nicht, weil / gesperrt ist.
Also interessiert er sich schon gar nicht für /index.php oder gar /pages/, woher sollte er denn wissen, dass es diese Seiten gibt.

Klar: Du kannst sagen: wenn die Startseite per robots.txt gesperrt ist, darf sie der Spider ja trotzdem crawlen, und dann würde er die Links finden - und dann würde er auch die Seiten finden.

Ich frage mich aber, ob das auch Google so sagt.

Und auch die von dir zitierte Quelle?:
http://janeandrobot.com/robots.txt

kweitzel · November 09, 2008, 05:18:12 PM

Also Chio, ich bin mir nicht ganz sicher, wo Dein Wissen herkommt, oder wie alt es ist ...

Der Regelblock, den ich aufgestellt habe bewirkt folgendes:

Code Select


User-agent: * # betrifft ALLE Useragents
Disallow: / # erstmal darf garnix gecrawled werden bis auf:
Allow: /index.php #die index.php Seite
Allow: /pages/ # der Seitenordner

Vielleicht liest Du auch mal hier nach: http://janeandrobot.com/post/Managing-Robots-Access-To-Your-Website.aspx

Die guten Crawler habe eine Eigenschaft ... sie lesen und halten sich an diese Regeln, und zwar alle.

Gruß

Klaus

chio · November 09, 2008, 04:58:44 PM

weil die Startseite nicht index.php heißt (deswegen in Klammern)
sondern: /
Wenn du das Startverzeichnis per robots.txt sperrst, kommt der Crawler schon gar nicht auf die Domain, egal was danach offen ist. Wo soll er denn anfangen.

kweitzel · November 09, 2008, 04:49:38 PM

ja, hast recht ... aber warum schreibst Du dann nicht einfach den Zusatz:

Code Select

Allow: /index.php

Das wäre vielleicht hilfreich gewesen ...

Gruß

Klaus

chio · November 09, 2008, 02:44:34 PM

www.domain.de/ (index.php) is NOT in /pages/
it is in /

kweitzel · November 09, 2008, 11:35:33 AM

That is what we want to achieve ... the crawling allow list is in part 2 ...

Code Select

Allow: /pages/

cheers

Klaus

chio · November 09, 2008, 11:23:41 AM

QuoteUser-agent: *
Disallow: /
Allow: /pages/

Uiui - and what about the homepage?
If you do this, your whole site isnt crawled if you have no deep links.

kweitzel · November 09, 2008, 11:04:11 AM

Guys,

I would actually disallow everything to everybody and in the next step allow the pages directory and, if wanted selected other directories. This way you do not give away your whole folder structure.

Code Select


User-agent: *
Disallow: /

Then you open the pages directory, since you want to have the pages crawled

Code Select


Allow: /pages/

Secure it like host systems used to be ... close everything and then open the required folders for indexing.

cheers

Klaus

chio · November 09, 2008, 09:29:24 AM

In some cases the apache "index of" pages might cause the problem.

I always disallow "modules" and "admin" in robots.txt, of couse I know a "bad guy" can also read it, but its more dangerous when a bad guy can simply use google-search to find security holes.

Ruud · July 12, 2008, 09:43:07 PM

The serious search engines will respect robots.txt. But they are not your worry.
The bad guys will use your robots.txt for finding places where you don't want the search engines to go.
Scanning for /admin will tell (on most WB sites) directly you are dealing with a WB site.
You can be sure, the admin structure is already known to someone serious in hacking.

Protecting the admin area with a .htaccess is something you could do, but you will need to logon twice every time you want to do something over there. The admin area is protected on php level.

There is nothing wrong with the html templates or images being accessible. They will not do anything anyway.
Personally I am a pretty paranoid guy, but as long as there are no vulnerability is popping up on sites like: http://secunia.com/advisories/23828/ I don't worry too much.

As I explained in my previous post, pages in the admin area are not indexed in google at all, unless you take the effort of asking google to do so.

Ruud

albatros · July 12, 2008, 09:24:44 PM

Hi,

correct me, if I am wrong, but I am sure, that a robots.txt as a solution is very unsecure. Searchengines can respect robots.txt. But they don´t need to. Maybe reading robots.txt-files could be a help and inspiration for bad guys.

And what, if a bad guy knows the stucture of WB (or any other cms)? He only has to find a website, built with this cms, and he is able to see the admin-files. If the files are in google-index or not doesn´t matter.

The very simple and only safe solution is passwordprotection of admin by htaccess. You can´t see any file in any subfolder. And the admin-functions are locked safely.

So why don´t you use this? Am I completely wrong or thinking much to simple?

Regards

Uwe

Ruud · July 12, 2008, 06:11:39 PM

I agree on the last part.
It should not be that difficult to prevent that. (using robots.txt or .php templates instead of .html)

BTW: I wonder how google has found those pages.
there are no links pointing to those files. Google doesn't look for pages without links pointing to them.
I think the only way that could happen is when a wb tree is installed on a server without .php enabled and with directory browsing enabled.

While writing this message, if found one that had a google_sitemap.xml document (generated with some external generator) that included ALL admin .html .js .gif .png etc..
This is the opposite of the robots.txt. It's asking google to "please index my hidden stuff".
(do a search on google like this: "/media/basic_header.html google_sitemap.xml". You will see what I mean)

Conclusion, don't worry. Your pages will not be indexed by google unless you ask them to.

Ruud

Argos · July 12, 2008, 01:27:19 AM

Although I understand your explanation, I do feel it's weird to have loose admin header files linked to in Google. I'll do some robots.txt stuff, but maybe it's an idea to have an updated WB version just prevent this possibility.

Ruud · July 11, 2008, 11:48:48 PM

There are lots of WB sites out there.
Every site has the /admin/media/basic_header.html file. (and all other non .php files in the admin area)
Everybody can download and look at the sourcecode to find out what it can do.

It should not be dangerous.
Typically a html page will not do much more than display data.
Php pages in the admin area (or the modules area) should all have a bit of code to prevent unning without the wb framework (and its security).

Although I can imagine that using html templates in the admin area together with .htaccess mods that allow html to run php code could be dangerous.

Ruud

kweitzel · July 11, 2008, 06:36:01 PM

you can always use a robots.txt to protect the folders.

cheers

Klaus

Argos · July 11, 2008, 06:27:02 PM

I searched Google for "website baker media" and found a lot of results linking to "WebsiteBaker Administration - Media-" with url's that point to "www.domainname.com/wb/admin/media/basic_header.html".

I feel a bit uncomfortable about it somehow. Can it be dangerous?

News:

Google results linking to www.domainname.com/wb/admin/media/basic_header.html

chio

chio

chio

chio

chio